semantic web 1 (2012) 1–5 ios press dbpedia - a large ... · semantic web 1 (2012) 1–5 1 ios...

29
Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country Jens Lehmann a,* , Robert Isele g , Max Jakob e , Anja Jentzsch d , Dimitris Kontokostas a , Pablo N. Mendes f , Sebastian Hellmann a , Mohamed Morsey a , Patrick van Kleef c ,S¨ oren Auer a , Christian Bizer b a University of Leipzig, Institute of Computer Science, AKSW Group, Augustusplatz 10, D-04009 Leipzig, Germany E-mail: {lastname}@informatik.uni-leipzig.de b University of Mannheim, Research Group Data and Web Science, B6-26, D-68159 Mannheim E-mail: [email protected] c OpenLink Software, 10 Burlington Mall Road, Suite 265, Burlington, MA 01803, U.S.A. E-mail: [email protected] d Hasso-Plattner-Institute for IT-Systems Engineering, Prof.-Dr.- Helmert-Str. 2-3, D-14482 Potsdam, Germany E-mail: [email protected] e Neofonie GmbH, Robert-Koch-Platz 4, D-10115 Berlin, Germany E-mail: [email protected] f Kno.e.sis - Ohio Center of Excellence in Knowledge-enabled Computing, Wright State University, Dayton, USA. E-Mail: [email protected] g Brox IT-Solutions GmbH, An der Breiten Wiese 9, D-30625 Hannover, Germany E-Mail: [email protected] Abstract. The DBpedia community project extracts structured, multilingual knowledge from Wikipedia and makes it freely available using Semantic Web and Linked Data standards. The extracted knowledge, comprising more than 1.8 billion facts, is structured according to an ontology maintained by the community. The knowledge is obtained from different Wikipedia language editions, thus covering more than 100 languages, and mapped to the community ontology. The resulting data sets are linked to more than 30 other data sets in the Linked Open Data (LOD) cloud. The DBpedia project was started in 2006 and has meanwhile attracted large interest in research and practice. Being a central part of the LOD cloud, it serves as a connection hub for other data sets. For the research community, DBpedia provides a testbed serving real world data spanning many domains and languages. Due to the continuous growth of Wikipedia, DBpedia also provides an increasing added value for data acquisition, re-use and integration tasks within organisations. In this system report, we give an overview over the DBpedia community project, including its architecture, technical implementation, maintenance, internationalisation, usage statistics and showcase some popular DBpedia applications. Keywords: Linked Open Data, Knowledge Extraction, Wikipedia, Data Web, RDF, OWL * Corresponding author. E-mail: [email protected] leipzig.de. 1570-0844/12/$27.50 c 2012 – IOS Press and the authors. All rights reserved

Upload: others

Post on 03-Nov-2019

33 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Semantic Web 1 (2012) 1ndash5 1IOS Press

DBpedia - A Large-scale MultilingualKnowledge Base Extracted from WikipediaEditor(s) Name Surname University CountrySolicited review(s) Name Surname University CountryOpen review(s) Name Surname University Country

Jens Lehmann alowast Robert Isele g Max Jakob e Anja Jentzsch d Dimitris Kontokostas aPablo N Mendes f Sebastian Hellmann a Mohamed Morsey a Patrick van Kleef c Soren Auer aChristian Bizer ba University of Leipzig Institute of Computer Science AKSW Group Augustusplatz 10 D-04009 Leipzig GermanyE-mail lastnameinformatikuni-leipzigdeb University of Mannheim Research Group Data and Web Science B6-26 D-68159 MannheimE-mail chrisinformatikuni-mannheimdec OpenLink Software 10 Burlington Mall Road Suite 265 Burlington MA 01803 USAE-mail pkleefopenlinkswcomd Hasso-Plattner-Institute for IT-Systems Engineering Prof-Dr- Helmert-Str 2-3 D-14482 Potsdam GermanyE-mail mailanjajentzschdee Neofonie GmbH Robert-Koch-Platz 4 D-10115 Berlin GermanyE-mail maxjakobneofoniedef Knoesis - Ohio Center of Excellence in Knowledge-enabled Computing Wright State University Dayton USAE-Mail pabloknoesisorgg Brox IT-Solutions GmbH An der Breiten Wiese 9 D-30625 Hannover GermanyE-Mail mailrobertiselecom

Abstract The DBpedia community project extracts structured multilingual knowledge from Wikipedia and makes it freelyavailable using Semantic Web and Linked Data standards The extracted knowledge comprising more than 18 billion factsis structured according to an ontology maintained by the community The knowledge is obtained from different Wikipedialanguage editions thus covering more than 100 languages and mapped to the community ontology The resulting data sets arelinked to more than 30 other data sets in the Linked Open Data (LOD) cloud The DBpedia project was started in 2006 and hasmeanwhile attracted large interest in research and practice Being a central part of the LOD cloud it serves as a connection hubfor other data sets For the research community DBpedia provides a testbed serving real world data spanning many domains andlanguages Due to the continuous growth of Wikipedia DBpedia also provides an increasing added value for data acquisitionre-use and integration tasks within organisations In this system report we give an overview over the DBpedia communityproject including its architecture technical implementation maintenance internationalisation usage statistics and showcasesome popular DBpedia applications

Keywords Linked Open Data Knowledge Extraction Wikipedia Data Web RDF OWL

Corresponding author E-mail lehmannninformatikuni-leipzigde

1570-084412$2750 ccopy 2012 ndash IOS Press and the authors All rights reserved

2 Lehmann et al DBpedia

1 Introduction

The DBpedia community project extracts knowl-edge from Wikipedia and makes it widely availablevia established Semantic Web standards and LinkedData best practices Wikipedia is currently the 7th mostpopular website1 the most widely used encyclopediaand one of the finest examples of truly collaborativelycreated content However due to the lack of the ex-ploitation of the inherent structure of Wikipedia arti-cles Wikipedia itself only offers very limited query-ing and search capabilities For instance it is difficultto find all rivers that flow into the Rhine or all Italiancomposers from the 18th century One of the goals ofthe DBpedia project is to provide those querying andsearch capabilities to a wide community by extractingstructured data from Wikipedia which can then be usedfor answering expressive queries such as the ones out-lined above

The DBpedia project was started in 2006 and hasmeanwhile attracted significant interest in research andpractice It has been a key factor for the success of theLinked Open Data initiative and serves as an interlink-ing hub for other data sets (see Section 5) For the re-search community DBpedia provides a testbed servingreal data spanning various domains and more than 100language editions Numerous applications algorithmsand tools have been build around or applied to DBpe-dia Due to the continuous growth of Wikipedia andimprovements in DBpedia the extracted data providesan increasing added value for data acquisition re-useand integration tasks within organisations While thequality of extracted data is unlikely to reach the qualityof completely manually curated data sources it can beapplied to some enterprise information integration usecases and has shown to be relevant beyond researchprojects as we will describe in in Section 7

One of the reasons why DBpediarsquos data quality hasimproved over the past years is that the structure of theknowledge in DBpedia itself is meanwhile maintainedby its community Most importantly the communitycreates mappings from Wikipedia information repre-sentation structures to the DBpedia ontology Thisontology unifies different template structures whichwill later be explained in detail ndash both within singleWikipedia language editions and across currently 27different languages The maintenance of different lan-guage editions of DBpedia is spread across a number

1See httpwwwalexacomtopsites Retrieved inJune 2013

of organisations Each organisation is responsible forthe support of a certain language The local DBpediachapters are coordinated by the DBpedia Internation-alisation Committee In addition to multilingual sup-port DBpedia also provides data-level links into morethan 30 external data sets which are partially also con-tributed from partners beyond the core project team

The aim of this system report is to provide a de-scription of the DBpedia community project includ-ing the architecture of the DBpedia extraction frame-work its technical implementation maintenance in-ternationalisation usage statistics as well as showcas-ing some popular DBpedia applications This systemreport is a comprehensive update and extension of pre-vious project descriptions in [1] and [5] The main nov-elties compared to these articles are

ndash The concept and implementation of the extrac-tion based on a community-curated DBpedia on-tology

ndash The wide internationalisation of DBpediandash A live synchronisation module which processes

updates in Wikipedia as well as the DBpediaontology and allows third parties to keep theircopies of DBpedia up-to-date

ndash A description of the maintenance of public DB-pedia services and statistics about their usage

ndash An increased number of interlinked data setswhich can be used to further enrich the content ofDBpedia

ndash The discussion and summary of novel third partyapplications of DBpedia

In essence the report summarizes major developmentsin DBpedia in the past four years since the publicationof [5]

The system report is structured as follows In thenext section we describe the DBpedia extractionframework which forms the technical core of DB-pedia This is followed by an explanation of thecommunity-curated DBpedia ontology with a focus onits evolution over the past years and multilingual sup-port In Section 4 we explicate how DBpedia is syn-chronised with Wikipedia with just very short delaysand how updates are propagated to DBpedia mirrorsemploying the DBpedia Live system Subsequentlywe give an overview of the external data sets that areinterlinked from DBpedia or that set data-level linkspointing at DBpedia themselves (Section 5) In Sec-tion 6 we provide statistics on the access of DBpediaand describe lessons learned for the maintenance of alarge scale public data set Within Section 7 we briefly

Lehmann et al DBpedia 3

describe several use cases and applications of DBpe-dia in a variety of different areas Finally we report onrelated work in Section 8 and conclude in Section 9

2 Extraction Framework

Wikipedia articles consist mostly of free text butalso comprise various types of structured informationin the form of wiki markup Such information includesinfobox templates categorisation information imagesgeo-coordinates links to external web pages disam-biguation pages redirects between pages and linksacross different language editions of Wikipedia TheDBpedia extraction framework extracts this structuredinformation from Wikipedia and turns it into a richknowledge base In this section we give an overviewof the DBpedia knowledge extraction framework

21 General Architecture

Figure 1 shows an overview of the technical frame-work The DBpedia extraction is structured into fourphases

Input Wikipedia pages are read from an externalsource Pages can either be read from a Wikipediadump or directly fetched from a MediaWiki in-stallation using the MediaWiki API

Parsing Each Wikipedia page is parsed by the wikiparser The wiki parser transforms the sourcecode of a Wikipedia page into an Abstract SyntaxTree

Extraction The Abstract Syntax Tree of each Wikipediapage is forwarded to the extractors DBpedia of-fers extractors for many different purposes for in-stance to extract labels abstracts or geographicalcoordinates Each extractor consumes an AbstractSyntax Tree and yields a set of RDF statements

Output The collected RDF statements are written toa sink Different formats such as N-Triples aresupported

22 Extractors

The DBpedia extraction framework employs variousextractors for translating different parts of Wikipediapages to RDF statements A list of all available extrac-tors is shown in Table 1 DBpedia extractors can bedivided into four categories

Mapping-Based Infobox Extraction The mapping-based infobox extraction uses manually writtenmappings that relate infoboxes in Wikipedia toterms in the DBpedia ontology The mappingsalso specify a datatype for each infobox propertyand thus help the extraction framework to pro-duce high quality data The mapping-based ex-traction will be described in detail in Section 24

Raw Infobox Extraction The raw infobox extrac-tion provides a direct mapping from infoboxes inWikipedia to RDF As the raw infobox extractiondoes not rely on explicit extraction knowledge inthe form of mappings the quality of the extracteddata is lower The raw infobox data is useful ifa specific infobox has not been mapped yet andthus is not available in the mapping-based extrac-tion

Feature Extraction The feature extraction uses anumber of extractors that are specialized in ex-tracting a single feature from an article such as alabel or geographic coordinates

Statistical Extraction Some NLP related extractorsaggregate data from all Wikipedia pages in orderto provide data that is based on statistical mea-sures of page links or word counts as further de-scribed in Section 26

23 Raw Infobox Extraction

The type of Wikipedia content that is most valu-able for the DBpedia extraction are infoboxes In-foboxes are frequently used to list an articlersquos mostrelevant facts as a table of attribute-value pairs onthe top right-hand side of the Wikipedia page (forright-to-left languages on the top left-hand side) In-foboxes that appear in a Wikipedia article are basedon a template that specifies a list of attributes thatcan form the infobox A wide range of infoboxtemplates are used in Wikipedia Common exam-ples are templates for infoboxes that describe per-sons organisations or automobiles As Wikipediarsquosinfobox template system has evolved over time dif-ferent communities of Wikipedia editors use differ-ent templates to describe the same type of things (egInfobox city japan Infobox swiss townand Infobox town de) In addition different tem-plates use different names for the same attribute(eg birthplace and placeofbirth) As manyWikipedia editors do not strictly follow the recommen-dations given on the page that describes a template at-tribute values are expressed using a wide range of dif-ferent formats and units of measurement An excerptof an infobox that is based on a template for describingautomobiles is shown below

4 Lehmann et al DBpedia

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

()

article categories Extracts the categorization of thearticle

dbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film)

external links Extracts links to external webpages related to the concept

dbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt

geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989

grammatical gender Extracts grammatical genders forpersons

dbrAbraham Lincoln foafgender male

homepage Extracts links to the officialhomepage of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbrAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin

lexicalizations Extracts information about sur-face forms and their associationwith concepts (only N-Quad for-mat)

dbrPine sptllexicalization lxpine tree lsPine pine tree

lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Per-sonData template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificalLanguages dbowikiPageRedirects

dbrConstructed language

revision ID Extracts the revision ID of theWikipedia article

dbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt

SKOS categories Extracts information about whichconcept is a category and howcategories are related using theSKOS Vocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

thematic concept Extracts lsquothematicrsquo concepts thecenters of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms

wiki page Extracts links to correspondingarticles in Wikipedia

dbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

Table 1

Overview of the DBpedia extractors sptl is used as prefixfor httpspotlightdbpediaorgvocab lx is used forhttpspotlightdbpediaorglexicalizations and lsis used for httpspotlightdbpediaorgscores

Lehmann et al DBpedia 5

Fig 1 Overview of DBpedia extraction framework

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows2

dbrFord_GT40 [dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

]

This extraction output has weaknesses The resourceis not associated to a class in the ontology and the en-gine and production data use literal values for whichthe semantics are not obvious Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

2We use dbr for httpdbpediaorgresourcedbo for httpdbpediaorgontology and dbp forhttpdbpediaorgproperty as prefixes throughoutthe report

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort has been initiated to develop an ontology schemaand mappings from Wikipedia infobox properties tothis ontology The alignment between Wikipedia in-foboxes and the ontology is performed via community-provided mappings that help to normalize name vari-ations in properties and classes Heterogeneity in theWikipedia infobox system like using different in-foboxes for the same type of entity or using differentproperty names for the same property (cf Section 23)can be alleviated in this way This significantly in-creases the quality of the raw Wikipedia infobox databy typing resources merging name variations and as-signing specific datatypes to the values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings Thesemappings are specified using the DBpedia MappingLanguage The mapping language makes use of Medi-aWiki templates that define DBpedia ontology classesand properties as well as templatetable to ontologymappings A mapping assigns a type from the DBpediaontology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

3httpmappingsdbpediaorg

6 Lehmann et al DBpedia

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty = productionStartDate| endDateOntologyProperty = productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we cansee the production period is correctly split into a startyear and an end year and the engine is represented bya distinct RDF node

dbrFord_GT40 [rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

]

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language edition ofWikipedia to the DBpedia ontology but is used to maptemplates from all Wikipedia language editions to theshared DBpedia ontology Figure 2 shows how the in-fobox properties author and συγγαϕεας ndash author inGreek ndash are both being mapped to the global identifierdboauthor That means in turn that informationfrom all language versions of DBpedia can be merged

and DBpedias for smaller languages can be augmentedwith knowledge from larger DBpedias such as the En-glish edition Conversely the larger DBpedia editionscan benefit from more specialized knowledge fromlocalized editions such as data about smaller townswhich is often only present in the corresponding lan-guage edition [40]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Validator When editing a mapping themapping can be directly validated by a button onthe edit page This validates changes before sav-ing them for syntactic correctness and highlightsinconsistencies such as missing property defini-tions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works andhow the resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data Apart from afew corner cases there is a one-to-one mappingbetween a Wikipedia page and a DBpedia re-source based on the article title For example forthe Wikipedia article on Berlin4 DBpedia willproduce the URI dbrBerlin

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbo populationTotal

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek and English Wikipedia templates about books to the same DBpedia Ontology class [23]

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach resulted in less and re-dundant data [23] Thus starting from the DBpe-dia 37 release two types of data sets were gen-erated The localized data sets contain all thingsthat are described in a specific language Within thedatasets things are identified with language specificURIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for property data Inaddition we produce a canonicalized data set for eachlanguage The canonicalized data sets only containthings for which a corresponding page in the Englishedition of Wikipedia exists Within all canonicalizeddata sets the same thing is identified with the sameURI from the generic language-agnostic namespacehttpdbpediaorgresource

5httpenwikipediaorgwikiHelpInterlanguage_links

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Process-ing (NLP) tasks [30] Currently four datasets are ex-tracted topic signatures grammatical gender lexical-izations and thematic concept While the topic signa-tures and the grammatical gender extractors primar-ily extract data from the article text the lexicalizationsand thematic concept extractors make use of the wikimarkup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalizationdata set provides access to alternative names for en-tities and concepts associated with several scores es-timating the association strength between name andURI These scores distinguish more common namesfor specific entities from rarely used ones and alsoshow how ambiguous a name is with respect to all pos-sible concepts that it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data pro-vided by the mapping-based and raw extractors Webuild a Vector Space Model (VSM) where each DBpe-dia resource is a point in a multidimensional space ofwords Each DBpedia resource is represented by a vec-tor and each word occurring in Wikipedia is a dimen-sion of this vector Word scores are computed using the

8 Lehmann et al DBpedia

tf-idf weight with the intention to measure how strongis the association between a word and a DBpedia re-source Note that word stems are used in this contextin order to generalize over inflected words We use thecomputed weights to select the strongest related wordstems for each entity and build topic signatures [27]

There are two more Feature Extractors related toNatural Language Processing The thematic conceptsdata set relies on Wikipediarsquos category system to cap-ture the idea of a lsquothemersquo a subject that is discussedin its articles Many of the categories in Wikipedia arelinked to an article that describes the main topic of thatcategory We rely on this information to mark DBpediaentities and concepts that are lsquothematicrsquo that is theyare the center of discussion for a category

The grammatical gender data set uses a simpleheuristic to decide on a grammatical gender for in-stances of the class Person in DBpedia While pars-ing an article in the English Wikipedia if there is amapping from an infobox in this article to the classdboPerson we record the frequency of gender-specific pronouns in their declined forms (Subject Ob-ject Possessive Adjective Possessive Pronoun and Re-flexive) ndash ie he him his himself (masculine) and sheher hers herself (feminine) Grammatical genders forDBpedia entities are assigned based on the dominatinggender in these pronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publica-tion of the previous DBpedia overview article [5] in2009 One of the major changes on the implementationlevel is that the extraction framework has been rewrit-ten in Scala in 2010 to improve the efficiency of theextractors by an order of magnitude compared to theprevious PHP based framework The new more modu-lar framework also allows to extract data from tables inWikipedia pages and supports extraction from multi-ple MediaWiki templates per page Another significantchange was the creation and utilization of the DBpediaMappings Wiki as described earlier Further significantchanges include the mentioned NLP extractors and theintroduction of URI schemes

In addition there were several smaller improve-ments and general maintenance Overall over the pastfour years the parsing of the MediaWiki markup im-proved quite a lot which led to better overall coveragefor example concerning references and parser func-tions In addition the collection of MediaWiki names-

pace identifiers for many languages is now performedsemi-automatically leading to a high accuracy of de-tection This concerns common title prefixes such asUser File Template Help Portal etc in English thatindicate pages that do not contain encyclopedic con-tent and would produce noise in the data They areimportant for specific extractors as well for instancethe category hierarchy data set (SKOS) is producedfrom pages of the Category namespace Furthermorethe output of the extraction system now supports moreformats and several compliance issues regarding URIsIRIs N-Triples and Turtle were fixed

The individual data extractors have been improvedas well in both number and quality in many ar-eas The abstract extraction was enhanced producingmore accurate plain text representations of the be-ginning of Wikipedia article texts More diverse andmore specific datatypes do exist (eg many curren-cies and XSD datatypes such as xsdgYearMonthxsdpositiveInteger etc) and for a number ofclasses and properties specific datatypes were added(eg inhabitantskm2 for the population density ofpopulated places and m3s for the discharge of rivers)Many issues related to data parsers were resolved andthe quality of the owlsameAs data set for multiplelanguage versions was increased by an implementationthat takes bijective relations into account

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and dis-ambiguation extractors were introduced and improvedFor the redirect data the transitive closure is com-puted while taking care of catching cycles in the linksThe redirects also help regarding infobox coveragein the mapping-based extraction by resolving alterna-tive template names Moreover in the past if an in-fobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an in-fobox contains a string value that is not linked toanother Wikipedia article the extraction frameworksearches for hyperlinks in the same Wikipedia articlethat have the same anchor text as the infobox valuestring If such a link exists the target of that link isused to replace the string value in the infobox Thismethod further increases the number of object propertyassertions in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

3 DBpedia Ontology

The DBpedia ontology consists of 320 classeswhich form a subsumption hierarchy and are describedby 1650 different properties With a maximal depthof 5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depicts apart of the DBpedia ontology indicating the relationsamong the top ten classes of the DBpedia ontology iethe classes with the highest number of instances

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings WikiFigure 4 depicts the growth of the DBpedia ontologyover time While the number of classes is not growingtoo much due to the already good coverage of the ini-tial version of the ontology the number of properties

increases over time due to the collaboration on the DB-pedia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communi-ties for 27 languages 23 of which are active Fig-ure 5 shows statistics for the coverage of these map-pings in DBpedia Figures (a) and (c) refer to the ab-solute number of template and property mappings thatare defined for every DBpedia language edition Fig-ures (b) and (d) depict the percentage of the definedtemplate and property mappings compared to the totalnumber of available templates and properties for everyWikipedia language edition Figures (e) and (g) showthe occurrences (instances) that the defined templateand property mappings have in Wikipedia Finally fig-ures (f) and (h) give the percentage of the mapped tem-plates and properties occurences compared to the to-tal templates and property occurences in a Wikipedialanguage edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50 oftotal template occurrences In addition almost all lan-guages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 2: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

2 Lehmann et al DBpedia

1 Introduction

The DBpedia community project extracts knowl-edge from Wikipedia and makes it widely availablevia established Semantic Web standards and LinkedData best practices Wikipedia is currently the 7th mostpopular website1 the most widely used encyclopediaand one of the finest examples of truly collaborativelycreated content However due to the lack of the ex-ploitation of the inherent structure of Wikipedia arti-cles Wikipedia itself only offers very limited query-ing and search capabilities For instance it is difficultto find all rivers that flow into the Rhine or all Italiancomposers from the 18th century One of the goals ofthe DBpedia project is to provide those querying andsearch capabilities to a wide community by extractingstructured data from Wikipedia which can then be usedfor answering expressive queries such as the ones out-lined above

The DBpedia project was started in 2006 and hasmeanwhile attracted significant interest in research andpractice It has been a key factor for the success of theLinked Open Data initiative and serves as an interlink-ing hub for other data sets (see Section 5) For the re-search community DBpedia provides a testbed servingreal data spanning various domains and more than 100language editions Numerous applications algorithmsand tools have been build around or applied to DBpe-dia Due to the continuous growth of Wikipedia andimprovements in DBpedia the extracted data providesan increasing added value for data acquisition re-useand integration tasks within organisations While thequality of extracted data is unlikely to reach the qualityof completely manually curated data sources it can beapplied to some enterprise information integration usecases and has shown to be relevant beyond researchprojects as we will describe in in Section 7

One of the reasons why DBpediarsquos data quality hasimproved over the past years is that the structure of theknowledge in DBpedia itself is meanwhile maintainedby its community Most importantly the communitycreates mappings from Wikipedia information repre-sentation structures to the DBpedia ontology Thisontology unifies different template structures whichwill later be explained in detail ndash both within singleWikipedia language editions and across currently 27different languages The maintenance of different lan-guage editions of DBpedia is spread across a number

1See httpwwwalexacomtopsites Retrieved inJune 2013

of organisations Each organisation is responsible forthe support of a certain language The local DBpediachapters are coordinated by the DBpedia Internation-alisation Committee In addition to multilingual sup-port DBpedia also provides data-level links into morethan 30 external data sets which are partially also con-tributed from partners beyond the core project team

The aim of this system report is to provide a de-scription of the DBpedia community project includ-ing the architecture of the DBpedia extraction frame-work its technical implementation maintenance in-ternationalisation usage statistics as well as showcas-ing some popular DBpedia applications This systemreport is a comprehensive update and extension of pre-vious project descriptions in [1] and [5] The main nov-elties compared to these articles are

ndash The concept and implementation of the extrac-tion based on a community-curated DBpedia on-tology

ndash The wide internationalisation of DBpediandash A live synchronisation module which processes

updates in Wikipedia as well as the DBpediaontology and allows third parties to keep theircopies of DBpedia up-to-date

ndash A description of the maintenance of public DB-pedia services and statistics about their usage

ndash An increased number of interlinked data setswhich can be used to further enrich the content ofDBpedia

ndash The discussion and summary of novel third partyapplications of DBpedia

In essence the report summarizes major developmentsin DBpedia in the past four years since the publicationof [5]

The system report is structured as follows In thenext section we describe the DBpedia extractionframework which forms the technical core of DB-pedia This is followed by an explanation of thecommunity-curated DBpedia ontology with a focus onits evolution over the past years and multilingual sup-port In Section 4 we explicate how DBpedia is syn-chronised with Wikipedia with just very short delaysand how updates are propagated to DBpedia mirrorsemploying the DBpedia Live system Subsequentlywe give an overview of the external data sets that areinterlinked from DBpedia or that set data-level linkspointing at DBpedia themselves (Section 5) In Sec-tion 6 we provide statistics on the access of DBpediaand describe lessons learned for the maintenance of alarge scale public data set Within Section 7 we briefly

Lehmann et al DBpedia 3

describe several use cases and applications of DBpe-dia in a variety of different areas Finally we report onrelated work in Section 8 and conclude in Section 9

2 Extraction Framework

Wikipedia articles consist mostly of free text butalso comprise various types of structured informationin the form of wiki markup Such information includesinfobox templates categorisation information imagesgeo-coordinates links to external web pages disam-biguation pages redirects between pages and linksacross different language editions of Wikipedia TheDBpedia extraction framework extracts this structuredinformation from Wikipedia and turns it into a richknowledge base In this section we give an overviewof the DBpedia knowledge extraction framework

21 General Architecture

Figure 1 shows an overview of the technical frame-work The DBpedia extraction is structured into fourphases

Input Wikipedia pages are read from an externalsource Pages can either be read from a Wikipediadump or directly fetched from a MediaWiki in-stallation using the MediaWiki API

Parsing Each Wikipedia page is parsed by the wikiparser The wiki parser transforms the sourcecode of a Wikipedia page into an Abstract SyntaxTree

Extraction The Abstract Syntax Tree of each Wikipediapage is forwarded to the extractors DBpedia of-fers extractors for many different purposes for in-stance to extract labels abstracts or geographicalcoordinates Each extractor consumes an AbstractSyntax Tree and yields a set of RDF statements

Output The collected RDF statements are written toa sink Different formats such as N-Triples aresupported

22 Extractors

The DBpedia extraction framework employs variousextractors for translating different parts of Wikipediapages to RDF statements A list of all available extrac-tors is shown in Table 1 DBpedia extractors can bedivided into four categories

Mapping-Based Infobox Extraction The mapping-based infobox extraction uses manually writtenmappings that relate infoboxes in Wikipedia toterms in the DBpedia ontology The mappingsalso specify a datatype for each infobox propertyand thus help the extraction framework to pro-duce high quality data The mapping-based ex-traction will be described in detail in Section 24

Raw Infobox Extraction The raw infobox extrac-tion provides a direct mapping from infoboxes inWikipedia to RDF As the raw infobox extractiondoes not rely on explicit extraction knowledge inthe form of mappings the quality of the extracteddata is lower The raw infobox data is useful ifa specific infobox has not been mapped yet andthus is not available in the mapping-based extrac-tion

Feature Extraction The feature extraction uses anumber of extractors that are specialized in ex-tracting a single feature from an article such as alabel or geographic coordinates

Statistical Extraction Some NLP related extractorsaggregate data from all Wikipedia pages in orderto provide data that is based on statistical mea-sures of page links or word counts as further de-scribed in Section 26

23 Raw Infobox Extraction

The type of Wikipedia content that is most valu-able for the DBpedia extraction are infoboxes In-foboxes are frequently used to list an articlersquos mostrelevant facts as a table of attribute-value pairs onthe top right-hand side of the Wikipedia page (forright-to-left languages on the top left-hand side) In-foboxes that appear in a Wikipedia article are basedon a template that specifies a list of attributes thatcan form the infobox A wide range of infoboxtemplates are used in Wikipedia Common exam-ples are templates for infoboxes that describe per-sons organisations or automobiles As Wikipediarsquosinfobox template system has evolved over time dif-ferent communities of Wikipedia editors use differ-ent templates to describe the same type of things (egInfobox city japan Infobox swiss townand Infobox town de) In addition different tem-plates use different names for the same attribute(eg birthplace and placeofbirth) As manyWikipedia editors do not strictly follow the recommen-dations given on the page that describes a template at-tribute values are expressed using a wide range of dif-ferent formats and units of measurement An excerptof an infobox that is based on a template for describingautomobiles is shown below

4 Lehmann et al DBpedia

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

()

article categories Extracts the categorization of thearticle

dbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film)

external links Extracts links to external webpages related to the concept

dbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt

geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989

grammatical gender Extracts grammatical genders forpersons

dbrAbraham Lincoln foafgender male

homepage Extracts links to the officialhomepage of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbrAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin

lexicalizations Extracts information about sur-face forms and their associationwith concepts (only N-Quad for-mat)

dbrPine sptllexicalization lxpine tree lsPine pine tree

lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Per-sonData template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificalLanguages dbowikiPageRedirects

dbrConstructed language

revision ID Extracts the revision ID of theWikipedia article

dbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt

SKOS categories Extracts information about whichconcept is a category and howcategories are related using theSKOS Vocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

thematic concept Extracts lsquothematicrsquo concepts thecenters of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms

wiki page Extracts links to correspondingarticles in Wikipedia

dbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

Table 1

Overview of the DBpedia extractors sptl is used as prefixfor httpspotlightdbpediaorgvocab lx is used forhttpspotlightdbpediaorglexicalizations and lsis used for httpspotlightdbpediaorgscores

Lehmann et al DBpedia 5

Fig 1 Overview of DBpedia extraction framework

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows2

dbrFord_GT40 [dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

]

This extraction output has weaknesses The resourceis not associated to a class in the ontology and the en-gine and production data use literal values for whichthe semantics are not obvious Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

2We use dbr for httpdbpediaorgresourcedbo for httpdbpediaorgontology and dbp forhttpdbpediaorgproperty as prefixes throughoutthe report

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort has been initiated to develop an ontology schemaand mappings from Wikipedia infobox properties tothis ontology The alignment between Wikipedia in-foboxes and the ontology is performed via community-provided mappings that help to normalize name vari-ations in properties and classes Heterogeneity in theWikipedia infobox system like using different in-foboxes for the same type of entity or using differentproperty names for the same property (cf Section 23)can be alleviated in this way This significantly in-creases the quality of the raw Wikipedia infobox databy typing resources merging name variations and as-signing specific datatypes to the values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings Thesemappings are specified using the DBpedia MappingLanguage The mapping language makes use of Medi-aWiki templates that define DBpedia ontology classesand properties as well as templatetable to ontologymappings A mapping assigns a type from the DBpediaontology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

3httpmappingsdbpediaorg

6 Lehmann et al DBpedia

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty = productionStartDate| endDateOntologyProperty = productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we cansee the production period is correctly split into a startyear and an end year and the engine is represented bya distinct RDF node

dbrFord_GT40 [rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

]

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language edition ofWikipedia to the DBpedia ontology but is used to maptemplates from all Wikipedia language editions to theshared DBpedia ontology Figure 2 shows how the in-fobox properties author and συγγαϕεας ndash author inGreek ndash are both being mapped to the global identifierdboauthor That means in turn that informationfrom all language versions of DBpedia can be merged

and DBpedias for smaller languages can be augmentedwith knowledge from larger DBpedias such as the En-glish edition Conversely the larger DBpedia editionscan benefit from more specialized knowledge fromlocalized editions such as data about smaller townswhich is often only present in the corresponding lan-guage edition [40]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Validator When editing a mapping themapping can be directly validated by a button onthe edit page This validates changes before sav-ing them for syntactic correctness and highlightsinconsistencies such as missing property defini-tions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works andhow the resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data Apart from afew corner cases there is a one-to-one mappingbetween a Wikipedia page and a DBpedia re-source based on the article title For example forthe Wikipedia article on Berlin4 DBpedia willproduce the URI dbrBerlin

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbo populationTotal

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek and English Wikipedia templates about books to the same DBpedia Ontology class [23]

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach resulted in less and re-dundant data [23] Thus starting from the DBpe-dia 37 release two types of data sets were gen-erated The localized data sets contain all thingsthat are described in a specific language Within thedatasets things are identified with language specificURIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for property data Inaddition we produce a canonicalized data set for eachlanguage The canonicalized data sets only containthings for which a corresponding page in the Englishedition of Wikipedia exists Within all canonicalizeddata sets the same thing is identified with the sameURI from the generic language-agnostic namespacehttpdbpediaorgresource

5httpenwikipediaorgwikiHelpInterlanguage_links

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Process-ing (NLP) tasks [30] Currently four datasets are ex-tracted topic signatures grammatical gender lexical-izations and thematic concept While the topic signa-tures and the grammatical gender extractors primar-ily extract data from the article text the lexicalizationsand thematic concept extractors make use of the wikimarkup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalizationdata set provides access to alternative names for en-tities and concepts associated with several scores es-timating the association strength between name andURI These scores distinguish more common namesfor specific entities from rarely used ones and alsoshow how ambiguous a name is with respect to all pos-sible concepts that it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data pro-vided by the mapping-based and raw extractors Webuild a Vector Space Model (VSM) where each DBpe-dia resource is a point in a multidimensional space ofwords Each DBpedia resource is represented by a vec-tor and each word occurring in Wikipedia is a dimen-sion of this vector Word scores are computed using the

8 Lehmann et al DBpedia

tf-idf weight with the intention to measure how strongis the association between a word and a DBpedia re-source Note that word stems are used in this contextin order to generalize over inflected words We use thecomputed weights to select the strongest related wordstems for each entity and build topic signatures [27]

There are two more Feature Extractors related toNatural Language Processing The thematic conceptsdata set relies on Wikipediarsquos category system to cap-ture the idea of a lsquothemersquo a subject that is discussedin its articles Many of the categories in Wikipedia arelinked to an article that describes the main topic of thatcategory We rely on this information to mark DBpediaentities and concepts that are lsquothematicrsquo that is theyare the center of discussion for a category

The grammatical gender data set uses a simpleheuristic to decide on a grammatical gender for in-stances of the class Person in DBpedia While pars-ing an article in the English Wikipedia if there is amapping from an infobox in this article to the classdboPerson we record the frequency of gender-specific pronouns in their declined forms (Subject Ob-ject Possessive Adjective Possessive Pronoun and Re-flexive) ndash ie he him his himself (masculine) and sheher hers herself (feminine) Grammatical genders forDBpedia entities are assigned based on the dominatinggender in these pronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publica-tion of the previous DBpedia overview article [5] in2009 One of the major changes on the implementationlevel is that the extraction framework has been rewrit-ten in Scala in 2010 to improve the efficiency of theextractors by an order of magnitude compared to theprevious PHP based framework The new more modu-lar framework also allows to extract data from tables inWikipedia pages and supports extraction from multi-ple MediaWiki templates per page Another significantchange was the creation and utilization of the DBpediaMappings Wiki as described earlier Further significantchanges include the mentioned NLP extractors and theintroduction of URI schemes

In addition there were several smaller improve-ments and general maintenance Overall over the pastfour years the parsing of the MediaWiki markup im-proved quite a lot which led to better overall coveragefor example concerning references and parser func-tions In addition the collection of MediaWiki names-

pace identifiers for many languages is now performedsemi-automatically leading to a high accuracy of de-tection This concerns common title prefixes such asUser File Template Help Portal etc in English thatindicate pages that do not contain encyclopedic con-tent and would produce noise in the data They areimportant for specific extractors as well for instancethe category hierarchy data set (SKOS) is producedfrom pages of the Category namespace Furthermorethe output of the extraction system now supports moreformats and several compliance issues regarding URIsIRIs N-Triples and Turtle were fixed

The individual data extractors have been improvedas well in both number and quality in many ar-eas The abstract extraction was enhanced producingmore accurate plain text representations of the be-ginning of Wikipedia article texts More diverse andmore specific datatypes do exist (eg many curren-cies and XSD datatypes such as xsdgYearMonthxsdpositiveInteger etc) and for a number ofclasses and properties specific datatypes were added(eg inhabitantskm2 for the population density ofpopulated places and m3s for the discharge of rivers)Many issues related to data parsers were resolved andthe quality of the owlsameAs data set for multiplelanguage versions was increased by an implementationthat takes bijective relations into account

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and dis-ambiguation extractors were introduced and improvedFor the redirect data the transitive closure is com-puted while taking care of catching cycles in the linksThe redirects also help regarding infobox coveragein the mapping-based extraction by resolving alterna-tive template names Moreover in the past if an in-fobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an in-fobox contains a string value that is not linked toanother Wikipedia article the extraction frameworksearches for hyperlinks in the same Wikipedia articlethat have the same anchor text as the infobox valuestring If such a link exists the target of that link isused to replace the string value in the infobox Thismethod further increases the number of object propertyassertions in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

3 DBpedia Ontology

The DBpedia ontology consists of 320 classeswhich form a subsumption hierarchy and are describedby 1650 different properties With a maximal depthof 5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depicts apart of the DBpedia ontology indicating the relationsamong the top ten classes of the DBpedia ontology iethe classes with the highest number of instances

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings WikiFigure 4 depicts the growth of the DBpedia ontologyover time While the number of classes is not growingtoo much due to the already good coverage of the ini-tial version of the ontology the number of properties

increases over time due to the collaboration on the DB-pedia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communi-ties for 27 languages 23 of which are active Fig-ure 5 shows statistics for the coverage of these map-pings in DBpedia Figures (a) and (c) refer to the ab-solute number of template and property mappings thatare defined for every DBpedia language edition Fig-ures (b) and (d) depict the percentage of the definedtemplate and property mappings compared to the totalnumber of available templates and properties for everyWikipedia language edition Figures (e) and (g) showthe occurrences (instances) that the defined templateand property mappings have in Wikipedia Finally fig-ures (f) and (h) give the percentage of the mapped tem-plates and properties occurences compared to the to-tal templates and property occurences in a Wikipedialanguage edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50 oftotal template occurrences In addition almost all lan-guages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 3: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 3

describe several use cases and applications of DBpe-dia in a variety of different areas Finally we report onrelated work in Section 8 and conclude in Section 9

2 Extraction Framework

Wikipedia articles consist mostly of free text butalso comprise various types of structured informationin the form of wiki markup Such information includesinfobox templates categorisation information imagesgeo-coordinates links to external web pages disam-biguation pages redirects between pages and linksacross different language editions of Wikipedia TheDBpedia extraction framework extracts this structuredinformation from Wikipedia and turns it into a richknowledge base In this section we give an overviewof the DBpedia knowledge extraction framework

21 General Architecture

Figure 1 shows an overview of the technical frame-work The DBpedia extraction is structured into fourphases

Input Wikipedia pages are read from an externalsource Pages can either be read from a Wikipediadump or directly fetched from a MediaWiki in-stallation using the MediaWiki API

Parsing Each Wikipedia page is parsed by the wikiparser The wiki parser transforms the sourcecode of a Wikipedia page into an Abstract SyntaxTree

Extraction The Abstract Syntax Tree of each Wikipediapage is forwarded to the extractors DBpedia of-fers extractors for many different purposes for in-stance to extract labels abstracts or geographicalcoordinates Each extractor consumes an AbstractSyntax Tree and yields a set of RDF statements

Output The collected RDF statements are written toa sink Different formats such as N-Triples aresupported

22 Extractors

The DBpedia extraction framework employs variousextractors for translating different parts of Wikipediapages to RDF statements A list of all available extrac-tors is shown in Table 1 DBpedia extractors can bedivided into four categories

Mapping-Based Infobox Extraction The mapping-based infobox extraction uses manually writtenmappings that relate infoboxes in Wikipedia toterms in the DBpedia ontology The mappingsalso specify a datatype for each infobox propertyand thus help the extraction framework to pro-duce high quality data The mapping-based ex-traction will be described in detail in Section 24

Raw Infobox Extraction The raw infobox extrac-tion provides a direct mapping from infoboxes inWikipedia to RDF As the raw infobox extractiondoes not rely on explicit extraction knowledge inthe form of mappings the quality of the extracteddata is lower The raw infobox data is useful ifa specific infobox has not been mapped yet andthus is not available in the mapping-based extrac-tion

Feature Extraction The feature extraction uses anumber of extractors that are specialized in ex-tracting a single feature from an article such as alabel or geographic coordinates

Statistical Extraction Some NLP related extractorsaggregate data from all Wikipedia pages in orderto provide data that is based on statistical mea-sures of page links or word counts as further de-scribed in Section 26

23 Raw Infobox Extraction

The type of Wikipedia content that is most valu-able for the DBpedia extraction are infoboxes In-foboxes are frequently used to list an articlersquos mostrelevant facts as a table of attribute-value pairs onthe top right-hand side of the Wikipedia page (forright-to-left languages on the top left-hand side) In-foboxes that appear in a Wikipedia article are basedon a template that specifies a list of attributes thatcan form the infobox A wide range of infoboxtemplates are used in Wikipedia Common exam-ples are templates for infoboxes that describe per-sons organisations or automobiles As Wikipediarsquosinfobox template system has evolved over time dif-ferent communities of Wikipedia editors use differ-ent templates to describe the same type of things (egInfobox city japan Infobox swiss townand Infobox town de) In addition different tem-plates use different names for the same attribute(eg birthplace and placeofbirth) As manyWikipedia editors do not strictly follow the recommen-dations given on the page that describes a template at-tribute values are expressed using a wide range of dif-ferent formats and units of measurement An excerptof an infobox that is based on a template for describingautomobiles is shown below

4 Lehmann et al DBpedia

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

()

article categories Extracts the categorization of thearticle

dbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film)

external links Extracts links to external webpages related to the concept

dbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt

geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989

grammatical gender Extracts grammatical genders forpersons

dbrAbraham Lincoln foafgender male

homepage Extracts links to the officialhomepage of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbrAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin

lexicalizations Extracts information about sur-face forms and their associationwith concepts (only N-Quad for-mat)

dbrPine sptllexicalization lxpine tree lsPine pine tree

lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Per-sonData template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificalLanguages dbowikiPageRedirects

dbrConstructed language

revision ID Extracts the revision ID of theWikipedia article

dbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt

SKOS categories Extracts information about whichconcept is a category and howcategories are related using theSKOS Vocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

thematic concept Extracts lsquothematicrsquo concepts thecenters of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms

wiki page Extracts links to correspondingarticles in Wikipedia

dbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

Table 1

Overview of the DBpedia extractors sptl is used as prefixfor httpspotlightdbpediaorgvocab lx is used forhttpspotlightdbpediaorglexicalizations and lsis used for httpspotlightdbpediaorgscores

Lehmann et al DBpedia 5

Fig 1 Overview of DBpedia extraction framework

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows2

dbrFord_GT40 [dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

]

This extraction output has weaknesses The resourceis not associated to a class in the ontology and the en-gine and production data use literal values for whichthe semantics are not obvious Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

2We use dbr for httpdbpediaorgresourcedbo for httpdbpediaorgontology and dbp forhttpdbpediaorgproperty as prefixes throughoutthe report

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort has been initiated to develop an ontology schemaand mappings from Wikipedia infobox properties tothis ontology The alignment between Wikipedia in-foboxes and the ontology is performed via community-provided mappings that help to normalize name vari-ations in properties and classes Heterogeneity in theWikipedia infobox system like using different in-foboxes for the same type of entity or using differentproperty names for the same property (cf Section 23)can be alleviated in this way This significantly in-creases the quality of the raw Wikipedia infobox databy typing resources merging name variations and as-signing specific datatypes to the values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings Thesemappings are specified using the DBpedia MappingLanguage The mapping language makes use of Medi-aWiki templates that define DBpedia ontology classesand properties as well as templatetable to ontologymappings A mapping assigns a type from the DBpediaontology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

3httpmappingsdbpediaorg

6 Lehmann et al DBpedia

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty = productionStartDate| endDateOntologyProperty = productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we cansee the production period is correctly split into a startyear and an end year and the engine is represented bya distinct RDF node

dbrFord_GT40 [rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

]

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language edition ofWikipedia to the DBpedia ontology but is used to maptemplates from all Wikipedia language editions to theshared DBpedia ontology Figure 2 shows how the in-fobox properties author and συγγαϕεας ndash author inGreek ndash are both being mapped to the global identifierdboauthor That means in turn that informationfrom all language versions of DBpedia can be merged

and DBpedias for smaller languages can be augmentedwith knowledge from larger DBpedias such as the En-glish edition Conversely the larger DBpedia editionscan benefit from more specialized knowledge fromlocalized editions such as data about smaller townswhich is often only present in the corresponding lan-guage edition [40]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Validator When editing a mapping themapping can be directly validated by a button onthe edit page This validates changes before sav-ing them for syntactic correctness and highlightsinconsistencies such as missing property defini-tions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works andhow the resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data Apart from afew corner cases there is a one-to-one mappingbetween a Wikipedia page and a DBpedia re-source based on the article title For example forthe Wikipedia article on Berlin4 DBpedia willproduce the URI dbrBerlin

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbo populationTotal

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek and English Wikipedia templates about books to the same DBpedia Ontology class [23]

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach resulted in less and re-dundant data [23] Thus starting from the DBpe-dia 37 release two types of data sets were gen-erated The localized data sets contain all thingsthat are described in a specific language Within thedatasets things are identified with language specificURIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for property data Inaddition we produce a canonicalized data set for eachlanguage The canonicalized data sets only containthings for which a corresponding page in the Englishedition of Wikipedia exists Within all canonicalizeddata sets the same thing is identified with the sameURI from the generic language-agnostic namespacehttpdbpediaorgresource

5httpenwikipediaorgwikiHelpInterlanguage_links

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Process-ing (NLP) tasks [30] Currently four datasets are ex-tracted topic signatures grammatical gender lexical-izations and thematic concept While the topic signa-tures and the grammatical gender extractors primar-ily extract data from the article text the lexicalizationsand thematic concept extractors make use of the wikimarkup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalizationdata set provides access to alternative names for en-tities and concepts associated with several scores es-timating the association strength between name andURI These scores distinguish more common namesfor specific entities from rarely used ones and alsoshow how ambiguous a name is with respect to all pos-sible concepts that it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data pro-vided by the mapping-based and raw extractors Webuild a Vector Space Model (VSM) where each DBpe-dia resource is a point in a multidimensional space ofwords Each DBpedia resource is represented by a vec-tor and each word occurring in Wikipedia is a dimen-sion of this vector Word scores are computed using the

8 Lehmann et al DBpedia

tf-idf weight with the intention to measure how strongis the association between a word and a DBpedia re-source Note that word stems are used in this contextin order to generalize over inflected words We use thecomputed weights to select the strongest related wordstems for each entity and build topic signatures [27]

There are two more Feature Extractors related toNatural Language Processing The thematic conceptsdata set relies on Wikipediarsquos category system to cap-ture the idea of a lsquothemersquo a subject that is discussedin its articles Many of the categories in Wikipedia arelinked to an article that describes the main topic of thatcategory We rely on this information to mark DBpediaentities and concepts that are lsquothematicrsquo that is theyare the center of discussion for a category

The grammatical gender data set uses a simpleheuristic to decide on a grammatical gender for in-stances of the class Person in DBpedia While pars-ing an article in the English Wikipedia if there is amapping from an infobox in this article to the classdboPerson we record the frequency of gender-specific pronouns in their declined forms (Subject Ob-ject Possessive Adjective Possessive Pronoun and Re-flexive) ndash ie he him his himself (masculine) and sheher hers herself (feminine) Grammatical genders forDBpedia entities are assigned based on the dominatinggender in these pronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publica-tion of the previous DBpedia overview article [5] in2009 One of the major changes on the implementationlevel is that the extraction framework has been rewrit-ten in Scala in 2010 to improve the efficiency of theextractors by an order of magnitude compared to theprevious PHP based framework The new more modu-lar framework also allows to extract data from tables inWikipedia pages and supports extraction from multi-ple MediaWiki templates per page Another significantchange was the creation and utilization of the DBpediaMappings Wiki as described earlier Further significantchanges include the mentioned NLP extractors and theintroduction of URI schemes

In addition there were several smaller improve-ments and general maintenance Overall over the pastfour years the parsing of the MediaWiki markup im-proved quite a lot which led to better overall coveragefor example concerning references and parser func-tions In addition the collection of MediaWiki names-

pace identifiers for many languages is now performedsemi-automatically leading to a high accuracy of de-tection This concerns common title prefixes such asUser File Template Help Portal etc in English thatindicate pages that do not contain encyclopedic con-tent and would produce noise in the data They areimportant for specific extractors as well for instancethe category hierarchy data set (SKOS) is producedfrom pages of the Category namespace Furthermorethe output of the extraction system now supports moreformats and several compliance issues regarding URIsIRIs N-Triples and Turtle were fixed

The individual data extractors have been improvedas well in both number and quality in many ar-eas The abstract extraction was enhanced producingmore accurate plain text representations of the be-ginning of Wikipedia article texts More diverse andmore specific datatypes do exist (eg many curren-cies and XSD datatypes such as xsdgYearMonthxsdpositiveInteger etc) and for a number ofclasses and properties specific datatypes were added(eg inhabitantskm2 for the population density ofpopulated places and m3s for the discharge of rivers)Many issues related to data parsers were resolved andthe quality of the owlsameAs data set for multiplelanguage versions was increased by an implementationthat takes bijective relations into account

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and dis-ambiguation extractors were introduced and improvedFor the redirect data the transitive closure is com-puted while taking care of catching cycles in the linksThe redirects also help regarding infobox coveragein the mapping-based extraction by resolving alterna-tive template names Moreover in the past if an in-fobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an in-fobox contains a string value that is not linked toanother Wikipedia article the extraction frameworksearches for hyperlinks in the same Wikipedia articlethat have the same anchor text as the infobox valuestring If such a link exists the target of that link isused to replace the string value in the infobox Thismethod further increases the number of object propertyassertions in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

3 DBpedia Ontology

The DBpedia ontology consists of 320 classeswhich form a subsumption hierarchy and are describedby 1650 different properties With a maximal depthof 5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depicts apart of the DBpedia ontology indicating the relationsamong the top ten classes of the DBpedia ontology iethe classes with the highest number of instances

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings WikiFigure 4 depicts the growth of the DBpedia ontologyover time While the number of classes is not growingtoo much due to the already good coverage of the ini-tial version of the ontology the number of properties

increases over time due to the collaboration on the DB-pedia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communi-ties for 27 languages 23 of which are active Fig-ure 5 shows statistics for the coverage of these map-pings in DBpedia Figures (a) and (c) refer to the ab-solute number of template and property mappings thatare defined for every DBpedia language edition Fig-ures (b) and (d) depict the percentage of the definedtemplate and property mappings compared to the totalnumber of available templates and properties for everyWikipedia language edition Figures (e) and (g) showthe occurrences (instances) that the defined templateand property mappings have in Wikipedia Finally fig-ures (f) and (h) give the percentage of the mapped tem-plates and properties occurences compared to the to-tal templates and property occurences in a Wikipedialanguage edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50 oftotal template occurrences In addition almost all lan-guages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 4: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

4 Lehmann et al DBpedia

Name Description Example

abstract Extracts the first lines of theWikipedia article

dbrBerlin dboabstract Berlin is the capital city of

()

article categories Extracts the categorization of thearticle

dbrOliver Twist dcsubject dbrCategoryEnglish novels

category label Extracts labels for categories dbrCategoryEnglish novels rdfslabel English novels

disambiguation Extracts disambiguation links dbrAlien dbowikiPageDisambiguates dbrAlien (film)

external links Extracts links to external webpages related to the concept

dbrAnimal Farm dbowikiPageExternalLink

lthttpbooksgooglecomid=RBGmrDnBs8UCgt

geo coordinates Extracts geo-coordinates dbrBerlin georsspoint 525006 133989

grammatical gender Extracts grammatical genders forpersons

dbrAbraham Lincoln foafgender male

homepage Extracts links to the officialhomepage of an instance

dbrAlabama foafhomepage lthttpalabamagovgt

image Extracts the first image of aWikipedia page

dbrBerlin foafdepiction lthttpOverview Berlinjpggt

infobox Extracts all properties from all in-foboxes

dbrAnimal Farm dbodate March 2010

interlanguage Extracts interwiki links dbrAlbedo dbowikiPageInterLanguageLink dbrAlbedo label Extracts the article title as label dbrBerlin rdfslabel Berlin

lexicalizations Extracts information about sur-face forms and their associationwith concepts (only N-Quad for-mat)

dbrPine sptllexicalization lxpine tree lsPine pine tree

lxpine tree rdfslabel pine tree lsPine pine tree sptlpUriGivenSf 0941

mappings Extraction based on mappings ofWikipedia infoboxes to the DBpe-dia ontology

dbrBerlin dbocountry dbrGermany

page ID Extracts page ids of articles dbrAutism dbowikiPageID 25

page links Extracts all links betweenWikipedia articles

dbrAutism dbowikiPageWikiLink dbrHuman brain

persondata Extracts information about per-sons represented using the Per-sonData template

dbrAndre Agassi foafbirthDate 1970-04-29

PND Extracts PND (Personennamen-datei) data about a person

dbrWilliam Shakespeare dboindividualisedPnd 118613723

redirects Extracts redirect links between ar-ticles in Wikipedia

dbrArtificalLanguages dbowikiPageRedirects

dbrConstructed language

revision ID Extracts the revision ID of theWikipedia article

dbrAutism lthttpwwww3orgnsprovwasDerivedFromgt

lthttpenwikipediaorgwikiAutismoldid=495234324gt

SKOS categories Extracts information about whichconcept is a category and howcategories are related using theSKOS Vocabulary

dbrCategoryWorld War II skosbroader

dbrCategoryModern history

thematic concept Extracts lsquothematicrsquo concepts thecenters of discussion for cate-gories

dbrCategoryMusic skossubject dbrMusic

topic signatures Extracts topic signatures dbrAlkane sptltopicSignature carbon alkanes atoms

wiki page Extracts links to correspondingarticles in Wikipedia

dbrAnAmericanInParis foafisPrimaryTopicOf

lthttpenwikipediaorgwikiAnAmericanInParisgt

Table 1

Overview of the DBpedia extractors sptl is used as prefixfor httpspotlightdbpediaorgvocab lx is used forhttpspotlightdbpediaorglexicalizations and lsis used for httpspotlightdbpediaorgscores

Lehmann et al DBpedia 5

Fig 1 Overview of DBpedia extraction framework

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows2

dbrFord_GT40 [dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

]

This extraction output has weaknesses The resourceis not associated to a class in the ontology and the en-gine and production data use literal values for whichthe semantics are not obvious Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

2We use dbr for httpdbpediaorgresourcedbo for httpdbpediaorgontology and dbp forhttpdbpediaorgproperty as prefixes throughoutthe report

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort has been initiated to develop an ontology schemaand mappings from Wikipedia infobox properties tothis ontology The alignment between Wikipedia in-foboxes and the ontology is performed via community-provided mappings that help to normalize name vari-ations in properties and classes Heterogeneity in theWikipedia infobox system like using different in-foboxes for the same type of entity or using differentproperty names for the same property (cf Section 23)can be alleviated in this way This significantly in-creases the quality of the raw Wikipedia infobox databy typing resources merging name variations and as-signing specific datatypes to the values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings Thesemappings are specified using the DBpedia MappingLanguage The mapping language makes use of Medi-aWiki templates that define DBpedia ontology classesand properties as well as templatetable to ontologymappings A mapping assigns a type from the DBpediaontology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

3httpmappingsdbpediaorg

6 Lehmann et al DBpedia

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty = productionStartDate| endDateOntologyProperty = productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we cansee the production period is correctly split into a startyear and an end year and the engine is represented bya distinct RDF node

dbrFord_GT40 [rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

]

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language edition ofWikipedia to the DBpedia ontology but is used to maptemplates from all Wikipedia language editions to theshared DBpedia ontology Figure 2 shows how the in-fobox properties author and συγγαϕεας ndash author inGreek ndash are both being mapped to the global identifierdboauthor That means in turn that informationfrom all language versions of DBpedia can be merged

and DBpedias for smaller languages can be augmentedwith knowledge from larger DBpedias such as the En-glish edition Conversely the larger DBpedia editionscan benefit from more specialized knowledge fromlocalized editions such as data about smaller townswhich is often only present in the corresponding lan-guage edition [40]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Validator When editing a mapping themapping can be directly validated by a button onthe edit page This validates changes before sav-ing them for syntactic correctness and highlightsinconsistencies such as missing property defini-tions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works andhow the resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data Apart from afew corner cases there is a one-to-one mappingbetween a Wikipedia page and a DBpedia re-source based on the article title For example forthe Wikipedia article on Berlin4 DBpedia willproduce the URI dbrBerlin

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbo populationTotal

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek and English Wikipedia templates about books to the same DBpedia Ontology class [23]

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach resulted in less and re-dundant data [23] Thus starting from the DBpe-dia 37 release two types of data sets were gen-erated The localized data sets contain all thingsthat are described in a specific language Within thedatasets things are identified with language specificURIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for property data Inaddition we produce a canonicalized data set for eachlanguage The canonicalized data sets only containthings for which a corresponding page in the Englishedition of Wikipedia exists Within all canonicalizeddata sets the same thing is identified with the sameURI from the generic language-agnostic namespacehttpdbpediaorgresource

5httpenwikipediaorgwikiHelpInterlanguage_links

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Process-ing (NLP) tasks [30] Currently four datasets are ex-tracted topic signatures grammatical gender lexical-izations and thematic concept While the topic signa-tures and the grammatical gender extractors primar-ily extract data from the article text the lexicalizationsand thematic concept extractors make use of the wikimarkup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalizationdata set provides access to alternative names for en-tities and concepts associated with several scores es-timating the association strength between name andURI These scores distinguish more common namesfor specific entities from rarely used ones and alsoshow how ambiguous a name is with respect to all pos-sible concepts that it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data pro-vided by the mapping-based and raw extractors Webuild a Vector Space Model (VSM) where each DBpe-dia resource is a point in a multidimensional space ofwords Each DBpedia resource is represented by a vec-tor and each word occurring in Wikipedia is a dimen-sion of this vector Word scores are computed using the

8 Lehmann et al DBpedia

tf-idf weight with the intention to measure how strongis the association between a word and a DBpedia re-source Note that word stems are used in this contextin order to generalize over inflected words We use thecomputed weights to select the strongest related wordstems for each entity and build topic signatures [27]

There are two more Feature Extractors related toNatural Language Processing The thematic conceptsdata set relies on Wikipediarsquos category system to cap-ture the idea of a lsquothemersquo a subject that is discussedin its articles Many of the categories in Wikipedia arelinked to an article that describes the main topic of thatcategory We rely on this information to mark DBpediaentities and concepts that are lsquothematicrsquo that is theyare the center of discussion for a category

The grammatical gender data set uses a simpleheuristic to decide on a grammatical gender for in-stances of the class Person in DBpedia While pars-ing an article in the English Wikipedia if there is amapping from an infobox in this article to the classdboPerson we record the frequency of gender-specific pronouns in their declined forms (Subject Ob-ject Possessive Adjective Possessive Pronoun and Re-flexive) ndash ie he him his himself (masculine) and sheher hers herself (feminine) Grammatical genders forDBpedia entities are assigned based on the dominatinggender in these pronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publica-tion of the previous DBpedia overview article [5] in2009 One of the major changes on the implementationlevel is that the extraction framework has been rewrit-ten in Scala in 2010 to improve the efficiency of theextractors by an order of magnitude compared to theprevious PHP based framework The new more modu-lar framework also allows to extract data from tables inWikipedia pages and supports extraction from multi-ple MediaWiki templates per page Another significantchange was the creation and utilization of the DBpediaMappings Wiki as described earlier Further significantchanges include the mentioned NLP extractors and theintroduction of URI schemes

In addition there were several smaller improve-ments and general maintenance Overall over the pastfour years the parsing of the MediaWiki markup im-proved quite a lot which led to better overall coveragefor example concerning references and parser func-tions In addition the collection of MediaWiki names-

pace identifiers for many languages is now performedsemi-automatically leading to a high accuracy of de-tection This concerns common title prefixes such asUser File Template Help Portal etc in English thatindicate pages that do not contain encyclopedic con-tent and would produce noise in the data They areimportant for specific extractors as well for instancethe category hierarchy data set (SKOS) is producedfrom pages of the Category namespace Furthermorethe output of the extraction system now supports moreformats and several compliance issues regarding URIsIRIs N-Triples and Turtle were fixed

The individual data extractors have been improvedas well in both number and quality in many ar-eas The abstract extraction was enhanced producingmore accurate plain text representations of the be-ginning of Wikipedia article texts More diverse andmore specific datatypes do exist (eg many curren-cies and XSD datatypes such as xsdgYearMonthxsdpositiveInteger etc) and for a number ofclasses and properties specific datatypes were added(eg inhabitantskm2 for the population density ofpopulated places and m3s for the discharge of rivers)Many issues related to data parsers were resolved andthe quality of the owlsameAs data set for multiplelanguage versions was increased by an implementationthat takes bijective relations into account

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and dis-ambiguation extractors were introduced and improvedFor the redirect data the transitive closure is com-puted while taking care of catching cycles in the linksThe redirects also help regarding infobox coveragein the mapping-based extraction by resolving alterna-tive template names Moreover in the past if an in-fobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an in-fobox contains a string value that is not linked toanother Wikipedia article the extraction frameworksearches for hyperlinks in the same Wikipedia articlethat have the same anchor text as the infobox valuestring If such a link exists the target of that link isused to replace the string value in the infobox Thismethod further increases the number of object propertyassertions in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

3 DBpedia Ontology

The DBpedia ontology consists of 320 classeswhich form a subsumption hierarchy and are describedby 1650 different properties With a maximal depthof 5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depicts apart of the DBpedia ontology indicating the relationsamong the top ten classes of the DBpedia ontology iethe classes with the highest number of instances

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings WikiFigure 4 depicts the growth of the DBpedia ontologyover time While the number of classes is not growingtoo much due to the already good coverage of the ini-tial version of the ontology the number of properties

increases over time due to the collaboration on the DB-pedia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communi-ties for 27 languages 23 of which are active Fig-ure 5 shows statistics for the coverage of these map-pings in DBpedia Figures (a) and (c) refer to the ab-solute number of template and property mappings thatare defined for every DBpedia language edition Fig-ures (b) and (d) depict the percentage of the definedtemplate and property mappings compared to the totalnumber of available templates and properties for everyWikipedia language edition Figures (e) and (g) showthe occurrences (instances) that the defined templateand property mappings have in Wikipedia Finally fig-ures (f) and (h) give the percentage of the mapped tem-plates and properties occurences compared to the to-tal templates and property occurences in a Wikipedialanguage edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50 oftotal template occurrences In addition almost all lan-guages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 5: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 5

Fig 1 Overview of DBpedia extraction framework

Infobox automobile| name = Ford GT40| manufacturer = [[Ford Advanced Vehicles]]| production = 1964-1969| engine = 4181cc()

In this infobox the first line specifies the infoboxtype and the subsequent lines specify various attributesof the described entity

An excerpt of the extracted data is as follows2

dbrFord_GT40 [dbpname Ford GT40endbpmanufacturer dbrFord_Advanced_Vehiclesdbpengine 4181dbpproduction 1964()

]

This extraction output has weaknesses The resourceis not associated to a class in the ontology and the en-gine and production data use literal values for whichthe semantics are not obvious Those problems canbe overcome by the mapping-based infobox extractionpresented in the next subsection

2We use dbr for httpdbpediaorgresourcedbo for httpdbpediaorgontology and dbp forhttpdbpediaorgproperty as prefixes throughoutthe report

24 Mapping-Based Infobox Extraction

In order to homogenize the description of informa-tion in the knowledge base in 2010 a community ef-fort has been initiated to develop an ontology schemaand mappings from Wikipedia infobox properties tothis ontology The alignment between Wikipedia in-foboxes and the ontology is performed via community-provided mappings that help to normalize name vari-ations in properties and classes Heterogeneity in theWikipedia infobox system like using different in-foboxes for the same type of entity or using differentproperty names for the same property (cf Section 23)can be alleviated in this way This significantly in-creases the quality of the raw Wikipedia infobox databy typing resources merging name variations and as-signing specific datatypes to the values

This effort is realized using the DBpedia MappingsWiki3 a MediaWiki installation set up to enable usersto collaboratively create and edit mappings Thesemappings are specified using the DBpedia MappingLanguage The mapping language makes use of Medi-aWiki templates that define DBpedia ontology classesand properties as well as templatetable to ontologymappings A mapping assigns a type from the DBpediaontology to the entities that are described by the corre-sponding infobox In addition attributes in the infoboxare mapped to properties in the DBpedia ontology Inthe following we show a mapping that maps infoboxesthat use the Infobox automobile template to theDBpedia ontology

3httpmappingsdbpediaorg

6 Lehmann et al DBpedia

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty = productionStartDate| endDateOntologyProperty = productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we cansee the production period is correctly split into a startyear and an end year and the engine is represented bya distinct RDF node

dbrFord_GT40 [rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

]

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language edition ofWikipedia to the DBpedia ontology but is used to maptemplates from all Wikipedia language editions to theshared DBpedia ontology Figure 2 shows how the in-fobox properties author and συγγαϕεας ndash author inGreek ndash are both being mapped to the global identifierdboauthor That means in turn that informationfrom all language versions of DBpedia can be merged

and DBpedias for smaller languages can be augmentedwith knowledge from larger DBpedias such as the En-glish edition Conversely the larger DBpedia editionscan benefit from more specialized knowledge fromlocalized editions such as data about smaller townswhich is often only present in the corresponding lan-guage edition [40]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Validator When editing a mapping themapping can be directly validated by a button onthe edit page This validates changes before sav-ing them for syntactic correctness and highlightsinconsistencies such as missing property defini-tions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works andhow the resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data Apart from afew corner cases there is a one-to-one mappingbetween a Wikipedia page and a DBpedia re-source based on the article title For example forthe Wikipedia article on Berlin4 DBpedia willproduce the URI dbrBerlin

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbo populationTotal

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek and English Wikipedia templates about books to the same DBpedia Ontology class [23]

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach resulted in less and re-dundant data [23] Thus starting from the DBpe-dia 37 release two types of data sets were gen-erated The localized data sets contain all thingsthat are described in a specific language Within thedatasets things are identified with language specificURIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for property data Inaddition we produce a canonicalized data set for eachlanguage The canonicalized data sets only containthings for which a corresponding page in the Englishedition of Wikipedia exists Within all canonicalizeddata sets the same thing is identified with the sameURI from the generic language-agnostic namespacehttpdbpediaorgresource

5httpenwikipediaorgwikiHelpInterlanguage_links

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Process-ing (NLP) tasks [30] Currently four datasets are ex-tracted topic signatures grammatical gender lexical-izations and thematic concept While the topic signa-tures and the grammatical gender extractors primar-ily extract data from the article text the lexicalizationsand thematic concept extractors make use of the wikimarkup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalizationdata set provides access to alternative names for en-tities and concepts associated with several scores es-timating the association strength between name andURI These scores distinguish more common namesfor specific entities from rarely used ones and alsoshow how ambiguous a name is with respect to all pos-sible concepts that it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data pro-vided by the mapping-based and raw extractors Webuild a Vector Space Model (VSM) where each DBpe-dia resource is a point in a multidimensional space ofwords Each DBpedia resource is represented by a vec-tor and each word occurring in Wikipedia is a dimen-sion of this vector Word scores are computed using the

8 Lehmann et al DBpedia

tf-idf weight with the intention to measure how strongis the association between a word and a DBpedia re-source Note that word stems are used in this contextin order to generalize over inflected words We use thecomputed weights to select the strongest related wordstems for each entity and build topic signatures [27]

There are two more Feature Extractors related toNatural Language Processing The thematic conceptsdata set relies on Wikipediarsquos category system to cap-ture the idea of a lsquothemersquo a subject that is discussedin its articles Many of the categories in Wikipedia arelinked to an article that describes the main topic of thatcategory We rely on this information to mark DBpediaentities and concepts that are lsquothematicrsquo that is theyare the center of discussion for a category

The grammatical gender data set uses a simpleheuristic to decide on a grammatical gender for in-stances of the class Person in DBpedia While pars-ing an article in the English Wikipedia if there is amapping from an infobox in this article to the classdboPerson we record the frequency of gender-specific pronouns in their declined forms (Subject Ob-ject Possessive Adjective Possessive Pronoun and Re-flexive) ndash ie he him his himself (masculine) and sheher hers herself (feminine) Grammatical genders forDBpedia entities are assigned based on the dominatinggender in these pronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publica-tion of the previous DBpedia overview article [5] in2009 One of the major changes on the implementationlevel is that the extraction framework has been rewrit-ten in Scala in 2010 to improve the efficiency of theextractors by an order of magnitude compared to theprevious PHP based framework The new more modu-lar framework also allows to extract data from tables inWikipedia pages and supports extraction from multi-ple MediaWiki templates per page Another significantchange was the creation and utilization of the DBpediaMappings Wiki as described earlier Further significantchanges include the mentioned NLP extractors and theintroduction of URI schemes

In addition there were several smaller improve-ments and general maintenance Overall over the pastfour years the parsing of the MediaWiki markup im-proved quite a lot which led to better overall coveragefor example concerning references and parser func-tions In addition the collection of MediaWiki names-

pace identifiers for many languages is now performedsemi-automatically leading to a high accuracy of de-tection This concerns common title prefixes such asUser File Template Help Portal etc in English thatindicate pages that do not contain encyclopedic con-tent and would produce noise in the data They areimportant for specific extractors as well for instancethe category hierarchy data set (SKOS) is producedfrom pages of the Category namespace Furthermorethe output of the extraction system now supports moreformats and several compliance issues regarding URIsIRIs N-Triples and Turtle were fixed

The individual data extractors have been improvedas well in both number and quality in many ar-eas The abstract extraction was enhanced producingmore accurate plain text representations of the be-ginning of Wikipedia article texts More diverse andmore specific datatypes do exist (eg many curren-cies and XSD datatypes such as xsdgYearMonthxsdpositiveInteger etc) and for a number ofclasses and properties specific datatypes were added(eg inhabitantskm2 for the population density ofpopulated places and m3s for the discharge of rivers)Many issues related to data parsers were resolved andthe quality of the owlsameAs data set for multiplelanguage versions was increased by an implementationthat takes bijective relations into account

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and dis-ambiguation extractors were introduced and improvedFor the redirect data the transitive closure is com-puted while taking care of catching cycles in the linksThe redirects also help regarding infobox coveragein the mapping-based extraction by resolving alterna-tive template names Moreover in the past if an in-fobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an in-fobox contains a string value that is not linked toanother Wikipedia article the extraction frameworksearches for hyperlinks in the same Wikipedia articlethat have the same anchor text as the infobox valuestring If such a link exists the target of that link isused to replace the string value in the infobox Thismethod further increases the number of object propertyassertions in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

3 DBpedia Ontology

The DBpedia ontology consists of 320 classeswhich form a subsumption hierarchy and are describedby 1650 different properties With a maximal depthof 5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depicts apart of the DBpedia ontology indicating the relationsamong the top ten classes of the DBpedia ontology iethe classes with the highest number of instances

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings WikiFigure 4 depicts the growth of the DBpedia ontologyover time While the number of classes is not growingtoo much due to the already good coverage of the ini-tial version of the ontology the number of properties

increases over time due to the collaboration on the DB-pedia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communi-ties for 27 languages 23 of which are active Fig-ure 5 shows statistics for the coverage of these map-pings in DBpedia Figures (a) and (c) refer to the ab-solute number of template and property mappings thatare defined for every DBpedia language edition Fig-ures (b) and (d) depict the percentage of the definedtemplate and property mappings compared to the totalnumber of available templates and properties for everyWikipedia language edition Figures (e) and (g) showthe occurrences (instances) that the defined templateand property mappings have in Wikipedia Finally fig-ures (f) and (h) give the percentage of the mapped tem-plates and properties occurences compared to the to-tal templates and property occurences in a Wikipedialanguage edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50 oftotal template occurrences In addition almost all lan-guages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 6: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

6 Lehmann et al DBpedia

TemplateMapping|mapToClass = Automobile|mappings =PropertyMapping| templateProperty = name| ontologyProperty = foafname PropertyMapping| templateProperty = manufacturer| ontologyProperty = manufacturer DateIntervalMapping| templateProperty = production| startDateOntologyProperty = productionStartDate| endDateOntologyProperty = productionEndDate IntermediateNodeMapping| nodeClass = AutomobileEngine| correspondingProperty = engine| mappings =

PropertyMapping| templateProperty = engine| ontologyProperty = displacement| unit = Volume PropertyMapping| templateProperty = engine| ontologyProperty = powerOutput| unit = Power

()

The RDF statements that are extracted from the pre-vious infobox example are shown below As we cansee the production period is correctly split into a startyear and an end year and the engine is represented bya distinct RDF node

dbrFord_GT40 [rdftype dboAutomobilerdfslabel Ford GT40endbomanufacturer

dbrFord_Advanced_VehiclesdboproductionStartYear

1964ˆˆxsdgYeardboproductionEndYear 1969ˆˆxsdgYeardboengine [

rdftype AutomobileEnginedbodisplacement 0004181

]()

]

The DBpedia Mapping Wiki is not only used to mapdifferent templates within a single language edition ofWikipedia to the DBpedia ontology but is used to maptemplates from all Wikipedia language editions to theshared DBpedia ontology Figure 2 shows how the in-fobox properties author and συγγαϕεας ndash author inGreek ndash are both being mapped to the global identifierdboauthor That means in turn that informationfrom all language versions of DBpedia can be merged

and DBpedias for smaller languages can be augmentedwith knowledge from larger DBpedias such as the En-glish edition Conversely the larger DBpedia editionscan benefit from more specialized knowledge fromlocalized editions such as data about smaller townswhich is often only present in the corresponding lan-guage edition [40]

Besides hosting of the mappings and DBpedia on-tology definition the DBpedia Mappings Wiki offersvarious tools which support users in their work

ndash Mapping Validator When editing a mapping themapping can be directly validated by a button onthe edit page This validates changes before sav-ing them for syntactic correctness and highlightsinconsistencies such as missing property defini-tions

ndash Extraction Tester The extraction tester linked oneach mapping page tests a mapping against a setof example Wikipedia pages This gives directfeedback about whether a mapping works andhow the resulting data is structured

ndash Mapping Tool The DBpedia Mapping Tool is agraphical user interface that supports users to cre-ate and edit mappings

25 URI Schemes

For every Wikipedia article the framework intro-duces a number of URIs to represent the concepts de-scribed on a particular page Up to 2011 DBpedia pub-lished URIs only under the httpdbpediaorgdomain The main namespaces were

ndash httpdbpediaorgresource (prefixdbr) for representing article data Apart from afew corner cases there is a one-to-one mappingbetween a Wikipedia page and a DBpedia re-source based on the article title For example forthe Wikipedia article on Berlin4 DBpedia willproduce the URI dbrBerlin

ndash httpdbpediaorgproperty (prefixdbp) for representing properties extracted fromthe raw infobox extraction (cf Section 23) egdbppopulation

ndash httpdbpediaorgontology (prefixdbo) for representing the DBpedia ontology (cfSection 24) eg dbo populationTotal

4httpenwikipediaorgwikiBerlin

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek and English Wikipedia templates about books to the same DBpedia Ontology class [23]

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach resulted in less and re-dundant data [23] Thus starting from the DBpe-dia 37 release two types of data sets were gen-erated The localized data sets contain all thingsthat are described in a specific language Within thedatasets things are identified with language specificURIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for property data Inaddition we produce a canonicalized data set for eachlanguage The canonicalized data sets only containthings for which a corresponding page in the Englishedition of Wikipedia exists Within all canonicalizeddata sets the same thing is identified with the sameURI from the generic language-agnostic namespacehttpdbpediaorgresource

5httpenwikipediaorgwikiHelpInterlanguage_links

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Process-ing (NLP) tasks [30] Currently four datasets are ex-tracted topic signatures grammatical gender lexical-izations and thematic concept While the topic signa-tures and the grammatical gender extractors primar-ily extract data from the article text the lexicalizationsand thematic concept extractors make use of the wikimarkup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalizationdata set provides access to alternative names for en-tities and concepts associated with several scores es-timating the association strength between name andURI These scores distinguish more common namesfor specific entities from rarely used ones and alsoshow how ambiguous a name is with respect to all pos-sible concepts that it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data pro-vided by the mapping-based and raw extractors Webuild a Vector Space Model (VSM) where each DBpe-dia resource is a point in a multidimensional space ofwords Each DBpedia resource is represented by a vec-tor and each word occurring in Wikipedia is a dimen-sion of this vector Word scores are computed using the

8 Lehmann et al DBpedia

tf-idf weight with the intention to measure how strongis the association between a word and a DBpedia re-source Note that word stems are used in this contextin order to generalize over inflected words We use thecomputed weights to select the strongest related wordstems for each entity and build topic signatures [27]

There are two more Feature Extractors related toNatural Language Processing The thematic conceptsdata set relies on Wikipediarsquos category system to cap-ture the idea of a lsquothemersquo a subject that is discussedin its articles Many of the categories in Wikipedia arelinked to an article that describes the main topic of thatcategory We rely on this information to mark DBpediaentities and concepts that are lsquothematicrsquo that is theyare the center of discussion for a category

The grammatical gender data set uses a simpleheuristic to decide on a grammatical gender for in-stances of the class Person in DBpedia While pars-ing an article in the English Wikipedia if there is amapping from an infobox in this article to the classdboPerson we record the frequency of gender-specific pronouns in their declined forms (Subject Ob-ject Possessive Adjective Possessive Pronoun and Re-flexive) ndash ie he him his himself (masculine) and sheher hers herself (feminine) Grammatical genders forDBpedia entities are assigned based on the dominatinggender in these pronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publica-tion of the previous DBpedia overview article [5] in2009 One of the major changes on the implementationlevel is that the extraction framework has been rewrit-ten in Scala in 2010 to improve the efficiency of theextractors by an order of magnitude compared to theprevious PHP based framework The new more modu-lar framework also allows to extract data from tables inWikipedia pages and supports extraction from multi-ple MediaWiki templates per page Another significantchange was the creation and utilization of the DBpediaMappings Wiki as described earlier Further significantchanges include the mentioned NLP extractors and theintroduction of URI schemes

In addition there were several smaller improve-ments and general maintenance Overall over the pastfour years the parsing of the MediaWiki markup im-proved quite a lot which led to better overall coveragefor example concerning references and parser func-tions In addition the collection of MediaWiki names-

pace identifiers for many languages is now performedsemi-automatically leading to a high accuracy of de-tection This concerns common title prefixes such asUser File Template Help Portal etc in English thatindicate pages that do not contain encyclopedic con-tent and would produce noise in the data They areimportant for specific extractors as well for instancethe category hierarchy data set (SKOS) is producedfrom pages of the Category namespace Furthermorethe output of the extraction system now supports moreformats and several compliance issues regarding URIsIRIs N-Triples and Turtle were fixed

The individual data extractors have been improvedas well in both number and quality in many ar-eas The abstract extraction was enhanced producingmore accurate plain text representations of the be-ginning of Wikipedia article texts More diverse andmore specific datatypes do exist (eg many curren-cies and XSD datatypes such as xsdgYearMonthxsdpositiveInteger etc) and for a number ofclasses and properties specific datatypes were added(eg inhabitantskm2 for the population density ofpopulated places and m3s for the discharge of rivers)Many issues related to data parsers were resolved andthe quality of the owlsameAs data set for multiplelanguage versions was increased by an implementationthat takes bijective relations into account

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and dis-ambiguation extractors were introduced and improvedFor the redirect data the transitive closure is com-puted while taking care of catching cycles in the linksThe redirects also help regarding infobox coveragein the mapping-based extraction by resolving alterna-tive template names Moreover in the past if an in-fobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an in-fobox contains a string value that is not linked toanother Wikipedia article the extraction frameworksearches for hyperlinks in the same Wikipedia articlethat have the same anchor text as the infobox valuestring If such a link exists the target of that link isused to replace the string value in the infobox Thismethod further increases the number of object propertyassertions in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

3 DBpedia Ontology

The DBpedia ontology consists of 320 classeswhich form a subsumption hierarchy and are describedby 1650 different properties With a maximal depthof 5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depicts apart of the DBpedia ontology indicating the relationsamong the top ten classes of the DBpedia ontology iethe classes with the highest number of instances

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings WikiFigure 4 depicts the growth of the DBpedia ontologyover time While the number of classes is not growingtoo much due to the already good coverage of the ini-tial version of the ontology the number of properties

increases over time due to the collaboration on the DB-pedia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communi-ties for 27 languages 23 of which are active Fig-ure 5 shows statistics for the coverage of these map-pings in DBpedia Figures (a) and (c) refer to the ab-solute number of template and property mappings thatare defined for every DBpedia language edition Fig-ures (b) and (d) depict the percentage of the definedtemplate and property mappings compared to the totalnumber of available templates and properties for everyWikipedia language edition Figures (e) and (g) showthe occurrences (instances) that the defined templateand property mappings have in Wikipedia Finally fig-ures (f) and (h) give the percentage of the mapped tem-plates and properties occurences compared to the to-tal templates and property occurences in a Wikipedialanguage edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50 oftotal template occurrences In addition almost all lan-guages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 7: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 7

Fig 2 Depiction of the mapping from the Greek and English Wikipedia templates about books to the same DBpedia Ontology class [23]

Although data from other Wikipedia language edi-tions were extracted they used the same namespacesThis was achieved by exploiting the Wikipedia inter-language links5 For every page in a language otherthan English the page was extracted only if the pagecontained an inter-language link to an English page Inthat case using the English link the data was extractedunder the English resource name (ie dbrBerlin)

Recent DBpedia internationalisation developmentsshowed that this approach resulted in less and re-dundant data [23] Thus starting from the DBpe-dia 37 release two types of data sets were gen-erated The localized data sets contain all thingsthat are described in a specific language Within thedatasets things are identified with language specificURIs such as httpltlanggtdbpediaorgresource for article data and httpltlanggtdbpediaorgproperty for property data Inaddition we produce a canonicalized data set for eachlanguage The canonicalized data sets only containthings for which a corresponding page in the Englishedition of Wikipedia exists Within all canonicalizeddata sets the same thing is identified with the sameURI from the generic language-agnostic namespacehttpdbpediaorgresource

5httpenwikipediaorgwikiHelpInterlanguage_links

26 NLP Extraction

DBpedia provides a number of data sets which havebeen created to support Natural Language Process-ing (NLP) tasks [30] Currently four datasets are ex-tracted topic signatures grammatical gender lexical-izations and thematic concept While the topic signa-tures and the grammatical gender extractors primar-ily extract data from the article text the lexicalizationsand thematic concept extractors make use of the wikimarkup

DBpedia entities can be referred to using many dif-ferent names and abbreviations The Lexicalizationdata set provides access to alternative names for en-tities and concepts associated with several scores es-timating the association strength between name andURI These scores distinguish more common namesfor specific entities from rarely used ones and alsoshow how ambiguous a name is with respect to all pos-sible concepts that it can mean

The topic signatures data set enables the descriptionof DBpedia resources based on unstructured informa-tion as compared to the structured factual data pro-vided by the mapping-based and raw extractors Webuild a Vector Space Model (VSM) where each DBpe-dia resource is a point in a multidimensional space ofwords Each DBpedia resource is represented by a vec-tor and each word occurring in Wikipedia is a dimen-sion of this vector Word scores are computed using the

8 Lehmann et al DBpedia

tf-idf weight with the intention to measure how strongis the association between a word and a DBpedia re-source Note that word stems are used in this contextin order to generalize over inflected words We use thecomputed weights to select the strongest related wordstems for each entity and build topic signatures [27]

There are two more Feature Extractors related toNatural Language Processing The thematic conceptsdata set relies on Wikipediarsquos category system to cap-ture the idea of a lsquothemersquo a subject that is discussedin its articles Many of the categories in Wikipedia arelinked to an article that describes the main topic of thatcategory We rely on this information to mark DBpediaentities and concepts that are lsquothematicrsquo that is theyare the center of discussion for a category

The grammatical gender data set uses a simpleheuristic to decide on a grammatical gender for in-stances of the class Person in DBpedia While pars-ing an article in the English Wikipedia if there is amapping from an infobox in this article to the classdboPerson we record the frequency of gender-specific pronouns in their declined forms (Subject Ob-ject Possessive Adjective Possessive Pronoun and Re-flexive) ndash ie he him his himself (masculine) and sheher hers herself (feminine) Grammatical genders forDBpedia entities are assigned based on the dominatinggender in these pronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publica-tion of the previous DBpedia overview article [5] in2009 One of the major changes on the implementationlevel is that the extraction framework has been rewrit-ten in Scala in 2010 to improve the efficiency of theextractors by an order of magnitude compared to theprevious PHP based framework The new more modu-lar framework also allows to extract data from tables inWikipedia pages and supports extraction from multi-ple MediaWiki templates per page Another significantchange was the creation and utilization of the DBpediaMappings Wiki as described earlier Further significantchanges include the mentioned NLP extractors and theintroduction of URI schemes

In addition there were several smaller improve-ments and general maintenance Overall over the pastfour years the parsing of the MediaWiki markup im-proved quite a lot which led to better overall coveragefor example concerning references and parser func-tions In addition the collection of MediaWiki names-

pace identifiers for many languages is now performedsemi-automatically leading to a high accuracy of de-tection This concerns common title prefixes such asUser File Template Help Portal etc in English thatindicate pages that do not contain encyclopedic con-tent and would produce noise in the data They areimportant for specific extractors as well for instancethe category hierarchy data set (SKOS) is producedfrom pages of the Category namespace Furthermorethe output of the extraction system now supports moreformats and several compliance issues regarding URIsIRIs N-Triples and Turtle were fixed

The individual data extractors have been improvedas well in both number and quality in many ar-eas The abstract extraction was enhanced producingmore accurate plain text representations of the be-ginning of Wikipedia article texts More diverse andmore specific datatypes do exist (eg many curren-cies and XSD datatypes such as xsdgYearMonthxsdpositiveInteger etc) and for a number ofclasses and properties specific datatypes were added(eg inhabitantskm2 for the population density ofpopulated places and m3s for the discharge of rivers)Many issues related to data parsers were resolved andthe quality of the owlsameAs data set for multiplelanguage versions was increased by an implementationthat takes bijective relations into account

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and dis-ambiguation extractors were introduced and improvedFor the redirect data the transitive closure is com-puted while taking care of catching cycles in the linksThe redirects also help regarding infobox coveragein the mapping-based extraction by resolving alterna-tive template names Moreover in the past if an in-fobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an in-fobox contains a string value that is not linked toanother Wikipedia article the extraction frameworksearches for hyperlinks in the same Wikipedia articlethat have the same anchor text as the infobox valuestring If such a link exists the target of that link isused to replace the string value in the infobox Thismethod further increases the number of object propertyassertions in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

3 DBpedia Ontology

The DBpedia ontology consists of 320 classeswhich form a subsumption hierarchy and are describedby 1650 different properties With a maximal depthof 5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depicts apart of the DBpedia ontology indicating the relationsamong the top ten classes of the DBpedia ontology iethe classes with the highest number of instances

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings WikiFigure 4 depicts the growth of the DBpedia ontologyover time While the number of classes is not growingtoo much due to the already good coverage of the ini-tial version of the ontology the number of properties

increases over time due to the collaboration on the DB-pedia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communi-ties for 27 languages 23 of which are active Fig-ure 5 shows statistics for the coverage of these map-pings in DBpedia Figures (a) and (c) refer to the ab-solute number of template and property mappings thatare defined for every DBpedia language edition Fig-ures (b) and (d) depict the percentage of the definedtemplate and property mappings compared to the totalnumber of available templates and properties for everyWikipedia language edition Figures (e) and (g) showthe occurrences (instances) that the defined templateand property mappings have in Wikipedia Finally fig-ures (f) and (h) give the percentage of the mapped tem-plates and properties occurences compared to the to-tal templates and property occurences in a Wikipedialanguage edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50 oftotal template occurrences In addition almost all lan-guages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 8: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

8 Lehmann et al DBpedia

tf-idf weight with the intention to measure how strongis the association between a word and a DBpedia re-source Note that word stems are used in this contextin order to generalize over inflected words We use thecomputed weights to select the strongest related wordstems for each entity and build topic signatures [27]

There are two more Feature Extractors related toNatural Language Processing The thematic conceptsdata set relies on Wikipediarsquos category system to cap-ture the idea of a lsquothemersquo a subject that is discussedin its articles Many of the categories in Wikipedia arelinked to an article that describes the main topic of thatcategory We rely on this information to mark DBpediaentities and concepts that are lsquothematicrsquo that is theyare the center of discussion for a category

The grammatical gender data set uses a simpleheuristic to decide on a grammatical gender for in-stances of the class Person in DBpedia While pars-ing an article in the English Wikipedia if there is amapping from an infobox in this article to the classdboPerson we record the frequency of gender-specific pronouns in their declined forms (Subject Ob-ject Possessive Adjective Possessive Pronoun and Re-flexive) ndash ie he him his himself (masculine) and sheher hers herself (feminine) Grammatical genders forDBpedia entities are assigned based on the dominatinggender in these pronouns

27 Summary of Other Recent Developments

In this section we summarize the improvements ofthe DBpedia extraction framework since the publica-tion of the previous DBpedia overview article [5] in2009 One of the major changes on the implementationlevel is that the extraction framework has been rewrit-ten in Scala in 2010 to improve the efficiency of theextractors by an order of magnitude compared to theprevious PHP based framework The new more modu-lar framework also allows to extract data from tables inWikipedia pages and supports extraction from multi-ple MediaWiki templates per page Another significantchange was the creation and utilization of the DBpediaMappings Wiki as described earlier Further significantchanges include the mentioned NLP extractors and theintroduction of URI schemes

In addition there were several smaller improve-ments and general maintenance Overall over the pastfour years the parsing of the MediaWiki markup im-proved quite a lot which led to better overall coveragefor example concerning references and parser func-tions In addition the collection of MediaWiki names-

pace identifiers for many languages is now performedsemi-automatically leading to a high accuracy of de-tection This concerns common title prefixes such asUser File Template Help Portal etc in English thatindicate pages that do not contain encyclopedic con-tent and would produce noise in the data They areimportant for specific extractors as well for instancethe category hierarchy data set (SKOS) is producedfrom pages of the Category namespace Furthermorethe output of the extraction system now supports moreformats and several compliance issues regarding URIsIRIs N-Triples and Turtle were fixed

The individual data extractors have been improvedas well in both number and quality in many ar-eas The abstract extraction was enhanced producingmore accurate plain text representations of the be-ginning of Wikipedia article texts More diverse andmore specific datatypes do exist (eg many curren-cies and XSD datatypes such as xsdgYearMonthxsdpositiveInteger etc) and for a number ofclasses and properties specific datatypes were added(eg inhabitantskm2 for the population density ofpopulated places and m3s for the discharge of rivers)Many issues related to data parsers were resolved andthe quality of the owlsameAs data set for multiplelanguage versions was increased by an implementationthat takes bijective relations into account

There are also further extractors eg for Wikipediapage IDs and revisions Moreover redirect and dis-ambiguation extractors were introduced and improvedFor the redirect data the transitive closure is com-puted while taking care of catching cycles in the linksThe redirects also help regarding infobox coveragein the mapping-based extraction by resolving alterna-tive template names Moreover in the past if an in-fobox value pointed to a redirect this redirection wasnot properly resolved and thus resulted in RDF linksthat led to URIs which did not contain any further in-formation Resolving redirects affected approximately15 of all links and hence increased the overall inter-connectivity of resources in the DBpedia ontology

Finally a new heuristic to increase the connective-ness of DBpedia instances was introduced If an in-fobox contains a string value that is not linked toanother Wikipedia article the extraction frameworksearches for hyperlinks in the same Wikipedia articlethat have the same anchor text as the infobox valuestring If such a link exists the target of that link isused to replace the string value in the infobox Thismethod further increases the number of object propertyassertions in the DBpedia ontology

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

3 DBpedia Ontology

The DBpedia ontology consists of 320 classeswhich form a subsumption hierarchy and are describedby 1650 different properties With a maximal depthof 5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depicts apart of the DBpedia ontology indicating the relationsamong the top ten classes of the DBpedia ontology iethe classes with the highest number of instances

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings WikiFigure 4 depicts the growth of the DBpedia ontologyover time While the number of classes is not growingtoo much due to the already good coverage of the ini-tial version of the ontology the number of properties

increases over time due to the collaboration on the DB-pedia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communi-ties for 27 languages 23 of which are active Fig-ure 5 shows statistics for the coverage of these map-pings in DBpedia Figures (a) and (c) refer to the ab-solute number of template and property mappings thatare defined for every DBpedia language edition Fig-ures (b) and (d) depict the percentage of the definedtemplate and property mappings compared to the totalnumber of available templates and properties for everyWikipedia language edition Figures (e) and (g) showthe occurrences (instances) that the defined templateand property mappings have in Wikipedia Finally fig-ures (f) and (h) give the percentage of the mapped tem-plates and properties occurences compared to the to-tal templates and property occurences in a Wikipedialanguage edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50 oftotal template occurrences In addition almost all lan-guages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 9: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 9

C

P

P xsddecimal

C rdftype owlClassrdftype owlDatatypePropertyrdftype owlObjectProperty

Legend

dboPopulatedPlace

C dboAgentC dboPlace

C owlThing

C dboSpecies

C dboSettlement

C dboPerson

C dboAthlete

C dboEukaryote

C dboOrganisation

C dboWork

dbo

prod

ucer

P

dbowriter

P

dbobirthPlace

PdbofamilyP

P dboconservationStatus xsdString

P xsddate

dboreleaseDate

P dboruntime xsddouble

P xsddatedbobirthDate

P xsddatedbodeathDate

P xsddoubledboareaTotal

P xsddoubledboelevation

P xsdStringdboutcOffset

P dbopopulationTotal xsdinteger

P xsdStringdboareaCode

rdfssubClassOf

rdfs

subC

lass

Of

rdfs

subC

lass

Of

rdfssubClassOf

rdfssubClassOf

rdfssubClassOfrdfssubClassOf

rdfssubClassOf

rdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

n

rdfs

dom

ain

rdfs

dom

ainrd

fsd

omai

nrdfs

dom

ain

rdfs

dom

ain

rdfs

dom

ain

dbosubsequentWorkP

dbolocation

P

dbocantonP

rdfs

sub

Clas

sOf

rdfs

sub

Clas

sOf

Fig 3 Snapshot of a part of the DBpedia ontology

3 DBpedia Ontology

The DBpedia ontology consists of 320 classeswhich form a subsumption hierarchy and are describedby 1650 different properties With a maximal depthof 5 the subsumption hierarchy is intentionally keptrather shallow which fits use cases in which the on-tology is visualized or navigated Figure 3 depicts apart of the DBpedia ontology indicating the relationsamong the top ten classes of the DBpedia ontology iethe classes with the highest number of instances

DBpedia Version

Num

ber

of o

ntol

ogy

elem

ents

32 33 34 35 36 37 38

0

200

400

600

800

1000

1200

1400

1600

1800

2000ClassesProperties

Fig 4 Growth of the DBpedia ontology

The DBpedia ontology is maintained and extendedby the community in the DBpedia Mappings WikiFigure 4 depicts the growth of the DBpedia ontologyover time While the number of classes is not growingtoo much due to the already good coverage of the ini-tial version of the ontology the number of properties

increases over time due to the collaboration on the DB-pedia Mappings Wiki and the addition of more detailedinformation to infoboxes by Wikipedia editors

31 Mapping Statistics

As of April 2013 there exist mapping communi-ties for 27 languages 23 of which are active Fig-ure 5 shows statistics for the coverage of these map-pings in DBpedia Figures (a) and (c) refer to the ab-solute number of template and property mappings thatare defined for every DBpedia language edition Fig-ures (b) and (d) depict the percentage of the definedtemplate and property mappings compared to the totalnumber of available templates and properties for everyWikipedia language edition Figures (e) and (g) showthe occurrences (instances) that the defined templateand property mappings have in Wikipedia Finally fig-ures (f) and (h) give the percentage of the mapped tem-plates and properties occurences compared to the to-tal templates and property occurences in a Wikipedialanguage edition

It can be observed in the figure that the PortugueseDBpedia language edition is the most complete regard-ing mapping coverage Other language editions suchas Bulgarian Dutch English Greek Polish and Span-ish have mapped templates covering more than 50 oftotal template occurrences In addition almost all lan-guages have covered more than 20 of property oc-currences with Bulgarian and Portuguese reaching upto 70

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 10: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

10 Lehmann et al DBpedia

Fig 5 Mapping coverage statistics for all mapping-enabled languages

The mapping activity of the ontology enrichmentprocess along with the editing of the ten most ac-tive mapping language communities is depicted in Fig-ure 6 It is interesting to notice that the high map-ping activity peaks coincide with the DBpedia releasedates For instance the DBpedia 37 version was re-leased on September 2011 and the 2nd and 3rd quar-ter of that year have a very high activity compared tothe 4th quarter In the last two years (2012 and 2013)most of the DBpedia mapping language communitieshave defined their own chapters and have their own re-lease dates Thus recent mapping activity shows lessfluctuation

Finally Figure 7 shows the English property map-pings occurrence frequency Both axes are in log scaleand represent the number of property mappings (xaxis) that have exactly y occurrences (y axis) Theoccurrence frequency follows a long tail distributionThus a low number of property mappings have a highnumber of occurrences and a high number of propertymappings have a low number of occurences

32 Instance Data

The DBpedia 38 release contains localized versionsof DBpedia for 111 languages which have been ex-tracted from the Wikipedia edition in the correspond-

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 11: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 11

Fig 6 Mapping community activity for (a) ontology and (b) 10 most active language editions

Fig 7 English property mappings occurrence frequency (both axesare in log scale)

ing language For 20 of these languages we report inthis section the overall number of entities being de-scribed by the localized versions as well as the numberof facts (ie statements) that have been extracted frominfoboxes describing these things Afterwards we re-port on the number of instances of popular classeswithin the 20 DBpedia versions as well as the concep-tual overlap between the languages

Table 2 shows the overall number of things ontol-ogy and raw-infobox properties infobox statementsand type statements for the 20 languages The col-umn headings have the following meaning LD = Lo-calized data sets (see Section 25) CD = Canonical-ized data sets (see Section 25) all = Overall numberof instances in the data set including instances with-out infobox data with MD = Number of instances forwhich mapping-based infobox data exists Raw Prop-erties = Number of different properties that are gen-erated by the raw infobox extractor Mapping Proper-ties = Number of different properties that are generatedby the mapping-based infobox extractor Raw State-

ments = Number of statements (facts) that are gener-ated by the raw infobox extractor Mapping Statements= Number of statements (facts) that are generated bythe mapping-based infobox extractor

It is interesting to see that the English version ofDBpedia describes about three times more instancesthan the second and third largest language editions(French German) Comparing the first column of thetable with the second and third reveals which por-tion of the instances of a specific language corre-spond to instances in the English version of DBpe-dia and which portion of the instances is described byclean mapping-based infobox data The difference be-tween the number of properties in the raw infobox dataset and the cleaner mapping-based infobox data set(columns 4 and 5) results on the one hand from mul-tiple Wikipedia infobox properties being mapped to asingle ontology property On the other hand it reflectsthe number of mappings that have been so far createdin the Mapping Wiki for a specific language

Table 3 reports the number of instances for a set ofselected ontology classes within the canonicalized DB-pedia data sets for each language The indented classesare subclasses of the superclasses set in bold The zerovalues in the table indicate that no infoboxes have beenmapped to a specific ontology class within the corre-sponding language so far Again the English versionof DBpedia covers by far the most instances

Table 4 shows for the canonicalized mapping-based data set how many instances are described inmultiple languages The Instances column contains thetotal number of instances per class across all 20 lan-guages the second column contains the number of in-stances that are described only in a single languageversion the next column contains the number of in-stances that are contained in two languages but notin three or more languages etc For example 12936persons are described in five languages but not in six

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 12: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

12 Lehmann et al DBpedia

Inst LD all Inst CD all Inst with MD CD Raw Prop CD Map Prop CD Raw Statem CD Map Statem CD

en 3769926 3769926 2359521 48293 1313 65143840 33742015de 1243771 650037 204335 9593 261 7603562 2880381fr 1197334 740044 214953 13551 228 8854322 2901809it 882127 580620 383643 9716 181 12227870 4804731

es 879091 542524 310348 14643 476 7740458 4383206pl 848298 538641 344875 7306 266 7696193 4511794ru 822681 439605 123011 13522 76 6973305 1389473pt 699446 460258 272660 12851 602 6255151 4005527ca 367362 241534 112934 8696 183 3689870 1301868cs 225133 148819 34893 5564 334 1857230 474459hu 209180 138998 63441 6821 295 2506399 601037ko 196132 124591 30962 7095 419 1035606 417605tr 187850 106644 40438 7512 440 1350679 556943ar 165722 103059 16236 7898 268 635058 168686eu 132877 108713 41401 2245 19 2255897 532709sl 129834 73099 22036 4235 470 1213801 222447

bg 125762 87679 38825 3984 274 774443 488678hr 109890 71469 10343 3334 158 701182 151196el 71936 48260 10813 2866 288 206460 113838

Table 2Basic statistics about Localized DBpedia Editions

en it pl es pt fr de ru ca hu

Person 763643 145060 70708 65337 43057 62942 33122 18620 7107 15529Athlete 185126 47187 30332 19482 14130 21646 31237 0 721 4527Artist 61073 12511 16120 25992 10571 13465 0 0 2004 3821Politician 23096 0 8943 5513 3342 0 0 12004 1376 760

Place 572728 141101 182727 132961 116660 80602 131766 67932 73078 18324PopulPlace 387166 138077 167034 121204 109418 72252 79410 63826 72743 15535Building 60514 1270 2946 3570 803 921 83 43 0 527River 24267 0 1383 2 4149 3333 6707 3924 0 565

Organisation 192832 4142 12193 11710 10949 17513 16973 1598 1399 3993Company 44516 4142 2566 975 1903 5832 7200 0 440 618EducInst 42270 0 599 1207 270 1636 2171 1010 0 115Band 27061 0 2993 0 4476 3868 5368 0 263 802

Work 333269 51918 32386 36484 33869 39195 18195 34363 5240 9777MusicWork 159070 23652 14987 15911 17053 16206 0 6588 697 4704Film 71715 17210 9467 9396 8896 9741 15038 12045 3859 2320Software 27947 5682 3050 4833 3419 5733 2368 0 606 857

Table 3Number of instances per class within 10 localized DBpedia versions

or more languages The number 871630 for the classPerson means that all 20 language version togetherdescribe 871630 different persons The number ishigher than the number of persons described in thecanonicalized English infobox data set (763643) listedin Table 3 since there are infoboxes in non-Englisharticles describing a person without a corresponding

infobox in the English article describing the same per-son Summing up columns 2 to 10+ for the Personclass we see that 195263 persons are described in twoor more languages The large difference of this numbercompared to the total number of 871630 persons isdue to the much smaller size of the localized DBpediaversions compared to the English one (cf Table 2)

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 13: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 13

Class Instances 1 2 3 4 5 6 7 8 9 10+

Person 871630 676367 94339 42382 21647 12936 8198 5295 3437 2391 4638Place 643260 307729 150349 45836 36339 20831 13523 20808 31422 11262 5161Organisation 206670 160398 22661 9312 5002 3221 2072 1421 928 594 1061Work 360808 243706 54855 23097 12605 8277 5732 4007 2911 1995 3623

Table 4Cross-language overlap Number of instances that are described in multiple languages

33 Internationalisation Community

The introduction of the mapping-based infoboxextractor alongside live synchronisation approachesin [19] allowed the international DBpedia communityto easily define infobox-to-ontology mappings As aresult of this development there are presently map-pings for 27 languages6 The DBpedia 37 release7 inSeptember 2011 was the first DBpedia release to usethe localized I18n DBpedia extraction framework [23]

At the time of writing official DBpedia chap-ters for 14 languages have been founded BasqueCzech Dutch English French German Greek Ital-ian Japanese Korean Polish Portuguese Russianand Spanish8 Besides providing mappings from in-foboxes in the corresponding Wikipedia editions DB-pedia chapters organise a local community and providehosting for data sets and associated services Recentlythe DBpedia internationalisation committee9 has man-ifested its structure and each language edition has arepresentative with a vote in elections In some cases(eg Greek10 and Dutch11) the existence of a local DB-pedia chapter has had a positive effect on the creationof localized LOD clouds [23]

In the weeks leading to a new release the DBpe-dia project organises a mapping sprint where com-munities from each language work together to im-prove mappings increase coverage and detect bugs inthe extraction process The progress of the mappingeffort is tracked through statistics on the number ofmapped templates and properties as well as the num-ber of times these templates and properties occur inWikipedia These statistics provide an estimate of thecoverage of each Wikipedia edition in terms of how

6ar bg bn ca cs de el en es et eu fr ga hi hr hu id it jako nl pl pt ru sl tr ur

7httpblogdbpediaorg201109118Accessed on 25022013 httpwikidbpediaorg

InternationalizationChapters9httpwikidbpediaorg

Internationalization10httpeldbpediaorg11httpnldbpediaorg

many entities will be typed and how many proper-ties from those entities will be extracted Thereforethey can be used by each language edition to prior-itize properties and templates with higher impact onthe coverage The mapping statistics have also beenused as a way to promote a healthy competition be-tween language editions A sprint page was createdwith bar charts that show how close each languageis from achieving total coverage (cf Figure 5) andline charts showing the progress over time highlightingwhen one language is overtaking another in their racefor higher coverage The mapping sprints have servedas a great motivator for the crowd-sourcing efforts asit can be noted from the increase in the number of map-ping contributions in the weeks leading to a release

4 Live Synchronisation

Wikipedia articles are continuously revised at a veryhigh rate eg the English Wikipedia in June 2013has approximately 33 million edits per month whichis equal to 77 edits per minute12 This high change fre-quency leads to DBpedia data quickly being outdatedwhich in turn leads to the need for a methodologyto keep DBpedia in synchronisation with WikipediaAs a consequence the DBpedia Live system was de-veloped which works on a continuous stream of up-dates from Wikipedia and processes that stream on thefly [1933] It allows extracted data to stay up-to-datewith a small delay of at most a few minutes Since theEnglish Wikipedia is the largest among all Wikipediaeditions with respect to the number of articles and thenumber of edits per month it was the first languageDBpedia Live supported13 Meanwhile DBpedia Livefor Dutch14 was developed

12httpstatswikimediaorgENSummaryENhtm

13httplivedbpediaorg14httplivenldbpediaorg

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 14: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

14 Lehmann et al DBpedia

41 DBpedia Live System Architecture

In order for live synchronisation to be possible weneed access to the changes made in Wikipedia TheWikimedia foundation kindly provided us access totheir update stream using the OAI-PMH protocol [24]This protocol allows a programme to pull page updatesin XML via HTTP A Java component serving as aproxy constantly retrieves new updates and feeds themto the DBpedia Live framework This proxy is nec-essary to decouple the stream from the framework tosimplify maintenance of the software The live extrac-tion workflow uses this update stream to extract newknowledge upon relevant changes in Wikipedia arti-cles

The overall architecture of DBpedia Live is indi-cated in Figure 8 The major components of the systemare as follows

ndash Local Wikipedia Mirror A local copy of aWikipedia language edition is installed which iskept in real-time synchronisation with its live ver-sion using the OAI-PMH protocol Keeping a lo-cal Wikipedia mirror allows us to exceed any ac-cess limits posed by Wikipedia

ndash Mappings Wiki The DBpedia Mappings Wikidescribed in Section 24 serves as secondary in-put source Changes of the mappings wiki are alsoconsumed via an OAI-PMH stream Note that asingle mapping change can affect a high numberof DBpedia resources

ndash DBpedia Live Extraction Manager This is thecore component of the DBpedia Live extractionarchitecture The manager takes feeds of pagesfor re-processing as input and applies all the en-abled extractors After processing a page the ex-tracted triples are a) inserted into a backend triplestore (in our case Virtuoso [10]) updating the oldtriples and b) saved as changesets into a com-pressed N-Triples file structure

ndash Synchronisation Tool This tool allows third par-ties to keep DBpedia Live mirrors up-to-date byharvesting the produced changesets

42 Features of DBpedia Live

The core components of the DBpedia Live Extrac-tion framework provide the following features

ndash Mapping-Affected Pages The update of all pagesthat are affected by a mapping change

Fig 8 Overview of DBpedia Live extraction framework

ndash Unmodified Pages The update of unmodifiedpages at regular intervals

ndash Changesets Publication The publication of triple-changesets

ndash Synchronisation Tool A synchronisation tool forharvesting updates to DBpedia Live mirrors

ndash Data Isolation Separate data from different sources

Mapping-Affected Pages Whenever an infobox map-ping change occurs all the Wikipedia pages that usethat infobox are reprocessed Taking Figure 2 as an ex-ample if a new property mapping is introduced (iedbotranslator) or an existing (ie dboillustrator) isupdated or deleted then all entities belonging to theclass dboBook are reprocessed Thus upon a mappingchange we identify all the affected Wikipedia pagesand feed them for reprocessing

Unmodified Pages Extraction framework improve-ments or activation deactivation of DBpedia extrac-tors might never be applied to rarely modified pagesTo overcome this problem we obtain a list of the pageswhich have not been processed over a period of time(30 days in our case) and feed that list to the DBpediaLive extraction framework for reprocessing This feedhas a lower priority than the update or the mapping af-fected pages feed and ensures that all articles reflect arecent state of the output of the extraction framework

Publication of Changesets Whenever a Wikipediaarticle is processed we get two disjoint sets of triplesA set for the added triples and another set for thedeleted triples We write those two sets into N-Triplesfiles compress them and publish the compressed filesas changesets If another DBpedia Live mirror wantsto synchronise with the DBpedia Live endpoint it canjust download those files decompress and integratethem

Synchronisation Tool The synchronisation tool en-ables a DBpedia Live mirror to stay in synchronisationwith our live endpoint It downloads the changeset files

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 15: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 15

sequentially decompresses them and updates the tar-get SPARQL endpoint via insert and delete operations

Data Isolation In order to keep the data isolatedDBpedia Live keeps different sources of data in dif-ferent SPARQL graphs Data from the article updatefeeds are contained in the graph with the URI httplivedbpediaorg static data (ie links tothe LOD cloud) are kept in httpstaticdbpediaorg and the DBpedia ontology is storedin httpdbpediaorgontology All datais also accessible under the httpdbpediaorggraph for combined queries Next versions of DBpe-dia Live will also separate data from the raw infoboxextraction and mapping-based infobox extraction

5 Interlinking

DBpedia is interlinked with numerous external datasets following the Linked Data principles In this sec-tion we give an overview of the number and typesof outgoing links that point from DBpedia into otherdata sets as well as the external data sets that set linkspointing at DBpedia resources

51 Outgoing Links

Similar to the DBpedia ontology DBpedia also fol-lows a community approach for adding links to otherthird party data sets The DBpedia project maintainsa link repository15 for which conventions for addinglinksets and linkset metadata are defined The adher-ence to those guidelines is supervised by a linkingcommittee Linksets which are added to the repositoryare used for the subsequent official DBpedia releaseas well as for DBpedia Live Table 5 lists the linksetscreated by the DBpedia community as of April 2013The first column names the data set that is the targetof the links The second and third column contain thepredicate that is used for linking as well as the overallnumber of links that is set between DBpedia and theexternal data set The last column names the tool thatwas used to generate the links The value S refers toSilk L to LIMES C to custom script and a missing en-try means that the dataset is copied from the previousreleases and not regenerated

Links in DBpedia have been used for various pur-poses One example is the combination of data about

15httpsgithubcomdbpediadbpedia-links

Data set Predicate Count Tool

Amsterdam Museum owlsameAs 627 SBBC Wildlife Finder owlsameAs 444 SBook Mashup rdftype 9 100

owlsameAsBricklink dcpublisher 10 100CORDIS owlsameAs 314 SDailymed owlsameAs 894 SDBLP Bibliography owlsameAs 196 SDBTune owlsameAs 838 SDiseasome owlsameAs 2 300 SDrugbank owlsameAs 4 800 SEUNIS owlsameAs 3 100 SEurostat (Linked Stats) owlsameAs 253 SEurostat (WBSG) owlsameAs 137CIA World Factbook owlsameAs 545 Sflickr wrappr dbphasPhoto- 3 800 000 C

CollectionFreebase owlsameAs 3 600 000 CGADM owlsameAs 1 900GeoNames owlsameAs 86 500 SGeoSpecies owlsameAs 16 000 SGHO owlsameAs 196 LProject Gutenberg owlsameAs 2 500 SItalian Public Schools owlsameAs 5 800 SLinkedGeoData owlsameAs 103 600 SLinkedMDB owlsameAs 13 800 SMusicBrainz owlsameAs 23 000New York Times owlsameAs 9 700OpenCyc owlsameAs 27 100 COpenEI (Open Energy) owlsameAs 678 SRevyu owlsameAs 6Sider owlsameAs 2 000 STCMGeneDIT owlsameAs 904UMBEL rdftype 896 400US Census owlsameAs 12 600WikiCompany owlsameAs 8 300WordNet dbpwordnet type 467 100YAGO2 rdftype 18 100 000

Sum 27 211 732Table 5

Data sets linked from DBpedia

European Union project funding (FTS) [29] and dataabout countries in DBpedia The following query com-pares funding per year (from FTS) and country withthe gross domestic product of a country (from DBpe-dia)

1 SELECT 2 SELECT ftsyear ftscountry (SUM(amount) AS

funding)

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 16: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

16 Lehmann et al DBpedia

3 com rdftype fts-oCommitment 4 com fts-oyear year 5 year rdfslabel ftsyear 6 com fts-obenefit benefit 7 benefit fts-odetailAmount amount 8 benefit fts-obeneficiary beneficiary 9 beneficiary fts-ocountry country

10 country owlsameAs ftscountry 11 12 SELECT dbpcountry gdpyear gdpnominal 13 dbpcountry rdftype dboCountry 14 dbpcountry dbpgdpNominal gdpnominal 15 dbpcountry dbpgdpNominalYear gdpyear 16 17 FILTER ((ftsyear = str(gdpyear)) ampamp18 (ftscountry = dbpcountry))

In addition to providing outgoing links on instance-level DBpedia also sets links on schema-level point-ing from the DBpedia ontology to equivialent terms inother schemas Links to other schemata can be set bythe community within the DBpedia Mappings Wiki byusing owlequivalentClass in class templatesand owlequivalentProperty in datatype orobject property templates respectively In particularin 2011 Google Microsoft and Yahoo announcedtheir collaboration on Schemaorg a collection of vo-cabularies for marking up content on web pages TheDBpedia 38 ontology contains 45 equivalent classand 31 equivalent property links pointing to httpschemaorg terms

52 Incoming Links

DBpedia is being linked to from a variety of datasets The overall number of links pointing at DBpe-dia from other data sets is 39007478 according tothe Data Hub16 However those counts are entered byusers and may not always be valid and up-to-date

In order to identify actually published and onlinedata sets that link to DBpedia we used Sindice [36]The Sindice project crawls RDF resources on the weband indexes those resources In Sindice a data set isdefined by the second-level domain name of the en-tityrsquos URI eg all resources available at the domainfu-berlinde are considered to belong to the samedata set A triple is considered to be a link if the dataset of subject and object are different Furthermore theSindice data we used for analysis only considers au-thoritative entities The data set of a subject of a triplemust match the domain it was retrieved from other-wise it is not considered Sindice computes a graph

16See httpwikidbpediaorgInterlinking fordetails

Data set Link Predicate Count Link Count

okaboocom 4 2407121tfrigovtw 57 837459naplesplusus 2 212460fu-berlinde 7 145322freebasecom 108 138619geonamesorg 3 125734opencycorg 3 19648geospeciesorg 10 16188dbrecnet 3 12856favikicom 5 12768

Table 6

Top 10 data sets in Sindice ordered by the number of links toDBpedia

Metric Value

Total links 3960212Total distinct data sets 248Total distinct predicates 345

Table 7Sindice summary statistics for incoming links to DBpedia

summary over all resources they store With the helpof the Sindice team we examined this graph summaryto obtain all links pointing to DBpedia As shown inTable 7 Sindice knows about 248 data sets linking toDBpedia 70 of those data sets link to DBpedia viaowlsameAs but other link predicates are also verycommon as evident in this table In total Sindice hasindexed 4 million links pointing at DBpedia Table 6lists the 10 data sets which set most links to DBpediaalong with the used link predicate and the number oflinks

It should be noted that the data in Sindice is not com-plete for instance it does not contain all data sets thatare catalogued by the DataHub17 However it crawlsfor RDFa snippets converts microformats etc whichare not captured by the DataHub Despite the inac-curacy the relative comparison of different datasetscan still give us insights Therefore we analysed thelink structure of all Sindice datasets using the Sindicecluster (see appendix for details) Table 8 shows thedatasets with most incoming links Those are authori-ties in the network structure of the web of data and DB-pedia is currently ranked second in terms of incominglinks

17httpdatahubio

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 17: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 17

domain datasets links

purlorg 498 6717520dbpediaorg 248 3960212creativecommonsorg 2483 3030910identica 1021 2359276l3sde 34 1261487rkbexplorercom 24 1212416nytimescom 27 1174941w3org 405 658535geospeciesorg 13 523709livejournalcom 14881 366025

Table 8Top 10 datasets by incoming links in Sindice

0

20

40

60

80

100

120

2012-Q1 2012-Q2 2012-Q3 2012-Q4

Download Count (in Thousands) Download Volume (in TB)

Fig 9 The download count and download volume (in GB) of theEnglish language of DBpedia

6 DBpedia Usage Statistics

DBpedia is served on the web in three forms Firstit is provided in the form of downloadable data setswhere each data set contains the results of one ofthe extractors listed in Table 1 Second DBpedia isserved via a public SPARQL endpoint and third itprovides dereferencable URIs according to the LinkedData principles In this section we explore some of thestatistics gathered during the hosting of DBpedia overthe last couple of years

61 Download Statistics for the DBpedia Data Sets

DBpedia covers more than 100 languages but thoselanguages vary with respect to the download popular-ity as well The top five languages with respect to thedownload volume are English Chinese German Cata-lan and French respectively The download count anddownload volume of the English language is indicatedin Figure 9 To host the DBpedia dataset downloadsa bandwidth of approximately 18 TB per quarter or 6TB per month is currently needed

Furthermore DBpedia consists of several data setswhich vary with respect to their download popular-ity The download count and the download volumeof each data set during the year 2012 is depicted

DBpedia Configuration

33 - 34 AMD Opteron 8220 280Ghz 4 Cores 32GB35 - 37 Intel Xeon E5520 227Ghz 8 Cores 48GB38 Intel Xeon E5-2630 230GHz 8 Cores 64GB

Table 9Hardware of the machines serving the public SPARQL endpoint

in Figure 10 In those statistics we filtered out allIP addresses which requested a file more than 1000times per month18 Pagelinks are the most downloadeddataset although they are not semantically rich as theydo not reveal which type of links exists between tworesources Supposedly they are used for network anal-ysis or providing relevant links in user interfaces anddownloaded more often as they are not provided viathe official SPARQL endpoint

62 Public Static DBpedia SPARQL Endpoint

The main public DBpedia SPARQL endpoint19 ishosted using the Virtuoso Universal Server (EnterpriseEdition) version 64 software in a 4-nodes cluster con-figuration This cluster setup provides parallelizationof query execution even when the cluster nodes areon the same machine as splitting a query over severalnodes allows better use of parallel threads on modernmulti-core CPUs on standard commodity hardware

Virtuoso supports infinite horizontal scale-out ei-ther by redistributing the existing cluster nodes ontomultiple machines or by adding several separate clus-ters with a round robin HTTP front-end This allowsthe cluster setup to grow in line with desired responsetimes for an RDF data set collection of any size Asthe size of the DBpedia data set increased and its useby the Linked Data community grew the project mi-grated to increasingly powerful hardware as shown inTable 9

The Virtuoso instance is configured to processqueries within a 1200 second timeout window and amaximum result set size of 50000 rows It providesOFFSET and LIMIT support for paging alongside theability to produce partial results

The log files used in the following analysis excludedtraffic generated by

1 clients that have been temporarily rate limited af-ter a burst period

18The IP address was only filtered for that specific file and monthin those cases

19httpdbpediaorgsparql

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 18: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

18 Lehmann et al DBpedia

0

10

20

30

40

50

60

70Download Count (in Thousands) Download Volume (in TB)

Fig 10 The download count and download volume (in GB) of the DBpedia data sets

2 clients that have been banned after misuse3 applications spiders and other crawlers that are

blocked after frequently hitting the rate limit orgenerally use too many resources

Virtuoso supports HTTP Access Control Lists (ACLs)which allow the administrator to rate limit certain IPaddresses or whole IP ranges A maximum number ofrequests per second (currently 15) as well as a band-width limit per request (currently 10MB) are enforcedIf the client software can handle compression repliesare compressed to further save bandwidth Exceptionrules can be configured for multiple clients hidden be-hind a NAT firewall (appearing as a single IP address)or for temporary requests for higher rate limits Whena client hits an ACL limit the system reports an appro-priate HTTP status code20 like 509 and quickly dropsthe connection The system further uses an iptablesbased firewall for permanent blocking of clients iden-tified by their IP addresses

63 Public Static Endpoint Statistics

The statistics presented in this section were ex-tracted from reports generated by Webalizer v22121Table 10 and Table 11 show various DBpedia SPARQLendpoint usage statistics for the last couple of DBpediareleases The AvgDay column represents the aver-age number of hits (resp visits) per day followed bythe Median and Standard Deviation The last column

20httpenwikipediaorgwikiList_of_HTTP_status_codes

21httpwwwwebalizerorg

DBpedia AvgDay Median Stdev Maximum

33 733811 711306 188991 131953934 1212549 1165893 351226 237165735 1122612 1035444 386454 290872036 1328355 1286750 256945 249503137 2085399 1930728 1057398 887347338 2910410 2717775 1085640 7678490

Table 10Number of SPARQL endpoint hits

Fig 11 Average number of SPARQL endpoint requests per day

shows the maximum number of hits (visits) that wasrecorded on a single day for each data set version Vis-its (ie sessions of subsequent queries from the sameclient) are determined by a floating 30 minute timewindow All requests from behind a NAT firewall arelogged under the same external IP address and aretherefore counted towards the same visit if they occurwithin the 30 minute interval

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 19: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 19

DBpedia AvgDay Median Stdev Maximum

33 9750 9775 2036 1312634 11329 11473 1546 1419835 16592 16902 2969 2312936 19471 17664 5691 5606437 23972 22262 10741 12786938 16851 16711 2960 27416

Table 11Number of SPARQL endpoint visits

Fig 12 Average SPARQL endpoint visits per day

Figure 11 and Figure 12 show the increasing popu-larity of DBpedia There is a distinct dip in hits to theSPARQL endpoint in DBpedia 35 which is partiallydue to more strict initial limits for bot-related trafficwhich were later relaxed The sudden drop of visits be-tween the 37 and the 38 data sets can be attributedto

1 applications starting to use their own privateDBpedia endpoint

2 blocking of apps that were abusing the DBpediaendpoint

3 uptake of the language specific DBpedia end-points and DBpedia Live

64 Query Types and Trends

The DBpedia server is not only a SPARQL end-point but also serves as a Linked Data Hub returningresources in a number of different formats For eachdata set we randomly selected 14 days worth of logfiles and processed those in order to show the variousservices called Table 12 shows the number of hits tothe various endpoints

Endpoint 33 34 35 36 37 38

data 1355 2681 2230 2547 3714 4246ontology 80 168 142 178 116 97page 2634 4703 1810 1687 3787 7377property 231 311 137 176 176 128resource 2976 4080 2554 2309 4436 7143sparql 2090 4177 2266 9112 15902 15475other 252 420 342 277 579 695total 9619 16541 9434 16286 28710 35142

Table 12Hits per service to httpdbpediaorg in thousands

Fig 13 Traffic Linked Data versus SPARQL endpoint

The resource endpoint uses the Accept line in theHTTP header sent by the client to return a HTTP sta-tus code 30x to redirect the client to either the page(HTML based) or data (formats like RDFXML orTurtle) equivalent of the article Clients also frequentlymint their own URLs to either page or data versionof an articles directly or download the raw data fromthe links at the bottom of the page based article Thisexplains why the count of page and data hits in thetable is larger than the number of hits on the resourceendpoint The ontology and property endpoints returnmeta information about the DBpedia ontology

Figure 13 shows the percentages of traffic hits thatwere generated by the main endpoints As we can seethe usage of the SPARQL endpoint has doubled fromabout 22 percent in 2009 to about 44 percent in 2013However this still means that 56 percent of traffic hitsare directed to the Linked Data service

In Table 13 we focussed on the calls to the sparqlendpoint and counted the number of statements pertype As the log files only record the full SPARQLquery on a GET request all the PUT requests arecounted as unknown

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 20: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

20 Lehmann et al DBpedia

Statement 33 34 35 36 37 38

ask 56 269 360 326 159 677construct 55 31 14 11 22 38describe 11 8 4 7 62 111select 1891 3663 1658 8030 11204 13516unknown 78 206 229 738 4455 1134total 2090 4177 2266 9112 15902 15475

Table 13Hits per statement type in thousands

Statement 33 34 35 36 37 38

distinct 195 114 173 194 133 254filter 457 137 318 253 292 361functions 88 69 235 213 255 259geo 277 70 396 67 93 92group 00 00 00 00 00 02limit 46 65 116 105 77 78optional 306 238 473 263 167 172order 23 13 19 32 12 16union 33 32 264 113 121 206

Table 14Trends in SPARQL select (rounded values in )

Finally we analyzed each SPARQL query andcounted the use of keywords and constructs like

ndash DISTINCTndash FILTERndash FUNCTIONS like CONCAT CONTAINS ISIRIndash Use of GEO objectsndash GROUP BYndash LIMIT OFFSETndash OPTIONALndash ORDER BYndash UNION

For the GEO objects we counted the use of SPARQLPREFIX geo and wgs84 declarations and usage inproperty tags Table 14 shows the use of various key-words as a percentage of the total select queries madeto the sparql endpoint for the sample sets In generalwe observed that queries became more complex overtime indicating an increasing maturity and higher ex-pectations of the user base

65 Statistics for DBpedia Live

Since its official release at the end of June 2011DBpedia Live attracted a steadily increasing numberof users Furthermore more users tend to use the syn-chronisation tool to synchronise their own DBpediaLive mirrors This leads to an increasing number of

live update requests ie changeset downloads Fig-ure 14 indicates the number of daily SPARQL and syn-chronisation requests sent to DBpedia Live endpoint inthe period between August 2012 and January 2013

Fig 14 Number of daily requests sent to the DBpedia Live for a)SPARQL queries and b) syncronization requests from August 2012until January 2013

7 Use Cases and Applications

With DBpedia evolving into a hub on the Web ofData due to its coverage of various domains it is aneminent data set for various tasks It can not only im-prove Wikipedia search but also serve as a data sourcefor applications and mashups as well as for text anal-ysis and annotation tools

71 Natural Language Processing

DBpedia can support many tasks in Natural Lan-guage Processing (NLP) [30] For that purpose DBpediaincludes a number of specialized data sets22 For in-stance the lexicalizations data set can be used to es-timate the ambiguity of phrases to help select un-ambiguous identifiers for ambiguous phrases or toprovide alternative names for entities just to men-tion a few examples Topic signatures can be useful intasks such as query expansion or document summa-rization and has been successfully employed to clas-sify ambiguously described images as good depictionsof DBpedia entities [12] The thematic concepts dataset of resources can be used for creating a corpus from

22httpwikidbpediaorgDatasetsNLP

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 21: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 21

Wikipedia to be used as training data for topic classi-fiers among other things (see below) The grammati-cal gender data set can for example be used to add agender feature in co-reference resolution

711 Annotation Entity DisambiguationAn important use case for NLP is annotating texts or

other content with semantic information Named entityrecognition and disambiguation ndash also known as keyphrase extraction and entity linking tasks ndash refers tothe task of finding real world entities in text and link-ing them to unique identifiers One of the main chal-lenges in this regard is ambiguity an entity name orsurface form may be used in different contexts to re-fer to different concepts Many different methods havebeen developed to resolve this ambiguity with fairlyhigh accuracy [21]

As DBpedia reflects a vast amount of structured realworld knowledge obtained from Wikipedia DBpediaURIs can be used as identifiers for the majority ofdomains in text annotation Consequently interlink-ing text documents with Linked Data enables theWeb of Data to be used as background knowledgewithin document-oriented applications such as seman-tic search or faceted browsing (cf Section 73)

Many applications performing this task of annotat-ing text with entities in fact use DBpedia entities as tar-gets For example DBpedia Spotlight [31] is an opensource tool23 including a free web service that de-tects mentions of DBpedia resources in text It usesthe lexicalizations in conjunction with the topic signa-tures data set as context model in order to be able todisambiguate found mentions The main advantage ofthis system is its comprehensiveness and flexibility al-lowing one to configure it based on quality measuressuch as prominence contextual ambiguity topical per-tinence and disambiguation confidence as well as theDBpedia ontology The resources that should be anno-tated can be specified by a list of resource types or bymore complex relationships within the knowledge basedescribed as SPARQL queries

There are numerous other NLP APIs that link enti-ties in text to DBpedia AlchemyAPI24 Semantic APIfrom Ontos25 Open Calais26 and Zemanta27 amongothers Furthermore the DBpedia ontology has beenused for training named entity recognition systems

23httpspotlightdbpediaorg24httpwwwalchemyapicom25httpwwwontoscom26httpwwwopencalaiscom27httpwwwzemantacom

(without disambiguation) in the context of the ApacheStanbol project28

Tag disambiguation Similar to linking entities in textto DBpedia user-generated tags attached to multime-dia content such as music photos or videos can also beconnected to the Linked Data hub This has previouslybeen implemented by letting the user resolve ambigu-ities For example Faviki29 suggests a set of DBpediaentities coming from Zemantarsquos API and lets the userchoose the desired one Alternatively similar disam-biguation techniques as mentioned above can be uti-lized to choose entities from tags automatically [13]The BBC30 employs DBpedia URIs for tagging theirprogrammes Short clips and full episodes are taggedusing two different tools while utilizing DBpedia tobenefit from global identifiers that can be easily inte-grated with other knowledge bases

712 Question AnsweringDBpedia provides a wealth of human knowledge

across different domains and languages which makesit an excellent target for question answering and key-word search approaches One of the most prominentefforts in this area is the DeepQA project which re-sulted in the IBM Watson system The Watson systemwon a $1 million prize in Jeopardy and relies on sev-eral data sets including DBpedia31 DBpedia is also theprimary target for several QA systems in the QuestionAnswering over Linked Data (QALD) workshop se-ries32 Several QA systems such as TBSL [41] Pow-erAqua [28] FREyA [9] and QAKiS [6] have been ap-plied to DBpedia using the QALD benchmark ques-tions DBpedia is interesting as a test case for such sys-tems Due to its large schema and data size as wellas its topic adversity it provides significant scientificchallenges In particular it would be difficult to pro-vide capable QA systems for DBpedia based only onsimple patterns or via domain specific dictionaries be-cause of its size and broad coverage Therefore a ques-tion answering system which is able to reliable an-swer questions over DBpedia correctly could be seenas a truly intelligent system In the latest QALD seriesquestion answering benchmarks also exploit nationalDBpedia chapters for multilingual question answering

28httpstanbolapacheorg29httpwwwfavikicom30httpbbccouk31httpwwwaaaiorgMagazineWatson

watsonphp32httpgreententacletechfakuni-

bielefeldde˜cungerqald

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 22: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

22 Lehmann et al DBpedia

Similarly the slot filling task in natural languageprocessing poses the challenge of finding values for agiven entity and property from mining text This canbe viewed as question answering with static questionsbut changing targets DBpedia can be exploited for factvalidation or training data in this task as was done bythe Watson team [4] and others [25]

72 Digital Libraries and Archives

In the case of libraries and archives DBpedia couldoffer a broad range of information on a broad range ofdomains In particular DBpedia could provide

ndash Context information for bibliographic and archiverecords Background information such as an au-thorrsquos demographics a filmrsquos homepage or an im-age could be used to enhance user interaction

ndash Stable and curated identifiers for linking DBpediais a hub of Linked Open Data Thus (re-)usingcommonly used identifiers could ease integrationwith other libraries or knowledge bases

ndash A basis for a thesaurus for subject indexing Thebroad range of Wikipedia topics in addition to thestable URIs could form the basis for a global clas-sification system

Libraries have already invested both in Linked Dataand Wikipedia (and transitively to DBpedia) thoughthe realization of the Virtual International AuthorityFiles (VIAF) project33 Recently it was announcedthat VIAF added a total of 250000 reciprocal author-ity links to Wikipedia34 These links are already har-vested by DBpedia Live and will also be included inthe next static DBpedia release This creates a huge op-portunity for libraries that use VIAF to get connectedto DBpedia and the LOD cloud in general

73 Knowledge Exploration

Since DBpedia spans many domains and has a di-verse schema many knowledge exploration tools ei-ther used DBpedia as a testbed or were specificallybuilt for DBpedia We give a brief overview of toolsand structure them in categories

33httpviaforg34Accessed on 12022013 httpwwwoclcorg

researchnews201212-07ahtml

Facet Based Browsers An award-winning35 facet-based browser used the Neofonie search engine tocombine facts in DBpedia with full-text from Wikipediain order to compute hierarchical facets [14] Anotherfacet based browser which allows to create complexgraph structures of facets in a visually appealing inter-face and filter them is gFacet [15] A generic SPARQLbased facet explorer which also uses a graph based vi-sualisation of facets is LODLive [7] The OpenLinkbuilt-in facet based browser36 is an interface which en-ables developers to explore DBpedia compute aggre-gations over facets and view the underlying SPARQLqueries

Search and Querying The DBpedia Query Builder37

allows developers to easily create simple SPARQLqueries more specifically sets of triple patterns viaintelligent autocompletion The autocompletion func-tionality ensures that only URIs which lead to solu-tions are suggested to the user The RelFinder [16]tool provides an intuitive interface which allows to ex-plore the neighborhood and connections between re-sources specified by the user For instance the usercan view the shortest paths connecting certain personsin DBpedia SemLens [17] allows to create statisti-cal analysis queries and correlations in RDF data andDBpedia in particular

Spatial Applications DBpedia Mobile [3] is a loca-tion aware client which renders a map of nearby loca-tions from DBpedia provides icons for schema classesand supports more than 30 languages from variousDBpedia language editions It can follow RDF linksto other data sets linked from DBpedia and supportspowerful SPARQL filters to restrict the viewed data

74 Applications of the Extraction FrameworkWiktionary Extraction

Wiktionary is one of the biggest collaboratively cre-ated lexical-semantic and linguistic resources avail-able written in 171 languages (of which approxi-mately 147 can be considered active38) containing in-formation about hundreds of spoken and even ancient

35httpblogdbpediaorg20091120german-government-proclaims-faceted-wikipedia-search-one-of-the-365-best-ideas-in-germany

36httpdbpediaorgfct37httpquerybuilderdbpediaorg38https23orgwikistatswiktionaries_

htmlphp

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 23: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 23

languages For example the English Wiktionary con-tains nearly 3 million words39 A Wiktionary page pro-vides for a lexical word a hierarchical disambiguationto its language part of speech sometimes etymolo-gies and most prominently senses Within this tree nu-merous kinds of linguistic properties are given includ-ing synonyms hyponyms hyperonyms example sen-tences links to Wikipedia and many more The fastchanging nature together with the fragmentation of theproject into Wiktionary language editions (WLE) withindependent layout rules poses the biggest problem tothe automated transformation into a structured knowl-edge base We identified this as a serious problem Al-though the value of Wiktionary is known and usagescenarios are obvious only some rudimentary toolsexist which either focus on a specific subset of thedata or they only cover one or two WLEs [18] Ex-isting tools can be seen as adapters to single WLEmdash they are hard to maintain and there are too manylanguages that constantly change and require a pro-grammer to refactor complex code Opposed to thecurrently employed classic and straight-forward ap-proach (implementing software adapters for scraping)Wiktionary2RDF [18] employs a declarative media-torwrapper pattern Figure 15 shows the work flow ofthe specially created Wiktionary extractor that takes asinput one config XML file for each WLE The aim is toenable non-programmers (the community of adoptersand domain experts) to tailor and maintain the WLEwrappers themselves A simple XML dialect was cre-ated to encode the ldquoentry layout explainedrdquo (ELE)guidelines and declare triple patterns that define howthe resulting RDF should be built An example is givenfor the following wiki syntax

1 ===Synonyms===2 [[building]]3 [[company]]

Regular expression for scraping and the triple genera-tion rule encoded in XML look like

1 ltwikiTemplategt===Synonyms===2 ( [[$target]]3 )+4 ltwikiTemplategt5 6 lttriple s=httpsomens$entityId p=httpsomens

hasSynonym o=httpsomens$target gt

39See httpenwiktionaryorgwikisemanticfor a simple example page

Fig 15 Detail view of the Wiktionary extractor

This configuration is interpreted and run against Wik-tionary dumps The resulting data set is open in ev-ery aspect and hosted as Linked Data40 Statistics areshown in Table 15

8 Related Work

81 Cross Domain Community Knowledge Bases

811 WikidataIn March 2012 the Wikimedia Germany eV started

the development of Wikidata41 Wikidata is a freeknowledge base about the world that can be read andedited by humans and machines alike It provides datain all languages of the Wikimedia projects and allowsfor central access to the data in a similar vein as Wiki-media Commons does for multimedia files Thingsdescribed in the Wikidata knowledge base are calleditems and can have labels descriptions and aliasesin all languages Wikidata does not aim at offeringthe truth about things but providing statements aboutthem Rather than stating that Berlin has a populationof 35 million Wikidata contains the statement aboutBerlinrsquos population being 35 million as of 2011 ac-cording to the German statistical office Thus Wiki-data can offer a variety of statements from differentsources and dates As there are potentially many dif-ferent statements for a given item and property rankscan be added to statements to define their status (pre-ferred normal or deprecated) The initial developmentwas divided in three phases

ndash The first phase (interwiki links) created an entitybase for the Wikimedia projects This providesa better alternative to the previous interlanguagelink system

40httpwiktionarydbpediaorg41wikidataorg

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 24: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

24 Lehmann et al DBpedia

language words triples resources predicates senses XML lines

en 2142237 28593364 11804039 28 424386 930fr 4657817 35032121 20462349 22 592351 490ru 1080156 12813437 5994560 17 149859 1449de 701739 5618508 2966867 16 122362 671

Table 15

Statistical comparison of extractions for different languages XMLlines measures the number of lines of the XML configuration files

ndash The second phase (infoboxes) gathered infobox-related data for a subset of the entities with theexplicit goal of augmenting the infoboxes that arecurrently widely used with data from Wikidata

ndash The third phase (lists) will expand the set of prop-erties beyond those related to infoboxes and willprovide ways of exploiting this data within andoutside the Wikimedia projects

At the time of writing of this article the developmentof the third phase is ongoing

Wikidata already contains 1195 million items and348 properties that can be used to describe themSince March 2013 the Wikidata extension is live on allWikipedia language editions and thus their pages canbe linked to items in Wikidata and include data fromWikidata

Wikidata also offers a Linked Data interface42 aswell as regular RDF dumps of all its data The plannedcollaboration with Wikidata is outlined in Section 9

812 FreebaseFreebase43 is a graph database which also extracts

structured data from Wikipedia and makes it availablein RDF Both DBpedia and Freebase link to each otherand provide identifiers based on those for Wikipediaarticles They both provide dumps of the extracteddata as well as APIs or endpoints to access the dataand allow their communities to influence the schemaof the data There are however also major differencesbetween both projects DBpedia focuses on being anRDF representation of Wikipedia and serving as a hubon the Web of Data whereas Freebase uses severalsources to provide broad coverage The store behindFreebase is the GraphD [32] graph database which al-lows to efficiently store metadata for each fact Thisgraph store is append-only Deleted triples are markedand the system can easily revert to a previous version

42httpmetawikimediaorgwikiWikidataDevelopmentLinkedDataInterface

43httpwwwfreebasecom

This is necessary since Freebase data can be directlyedited by users whereas information in DBpedia canonly indirectly be edited by modifying the content ofWikipedia or the Mappings Wiki From an organisa-tional point of view Freebase is mainly run by Googlewhereas DBpedia is an open community project Inparticular in focus areas of Google and areas in whichFreebase includes other data sources the Freebasedatabase provides a higher coverage than DBpedia

813 YAGOOne of the projects which pursues similar goals

to DBpedia is YAGO44 [39] YAGO is identical toDBpedia in that each article in Wikipedia becomes anentity in YAGO Based on this it uses the leaf cate-gories in the Wikipedia category graph to infer typeinformation about an entity One of its key featuresis to link this type information to WordNet Word-Net synsets are represented as classes and the ex-tracted types of entities may become subclasses ofsuch a synset In the YAGO2 system [20] declara-tive extraction rules were introduced which can ex-tract facts from different parts of Wikipedia articleseg infoboxes and categories as well as other sourcesYAGO2 also supports spatial and temporal dimensionsfor facts at the core of its system

One of the main differences between DBpedia andYAGO in general is that DBpedia tries to stay veryclose to Wikipedia and provide an RDF version of itscontent YAGO focuses on extracting a smaller numberof relations compared to DBpedia to achieve very highprecision and consistent knowledge The two knowl-edge bases offer different type systems Whereas theDBpedia ontology is manually maintained YAGO isbacked by WordNet and Wikipedia leaf categoriesDue to this YAGO contains much more classes thanDBpedia Another difference is that the integration ofattributes and objects in infoboxes is done via map-pings in DBpedia and therefore by the DBpedia com-

44httpwwwmpi-infmpgdeyago-nagayago

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 25: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 25

munity itself whereas this task is facilitated by expert-designed declarative rules in YAGO2

The two knowledge bases are connected eg DBpediaoffers the YAGO type hierarchy as an alternative to theDBpedia ontology and sameAs links are provided inboth directions While the underlying systems are verydifferent both projects share similar aims and posi-tively complement and influence each other

82 Knowledge Extraction from Wikipedia

Since its official start in 2001 Wikipedia has alwaysbeen the target of automatic extraction of informationdue to its easy availability open license and encyclo-pedic knowledge A large number of parsers scraperprojects and publications exist In this section we re-strict ourselves to approaches that are either notablerecent or pertinent to DBpedia MediaWikiorg main-tains an up-to-date list of software projects45 who areable to process wiki syntax as well as a list of dataextraction extensions46 for MediaWiki

JWPL (Java Wikipedia Library [46]) is an open-source Java-based API that allows to access informa-tion provided by the Wikipedia API (redirects cate-gories articles and link structure) JWPL contains aMediaWiki Markup parser that can be used to furtheranalyze the contents of a Wikipedia page Data is alsoprovided as XML dump and is incorporated in the lex-ical resource UBY47 for language tools

Several different approaches to extract knowledgefrom Wikipedia are presented in [34] Given featureslike anchor texts interlanguage links category linksand redirect pages are utilized eg for word-sense dis-ambiguations or synonyms translations taxonomic re-lations and abbreviation or hypernym resolution re-spectively Apart from this link structures are used tobuild the Wikipedia Thesaurus Web service48 Addi-tional projects presented by the authors that exploit thementioned features are listed on the Special InterestGroup on Wikipedia Mining (SIGWP) Web site49

An earlier approach to improve the quality of theinfobox schemata and contents is described in [44]

45httpwwwmediawikiorgwikiAlternative_parsers

46httpwwwmediawikiorgwikiExtension_Matrixdata_extraction

47httpwwwukptu-darmstadtdedatalexical-resourcesuby

48httpsigwporgenindexphpWikipedia_Thesaurus

49httpsigwporgen

The presented methodology encompasses a three stepprocess of preprocessing classification and extractionDuring preprocessing refined target infobox schemataare created applying statistical methods and trainingsets are extracted based on real Wikipedia data Afterassigning a class and the corresponding target schema(classification) the training sets are used to extract tar-get infobox values from the documentrsquos text applyingmachine learning algorithms

The idea of using structured data from certainmarkup structures was also applied to other user-drivenWeb encyclopedias In [35] the authors describe theireffort building an integrated Chinese Linking OpenData (CLOD) source based on the Chinese Wikipediaand the two widely used and large encyclopedias BaiduBaike50 and Hudong Baike51 Apart from utilizing Me-diaWiki and HTML Markup for the actual extractionthe Wikipedia interlanguage links were used to link theCLOD source to the English DBpedia

A more generic approach to achieve a better cross-lingual knowledge-linkage beyond the use of Wikipediainterlanguage links is presented in [42] Focusing onwiki knowledge bases the authors introduce their so-lution based on structural properties like similar link-age structures the assignment to similar categoriesand similar interests of the authors of wiki documentsin the considered languages Since this approach islanguage-feature-agnostic it is not restricted to certainlanguages

KnowItAll52 is a web scale knowledge extraction ef-fort which is domain-independent and uses genericextraction rules co-occurrence statistics and NaiveBayes classification [11] Cyc [26] is a large com-mon sense knowledge base which is now partially re-leased as OpenCyc and also available as an OWL on-tology OpenCyc is linked to DBpedia which providesan ontological embedding in its comprehensive struc-tures WikiTaxonomy [37] is a large taxonomy de-rived from categories in Wikipedia by classifying cate-gories as instances or classes and deriving a subsump-tion hierarchy The KOG system [43] refines existingWikipedia infoboxes based on machine learning tech-niques using both SVMs and a more powerful joint-inference approach expressed in Markov Logic Net-works KYLIN [44] is a system which autonomouslyextracts structured data from Wikipedia and uses self-

50httpbaikebaiducom51httpwwwhudongcom52httpwwwcswashingtoneduresearch

knowitall

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 26: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

26 Lehmann et al DBpedia

supervised linking [2] was an infobox extraction ap-proach for Wikipedia which later became the DBpediaproject

9 Conclusions and Future Work

In this system report we presented an overview onrecent advances of the DBpedia community projectThe technical innovations described in this articleincluded in particular (1) the extraction based onthe community-curated DBpedia ontology (2) thelive synchronisation of DBpedia with Wikipedia andDBpedia mirrors through update propagation and (3)the facilitation of the internationalisation of DBpediaAs a result we demonstrated that DBpedia maturedand improved significantly in the last years in particu-lar also in terms of coverage usability and data qual-ity

With DBpedia we also aim to provide a proof-of-concept and blueprint for the feasibility of large-scale knowledge extraction from crowd-sourced con-tent repositories There are a large number of furthercrowd-sourced content repositories and DBpedia al-ready had an impact on their structured data publishingand interlinking Two examples are Wiktionary withthe Wiktionary extraction [18] meanwhile becomingpart of DBpedia and LinkedGeoData [38] which aimsto implement similar data extraction publishing andlinking strategies for OpenStreetMaps

In the future we see in particular the following di-rections for advancing the DBpedia project

Multilingual data integration and fusion An areawhich is still largely unexplored is the integration andfusion between different DBpedia language editionsNon-English DBpedia editions comprise a better anddifferent coverage of local culture When we are ableto precisely identify equivalent overlapping and com-plementary parts in different DBpedia language edi-tions we can reach a significantly increased coverageOn the other hand comparing the values of a specificproperty between different language editions will helpus to spot extraction errors as well as wrong or out-dated information in Wikipedia

Community-driven data quality improvement In thefuture we also aim to engage a larger community ofDBpedia users in feedback loops which help us toidentify data quality problems and corresponding de-ficiencies of the DBpedia extraction framework Byconstantly monitoring the data quality and integratingimprovements into the mappings to the DBpedia on-

tology as well as fixes into the extraction frameworkwe aim to demonstrate that the Wikipedia communityis not only capable to create the largest encyclopediabut also the most comprehensive and structured knowl-edge base With the DBpedia quality evaluation cam-paign [45] we were going a first step in this direction

Inline extraction Currently DBpedia extracts infor-mation primarily from templates In the future we en-vision to also extract semantic information from typedlinks Typed links is a feature of Semantic MediaWikiwhich was backported and implemented as a verylightweight extension for MediaWiki53 If this exten-sion is deployed at Wikipedia installations this opensup completely new possibilities for more fine-grainedand non-invasive knowledge representations and ex-traction from Wikipedia

Collaboration between Wikidata and DBpedia WhileDBpedia provides a comprehensive and current viewon entity descriptions extracted from Wikipedia Wiki-data offers a variety of factual statements from dif-ferent sources and dates One of the richest sourcesof DBpedia are Wikipedia infoboxes which are struc-tured but at the same time heterogeneous and non-standardized (thus making the extraction error prone incertain cases) The aim of Wikidata is to populate in-foboxes automatically from a centrally managed high-quality fact database In this regard both projects com-plement each other nicely In future versions DBpediawill include more raw data provided by Wikidata andadd services such as Linked DataSPARQL endpointsRDF dumps linking and ontology mapping for Wiki-data

Feedback for Wikipedia A promising prospect isthat DBpedia can help to identify misrepresentationserrors and inconsistencies in Wikipedia In the futurewe plan to provide more feedback to the Wikipediacommunity about the quality of Wikipedia This canfor instance be achieved in the form of sanity checkswhich are implemented as SPARQL queries on theDBpedia Live endpoint which identify data quality is-sues and are executed in certain intervals For examplea query could check that the birthday of a person mustalways be before the death day or spot outliers that dif-fer significantly from the range of the majority of theother values In case a Wikipedia editor makes a mis-take or typo when adding such information to a pagethis could be automatically identified and provided asfeedback to Wikipedians [22]

53httpwwwmediawikiorgwikiExtensionLightweightRDFa

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 27: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 27

Integrate DBpedia and NLP Despite recent ad-vances (cf Section 7) there is still a huge potentialfor employing Linked Data background knowledgein various Natural Language Processing (NLP) tasksOne very promising research avenue in this regard isto employ DBpedia as structured background knowl-edge for named entity recognition and disambiguationCurrently most approaches use statistical informationsuch as co-occurrence for named entity disambigua-tion However co-occurrence is not always easy todetermine (depends on training data) and update (re-quires recomputation) With DBpedia and in particu-lar DBpedia Live we have comprehensive and evolv-ing background knowledge comprising information onthe relationship between a large number of real-worldentities Consequently we can employ this informa-tion for deciding to what entity a certain surface formshould be mapped

Acknowledgment

We would like to acknowledge the support of thefollowing people and organisations to DBpedia

ndash all Wikipedia contributorsndash OpenLink Software for providing and maintain-

ing the server infrastructure for the main DBpediaendpoint

ndash Kingsley Idehen for SPARQL and Linked Datahosting and community support

ndash Paul Kreis for DBpedia developmentndash Christopher Sahnwaldt for DBpedia development

and release managementndash Claus Stadler for DBpedia developmentndash people who helped contributing data for certain

parts of the article

lowast Sindice analysis Stphane Campinas SzymonDanielczyk and Gabriela Vulcu (working atDERI)

lowast Freebase Shawn Simister and Tom Morris(working at Freebase)

lowast Instance Data Analysis Volha Bryl (working atUniversity of Mannheim)

This work was supported by grants from the Euro-pean Unionrsquos 7th Framework Programme provided forthe projects LOD2 (GA no 257943) GeoKnow (GAno 318159) and Dicode (GA no 257184)

References

[1] S Auer C Bizer G Kobilarov J Lehmann R Cyganiakand Z Ives DBpedia A nucleus for a web of open data InProceedings of the 6th International Semantic Web Conference(ISWC) volume 4825 of Lecture Notes in Computer Sciencepages 722ndash735 Springer 2008

[2] S Auer and J Lehmann What have Innsbruck and Leipzigin common extracting semantics from wiki content In Pro-ceedings of the ESWC (2007) volume 4519 of Lecture Notesin Computer Science pages 503ndash517 Springer 2007

[3] C Becker and C Bizer Exploring the geospatial semantic webwith DBpedia mobile J Web Sem 7(4)278ndash286 2009

[4] D Bikel V Castelli R Florian and D-J Han Entity link-ing and slot filling through statistical processing and inferencerules In Proceedings TAC Workshop 2009

[5] C Bizer J Lehmann G Kobilarov S Auer C Becker R Cy-ganiak and S Hellmann DBpedia - a crystallization pointfor the web of data Journal of Web Semantics 7(3)154ndash1652009

[6] E Cabrio J Cojan A P Aprosio B Magnini A Lavelli andF Gandon QAKiS an open domain QA system based on rela-tional patterns In ISWC-PD International Semantic Web Con-ference (Posters amp Demos) volume 914 of CEUR WorkshopProceedings CEUR-WSorg 2012

[7] D V Camarda S Mazzini and A Antonuccio Lodlive ex-ploring the web of data In V Presutti and H S Pinto editorsI-SEMANTICS 2012 - 8th International Conference on Seman-tic Systems I-SEMANTICS rsquo12 Graz Austria September 5-72012 pages 197ndash200 ACM 2012

[8] S Campinas T E Perry D Ceccarelli R Delbru and G Tum-marello Introducing rdf graph summary with application toassisted sparql formulation In Database and Expert SystemsApplications (DEXA) 2012 23rd International Workshop onpages 261ndash266 IEEE 2012

[9] D Damljanovic M Agatonovic and H Cunningham FreyaAn interactive way of querying linked data using natural lan-guage In The Semantic Web ESWC 2011 Workshops pages125ndash138 Springer 2012

[10] O Erling and I Mikhailov RDF support in the virtuosoDBMS In S Auer C Bizer C Muller and A V Zhdanovaeditors CSSW volume 113 of LNI pages 59ndash68 GI 2007

[11] O Etzioni M Cafarella D Downey S Kok A-M PopescuT Shaked S Soderland D S Weld and A Yates Web-scaleinformation extraction in knowitall In Proceedings of the 13thinternational conference on World Wide Web pages 100ndash110ACM 2004

[12] A Garcıa-Silva M Jakob P N Mendes and C Bizer Mul-tipedia enriching DBpedia with multimedia information InProceedings of the sixth international conference on Knowl-edge capture K-CAP rsquo11 pages 137ndash144 New York NYUSA 2011 ACM

[13] A Garcıa-Silva M Szomszor H Alani and O Corcho Pre-liminary results in tag disambiguation using DBpedia In 1stInternational Workshop in Collective Knowledage Capturingand Representation (CKCaR) California USA 2009

[14] R Hahn C Bizer C Sahnwaldt C Herta S RobinsonM Burgle H Duwiger and U Scheel Faceted wikipediasearch In W Abramowicz and R Tolksdorf editors BusinessInformation Systems 13th International Conference BIS 2010Berlin Germany May 3-5 2010 Proceedings volume 47 of

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 28: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

28 Lehmann et al DBpedia

Lecture Notes in Business Information Processing pages 1ndash11Springer 2010

[15] P Heim T Ertl and J Ziegler Facet graphs Complex seman-tic querying made easy In Proceedings of the 7th Extended Se-mantic Web Conference (ESWC 2010) volume 6088 of LNCSpages 288ndash302 BerlinHeidelberg 2010 Springer

[16] P Heim S Hellmann J Lehmann S Lohmann and T Stege-mann Relfinder Revealing relationships in RDF knowledgebases In T-S Chua Y Kompatsiaris B Merialdo W HaasG Thallinger and W Bailer editors Semantic Multimedia4th International Conference on Semantic and Digital MediaTechnologies SAMT 2009 Graz Austria December 2-4 2009Proceedings volume 5887 of Lecture Notes in Computer Sci-ence pages 182ndash187 Springer 2009

[17] P Heim S Lohmann D Tsendragchaa and T Ertl Semlensvisual analysis of semantic data with scatter plots and semanticlenses In C Ghidini A-C N Ngomo S N Lindstaedt andT Pellegrini editors Proceedings the 7th International Con-ference on Semantic Systems I-SEMANTICS 2011 Graz Aus-tria September 7-9 2011 ACM International Conference Pro-ceeding Series pages 175ndash178 ACM 2011

[18] S Hellmann J Brekle and S Auer Leveraging the crowd-sourcing of lexical resources for bootstrapping a linguistic datacloud In Proc of the Joint International Semantic TechnologyConference 2012

[19] S Hellmann C Stadler J Lehmann and S Auer DBpedialive extraction In Proc of 8th International Conferenceon Ontologies DataBases and Applications of Semantics(ODBASE) volume 5871 of Lecture Notes in Computer Sci-ence pages 1209ndash1223 2009

[20] J Hoffart F M Suchanek K Berberich and G WeikumYAGO2 A spatially and temporally enhanced knowledge basefrom wikipedia Artif Intell 19428ndash61 2013

[21] H Ji R Grishman H T Dang K Griffitt and J EllisOverview of the TAC 2010 knowledge base population trackIn Third Text Analysis Conference (TAC 2010) 2010

[22] D Kontokostas S Auer S Hellmann J Lehmann P West-phal R Cornelissen and A Zaveri Test-driven data qualityevaluation for sparql endpoints In Submitted to 12th Interna-tional Semantic Web Conference 21-25 October 2013 SydneyAustralia 2013

[23] D Kontokostas C Bratsas S Auer S Hellmann I Antoniouand G Metakides Internationalization of linked data The caseof the greek dbpedia edition Web Semantics Science Servicesand Agents on the World Wide Web 15(0)51 ndash 61 2012

[24] C Lagoze H V de Sompel M Nelson and S WarnerThe open archives initiative protocol for metadata har-vesting httpwwwopenarchivesorgOAIopenarchivesprotocolhtml 2008

[25] J Lehmann S Monahan L Nezda A Jung and Y Shi LCCapproaches to knowledge base population at TAC 2010 InProceedings TAC Workshop 2010

[26] D B Lenat Cyc A large-scale investment in knowledgeinfrastructure Communications of the ACM 38(11)33ndash381995

[27] C-Y Lin and E H Hovy The automated acquisition oftopic signatures for text summarization In Proceedings of the18th conference on Computational linguistics pages 495ndash5012000

[28] V Lopez M Fernandez E Motta and N Stieler PoweraquaSupporting users in querying and exploring the semantic web

Semantic Web 3(3)249ndash265 2012[29] M Martin C Stadler P Frischmuth and J Lehmann Increas-

ing the financial transparency of european commission projectfunding Semantic Web Journal Special Call for LinkedDataset descriptions 2013

[30] P N Mendes M Jakob and C Bizer DBpedia for NLP -a multilingual cross-domain knowledge base In Proceedingsof the International Conference on Language Resources andEvaluation (LREC) Istanbul Turkey 2012

[31] P N Mendes M Jakob A Garcıa-Silva and C BizerDBpedia Spotlight Shedding light on the web of documentsIn Proceedings of the 7th International Conference on Seman-tic Systems (I-Semantics) Graz Austria 2011

[32] S M Meyer J Degener J Giannandrea and B MichenerOptimizing schema-last tuple-store queries in graphd In A KElmagarmid and D Agrawal editors SIGMOD Conferencepages 1047ndash1056 ACM 2010

[33] M Morsey J Lehmann S Auer C Stadler and S Hell-mann DBpedia and the live extraction of structured data fromwikipedia Program electronic library and information sys-tems 4627 2012

[34] K Nakayama M Pei M Erdmann M Ito M ShirakawaT Hara and S Nishio Wikipedia mining Wikipedia as a cor-pus for knowledge extraction In Annual Wikipedia Conference(Wikimania) 2008

[35] X Niu X Sun H Wang S Rong G Qi and Y Yu ZhishimeWeaving chinese linking open data In 10th International Con-ference on The semantic web - Volume Part II pages 205ndash220Berlin Heidelberg 2011 Springer-Verlag

[36] E Oren R Delbru M Catasta R Cyganiak H Stenzhornand G Tummarello Sindicecom a document-oriented lookupindex for open linked data Int J of Metadata and Semanticsand Ontologies 337ndash52 Nov 10 2008

[37] S P Ponzetto and M Strube Wikitaxonomy A large scaleknowledge resource In M Ghallab C D SpyropoulosN Fakotakis and N M Avouris editors ECAI volume 178of Frontiers in Artificial Intelligence and Applications pages751ndash752 IOS Press 2008

[38] C Stadler J Lehmann K Hoffner and S Auer Linkedgeo-data A core for a web of spatial open data Semantic WebJournal 3(4)333ndash354 2012

[39] F M Suchanek G Kasneci and G Weikum Yago a core ofsemantic knowledge In C L Williamson M E Zurko P FPatel-Schneider and P J Shenoy editors WWW pages 697ndash706 ACM 2007

[40] E Tacchini A Schultz and C Bizer Experiments withwikipedia cross-language data fusion In Proceedings of the5th Workshop on Scripting and Development for the SemanticWeb ESWC Citeseer 2009

[41] C Unger L Buhmann J Lehmann A-C Ngonga NgomoD Gerber and P Cimiano Template-based question answer-ing over rdf data In Proceedings of the 21st international con-ference on World Wide Web pages 639ndash648 ACM 2012

[42] Z Wang J Li Z Wang and J Tang Cross-lingual knowledgelinking across wiki knowledge bases In Proceedings of the21st international conference on World Wide Web pages 459ndash468 New York NY USA 2012 ACM

[43] F Wu and D Weld Automatically Refining the WikipediaInfobox Ontology In Proceedings of the 17th World Wide WebConference 2008

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links

Page 29: Semantic Web 1 (2012) 1–5 IOS Press DBpedia - A Large ... · Semantic Web 1 (2012) 1–5 1 IOS Press DBpedia - A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia

Lehmann et al DBpedia 29

[44] F Wu and D S Weld Autonomously semantifying wikipediaIn Proceedings of the 16th Conference on Information andKnowledge Management pages 41ndash50 ACM 2007

[45] A Zaveri D Kontokostas M A Sherif L BuhmannM Morsey S Auer and J Lehmann User-driven quality eval-uation of dbpedia In To appear in Proceedings of 9th Inter-national Conference on Semantic Systems I-SEMANTICS rsquo13Graz Austria September 4-6 2013 ACM 2013

[46] T Zesch C Muller and I Gurevych Extracting lexical seman-tic knowledge from wikipedia and wiktionary In Proceedingsof the 6th International Conference on Language Resourcesand Evaluation Marrakech Morocco May 2008 electronicproceedings

Appendix

A Sindice Analysis

The following query was used to analyse the incom-ing links of all datasets crawled by Sindice includingDBpedia The query runs against the Sindice SummaryGraph [8] which contains statistics over all Sindicedatasets Due to the complexity of the query it was runas a Hadoop job on the Sindice cluster

1 PREFIX any232 lthttpvocabsindicenetgt3 PREFIX an4 lthttpvocabsindicenetanalyticsgt5 PREFIX dom6 lthttpsindicecomdataspacedefault

domaingt7 SELECT Dataset Target_Datatset predicate

SUM(xsdlong(cardinality)) AS total8 FROM lthttpsindicecomanalyticsgt 9 target any23domain_uri Target_Dataset

1011 edge ansource source 12 antarget target 13 anlabel predicate14 ancardinality cardinality15 anpublishedIn Dataset 1617 source any23domain_uri Dataset 1819 FILTER (Dataset = Target_Dataset)

Listing 1 SPARQL query on the Sindice summarygraph for obtaining the number of incoming links