information extraction and linked data cloud

04/12/23

Information Extraction & Linked Data Cloud

Dr. Dhaval Thakker KTP Research Associate

Press Association Images & Nottingham Trent University

© Dhaval Thakker, Press Association , Nottingham Trent University

2

OutlineOutline

Press Association & its operations Introduction to the Semantic Technology Project at PA

Images IE and Knowledge base systems Semantic Web browsing

Problem of generating Knowledge bases Introduction to Linked Data Cloud (LDC) How do we use LDC

Current and Future Work Conclusions

3

Press Association (pressassociation.com)Press Association (pressassociation.com)

Background Semantic Web project Knowledge base Conclusions

Press Association & its operations UK’s leading multimedia news & information provider Core News Agency operation Editorial services: Sports data, entertainment guides, weather

forecasting, photo syndication

4

Free-text versus Semantic ApproachFree-text versus Semantic Approach

Free-Text Lack of structure

Have to rely on the annotator to provide all possible keywords

Repetitive annotation effort

Low accuracy

Semantic Adds structure, Concepts-Relationship

Provides Inference ( Implicit reasoning ) capacity

Accurate results

“Related”, “Similarity” based browsing


5

… the Semantic Web… the Semantic Web

Web was “invented” by Tim Berners-Lee (amongst others), a physicist working at CERN

“The next generation WWW is a Web in which machines can converse in a meaningful way, rather than a web limited to humans requesting HTML pages.“

Tim Berners-Lee

… need to Add “Semantics”

Use Ontologies (dictionary of terms) to help computers understand the meaning (semantics) of domain concepts


6

PA Images WorkflowPA Images Workflow

Agency/Photographers

Metadata

Company

Captioners

Website

Provides minimum metadata in IPTC

Images with metadata

passed to Captioners for batch processing

Modifies existing and adds new metadata

Information Extraction

Storage & Browsing

Semantic structure


7

Utilisation of Semantic Technologies for Intelligent Indexing and Retrieval of PA Images photo CollectionUtilisation of Semantic Technologies for Intelligent Indexing and Retrieval of PA Images photo Collection

Development of a comprehensive semantic-based taxonomy for PA Images domains of News, Entertainment and Sports.

Design and implementation of a web-based and semantics-transparent annotation tool.

Design and develop software programmes to semi-automate the annotation of legacy data.

Development of semantically-enabled search technology, specifically tailored for the PA Photos Image Retrieval engine.


8

Text Mining System OverviewText Mining System Overview

Images with captions

GATE-based IE System


Gazetteer (known entities)

JAPE Grammar (context rules)

Disambiguation/Summarisation

Entities of interest

Annotated Image

captions

PA KB

Linked Data

Cloud

What to store

What to extract

Confirmation

Captions

Learned Facts

Schema

PA Images view

PA Images ontology

9

PA Images Ontology (OWL)PA Images Ontology (OWL)


10

Knowledge base (KB)Knowledge base (KB)

Ontology (schema)

Royalty (Royal Family)‒ name‒ relationship

Type 1‒ Spouse‒ From‒ ToType 2‒ Partner‒ From‒ To

‒ predecessor‒ successor‒ father‒ mother‒ Title

Data

Royalty (Henry VIII )‒ name (Tudor, Henry/Henry VIII

of England )‒ relationship

Spouse (Anne Boleyn)Spouse (Catherine Parr)Spouse (Jane Seymour)Spouse (Anne of Cleves)Spouse (Catherine Howard)Spouse (Catherine of Aragon)

‒ Predecessor (Henry VII )‒ Successor (Edward VI)‒ Father (Henry VII of England)‒ Mother (Elizabeth of York )‒ Title (king of England and

Ireland)


11

Scale of Things for KBScale of Things for KB

Emphasis on : People, Places, Organisations, Events

About 50 types of sports Their Events Type of people in these sports (Referee, Players etc) Type of Locations for these sports Variety of Teams for these sports And relationships between all of them!!

Similarly for Entertainment and News


12

Outsourcing KB – Linked Data Cloud (LDC)Outsourcing KB – Linked Data Cloud (LDC)

Where do we get all these knowledge from? We don’t want it in free-text form but in a semantic

structure It has to be comprehensive and accurate Free, open, extractable, evolving Uniform Resource Identifiers (URIs) and Resource

Description Framework (RDF) language are the heart of the LoD


13

Linked DataLinked Data

“The term Linked Data is used to describe a method of exposing, sharing, and connecting data via dereferenceable URIs on the Web”

“The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.”


14

Linked Data cloudLinked Data cloud

31/03/2008


15

DBPediaDBPedia

Epicentre of the Linked Data Cloud Generated primarily from the Wikipedia info-boxes

and improved with linkage to other sources in the cloud.

The DBpedia knowledge base currently describes more than 2.6 million things, including at least 213,000 persons, 328,000 places, 57,000 music albums, 36,000 films, 20,000 companies.

Many organisations, researchers using it.


16

Linking Open Data CommunityLinking Open Data Community

Community effort to•Publish existing open license datasets as Linked Data on the Web•Interlink things between different data sources•Develop clients that consume Linked Data from the Web


17

Organizations participating in the LOD communityOrganizations participating in the LOD community

Companies

•Press Association (UK)

•New York Times (USA)

•Thompson Reuters (USA)- Opencalais

•BBC (UK) – Music Beta website, BBC Eath

• MusicBrainz

• Yahoo Microsearch

• OpenLink (UK)

• Talis (UK)

• Zitgist (USA)

• Garlik (UK)

• Mondeca (FR)

• Renault (FR)

• Boab Interactive (AUS)

•…..others who are indirect consumers..

Universities and Research Institutes

• Massachusetts Institute of Technology (USA)•University of Southampton (UK)•DERI (IRE)•KMi, Open University (UK)•University of London (UK)•Universität Hannover (DE)•University of Pennsylvania (USA)•Universität Leipzig (DE)•Universität Karlsruhe (DE)•Joanneum (AT)•Freie Universität Berlin (DE)•Cyc Foundation (USA)•SouthEast University (CN)


18


19

Interested in Linking up?Interested in Linking up?

1. Use URIs as names for things

2. Use HTTP URIs so that people can look up those names

3. When someone looks up a URI, provide useful RDF information

4. Include RDF statements that link to other URIs so that they can discover related things

Tim Berners-Lee 2007 http://www.w3.org/DesignIssues/LinkedData.html


20

Our approach for LDC utilisationOur approach for LDC utilisation

Why not DBPedia as it is?

Great deal of noisy data -If we store them as it is, storage will be huge DBpedia is less formally structured. The data quality is lower for production scale and there are some

inconsistencies within DBpedia. and we have our own domains and own view of them

Our approach is to combine the advantages of both worlds is to interlink DBpedia with hand-crafted ontologies such as PA Images ontology, which enables applications to use the formal knowledge from these ontologies together with the data from DBpedia.”


21

Linked Data CloudLinked Data Cloud

Ontology Mapping - Map the ontology and the data will follow..Ontology Mapping - Map the ontology and the data will follow..

PA Images Ontology

DBPedia YAGO

Geonames

......

sameAs

sameAs

sameAs

Knowledgebase/data for our ontology

Similar Entities & Their Features


22

SPARQL CONSTRUCTSPARQL CONSTRUCT

PREFIX dbpedia-ont: <http://dbpedia.org/ontology/>PREFIX db: <http://dbpedia.org/>PREFIX pa: <http://localhost/pa/images/media/entities.owl#>PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX foaf: <http://xmlns.com/foaf/0.1/>

CONSTRUCT { ?newLoc a pa:City .?newLoc pa:locationName ?name .?newLoc pa:latitutedegrees ?lat

}WHERE{ ?newLoc a dbpedia-ont:City . ?newLoc foaf:name ?name . ?newLoc dbpedia-ont:latitutedegrees ?lat } DBPedia

PA Images ontolog

y


23

Has City -> City Of CountryHas City -> City Of Country

PREFIX dbpedia-ont: <http://dbpedia.org/ontology/>PREFIX pa: <http://localhost/pa/images/media/entities.owl#>PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX foaf: <http://xmlns.com/foaf/0.1/>PREFIX db-prop: <http://dbpedia.org/property/>

CONSTRUCT { ?newLoc a pa:City.?newLoc pa:cityOfCountry ?country .?newLoc pa:locationName ?name .?country pa:hasCity ?newLoc}WHERE{ ?newLoc a dbpedia-ont:City . ?newLoc db-prop:subdivisionName ?country . ?country a <http://dbpedia.org/ontology/Country> . ?newLoc foaf:name ?name }


24

People - Total > 200000People - Total > 200000

Footballers -> 24k Cricketers -> 4k American Footballers -> 8k Actors -> 12k Music Artists -> 22k Baseball players -> 1200 Basketball players -> 1200 British Royalty -> 800 Cyclists -> 2300 Politicians -> 15k F1 Racing Drivers ->1100……………….


25

Groups Total > 50k Groups Total > 50k

National Football Teams -> 400 Band -> 16000 Companies -> 24k Clubs -> 800


26

Work > 200000Work > 200000

Album – 80k Films – 80k Single -> 27k Books -> 17k ….

And.. Events -> 2000 Locations -> 200000


27

ConclusionsConclusions

Linked data very exciting The intention is that we move from a web of

documents to a web of data– The Web as database

PA Knowledge base generation using linked data cloud

A complete product that utilises semantic technologies to lower the cost of annotation and improved search experience


28

AcknowledgementAcknowledgement

KTP Project, Press Association & Nottingham Trent University

information extraction and linked data cloud

Education

semantic web web

web of data

semantic web isnt

sports data

semantic technology

linked data cloud ldc

thedbpedia knowledge

annotation of legacy