populating ontologies for the semantic web alexiei dingli

Populating Ontologies for the Semantic Web

Alexiei Dingli

What’s the problem?

Towards a solution … (1)

Ask intelligent

agents to do the

job for us!!

But they don’t understand the

WWW !!!

But there’s another way in which this can be achieved, by supplying the missing semantic information

For the Web to reach its full potential, it must evolve into a SemanticWeb, providing a universally accessible platform that allows data tobe shared and processed by automated tools as well as by people.

(W3C Web Guru)

Creating the Semantic Web !!

Why do many believe this solution will fail?

It requires lots of time and effort

It needs lots of people willing to do it

Not everyone can do it

Our approaches

Active learning to reduce annotation burden Supervised learning Adaptive IE The Melita methodology

Automatic annotation of large repositories Largely unsupervised Armadillo

Adaptive IE What is AIE?

Performs tasks of traditional IEExploits the power of Machine Learning in

order to adapt to complex domains having large amounts of domain

dependent data different sub-languages features different text genres

Considers important the Usability and Accessibility of the system

Amilcare

Tool for adaptive IE from Web-related textsSpecifically designed for document

annotationBased on (LP)2 algorithm

Covering algorithm based on Lazy NLP Trains with a limited amount of examples Effective on different text types

free texts semi-structured texts structured texts

Uses Gate and Annie for preprocessing

CMU: detailed results (LP)2 BWI HMM SRV Rapier Whisk

speaker 77.6 67.7 76.6 56.3 53.0 18.3 location 75.0 76.7 78.6 72.3 72.7 66.4

stime 99.0 99.6 98.5 98.5 93.4 92.6 etime 95.5 93.9 62.1 77.9 96.2 86.0

All Slots 86.0 83.9 82.0 77.1 77.3 64.9

1. Best overall accuracy 2. Best result on speaker field3. No results below 75%

General Architecture for Text Engineering provides a software infrastructure for researchers and

developers working in NLP

Contains Tokeniser Gazetteers Sentence Splitter POS Tagger Semantic Tagger (ANNIE) Orthographic Coreference

http://www.gate.ac.uk

Pronominal Coreference Multi lingual support Protégé WEKA many more exist and can be added

AnnotationCurrent practice of annotation for knowledge identification and extraction

is time consuming

needs annotation by experts

is complex

Reduce burden of text annotation for Knowledge

Management

Different Annotation Systems

SGML TEX Xanadu CoNote ComMentor JotBot Third Voice Annotate.net The Annotation Engine Visual Text

Alembic Annotea CritLink The Gate Annotation Tool iMarkup MnM S-CREAM Yawas

Melita

Tool for assisted automatic annotation Uses an Adaptive IE engine to learn how to annotate

(no use of rule writing for adapting the system) Users: annotates document samples IE System:

Trains while users annotate Generalizes over seen cases Provides preliminary annotation for new documents

Performs smart ordering of documents Advantages

Annotates trivial or previously seen cases Focuses slow/expensive user activity on unseen cases User mainly validates extracted information

Simpler & less error prone / Speeds up corpus annotation The system learns how to improve its capabilities

Methodology: Melita Bootstrap Phase

Bare Text

Amilcare Learns in

background

User Annotates

Methodology: Melita Checking Phase

Bare Text

Learning in background

from missing

tags, mistakes

User Annotates

Amilcare Annotates

Methodology: Melita Support Phase

Bare Text

Corrections used to retrain

Amilcare Annotates

User Corrects

Intrusivity An evolving system is difficult to control Goal:

Avoiding unwelcome/unreliable suggestions Adapting proactivity to user’s needs

Method: Allow users to tune proactivity Monitor user reactions to suggestions

Smart ordering of Documents

Bare Text

Tries to annotate all the documents and selects the

document with partial annotations

Learns annotations

User Annotates

Methodology: Melita

Ontology

defining

concepts

Control Panel

Document

Results

Tag Amount of Texts needed for training

Prec Rec

stime 20 84 63

etime 20 96 72

location 30 82 61

speaker 100 75 70

Location

0 50 100 150

training examples

Original Order selected Order

Future Work

Research better ways of annotating concepts in documents

Optimise document ordering to maximise the discovery of new tags

Allow users to edit the rules Learn to discover relationships !! Not only suggest but also corrects

user annotations !!

Annotation for the Semantic Web

Semantic Web requires document annotation Current approaches

Manual (e.g. Ontomat) or semi-automatic (MnM, S-Cream, Melita)

BUT: Manual/Semi-automatic annotation of

Large diverse repositories Containing different and sparse information

is unfeasible E.g. a Web site (So: 1,600 pages)

Redundancy Information on the Web (or large repositories) is

Redundant

Information repeated in different superficial formats Databases/ontologies Structured pages (e.g. produced by databases) Largely structured pages (bibliography pages) Unstructured pages (free texts)

Our Proposal

Largely unsupervised annotation of documents Based on Adaptive Information Extraction Bootstrapped using redundancy of information

Method Use the structured information (easier to extract)

to bootstrap learning on less structured sources (more difficult to extract)

Example: Extracting Bibliographies

Mines web-sites to extract biblios from personal pages Tasks: Finding people’s names Finding home pages Finding personal biblio pages Extract biblio references

Sources NE Recognition (Gate’s Annie) Citeseer/Unitrier (largely incomplete biblios) Google Homepagesearch

AKT Reference Ontology

Developed by the AKT partners Represent the knowledge used in the CS AKTive Portal

testbed Consists of several sub-ontologies Available in several flavours …

DAML+OIL OWL

Has 9,000,000 RDF triples !! Available at

Ontology http://www.aktors.org/publications/ontology/ RDF Triples http://triplestore.aktors.org/

Mining Web sites (1)• Mines the site looking for

People’s names• Uses

•Generic patterns (NER)•Citeseer for likely bigrams

• Looks for structured lists of names

• Annotates known names• Trains on annotations to discover

the HTML structure of the page• Recovers all names and

hyperlinks

Experimental Results (1) People

discovering who works in the department using Information Integration

Total present in site 129 Using generic patterns + online repositories

48 correct, 3 wrong Precision 48 / 51 = 94 % Recall 48 / 129 = 37 % F-measure 51 %

Errors A. Schriffin Eugenio Moggi Peter Gray

Experimental Results (2) People

using Information Extraction Total present in site 129

Errors Speech and Hearing European Network Department Of

Position Paper The Network To System

Mining Web sites (2)

• Annotates known papers• Trains on annotations to

discover the HTML structure• Recovers co-authoring

information

Experimental Results (1) Papers

discovering publications in the department using Information Integration

Total present in site 320 Using generic patterns + online repositories

Errors - Garbage in database!!@misc{ computer-mining,

author = "Department Of Computer", title = "Mining Web Sites Using Adaptive Information Extraction Alexiei Dingli and Fabio Ciravegna and David Guthrie and Yorick Wilks", url = "citeseer.nj.nec.com/582939.html" }

Experimental Results (2) Papers

using Information Extraction Total present in site 320

Errors Wrong boundaries in detection of paper names! Names of workshops mistaken as paper names!

User Role Providing …

A URL List of services

Already wrapped (e.g. Google is in default library) Train wrappers using examples

Examples of fillers (e.g. project names)

In case … Correcting intermediate results Reactivating Armadillo when paused

Armadillo Library of known services (e.g. Google, Citeseer)

Tools for training learners for other structured sources

Tools for bootstrapping learning From un/structured sources No user annotation Multi-strategy acquisition of information using redundancy

User-driven revision of results With re-learning after user correction

Rationale Armadillo learns how to extract information

From large repositories

By integrating information from diverse and distributed resources

Use: Ontology population Information highlighting Document enrichment Enhancing user experience

Data Navigation (1)

Data Navigation (2)

Data Navigation (3)

What’s so new about Armadillo? In other systems …

User defined examples are used Generic patters are used that work independently of

the site

In our system … We also make use of

generic patterns & some user defined examples We learn page specific patterns And we integrate information from different sources

IE for SW: The Vision Automatic annotation services

For a specific ontology Constantly re-indexing/re-annotating documents Semantic search engine

Effects: No annotation in the document

As today’s indexes are not stored in the documents No legacy with the past

Annotation with the latest version of the ontology Multiple annotations for a single document

Simplifies maintenance Page changed but not re-annotated

Links Melita

http://nlp.shef.ac.uk/melita/ Armadillo

http://nlp.shef.ac.uk/armadillo/ Amilcare

http://nlp.shef.ac.uk/amilcare/ Gate

http://www.gate.ac.uk AKT Reference Ontology

http://www.aktors.org/publications/ontology/ AKT 3Store

http://triplestore.aktors.org/ More than 40 semantic web technologies

http://www.aktors.org/technologies/ Most of them can be freely downloaded Range from IE tools, semantic portals, annotation tools, semantic

web services, dialogue systems, etc

populating ontologies for the semantic web alexiei dingli

annotation burdensupervised

melita ontology

text engineeringprovides

document samplesie system

users annotategeneralizes

document annotationbased

corpus annotationthe

webrelated textsspecifically

Documents

company presentation for dingli -

populating the earth

dingli solutions overview 2013

programme - conferences.hu · programme co-chairs ann...

1 dr alexiei dingli introduction to web science web 2.0

1 dr alexiei dingli introduction to web science web 1.0

dingli australia summit s1012eh

dingli primary embed projects

1 dr alexiei dingli xml technologies xml advanced

1 dr alexiei dingli web science stream search engine...

1 dr alexiei dingli xml technologies sax and dom

dingli dl0607 - elavation

1 dr alexiei dingli xml technologies xml. 2 xml stands for...

dingli mos measurement tool

1 dr alexiei dingli introduction to web science knowledge...

populating time dimension

the nlu module alexiei dingli. 2 the task so far …...

chev. e caruana dingli

1 dr alexiei dingli web science stream models, views and...

1 dr alexiei dingli web science stream a ror twitter