marrakech, moroccolrec 2008 ontology learning and semantic annotation: a necessary symbiosis...

23
Marrakech, Morocco LREC 2008 Ontology Learning and Semantic Annotation: a necessary symbiosis Emiliano Giovannetti, Simone Marchi, Simonetta Montemagni, Roberto Bartolini ILC-CNR, Pisa, Italy

Upload: noel-lang

Post on 28-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Marrakech, MoroccoLREC 2008

Ontology Learning and Semantic Annotation: a necessary symbiosis

Emiliano Giovannetti, Simone Marchi,

Simonetta Montemagni, Roberto Bartolini

ILC-CNR, Pisa, Italy

LREC 2008 Marrakech, Morocco

• Technologies in the area of knowledge management and information access are confronted with a typical acquisition paradox

– access to content requires understanding the linguistic structures representing it in text at a level of considerable detail

– processing linguistic structures at the depth needed for content understanding presupposes that a considerable amount of domain knowledge is already in place

The knowledge acquisition paradox

LREC 2008 Marrakech, Morocco

The knowledge acquisition paradox

corpus

ontology as a formal representation of

domain knowledge

semantically annotated text

advanced linguistic

annotationneeds an ontology

ontology learning needs a

linguistically- annotated

corpus

LREC 2008 Marrakech, Morocco

Turning a vicious circle into a “virtuous circle”

Text(implicit knowledge)

Structured content(explicit knowledge)

Dynamic Content

Structuring

KnowledgeExtraction

Linguistic annotation

LREC 2008 Marrakech, Morocco

Turning a vicious circle into a “virtuous circle”: a first step

Text(implicit knowledge)

Structured content(explicit knowledge)

Dynamic Content

Structuring

Terminology extraction and

structuring

Syntactic

parsing

ontology

LREC 2008 Marrakech, Morocco

Turning a vicious circle into a “virtuous circle”: a first step

Text(implicit knowledge)

Structured content(explicit knowledge)

Dynamic Content

Structuring

Domain entity and relation extraction

ontology-driven semantic

annotation

Syntactic

parsing

LREC 2008 Marrakech, Morocco

A case study:semantic annotation of product catalogues

the challenge• product descriptions appear as

semi-structured texts, also including portions of running text

• product catalogues do not contain continuous and linguistically sound text (typically, nominal descriptions)

• this task requires the combination of different types of evidence and techniques

LREC 2008 Marrakech, Morocco

The system for ontology-based semantic annotation of product catalogues

input catalogue

ontology

NLP Modules

Tokenizer Morpho Analyzer

Chunker Dependency

Parser

Product cataloguesItalianSemanticAnnotator

<entity data_id="26"> <name>SANELA</name></entity><entity data_id=“33"> <part>fodera</part></entity><entity data_id=“34"> <material>cotone</material></entity>

semantic annotation of

product descriptions

semantic annotation component

ontology learning componentProduct cataloguesTerminologyProcessor

Product cataloguesItalianSemanticAnnotator

Product cataloguesTerminologyProcessor

LREC 2008 Marrakech, Morocco

The Product catalogues Terminology Processor (PTP) for Ontology Learning

LegDoorTopFrameElementShelfPartSliding doorCoverSupportDrawer

domain domain terminologyterminology

Term ExtractionTerm ExtractionSemantic StructuringSemantic Structuring

Customised version of T2K (Text-to-Knowledge), a hybrid system combining linguistic technologies and statistical techniques

LREC 2008 Marrakech, Morocco

PTP: semantic structuring – identification of relations

Horizontal relations

identified on the basis of dynamic distributionally-based similarity measures

Vertical relations

identified on the basis of head-sharing

LREC 2008 Marrakech, Morocco

First step: semantic structuring - clustering

colour

material

definition of root conceptsdefinition of root conceptsdefinition of sub-conceptsdefinition of sub-concepts

bianco

beige

scuro

grigio

blu

rosso

acciaio

pino

betulla

alluminio

rovere

plastica

faggiovetro

is_ais_a

is_a

is_a

is_a

is_a

is_ais_a

is_a

is_a

is_a

is_a

is_ais_a

LREC 2008 Marrakech, Morocco

PTP: the final ontology

steel wood

material part colour

door blue

stainless steel

solid wood

light blue

base

sliding door

hasPartColourhasPartMaterial

isa isa

isa isa

isa isa

isa isa

isa

LREC 2008 Marrakech, Morocco

Semantic annotation: the approachpattern matching + NLP

pattern matching: resorted to for isolating individual product descriptions within the textual flow and for identifying their basic building blocks

ontology-driven NLP: for each identified product, the NL description is processed by a battery of NLP tools in charge of identifying relevant entities (e.g. color, material, parts of a given product) and the relations holding between them (e.g. part_of, color_of)

LREC 2008 Marrakech, Morocco

Product catalogues Italian Semantic Annotator (PISA):ontology driven semantic annotation

input catalogue

ontology

NLP tools PISA

RegExp Manager

NLP Manager

domain entities– product– part– name– id– type– category– material– color– price– height– width– depth– weight– diameter

relations between identified entities– part_of ( product part )– name_of ( (product | series) name )– id_of ( product id )– type_of ( product type )– category_of ( (product | series | part) category )– made_of ( (product | series | part) material )– color_of ( (product | series | part) color )– price_of ( product price )– height_of ( product height )– width_of ( product width )– depth_of ( product depth )– weight_of ( product weight )– diameter_of ( product diameter )

LREC 2008 Marrakech, Morocco

PISA:semantic annotation - pattern matching

([A-Z]{3,}\s)+(.+)?(€[\d,\/\spz]+\.)([\w|\s|\.]+)(Cm\s\d{1,3}.\d{1,3}\.)(\d{3}\.\d{3}\.\d{2})

name type price

description

dimensions product id

name type price description dimensions product id

to be processed by the NLP manager to extract entities and relations about: parts, materials, colours, etc.

LREC 2008 Marrakech, Morocco

PISA: ontology for semantic annotation (entity recognition)

hasPart

glass wood

material part product

door table

tempered glass

solid wood

base

sliding door

hasPartMaterial

isa isa

isa isa

isa isa

isa

isa

[ [ CC: N_C] [ AGR: @FP] [ POTGOV: ANTA#S@FP]][ [ CC: P_C] [ AGR: @MS] [ PREP: IN#E] [ POTGOV: VETRO__TEMPRARE|TEMPRATO#S@MS]][ [ CC: PUNC_C] [ PUNCTYPE: .#@]]{. }

LREC 2008 Marrakech, Morocco

“Sedia in plastica con schienale regolabile” (plastic chair with adjustable back)

??

hasPart

vetro plastica

materiale parte prodotto

schienale sedia

vetro temprato

base

bevel edged plate

isa isa

isa

isa isa

isa

isa

schienale regolabile

sediaplastica

Where to attach “schienale regolabile”: - to “sedia” or to “plastica”?

“sedia” is a kind of “prodotto”

“plastica” is a kind of “materiale”

schienale regolabile

“schienale regolabile” is a kind of “parte”

There is no property linking Material to a Part, but there is one linking a Product to a Part so the correct interpretation is that “schienale regolabile” is a part of “sedia”.

PISA: ontology for semantic annotation (relation extraction)

LREC 2008 Marrakech, Morocco

An example of semantic annotation

LREC 2008 Marrakech, Morocco

An example of semantic annotation: entities annotation

LREC 2008 Marrakech, Morocco

An example of semantic annotation: relations annotation

Evaluation of acquired results•Preliminary evaluation was carried out:

•“task based” evaluation concerning the ontology learning component:

•provided in terms of correctness in supporting semantic annotation

•evaluation of the semantic annotation component:

•a “gold-standard” corpus of reference was created by randomly extracting and manually annotating about 100 IKEA products.

number of correct annotations

number of partially correct annotations

ACT

PARCORPRE

5.0

precision

POS

PARCORREC

5.0

recalltotal number of annotation (correct+incorrect+partially

correct)

total number of annotations in the gold-standard (correct+partially

correct+missing)

Sem. annotation precision recall F-measure

pattern matching 0,99 0,94 0,96

ontology driven liguistic analysis

0,89 0,70 0,78

RECPRE

RECPREF

**2

F-measure

Further directions of research– system portability to other product catalogues:

• “Zanotta” furniture catalogue– subset of 30 product descriptions extracted as a “gold-standard” of

reference and manually annotated

Sem. annotation precision recall F-measure

pattern matching 1 0,86 0,92

ontology driven liguistic analysis

1 0,50 0,66

• product catalogues in other domains

– application of the methodology to other domains and to non-structured (free) corpora

– more steps towards the triggering of the “virtuous circle”:• next step: exploiting the results obtained from the semantic annotation

to enrich the ontology

THANK YOU!