semantica e lessico - alphabit.net · et.pdf) altre da piek vossen 2 wordnets ... all’utente di...

25
12/03/2010 1 SEMANTIC NETS Per la lessicografia contemporanea I. Chiari, Linguistica computazionale - a.a. 2009/2010 1 Alcune diapositive provengono da Semeraro (http://www.di.uniba.it/~semeraro/GCI/WordN et.pdf ) altre da Piek Vossen Wordnets 2 I. Chiari, Linguistica computazionale - a.a. 2009/2010

Upload: vuongnguyet

Post on 26-Jan-2019

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

1

SEMANTIC NETS

Per la lessicografia contemporanea

I Chiari Linguistica computazionale - aa 20092010 1

Alcune diapositive provengono da Semeraro

(httpwwwdiunibait~semeraroGCIWordN

etpdf) altre da Piek Vossen

Wordnets2

I Chiari Linguistica computazionale - aa 20092010

12032010

2

Wordnet

I Chiari Linguistica computazionale - aa 20092010

3

httpwordnetprincetonedu

Ontologia linguistica che rappresenta in maniera

esplicita e formale la conoscenza linguistica umana

Lrsquoidea nasce nel 1985 da un gruppo di linguisti e

psicolinguisti dellrsquouniversitagrave di Princeton

1048708 Obiettivo ricerca concettuale nei dizionari

1048708 Risultato definizione di un database lessicale

1048708 Linea di ricerca memoria lessicale umana

I Chiari Linguistica computazionale - aa 20092010

4

WordNet egrave unrsquoontologia linguistica toplevel

La conoscenza linguistica

1048708 egrave conoscenza di senso comune

1048708 puograve essere utilizzata in qualsiasi dominio

Wordnet non tratta parole come

of an the and about above because etc

12032010

3

I Chiari Linguistica computazionale - aa 20092010

5

Ogni word meaning egrave rappresentata dallrsquoinsieme

delle word form che possono essere usate per

esprimerla synset

Un synset associato ad una word form consente

allrsquoutente di inferire la semantica della word form in

esame purcheacute conosca la semantica di almeno una

word form elencata nel synset

Relazioni

I Chiari Linguistica computazionale - aa 20092010

6

LE RELAZIONI LESSICALI Si instaurano tra word

form (synonymy antonymy morphological)

LE RELAZIONI SEMANTICHE Si instaurano tra

word meaning (hyponymy hypernymy e

meronymy holonymy)

12032010

4

Da Diapositive Semerarohellip

I Chiari Linguistica computazionale - aa 20092010

7

Sostantivi

I Chiari Linguistica computazionale - aa 20092010

8

WordNet suddivide i nomi in 25 campi semantici distinti (animale sostanzahellip)

In ogni campo semantico i nomi sono organizzati in un albero lessicale secondo la relazione hypernymy

Vale il principio di ereditarietagrave

Ad un nome (canarino) si possono associare

1048708 Attributi del nome (piccolo e giallo)

1048708 Parti del nome (becco e ali)

1048708 Funzioni del nome (canta e vola)

Molti degli attributi delle parti e delle attivitagrave di un termine sono ereditate dal diretto hypernym

12032010

5

Interrogazione online

I Chiari Linguistica computazionale - aa 20092010

9

httpwordnetwebprincetoneduperlwebwn

I Chiari Linguistica computazionale - aa 2009201010

12032010

6

Statistiche su Wordnet

I Chiari Linguistica computazionale - aa 20092010

11

Polisemia in Wordnet

I Chiari Linguistica computazionale - aa 20092010

12

12032010

7

200405ANLE

13

Verbi nel database (Semeraro)

About 10000 forms 20000 senses

Un verbo egrave il nucleo su cui si basa la semantica

associata ad una frase

Il significato dei verbi cambia a seconda del nome

con cui i verbi stessi sono associati

Per risolvere lrsquoambiguitagrave si potrebbe immaginare di

inserire in ogni synset di verbi un puntatore al

synset del nome a cui il significato del verbo egrave

riferito

I Chiari Linguistica computazionale - aa 20092010

14

Abbandonata lrsquoidea proposta precedentemente si

egrave pensato di suddividere i verbi in varie categorie

semantiche (file)

Con tale organizzazione il significato di un verbo in

una categoria non egrave piugrave soggetto ad ambiguitagrave

percheacute legato alla categoria semantica stessa

12032010

8

200405ANLE

15

Relazioni verbali

V1 ENTAILS V2

when Someone V1 (logically) entails Someone V2

- eg snore entails sleep

TROPONYMY

when To do V1 is To do V2 in some manner

- eg limp is a troponym of walk

Hypernym fly-gt travel

Troponym Walk -gt stroll

Entails Snore -gt sleep

Antonym Increase -gt decrease

Differences in wordnet structures

voorwerp

object

lepel

spoon

werktuig

tool

tas

bag

bak

box

blok

block

lichaam

body

Wordnet15 Dutch Wordnet

bagspoonbox

object

natural object (an

object occurring

naturally)

artifact artefact

(a man-made object)

instrumentalityblock body

containerdeviceimplement

tool instrument

- Artificial Classes versus Lexicalized Classes

instrumentality natural object

- Lexicalization differences of classes

container and artifact (object) are not lexicalized in Dutch

12032010

9

Applicazioni di Wordnet

I Chiari Linguistica computazionale - aa 20092010

17

httpwwwlexiologycom

I Chiari Linguistica computazionale - aa 2009201018

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 2: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

2

Wordnet

I Chiari Linguistica computazionale - aa 20092010

3

httpwordnetprincetonedu

Ontologia linguistica che rappresenta in maniera

esplicita e formale la conoscenza linguistica umana

Lrsquoidea nasce nel 1985 da un gruppo di linguisti e

psicolinguisti dellrsquouniversitagrave di Princeton

1048708 Obiettivo ricerca concettuale nei dizionari

1048708 Risultato definizione di un database lessicale

1048708 Linea di ricerca memoria lessicale umana

I Chiari Linguistica computazionale - aa 20092010

4

WordNet egrave unrsquoontologia linguistica toplevel

La conoscenza linguistica

1048708 egrave conoscenza di senso comune

1048708 puograve essere utilizzata in qualsiasi dominio

Wordnet non tratta parole come

of an the and about above because etc

12032010

3

I Chiari Linguistica computazionale - aa 20092010

5

Ogni word meaning egrave rappresentata dallrsquoinsieme

delle word form che possono essere usate per

esprimerla synset

Un synset associato ad una word form consente

allrsquoutente di inferire la semantica della word form in

esame purcheacute conosca la semantica di almeno una

word form elencata nel synset

Relazioni

I Chiari Linguistica computazionale - aa 20092010

6

LE RELAZIONI LESSICALI Si instaurano tra word

form (synonymy antonymy morphological)

LE RELAZIONI SEMANTICHE Si instaurano tra

word meaning (hyponymy hypernymy e

meronymy holonymy)

12032010

4

Da Diapositive Semerarohellip

I Chiari Linguistica computazionale - aa 20092010

7

Sostantivi

I Chiari Linguistica computazionale - aa 20092010

8

WordNet suddivide i nomi in 25 campi semantici distinti (animale sostanzahellip)

In ogni campo semantico i nomi sono organizzati in un albero lessicale secondo la relazione hypernymy

Vale il principio di ereditarietagrave

Ad un nome (canarino) si possono associare

1048708 Attributi del nome (piccolo e giallo)

1048708 Parti del nome (becco e ali)

1048708 Funzioni del nome (canta e vola)

Molti degli attributi delle parti e delle attivitagrave di un termine sono ereditate dal diretto hypernym

12032010

5

Interrogazione online

I Chiari Linguistica computazionale - aa 20092010

9

httpwordnetwebprincetoneduperlwebwn

I Chiari Linguistica computazionale - aa 2009201010

12032010

6

Statistiche su Wordnet

I Chiari Linguistica computazionale - aa 20092010

11

Polisemia in Wordnet

I Chiari Linguistica computazionale - aa 20092010

12

12032010

7

200405ANLE

13

Verbi nel database (Semeraro)

About 10000 forms 20000 senses

Un verbo egrave il nucleo su cui si basa la semantica

associata ad una frase

Il significato dei verbi cambia a seconda del nome

con cui i verbi stessi sono associati

Per risolvere lrsquoambiguitagrave si potrebbe immaginare di

inserire in ogni synset di verbi un puntatore al

synset del nome a cui il significato del verbo egrave

riferito

I Chiari Linguistica computazionale - aa 20092010

14

Abbandonata lrsquoidea proposta precedentemente si

egrave pensato di suddividere i verbi in varie categorie

semantiche (file)

Con tale organizzazione il significato di un verbo in

una categoria non egrave piugrave soggetto ad ambiguitagrave

percheacute legato alla categoria semantica stessa

12032010

8

200405ANLE

15

Relazioni verbali

V1 ENTAILS V2

when Someone V1 (logically) entails Someone V2

- eg snore entails sleep

TROPONYMY

when To do V1 is To do V2 in some manner

- eg limp is a troponym of walk

Hypernym fly-gt travel

Troponym Walk -gt stroll

Entails Snore -gt sleep

Antonym Increase -gt decrease

Differences in wordnet structures

voorwerp

object

lepel

spoon

werktuig

tool

tas

bag

bak

box

blok

block

lichaam

body

Wordnet15 Dutch Wordnet

bagspoonbox

object

natural object (an

object occurring

naturally)

artifact artefact

(a man-made object)

instrumentalityblock body

containerdeviceimplement

tool instrument

- Artificial Classes versus Lexicalized Classes

instrumentality natural object

- Lexicalization differences of classes

container and artifact (object) are not lexicalized in Dutch

12032010

9

Applicazioni di Wordnet

I Chiari Linguistica computazionale - aa 20092010

17

httpwwwlexiologycom

I Chiari Linguistica computazionale - aa 2009201018

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 3: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

3

I Chiari Linguistica computazionale - aa 20092010

5

Ogni word meaning egrave rappresentata dallrsquoinsieme

delle word form che possono essere usate per

esprimerla synset

Un synset associato ad una word form consente

allrsquoutente di inferire la semantica della word form in

esame purcheacute conosca la semantica di almeno una

word form elencata nel synset

Relazioni

I Chiari Linguistica computazionale - aa 20092010

6

LE RELAZIONI LESSICALI Si instaurano tra word

form (synonymy antonymy morphological)

LE RELAZIONI SEMANTICHE Si instaurano tra

word meaning (hyponymy hypernymy e

meronymy holonymy)

12032010

4

Da Diapositive Semerarohellip

I Chiari Linguistica computazionale - aa 20092010

7

Sostantivi

I Chiari Linguistica computazionale - aa 20092010

8

WordNet suddivide i nomi in 25 campi semantici distinti (animale sostanzahellip)

In ogni campo semantico i nomi sono organizzati in un albero lessicale secondo la relazione hypernymy

Vale il principio di ereditarietagrave

Ad un nome (canarino) si possono associare

1048708 Attributi del nome (piccolo e giallo)

1048708 Parti del nome (becco e ali)

1048708 Funzioni del nome (canta e vola)

Molti degli attributi delle parti e delle attivitagrave di un termine sono ereditate dal diretto hypernym

12032010

5

Interrogazione online

I Chiari Linguistica computazionale - aa 20092010

9

httpwordnetwebprincetoneduperlwebwn

I Chiari Linguistica computazionale - aa 2009201010

12032010

6

Statistiche su Wordnet

I Chiari Linguistica computazionale - aa 20092010

11

Polisemia in Wordnet

I Chiari Linguistica computazionale - aa 20092010

12

12032010

7

200405ANLE

13

Verbi nel database (Semeraro)

About 10000 forms 20000 senses

Un verbo egrave il nucleo su cui si basa la semantica

associata ad una frase

Il significato dei verbi cambia a seconda del nome

con cui i verbi stessi sono associati

Per risolvere lrsquoambiguitagrave si potrebbe immaginare di

inserire in ogni synset di verbi un puntatore al

synset del nome a cui il significato del verbo egrave

riferito

I Chiari Linguistica computazionale - aa 20092010

14

Abbandonata lrsquoidea proposta precedentemente si

egrave pensato di suddividere i verbi in varie categorie

semantiche (file)

Con tale organizzazione il significato di un verbo in

una categoria non egrave piugrave soggetto ad ambiguitagrave

percheacute legato alla categoria semantica stessa

12032010

8

200405ANLE

15

Relazioni verbali

V1 ENTAILS V2

when Someone V1 (logically) entails Someone V2

- eg snore entails sleep

TROPONYMY

when To do V1 is To do V2 in some manner

- eg limp is a troponym of walk

Hypernym fly-gt travel

Troponym Walk -gt stroll

Entails Snore -gt sleep

Antonym Increase -gt decrease

Differences in wordnet structures

voorwerp

object

lepel

spoon

werktuig

tool

tas

bag

bak

box

blok

block

lichaam

body

Wordnet15 Dutch Wordnet

bagspoonbox

object

natural object (an

object occurring

naturally)

artifact artefact

(a man-made object)

instrumentalityblock body

containerdeviceimplement

tool instrument

- Artificial Classes versus Lexicalized Classes

instrumentality natural object

- Lexicalization differences of classes

container and artifact (object) are not lexicalized in Dutch

12032010

9

Applicazioni di Wordnet

I Chiari Linguistica computazionale - aa 20092010

17

httpwwwlexiologycom

I Chiari Linguistica computazionale - aa 2009201018

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 4: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

4

Da Diapositive Semerarohellip

I Chiari Linguistica computazionale - aa 20092010

7

Sostantivi

I Chiari Linguistica computazionale - aa 20092010

8

WordNet suddivide i nomi in 25 campi semantici distinti (animale sostanzahellip)

In ogni campo semantico i nomi sono organizzati in un albero lessicale secondo la relazione hypernymy

Vale il principio di ereditarietagrave

Ad un nome (canarino) si possono associare

1048708 Attributi del nome (piccolo e giallo)

1048708 Parti del nome (becco e ali)

1048708 Funzioni del nome (canta e vola)

Molti degli attributi delle parti e delle attivitagrave di un termine sono ereditate dal diretto hypernym

12032010

5

Interrogazione online

I Chiari Linguistica computazionale - aa 20092010

9

httpwordnetwebprincetoneduperlwebwn

I Chiari Linguistica computazionale - aa 2009201010

12032010

6

Statistiche su Wordnet

I Chiari Linguistica computazionale - aa 20092010

11

Polisemia in Wordnet

I Chiari Linguistica computazionale - aa 20092010

12

12032010

7

200405ANLE

13

Verbi nel database (Semeraro)

About 10000 forms 20000 senses

Un verbo egrave il nucleo su cui si basa la semantica

associata ad una frase

Il significato dei verbi cambia a seconda del nome

con cui i verbi stessi sono associati

Per risolvere lrsquoambiguitagrave si potrebbe immaginare di

inserire in ogni synset di verbi un puntatore al

synset del nome a cui il significato del verbo egrave

riferito

I Chiari Linguistica computazionale - aa 20092010

14

Abbandonata lrsquoidea proposta precedentemente si

egrave pensato di suddividere i verbi in varie categorie

semantiche (file)

Con tale organizzazione il significato di un verbo in

una categoria non egrave piugrave soggetto ad ambiguitagrave

percheacute legato alla categoria semantica stessa

12032010

8

200405ANLE

15

Relazioni verbali

V1 ENTAILS V2

when Someone V1 (logically) entails Someone V2

- eg snore entails sleep

TROPONYMY

when To do V1 is To do V2 in some manner

- eg limp is a troponym of walk

Hypernym fly-gt travel

Troponym Walk -gt stroll

Entails Snore -gt sleep

Antonym Increase -gt decrease

Differences in wordnet structures

voorwerp

object

lepel

spoon

werktuig

tool

tas

bag

bak

box

blok

block

lichaam

body

Wordnet15 Dutch Wordnet

bagspoonbox

object

natural object (an

object occurring

naturally)

artifact artefact

(a man-made object)

instrumentalityblock body

containerdeviceimplement

tool instrument

- Artificial Classes versus Lexicalized Classes

instrumentality natural object

- Lexicalization differences of classes

container and artifact (object) are not lexicalized in Dutch

12032010

9

Applicazioni di Wordnet

I Chiari Linguistica computazionale - aa 20092010

17

httpwwwlexiologycom

I Chiari Linguistica computazionale - aa 2009201018

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 5: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

5

Interrogazione online

I Chiari Linguistica computazionale - aa 20092010

9

httpwordnetwebprincetoneduperlwebwn

I Chiari Linguistica computazionale - aa 2009201010

12032010

6

Statistiche su Wordnet

I Chiari Linguistica computazionale - aa 20092010

11

Polisemia in Wordnet

I Chiari Linguistica computazionale - aa 20092010

12

12032010

7

200405ANLE

13

Verbi nel database (Semeraro)

About 10000 forms 20000 senses

Un verbo egrave il nucleo su cui si basa la semantica

associata ad una frase

Il significato dei verbi cambia a seconda del nome

con cui i verbi stessi sono associati

Per risolvere lrsquoambiguitagrave si potrebbe immaginare di

inserire in ogni synset di verbi un puntatore al

synset del nome a cui il significato del verbo egrave

riferito

I Chiari Linguistica computazionale - aa 20092010

14

Abbandonata lrsquoidea proposta precedentemente si

egrave pensato di suddividere i verbi in varie categorie

semantiche (file)

Con tale organizzazione il significato di un verbo in

una categoria non egrave piugrave soggetto ad ambiguitagrave

percheacute legato alla categoria semantica stessa

12032010

8

200405ANLE

15

Relazioni verbali

V1 ENTAILS V2

when Someone V1 (logically) entails Someone V2

- eg snore entails sleep

TROPONYMY

when To do V1 is To do V2 in some manner

- eg limp is a troponym of walk

Hypernym fly-gt travel

Troponym Walk -gt stroll

Entails Snore -gt sleep

Antonym Increase -gt decrease

Differences in wordnet structures

voorwerp

object

lepel

spoon

werktuig

tool

tas

bag

bak

box

blok

block

lichaam

body

Wordnet15 Dutch Wordnet

bagspoonbox

object

natural object (an

object occurring

naturally)

artifact artefact

(a man-made object)

instrumentalityblock body

containerdeviceimplement

tool instrument

- Artificial Classes versus Lexicalized Classes

instrumentality natural object

- Lexicalization differences of classes

container and artifact (object) are not lexicalized in Dutch

12032010

9

Applicazioni di Wordnet

I Chiari Linguistica computazionale - aa 20092010

17

httpwwwlexiologycom

I Chiari Linguistica computazionale - aa 2009201018

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 6: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

6

Statistiche su Wordnet

I Chiari Linguistica computazionale - aa 20092010

11

Polisemia in Wordnet

I Chiari Linguistica computazionale - aa 20092010

12

12032010

7

200405ANLE

13

Verbi nel database (Semeraro)

About 10000 forms 20000 senses

Un verbo egrave il nucleo su cui si basa la semantica

associata ad una frase

Il significato dei verbi cambia a seconda del nome

con cui i verbi stessi sono associati

Per risolvere lrsquoambiguitagrave si potrebbe immaginare di

inserire in ogni synset di verbi un puntatore al

synset del nome a cui il significato del verbo egrave

riferito

I Chiari Linguistica computazionale - aa 20092010

14

Abbandonata lrsquoidea proposta precedentemente si

egrave pensato di suddividere i verbi in varie categorie

semantiche (file)

Con tale organizzazione il significato di un verbo in

una categoria non egrave piugrave soggetto ad ambiguitagrave

percheacute legato alla categoria semantica stessa

12032010

8

200405ANLE

15

Relazioni verbali

V1 ENTAILS V2

when Someone V1 (logically) entails Someone V2

- eg snore entails sleep

TROPONYMY

when To do V1 is To do V2 in some manner

- eg limp is a troponym of walk

Hypernym fly-gt travel

Troponym Walk -gt stroll

Entails Snore -gt sleep

Antonym Increase -gt decrease

Differences in wordnet structures

voorwerp

object

lepel

spoon

werktuig

tool

tas

bag

bak

box

blok

block

lichaam

body

Wordnet15 Dutch Wordnet

bagspoonbox

object

natural object (an

object occurring

naturally)

artifact artefact

(a man-made object)

instrumentalityblock body

containerdeviceimplement

tool instrument

- Artificial Classes versus Lexicalized Classes

instrumentality natural object

- Lexicalization differences of classes

container and artifact (object) are not lexicalized in Dutch

12032010

9

Applicazioni di Wordnet

I Chiari Linguistica computazionale - aa 20092010

17

httpwwwlexiologycom

I Chiari Linguistica computazionale - aa 2009201018

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 7: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

7

200405ANLE

13

Verbi nel database (Semeraro)

About 10000 forms 20000 senses

Un verbo egrave il nucleo su cui si basa la semantica

associata ad una frase

Il significato dei verbi cambia a seconda del nome

con cui i verbi stessi sono associati

Per risolvere lrsquoambiguitagrave si potrebbe immaginare di

inserire in ogni synset di verbi un puntatore al

synset del nome a cui il significato del verbo egrave

riferito

I Chiari Linguistica computazionale - aa 20092010

14

Abbandonata lrsquoidea proposta precedentemente si

egrave pensato di suddividere i verbi in varie categorie

semantiche (file)

Con tale organizzazione il significato di un verbo in

una categoria non egrave piugrave soggetto ad ambiguitagrave

percheacute legato alla categoria semantica stessa

12032010

8

200405ANLE

15

Relazioni verbali

V1 ENTAILS V2

when Someone V1 (logically) entails Someone V2

- eg snore entails sleep

TROPONYMY

when To do V1 is To do V2 in some manner

- eg limp is a troponym of walk

Hypernym fly-gt travel

Troponym Walk -gt stroll

Entails Snore -gt sleep

Antonym Increase -gt decrease

Differences in wordnet structures

voorwerp

object

lepel

spoon

werktuig

tool

tas

bag

bak

box

blok

block

lichaam

body

Wordnet15 Dutch Wordnet

bagspoonbox

object

natural object (an

object occurring

naturally)

artifact artefact

(a man-made object)

instrumentalityblock body

containerdeviceimplement

tool instrument

- Artificial Classes versus Lexicalized Classes

instrumentality natural object

- Lexicalization differences of classes

container and artifact (object) are not lexicalized in Dutch

12032010

9

Applicazioni di Wordnet

I Chiari Linguistica computazionale - aa 20092010

17

httpwwwlexiologycom

I Chiari Linguistica computazionale - aa 2009201018

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 8: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

8

200405ANLE

15

Relazioni verbali

V1 ENTAILS V2

when Someone V1 (logically) entails Someone V2

- eg snore entails sleep

TROPONYMY

when To do V1 is To do V2 in some manner

- eg limp is a troponym of walk

Hypernym fly-gt travel

Troponym Walk -gt stroll

Entails Snore -gt sleep

Antonym Increase -gt decrease

Differences in wordnet structures

voorwerp

object

lepel

spoon

werktuig

tool

tas

bag

bak

box

blok

block

lichaam

body

Wordnet15 Dutch Wordnet

bagspoonbox

object

natural object (an

object occurring

naturally)

artifact artefact

(a man-made object)

instrumentalityblock body

containerdeviceimplement

tool instrument

- Artificial Classes versus Lexicalized Classes

instrumentality natural object

- Lexicalization differences of classes

container and artifact (object) are not lexicalized in Dutch

12032010

9

Applicazioni di Wordnet

I Chiari Linguistica computazionale - aa 20092010

17

httpwwwlexiologycom

I Chiari Linguistica computazionale - aa 2009201018

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 9: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

9

Applicazioni di Wordnet

I Chiari Linguistica computazionale - aa 20092010

17

httpwwwlexiologycom

I Chiari Linguistica computazionale - aa 2009201018

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 10: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

10

Memidex applicazione Wordnet

I Chiari Linguistica computazionale - aa 20092010

19

I Chiari Linguistica computazionale - aa 2009201020

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 11: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

11

I Chiari Linguistica computazionale - aa 2009201021

Multiwordnet22

I Chiari Linguistica computazionale - aa 20092010

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 12: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

12

I Chiari Linguistica computazionale - aa 20092010

23

httpmultiwordnetfbkeu

lexical relations between words

semantic relations between lexical concepts

(synsets)

correspondences between Italian and English lexical

concepts

semantic fields (domains)

I Chiari Linguistica computazionale - aa 20092010

24

The lastest version of MultiWordNet (139) contains

around 58000 Italian word senses and 41500

lemmas organized into 32700 synsets aligned

whenever possible with Princeton WordNet English

synsets

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 13: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

13

Relazioni semantiche e lessicali

I Chiari Linguistica computazionale - aa 20092010

25

I Chiari Linguistica computazionale - aa 2009201026

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 14: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

14

I Chiari Linguistica computazionale - aa 2009201027

I Chiari Linguistica computazionale - aa 2009201028

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 15: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

15

Le applicazioni di MWN

I Chiari Linguistica computazionale - aa 20092010

29

Information Retrieval synonymy relations are used for query expansion to improve the recall of IR cross language correspondences between Italian and English synsets are used for Cross Language Information Retrieval

Semantic tagging MultiWordNet constitutes a large coverage sense inventory which is the basis for semantic tagging ie texts are tagged with synset identifiers

Disambiguation Semantic relationships are used to measure the semantic distance between words which can be used to disambiguate the meaning of words in texts Also semantic fields have proved to be very useful for the disambiguation task

Ontologies MultiWordNet can be seen as an ontology to be used for a variety of knowledge-based NLP tasks

Terminologies MultiWordNet constitutes a robust framework supporting the development of specific structured terminologies

ItalWordNet30

I Chiari Linguistica computazionale - aa 20092010

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 16: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

16

I Chiari Linguistica computazionale - aa 20092010

31

ItalWordNet (IWN) egrave un database semantico-

lessicale sviluppato nellambito di due progetti di

ricerca distinti EuroWordNet (EWN)1 e Sistema

Integrato per il Trattamento Automatico del

Linguaggio (SI-TAL) un progetto nazionale dedicato

alla creazione di ampie risorse linguistiche e di

strumenti software per lelaborazione dellitaliano

scritto e parlato

il database IWN

I Chiari Linguistica computazionale - aa 20092010

32

un wordnet contenente circa 47000 lemmi 50000 synset e 130000 relazioni semantiche (tra le relazioni codificate le piugrave importanti sono le seguenti iperonimiaiponimia antonimia meronimia relazioni di causa relazioni di ruolo etc)

un Inter-Lingual Index (ILI) che egrave una versione non strutturata di WN15questo modulo usato in EWN per collegare wordnet di diverse lingue egrave stato mantenuto anche in IWN per rendere la risorsa utilizzabile in applicazioni multilingue

la Top Ontology (TO) una gerarchia di concetti indipendenti dalla lingua che riflette fondamentali distinzioni semantiche costruita nellambito di EWN e parzialmente modificata in IWN per spiegare gli aggettivi (non trattati in EWN)

la TO egrave costituita da aspetti indipendenti dalla lingua che possono (o non possono) essere lessicalizzati in vari modi o secondo diversi modelli in diverse lingue [Rodriguez et al 1998] attraverso lILI tutti i concetti del wordnet sono direttamente o indirettamente collegati alla TO

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 17: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

17

Mangiare (v)

I Chiari Linguistica computazionale - aa 20092010

33

FrameNet34

I Chiari Linguistica computazionale - aa 20092010

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 18: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

18

Framenet

I Chiari Linguistica computazionale - aa 20092010

35

The Berkeley FrameNet project is creating an on-

line lexical resource for English based on frame

semantics and supported by corpus evidence The

aim is to document the range of semantic and

syntactic combinatory possibilities (valences) of

each word in each of its senses through computer-

assisted annotation of example sentences and

automatic tabulation and display of the annotation

results

database

I Chiari Linguistica computazionale - aa 20092010

36

the FrameNet lexical database currently contains

more than 11600 lexical units (defined below)

more than 6800 of which are fully annotated in

more than 960 semantic frames exemplified in

more than 150000 annotated sentences

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 19: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

19

I Chiari Linguistica computazionale - aa 2009201037

I Chiari Linguistica computazionale - aa 2009201038

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 20: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

20

I Chiari Linguistica computazionale - aa 2009201039

I Chiari Linguistica computazionale - aa 2009201040

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 21: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

21

In ottica multilingue41

I Chiari Linguistica computazionale - aa 20092010

Aligning wordnets

muziekinstrument

orgel

hammond orgel

organ organ organ

hammond organ

musical instrument

instrument

artifact object natural object

objectDutch wordnetEnglish wordnet

orgaan

orgel

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 22: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

22

Criteri generali

Massimizzare la sovrapposizione con altri wordnet

di altre lingue

Massimizzare la consistenza semantica allrsquointerno e

attraverso i wordnet

Focalizzare lo sforzo manuale dove necessario

Sfruttare massimamente le tecniche automatiche

Top-down methodology

Develop a core wordnet (5000 synsets)

all the semantic building blocks or foundation to define the relations for all other more specific synsets eg building -gt house church school

provide a formal and explicit semantics

Validate the core wordnet

does it include the most frequent words

are semantic constraints violated

Extend the core wordnet (5000 synsets or more)

automatic techniques for more specific concepts with high-confidence results

add other levels of hyponymy

add specific domains

add lsquoeasyrsquo derivational words

add lsquoeasyrsquo translation equivalence

Validate the complete wordnet

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 23: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

23

Developing a core wordnet

Define a set of concepts(so-called Base Concepts) that play an important role in wordnets

high position in the hierarchy amp high connectivity

represented as English WordNet synsets

Common base concepts shared by various wordnets in different languages

Local base concepts not shared

EuroWordNet 1024 synsets shared by 2 or more languages

BalkaNet 5000 synsets (including 1024)

Common semantic framework for all Base Concepts in the form of a Top-Ontology

Manually translate all Base Concepts (English Wordnet synsets) to synsets in the local languages (was applied for 13 Wordnets)

Manually build and verify the hypernym relations for the Base Concepts

All 13 Wordnets are developed from a similar semantic core closely related to the English Wordnet

63TCs

1024 CBCs

First Level Hyponyms

Remaining

Hyponyms

Hypero

nyms

CBC

Represen-

tatives

Local

BCs

WMs

related via

non-hypo

nymy

Top-Ontology

Inter-Lingual-Index

Remaining

Hyponyms

Hypero

nyms

CBC

Repre-

senta

Local

BCs

WMs

related via

non-hypo

nymyFirst Level HyponymsRemaining

WordNet15

Synsets

Top-down methodology

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 24: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

24

DomainNamed

Entities

Next Level

Hyponyms

Sumo

Ontology

WordNet

Synsets

1000

SynsetsSBC

CBC

Hyper

nyms

ABCEuroWordNet

BalkaNet

Base Concepts

5000

SynsetsEnglish

Arabic

LexiconWordNet

Domains

Domainldquochemicsrdquo

WordNet

Synsets

English Wordnet Arabic Wordnet

Arabic

word

frequency

Arabic

roots

amp

derivation

rules

Top-down methodology

More

Hyponyms

Easy

Translations

Named

Entities

=

Advantages of the approach

Well-defined semantics that can be inherited down to more specific concepts

Apply consistency checks

Automatic techniques can use semantic basis

Most frequent concepts and words are covered

High overlap and compatibility with other wordnets

Manual effort is focussed on the most difficult concepts and words

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009

Page 25: Semantica e lessico - Alphabit.net · et.pdf) altre da Piek Vossen 2 Wordnets ... all’utente di inferire la semantica della word form in esame purché conosca la semantica di almeno

12032010

25

Wordnet

Domains Concepts Proportion

Wordnet

Domains Concepts Proportion

acoustics 104 0092 linguistics 1545 1363

administration 2974 2624 literature 686 0605

aeronautic 154 0136 mathematics 575 0507

agriculture 306 0270 mechanics 532 0469

alimentation 28 0025 medicine 2690 2374

anatomy 2705 2387 merchant_navy 485 0428

anthropology 896 0791 meteorology 231 0204

applied_science 28 0025 metrology 1409 1243

archaeology 68 0060 military 1490 1315

archery 5 0004 money 624 0551

architecture 255 0225 mountaineering 28 0025

art 420 0371 music 985 0869

artisanship 148 0131 mythology 314 0277

astrology 17 0015 number 220 0194

astronautics 29 0026 numismatics 43 0038

astronomy 376 0332 occultism 52 0046

athletics 22 0019 oceanography 10 0009