wordnet, eurowordnet, global wordnet

117
Wordnet, EuroWordNet, Global Wordnet Piek Vossen [email protected] http:// www.globalwordnet.org

Upload: enid

Post on 09-Jan-2016

121 views

Category:

Documents


0 download

DESCRIPTION

Wordnet, EuroWordNet, Global Wordnet. Piek Vossen [email protected] http://www.globalwordnet.org. Overview. Princeton WordNet (1980 - ongoing) EuroWordNet (1996 - 1999) The database design The general building strategy Towards a universal index of meaning - PowerPoint PPT Presentation

TRANSCRIPT

Wordnet, EuroWordNet, Global Wordnet

Piek [email protected]

http://www.globalwordnet.org

OverviewOverview Princeton WordNet (1980 - ongoing) EuroWordNet (1996 - 1999)

The database design

The general building strategy

Towards a universal index of meaning

Global WordNet Association (2001 - ongoing)

Other wordnets

BalkaNet (2001 - 2004)

IndoWordnet (2002 - ongoing)

Meaning (2002 - 2005)

WordNet1.5WordNet1.5• Developed at Princeton by George Miller

and his team as a model of the mental lexicon.

• Semantic network in which concepts are defined in terms of relations to other concepts.

• Structure: organized around the notion of

synsets (sets of synonymous words) basic semantic relations between

these synsets Initially no glosses Main revision after tagging the Brown

corpus with word meanings: SemCor. http://www.cogsci.princeton.edu/~wn/http://www.cogsci.princeton.edu/~wn/

w3wn.htmlw3wn.html

Structure of WordNet1.5

{vehicle}

{conveyance; transport}

{car; auto; automobile; machine; motorcar}

{cruiser; squad car; patrol car; police car; prowl car} {cab; taxi; hack; taxicab; }

{motor vehicle; automotive vehicle}

{bumper}

{car door}

{car window}

{car mirror}

{hinge; flexible joint}

{doorlock}

{armrest}

hyperonym

hyperonym

hyperonym

hyperonymhyperonym

meronym

meronym

meronym

meronym

EuroWordNet The development of a multilingual database

with wordnets for several European languages Funded by the European Commission, DG XIII,

Luxembourg as projects LE2-4003 and LE4-8328

March 1996 - September 1999 2.5 Million EURO. URL: http://www.hum.uva.nl/~ewnhttp://www.hum.uva.nl/~ewn

Objectives of Objectives of EuroWordNetEuroWordNet

Languages covered: EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian EuroWordNet-2 (LE4-8328): German, French, Czech,

Estonian. Size of vocabulary:

EuroWordNet-1: 30,000 concepts - 50,000 word meanings. EuroWordNet-2: 15,000 concepts- 25,000 word meaning.

Type of vocabulary: the most frequent words of the languages all concepts needed to relate more specific concepts

ConsortiumConsortium

Organization Country Task University of Amsterdam

NL Project Coordinator & Build the Dutch wordnet

Istituto Di Linguistica Computazionale Pisa

IT Build the Italian wordnet

Fundacion Universidad Empresa ES Build the Spanish wordnet Université d’ Avignon and Memodata at Avignon

FR Build the French wordnet

Universität Tübingen DE Build the German wordnet University of Masaryk at Brno CZ Build the Czech wordnet University of Tartu, Estonia EE Build the Estonian wordnet

University of Sheffield GB Adapt the English wordnet Novell Belgium NV BE User

Build the common database Xerox Research Centre, Meylan FR User Bertin & Cie, Plaisir, Paris FR User

The basic principles of EuroWordNet

the structure of the Princeton WordNet the design of the EuroWordNet

database wordnets as language-specific

structures the language-internal relations the multilingual relations

Specific features of Specific features of EuroWordNetEuroWordNet

it contains semantic lexicons for other languages than English.

each wordnet reflects the relations as a language-internal

system, maintaining cultural and linguistic differences in the

wordnets.

it contains multilingual relations from each wordnet to English

meanings, which makes it possible to compare the wordnets,

tracking down inconsistencies and cross-linguistic differences.

each wordnet is linked to a language independent top-

ontology and to domain labels.

Autonomous & Language-Specific

voorwerp{object}

lepel{spoon}

werktuig{tool}

tas{bag}

bak{box}

blok{block}

lichaam{body}

Wordnet1.5 Dutch Wordnet

bagspoonbox

object

natural object (an object occurring naturally)

artifact, artefact (a man-made object)

instrumentality block body

containerdeviceimplement

tool instrument

Differences in structure

•Artificial Classes versus Lexicalized Classes: instrumentality; natural object

•Lexicalization differences of classes: container and artifact (object) are not lexicalized in

Dutch

•What is the purpose of different hierarchies?

•Should we include all lexicalized classes from all (8) languages?

Conceptual ontology: A particular level or structuring may be required to achieve a better control or performance, or a more compact and coherent structure.

• introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), • neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise).

What properties can we infer for spoons?spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking

Linguistic versus Conceptual Ontologies

Linguistic versus Conceptual Ontologies

Linguistic ontology: Exactly reflects the relations between all the lexicalized

words and expressions in a language. It therefore captures valuable information about the lexical capacity of languages: what is the available fund of words and expressions in a language.

What words can be used to name spoons?spoon -> object, tableware, silverware, merchandise,

cutlery,

Separate Wordnets and Ontologies

ReferenceOntologyClasses: BOXContainerProduct;SolidTangibleThing

Language-Neutral Ontology

object

boxcontainer

box

containerWordNet1.5

Language-Specific Wordnets

doos

voorwerpDutch Wordnet

EuroWordNet Top-Ontology:Form: CubicFunction: ContainOrigin: ArtifactComposition: Whole

Wordnets versus ontologies

Wordnets:autonomous language-specific lexicalization patterns in a relational network. Usage: to predict substitution in text for information retrieval,text generation, machine translation, word-sense-disambiguation.

Ontologies: data structure with formally defined concepts.Usage: making semantic inferences.

Classical Substitution Principle:

Any word that is used to refer to something can be replaced by its synonyms, hyperonyms and hyponyms:

horse stallion, mare, pony, mammal, animal, being.

It cannot be referred to by co-hyponyms and co-hyponyms of its hyperonyms:

horse X cat, dog, camel, fish, plant, person, object.

Conceptual Distance Measurement:

Number of hierarchical nodes between words is a measurement of closeness, where the level and the local density of nodes are additional factors.

Wordnets asLinguistic Ontologies

Linguistic Principles for deriving relations

1. Substitution tests (Cruse 1986):

1 a. It is a fiddle therefore it is a violin.b It is a violin therefore it is a fiddle.

2 a. It is a dog therefore it is an animal.b *It is an animal therefore it is a dog.

3 a to kill (/a murder) causes to die (/ death)to kill (/a murder) has to die (/ death) as a

consequenceb *to die / death causes to kill

*to die / death has to kill as a consequence

Linguistic Principles for deriving relations

2. Principle of Economy (Dik 1978):

If a word W1 (animal) is the hyperonym of W2 (mammal) and W2

is the hyperonym of W3 (dog) then W3 (dog) should not be linked to W1 (animal) but to W2 (mammal).

3. Principle of Compatibility

If a word W1 is related to W2 via relation R1, W1 and W2 cannot be related via relation Rn, where Rn is defined as a distinct relation from R1.

Architecture of the Architecture of the EuroWordNet Data BaseEuroWordNet Data Base

I

I = Language Independent linkII = Link from Language Specific to Inter lingual IndexIII = Language Dependent Link

II

Lexical Items Table

bewegengaan

rijden berijdenIII

guidare

III

Lexical Items Table

cavalcare

andaremuoversi

ILI-record{drive}

Inter-Lingual-Index

I

Lexical Items Table

driveride

movego

III

Ontology

2OrderEntity

LocationDynamic

Lexical Items Table

cabalgar jinetear

III

conducir

movertransitar

Domains

Traffic

Air Road`

III

IIIIII

III

IIII

II

The mono-lingual design of EuroWordNet

Language Internal Language Internal RelationsRelations

WN 1.5 starting point

The ‘synset’ as a weak notion of synonymy:“two expressions are synonymous in a linguistic context C if the substitution of one for the other in C does not alter the truth value.” (Miller et al. 1993)

Relations between synsets:Relation POS-combination ExampleANTONYMY adjective-to-adjective

verb-to-verb open/ closeHYPONYMY noun-to-noun car/ vehicle

verb-to-verb walk/ moveMERONYMY noun-to-noun head/ noseENTAILMENT verb-to-verb buy/ payCAUSE verb-to-verb kill/ die

Differences Differences EuroWordNet/WordNet1.5EuroWordNet/WordNet1.5

• Added Features to relations

• Cross-Part-Of-Speech relations

• New relations to differentiate shallow hierarchies

• New interpretations of relations

EWN Relationship EWN Relationship LabelsLabels

Disjunction/Conjunction of multiple relations of the same type

WordNet1.5door1 -- (a swinging or sliding barrier that will close the entrance to a room or

building; "he knocked on the door"; "he slammed the door as he left") PART OF: doorway, door, entree, entry, portal, room access

door 6 -- (a swinging or sliding barrier that will close off access into a car; "she forgot to lock the doors of her car") PART OF: car, auto, automobile, machine, motorcar.

EWN Relationship EWN Relationship LabelsLabels

{airplane} HAS_MERO_PART: conj1 {door}HAS_MERO_PART: conj2 disj1 {jet engine}HAS_MERO_PART: conj2 disj2 {propeller}

{door} HAS_HOLO_PART: disj1 {car}HAS_HOLO_PART: disj2 {room}

HAS_HOLO_PART: disj3 {entrance}

{dog} HAS_HYPERONYM: conj1 {mammal} HAS_HYPERONYM: conj2 {pet}

{albino} HAS_HYPERONYM: disj1 {plant} HAS_HYPERONYM: dis2 {animal}

Default Interpretation: non-exclusive disjunction

EWN Relationship EWN Relationship LabelsLabels

Disjunction/Conjunction of multiple relations of the same type

{ {dog}HAS_HYPONYM: dis1 {poodle}HAS_HYPONYM: dis1 {labrador}HAS_HYPONYM: {sheep dog} (Orthogonal)HAS_HYPONYM: {watch dog} (Orthogonal)

Default Interpretation: non-exclusive disjunction

Factive/Non-factive CAUSES (Lyons 1977)

factive (default interpretation):

“to kill causes to die”: {kill} CAUSES{die}

non-factive: E1 probably or likely causes event E2 or E1 is intended to cause some event E2:

“to search may cause to find”.{search} CAUSES {find} non-factive

EWN Relationship EWN Relationship LabelsLabels

EWN Relationship EWN Relationship LabelsLabels

ReversedIn the database every relation must have a reverse counter-part but there is a difference between relations which are explicitly coded as reverse and automatically reversed relations:

{finger} HAS_HOLONYM {hand}{hand} HAS_MERONYM {finger} {paper-clip} HAS_MER_MADE_OF {metal} {metal} HAS_HOL_MADE_OF {paper-clip} reversed

Negation{monkey} HAS_MERO_PART {tail}{ape} HAS_MERO_PART {tail} not

Cross-Part-Of-Speech Cross-Part-Of-Speech relationsrelations

WordNet1.5: nouns and verbs are not interrelated by basic semantic relations such as hyponymy and synonymy:

adornment 2 change of state-- (the act of changing something)adorn 1 change, alter-- (cause to change; make different)

EuroWordNet: words of different parts of speech can be inter-linked with explicit xpos-synonymy, xpos-antonymy and xpos-hyponymy relations:

{adorn V} XPOS_NEAR_SYNONYM {adornment N}

The advantages of such explicit cross-part-of-speech relations are:

similar words with different parts of speech are grouped together. the same information can be coded in an NP or in a sentence. By

unifying higher-order nouns and verbs in the same ontology it will be possible to match expressions with very different syntactic structures but comparable content

by merging verbs and abstract nouns we can more easily link mismatches across languages that involve a part-of-speech shift. Dutch nouns such as “afsluiting”, “gehuil” are translated with the English verbs “close” and “cry”, respectively.

Cross-Part-Of-Speech Cross-Part-Of-Speech relationsrelations

Entailment in WordNetEntailment in WordNet

WordNet1.5: Entailment indicates the direction of the implication or entailment:

a. + Temporal Inclusion (the two situations partially or totally overlap)a.1 co-extensiveness (e. g., to limp/to walk)

hyponymy/troponymya.2 proper inclusion (e.g., to snore/to sleep) entailment

b. - Temporal Exclusion (the two situations are temporally disjoint)b.1 backward presupposition (e.g., to succeed/to try) entailmentb.2 cause (e.g., to give/to have)

Subevents in EuroWordNetEuroWordNetDirection of the entailment is expressed by the labels factive and reversed:

{to succeed} is_caused_by {to try} factive{to try} causes {to succeed} non-factive

Proper inclusion is described by the has_subevent/ is_subevent_of relation in combination with the label reversed:

{to snore} is_subevent_of {to sleep}{to sleep} has_subevent {to snore} reversed{to buy}has_subevent {to pay}{to pay} is_subevent_of {to buy} reversed

The interpretation of The interpretation of

the CAUSE relationthe CAUSE relation

WordNet1.5: The causal relation only holds between verbs and it should only apply to temporally disjoint situations:

EuroWordNet: the causal relation will also be applied across different parts of speech:

{to kill} V causes {death} N{death} n is_caused_by {to kill} v reversed{to kill } v causes {dead} a{dead} a is_caused_by {to kill} v reversed{murder} n causes {death}n{death} a is_caused_by {murder} n reversed

The interpretation of The interpretation of the CAUSE relationthe CAUSE relation

Various temporal relationships between the (dynamic/non-dynamic) situations may hold:

• Temporally disjoint: there is no time point when dS1 takes place and also S2 (which is caused by dS1) (e.g. to shoot/to hit);

• Temporally overlapping: there is at least one time point when both dS1 and S2 take place, and there is at least one time point when dS1 takes place and S2 (which is caused by dS1) does not yet take place (e.g. to teach/to learn);

• Temporally co-extensive: whenever dS1 takes place also S2 (which is caused by dS1) takes place and there is no time point when dS1 takes place and S2 does not take place, and vice versa (e.g. to feed/to eat).

Role relationsRole relationsIn the case of many verbs and nouns the most salient relation is not the hyperonym but the relation between the event and the involved participants. These relations are expressed as follows:

{hammer} ROLE_INSTRUMENT {to hammer}{to hammer} INVOLVED_INSTRUMENT {hammer} reversed{school} ROLE_LOCATION {to teach}{to teach} INVOLVED_LOCATION {school} reversed

These relations are typically used when other relations, mainly hyponymy, do not clarify the position of the concept network, but the word is still closely related to another word.

Co_Role relationsCo_Role relations

guitar player HAS_HYPERONYM playerCO_AGENT_INSTRUMENT guitar

player HAS_HYPERONYM personROLE_AGENT to play musicCO_AGENT_INSTRUMENT musical instrument

to play music HAS_HYPERONYM to makeROLE_INSTRUMENT musical instrument

guitar HAS_HYPERONYM musical instrumentCO_INSTRUMENT_AGENT guitar player

ice saw HAS_HYPERONYM sawCO_INSTRUMENT_PATIENT ice

saw HAS_HYPERONYM sawROLE_INSTRUMENT to saw

ice CO_PATIENT_INSTRUMENT ice saw REVERSED

Co_Role relationsCo_Role relations

Examples of the other relations are:

criminal CO_AGENT_PATIENT victimnovel writer/ poet CO_AGENT_RESULT novel/ poemdough CO_PATIENT_RESULT pastry/ breadphotograpic camera CO_INSTRUMENT_RESULT photo

BE_IN_STATE and STATE_OFBE_IN_STATE and STATE_OFExample: the poor are the ones to whom the state poor applies

Effect: poor N HAS_HYPERONYM person Npoor N BE_IN_STATE poor Apoor A STATE_OF poor N reversed

IN_MANNER and MANNER_OFIN_MANNER and MANNER_OFExample: to slurp is to eat in a noisely manner

Effect: slurp V HAS_HYPERONYM eat Vslurp V IN_MANNER noisely Adverbnoisely Adverb MANNER_OF slurp V reversed

Overview of the Language Overview of the Language Internal relations in EuroWordnetInternal relations in EuroWordnet

Same Part of Speech relations:NEAR_SYNONYMY apparatus - machineHYPERONYMY/HYPONYMY car - vehicleANTONYMY open - closeHOLONYMY/MERONYMY head - nose

Cross-Part-of-Speech relations:XPOS_NEAR_SYNONYMY dead - death; to adorn - adornmentXPOS_HYPERONYMY/HYPONYMY to love - emotionXPOS_ANTONYMY to live - deadCAUSE die - deathSUBEVENT buy - pay; sleep - snoreROLE/INVOLVED write - pencil; hammer - hammerSTATE the poor - poorMANNER to slurp - noisily BELONG_TO_CLASS Rome - city

Thematic networksThematic networks

behandelen(treat)

zieke (sick person, patient)

genezen (to get well)

arts (doctor)

scalpel

opereren(operate)

persoon (person)

wezen(being)

organisme (organism)

orgaan(organ)

maag(stomach)

maagaandoening(stomach disease)

ziekte(disease)

Agent

PatientCauses

Patient

Involves

Instrument

Part of

Patient

The multi-lingual design of EuroWordNet

Inter-Lingual-Index: unstructured fund of concepts to

provide an efficient mapping across the languages;

Index-records are mainly based on WordNet1.5 synsets

and consist of synonyms, glosses and source references;

Various types of complex equivalence relations are

distinguished;

Equivalence relations from synsets to index records: not

on a word-to-word basis;

Indirect matching of synsets linked to the same index

items;

The Multilingual DesignThe Multilingual Design

EWN Interlingual RelationsEWN Interlingual Relations

• EQ_SYNONYM: there is a direct match between a synset and an ILI-record

• EQ_NEAR_SYNONYM: a synset matches multiple ILI-records simultaneously,

• HAS_EQ_HYPERONYM: a synset is more specific than any available ILI-record.

• HAS_EQ_HYPONYM: a synset can only be linked to more specific ILI-records.

• other relations:

CAUSES/IS_CAUSED_BY, EQ_SUBEVENT/EQ_ROLE, EQ_IS_STATE_OF/EQ_BE_IN_STATE

Equivalent Near SynonymEquivalent Near Synonym

1. Multiple TargetsOne sense for Dutch schoonmaken (to clean) which simultaneously matches with at least 4 senses of clean in WordNet1.5:

•{make clean by removing dirt, filth, or unwanted substances from}•{remove unwanted substances from, such as feathers or pits, as of chickens or fruit}•(remove in making clean; "Clean the spots off the rug")•{remove unwanted substances from - (as in chemistry)}

The Dutch synset schoonmaken will thus be linked with an eq_near_synonym relation to all these sense of clean.

Equivalent Near SynonymEquivalent Near Synonym

2. Multiple Source meaningsSynsets inter-linked by a near_synonym relation can be linked to same target ILI-record(s), either with an eq_synonym or an eq_near_synonym relation:

Dutch wordnet:

toestel near_synonym apparaatILI-records: {machine}; {device}; {apparatus}; {tool}

Equivalent Hyponymy

has_eq_hyperonym Typically used for gaps in WordNet1.5 or in English:

• genuine, cultural gaps for things not known in English culture, e.g. citroenjenever, which is a kind of gin made out of lemon skin,

• pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: Dutch hoofd only refers to human head and Dutch kop only refers to animal head, English uses head for both.

has_eq_hyponym Used when wordnet1.5 only provides more narrow terms. In this case there

can only be a pragmatic difference, not a genuine cultural gap, e.g.: Spanish dedo can be used to refer to both finger and toe.

{ toe : part of foot }

{ finger : part of hand }

{ dedo , dito : finger or toe }

{ head : part of body }

{ hoofd : human head } { kop : animal head }

toe finger

head

dito

dedo

hoofd kop

GB-Net

NL-Net

IT-Net

ES-Net

= normal equivalence

= eq _has_hyponym

= eq _has_hyperonym

Complex mappings across languages

The methodologies for building wordnets

Overall Building Overall Building ProcessProcess

Verificationby users

Comparing and restructuring the wordnet

Load wordnet in the EuroWordNet Database

Improve and extend the wordnet fragments

Adjust coverageimprove encoding

Machine Readable DictionariesWordnets, Taxonomies,

CorporaLoaded in local databases

Subset of word meanings

Encoding oflanguage internal and equivalence relations

Wordnet fragment with links to WordNet1.5

in local database

Specification of selection criteria

Wordnet fragment inEuroWordNet database

Demonstrationin

InformationRetrieval

VerificationReport

Ia

Ib

Ic

II

Ia

III

Main MethodsMain Methods Expand approach: translate WordNet1.5 synsets to another

language and take over the structure easier and more efficient method compatible structure with WordNet1.5 structure is close to WordNet1.5 but also biased by it

Merge approach: create an independent wordnet in another language and align the separate hierarchies by generating the appropriate translations

more complex and labour intensive different structure from WordNet1.5 lanuage specific patterns can be maintained

Methods for extracting Methods for extracting language-internal relationslanguage-internal relations

• editors and database for manually encoding relations;

• comparison with WordNet1.5 structure;

• definition patterns in monolingual dictionaries;

• co-occurrences in corpora;

• morphology;

• bilingual dictionaries;

• lexical semantic substitution tests

• extract monosemeous translations of English synsets, e.g. a Spanish word has only 1 translation to an English word which has only one sense and vice versa;

disambiguation of multiple ambivalent translations by measuring their conceptual-distance between the senses of these translations in the WordNet1.5 hierarchy (Rigau and Aguirre, 95);

disambiguation of ambivalent translations by measuring the conceptual-distance directly in the WordNet1.5 hierarchy between alternative translations and the translations of the direct semantic context in the source wordnet;

disambiguation of ambivalent translations by measuring the overlap in top-concepts inherited in the source wordnet and inherited for the different senses of translations in WordNet1.5;

Methods for extracting Methods for extracting equivalence relationsequivalence relations

Aligning wordnetsAligning wordnets

muziekinstrument

orgel

hammond orgel

organ ? organ organ

hammond organ

musical instrument

instrument

artifact object natural object

object

Inheriting Inheriting Semantic FeaturesSemantic Features

hart 1 orgaan 1 (Living Part) deel 2 (Part) iets 1 LEAF

-----------------------------------------------------------------------------------------------------heart 1

playing card 1 card 1 (Artifact Function Object) paper 6 (Artifact Solid) material 5 (Substance) matter 1 inanimate object 1 entity 1 LEAF

heart 2 disposition 2 (Dynamic Experience Mental)nature 1trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF

heart 3 bravery 1 spirit 1 character 1 trait 1 (Property) attribute 1 (Property) abstraction 1 LEAF

heart 4 internal organ 1 organ 4 (Living Part) body part 1 (Living Part) part 10 entity 1 LEAF

Reliability of Equivalence Relations

Spanish wordnet Confidence (Variants)

Nouns Verbs Total

100% (Manual) 7819 8394 16213 >96% 382 0 382 >94% 2948 0 2948 >92% 1364 0 1364 >85% 23113 0 23113 >84% 4156 0 4156 Total 39782 8394 48176

Reliability of Equivalence Relations

Dutch wordnet Nouns Verbs Matching Type No of synsets Perc. Reliability No of Synsets Perc. Reliability manual/ok 4138 17,00% 100% 3383 37,07% 100% 1 match 4846 19,91% 86% 763 8,36% 78% 2 matches 3059 12,57% 68% 652 7,15% 71% 3-9 matches 5408 22,22% 65% 2471 27,08% 49% 10+ matches 1864 7,66% 54% 980 10,74% 23% 0 matches 5022 20,64% n.a. 876 9,60% n.a. Total 24337 9125

Conflicting Starting pointsConflicting Starting points

1. There should be a maximum of flexibility: the wordnets should be able to reflect language-specific relations

and patterns the wordnets should be built relatively independently because each

sites has different starting points: different tools, database and resources (Machine Readable

Dictionaries) differences in the languages

2. The wordnets have to be compatible in terms of coverage and relations to be useful for multilingual information retrieval and translations tools and to be able to compare the wordnets.

Measures to Measures to achieve maximal compatibilityachieve maximal compatibility

The results are loaded into a common Multilingual Database (Polaris): consistency checks and types of incompatibility specific comparison options to measure consistency and overlap in coverage

User-guides for building wordnets in each language: the steps to encode the relations for a word meaning. common tests and criteria for all the relations. overview of problems and solutions.

A set of common Base-Concepts which are shared by all the sites, having: most relations and the most-important positions in the wordnets most meanings and badly defined

Classification of the common Base Concept in terms of a Top-Ontology of 63 basic Semantic Distinctions

Top-Down Approach, where first the Base Concepts and their direct context are (manually) encoded and next the wordnets are (semi-automatically) extended top-down to include more specific concepts that depend on these Base Concept.

Top-Ontology and Top-Ontology and Base ConceptsBase Concepts

Top-Ontology with 63 higher-level conceptsExisting Ontologies:

WordNet1.5 top-levelsAktions-Art models (Vendler, Verkuyl)Acquilex and Sift ontologies (EC-projects)Qualia-structure (Pustejovsky)Upper-Model, MikroKosmos, Cyc, Ad Hoc ANSI-Committee on ontologies

The ontology was adapted to represent the variety of concepts in the set of Common Base Concepts, across the 4 language:.

homogenous Base-Concept Clustersaverage size of Base Concept Clusterapply to both nouns and verbs

Set of 1024 common Base Concepts making up the core of the separate wordnets.

Base ConceptsBase ConceptsProcedure:• Each site determined the set of word meanings with most relations (up to 15% of all relations) and high positions in the hierarchy.• This set was extended with all meanings used to define the first selection.• The local selection was translated to WordNet1.5 equivalences: 4 lists of WordNet1.5 synsets (between 450 – 2000 synsets per selection).• These sets of WordNet1.5 translations have been compared.

Concepts selected by all sites: 30 synsets (24 nouns synsets, 6 verb synsets). Explanations:•The individual selections are not representative enough.

•There are major differences in the way meanings are classified, which have an effect on the frequency of the relations.

•The translations of the selection to WordNet1.5 synsets are not reliable

•The resources cover very different vocabularies

Concepts selected by at least two sites: intersections of pairsConcepts selected by at least two sites: intersections of pairs

NOUNS VERBS

NL ES IT GB/WN NL ES IT GB/WNNL 1027 103 182 333 323 36 42 86

ES 103 523 45 284 36 128 18 43

IT 182 45 334 167 42 18 104 39

GB/WN 333 284 167 1296 86 43 39 236

Total Set of shared Base Concepts : Union of intersection pairsTotal Set of shared Base Concepts : Union of intersection pairs

Nouns Verbs Total

1stOrderEntities 491 491

2ndOrderEntities 272 228 500

3rdOrderEntities 33 33

Total 796 228 1024

Table 4: Number of Common BCs represented in the local wordnetsTable 4: Number of Common BCs represented in the local wordnets

Related to CBCs Eq_synonym Eq_near_ CBCs Without

Relations Synonym relations Direct Equivalent

AMS 992 725 269 97

FUE 1012 1009 0 15PSA 878 759 191 9

Table 5: BC4 Gaps in at least two wordnets (10 synsets)Table 5: BC4 Gaps in at least two wordnets (10 synsets)

body covering#1 mental object#1; cognitive content#1; content#2body substance#1 natural object#1social control#1 place of business#1; business establishment#1change of magnitude#1 plant organ#1contractile organ#1 Plant part#1psychological feature#1 spatial property#1; spatiality#1

Table 6: Local senses with complex equivalence relations to CBCsTable 6: Local senses with complex equivalence relations to CBCsNL ES IT

Eq_has_hyperonym 61 40 4eq_has_hyponym 34 14 20Eq_has_holonym 2 0Eq_has_meronym 3 2Eq_involved 3Eq_is_caused_by 3Eq_is_state_of 1

Example of complex relation

CBC: cause to feel unwell#1, Verb

Closest Dutch concept: {onwel#1}, Adjective (sick)

Equivalence relation: eq_is_caused_by

Adaptation of Base Concepts in Adaptation of Base Concepts in EuroWordNet-2EuroWordNet-2

A similar selection of fundamental concepts has been made in EuroWordNet-2

The selected concepts have been compared among German, French, Czech and Estonian and with the EuroWordNet-1 selection

The EuroWordNet-1 set has been extended to 1310 Base Concepts

A distinction has been made between Hard and Soft Base Concepts Hard: represented by only a single Index-record Soft: represented by several close Index-records

The final set has been used as starting point in EuroWordNet-2

NOUNS Local NBCs

Intersection with NBC-ewn1 (905)

% of NBC-ewn1 % of Local BCs New BCs

FR 787 787 99,24% 100,00% 0 DE 460 202 25,47% 43,91% 258 CZ 726 271 34,17% 37,33% 455 EE 703 389 49,05% 55,33% 314 Union (selected by at least 1 side)

1727 811 102,27% 46,96% 916

Union of Intersections (selected by at least 2 sides)

619 516 65,07% 83,36% 105

Intersection (selected by 4 sides)

70 70 8,83% 100,00%

VERBS Local VBCs

Intersection with VBC-ewn1 (239)

% of VBC-EWN1 % of Local BCs New BCs

FR 225 225 94.14% 100.00% 0 DE 321 98 41.00% 30.53% 223 EE 459 145 60.67% 31.80% 314 CZ 260 71 29.71% 27.31% 189 Union (selected by at least 1 side)

872 233 97.49% 26.72% 639

Union of Intersections (selected by at least 2 sides)

258 179 74.90% 69.38% 61

Intersections (selected by 4 sides)

30 30 12.55% 100.00%

Comparison of Comparison of Base Concept SelectionsBase Concept Selections

Revised Set of Base Revised Set of Base ConceptsConcepts

EWN1 EWN2 EWN12 Total Hard Soft Total Hard Soft Total Hard Soft

NOUNS 905 575 330 105 20 85 1010 595 415 VERBS 239 164 75 61 23 38 300 187 113

Table 7: Proposed, Missing and Selected Noun Base Concepts for EWN2 SOFT

LocalBCs HARD Missing Total Partial Missing

Unique BCs

Shared BCs

FR 787 24 199 112 87 0 787 DE 460 427 322 97 225 199 216 EE 703 293 252 160 92 238 465 CZ 726 339 260 153 107 375 351

Table 8: Proposed, Missing and Selected Verb Base Concepts for EWN2 HARD SOFT

Total Missing Total Partial Missing

Unique BCs

Shared BCs

FR 225 30 45 11 34 0 225 DE 321 91 70 36 34 182 139 EE 459 52 43 36 7 254 205 CZ 260 126 76 35 41 162 98

Starting points Starting points for the Top-Ontologyfor the Top-Ontology

• The ontology should support the building and encoding of semantic networks as

linguistic ontologies: networks of lexicalized words and expressions in a

language.

• The classification of the Base Concepts in terms of the Top Ontology should apply

to all the involved languages.

• Enforce uniformity and compatibility of the different wordnets, by providing a

common framework. Divide the Base Concepts (BCs) into coherent clusters to

enable contrastive-analysis and discussion of closely related word meanings

• Customize the database by assigning features to the top-concepts, irrespective of

language-specific structures.

• Provide an anchor point for connecting other ontologies to the Inter-Lingual-

Index, such as CYC, MikroKosmos, the Upper-Model, by linking them to the

corresponding ILI-records.

Principles for Principles for deciding on the distinctionsdeciding on the distinctions

Starting point is that the wordnets are linguistic ontologies:

• Semantic classifications common in linguistic paradigms: Aktionsart models [Vendler 1967, Verkuyl 1972, Verkuyl 1989, Pustejovsky 1991], entity-orders [Lyons 1977], Aristotle’s Qualia-structure [Pustejovsky 1995].

• Ontologies developed in previous EC-projects, which had a similar basis and are well-known in the project consortium: Acquilex (BRA 3030, 7315), Sift (LE-62030, [Vossen and Bon 1996].

• The ontology should be capable of reflecting the diversity of the set of common BCs, across the 4 languages. In this sense the classification of the common BCs in terms of the top-concepts should result in:

Homogeneous Base Concept Clusters: classifications in WordNet1.5 and the other wordnets.

Average-sized Base Concept Clusters: not extremely large or small.

Other important characteristics:

The distinctions apply to both nouns, verbs and adjectives, because these can be related in the language-specific wordnets via a xpos_synonymy relation, and the ILI-records can be related to any part-of-speech.

The top-concepts are hierarchically ordered by means of a subsumption relation but there can only be one super-type linked to each top-concept: multiple inheritance between top-concepts is not allowed.

In addition to the subsumption relation top-concepts can have an opposition-relation to indicate that certain distinctions are disjunct, whereas others may overlap.

There may be multiple relations from ILI-records to top-concepts: the Base Conceptss can be cross-classified in terms of multiple top-concepts (as long as these have no opposition-relation between them): i.e. multiple inheritance from Top-Concept to Base Concept is allowed.

Result: the TCs function as cross-classifying features rather than conceptual classes .

Meanings for bodyparts are not linked to a single class BodyPart but to two features: Living and Part.

The EuroWordNet Top-Ontology: The EuroWordNet Top-Ontology: 63 concepts (excluding the top)63 concepts (excluding the top)

First Level [Lyons 1977]:

1stOrderEntity (491 BC synsets, all nouns)Any concrete entity (publicly) perceivable by the senses and located at any point in time, in a three-dimensional space.

2ndOrderEntity (500 BC synsets, 272 nouns and 228 verbs)Any Static Situation (property, relation) or Dynamic Situation, which cannot be grasped, heart, seen, felt as an independent physical thing. They can be located in time and occur or take place rather than exist; e.g. continue, occur, apply

3rdOrderEntity (33 BC synsets, all nouns)An unobservable proposition that exists independently of time and space. They can be true or false rather than real. They can be asserted or denied, remembered or forgotten. E.g. idea, though, information, theory, plan.

Third-order entities cannot occur, have no temporal duration and therefore fail on both tests:a The same person was here again to-dayb The same thing happened/occurred again to-day*? The idea, fact, expectation, etc.... was here/occurred/ took place

A positive test for a 3rdOrderEntity is based on the properties that can be predicated:

ok The idea, fact, expectation, etc.. is true, is denied, forgotten

The first division of the ontology is disjoint: BCs cannot be classified as combinations of these TCs. This distinction cuts across the different parts of speech in that:

1stOrderEntities are always (concrete) nouns. 2ndOrderEntities can be nouns, verbs and adjectives, where adjectives are always non-dynamic (refer to states and situations not involving a change of state). 3rdOrderEntities are always (abstract) nouns.

Test to distinguish 1st, 2nd and 3rd OrderEntities

Base Concepts classified as 3rdOrderEntities

theory; idea; structure; evidence; procedure; doctrine; policy; data point; content; plan of action; concept; plan; communication; knowledge base; cognitive content; know-how; category; information; abstract; info;

1stOrderEntity1stOrderEntity11

Origin 0 the way in which an entity has come aboutNatural21 Living30 Plant18

Human106

Creature2

Animal123

Artifact144

Function0 the typical activity or role that is associated with an entityVehicle8 Occupation23 Covering8

Garment3 Software4 Furniture6 Place45 Container12 Comestible32

Instrument18 Container12 Building13

Representation12: MoneyRepresentation10; LanguageRepresentation34; Image Representation9

Form0 a-morf or fixed shape.Substance32 Solid63

Liquid13

Gas1

Object62

Composition0 group of self-contained wholes or as a part of such a wholePart86

Group63

Conjunctive classes of Conjunctive classes of 1stOrderEntities1stOrderEntities

Frequent combinations5 Comestible;Solid;Artifact 7 LanguageRepresentation 5 Container;Part;Solid;Living 7 Vehicle;Object;Artifact5 Furniture;Object;Artifact 10 Instrument;Object;Artifact5 Instrument;Artifact 12 Part5 Living 14 Place5 Plant 14 Place;Part6 Liquid 15 Substance6 Object;Artifact 19LanguageRepresentation;Artifact6 Part;Living 20 Occupation;Object;Human6 Place;Part;Solid 22 Object;Animal; Function7 Building;Object;Artifact 38 Group;Human7 Group 42 Object;Human

Conjunctive classes of Conjunctive classes of 1stOrderEntities1stOrderEntities

Low Frequent combinationsfruit: Comestible (Function) life: Group (Composition)

Object (Form) Living (Natural, Origin)Part (Composition) cell: Part (Composition)Plant (Natural, Origin) Living (Natural, Origin)

skin: Covering (Covering) arms: Instrument (Function) Solid (Form) Group (Composition)Part (Composition) Object (Form)Living (Natural, Origin) Artifact (Origin)

1stOrderEntities classified 1stOrderEntities classified as Function onlyas Function only

barrier 1; belonging 2;building material 1;causal agency 1;commodity 1;consumer goods 1;creation 3;curative 1;decoration 2;device 4;fastener 1;force 6;force 7;form 5;impediment 1;medicament 1;piece of work 1;possession 1;protection 4;remains 2;restraint 2;support 6;support; 7;supporting structure 1;thing 3

2ndOrderEntity2ndOrderEntity00

SituationType6 (the event-structure in terms of which a situation can be characterized as a conceptual unit over time; Disjoint features)

Dynamic134

(he sat down quickly. a quick meeting) BoundedEvent183

UnboundedEvent48

Static28

(?he sits quickly.)Property61

Relation38

SituationComponent0

(the most salient semantic component(s) that characterize(s) a situation; Conjuncted Features)

Cause67 Communication50 Condition62 Physical140

Agentive170 Existence27 Experience43 Possession23

Phenomenal17 Location76 Manner21 Purpose137

Stimulating25 Mental90 Modal10 Quantity39

Social102 Time24 Usage8

Conjunctive classes of Conjunctive classes of 2ndOrderEntities2ndOrderEntities

Static

5 Property;Physical;Condition5 Property;Stimulating;Physical5 Relation5 Relation;Social6 Static;Quantity7 Property;Condition8 Relation;Location9 Property10 Relation;Physical;Location:

adjoin 1; aim 4; blank space 1; course 7; direction 8; distance 1; elbow room 1; path 3; spatial property 1; spatial relation 1

Conjunctive classes of Conjunctive classes of 2ndOrderEntities2ndOrderEntities

Dynamic5 BoundedEvent;Cause;Physical5 BoundedEvent;Cause;Physical;Location5 BoundedEvent;Time5 Dynamic5 Dynamic;Location5 Dynamic;Phenomenal5 Dynamic;Phenomenal;Physical6 BoundedEvent;Agentive6 BoundedEvent;Location6 BoundedEvent;Physical;Location6 Dynamic;Agentive;Communication6 Dynamic;Cause8 BoundedEvent;Agentive;Mental;Purpose8 BoundedEvent;Quantity;Time9 BoundedEvent;Cause9 Dynamic;Experience;Mental experience 7; find 3;affect 5; arouse 5; excite 2; cognition 1; desire 2; disposition 2; disposition 4; disturbance 7; emotion 1; feeling 1; humor 3; pleasance 1; process 4; look 8; phenomenon 1; cause to appear 1;

perception 2; sensation 1; feel 12; experience 8; trouble 3; reality 1

Top-Down Building Top-Down Building ProcedureProcedure

1) Construction of a core wordnet from the common set of Base Concepts

• Find Representatives in the local language for the Common Base Concepts (1310 synsets)• Add local Base Concepts that are not selected as Common Base Concepts • Specify the hyperonyms of the local and common Base Concepts

2) Extend the Core Wordnets

• Add the first level of hyponyms to the core wordnets• Add other hyponyms which have many sub-hyponyms• Add other types of relations: XPOS, roles, meronymy, subevents, causes.

3) Verify the Selection

• Corpus frequency: Parole lexicons and corpora• Top-Concept clustering• Intersection of ILI-records• Overlap in ILI-chains

Top-Down BuildingTop-Down Building

63TCs

1310 CBCs149 new ILIs

First Level Hyponyms

Remaining Hyponyms

Hyperonyms

CBCRepresen- tatives

Local BCs

WMsrelated vianon-hyponymy

Top-Ontology

Inter-Lingual-Index

Remaining Hyponyms

Hyperonyms

CBCRepre-senta.

Local BCs

WMsrelated vianon-hyponymyFirst Level Hyponyms

RemainingWordNet1.5Synsets

The current wordnets

Synsets No. of senses Sens./ syns.

Entries Sens./ entry

LIRels. LIRels/ syns

EQRels-ILI

EQRels/syn

Synsets without

ILI Dutch 44015 70201 1,59 56283 1,25 111639 2,54 53448 1,21 7203 Spanish 23370 50526 2,16 27933 1,81 55163 2,36 21236 0,91 0 Italian 40428 48499 1,20 32978 1,47 117068 2,90 71789 1,78 1561 French 22745 32809 1.44 18777 1.75 49494 2.18 22730 1.00 20 German 15132 20453 1.35 17098 1.20 34818 2.30 16347 1.08 0 Czech 12824 19949 1.56 12283 1.62 26259 2.05 12824 1.00 0 Estonian 7678 13839 1.80 10961 1.26 16318 2.13 9004 1.17 0 English 16361 40588 2,48 17320 2,34 42140 2,58 n.a. n.a. n.a. WN15 94515 187602 1,98 126617 1,48 211375 2,24 n.a. n.a. n.a.

Comparison of wordnets

In depth comparison of major semantic fields Comparison of the intersection of the associated ILI-

records Distribution of the associated ILI-records over the different top ontology clusters

Comparison of the hyponymy relations in the wordnets, projected on the associated ILI-records

Intersection of the associated ILI-records

  Nouns Verbs

  Total 62780 32520 Total 12215 7455

  frequency

% of (WN,IT, NL, ES)

% of (IT, NL, ES)

frequency

% of (WN,IT, NL, ES)

% of (IT, NL, ES)ES 24596 39.2% 75.6% 4654 38.1% 62.4%

IT 14272 22.7% 43.9% 4673 38.3% 62.7%

NL 21259 33.9% 65.4% 6416 52.5% 86.1%

(ES, IT) 10907 17.4% 33.5% 3272 26.8% 43.9%

(ES, NL) 14773 23.5% 45.4% 3870 31.7% 51.9%

(IT, NL) 9862 15.7% 30.3% 3950 32.3% 53.0%

(ES, IT, NL)

81838183 13.0% 25.2% 3051 25.0% 40.9%

Distribution over the top ontology clusters

WN NL ES IT Top-Concept TC-

Tokens %of wn

TC-Tokens

% of nl

%of wn

TC-Tokens

%of es %of wn

TC-Tokens

%of it %of wn

Animal 14068 3.99% 1193 0.97% 8.5% 2458 1.81% 17.5% 1122 1.44% 8.0% Artifact 19562 5.55% 10803 8.83% 55.2% 9969 7.36% 51.0% 6494 8.34% 33.2% Building 1022 0.29% 707 0.58% 69.2% 628 0.46% 61.4% 434 0.56% 42.5% Comestible 3377 0.96% 1393 1.14% 41.2% 1614 1.19% 47.8% 624 0.80% 18.5% Container 1725 0.49% 778 0.64% 45.1% 799 0.59% 46.3% 432 0.55% 25.0% Covering 2030 0.58% 1208 0.99% 59.5% 1027 0.76% 50.6% 690 0.89% 34.0% Creature 664 0.19% 159 0.13% 23.9% 254 0.19% 38.3% 27 0.03% 4.1% Function 34081 9.68% 17668 14.44% 51.8% 18904 13.96% 55.5% 11043 14.18% 32.4% Furniture 298 0.08% 171 0.14% 57.4% 147 0.11% 49.3% 87 0.11% 29.2% Garment 756 0.21% 494 0.40% 65.3% 426 0.31% 56.3% 292 0.37% 38.6% Gas 93 0.03% 67 0.05% 72.0% 62 0.05% 66.7% 49 0.06% 52.7% Group 27805 7.90% 3357 2.74% 12.1% 3630 2.68% 13.1% 2337 3.00% 8.4% Human 11543 3.28% 6372 5.21% 55.2% 7683 5.67% 66.6% 4488 5.76% 38.9% ImageRepresentation 780 0.22% 412 0.34% 52.8% 426 0.31% 54.6% 294 0.38% 37.7% Instrument 7036 2.00% 4102 3.35% 58.3% 3590 2.65% 51.0% 2564 3.29% 36.4% LanguageRepresent. 2844 0.81% 1273 1.04% 44.8% 1218 0.90% 42.8% 691 0.89% 24.3% Liquid 1629 0.46% 617 0.50% 37.9% 500 0.37% 30.7% 339 0.44% 20.8% Living 47104 13.37% 10225 8.36% 21.7% 13661 10.08% 29.0% 7408 9.51% 15.7%

Distribution over the top ontology clusters

WN NL ES IT Top-Concept TC-

Tokens %of wn

TC-Tokens

% of nl

%of wn

TC-Tokens

%of es %of wn

TC-Tokens

%of it %of wn

MoneyRepresentation 372 0.11% 190 0.16% 51.1% 183 0.14% 49.2% 111 0.14% 29.8% Natural 68370 19.41% 21948 17.94% 32.1% 24556 18.13% 35.9% 14400 18.49% 21.1% Object 48162 13.68% 20206 16.51% 42.0% 22608 16.69% 46.9% 13242 17.00% 27.5% Occupation 2059 0.58% 1209 0.99% 58.7% 1395 1.03% 67.8% 824 1.06% 40.0% Part 12083 3.43% 4806 3.93% 39.8% 5819 4.30% 48.2% 2586 3.32% 21.4% Place 5281 1.50% 2072 1.69% 39.2% 2439 1.80% 46.2% 1227 1.58% 23.2% Plant 18874 5.36% 1534 1.25% 8.1% 2012 1.49% 10.7% 1121 1.44% 5.9% Representation 934 0.27% 560 0.46% 60.0% 577 0.43% 61.8% 302 0.39% 32.3% Software 201 0.06% 80 0.07% 39.8% 91 0.07% 45.3% 49 0.06% 24.4% Solid 6319 1.79% 2845 2.33% 45.0% 2721 2.01% 43.1% 1406 1.81% 22.3% Substance 12365 3.51% 5447 4.45% 44.1% 5599 4.13% 45.3% 2847 3.66% 23.0% Vehicle 747 0.21% 466 0.38% 62.4% 466 0.34% 62.4% 352 0.45% 47.1% Total 352184 122362 34.7% 135462 38.5% 77882 22.1%

Comparison of the hyponymy relations, projected on the associated ILI-records

To be able to compare hyponymy chains, each word sense in the chain has been replaced by the ILI-records that are linked to these synsets which gives the following result:

veranderen (change) bewegen (move intransitive) bewegen (move reflexive) voortbewegen (move location) verplaatsen (move from A to B) stijgen (move to a higher position) opstijgen (take off)

00064108 01046072 01046072 01046072 01055491 01094615 00257753

Coverage of complete noun chains projected over WN1.5 structure

nodes (53467) edges (53467) frequency % frequency % ES 14221 26.60 14221 26.60 NL 650 1.22 17 0.03 IT 2760 5.16 49 0.09 (ES,NL) 352 0.66 10 0.02 (ES,IT) 1563 2.92 34 0.06 (NL,IT) 190 0.36 0 0.00 (ES,NL,IT) 136 0.25 0 0.00

Partial noun chains projected over WN1.5

LENGTH ES NL IT (ES,NL)

(ES,IT)

(NL,IT)

(ES,NL,IT)

WN

1 53467 53213 53456 53148 53452 52862 52803 53467 2 53385 43161 47346 41959 47138 40893 40636 53467 3 51541 26862 44076 25162 42764 21573 21089 53434 4 47930 15032 27878 13106 26260 7808 7112 52913 5 42049 6771 21019 5454 19433 2996 2506 50693 6 27582 2781 14817 1929 12552 949 799 45029 7 16789 967 7865 726 6259 169 148 32299 8 8337 196 3526 87 2648 17 12 20558 9 3800 6 1062 3 779 11821 10 1647 380 311 5881 11 647 82 73 2576 12 299 28 25 1176 13 115 659 14 19 295 15 2 82

Partial noun chains with 1 gap projected over WN1.5

LENGTH ES NL IT (ES,NL)

(ES,IT)

(NL,IT)

(ES,NL,IT)

WN

3 7804 29355 12152 28312 11619 20886 20439 53434 4 7776 26152 11616 24655 11086 17228 16775 52913 5 7333 18633 10480 16712 9652 11136 10561 50693 6 6296 12019 7782 10158 6879 6023 5262 45029 7 5017 5326 4602 3866 4119 2531 1960 32299 8 3392 1891 2456 1046 2131 704 560 20558 9 1914 487 1166 268 986 115 98 11821 10 1038 83 538 32 485 11 7 5881 11 564 2 173 1 163 2576 12 232 108 101 1176 13 98 35 4 659 14 43 2 295 15 5 82 16 2 7

Independently of the wordnet structures in each language, we can manipulate the mapping across languages via the ILI.

We can use the information of all the languages to correct incompleteness and inconsistencies of the individual resources

Ultimately, we should try to find a minimal and sufficient set of concepts to provide an efficient mapping.

Towards an efficient, condensed and universal index of sense-distinctions

Characteristics of the Characteristics of the Inter-Lingual-IndexInter-Lingual-Index

The Inter-lingual-Index (ILI) is an unstructured fund of concepts with the sole purpose of providing an efficient mapping of senses across languages. Requirements:1. efficient level of granularity

ILI Wordnets{break} “He broke the glass” breken Dutch {break; cause to break} breken Dutch

{break; damage} inflict damage upon. romper Spanish rompere Italian

2. superset of concepts that occur across languagesILI Wordnets

{cashier} eq_hyperonym cassière Dutcheq_hyperonym cajera Spanish

{female cashier} eq_synonym cassière Dutcheq_synonym cajera Spanish

A Minimal and Efficient set of A Minimal and Efficient set of conceptsconcepts

• Globalizing the sense-differentiation:• create metonymic clusters• abstract from contextual specialization and grammatical perspectives• abstract from part-of-speech realization• abstract from productive and predictable meanings

• Extending the Inter-Lingual-Index to become the superset of concepts occurring in two or more wordnets only if:

• concepts are unpredictable and unproductive• concepts cannot be linked exhaustively and uniquely to the ILI

Under-specified conceptsUnder-specified conceptsMetonymic clustersMetonymic clusters

club

{vereniging}NL

{club; verenigingsgebouw}NL

{club}EN

metonym# club: organization

metonym# club: building

eq_metonym eq_metonym

eq_synonym eq_synonym

Under-specified conceptsGeneralization and Diathesis

clusters

break

{rompere}IT

diathesis# break: inchoative

diathesis#break: causative

{breken; kapotgaan}NL

{breken; kapotmaken}NL eq_synonym eq_synonym

eq_diatheis

{rompersi}IT

eq_diathesis

Under-specified for POS

depart

{vertrekkenV}NL

{vertrekN}NL

{departV}EN

{departureN}EN

xpos# departure

xpos# depart

eq_xpos_synonym eq_xpos_synonym

eq_synonym eq_synonym

Overview of equivalence relations to the ILI

Relation POS Sources: Targets Exampleeq_synonym same 1:1 auto : voiture

careq_near_synonym any many : many apparaat, machine, toestel:

apparatus, machine, deviceeq_hyperonym same many : 1 (usually) citroenjenever:

gineq_hyponym same (usually) 1 : many dedo :

toe, fingereq_metonymy same many/1 : 1 universiteit, universiteitsgebouw:

universityeq_diathesis same many/1 : 1 raken (cause), raken:

hiteq_generalization same many/1 : 1 schoonmaken :

clean

Progress on restructuring the ILI

Clusters added manually and automatically based on: structural properties of WN1.5 mapping to other sources: Levin’s classes, WN1.6 cross-lingual mapping

clusters words word senses synsets

Nouns 1703 1398 3205 2895

Verbs 2905 1799 5134 3839

New ILIs from other wordnets have not yet been added. We estimated that for verbs hardly any new ILIs are needed, for nouns about 30% of non-translated concepts (2,000 synsets based on Dutch).

Effects of ILI-clusters

Intersection of ILI-references for Dutch, Spanish, Italian and English

Nouns 2895 clustered synsets (4,6% of 62780 WN1.5 noun synsets)intersection increased from 7736 (23,8%) to 8183 (25,2%) out of the union of 32520 synsets

Verbs 3839 clustered synsets (31,4% of 12215 WN1.5 verb synsets)intersection increased from 1632 (21,9%) to 3051 (40,9%) out of the union of 7455 synsets

Superset of all conceptsSuperset of all concepts.

Procedure:• Initially, the ILI will only contain WordNet1.5 synsets.• a site that cannot find a proper equivalent among the available ILI-concepts

will link the meaning to another ILI-record using a so-called complex-equivalence relation and will generate a potential new ILI-record:

Dutch Meaning Definition Complex-equivalence Target conceptklunen to walk on skates has_eq_hyperonym walk

• after a building-phase all potentially-new ILI-records are collected and verified for overlap by one site;

• a proposal for updating the ILI is distributed to all sites and has to be verified;• the ILI is updated and all sites have to reconsider the equivalence relations

for all meanings that can potentially be linked to the new ILI-records;

Filling gaps in the ILI

Types of GAPS 1. genuine, cultural gaps for things not known in English culture,

e.g. citroenjenever, which is a kind of gin made out of lemon skin,

• Non-productive• Non-compositional

1. pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English, e.g.: container, borrower, cajera (female cashier)

• Productive• Compositional

2. Universality of gaps: Concepts occurring in at least 2 languages

Productive and Predictable Lexicalizations exhaustively linked to the ILI

beat

stamp

{doodslaanV}NL

{cajeraN}ES

eq_has_hyperonym

{doodschoppenV}NL

{doodstampenV}NL

kill

kick

{tottrampelnV}DE

{totschlagenV}DE

eq_has_hyperonym

eq_has_hyperonym eq_has_hyperonym

eq_has_hyperonym

eq_has_hyperonym

eq_has_hyperonym

cashier

female

young

fish

{casière}NL

eq_has_hyperonym

{alevínN}ES

eq_has_hyperonym

eq_has_hyperonym

eq_has_hyperonym

eq_in_state

eq_in_state

eq_in_state

WordNet gaps across languages

ILI REFs (mostly hyperonyms)

ILIVars

Nouns Verbs Nouns Verbs NL 491 99 551 82 DE 109 9 144 10 IT 45 22 77 66 NL&DE 10 0 2 0 NL&IT 6 3 1 0 DE&IT 5 1 0 0 NL&DE&IT 3 0 0 0 Union Intersections 15 4 3 0

Towards an efficient, condensed and universal index of sense-distinctions

Productive derivations and compounds linked exhaustively

WordNet1.5

90,000concepts

Metonymy/Generalizationclusters

Universal Core meanings

POSIndependent

Non-predictable

Universal systematic polysemy and level of granularity

Language and domain specific lexicalizations that do not occur in a large variety of languages

Language specific realizations in grammatical forms

The EuroWordNet databaseThe EuroWordNet database1.) The actual wordnets in Flaim database format: an indexing and compression

format of Novell.

2.) Polaris (Louw 1997): Re-implementation of the Novell ConceptNet toolkit (Díez-Orzas et al 1995) adapted to the EuroWordNet architecture.

import and export wordnets or wordnet selections from/to ASCII files. resolve links for imported concepts. edit and add concepts, variants and relations in the wordnets. access to the ILI and ontologies and to switch between the wordnets and

ontologies via the ILI. extract, import and export clusters of senses based on relations. project synsets or clusters from one wordnet to another wordnet compare clusters of synsets. import new or adapted ILI-records. update ILI-references to updated ILI.

3. Periscope (Cuypers and Adriaens 1997): a graphical interface for viewing the EuroWordNet database.

Global Wordnet Associationhttp://www.globalwordnet.org

provide a standardized framework to link, compare and build complete wordnets for all the European languages and dialects.

initialize the development of wordnets in non-European languages

develop more specific definitions, tests and procedures for evaluating and developing wordnets.

extend the specification of EuroWordNet to lexical units which are not yet covered (adjectives/adverbs, lexicalized phrases and multi-words).

develop (axiomatized) ontologies for Domains and World-Knowledge that can be shared by all languages via the ILI.

develop an efficient ILI for linking, sharing, consistency checking and cross-language technology applications. This ILI could function as a gold-standard of sense-distinctions.

organize a (annual/bi-annual) workshop or conference.

2nd Global Wordnet Conference Location: Masaryk University, Brno (Czech

Republic), January, 20 - 23, 2004. http://www.fi.muni.cz/gwc2004/

Other wordnet initiatives Danish Norway Swedish Portuguese Arabic Korean Russian

Welsh Basque, Catalan Chinese BalkaNet IndoWordnet Meaning

BalkaNet Funded by the European Union as project IST-

2000-29388. 3-year project: 2001 - 2004 Follows a strict EuroWordNet approach:

Expanded set of base concepts Top-down building approach

EWN database extended with: Greek, Romanian, Serbian, Turkish, Bulgarian, Czech

Development of new wordnet database system: VisDic

http://www.ceid.upatras.gr/Balkanet/.

IndoWordnet

Current Wordnet development in India: Hindi and Marathi at IIT Bombay, Tamil at Anna University-K.B Chandrashekhar Research

Centre (AU-KBC) Chennai and Tamil University Tanjavur, Gujarathi at MS University Baroda, Oriya at Utkal

University Bhubaneswar and Bengali at IIT Kharagpur. The Hindi WordNet is at an advanced stage of

development with about 11000 semantically linked synsets and with associated software and user interface.

IndoWordnet By the end of 2003 each Indian language will create a WordNet of 5000

synsets. These will be for about 2000 most frequent content words in each language. Use will be made of the wordlist sorted by frequency- available with the CIIL

Language specific WordNets developed by the following institutions: CIIL, Mysore: Kannada, Kashmiri, Punjabi, Urdu, Himachali, Malayalam. IIT Bombay: Hindi, Marathi and Konkani AU-KBC Chenai and Tamil University Tanjavur: Tamil and Malayalam University of Hyderabad: Telegu University of Baroda: Gujarati Utkal University Bhubaneswar: Oriya IIT Kharagpur: Bengali

Reserach groups have to be identified for building the WordNets of Assamese, Nepali and Languages of the North East.

Developing Multilingual Web-Developing Multilingual Web-scale Language Technologiesscale Language Technologies

http://www.lsi.upc.es/~nlp/meaning/http://www.lsi.upc.es/~nlp/meaning/

MeaningMeaning

Meaning Objectives Funded by the European Union as

project IST-2001-34460IST-2001-34460 3 -year project: April 2002 - April 2005 Large-scale (Lexical) Knowledge Bases

Automatic enrichment of EWN Mixed approach (KB + ML) Applied to Q/A, CLIR

Problem structural and lexical ambiguity

Meaning Approach automatic collection of sense

examples (Leacock et al. 98, Mihalcea y Moldovan 99)

Large-scale WSD (Boosting, SVM, transductives)

Large-scale Knowledge Acquisition (McCarthy 01, Agirre & Martinez 02)

MultilingualMultilingualCentral RepositoryCentral Repository

ItalianItalianEWNEWN

BasqueBasqueEWNEWN

SpanishSpanishEWNEWN

EnglishEnglishEWNEWN

BasqueWeb Corpus

ItalianWeb Corpus

EnglishWeb Corpus

SpanishWeb Corpus

ACQACQ

ACQACQACQACQ

ACQACQ

UPLOADUPLOADUPLOADUPLOAD

UPLOADUPLOADUPLOADUPLOAD

PORTPORT

PORTPORT

PORTPORT

PORTPORT

WSDWSD

WSDWSDWSDWSD

MeaningMeaningArchitectureArchitecture

WSDWSD

CatalanCatalanEWNEWN

CatalanWeb Corpus

WSDWSDACQACQ

PORTPORT UPLOADUPLOAD

A combination of unsupervised Knowledge-based and supervised Machine Learning techniques that will provide a high-precision system that is able to tag running text with word senses

A system that acquires a huge number of examples per word from the web

The use of sophisticated linguistic information, such as, syntactic relations, semantic classes, selectional restrictions, subcategorization information, domain, etc.

Efficient margin-based Machine Learning algorithms.

Novel algorithms that combine tagged examples with huge amounts of untagged examples in order to increase the precision of the system.

MeaningMeaningWP6: Word Sense DisambiguationWP6: Word Sense Disambiguation

THE END...