piek vossen vu university amsterdam

78
Guest lecture, Language Engineerin g Applications, February, 26th 200 9, Leuven 1 From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages to universal meaning Piek Vossen VU University Amsterdam

Upload: stacy

Post on 11-Feb-2016

31 views

Category:

Documents


0 download

DESCRIPTION

From WordNet, to EuroWordNet, to the Global Wordnet Grid: anchoring languages to universal meaning. Piek Vossen VU University Amsterdam. Overview. Wordnet, EuroWordNet Global Wordnet Grid Stevin project Cornetto 7 th Frame work project KYOTO. WordNet. http://wordnet.princeton.edu/ - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

1

From WordNet, to EuroWordNet,

to the Global Wordnet Grid: anchoring languages to universal meaning

Piek Vossen

VU University Amsterdam

Page 2: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

2

Overview

• Wordnet, • EuroWordNet• Global Wordnet Grid• Stevin project Cornetto• 7th Frame work project KYOTO

Page 3: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

3

WordNet http://wordnet.princeton.edu/http://wordnet.princeton.edu/• Lexical semantic database for English• Developed by George Miller and his team at

Princeton University, as the implementation of a mental model of the lexicon

• Organized around the notion of a synset: a set of synonyms in a language that represent a single concept

• Semantic relations between concepts (synsets) and not between words

• Currently covers over 117,000 concepts (synsets) and over 150,000 English words

Page 4: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

4

Relational model of meaning

man woman

boy girl

cat

kitten

dog

puppy

animal

man

woman

boy

cat

kitten

dogpuppy

animal

Page 5: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

5

Wordnet: a network of semantically related words

{car; auto; automobile; machine; motorcar}

{conveyance;transport}

{vehicle}

{motor vehicle; automotive vehicle}

{cruiser; squad car; patrol car; police car; prowl car}

{cab; taxi; hack; taxicab}

{bumper}

{car door}

{car window}

{car mirror} {armrest}

{doorlock}

{hinge; flexible joint}

hyper(o)nym

hyponym

meronyms

Hyponymy and meronymy relations are:• transitive• directed

Page 6: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

6

Wordnet Semantic RelationsWordnet Semantic RelationsWN 1.5 starting point

The ‘synset’ as a weak notion of synonymy:“two expressions are synonymous in a linguistic context C if the substitution of one for the other in C does not alter the truth value.” (Miller et al. 1993)

Relations between synsets:Example

HYPONYMY noun-to-noun car/ vehicleverb-to-verb walk/ move

MERONYMY noun-to-noun head/ noseANTONYMY adjective-to-adjective good/bad

verb-to-verb open/ closeENTAILMENT verb-to-verb buy/ payCAUSE verb-to-verb kill/ die

Page 7: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

7

Wordnet Data Model

bank

fiddleviolin

violistfiddler

string

rec: 12345- financial instituterec: 54321- side of a riverrec: 9876- small string instrumentrec: 65438- musician playing violinrec:42654- musician

rec:25876- string instrument

rec:35576- string of instrumentrec:29551- underwear

type-of

type-of

part-of

Vocabulary of a languageConceptsRelations

1

2

2

1

1

2

polysemy

polysemy&synonymy

polysemy

Page 8: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

8

Some observations on Wordnet• synsets are more compact representations for concepts

than word meanings in traditional lexicons• synonyms and hypernyms are substitutional variants:

– begin – commence– I once had a canary. The bird got sick. The poor animal died.

• hyponymy and meronymy chains are important transitive relations for predicting properties and explaining textual properties:object -> artifact -> vehicle -> 4-wheeled vehicle -> car

• strict separation of part of speech although concepts are closely related (bed – sleep) and are similar (dead – death)

• lexicalization patterns reveal important mental structures

Page 9: Piek Vossen VU University Amsterdam

Lexicalization patterns

25 unique beginnersgarbage

tree

organism

animal

bird

canarychurch

building

artifact

object

plant

flower

rose

wastethreat

entity

common canary

abbey

crocodiledogbasic level concepts

• balance of two principles: • predict most features• apply to most subclasses

• where most concepts are created • amalgamate most parts• most abstract level to draw a pictures

Page 10: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

10

Wordnet top level

Page 11: Piek Vossen VU University Amsterdam

Meronymy & picturesbeak

tail

leg

Page 12: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

12

Meronymy & pictures

Page 13: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

13

Wordnet 3.0 statistics

POS Unique Synsets Total

  Strings  Word-Sense

Pairs Noun 117,798 82,115 146,312

Verb 11,529 13,767 25,047

Adjective 21,479 18,156 30,002

Adverb 4,481 3,621 5,580

Totals 155,287 117,659 206,941

Page 14: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

14

Wordnet 3.0 statistics

POS Monosemous Polysemous Polysemous

 Words and

Senses Words Senses

Noun 101,863 15,935 44,449

Verb 6,277 5,252 18,770

Adjective 16,503 4,976 14,399

Adverb 3,748 733 1,832

Totals 128,391 26,896 79,450

Page 15: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

16

http://www.visuwords.com

Page 16: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

17

Page 17: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

18

Usage of Wordnet

• Mostly used database in language technology

• Enormous impact in language technology development

• Large• Free and downloadable• English

Page 18: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

19

Usage of Wordnet• Improve recall of textual based analysis:

– Query -> Index• Synonyms: commence – begin• Hypernyms: taxi -> car• Hyponyms: car -> taxi• Meronyms: trunk -> elephant• Lexical entailments: gun -> shoot

• Inferencing:– what things can burn?

• Expression in language generation and translation:– alternative words and paraphrases

Page 19: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

20

Improve recall

• Information retrieval: – effective on small databases without redundancy, e.g.

image captions, video text• Text classification:

– expand small training sets– reduce training effort

• Question & Answer systems– question classification: who, where, what, when– match answers to question types

Page 20: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

21

Improve recall

• Anaphora resolution:– The girl fell off the table. She....– The glass fell of the table. It...

• Coreference resolution:– When he moved the furniture, the antique table got

damaged. • Information extraction (unstructed text to

structured databases):– generic forms or patterns "vehicle" - > text with

specific cases "car"

Page 21: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

22

Improve recall

• Summarizers:– Sentence selection based on word counts ->

concept counts– Avoid repetition in summary -> language

generation, pick out another synonym or hypernym

• Limited inferencing: detect locations, people, organisations, etc.

Page 22: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

23

Enabling technologies

• Semantic similarity: what sentences or expressions are semantically similar?

• Semantic relatedness and textual entailment: smoke entails fire, fire entails damage

• Word-Senses-Disambiguation• Erwin Marsi, University of Tilbug, http://

daeso.uvt.nl/demos/index.html

Page 23: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

24

Page 24: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

25

Page 25: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

Recall & Precision

query:“cell”

“cellphone”

“mobilephones”

“nerve cell”“police cell”

recall = doorsnede / relevantprecision = doorsnede / gevonden

found intersection relevant

Recall < 20% for basic search engines!(Blair & Maron 1985)

“jail”

“neuron”

Page 26: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

27

Many others

• Data sparseness for machine learning: hapaxes can be replaced by semantic classes that match classes from the training set

• Use redundancy for more robustness: spelling correction and speech recognition can built semantic expectations using Wordnet and make better choices

• Sentiment and opinion mining• Natural language learning

Page 27: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

28

EuroWordNet

• The development of a multilingual database with wordnets for several European languages

• Funded by the European Commission, DG XIII, Luxembourg as projects LE2-4003 and LE4-8328

• March 1996 - September 1999• 2.5 Million EURO.• http://www.hum.uva.nl/~ewn• http://www.illc.uva.nl/EuroWordNet/finalresults-ewn.html

Page 28: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

29

EuroWordNetEuroWordNet

• Languages covered: – EuroWordNet-1 (LE2-4003): English, Dutch, Spanish, Italian– EuroWordNet-2 (LE4-8328): German, French, Czech, Estonian.

• Size of vocabulary:– EuroWordNet-1: 30,000 concepts - 50,000 word meanings.– EuroWordNet-2: 15,000 concepts- 25,000 word meaning.

• Type of vocabulary: – the most frequent words of the languages– all concepts needed to relate more specific concepts

Page 29: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

30

EuroWordNet Model

I = Language Independent linkII = Link from Language Specific to Inter lingual IndexIII = Language Dependent Link

III

Lexical Items Table

cavalcare

andaremuoversi

III

guidare

ILI-record{drive}

Inter-Lingual-Index

Ontology

2OrderEntity

Location Dynamic

Domains

Traffic

Air Road` III

Lexical Items Table

bewegengaan

rijden berijden

III

Lexical Items Table

driveride

movego

III

III

Lexical Items Table

cabalgar jinetear

III

conducir

movertransitar

IIIII

IIII

II

I I

Page 30: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

31

Differences in relations between Differences in relations between EuroWordNet and WordNetEuroWordNet and WordNet

• Added Features to relations

• Cross-Part-Of-Speech relations

• New relations to differentiate shallow hierarchies

• New interpretations of relations

Page 31: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

32

EWN Relationship LabelsEWN Relationship Labels

{airplane} HAS_MERO_PART: conj1 {door}HAS_MERO_PART: conj2 disj1 {jet engine}HAS_MERO_PART: conj2 disj2 {propeller}

{door} HAS_HOLO_PART: disj1 {car}HAS_HOLO_PART: disj2 {room}

HAS_HOLO_PART: disj3 {entrance}

Default Interpretation: non-exclusive disjunction

Page 32: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

33

Overview of the Language Internal relations in EuroWordnet

Same Part of Speech relations:HYPERONYMY/HYPONYMY car - vehicleANTONYMY open - closeHOLONYMY/MERONYMY head – noseNEAR_SYNONYMY apparatus - machineCross-Part-of-Speech relations:XPOS_NEAR_SYNONYMY dead - death; to adorn - adornmentXPOS_HYPERONYMY/HYPONYMY to love - emotionXPOS_ANTONYMY to live - deadCAUSE die - deathSUBEVENT buy - pay; sleep - snoreROLE/INVOLVED write - pencil; hammer - hammerSTATE the poor - poorMANNER to slurp - noisily BELONG_TO_CLASS Rome - city

Page 33: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

34

Co_Role relationsCo_Role relationscriminal CO_AGENT_PATIENT victimnovel writer/ poet CO_AGENT_RESULT novel/ poemdough CO_PATIENT_RESULT pastry/ breadphotograpic camera CO_INSTRUMENT_RESULT photo

guitar player HAS_HYPERONYM playerCO_AGENT_INSTRUMENT guitar

player HAS_HYPERONYM personROLE_AGENT to play musicCO_AGENT_INSTRUMENT musical instrument

to play music HAS_HYPERONYM to makeROLE_INSTRUMENT musical instrument

guitar HAS_HYPERONYM musical instrumentCO_INSTRUMENT_AGENT guitar player

Page 34: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

35

chronical patient ; mental patient

patient

HYPONYM

ρ-PROCEDURE ρ-LOCATION

STATE

ρ-CAUSE

cureρ-PATIENT

treatdocter

disease; disorder

physiotherapymedicineetc.

hospital, etc.stomach disease, kidney disorder,

ρ-PATIENT ρ-AGENT

child docter

child

co-ρ-AGENT-PATIENT

Horizontal & vertical semantic relations

HYPONYM

HYPONYM

Page 35: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

36

• Inter-Lingual-Index: unstructured fund of concepts to provide an efficient mapping across the languages;

• Index-records are mainly based on WordNet synsets and consist of synonyms, glosses and source references;

• Various types of complex equivalence relations are distinguished;

• Equivalence relations from synsets to index records: not on a word-to-word basis;

• Indirect matching of synsets linked to the same index items;

The Multilingual DesignThe Multilingual Design

Page 36: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

37

Equivalent Near SynonymEquivalent Near Synonym1. Multiple Targets (1:many)

Dutch wordnet: schoonmaken (to clean) matches with 4 senses of clean in WordNet1.5:• make clean by removing dirt, filth, or unwanted substances from• remove unwanted substances from, such as feathers or pits, as of chickens or fruit• remove in making clean; "Clean the spots off the rug"• remove unwanted substances from - (as in chemistry)

2. Multiple Sources (many:1)Dutch wordnet: versiersel near_synonym versiering ILI-Record: decoration.

3. Multiple Targets and Sources (many:many)Dutch wordnet: toestel near_synonym apparaat

ILI-records: machine; device; apparatus; tool

Page 37: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

38

Equivalent HyperonymyTypically used for gaps in English WordNet:

• genuine, cultural gaps for things not known in English culture:

– Dutch: klunen, to walk on skates over land from one frozen water to the other

• pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English:

– Dutch: kunststof = artifact substance <=> artifact object

Page 38: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

39

EuroWordNet statistics Synsets No. of senses Sens./

syns. Entries Sens./

entry LIRels. LIRels/

syns EQRels-

ILI EQRels/s

yn Synsets without

ILI Dutch 44015 70201 1,59 56283 1,25 111639 2,54 53448 1,21 7203 Spanish 23370 50526 2,16 27933 1,81 55163 2,36 21236 0,91 0 Italian 40428 48499 1,20 32978 1,47 117068 2,90 71789 1,78 1561 French 22745 32809 1.44 18777 1.75 49494 2.18 22730 1.00 20 German 15132 20453 1.35 17098 1.20 34818 2.30 16347 1.08 0 Czech 12824 19949 1.56 12283 1.62 26259 2.05 12824 1.00 0 Estonian 7678 13839 1.80 10961 1.26 16318 2.13 9004 1.17 0 English 16361 40588 2,48 17320 2,34 42140 2,58 n.a. n.a. n.a. WN15 94515 187602 1,98 126617 1,48 211375 2,24 n.a. n.a. n.a.

Page 39: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

40

Wordnets as semantic structures

• Wordnets are unique language-specific structures:– same organizational principles: synset structure and

same set of semantic relations. – different lexicalizations– differences in synonymy and homonymy:

• "decoration" in English versus "versiersel/versiering" in Dutch• "bank" in English (money/river) versus "bank" in Dutch

(money/furniture)

• BUT also different relations for similar synsets

Page 40: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

41

Autonomous & Language-Specific

voorwerp{object}

lepel{spoon}

werktuig{tool}

tas{bag}

bak{box}

blok{block}

lichaam{body}

Wordnet1.5 Dutch Wordnet

bagspoonbox

object

natural object (an object occurring naturally)

artifact, artefact (a man-made object)

instrumentality block body

containerdeviceimplement

tool instrument

Page 41: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

42

Artificial ontology: • better control or performance, or a more compact and coherent structure. • introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), • neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise).

What properties can we infer for spoons?spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking

Linguistic versus Artificial Ontologies

Page 42: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

43

Linguistic ontology: • Exactly reflects the relations between all the lexicalized words and

expressions in a language. • Captures valuable information about the lexical capacity of

languages: what is the available fund of words and expressions in a language.

What words can be used to name spoons?spoon -> object, tableware, silverware, merchandise, cutlery,

Linguistic versus Artificial Ontologies

Page 43: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

44

Wordnets versus ontologies

• Wordnets:• autonomous language-specific lexicalization

patterns in a relational network. • Usage: to predict substitution in text for

information retrieval,• text generation, machine translation, word-

sense-disambiguation.• Ontologies:

• data structure with formally defined concepts.• Usage: making semantic inferences.

Page 44: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

45

From EuroWordNet to Global WordNet

• EuroWordNet ended in 1999• Global Wordnet Association was founded in 2000 to

maintain the framework: http://www.globalwordnet.org• Currently, wordnets exist for more than 50 languages,

including:– Arabic, Bantu, Basque, Chinese, Bulgarian, Estonian, Hebrew,

Icelandic, Japanese, Kannada, Korean, Latvian, Nepali, Persian, Romanian, Sanskrit, Tamil, Thai, Turkish, Zulu...

• Many languages are genetically and typologically unrelated

Page 45: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

47

Some downsides of the EuroWordNet model

• Construction is not done uniformly• Coverage differs• Not all wordnets can communicate with one

another, i.e. linked to different versions of English wordnet

• Proprietary rights restrict free access and usage• A lot of semantics is duplicated• Complex and obscure equivalence relations due to

linguistic differences between English and other languages

Page 46: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

48

Inter-LingualOntology

Device

Object

TransportDeviceEnglish Words

vehicle

car train

1

2

3 3

Czech Words

dopravní prostředník

auto vlak

2

1French Words

véhicule

voiture train

2

1Estonian Words

liiklusvahend

auto killavoor

2

1

German Words

Fahrzeug

Auto Zug

2

1

Spanish Words

vehículo

auto tren

2

1

Italian Words

veicolo

auto treno

2

1

Dutch Words

voertuig

auto trein

2

1

Next step: Global WordNet Grid

Page 47: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

49

GWNG: Main Features

• Construct separate wordnets for each Grid language

• Contributors from each language encode the same core set of concepts plus culture/language-specific ones

• Synsets (concepts) are mapped crosslinguistically via an ontology instead of just the English Wordnet

Page 48: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

50

The Ontology: Main Features

• List of concepts is not just based on the lexicon of a particular language (unlike in EuroWordNet) but uses ontological observations

• Ontology contains only upper and mid-level concepts

• Concepts are related in a type hierarchy• Concepts are defined with axioms

Page 49: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

51

The Ontology: Main Features• Minimal set of concepts (Reductionist view):

– to express equivalence across languages– to support inferencing

• Ontology need not and cannot provide a concept for all concepts found in the Grid languages – Lexicalization in a language is not sufficient to warrant inclusion in the

ontology– Lexicalization in all or many languages may be sufficient

• Ontological observations will be used to define the concepts in the ontology

• Ontological framework still must be powerful enough to encode all concepts that are lexically expressed in any of the Grid languages

• Additional lexicalized concepts are related to the ontology through complex relations

Page 50: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

52

Ontological observations• Identity criteria as used in OntoClean (Guarino &

Welty 2002), :– rigidity: to what extent are properties true for entities

in all worlds? You are always a human, but you can be a student for a short while.

– essence: what properties are essential for an entity? Shape is essential for a statue but not for the clay it is made of.

– unicity: what represents a whole and what entities are parts of these wholes? An ocean is a whole but the water it contains is not.

Page 51: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

53

Type-role distinction Current WordNet treatment, hyponyms of dog:• lapdog:1 # toy dog:1, toy:4 # hunting dog:1 # working dog:1, etc.• dalmatian:2, coach dog:1, carriage dog:1 # Leonberg:1 #

Newfoundland:1 # poodle:1, poodle dog:1, etc.

(1) a husky is a kind of dog(type)(2) a husky is a kind of working dog (role)

• What’s wrong? (2) is defeasible, (1) is not:*This husky is not a dogThis husky is not a working dog

Page 52: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

54

Ontology and lexicon

•Hierarchy of disjunct types:Canine PoodleDog; NewfoundlandDog;

GermanShepherdDog; Husky

•Lexicon:– NAMES for TYPES:

{poodle}EN, {poedel}NL, {pudoru}JP((instance x Poodle)

– LABELS for ROLES:{watchdog}EN, {waakhond}NL, {banken}JP((instance x Canine) and (role x GuardingProcess))

Page 53: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

55

Ontology and lexicon•Hierarchy of disjunct types:

River; Clay; etc…•Lexicon:

– NAMES for TYPES:{river}EN, {rivier, stroom}NL((instance x River)

– LABELS for dependent concepts:{rivierwater}NL (water from a river => water is not a unit){kleibrok}NL (irregularly shared piece of clay=>non-essential) ((instance x water) and (instance y River) and (portion x y)((instance x Object) and (instance y Clay) and (portion x y)

and (shape X Irregular))

Page 54: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

56

• {teacher}EN((instance x Human) and (agent x

TeachingProcess))

• {Lehrer}DE ((instance x Man) and (agent x TeachingProcess))

• {Lehrerin}DE ((instance x Woman) and (agent x TeachingProcess))

KIF expression for gender marking

Page 55: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

57

KIF expression for perspectivesell: subj(x), direct obj(z),indirect obj(y) versus buy: subj(y), direct obj(z),indirect obj(x) (and (instance x Human)(instance y Human)

(instance z Entity) (instance e FinancialTransaction) (source x e) (destination y e) (patient e)

The same process but a different perspective by subject and object realization: marry in Russian two verbs, apprendre in French can mean teach and learn

Page 56: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

66

Advantages of the Global Wordnet Grid

• Shared and uniform world knowledge:– universal inferencing– uniform text analysis and interpretation

• More compact and less redundant databases• More clear notion how languages map to

the knowledge – better criteria for expressing knowledge– better criteria for understanding variation

Page 57: Piek Vossen VU University Amsterdam

CORNETTO(STEVIN TENDER)

Combinatorial and Relational Network as Toolkit for Dutch Language

Technology http://www2.let.vu.nl/oz/cornetto

Page 58: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

68

Goals of the Cornetto project• Goal: to develop a lexical semantic database for Dutch:

– 40K Entries: generic and central part of the language– Rich horizontal and vertical semantic relations– Combinatoric information – Ontological information

• Method: merge data from Dutch Wordnet (DWN) and Referentie bestand Nederlands (RBN)

• April 2006-March 2008, extended to July 2008• The data of the final results of the Cornetto project

available through the TST-centrale of the Nederlandse Taalunie (free for research).

Page 59: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

70

Database• Collections:

▪ Lexical Units (LU): mainly derived from the RBN▪ Synsets (SY): mainly derived from DWN▪ Terms (TE) and axioms: mainly derived on SUMO

and MILO▪ Domains (DM): based on Wordnet domains

• Mappings:▪ LU<-> SY▪ SY <-> SY (within Dutch and from Dutch to English)▪ SY <-> TE▪ SY <-> DM

Page 60: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

71

Data Organization

Internal relations

PrincetonWordnet

WordnetDomains

SpanishWordnet

CzechWordnet

GermanWordnet

FrenchWordnet

KoreanWordnet Arabic

Wordnet

SUMOMILO

Collection of Terms and Axioms

Correspond to word-meaning pair

formmorphologysyntaxsemanticspragmaticsusage examples

Lexical Unit (LU)

Model meaning relations

Synset

Synonyms

Page 61: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

72

Database

• Implemented in DebVisDic:– http://deb.fi.muni.cz/index.php

• Demo version available:http://www2.let.vu.nl/oz/cornetto/demo.html

Page 62: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

73

Page 63: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

74

Page 64: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

75

Overview of results

  ALL NOUNS VERBS ADJ ADV OTHERSSynsets 70,371 52,847 9,017 7,689 220 598

Lexical Units 119,108 85,449 17,314 15,712 475 158

Lemmas (form+pos) 92,686 70,315 9,051 12,288 1,032 n.a.

Synonyms in synsets 103,762 75,476 14,138 12,914 408 826

CID records 104,556 76,537 14,214 13,132 483 190

Synonym per synset 1.47 1.43 1.57 1.68 1.85 1.38

Senses per lemma 1.29 1.22 1.91 1.28 0.46 n.a.

Page 65: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

76

Mapping relations

No status value 55976 53.54%

Status value 48580 46.46%

manual 10108 9.67%

B-95 4944 4.73%

BM-90 4215 4.03%

D-55 adjectives 171 0.16%

D-58 verbs 774 0.74%

D-75 nouns 2085 1.99%

M-97 25236 24.14%

RESUME-75 1047 1.00%

TOTAL 104556  

DWN and RBN matches 35,289 37.74%

LUs only in DWN 54,983 58.81%

LUs only in RBN 3,223 3.45%

Total 93,495

Page 66: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

77

Overview of synset data

Synsets 70371

Synonyms 103762

InternalRelations 153370

EquivalenceRelations 86830

Definitions 35620

WordNet Domains mappings 93822

Sumo mappings 70654

Base Level Concepts 8828

Page 67: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

78

English Wordnet to SUMO mappingthrough two-place relations

• = the synset is equivalent to the SUMO concept, circle (= Circle)

• + the synset is subsumed by the SUMO concept, branch (+ PlantBranch)

• @ the synset is an instance of the SUMO concept, Amsterdam (@ City)

Page 68: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

79

Cornetto SUMO Mappings through triplets

• Equality:– cirkel: (=, 0, Circle) or (=, , Circle)

• Subsumption:– tak: (+, 0, PlantBranch) or (+, , PlantBranch)

• Related:– blad: (part, 0, PlantBranch) or (part, , PlantBranch)

• Axiomatized:– theewater:

(instance, 0, Water) (instance, 1, Making) (instance, 2, Tea) (resource, 0, 1) (result, 2,1) OR (instance, , Water) (instance, 1, Making) (instance, 2, Tea) (resource, , 1) (result, 2,1)

Page 69: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

80

Ontology mapping: female/male variants

teacher (a person whose occupation is teaching)SUMO: equivalent to Teacher

In Dutch: no neutral formleraar (male teacher)

(+,,Teacher), (+,, Man)lerares (female teacher)

(+,,Teacher), (+,, Woman)

Page 70: Piek Vossen VU University Amsterdam

KYOTO (ICT-211423)Yielding Ontologies for Transition-Based OrganizationFP7: Intelligent Content and Semantics

http://www.kyoto-project.eu/

Page 71: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

82

KYOTO (ICT-211423) Overview • Title: Yielding Ontologies for Transition-Based Organization• Funded:

– 7th Framework Program-ICT of the European Union: Intelligent Content and Semantics– Taiwan and Japan funded by national grants

• Goal: – Platform for knowledge sharing across languages and cultures– Enables knowledge transition and information search across different target groups,

transgressing linguistic, cultural and geographic boundaries.– Open text mining and deep semantic search– Wiki environment that allows people in the field to maintain their knowledge and agree

on meaning without knowledge engineering skills• URL: http://www.kyoto-project.eu/• Duration:

– March 2008 – March 2011• Effort:

– 364 person months of work.

Page 72: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

83

KYOTO cycle

frog endemic frogs common frog poison frog

Golden poison froggopher frog

Dusky gopher frogforest frog

Garden ponds are havens for wildlife. They provide food and shelter for frogs, newts and aquatic insects, including damselflies and dragonflies,

(garden pont, haven, wild life)(garden pont, has_food, frog)(garden pont, has_food, newt)(garden pont, has_food, aquatic insect)(garden pont, is_shelter, frog)(garden pont, is_shelter, newt)(garden pont, is_shelter, aquatic insect)

Page 73: Piek Vossen VU University Amsterdam

Top

Middle

H20 CO2

Substance

Abstract

Process

Physical

Ontology

Environmental organizations

Tybot: term yielding robot

Kybot: knowledge yielding robot

Wordnets

Distributed, diverse & dynamic data

1

Capture text:"Sudden increase of CO2 emissions in 2008 in Europe"

2

CO2 emission3

Wikyoto

maintainterms & concepts

4

Index facts:Process: Increase Involves: CO2 emission When: 2008 Where: Europe

5Text & Fact Index

SemanticSearch

6

CitizensGovernmentsCompanies

DomainCO2 Emission

H20 Pollution

Greenhouse Gas

Page 74: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

85

Kyoto main application

• Wikyoto (Wiki platform)– Connects people with shared interest as a community– Upload documents and sources– View and edit terms and concepts learned from these

documents– Combines concepts with other taxonomies– Discuss and agree with others in the community,

different languages, regions and cultures

Page 75: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

86

Kyoto main application• Tybots

– Learns terms and concepts from document collection

– Organizes terms as a hierarchy– Connects terms to other hierarchies– Defines:

• definitions• relations to other terms• properties and criteria for terms

Page 76: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

87

Kyoto main application

• Kybot:– Detects facts of interest in text and combines

these in a comprehensive overview– Uses knowledge represented for terms to detect

facts in any document, regardless of language– Allows you to specify any collection of types of

knowledge of your interest

Page 77: Piek Vossen VU University Amsterdam

Guest lecture, Language Engineering Applications, February, 26th 2009, Leuven

88

Kyoto databases

• Database of users that forms the community• Database of sources and documents provided by

the users• Database of terms, presented as a domain wordnet

in each language• Database of concepts (so-called ontology) that

connects the terms of the different languages• Databases of facts derived from various document

and source collections provided by the user

Page 78: Piek Vossen VU University Amsterdam

Thank you for your attention