the global wordnet grid: anchoring languages to universal meaning piek vossen irion...
Post on 27-Mar-2015
215 Views
Preview:
TRANSCRIPT
The Global Wordnet Grid: anchoring languages to universal meaning
Piek Vossen
Irion Technologies/Free University of Amsterdam
and
Christiane Fellbaum
Princeton University
Overview
• Wordnet, EuroWordNet background
• Architecture of the Global Wordnet Grid
• Mapping wordnets to the Grid
• Kyoto: an implementation of the Grid
WordNet1.5WordNet1.5• Developed at Princeton by George Miller and his
team as a model of the mental lexicon.• Semantic network in which concepts are defined in
terms of relations to other concepts.• Structure:
organized around the notion of synsets (sets of synonymous words)
basic semantic relations between these synsets
http://www.cogsci.princeton.edu/~wn/w3wn.htmlhttp://www.cogsci.princeton.edu/~wn/w3wn.html
Structure of WordNet
{vehicle}
{conveyance; transport}
{car; auto; automobile; machine; motorcar}
{cruiser; squad car; patrol car; police car; prowl car} {cab; taxi; hack; taxicab; }
{motor vehicle; automotive vehicle}
{bumper}
{car door}
{car window}
{car mirror}
{hinge; flexible joint}
{doorlock}
{armrest}
hyperonym
hyperonym
hyperonym
hyperonymhyperonym
meronym
meronym
meronym
meronym
EuroWordNet
• The development of a multilingual database with wordnets for several European languages
• Funded by the European Commission, DG XIII, Luxembourg as projects LE2-4003 and LE4-8328
• March 1996 - September 1999
• 2.5 Million EURO.
• http://www.hum.uva.nl/~ewn
• http://www.illc.uva.nl/EuroWordNet/finalresults-ewn.html
ENGLISHCar…
Train…
Vehicle
Inter-Lingual-Index
Transport
Road Air Water
Domains Top Ontology
Device
Object
TransportDevice
English Words
vehicle
car train
1
2
4
3 3
Czech Words
dopravní prostředník
auto vlak
2
1French Words
véhicule
voiture train
2
1
Estonian Words
liiklusvahend
auto killavoor
2
1
German Words
Fahrzeug
Auto Zug
2
1
Spanish Words
vehículo
auto tren
2
1
Italian Words
veicolo
auto treno
2
1
Dutch Words
voertuig
auto trein
2
1
EuroWordnet architecture
EuroWordNet
• Wordnets are unique language-specific structures:– different lexicalizations– differences in synonymy and homonymy– different relations between synsets– same organizational principles: synset structure and
same set of semantic relations.
• Language independent knowledge is assigned to the ILI and can thus be shared for all language linked to the ILI: both an ontology and domain hierarchy
Autonomous & Language-Specific
voorwerp{object}
lepel{spoon}
werktuig{tool}
tas{bag}
bak{box}
blok{block}
lichaam{body}
Wordnet1.5 Dutch Wordnet
bagspoonbox
object
natural object (an object occurring naturally)
artifact, artefact (a man-made object)
instrumentality block body
containerdeviceimplement
tool instrument
Artificial ontology: • better control or performance, or a more compact and coherent structure. • introduce artificial levels for concepts which are not lexicalized in a language (e.g. instrumentality, hand tool), • neglect levels which are lexicalized but not relevant for the purpose of the ontology (e.g. tableware, silverware, merchandise).
What properties can we infer for spoons?spoon -> container; artifact; hand tool; object; made of metal or plastic; for eating, pouring or cooking
Linguistic versus Artificial Ontologies
Linguistic ontology: • Exactly reflects the relations between all the lexicalized words and
expressions in a language. • Captures valuable information about the lexical capacity of
languages: what is the available fund of words and expressions in a language.
What words can be used to name spoons?spoon -> object, tableware, silverware, merchandise, cutlery,
Linguistic versus Artificial Ontologies
Wordnets versus ontologies
• Wordnets:• autonomous language-specific lexicalization
patterns in a relational network. • Usage: to predict substitution in text for
information retrieval,• text generation, machine translation, word-
sense-disambiguation.• Ontologies:
• data structure with formally defined concepts.• Usage: making semantic inferences.
• Inter-Lingual-Index: unstructured fund of concepts to
provide an efficient mapping across the languages;
• Index-records are mainly based on WordNet synsets and
consist of synonyms, glosses and source references;
• Various types of complex equivalence relations are
distinguished;
• Equivalence relations from synsets to index records: not on a
word-to-word basis;
• Indirect matching of synsets linked to the same index items;
The Multilingual DesignThe Multilingual Design
Equivalent Near SynonymEquivalent Near Synonym1. Multiple Targets (1:many)
Dutch wordnet: schoonmaken (to clean) matches with 4 senses of clean in WordNet1.5:• make clean by removing dirt, filth, or unwanted substances from• remove unwanted substances from, such as feathers or pits, as of chickens or fruit• remove in making clean; "Clean the spots off the rug"• remove unwanted substances from - (as in chemistry)
2. Multiple Sources (many:1)Dutch wordnet: versiersel near_synonym versiering ILI-Record: decoration.
3. Multiple Targets and Sources (many:many)Dutch wordnet: toestel near_synonym apparaat
ILI-records: machine; device; apparatus; tool
Equivalent HyperonymyTypically used for gaps in English WordNet:
• genuine, cultural gaps for things not known in English culture:
– Dutch: klunen, to walk on skates over land from one frozen water to the other
– Dutch: citroenjenever, which is a kind of gin made out of lemon skin,
• pragmatic, in the sense that the concept is known but is not expressed by a single lexicalized form in English:
– Dutch: kunstproduct = artifact substance <=> artifact object– Dutch: hoofd = human head and Dutch: kop = animal head, English
uses head for both.
From EuroWordNet to Global WordNet
• Currently, wordnets exist for more than 40 languages, including:
• Arabic, Bantu, Basque, Chinese, Bulgarian, Estonian, Hebrew, Icelandic, Japanese, Kannada, Korean, Latvian, Nepali, Persian, Romanian, Sanskrit, Tamil, Thai, Turkish, Zulu...
• Many languages are genetically and typologically unrelated
• http://www.globalwordnet.org
Some downsides
• Construction is not done uniformly• Coverage differs• Not all wordnets can communicate with one
another• Proprietary rights restrict free access and usage• A lot of semantics is duplicated• Complex and obscure equivalence relations due to
linguistic differences between English and other languages
Inter-LingualOntology
Device
Object
TransportDeviceEnglish Words
vehicle
car train
1
2
3 3
Czech Words
dopravní prostředník
auto vlak
2
1French Words
véhicule
voiture train
2
1
Estonian Words
liiklusvahend
auto killavoor
2
1
German Words
Fahrzeug
Auto Zug
2
1
Spanish Words
vehículo
auto tren
2
1
Italian Words
veicolo
auto treno
2
1
Dutch Words
voertuig
auto trein
2
1
Next step: Global WordNet Grid
GWNG: Main Features
• Construct separate wordnets for each Grid language
• Contributors from each language encode the same core set of concepts plus culture/language-specific ones
• Synsets (concepts) can be mapped crosslinguistically via an ontology
• No license constraints, freely available
The Ontology: Main Features
• Formal, artificial ontology serves as universal index of concepts
• List of concepts is not just based on the lexicon of a particular language (unlike in EuroWordNet) but uses ontological observations
• Concepts are related in a type hierarchy• Concepts are defined with axioms
The Ontology: Main Features
• In addition to high-level (“primitive”) concept ontology needs to express low-level concepts lexicalized in the Grid languages
• Additional concepts can be defined with expressions in Knowledge Interchange Format (KIF) based on first order predicate calculus and atomic element
The Ontology: Main Features
• Minimal set of concepts (Reductionist view):
– to express equivalence across languages– to support inferencing
• Ontology must be powerful enough to encode all concepts that are lexically expressed in any of the Grid languages
The Ontology: Main Features
• Ontology need not and cannot provide a linguistic encoding for all concepts found in the Grid languages – Lexicalization in a language is not sufficient to warrant
inclusion in the ontology– Lexicalization in all or many languages may be
sufficient• Ontological observations will be used to define the
concepts in the ontology
Ontological observations• Identity criteria as used in OntoClean (Guarino &
Welty 2002), :– rigidity: to what extent are properties true for entities
in all worlds? You are always a human, but you can be a student for a short while.
– essence: what properties are essential for an entity? Shape is essential for a statue but not for the clay it is made of.
– unicity: what represents a whole and what entities are parts of these wholes? An ocean is a whole but the water it contains is not.
Type-role distinction
• Current WordNet treatment:(1) a husky is a kind of dog(type)(2) a husky is a kind of working dog (role)
• What’s wrong? (2) is defeasible, (1) is not:*This husky is not a dogThis husky is not a working dog
Other roles: watchdog, sheepdog, herding dog, lapdog, etc….
Ontology and lexicon
•Hierarchy of disjunct types:Canine PoodleDog; NewfoundlandDog;
GermanShepherdDog; Husky
•Lexicon:– NAMES for TYPES:
{poodle}EN, {poedel}NL, {pudoru}JP((instance x Poodle)
– LABELS for ROLES:{watchdog}EN, {waakhond}NL, {banken}JP
((instance x Canine) and (role x GuardingProcess))
Ontology and lexicon
•Hierarchy of disjunct types:River; Clay; etc…
•Lexicon:– NAMES for TYPES:
{river}EN, {rivier, stroom}NL((instance x River)
– LABELS for dependent concepts:{rivierwater}NL (water from a river => water is not Unit)((instance x water) and (instance y River) and (portion x y){kleibrok}NL (irregularly shared piece of clay=>Non-essential) ((instance x Object) and (instance y Clay) and (portion x y)
and (shape X Irregular))
Rigidity
• The “primitive” concepts represented in the ontology are rigid types
• Entities with non-rigid properties will be represented with KIF statements
• But: ontology may include some universal, core concepts referring to roles like father, mother
Properties of the Ontology
• Minimal: terms are distinguished by essential properties only
• Comprehensive: includes all distinct concepts types of all Grid languages
• Allows definitions via KIF of all lexemes that express non-rigid, non-essential properties of types
• Logically valid, allows inferencing
Mapping Grid Languages onto the Ontology
• Explicit and precise equivalence relations among synsets in different languages, which is somehow easier:– type hierarchy is minimal– subtle differences can be encoded in KIF expressions
• Grid database contains wordnets with synsets that label – either “primitive” types in the hierarchies, – or words relating to these types in ways made explicit in KIF
expressions
• If 2 lgs. create the same KIF expression, this is a statement of equivalence!
How to construct the GWNG
• Take an existing ontology as starting point;
• Use English WordNet to maximize the number of disjunct types in the ontology;
• Link English WordNet synsets as names to the disjunct types;
• Provide KIF expressions for all other English words and synsets
How to construct the GWNG
• Copy the relation from the English Wordnet to the ontology to other languages, including KIF statements built for English
• Revise KIF statements to make the mapping more precise
• Map all words and synsets that are and cannot be mapped to English WordNet to the ontology:– propose extensions to the type hierarchy
– create KIF expressions for all non-rigid concepts
Initial Ontology: SUMO (Niles and Pease)
SUMO = Suggested Upper Merged Ontology
--consistent with good ontological practice
--fully mapped to WordNet(s): 1000 equivalence mappings, the rest through subsumption
--freely and publicly available
--allows data interoperability
--allows NLP
--allows reasoning/inferencing
Mapping Grid languages onto the Ontology
• Check existing SUMO mappings to Princeton WordNet -> extend the ontology with rigid types for specific concepts
• Extend it to many other WordNet synsets• Observe OntoClean principles! (Synsets
referring to non-rigid, non-essential, non-unicitous concepts must be expressed in KIF)
Lexicalizations not mapped to WordNet
• Not added to the type hierarchy:{straathond}NL (a dog that lives in the streets)((instance x Canine) and (habitat x Street))
• Added to the type hierarchy:{klunen}NL (to walk on skates from one frozen body to
the next over land)KluunProcess => WalkProcessAxioms:(and (instance x Human) (instance y Walk) (instance z
Skates) (wear x z) (instance s1 Skate) (instance s2 Skate) (before s1 y) (before y s2) etc…
• National dishes, customs, games,....
Most mismatching concepts are not new types
• Refer to sets of types in specific circumstances or to concept that are dependent on these types, next to {rivierwater}NL there are many others:
{theewater}NL (water used for making tea)
{koffiewater}NL (water used for making coffee)
{bluswater}NL (water used for making extinguishing file)
• Relate to linguistic phenomena:– gender, perspective, aspect, diminutives, politeness,
pejoratives, part-of-speech constraints
• {teacher}EN((instance x Human) and (agent x
TeachingProcess))
• {Lehrer}DE ((instance x Man) and (agent x TeachingProcess))
• {Lehrerin}DE ((instance x Woman) and (agent x TeachingProcess))
KIF expression for gender marking
KIF expression for perspective
sell: subj(x), direct obj(z),indirect obj(y) versus buy: subj(y), direct obj(z),indirect obj(x) (and (instance x Human)(instance y Human)
(instance z Entity) (instance e FinancialTransaction) (source x e) (destination y e) (patient e)
The same process but a different perspective by subject and object realization: marry in Russian two verbs, apprendre in French can mean teach and learn
Part-of-speech mismatches
• {bankdrukken-V}NL vs.{bench press-N}EN
• {gehuil-N}NL vs. {cry-V}EN
• {afsluiting-N}NL vs. {close-V}EN
• Process in the ontology is neutral with respect to POS!
Parallel Noun and Verb hierarchy
• event– act
• deed– sail
– promise
– change• movement
– change of location
• to happen– to act
• to do– to sell
– a promise
– to change• to move
– to move position
Encoded once as a Process in the ontology!
Mixed Noun and Adjective hierarchy
• Colour: red, blue, green, etc.
• Height: high, low
• Size: big, small
• Emotion: sad, angry, happy, anxious
• etc.Encoded once as a attributes in the ontology!
Aspectual variants
• Slavic languages: two members of a verb pair for an ongoing event and a completed event.
• English: can mark perfectivity with particles, as in the phrasal verbs eat up and read through.
• Romance languages: mark aspect by verb conjugations on the same verb.
• Dutch, verbs with marked aspect can be created by prefixing a verb with door: doorademen, dooreten, doorfietsen, doorlezen, doorpraten (continue to breathe/eat/bike/read/talk).
• These verbs are restrictions on phases of the same process• Which does NOT warrant the extension of the ontology
with separate processes for each aspectual variant
Aspectual lexicalization• Regular compositional verb structures:
doorademen: (lit. through+breath, continue to breath)
doorbetalen: (lit. through+pay, continue to pay)
doorlopen: (lit. through+walk, continue to walk)
doorfietsen: (lit. through+walk, continue to walk)
doorrijden: (lit. through+walk, continue to walk)
(and (instance x BreathProcess)(instance y Time) (instance z Time) (end x z) (expected (end x y) (after z y))
• MORE GENERAL VERBS:openmaken: (lit. open+make, to cause to be open);dichtmaken: (lit. close+make, to cause to be open);
• MORE SPECIFIC VERBS:openknijpen (lit. open+squeeze, to open by squeezing)
has_hyperonym knijpen (squeeze) & openmaken (to open)
opendraaien (lit. open+turn, to open by turning)has_hyperonym draaien (to turn) & openmaken (to open)
dichtknijpen: (lit. closed+squeeze, to close by squeezing)has_hyperonym knijpen (squeeze) & dichtmaken (to close)
dichtdraaien: (lit. closed +turn, to close by turning)has_hyperonym draaien (to turn) & dichtmaken (to close)
Lexicalization of Resultatives
Kinship relations in Arabic
• (~Eam)َع&م father's brother, paternal uncle.
• (xaAl) َخ&ال mother's brother, maternal uncle.
• (Eam~ap) َع&َّم,ة father's sister, paternal aunt.
• اَل&ة (xaAlap) َخ& mother's sister, maternal aunt
Kinship relations in Arabic
• .........• َق1يَق&ة sister, sister on the paternal (aqiyqapfull$) َش&
and maternal side (as distinct from 5َخ4ت :(uxot<) ُأ'sister' which may refer to a 'sister' from paternal or maternal side, or both sides).
• &ْك4الن (vakolAna) َث father bereaved of a child (as opposed to 1يم &ِت 1يَّم&ة or (yatiym) َي &ِت for (yatiymap) َيfeminine: 'orphan' a person whose father or mother died or both father and mother died).
• 4َل&ى &ْك (vakolaYa) َث other bereaved of a child (as opposed to 1يم &ِت 1يَّم&ة or َي &ِت for feminine: 'orphan' a َيperson whose father or mother died or both father and mother died).
father's brother, paternal uncle
WORDNETpaternal uncle => uncle
=> brother of ....????
ONTOLOGY(=> (paternalUncle ?P ?UNC) (exists (?F) (and (father ?P ?F) (brother ?F ?UNC))))
Complex Kinship concepts
Fine tune equivalence relations
• {rivier}NL (and (instance x River) (instance y RiverMouth) (instance z Country) (part y x) (location y z)
• {stroom}NL (and (instance x River) (instance y RiverMouth) (instance p RiverPart) (not (equal p y) (instance z Country) (location p z) (not (location y z))
Universality as evidence• If lexicalization of the specific process is more universal it
can be seen as evidence that the specific processes should be listed in the ontology and not the generic verb:
– English verb cut abstracts from the precise process but there are
troponyms that implicate the manner :snip, clip imply scissors, chop and hack a large knife or an axe
– Dutch there is no general verb but only specific verbs:knippen “clip, snip, cut with scissors or a scissor-like tool'”, snijden “cut with a knife or knife-like tool”, hakken “chop, hack, to cut with an axe, or similar tool”).
• If Father is lexicalized in most languages we add it to the ontology even when it is NOT Rigid!
Universality as evidence
• Artifact substance is lexicalized in Dutch and other languages => ArtifactObject in SUMO needs to be generalized to Artifact so that it can be applied to both substances and objects
Open Questions/Challenges
• What is a word, i.e., a lexical unit?• What is the status of complex lexemes like
English lightning rod, word of mouth, find out, kick the bucket?
• What is the status of compounds in Germanic languages and Chinese?– "hottentottententententoonstelling"(exposition of tents of the "hottentotten" (African tribe))
• What is a semantic unit, i.e. a concept?
Open Questions/Challenges
• Is there a core inventory of concepts that are universally encoded?
• If so, what are these concepts?• How can crosslinguistic equivalence be verified?• Is there systematicity to the language-specific
extensions?• What are the lexicalization patterns of individual
languages? • Are lexical gaps accidental or systematic?
Coverage: what belongs in a universal lexical database?
• Formal, linguistic criteria for inclusion
• Informal, cultural criteria
• Both are difficult to define and apply!
Concrete goals for GWG
• Global Wordnet Association website:
http://www.globalwordnet.org/gwa/gwa_grid.htm• 5000 Base Concepts or more:
– English
– Spanish
– Catalan
– Czech, Polish, Dutch, other wordnets
• 7th Frame Work project Kyoto
KYOTO Project• 7th Frame Work project (under negotiation)• Kowledge Yielding Ontologies for Transition-based
Organisations• Goal:
– Global Wordnet Grid = ontology + wordnets– AutoCons = Automatic concept extractors– Kybots = Knowledge yielding robots– Wiki environment for encoding domain knowledge in expert
groups– Index and retrieval software for deep semantic search
• Languages: Dutch, English, Spanish, Basque, Italian, Chinese and Japanese
• Domain of application: environmental organisations• Period: March/April 2008 - 2011
KYOTO ConsortiumUniversities• Vrije Universiteit Amterdam, Amsterdam, Netherlands• Consiglio Nazionale delle Ricerche, Pisa, Italy• Berlin-Brandenburg Academy of Sciences and Humantities, Berlin,
Germany• Euskal Herriko Unibertsitatea, San Sebastian, Spain• Academia Sinica, Taipei, Taiwan• National Institute of Information and Communications Technology,
Kyoto, Japan• Masaryk University, Brno, CzechCompanies• Irion Technologies, Delft, Netherlands• Synthema, Pisa, ItalyUsers• European Centre for Nature Conservation, Tilburg, Netherlands• World Wide Fund for Nature, Zeist, Netherlands
Environmental organizations
Capture
Index
Docs
URLs
Experts
Images
Search
Dialogue
ConceptMining
FactMining
Abstract PhysicalTop
Middle
Domain
water CO2
Substance
CO2 emission
water pollution
Universal Ontology Wordnets
Environmental organizations
CitizensGovernorsCompanies
DomainWiki
Process
Text & Meta datain XMLFormat
termhierarchy
wordnet
ConceptMiners
termrelations
ontology
Kybots
ManualRevision
WikiDEB
Client
2
3
5
domainwordnet
domainontology
Indexing
sourcedata
Capture
Data & Factsin XML Format
DEBServer
Accessend-users
Index
6
Userscenarios
Userscenarios
ManualTest
Benchmarkdata
Benchmarking
1
1
4
7 8
Abstract Physical
water CO2
Substance
CO2 emission
water pollution
Ontology Wordnets
Generic
Process
Chemical Reaction
Logical Expressions Linguistic Minersor Kybots
Domain
words words
words words
END
top related