annotating words using wordnet semantic glosses julian szymański department of computer systems...

16
Annotating Words using Annotating Words using WordNet Semantic Glosses WordNet Semantic Glosses Julian Szymański Julian Szymański Department of Computer Systems Architecture, Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications and Informatics, Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of Technology, Poland Gdańsk University of Technology, Poland [email protected] [email protected] Włodzisław Duch Włodzisław Duch Department of Department of Informatics, Nicolaus Copernicus University, Informatics, Nicolaus Copernicus University, Toru Toruń , Poland , Poland School of Computer Engineering, Nanyang Technological School of Computer Engineering, Nanyang Technological University, Singapore University, Singapore Google: W. Duch Google: W. Duch

Upload: brent-washington

Post on 31-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Annotating Words usingAnnotating Words usingWordNet Semantic GlossesWordNet Semantic Glosses

Julian SzymańskiJulian SzymańskiDepartment of Computer Systems Architecture,Department of Computer Systems Architecture,

Faculty of Electronics, Telecommunications and Informatics,Faculty of Electronics, Telecommunications and Informatics,Gdańsk University of Technology, PolandGdańsk University of Technology, Poland

[email protected] [email protected]

Włodzisław DuchWłodzisław DuchDepartment of Department of Informatics, Nicolaus Copernicus University, ToruInformatics, Nicolaus Copernicus University, Toruńń, Poland, Poland

School of Computer Engineering, Nanyang Technological University, SingaporeSchool of Computer Engineering, Nanyang Technological University, SingaporeGoogle: W. DuchGoogle: W. Duch

Page 2: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

OutlineOutline

Motivation for Word Sense DisambiguationMotivation for Word Sense Disambiguation

““Semantic Glosses” approachSemantic Glosses” approach

SG algorithmSG algorithm

SG in actionSG in action

Aggregated results from small experimentsAggregated results from small experiments

Conclusions, problems and (possible) solutionsConclusions, problems and (possible) solutions

DeliverablesDeliverables

Page 3: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

IntroductionIntroductionAmbiguity of natural language is the source of many problems in Ambiguity of natural language is the source of many problems in automatic text processing. It is quite evident for example in automatic text processing. It is quite evident for example in classification or clustering of documents represented by features classification or clustering of documents represented by features derived from word frequencies. derived from word frequencies.

Automatic semantic annotation is still a great challenge, requiring Automatic semantic annotation is still a great challenge, requiring solution to the word sense disambiguation (WSD) problem.solution to the word sense disambiguation (WSD) problem.

WSD address many issuesWSD address many issues:: How to distinguish and represent word How to distinguish and represent word meanings? How to create semantic Web? meanings? How to create semantic Web?

Manually: introduction of elementary atoms of meaning. Manually: introduction of elementary atoms of meaning.

Set level of granularity of senses, relations to each other. Set level of granularity of senses, relations to each other.

Synonyms and/or homonyms must be considered acquiring word Synonyms and/or homonyms must be considered acquiring word senses in an automatic way. senses in an automatic way. So far most successful: Latent Semantic Indexing.So far most successful: Latent Semantic Indexing.

Semantic annotations allow to go beyond bag-of words representation.Semantic annotations allow to go beyond bag-of words representation.

Page 4: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Our approachOur approachFocus on word sense disambiguation during initial text processing Focus on word sense disambiguation during initial text processing phase, map words from texts to the structures that carry elementary phase, map words from texts to the structures that carry elementary meanings that may be treated as semantic atoms (senses). meanings that may be treated as semantic atoms (senses).

WordNet synsets group words into sets of synonyms related to word WordNet synsets group words into sets of synonyms related to word definitions, provide sense identifiers, record semantic relations definitions, provide sense identifiers, record semantic relations between synsets.between synsets.

Employ synsets for using WordNet semantic network formed by Employ synsets for using WordNet semantic network formed by relations between synsets. relations between synsets. Text annotated at a higher abstraction level can be clustered in a Text annotated at a higher abstraction level can be clustered in a better way because similarities between texts are more clear.better way because similarities between texts are more clear.

Enhance document representation with superordinate categories. Enhance document representation with superordinate categories. Works even better for clustering, simulating spreading of neural Works even better for clustering, simulating spreading of neural activation responsible for associations and simple inferences taking activation responsible for associations and simple inferences taking place in the reader’s brain.place in the reader’s brain.

The main issue is how to map words into synsets.The main issue is how to map words into synsets.

Page 5: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Atlas SemantycznyAtlas Semantycznyhttp://dico.isc.cnrs.fr/en/index.html http://dico.isc.cnrs.fr/en/index.html

spirit: 79 words69 cliques = minimal units with specific meaning.

Synset = collection of synonyms in Wordnet.

Page 6: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Typical approaches to WSD for selecting proper sense of a given words Typical approaches to WSD for selecting proper sense of a given words employ hierarchy of taxonomical relations, anaylse the disambiguated employ hierarchy of taxonomical relations, anaylse the disambiguated word context to find features that allows to select its proper meaning word context to find features that allows to select its proper meaning (eg. Lesk algorithm).(eg. Lesk algorithm).

Starting with the version 3.0 WordNet also provides Starting with the version 3.0 WordNet also provides semantically semantically annotated disambiguated gloss corpus. annotated disambiguated gloss corpus. Glosses are short definitions providing proper meanings of words Glosses are short definitions providing proper meanings of words and and thus whole synsets. The gloss annotations cover also concepts, thus whole synsets. The gloss annotations cover also concepts, collocations (multiword forms), tagging discontinuous spans of text. For collocations (multiword forms), tagging discontinuous spans of text. For example. “personal or business relationship” is converted to example. “personal or business relationship” is converted to “personal_relationship”, “business_relationship”. “personal_relationship”, “business_relationship”. Glosses have been linked manually to the context-appropriate sense in Glosses have been linked manually to the context-appropriate sense in WordNet, disambiguating the corpus.WordNet, disambiguating the corpus.

Semantic Glosses (SG)Semantic Glosses (SG) approach employs relations between synsets, approach employs relations between synsets, or more precisely relations obtained from references between synsets or more precisely relations obtained from references between synsets that are related to their definitions. They form a network of conceptually that are related to their definitions. They form a network of conceptually related synsets in opposition to structuralized hierarchy.related synsets in opposition to structuralized hierarchy.

Page 7: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

The algorithmThe algorithm

Disambiguated word Disambiguated word W W is mapped on its possible is mapped on its possible meanings (synsets) meanings (synsets) {{TsTs((WW))}}..

For each synset from For each synset from {{TsTs((WW))}} set retrieve all set retrieve all synsets Tgs that may be synsets Tgs that may be derivedderived from its glosses.from its glosses.

Rank all Rank all Ts synset Ts synset according to the number according to the number of relations with glosses of relations with glosses in Tgs.in Tgs.

Page 8: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Example Example

First create test First create test sets for sets for multi-sense words.multi-sense words.

Each sense has it Each sense has it own text.own text.

We compare our We compare our approach (SG) approach (SG) against Stanford against Stanford parser (SP).parser (SP).

Page 9: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Horse may mean … Horse may mean …

Page 10: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Aggregated resultsAggregated results

The evaluation of the SG approach has been performed on a test The evaluation of the SG approach has been performed on a test set of eight multisense words. For different senses of these words set of eight multisense words. For different senses of these words 51 test texts have been prepared and manually evaluated 51 test texts have been prepared and manually evaluated annotating proper senses.annotating proper senses.

Page 11: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

ConclusionsConclusions I IGood: Good:

The algorithm that employs semantically annotated glosses The algorithm that employs semantically annotated glosses provides quite promisingprovides quite promising results. results.

So far it has been evaluated only on a small test set of 8 So far it has been evaluated only on a small test set of 8 multi sense words (51multi sense words (51 different meanings). different meanings).

As the preliminary results are promising the method is now As the preliminary results are promising the method is now beingbeing tested on a larger scale, mamy improvements will be tested on a larger scale, mamy improvements will be introduced.introduced.

Page 12: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

ConclusionsConclusions: problems: problemsDifferent meanings of theDifferent meanings of the same word in one sentence eg: same word in one sentence eg:

Turtle’s Turtle’s shells shells provide protection to parts of the animalprovide protection to parts of the animal body, body, like egg like egg shell shell protects birds’ embryo.protects birds’ embryo.

The first The first ‘shell’ ‘shell’ is related to the is related to the turtleturtle shell, the second to shell, the second to egg shell. Disambiguating such cases is relativelyegg shell. Disambiguating such cases is relatively easy for easy for humans, because using semantic memory collocations are humans, because using semantic memory collocations are easily discovered andeasily discovered and require much smaller context for require much smaller context for proper sense classification. proper sense classification.

Experiments with variableExperiments with variable context length dependent on the context length dependent on the number of identical words with different meanings innumber of identical words with different meanings in one one sentence will be performed to check how to deal with such sentence will be performed to check how to deal with such difficulties.difficulties.

Page 13: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Some WordNet synsets are larger and have more relations Some WordNet synsets are larger and have more relations than others, the distributionthan others, the distribution is very uneven.is very uneven.

This causes preference for larger synsets that may confuse This causes preference for larger synsets that may confuse manymany algorithms degrading results for meanings that algorithms degrading results for meanings that correspond to synsets with small numbercorrespond to synsets with small number of relations. of relations.

To simulate effects of spreading activation weighed To simulate effects of spreading activation weighed relations betweenrelations between synsets may be introduced, describing synsets may be introduced, describing patterns of more and less important activations.patterns of more and less important activations.

ConclusionsConclusions: more problems: more problems

Page 14: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Explore the use of WordNet structural information given in Explore the use of WordNet structural information given in predefinedpredefined relations that extends the network of relations relations that extends the network of relations between synsets. between synsets.

Use references between glosses obtained from higher Use references between glosses obtained from higher order relations that should haveorder relations that should have smaller weights.smaller weights.

Employ additional relations from mining Wikipedia Employ additional relations from mining Wikipedia hyperreferences to introduce more relations between hyperreferences to introduce more relations between synsets. This task requires first asynsets. This task requires first a mapping betweenmapping between WordNet synsets and Wikipedia articles. WordNet synsets and Wikipedia articles. Results of the semi-automaticResults of the semi-automatic approach to perform such approach to perform such mapping are quite good. mapping are quite good.

Challenge: use of negativeChallenge: use of negative knowledge about the words knowledge about the words present in glosses that do not appear in the widerpresent in glosses that do not appear in the wider context.context.

Few more ideasFew more ideas

Page 15: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

DeliverablesDeliverables

The application for disambiguating and evaluation can be The application for disambiguating and evaluation can be downloaded free from: downloaded free from:

http://kask.eti.pg.gda.pl/semagloss/annotations.ziphttp://kask.eti.pg.gda.pl/semagloss/annotations.zip

This project resulted also in development of API in C# This project resulted also in development of API in C# and Java for WordNet semantically annotated gloss and Java for WordNet semantically annotated gloss corpus. The API is available for download corpus. The API is available for download

http://kask.eti.pg.gda.pl/semagloss/index.htmlhttp://kask.eti.pg.gda.pl/semagloss/index.html

Associating WordNet with Associating WordNet with WWikipediaikipedia

http://kask.eti.pg.gda.pl/CompWiki => WordNet tab.http://kask.eti.pg.gda.pl/CompWiki => WordNet tab.

Page 16: Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications

Thank you Thank you for lending for lending your ears your ears

http://kask.eti.pg.gda.pl/CompWiki Google: W Duch => Papers