an evaluation procedure for word net based lexical chaining: methods and issues

29
http:// www.hytex.info/ .............. H y T ex Hypertextualisierung auf textgram matischer Grundlage . . . . . . . Te xt Te ch n o lo gica l M o d elling o f In fo rm atio n An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues Irene Cramer & Marc Finthammer Faculty of Cultural Studies, Technische Universität Dortmund, Germany [email protected]

Upload: herb

Post on 05-Jan-2016

23 views

Category:

Documents


2 download

DESCRIPTION

An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues. Irene Cramer & Marc Finthammer Faculty of Cultural Studies, Technische Universität Dortmund, Germany [email protected]. Outline. Project Context and Motivation Lexical Chaining – Evaluation Steps - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. ..

.. ..

..

Text TechnologicalModel l ing of Informat ion

An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

Irene Cramer & Marc FinthammerFaculty of Cultural Studies,

Technische Universität Dortmund, Germany

[email protected]

Page 2: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Outline

• Project Context and Motivation

• Lexical Chaining – Evaluation Steps

1. Preprocessing and Coverage

2. Sense Disambiguation

3. Semantic Relatedness/Similarity

4. Application

• Open Issues and Future Work

Page 3: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Project Context and Motivation

• Research project HyTex funded by DFG (German Research Foundation) – part of the research unit "Text Technological Modelling of Information"

• Research objective in HyTex: text-grammatical foundations for the (semi-) automated text-to-hypertext conversion

• One focus of our research: topic-based linking strategies using lexical and topic chains/topic views

Page 4: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Project Context and Motivation

• Lexical chains

• Topic chains/views– based on the concept of lexical cohesion, – regarded as partial text representation, – valuable resource for many NLP applications, such as

text summarization, dialog systems etc.

– (to our knowledge) 2 lexical chainers for German (Mehler, 2006 and Cramer/Finthammer), in additon: research on semantic similarity based on GermaNet by Gurevych et al.

Page 5: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Project Context and Motivation

• Lexical chains• Topic chains/views

– based on a selection of central words, so called topic words,

– intended to support the user‘s orientation and navigation.

Steps:Lexical chains are used to select topic words (1-3 topic

words per passage),topic words are used to construct the topic view

(~"thematic index"),topic words are re-connected via semantically meaningful

edges (as in lexical chaining) to construct topic chains

Page 6: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Project Context and Motivation

Chapter 1.1text topic word text text text text text text text text text texttext text text text text text texttext text text text topic word texttext text text text text texttext text text text text text texttext text text text text text text text text text topic word text texttext text text text text text text text text

Chapter 1.1

topic word 1 topic word 2 topic word 3

Chapter 1.2

topic word 1topic word 2topic word 3

Chapter 1.3 …Chapter 2 …Chapter 3.1 …

Chapter 1.2text topic word text text text text text text text text text texttext text text text text text texttext text text text topic word texttext text text text text texttext text text text text text texttext text text text text text text text text text topic word text texttext text text text text text text text text

Topic ViewTopic View

Page 7: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Lexical Chaining – Evaluation Steps

• To evaluate our chainer, called GLexi, test data is required;

• experiments to develop such gold standard for German emphasize:– manual annotation of lexical chains is very demanding,– rich interaction between various principles to achieve a

cohesive text structure distracts annotators;

• results of these experiments partially reported in Stührenberg et al., 2007.

• Our conclusion: Evaluation of lexical chainer might be best performed in several steps.

Page 8: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Lexical Chaining – Evaluation Steps

• Our conclusion: Evaluation of lexical chainer might be best performed in several steps.

– Evaluation of coverage– Evaluation of disambiguation quality– Evaluation of semantic relatedness measures– Evaluation of chains wrt. specific application

Page 9: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Lexical Chaining – Evaluation Steps

• Remainder of talk: – very short presentation of GLexi‘s architecture and – exemplary demonstration of applicability of evaluation

procedure

• Resources used: – GermaNet (version 5.0)– HyTex corpus (specialized text) – approx. 29,000 (noun)

tokens– set of word pairs + results of human judgement

experiment– German word frequency list (thanks to Sabine Schulte im

Walde)

Page 10: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Lexical Chaining - GLexi

Basic modules:– preprocessing of corpora

tokenization, POS tagging, chunking chaining candidate selection

– core algorithm lexical semantic look-up, scoring of relation, sense disambiguation

– output creation rating of chain strength application specific representation

Page 11: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Lexical Chaining - GLexi

Preprocessing

Page 12: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Lexical Chaining - GLexi

Core algorithm:

lexical semantic look-up

Page 13: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Outline

• Project Context and Motivation

• Lexical Chaining – Evaluation Steps

1. Preprocessing and Coverage

2. Sense Disambiguation

3. Semantic Relatedness/Similarity

4. Application

• Open Issues and Future Work

Page 14: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 1: Coverage

approx. 29,000 (noun) tokens in our corpus split into

56 % in GermaNet 44 % not in GermaNet, of these:

15 % inflected 12 % compounds 17 % remaining, uncovered nouns

• Coverage without preprocessing: approx. 56 %

• Approx. 44 % not included in chaining

preprocessing necessary to improve coverage!

Page 15: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 1: Coverage

• Coverage without preprocessing: approx. 56 %

• Lemmatization: increase coverage to approx. 71 %• Compound analysis: increase coverage to approx. 83 %

Simple Named Entity Recognition in preprocessing phase• Open issues: abbreviations, foreign words, nominalized

verbs

remaining, uncovered nouns split into:

15 % Named Entities

30 % foreign words

25 % abbreviations 20 % nominalized verbs

theoretical value – open issue e.g. Medien – Medium (Engl.

media – psychic or data carrier)

theoretical value – open issue e.g. Agrarproduktion (Engl. agricultural production) Agrar (Engl. agricultural) + Produkt (Engl. artifact) + Ion (Engl. ion [chem.])

future work: include TermNet (domain specific language) as a resource – for more information: talk by Lüngen et al. – tomorrow, session 6, 10:40 h …

Page 16: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 2: Chaining-based WSD

• Approx. 45 % of tokens in our corpus more than 1 synset

• Basis for chaining-based WSD: manual annotation

• Compare manually annotated data and disambiguation decision of semantic measure

word A word B sense A sense B Wu-Palmer rank

Text Hypertext

Text Hypertext

1 1

2 1

0,9231

0,8333

1

2

manually annotated word senses

Text Hypertext 1 1

correct word senses (word A, sense A = 1 and word B, sense B = 1) of word pair on rank 1 ( semantic measure Wu-Palmer best value): correct disambiguation

Page 17: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 2: Chaining-based WSD

• Approx. 45 % of tokens in our corpus more than 1 synset

• Basis for chaining-based WSD: manual annotation

• Compare manually annotated data and disambiguation decision of semantic measure

word A word B sense A sense B Wu-Palmer rank

Text Hypertext

Text Hypertext

1 1

2 1

0,9231

0,8333

1

2

manually annotated word senses

Text Hypertext 1 1

correct word senses (word A, sense A = 1 and word B, sense B = 1) of word pair on rank 1 ( semantic measure Wu-Palmer best value): correct disambiguation

best value therefore rank 1

compare with manual annotation

Page 18: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 2: Chaining-based WSD

• Approx. 45 % of tokens in our corpus more than 1 synset

• Basis for chaining-based WSD: manual annotation

• Compare manually annotated data and disambiguation decision of semantic measure

word A word B sense A sense B Wu-Palmer rank

Text Hypertext

Text Hypertext

1 1

2 1

0,9231

0,8333

1

2

manually annotated word senses

Text Hypertext 1 1

correct word senses (word A, sense A = 1 and word B, sense B = 1) of word pair on rank 1 ( semantic measure Wu-Palmer best value): correct disambiguation

compare with manual annotation

Page 19: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 2: Chaining-based WSD

• Performance of chaining-based WSD: mediocre!

• Best semantic measures (Resnik, Wu-Palmer, and Lin):– approx. 50-60% correct disambiguation compared to

manual annotation– majority voting increased performance to approx. 63-65 %

• Future work– include WSD in preprocessing? – machine learning based new measure?

Page 20: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 3: Semantic Relatedness

• Implemented 11 semantic relatedness measures (SRM) (GermaNet: 8 measures, Google co-occurrence counts: 3 measures)

focus this talk: GermaNet measures

• Evaluation of SRM performance used results of human judgement experiment:– list of 100 word pairs, subject’s judgement of "semantic

distance" (35 subjects) on 5-level-scale– compare human judgement and SRM values

Page 21: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 3: Semantic Relatedness

almost 2/3 extreme values (not related / strongly related)

Page 22: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 3: Semantic Relatedness

human judgement experiment example of the results

Engl. printerEngl. fin

Engl. fluid Engl. water

Page 23: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 3: Semantic Relatedness

0.00

0.20

0.40

0.60

0.80

1.00

Word-Pairs Ordered by Relatedness Value

Rel

ated

ness

Human Judgement Resnik

all SRM values scatter

correlation between human judgement and SRM values low

Page 24: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 3: Semantic Relatedness

Open issues – semantic relatedness:– continuous SRM values necessary / helpful? instead: classes (e.g. 3 classes: not related, related,

strongly related)machine learning (ML) experiment using parameters of

SRM

– interactions between SRM quality and disambiguation quality?

– combination of GermaNet and Google co-occurrence based measures (and further resources) useful?

integration in ML experiment?

Page 25: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 4: Application-oriented Evaluation

Example: newspaper article about child poverty in Germany

Topic words according to lexical chaining results

Kind, Engl. child

Geld, Engl. money

Deutschland, Engl. Germany

Page 26: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Step 4: Application-oriented Evaluation

• Features used in calculation of topic words and views:– chain / meta-chain info:

link density link strength

– in addition to chains: frequency (relative passage and document) mark-up

• application-oriented evaluation gold standard topic words, topic views, and topic chains necessary

• manual annotation of topic words and topic views – work in progress – current annotation agreement > 75 % (before accordance)

initial results show: link density and frequency are most relevant features

Page 27: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Outlook and Future Work

To sum up:– Application: use lexical chaining for construction of topic

views– Lexical chaining for German corpora: several challenges

coverage, disambiguation, SRM– room for improvement: disambiguation and SRM

possible solutions: WSD as preprocessing step alternative SRM (potentially ML based) additional resources

– initial results using lexical chains for construction of topic views chaining useful!

Page 28: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. ..

.. ..

..

Text TechnologicalModel l ing of Informat ion

Thank you!

Comments, ideas, questions are very welcome.

Page 29: An Evaluation Procedure for Word Net Based Lexical Chaining: Methods and Issues

http://www.hytex.info/

............................ HHyyTTeexx

Hypertextualisierung auf textgrammatischer Grundlage

.. .. .. .. .. ..

..

Text TechnologicalModel l ing of Informat ion

Literature (back-up slide)

• Alexander Budanitsky and Graeme Hirst. 2001. Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In Workshop on WordNet and Other Lexical Resources at NAACL-2000, Pittsburgh, PA, June 2001.

• M. A. K. Halliday und Ruqaiya Hasan. 1976. Cohesion in English. Longman, London.

• Graeme Hirst und David St-Onge. 1998. Lexical chains as representation of context for the detection and correction malapropisms. In C. Fellbaum, editor, WordNet: An electronic lexical database, chapter 13, pages 305–332. The MIT Press, Cambrige, MA.

• Alexander Mehler. 2005. Lexical chaining as a source of text chaining. In Proceedings of the 1st Computational Systemic Functional Grammar Conference, Sydney.

• Grogory H. Silber und Kathleen F. McCoy. 2002. Efficiently computed lexical chains as an intermediate representation for automatic text summarization. Computational Linguistics, 28(4):487 – 496.