noun homograph disambiguation using local context in large text corpora

21
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004

Upload: orly

Post on 08-Jan-2016

28 views

Category:

Documents


1 download

DESCRIPTION

Noun Homograph Disambiguation Using Local Context in Large Text Corpora. Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004. Outline. Introduction Motivations of Algorithm Feature Selection Crucial Problem and Detail Algorithm Experiment Results Conclusions & Discussions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

1

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Marti A. Hearst

Presented by: Heng JiMar. 29, 2004

Page 2: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

2

Outline

Introduction Motivations of Algorithm Feature Selection Crucial Problem and Detail Algorithm Experiment Results Conclusions & Discussions

Page 3: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

3

Introduction

What is Homograph? One or two or more words spelled alike but different in

meaning

What is Noun Homograph Disambiguation? Determine which of a set of pre-determined senses

should be assigned to that noun

Why Noun Homograph Disambiguation is useful?

Page 4: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

4

Noun Compound Interpretation

Page 5: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

5

Noun Compound

Interpretation

Improve Information Retrieval Results

ORG

stick

stick

ORG

ORG

stick

stick

stick

Page 6: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

6

Extend key words?

ORG

ORG

ORG

ORG

stick

Page 7: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

7

How to do? -- Motivations Intuition1 Human can identify word sense by local context

Intuition2 Human’s identification ability comes from familiarity with frequent co

ntexts

Intuition3 Different senses can be distinguished by:

-- different high-frequency context

-- different syntactic, orthographic, or lexical features

Combine Intuition 1, 2, 3 Similar-sense terms will tend to have similar contexts!

Page 8: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

8

Feature Selection

Principles: Selective & General Example: “bank” Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects are found on the west bank of the Nile [“direction”] bank of the “proper name” Headed the Chase Manhattan Bank in New York Name + Capitalization

Neighbor word not enough Need syntactic information!

Page 9: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

9

Feature Set

Page 10: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

10

Crucial Problem: need large annotated data?

Problem: Cost of manual tagging is high The size of corpus is usually large Statistics vary a great deal across different domains Automating the tagging of the training corpus will result in “Circularity

problem” ( Dagan and Itai, 1994) Solution: Construct the training corpus incrementally An initial model M1, is trained using small corpus C1 M1 is used to disambiguate the rest of ambiguous words All words that can be disambiguated with strong confidence will be combined

with C1 to form C2 M2 is trained using C2; and repeat.

Page 11: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

11

Test

Algorithm

Manually label a smallset of samples

Record context features

Training

Check context feature of target noun

Choose sense with most evidence

Input

Output

Compare Evidence

Samples with high Comparative Evidence

Segmented into phrases& POS tagging

Page 12: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

12

Comparative Evidence Definition Max (CE) where:

and

CE: Comparative Evidence; n: number of senses m: number of evidence features found in test sentences fij: frequency (feature j is recorded in a sentence containing sense i)

Procedure Choose sense with maximum comparative evidence If the largest CE is not larger than the second largest CE by threshold

the sentence cannot be classified! (Margin)

1

0

n

ii

ii

E

ECE

m

j iji fE1

1)2*(

Page 13: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

13

Experiment Result – “tank”

Results for word "tank"

0

20

40

60

80

100

20 30 40 50 60 70Training Size

Acc

urac

y(%

) SupervisedLearning

Supervised +Unsupervised

Page 14: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

14

Experiment Result – “bank”

Results for word "bank"

0

20

40

60

80

100

10 20 30 40 50

Training Size

Acc

urac

y(%

) SupervisedLearning

Supervised+Unsupervised

Page 15: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

15

Experiment Result – “bass”

Result for word "bass"

0

20

40

60

80

100

10 15 20 25

Training Size

Acc

urac

y(5) Supervised

Learning

Supervised +Unsupervised

Page 16: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

16

Experiment Result – “country”

Result for "country"

0

20

40

60

80

100

10 20 30 40

Training Size

Acc

urac

y(%

)

SupervisedLearning

Supervised +Unsupervised

Page 17: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

17

Experiment Result – “Record”

Results for "Record" with Supervised Learning

0

20

40

60

80

100

20 30 40

Training Size

Acc

ura

cy(%

)

Record1

Record2

Record1: “archived event” “pinnacle achievement”Record2: “archived event” “musical disk”

Page 18: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

18

Conclusions and future work Most advantage: using bootstrapping to alleviate tagging bottleneck; No sizable sense-tagged corpus is needed Results show the method is successful Unsupervised Learning helps to improve general words has limitations on difficult words like “country”. also helps to reduce work amount Use of partial syntactic information: richer than common statistics tech

niques

Proposed Improvements Bootstrapping from Bilingual Corpora Improve Evidence Metric (adjust weight automatically; weight on the entire cor

pus and each sense; add more types) Integrate WordNet

Page 19: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

19

Discussion 1: Initial Training A good training base need to be already obtained,

Namely initial hand tagging is required. But once the training is complete, Noun Homograph Disambiguation is fast;

This initial set is still large(20-30 occurrences for each sense) the cost of tagging is still high!

Page 20: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

20

Discussion 2: Resources

Advantage of unrestricted corpus compared to dictionaries, includes sufficient contextual variety Can automatically integrate unfamiliar words

Assumption The context around an instance of a sense of the homograph is

meaningfully related to that sense

Need Semantic Lexicon? Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects

Page 21: Noun Homograph  Disambiguation  Using Local Context  in Large Text Corpora

21

References

Marti A. Hearst(1991). Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Yarowsky(1992). Word-Sense Disambiguation Using Statistical Models of Roget’s..

Chin(1999). Word Sense Disambiguation Using Statistical Techniques

Peh, Ng(1997). Domain-Specific Semantic Class Disambiguation Using WordNet

Dagan, I. and Itai(1994). Word Sense Disambiguation using a second language monolingual corpus