noun homograph disambiguation using local context in large text corpora

1

Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Marti A. Hearst

Presented by: Heng JiMar. 29, 2004

2

Outline

Introduction Motivations of Algorithm Feature Selection Crucial Problem and Detail Algorithm Experiment Results Conclusions & Discussions

3

Introduction

What is Homograph? One or two or more words spelled alike but different in

meaning

What is Noun Homograph Disambiguation? Determine which of a set of pre-determined senses

should be assigned to that noun

Why Noun Homograph Disambiguation is useful?

4

Noun Compound Interpretation

5

Noun Compound

Interpretation

Improve Information Retrieval Results

ORG

stick

stick

ORG

ORG

stick

stick

stick

6

Extend key words?

ORG

ORG

ORG

ORG

stick

7

How to do? -- Motivations Intuition1 Human can identify word sense by local context

Intuition2 Human’s identification ability comes from familiarity with frequent co

ntexts

Intuition3 Different senses can be distinguished by:

-- different high-frequency context

-- different syntactic, orthographic, or lexical features

Combine Intuition 1, 2, 3 Similar-sense terms will tend to have similar contexts!

8

Feature Selection

Principles: Selective & General Example: “bank” Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects are found on the west bank of the Nile [“direction”] bank of the “proper name” Headed the Chase Manhattan Bank in New York Name + Capitalization

Neighbor word not enough Need syntactic information!

9

Feature Set

10

Crucial Problem: need large annotated data?

Problem: Cost of manual tagging is high The size of corpus is usually large Statistics vary a great deal across different domains Automating the tagging of the training corpus will result in “Circularity

problem” ( Dagan and Itai, 1994) Solution: Construct the training corpus incrementally An initial model M1, is trained using small corpus C1 M1 is used to disambiguate the rest of ambiguous words All words that can be disambiguated with strong confidence will be combined

with C1 to form C2 M2 is trained using C2; and repeat.

11

Test

Algorithm

Manually label a smallset of samples

Record context features

Training

Check context feature of target noun

Choose sense with most evidence

Input

Output

Compare Evidence

Samples with high Comparative Evidence

Segmented into phrases& POS tagging

12

Comparative Evidence Definition Max (CE) where:

and

CE: Comparative Evidence; n: number of senses m: number of evidence features found in test sentences fij: frequency (feature j is recorded in a sentence containing sense i)

Procedure Choose sense with maximum comparative evidence If the largest CE is not larger than the second largest CE by threshold

the sentence cannot be classified! (Margin)

1

0

n

ii

ii

E

ECE

m

j iji fE1

1)2*(

13

Experiment Result – “tank”

Results for word "tank"

0

20

40

60

80

100

20 30 40 50 60 70Training Size

Acc

urac

y(%

) SupervisedLearning

Supervised +Unsupervised

14

Experiment Result – “bank”

Results for word "bank"

0

20

40

60

80

100

10 20 30 40 50

Training Size

Acc

urac

y(%

) SupervisedLearning

Supervised+Unsupervised

15

Experiment Result – “bass”

Result for word "bass"

0

20

40

60

80

100

10 15 20 25

Training Size

Acc

urac

y(5) Supervised

Learning


16

Experiment Result – “country”

Result for "country"

0

20

40

60

80

100

10 20 30 40

Training Size

Acc

urac

y(%

)

SupervisedLearning


17

Experiment Result – “Record”

Results for "Record" with Supervised Learning

0

20

40

60

80

100

20 30 40

Training Size

Acc

ura

cy(%

)

Record1

Record2

Record1: “archived event” “pinnacle achievement”Record2: “archived event” “musical disk”

18

Conclusions and future work Most advantage: using bootstrapping to alleviate tagging bottleneck; No sizable sense-tagged corpus is needed Results show the method is successful Unsupervised Learning helps to improve general words has limitations on difficult words like “country”. also helps to reduce work amount Use of partial syntactic information: richer than common statistics tech

niques

Proposed Improvements Bootstrapping from Bilingual Corpora Improve Evidence Metric (adjust weight automatically; weight on the entire cor

pus and each sense; add more types) Integrate WordNet

19

Discussion 1: Initial Training A good training base need to be already obtained,

Namely initial hand tagging is required. But once the training is complete, Noun Homograph Disambiguation is fast;

This initial set is still large(20-30 occurrences for each sense) the cost of tagging is still high!

20

Discussion 2: Resources

Advantage of unrestricted corpus compared to dictionaries, includes sufficient contextual variety Can automatically integrate unfamiliar words

Assumption The context around an instance of a sense of the homograph is

meaningfully related to that sense

Need Semantic Lexicon? Numerous residences, banks, and libraries parallel buildings They use holes in trees, banks, or rocks for nests parallel nature objects

21

References

Marti A. Hearst(1991). Noun Homograph Disambiguation Using Local Context in Large Text Corpora

Yarowsky(1992). Word-Sense Disambiguation Using Statistical Models of Roget’s..

Chin(1999). Word Sense Disambiguation Using Statistical Techniques

Peh, Ng(1997). Domain-Specific Semantic Class Disambiguation Using WordNet

Dagan, I. and Itai(1994). Word Sense Disambiguation using a second language monolingual corpus

noun homograph disambiguation using local context in large text corpora

Documents

word sense

sense iprocedurechoose

largest ce

training corpus

similarsense terms

number of evidence

different domainsautomating

number of senses