part 5. minimally supervised methods for word sense disambiguation

Part 5. Minimally Supervised Methods for Word Sense Disambiguation

Outline

• Task definition– What does “minimally” supervised mean?

• Bootstrapping algorithms– Co-training

– Self-training

– Yarowsky algorithm

• Using the Web for Word Sense Disambiguation– Web as a corpus

– Web as collective mind

Task Definition

• Supervised WSD = learning sense classifiers starting with annotated data

• Minimally supervised WSD = learning sense classifiers from annotated class, with minimal human supervision

• Examples – Automatically bootstrap a corpus starting with a few human

annotated examples

– Use monosemous relatives / dictionary definitions to automatically construct sense tagged data

– Rely on Web-users + active learning for corpus annotation

Outline



– Self-training




Bootstrapping WSD Classifiers

• Build sense classifiers with little training data– Expand applicability of supervised WSD

• Bootstrapping approaches– Co-training

– Self-training


Bootstrapping Recipe

• Ingredients– (Some) labeled data

– (Large amounts of) unlabeled data

– (One or more) basic classifiers

• Output– Classifier that improves over the basic classifiers

Co-training / Self-training

• 1. Create a pool of examples U' – choose P random examples from U

• 2. Loop for I iterations– Train C

i on L and label U'

– Select G most confident examples and add to L• maintain distribution in L

– Refill U' with examples from U• keep U' at constant size P

– A set L of labeled training examples

– A set U of unlabeled examples

– Classifiers Ci

• (Blum and Mitchell 1998)

• Two classifiers– independent views

– [independence condition can be relaxed]

• Co-training in Natural Language Learning– Statistical parsing (Sarkar 2001)

– Co-reference resolution (Ng and Cardie 2003)

– Part of speech tagging (Clark, Curran and Osborne 2003)

– ...

Co-training

Self-training

• (Nigam and Ghani 2000)

• One single classifier

• Retrain on its own output

• Self-training for Natural Language Learning– Part of speech tagging (Clark, Curran and Osborne 2003)

– Co-reference resolution (Ng and Cardie 2003)• several classifiers through bagging

Parameter Setting for Co-training/Self-training

• 1. Create a pool of examples U' – choose P random examples from U

• 2. Loop for I iterations– Train C

i on L and label U'

– Select G most confident examples and add to L• maintain distribution in L

– Refill U' with examples from U• keep U' at constant size P

Pool size

Iterations

Growth size

•A major drawback of bootstrapping–“No principled method for selecting optimal values for these

parameters” (Ng and Cardie 2003)

Experiments with Co-training / Self-training for WSD

• (Mihalcea 2004)

• Training / Test data– Senseval-2 nouns (29 ambiguous nouns)

– Average corpus size: 95 training examples, 48 test examples

• Raw data– British National Corpus

– Average corpus size: 7,085 examples

• Co-training– Two classifiers: local and topical classifiers

• Self-training– One classifier: global classifier

Optimal Parameter Settings

• Optimized on the test set– Upper bound in co-training/self-training performance

• Parameter ranges– P = {1, 100, 500, 1000, 1500, 2000, 5000}– G = {1, 10, 20, 30, 40, 50, 100, 150, 200}– I = {1, ..., 40}

• 29 nouns → 120,000 runs• Accuracy:

– Basic classifier: 53.84%– Optimal self-training: 65.61%– Optimal co-training: 65.75%– ~25% error reduction

• Example: lady– basic = 61.53%

– self-training = 84.61% [20/100/39]

– co-training = 82.05% [1/1000/3]

Empirical Parameter Settings

• How to detect parameter settings in practice?

• 20% training data → validation set

• Same range of parameter values

• Method 1: Per-word parameter setting– Identify best parameter setting for each word

– No improvement over basic classifier• Basic = 53.84%

• Co-training = 51.73%

• Self-training = 52.88%

Empirical Parameter Settings

• Method 2: Overall parameter setting– For each parameter setting P, G, I

• Determine the total relative growth in performance

• Select the “best” setting

– Co-training:• G = 1, P = 1500, I = 2

• Basic = 53.84%, Co-training = 55.67%

– Self-training• G = 1, P = 1, I = 1

• Basic = 53.84%, Self-training = 54.16%

Empirical Parameter Setting

• Method 3: Smoothed co-training– Combine iterations of co-training with voting

– Effect• similar shape

• “smoothed” learning curve

• larger range with better-than-baseline performance

• Results (avg.)– Basic = 53.84%

– Co-training, global setting • basic = 55.67%

• smoothed = 58.35%

– Co-training, per-word setting• basic = 51.73%

• smoothed = 56.68%40

42

44

46

48

50

52

54

56

58

60

Yarowsky Algorithm

• (Yarowsky 1995)

• Similar to co-training

• Differs in the basic assumption– “view independence” (co-training) vs. “precision independence”

(Yarowsky algorithm)– (Abney 2002)

• Relies on two classifiers and a decision list– One sense per collocation :

• Nearby words provide strong and consistent clues as to the sense of a target word

– One sense per discourse :• The sense of a target word is highly consistent within a single

document

Learning Algorithm

• A decision list is used to classify instances of target word :

“the loss of animal and plant species through extinction …”

• Classification is based on the highest ranking rule that matches the target context

LogL Collocation Sense

… … …

9.31 flower (within +/- k words) A (living)

9.24 job (within +/- k words) B (factory)

9.03 fruit (within +/- k words) A (living)

9.02 plant species A (living)

... ... …

Bootstrapping Algorithm

• All occurrences of the target word are identified

• A small training set of seed data is tagged with word sense

Sense-B: factory

Sense-A: life


• Iterative procedure:– Train decision list algorithm on seed set

– Classify residual data with decision list

– Create new seed set by identifying samples that are tagged with a probability above a certain threshold

– Retrain classifier on new seed set

• Selecting training seeds– Initial training set should accurately distinguish among possible

senses

– Strategies: • Select a single, defining seed collocation for each possible sense.

Ex: “life” and “manufacturing” for target plant

• Use words from dictionary definitions

• Hand-label most frequent collocates


Seed set grows and residual set shrinks ….


Convergence: Stop when residual set stabilizes

One Sense per Discourse

Algorithm can be improved by applying “One Sense per Discourse” constraint

• After algorithm has converged:Identify tokens tagged with low confidence, label with dominant tag of that document

• After each iteration: Extend tag to all examples in a single document after enough examples are tagged with a single sense

Evaluation

• Test corpus: extracted from 460 million word corpus of multiple sources (news articles, transcripts, novels, etc.)

• Performance of multiple models compared with:

– supervised decision lists

– unsupervised learning algorithm of Schütze (1992), based on alignment of clusters with word senses

Word Senses Supervised Unsupervised Schütze

Unsupervised Bootstrapping

plant living/factory 97.7 92 98.6

space volume/outer 93.9 90 93.6

tank vehicle/ container

97.1 95 96.5

motion legal/physical 98.0 92 97.9

… … … - …

Average - 96.1 92.2 96.5

Outline



– Self-training




The Web as a Corpus

• Use the Web as a large textual corpus– Build annotated corpora using monosemous relatives

– Bootstrap annotated corpora starting with few seeds

• Use the (semi)automatically tagged data to train WSD classifiers

Monosemous Relatives

• IDEAIDEA: determine a phrase (SP) which uniquely identifies the sense of a word (W#i)1. Determine one or more Search Phrases from a machine readable

dictionary using several heuristics

2. Search the Internet using the Search Phrases from step 1.

3. Replace the Search Phrases in the examples gathered at 2 with W#i.

– Output: sense annotated corpus for the word sense W#i

Heuristics to Identify Monosemous Relatives

• Heuristic 1– Determine a monosemous synonym– remember#1 has recollect as monosemous synonym SP=recollect

• Heuristic 2– Parse the gloss and determine the set of single phrase definitions– produce#5 has the definition “bring onto the market or release” 2

definitions: “bring onto the market” and “release” eliminate “release” as being ambiguous SP=bring onto the market

• Heuristic 3– Parse the gloss and determine the set of single phrase definitions– Replace the stop words with the NEAR operator– Strengthen the query: concatenate the words from the current synset using

the AND operator– produce#6 has the synset {grow, raise, farm, produce} and the definition

“cultivate by growing” SP=cultivate NEAR growing AND (grow OR raise OR farm OR produce)

• Heuristic 4– Parse the gloss and determine the set of single phrase definitions

– Keep only the head phrase

– Strengthen the query: concatenate the words from the current synset using the AND operator

– company#5 has the synset {party,company} and the definition “band of people associated in some activity”

SP=band of people AND (company OR party)

Heuristics to Identify Monosemous Relatives

Example

• Building annotated corpora for the noun interest.# Synset Definition1 {interest#1, involvement} sense of concern with and curiosity about

someone or something2 {interest#2,interestingness} the power of attracting or holding one’s interest3 {sake, interest#3} reason for wanting something done4 {interest#4} fixed charge for borrowing money; usually a

percentage of the amount borrowed5 {pastime,interest#5} a subject or pursuit that occupies one’s time and

thoughts6 {interest#6, stake} a right or legal share of something; financial

involvement with something7 {interest#7, interest group} a social group whose members control some field

of activity and who have common aims

Sense # Search phrase1 sense of concern AND (interest OR involvement)2 interestigness3 reason for wanting AND (interest OR sake)4 fixed charge AND interest

percentage of amount AND interest5 pastime6 right share AND (interest OR stake)

legal share AND (interest OR stake)financial involvement AND (interest OR stake)

7 interest group

Example

• Gather 5,404 examples• Check the first 70 examples 67 correct; 95.7% accuracy.

1. I appreciate the genuine interest#1 which motivated you to write your message.

2. The webmaster of this site warrants neither accuracy, nor interest#2.3. He forgives us not only for our interest#3, but for his own.4. Interest#4 coverage, including rents, was 3.6x5. As an interest#5, she enjoyed gardening and taking part into church

activities.6. Voted on issues, they should have abstained because of direct and

indirect personal interests#6 in the matters of hand.7. The Adam Smith Society is a new interest#7 organized within the

APA.

Experimental Evaluation

• Tests on 20 words – 7 nouns, 7 verbs, 3 adjectives, 3 adverbs (120 word meanings)

– manually check the first 10 examples of each sense of a word

=> 91% accuracy

– (Mihalcea 1999)

Word Polysemycount

Examplesin SemCor

Total exam-ples acquired

Examples ma-nually checked

Correctexamples

interest 7 139 5404 70 67report 7 71 4196 70 63company 9 90 6292 80 77school 7 146 2490 59 54produce 7 148 4982 67 60remember 8 166 3573 67 57write 8 285 2914 69 67speak 4 147 4279 40 39small 14 192 10954 107 92clearly 4 48 4031 29 28TOTAL(20 words)

120 2582 80741 1080 978

Web-based Bootstrapping

• Similar to Yarowsky algorithm• Relies on data gathered from the Web1. Create a set of seeds (phrases) consisting of:

– Sense tagged examples in SemCor– Sense tagged examples from WordNet– Additional sense tagged examples, if available (created with the

substitution method or Open Mind method)

• Phrase?– At least two open class words; – Words involved in a semantic relation (e.g. noun phrase, verb-object,

verb-subject, etc.)

2. Search the Web using queries formed with the seed expressions found at Step 1– Add to the generated corpus of maximum of N text passages

• (Mihalcea 2002)

The Web as Collective Mind

• Two different views of the Web:– collection of Web pages

– very large group of Web users

• Millions of Web users can contribute their knowledge to a data repository

• Open Mind Word Expert (Chklovski and Mihalcea, 2002)

• Fast growing rate: – Started in April 2002

– Currently more than 100,000 examples of noun senses in several languages

OMWEonline

http://teach-computers.org

Open Mind Word Expert: Quantity and Quality

• Data– A mix of different corpora: Treebank, Open Mind Common Sense, Los

Angeles Times, British National Corpus

• Word senses– Based on WordNet definitions

• Active learning to select the most informative examples for learning– Use two classifiers trained on existing annotated data– Select items where the two classifiers disagree for human annotation

• Quality: – Two tags per item– One tag per item per contributor

• Evaluations:– Agreement rates of about 65% - comparable to the agreements rates

obtained when collecting data for Senseval-2 with trained lexicographers

– Replicability: tests on 1,600 examples of “interest” led to 90%+ replicability

References

• (Abney 2002) Abney, S. Bootstrapping. Proceedings of ACL 2002.

• (Blum and Mitchell 1998) Blum, A. and Mitchell, T. Combining labeled and unlabeled data with co-training. Proceedings of COLT 1998.

• (Chklovski and Mihalcea 2002) Chklovski, T. and Mihalcea, R. Building a sense tagged corpus with Open Mind Word Expert. Proceedings of ACL 2002 workshop on WSD.

• (Clark, Curran and Osborne 2003) Clark, S. and Curran, J.R. and Osborne, M. Bootstrapping POS taggers using unlabelled data. Proceedings of CoNLL 2003.

• (Mihalcea 1999) Mihalcea, R. An automatic method for generating sense tagged corpora. Proceedings of AAAI 1999.

• (Mihalcea 2002) Mihalcea, R. Bootstrapping large sense tagged corpora. Proceedings of LREC 2002.

• (Mihalcea 2004) Mihalcea, R. Co-training and Self-training for Word Sense Disambiguation. Proceedings of CoNLL 2004.

• (Ng and Cardie 2003) Ng, V. and Cardie, C. Weakly supervised natural language learning without redundant views. Proceedings of HLT-NAACL 2003.

• (Nigam and Ghani 2000) Nigam, K. and Ghani, R. Analyzing the effectiveness and applicability of co-training. Proceedings of CIKM 2000.

• (Sarkar 2001) Sarkar, A. Applying cotraining methods to statistical parsing. Proceedings of NAACL 2001.

• (Yarowsky 1995) Yarowsky, D. Unsupervised word sense disambiguation rivaling supervised methods. Proceedings of ACL 1995.

part 5. minimally supervised methods for word sense disambiguation

Documents