part 4: supervised methods of word sense disambiguation

Part 4:

Supervised Methods of Word Sense Disambiguation

Outline

• What is Supervised Learning?

• Task Definition

• Single Classifiers– Naïve Bayesian Classifiers

– Decision Lists and Trees

• Ensembles of Classifiers

What is Supervised Learning?

• Collect a set of examples that illustrate the various possible classifications or outcomes of an event.

• Identify patterns in the examples associated with each particular class of the event.

• Generalize those patterns into rules.

• Apply the rules to classify a new event.

Learn from these examples :“when do I go to the store?”

Day Go to Store? Hot Outside? Slept Well? Ate Well?

1 YES YES NO NO

2 NO YES NO YES

3 YES NO NO NO

4 NO NO NO YES

Outline


• Task Definition

• Single Classifiers– Naïve Bayesian Classifiers

– Decision Lists and Trees


Task Definition

• Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques.

• Resources– Sense Tagged Text

– Dictionary (implicit source of sense inventory)

– Syntactic Analysis (POS tagger, Chunker, Parser, …)

• Scope– Typically one target word per context

– Part of speech of target word resolved

– Lends itself to “lexical sample” formulation

• Reduces WSD to a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs

Sense Tagged Text

Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers

My bank/1 charges too much for an overdraft.

I went to the bank/1 to deposit my check and get a new ATM card.

The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River.

My grandfather planted his pole in the bank/2 and got a great big catfish!

The bank/2 is pretty muddy, I can’t walk there.

Two Bags of Words(Co-occurrences in the “window of context”)

FINANCIAL_BANK_BAG:

a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were

RIVER_BANK_BAG:

a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West

Simple Supervised Approach

Given a sentence S containing “bank”:

For each word Wi in S

If Wi is in FINANCIAL_BANK_BAG then Sense_1 = Sense_1 + 1;

If Wi is in RIVER_BANK_BAG thenSense_2 = Sense_2 + 1;

If Sense_1 > Sense_2 then print “Financial” else if Sense_2 > Sense_1 then print “River”

else print “Can’t Decide”;

Supervised Methodology

• Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities.– One tagged word per instance/lexical sample disambiguation

• Select a set of features with which to represent context.– co-occurrences, collocations, POS tags, verb-obj relations, etc...

• Convert sense-tagged training instances to feature vectors.• Apply a machine learning algorithm to induce a classifier.

– Form – structure or relation among features– Parameters – strength of feature interactions

• Convert a held out sample of test data into feature vectors.– “correct” sense tags are known but not used

• Apply classifier to test instances to assign a sense tag.

Outline


• Task Definition

• Naïve Bayesian Classifier

• Decision Lists and Trees


Naïve Bayesian Classifier

• Naïve Bayesian Classifier well known in Machine Learning community for good performance across a range of tasks (e.g., Domingos and Pazzani, 1997)

…Word Sense Disambiguation is no exception

• Assumes conditional independence among features, given the sense of a word.– The form of the model is assumed, but parameters are estimated

from training instances

• When applied to WSD, features are often “a bag of words” that come from the training data– Usually thousands of binary features that indicate if a word is

present in the context of the target word (or not)

Bayesian Inference

• Given observed features, what is most likely sense?

• Estimate probability of observed features given sense

• Estimate unconditional probability of sense

• Unconditional probability of features is a normalizing term, doesn’t affect sense classification

),...,3,2,1(

)()*|,...,3,2,1(),...,3,2,1| (

FnFFFp

SpSFnFFFpFnFFFSp

Naïve Bayesian Model

S

F2 F3 F4 F1 Fn

)|(*...*)|2(*)|1()|,...,2,1( SFnpSFpSpFSFnFFP

The Naïve Bayesian Classifier

– Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense)

• P(S=1) = 1,500/2000 = .75• P(S=2) = 500/2,000 = .25

– Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.• P(F1=“credit”) = 204/2000 = .102• P(F1=“credit”|S=1) = 200/1,500 = .133• P(F1=“credit”|S=2) = 4/500 = .008

– Given a test instance that has one feature “credit”• P(S=1|F1=“credit”) = .133*.75/.102 = .978• P(S=2|F1=“credit”) = .008*.25/.102 = .020

)(*)|(*...*)|1(argmax SpSFnpSFpsenseSsense

Comparative Results

• (Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network and a Context Vector approach when disambiguating six senses of line…

• (Mooney, 1996) compared Naïve Bayes with a Neural Network, Decision Tree/List Learners, Disjunctive and Conjunctive Normal Form learners, and a perceptron when disambiguating six senses of line…

• (Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule Based Learner, Probabilistic Model, etc. when disambiguating line and 12 other words…

• …All found that Naïve Bayesian Classifier performed as well as any of the other methods!

Outline


• Task Definition

• Naïve Bayesian Classifiers



Decision Lists and Trees

• Very widely used in Machine Learning.

• Decision trees used very early for WSD research (e.g., Kelly and Stone, 1975; Black, 1988).

• Represent disambiguation problem as a series of questions (presence of feature) that reveal the sense of a word.– List decides between two senses after one positive answer

– Tree allows for decision among multiple senses after a series of answers

• Uses a smaller, more refined set of features than “bag of words” and Naïve Bayes.– More descriptive and easier to interpret.

Decision List for WSD (Yarowsky, 1994)

• Identify collocational features from sense tagged data.

• Word immediately to the left or right of target :– I have my bank/1 statement.

– The river bank/2 is muddy.

• Pair of words to immediate left or right of target :– The world’s richest bank/1 is here in New York.

– The river bank/2 is muddy.

• Words found within k positions to left or right of target, where k is often 10-50 :– My credit is just horrible because my bank/1 has made several

mistakes with my account and the balance is very low.

Building the Decision List

• Sort order of collocation tests using log of conditional probabilities.

• Words most indicative of one sense (and not the other) will be ranked highly.

))|2(

)|1((log

inCollocatioiFSpinCollocatioiFSp

Abs

Computing DL score

– Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense)

• P(S=1) = 1,500/2,000 = .75• P(S=2) = 500/2,000 = .25

– Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.

• P(F1=“credit”) = 204/2,000 = .102• P(F1=“credit”|S=1) = 200/1,500 = .133• P(F1=“credit”|S=2) = 4/500 = .008

– From Bayes Rule… • P(S=1|F1=“credit”) = .133*.75/.102 = .978• P(S=2|F1=“credit”) = .008*.25/.102 = .020

– DL Score = abs (log (.978/.020)) = 3.89

Using the Decision List

• Sort DL-score, go through test instance looking for matching feature. First match reveals sense…

DL-score Feature Sense

3.89 credit within bank Bank/1 financial

2.20 bank is muddy Bank/2 river

1.09 pole within bank Bank/2 river

0.00 of the bank N/A

Using the Decision List

CREDIT?

BANK/1 FINANCIAL IS MUDDY?

POLE? BANK/2 RIVER

BANK/2 RIVER

Learning a Decision Tree

• Identify the feature that most “cleanly” divides the training data into the known senses.– “Cleanly” measured by information gain or gain ratio. – Create subsets of training data according to feature values.

• Find another feature that most cleanly divides a subset of the training data.

• Continue until each subset of training data is “pure” or as clean as possible.

• Well known decision tree learning algorithms include ID3 and C4.5 (Quillian, 1986, 1993)

• In Senseval-1 a modified decision list (which supported some conditional branching) was most accurate for English Lexical Sample task (Yarowsky, 2000)

Supervised WSD with Individual Classifiers

• Most supervised Machine Learning algorithms have been applied to Word Sense Disambiguation, most work reasonably well.

• Features tend to differentiate among methods more than the learning algorithms.

• Good sets of features tend to include:– Co-occurrences or keywords (global)– Collocations (local)– Bigrams (local and global)– Part of speech (local)– Predicate-argument relations

• Verb-object, subject-verb,

– Heads of Noun and Verb Phrases

Convergence of Results

• Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another.– Senseval-1, a number of systems in range of 74-78% accuracy for

English Lexical Sample task.

– Senseval-2, a number of systems in range of 61-64% accuracy for English Lexical Sample task.

– Senseval-3, a number of systems in range of 70-73% accuracy for English Lexical Sample task…

• What to do next?

Outline


• Task Definition

• Naïve Bayesian Classifiers



Ensembles of Classifiers

• Classifier error has two components (Bias and Variance)– Some algorithms (e.g., decision trees) try and build a

representation of the training data – Low Bias/High Variance

– Others (e.g., Naïve Bayes) assume a parametric form and don’t represent the training data – High Bias/Low Variance

• Combining classifiers with different bias variance characteristics can lead to improved overall accuracy

• “Bagging” a decision tree can smooth out the effect of small variations in the training data (Breiman, 1996)– Sample with replacement from the training data to learn multiple

decision trees.

– Outliers in training data will tend to be obscured/eliminated.

Ensemble Considerations

• Must choose different learning algorithms with significantly different bias/variance characteristics.– Naïve Bayesian Classifier versus Decision Tree

• Must choose feature representations that yield significantly different (independent?) views of the training data.– Lexical versus syntactic features

• Must choose how to combine classifiers. – Simple Majority Voting

– Averaging of probabilities across multiple classifier output

– Maximum Entropy combination (e.g., Klein, et. al., 2002)

Ensemble Results

• (Pedersen, 2000) achieved state of art for interest and line data using ensemble of Naïve Bayesian Classifiers.– Many Naïve Bayesian Classifiers trained on varying sized

windows of context / bags of words.

– Classifiers combined by a weighted vote

• (Florian and Yarowsky, 2002) achieved state of the art for Senseval-1 and Senseval-2 data using combination of six classifiers.– Rich set of collocational and syntactic features.

– Combined via linear combination of top three classifiers.

• Many Senseval-2 and Senseval-3 systems employed ensemble methods.

References

• (Black, 1988) E. Black (1988) An experiment in computational discrimination of English word senses. IBM Journal of Research and Development (32) pg. 185-194.

• (Breiman, 1996) L. Breiman. (1996) The heuristics of instability in model selection. Annals of Statistics (24) pg. 2350-2383.

• (Domingos and Pazzani, 1997) P. Domingos and M. Pazzani. (1997) On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning (29) pg. 103-130.

• (Domingos, 2000) P. Domingos. (2000) A Unified Bias Variance Decomposition for Zero-One and Squared Loss. In Proceedings of AAAI. Pg. 564-569.

• (Florian an dYarowsky, 2002) R. Florian and D. Yarowsky. (2002) Modeling Consensus: Classifier Combination for Word Sense Disambiguation. In Proceedings of EMNLP, pp 25-32.

• (Kelly and Stone, 1975). E. Kelly and P. Stone. (1975) Computer Recognition of English Word Senses, North Holland Publishing Co., Amsterdam.

• (Klein, et. al., 2002) D. Klein, K. Toutanova, H. Tolga Ilhan, S. Kamvar, and C. Manning, Combining Heterogeneous Classifiers for Word-Sense Disambiguation, Proceedings of Senseval-2. pg. 87-89.

• (Leacock, et. al. 1993) C. Leacock, J. Towell, E. Voorhees. (1993) Corpus based statistical sense resolution. In Proceedings of the ARPA Workshop on Human Language Technology. pg. 260-265.

• (Mooney, 1996) R. Mooney. (1996) Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. Proceedings of EMNLP. pg. 82-91.

• (Pedersen, 1998) T. Pedersen. (1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D. Dissertation. Southern Methodist University.

• (Pedersen, 2000) T. Pedersen (2000) A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation. In Proceedings of NAACL.

• (Quillian, 1986). J.R. Quillian (1986) Induction of Decision Trees. Machine Learning (1). pg. 81-106.• (Quillian, 1993). J.R. Quillian (1993) C4.5 Programs for Machine Learning. San Francisco, Morgan Kaufmann.• (Yarowsky, 1994) D. Yarowsky. (1994) Decision lists for lexical ambiguity resolution: Application to accent

restoration in Spanish and French. In Proceedings of ACL. pp. 88-95.• (Yarowsky, 2000) D. Yarowsky. (2000) Hierarchical decision lists for word sense disambiguation. Computers and the

Humanities, 34.

part 4: supervised methods of word sense disambiguation

Documents