part 4: supervised methods of word sense disambiguation
DESCRIPTION
Part 4: Supervised Methods of Word Sense Disambiguation. Outline. What is Supervised Learning? Task Definition Single Classifiers Naïve Bayesian Classifiers Decision Lists and Trees Ensembles of Classifiers. What is Supervised Learning?. - PowerPoint PPT PresentationTRANSCRIPT
Part 4:
Supervised Methods of Word Sense Disambiguation
Outline
• What is Supervised Learning?
• Task Definition
• Single Classifiers– Naïve Bayesian Classifiers
– Decision Lists and Trees
• Ensembles of Classifiers
What is Supervised Learning?
• Collect a set of examples that illustrate the various possible classifications or outcomes of an event.
• Identify patterns in the examples associated with each particular class of the event.
• Generalize those patterns into rules.
• Apply the rules to classify a new event.
Learn from these examples :“when do I go to the store?”
Day Go to Store? Hot Outside? Slept Well? Ate Well?
1 YES YES NO NO
2 NO YES NO YES
3 YES NO NO NO
4 NO NO NO YES
Outline
• What is Supervised Learning?
• Task Definition
• Single Classifiers– Naïve Bayesian Classifiers
– Decision Lists and Trees
• Ensembles of Classifiers
Task Definition
• Supervised WSD: Class of methods that induces a classifier from manually sense-tagged text using machine learning techniques.
• Resources– Sense Tagged Text
– Dictionary (implicit source of sense inventory)
– Syntactic Analysis (POS tagger, Chunker, Parser, …)
• Scope– Typically one target word per context
– Part of speech of target word resolved
– Lends itself to “lexical sample” formulation
• Reduces WSD to a classification problem where a target word is assigned the most appropriate sense from a given set of possibilities based on the context in which it occurs
Sense Tagged Text
Bonnie and Clyde are two really famous criminals, I think they were bank/1 robbers
My bank/1 charges too much for an overdraft.
I went to the bank/1 to deposit my check and get a new ATM card.
The University of Minnesota has an East and a West Bank/2 campus right on the Mississippi River.
My grandfather planted his pole in the bank/2 and got a great big catfish!
The bank/2 is pretty muddy, I can’t walk there.
Two Bags of Words(Co-occurrences in the “window of context”)
FINANCIAL_BANK_BAG:
a an and are ATM Bonnie card charges check Clyde criminals deposit famous for get I much My new overdraft really robbers the they think to too two went were
RIVER_BANK_BAG:
a an and big campus cant catfish East got grandfather great has his I in is Minnesota Mississippi muddy My of on planted pole pretty right River The the there University walk West
Simple Supervised Approach
Given a sentence S containing “bank”:
For each word Wi in S
If Wi is in FINANCIAL_BANK_BAG then Sense_1 = Sense_1 + 1;
If Wi is in RIVER_BANK_BAG thenSense_2 = Sense_2 + 1;
If Sense_1 > Sense_2 then print “Financial” else if Sense_2 > Sense_1 then print “River”
else print “Can’t Decide”;
Supervised Methodology
• Create a sample of training data where a given target word is manually annotated with a sense from a predetermined set of possibilities.– One tagged word per instance/lexical sample disambiguation
• Select a set of features with which to represent context.– co-occurrences, collocations, POS tags, verb-obj relations, etc...
• Convert sense-tagged training instances to feature vectors.• Apply a machine learning algorithm to induce a classifier.
– Form – structure or relation among features– Parameters – strength of feature interactions
• Convert a held out sample of test data into feature vectors.– “correct” sense tags are known but not used
• Apply classifier to test instances to assign a sense tag.
Outline
• What is Supervised Learning?
• Task Definition
• Naïve Bayesian Classifier
• Decision Lists and Trees
• Ensembles of Classifiers
Naïve Bayesian Classifier
• Naïve Bayesian Classifier well known in Machine Learning community for good performance across a range of tasks (e.g., Domingos and Pazzani, 1997)
…Word Sense Disambiguation is no exception
• Assumes conditional independence among features, given the sense of a word.– The form of the model is assumed, but parameters are estimated
from training instances
• When applied to WSD, features are often “a bag of words” that come from the training data– Usually thousands of binary features that indicate if a word is
present in the context of the target word (or not)
Bayesian Inference
• Given observed features, what is most likely sense?
• Estimate probability of observed features given sense
• Estimate unconditional probability of sense
• Unconditional probability of features is a normalizing term, doesn’t affect sense classification
),...,3,2,1(
)()*|,...,3,2,1(),...,3,2,1| (
FnFFFp
SpSFnFFFpFnFFFSp
Naïve Bayesian Model
S
F2 F3 F4 F1 Fn
)|(*...*)|2(*)|1()|,...,2,1( SFnpSFpSpFSFnFFP
The Naïve Bayesian Classifier
– Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense)
• P(S=1) = 1,500/2000 = .75• P(S=2) = 500/2,000 = .25
– Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.• P(F1=“credit”) = 204/2000 = .102• P(F1=“credit”|S=1) = 200/1,500 = .133• P(F1=“credit”|S=2) = 4/500 = .008
– Given a test instance that has one feature “credit”• P(S=1|F1=“credit”) = .133*.75/.102 = .978• P(S=2|F1=“credit”) = .008*.25/.102 = .020
)(*)|(*...*)|1(argmax SpSFnpSFpsenseSsense
Comparative Results
• (Leacock, et. al. 1993) compared Naïve Bayes with a Neural Network and a Context Vector approach when disambiguating six senses of line…
• (Mooney, 1996) compared Naïve Bayes with a Neural Network, Decision Tree/List Learners, Disjunctive and Conjunctive Normal Form learners, and a perceptron when disambiguating six senses of line…
• (Pedersen, 1998) compared Naïve Bayes with Decision Tree, Rule Based Learner, Probabilistic Model, etc. when disambiguating line and 12 other words…
• …All found that Naïve Bayesian Classifier performed as well as any of the other methods!
Outline
• What is Supervised Learning?
• Task Definition
• Naïve Bayesian Classifiers
• Decision Lists and Trees
• Ensembles of Classifiers
Decision Lists and Trees
• Very widely used in Machine Learning.
• Decision trees used very early for WSD research (e.g., Kelly and Stone, 1975; Black, 1988).
• Represent disambiguation problem as a series of questions (presence of feature) that reveal the sense of a word.– List decides between two senses after one positive answer
– Tree allows for decision among multiple senses after a series of answers
• Uses a smaller, more refined set of features than “bag of words” and Naïve Bayes.– More descriptive and easier to interpret.
Decision List for WSD (Yarowsky, 1994)
• Identify collocational features from sense tagged data.
• Word immediately to the left or right of target :– I have my bank/1 statement.
– The river bank/2 is muddy.
• Pair of words to immediate left or right of target :– The world’s richest bank/1 is here in New York.
– The river bank/2 is muddy.
• Words found within k positions to left or right of target, where k is often 10-50 :– My credit is just horrible because my bank/1 has made several
mistakes with my account and the balance is very low.
Building the Decision List
• Sort order of collocation tests using log of conditional probabilities.
• Words most indicative of one sense (and not the other) will be ranked highly.
))|2(
)|1((log
inCollocatioiFSpinCollocatioiFSp
Abs
Computing DL score
– Given 2,000 instances of “bank”, 1,500 for bank/1 (financial sense) and 500 for bank/2 (river sense)
• P(S=1) = 1,500/2,000 = .75• P(S=2) = 500/2,000 = .25
– Given “credit” occurs 200 times with bank/1 and 4 times with bank/2.
• P(F1=“credit”) = 204/2,000 = .102• P(F1=“credit”|S=1) = 200/1,500 = .133• P(F1=“credit”|S=2) = 4/500 = .008
– From Bayes Rule… • P(S=1|F1=“credit”) = .133*.75/.102 = .978• P(S=2|F1=“credit”) = .008*.25/.102 = .020
– DL Score = abs (log (.978/.020)) = 3.89
Using the Decision List
• Sort DL-score, go through test instance looking for matching feature. First match reveals sense…
DL-score Feature Sense
3.89 credit within bank Bank/1 financial
2.20 bank is muddy Bank/2 river
1.09 pole within bank Bank/2 river
0.00 of the bank N/A
Using the Decision List
CREDIT?
BANK/1 FINANCIAL IS MUDDY?
POLE? BANK/2 RIVER
BANK/2 RIVER
Learning a Decision Tree
• Identify the feature that most “cleanly” divides the training data into the known senses.– “Cleanly” measured by information gain or gain ratio. – Create subsets of training data according to feature values.
• Find another feature that most cleanly divides a subset of the training data.
• Continue until each subset of training data is “pure” or as clean as possible.
• Well known decision tree learning algorithms include ID3 and C4.5 (Quillian, 1986, 1993)
• In Senseval-1 a modified decision list (which supported some conditional branching) was most accurate for English Lexical Sample task (Yarowsky, 2000)
Supervised WSD with Individual Classifiers
• Most supervised Machine Learning algorithms have been applied to Word Sense Disambiguation, most work reasonably well.
• Features tend to differentiate among methods more than the learning algorithms.
• Good sets of features tend to include:– Co-occurrences or keywords (global)– Collocations (local)– Bigrams (local and global)– Part of speech (local)– Predicate-argument relations
• Verb-object, subject-verb,
– Heads of Noun and Verb Phrases
Convergence of Results
• Accuracy of different systems applied to the same data tends to converge on a particular value, no one system shockingly better than another.– Senseval-1, a number of systems in range of 74-78% accuracy for
English Lexical Sample task.
– Senseval-2, a number of systems in range of 61-64% accuracy for English Lexical Sample task.
– Senseval-3, a number of systems in range of 70-73% accuracy for English Lexical Sample task…
• What to do next?
Outline
• What is Supervised Learning?
• Task Definition
• Naïve Bayesian Classifiers
• Decision Lists and Trees
• Ensembles of Classifiers
Ensembles of Classifiers
• Classifier error has two components (Bias and Variance)– Some algorithms (e.g., decision trees) try and build a
representation of the training data – Low Bias/High Variance
– Others (e.g., Naïve Bayes) assume a parametric form and don’t represent the training data – High Bias/Low Variance
• Combining classifiers with different bias variance characteristics can lead to improved overall accuracy
• “Bagging” a decision tree can smooth out the effect of small variations in the training data (Breiman, 1996)– Sample with replacement from the training data to learn multiple
decision trees.
– Outliers in training data will tend to be obscured/eliminated.
Ensemble Considerations
• Must choose different learning algorithms with significantly different bias/variance characteristics.– Naïve Bayesian Classifier versus Decision Tree
• Must choose feature representations that yield significantly different (independent?) views of the training data.– Lexical versus syntactic features
• Must choose how to combine classifiers. – Simple Majority Voting
– Averaging of probabilities across multiple classifier output
– Maximum Entropy combination (e.g., Klein, et. al., 2002)
Ensemble Results
• (Pedersen, 2000) achieved state of art for interest and line data using ensemble of Naïve Bayesian Classifiers.– Many Naïve Bayesian Classifiers trained on varying sized
windows of context / bags of words.
– Classifiers combined by a weighted vote
• (Florian and Yarowsky, 2002) achieved state of the art for Senseval-1 and Senseval-2 data using combination of six classifiers.– Rich set of collocational and syntactic features.
– Combined via linear combination of top three classifiers.
• Many Senseval-2 and Senseval-3 systems employed ensemble methods.
References
• (Black, 1988) E. Black (1988) An experiment in computational discrimination of English word senses. IBM Journal of Research and Development (32) pg. 185-194.
• (Breiman, 1996) L. Breiman. (1996) The heuristics of instability in model selection. Annals of Statistics (24) pg. 2350-2383.
• (Domingos and Pazzani, 1997) P. Domingos and M. Pazzani. (1997) On the Optimality of the Simple Bayesian Classifier under Zero-One Loss, Machine Learning (29) pg. 103-130.
• (Domingos, 2000) P. Domingos. (2000) A Unified Bias Variance Decomposition for Zero-One and Squared Loss. In Proceedings of AAAI. Pg. 564-569.
• (Florian an dYarowsky, 2002) R. Florian and D. Yarowsky. (2002) Modeling Consensus: Classifier Combination for Word Sense Disambiguation. In Proceedings of EMNLP, pp 25-32.
• (Kelly and Stone, 1975). E. Kelly and P. Stone. (1975) Computer Recognition of English Word Senses, North Holland Publishing Co., Amsterdam.
• (Klein, et. al., 2002) D. Klein, K. Toutanova, H. Tolga Ilhan, S. Kamvar, and C. Manning, Combining Heterogeneous Classifiers for Word-Sense Disambiguation, Proceedings of Senseval-2. pg. 87-89.
• (Leacock, et. al. 1993) C. Leacock, J. Towell, E. Voorhees. (1993) Corpus based statistical sense resolution. In Proceedings of the ARPA Workshop on Human Language Technology. pg. 260-265.
• (Mooney, 1996) R. Mooney. (1996) Comparative experiments on disambiguating word senses: An illustration of the role of bias in machine learning. Proceedings of EMNLP. pg. 82-91.
• (Pedersen, 1998) T. Pedersen. (1998) Learning Probabilistic Models of Word Sense Disambiguation. Ph.D. Dissertation. Southern Methodist University.
• (Pedersen, 2000) T. Pedersen (2000) A simple approach to building ensembles of Naive Bayesian classifiers for word sense disambiguation. In Proceedings of NAACL.
• (Quillian, 1986). J.R. Quillian (1986) Induction of Decision Trees. Machine Learning (1). pg. 81-106.• (Quillian, 1993). J.R. Quillian (1993) C4.5 Programs for Machine Learning. San Francisco, Morgan Kaufmann.• (Yarowsky, 1994) D. Yarowsky. (1994) Decision lists for lexical ambiguity resolution: Application to accent
restoration in Spanish and French. In Proceedings of ACL. pp. 88-95.• (Yarowsky, 2000) D. Yarowsky. (2000) Hierarchical decision lists for word sense disambiguation. Computers and the
Humanities, 34.