naive bayes classifier approach to word sense disambiguation · naive bayes classifier approach to...

18
Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational Lexical Semantics Sections 1 to 2 Seminar in Methodology and Statistics 3/June/2009 Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 1/18

Upload: haxuyen

Post on 02-May-2019

236 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Naive Bayes Classifier Approach toWord Sense Disambiguation

Daniel Jurafsky and James H. Martin

Chapter 20Computational Lexical Semantics

Sections 1 to 2

Seminar in Methodology and Statistics 3/June/2009

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 1/18

Page 2: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Outline

1 Word Sense Disambiguation WSDWhat is WSD?Variants of WSD

2 Naive Bayes ClassifierStatistics difficultyGet around the problemAssumptionSubstitutionIntuition of Naive Bayes Classifier for WSD

3 Conclusion

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 2/18

Page 3: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

What is WSD? Variants of WSD

What is WSD?

WSD is the task of automatically assigning appropriatemeaning to a polysemous word within a given contexPolysemy is:

the ambiguity of an individual word or phrase that can be used(in different contexts) to express two or more different meanings

Here WSD is discussed in relation to computational lexical semantics

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 3/18

Page 4: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

What is WSD? Variants of WSD

Example of polysemous word

Figure: Example sentences of the polysemous word bar

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 4/18

Page 5: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

What is WSD? Variants of WSD

Variants of generic WSD

Many WSD algorithms rely on contextual similarityto help choose the proper sense of a word in contextTwo variants of WSD include:

All words approach andSuppervised or lexical sample approach

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 5/18

Page 6: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

What is WSD? Variants of WSD

Unsupervised WSD approach

All words WSD approach

A system is given entire texts and a lexicon with an inventory ofsenses for each entry and the system is required to disambiguateevery context word in the text, disadvantages:

1 Training data for each word in the test set may not be available2 The approach of training one classifier per term is not practical

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 6/18

Page 7: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

What is WSD? Variants of WSD

Supervised WSD approach

Supervised WSD approach or lexical sample WSD approach

Takes as input a word in context along with a fixedinventory of potential word senses and outputs thecorrect word sense for that useThe input data is hand-labled with correct word sensesUnlabeled target words in context can then be labeledusing such a trained classifier

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 7/18

Page 8: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

What is WSD? Variants of WSD

Collecting features for Supervised WSD

Input for Supervised WSD are collected in feature vectorsA feature vector consits of numeric or nominal values toencode linguistic information as input to most ML algorithmsTwo classes of feature vectors extracted fromneighbouring context are:

1 Bag-of-word feature vectors and2 Collocational feature vectors

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 8/18

Page 9: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

What is WSD? Variants of WSD

Classes of feature vectors

Bag-of-word feature vectors

These are unordered set of wordswith their exact position ignored

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 9/18

Page 10: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

What is WSD? Variants of WSD

Classes of feature vectors

Collocation feature vectorsA collocation is a word or phrase in a position ofspecific relationship to a target wordThus a collocation encodes information about specific positionslocated to the left or right of the target word e.g. take bass as targetAn electric guitar and bass player stand off to one side, ...Collocation feature vector, extracted from a window of two wordsto the right and left of the target word, made up of the wordsthemselves and their respective POS, that is:[wi−2, POSi−2, wi−1, POSi−1, wi+1, POSi+1, wi+2, POSi+2]

Would yield the following vector:[guitar, NN, and, CC, player, NN, stand, VB]

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 10/18

Page 11: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Statistics difficulty Get around the problem Assumption Substitution Intuition of Naive Bayes Classifier for WSD

Naive Bayes Classifier

Because of the feature vector annotations we canuse a Naive Bayes Classifier approach to WSD

This approach is based on the premise that:

Choosing the best sense s out of the set ofpossible senses S for a feature vector ~f amountsto choosing the most probable sense given that vector.

This is to say:

s = arg maxs∈S

P(s|~f ) (1)

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 11/18

Page 12: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Statistics difficulty Get around the problem Assumption Substitution Intuition of Naive Bayes Classifier for WSD

Statistics difficulty

Collecting reasonable statistics for above equation is difficult.

For example:

Consider that a binary bag of words vector definedover a vocabulary of 20 words would have

220 = 1, 048, 576 (2)

possible feature vectors.

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 12/18

Page 13: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Statistics difficulty Get around the problem Assumption Substitution Intuition of Naive Bayes Classifier for WSD

To get around the problem

Equation 1 is Reformulated into the usual Bayesian manner:

s = arg maxs∈S

P(~f |s)P(s)

P(~f )(3)

Data that associates specific ~f with each sense is sparseBut information about individual feature-value pairs in the contextof specific senses is available in a tagged training set

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 13/18

Page 14: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Statistics difficulty Get around the problem Assumption Substitution Intuition of Naive Bayes Classifier for WSD

Assumption

We naively assume that features are independed of one anotherand that features are conditionally independent given the word sense

Yielding the following approximation for P(~f |s):

P(~f |s) ≈n∏

j=1

P(fj |s) (4)

Probability of an entire vector given a sense can be estimated bythe product of the probability of its individual features given that sense

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 14/18

Page 15: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Statistics difficulty Get around the problem Assumption Substitution Intuition of Naive Bayes Classifier for WSD

Naive Bayes Classifier for WSD

Since P(~f ) is the same for all possible sensesit does not affect the final ranking of sensesLeaving us with the following formulation whenwe subtitute for P(~f |s) in equation 3 above

s = arg maxs∈S

P(s)n∏

j=1

P(fj |s) (5)

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 15/18

Page 16: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Statistics difficulty Get around the problem Assumption Substitution Intuition of Naive Bayes Classifier for WSD

Training a Naive Bayes Classifier

We can estimate each of the probabilities in equation 5 as shown below:

Prior probability of each sense P(s)

This probability is the sum of the instances of each sense of the word, i.e.:

P(si) =count(si , wj)

count(wj)(6)

Individual feature probabilities P(fj |s)

P(fj |s) =count(fj , s)

count(s)(7)

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 16/18

Page 17: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Statistics difficulty Get around the problem Assumption Substitution Intuition of Naive Bayes Classifier for WSD

Intuition of Naive Bayes Classifier for WSD

Take a target word in contextExtract the specified features e.g. neighbouring words, POS, position

Compute P(s)n∏

j=1

P(fj |s) for each sense

Return the sense associated with the highest scores.

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 17/18

Page 18: Naive Bayes Classifier Approach to Word Sense Disambiguation · Naive Bayes Classifier Approach to Word Sense Disambiguation Daniel Jurafsky and James H. Martin Chapter 20 Computational

Word Sense Disambiguation WSD Naive Bayes Classifier Conclusion

Conclusion

We discussed the Naive Baye’s classifier for WSDbased on Baye’s theorem and shown that it is possibleto disambiguate word Senses in contextBut we have not discussed:

Evaluation of such systems, andDisambiguation of phrasesTo find out, come to my TabuDag presentation

Daniel Jurafsky and James H. Martin Naive Bayes Classifier Approach to Word Sense Disambiguation 18/18