supervised, semi-supervised and unsu pervised approaches for word sense disambiguation

SUPERVISED, SEMI-SUPERVISED AND UNSUPERVISED

APPROACHES FOR

WORD SENSE DISAMBIGUATION

Under the guidance of

Slides byArindam

Chatterjee&

Salil Joshi

Prof. Pushpak BhattacharyyaMay 01, 2010

ROADMAP1. Bird’s Eye View.2. Supervised Approaches.3. Semi-supervised Approaches.4. Unsupervised Approaches.5. Summary

SUPERVISED, SEMI-SUPERVISED AND UNSUPERVISED APPROACHES IN WSD

BIRD’S EYE VIEW


WSD Approaches

Machine Learning

Supervised

Unsupervise

d

Semi-

supervised

Knowledge Based

Hybrid

The unifying thread of operation.Distinguishing features of the algorithms.

Supervised, Semi-supervised and Unsupervised Approaches in WSD

4

SUPERVISED APPROACHES

5

SUPERVISED APPROACHES

TRAINING PHASESUPERVISED, SEMI-SUPERVISED AND UNSUPERVISED APPROACHES

IN WSD

TESTING PHASE

CLASS 1

CLASS 2

CLASS 3(SENSE

1)(SENSE

2)(SENSE

3)

5 TRAINING INSTANCES(WORDS)MODEL TRAINED FROM TRAINING DATA

CLASSIFIED BASED ON ITS FEATURE VECTOR

WSDCLASSES = SENSES

Water, river Money, finance blood, plasma

Money, finance

FEATURE VECTOR FOR WSD1. In supervised WSD, the feature vector

consists of four features


The feature vector consists of the following features:

1. Part Of Speech (POS) of w2. Semantic & Syntactic features of w3. Collocation vector (set of words around it)

typically consists of next word(+1), next-to-next word(+2), -2, -1 & their POS's.

4. Co-occurrence vector (number of times w occurs in bag of words around it)

Feature 1

Feature 2

Feature 3

Feature 4

SUPERVISED APPROACHESUnifying thread of operation

1. Use of annotated corpora.2. They are all target-word WSD approaches.3. Representation of words as feature vectors.

Algorithms4. Decision List.5. Decision Tree.6. Naïve Bayes.7. Exemplar Based Approach.8. Support Vector Machines.9. Neural Networks.10.Ensemble Methods.


1. DECISION LISTS1. Based on ‘One sense per collocation’

property.– Nearby words provide strong and consistent clues to

the sense of a target word.2. Decision List is an ordered set of if-then-

else rules.– If (feature X) then sense (Si)

3. Each rule is weighted by a score.4. In the Training phase the decision list is

built from evidence in the corpus.5. In the Testing phase, the sense with the

highest score wins.SUPERVISED, SEMI-SUPERVISED AND UNSUPERVISED APPROACHES

IN WSD


9

For a particular word:

1. DECISION LISTS(CONTD.)

SUPERVISED, SEMI-SUPERVISED AND HYBRID APPROACHES IN WSD

1. Features are extracted from the corpus.

2. An ordered decision list of the form {feature-value, sense, score} is created.

3. The score of a feature f is the log-likelihood ratio of the sense given the feature as:

.

( | )( ) max log( | )i

i fj

j i

P S fscore SP S f

TRAINING PHASE


10

1. DECISION LISTS(CONTD.)

SUPERVISED, SEMI-SUPERVISED AND HYBRID APPROACHES IN WSD

Feature Prediction Scoreaccount with bank bank/FINANCE 4.83standing in bank bank/FINANCE 3.35bank of blood bank/SUPPLY 2.48work in bank bank/FINANCE 2.33the left river bank bank/RIVER 1.12of the bank - 0.01

The decision list for the word bank.(Courtesy Navigli, 2009)

Test Sentence: I went for a walk along the river bank

11

3.SUPPORT VECTOR MACHINES


This distance gives the confidence score for each

SVMA

B

SVM A B

1 S1 S2, S2, S3

2 S2 S1, S3, S4

3 S3 S1, S2, S4

4 S4 S1, S2, S3

E.g., If a word has 4 senses

The SVM with the highest confidence score becomes the

winner sense

A collection of classifiers (C1, C2, …, Cn) are combined to improve the overall accuracy of WSD system.

3. ENSEMBLE METHODS.


C1

C2

C3

S1

S2

Total_Score(S1)

Total_Score(S2)

SensesEnsemble Components(Classifiers)

Score Function

For each approach, the score function varies.

Here the score function is a vote function.

The sense with largest number of ‘votes’ is selected as winner sense.

A. MAJORITY VOTING.


C1

C2

C3

S1

S2

Winner sense

Each ensemble component votes for one sense of targeted word.

( )ˆ argmax |{ : ( ) } |

i DS Senses w j iS j vote C S

B. PROBABILITY MIXTURE.

Classifier

Sense Confidence score

Normalized score

C1S1 0.6 0.6/0.6 = 1.0S2 0.4 0.4/0.6 = 0.7

C2S1 0.7 0.7/0.7 = 1.0S2 0.3 0.3/0.7 = 0.4

C3S1 0.8 0.8/0.8 = 1.0S2 0.2 0.2/0.8 = 0.3Total_Score(S1) = 1.0 +1.0

+ 1.0 = 3.0Total_Score(S2) = 0.7 +0.4 + 0.3 = 1.4

The scoring function is a confidence score The confidence score is normalized as The normalized scores are summed up and

the sense with maximum sum is selected as the winner sense.

( ) ( , ) max { ( , )}jC i j i k j kP S score C S score C S


B. PROBABILITY MIXTURE.


C1

C2

C3

S1

S2

Winner sense

0.6/10.4/0.7

0.7/1

0.3/0.4

0.2/0.3

0.8/1

Score = 3.0

Score = 1.4

Confidence Score/Normalized Score

C. RANK BASED COMBINATION.

Classifier

Sense Ranks Negated Ranks

C1S1 1 -1S2 2 -2

C2S1 2 -2S2 1 -1

C3S1 1 -1S2 2 -2Total_Score: S1 = (-1) + (-2) + (-1) = -4, S2 =

(-2) + (-1) + (-2) = -5

The score function is the rank of each sense.

The ranks are negated and summed up. The sense with the highest sum wins.


( )1

ˆ argmax ( )i D j

m

S Senses w C ii

S Rank S


C1

C2

C3

S1

S2

Winner sense

1/-12/-2

2/-2

1/-11/-1

Score = -4

Score = -5

Rank/Negated Rank

2/-2

C. RANK BASED COMBINATION.


18

SEMI-SUPERVISED APPROACHES

19



Semi-Supervised approaches use

minimal annotated data

Supervised approaches use large annotated

data

Data required reduced


Unifying thread of operation1. Use of minimal annotated corpora.2. Use of unannotated data for

tuning.Algorithms

3. Bootstrapping .4. Monosemous Relatives .


1. bootstrapping


1. bootstrapping


An example of Yarowsky’s algorithm. At each iteration, new examples are labeled with class a or b and added to the set A of sense tagged examples.

Courtesy Navigli, 2009


23

UNSUPERVISEDAPPROACHES

24

Unsupervised approaches


Input data•Circles of different size and colors•No associated background knowledge•Implicit features are size and color of balls

Unsupervised Approach I (Clustering based on size of balls)

clusters

Unsupervised Approach II(Clustering based on color of balls)

clusters

Hyperlex: Example showing graph for context of word वीज (electricity/lightning)

• For each high density component, highest degree node is selected as hub.• The procedure is iterated by removing the hub with its neighbors.• For this example, the hubs will be ज्वलन (combustion) and चमक (shine).

Hyperlex (1/2)

Supervised, Semi-supervised and Unsupervised Approaches in WSD 25

धन(positive)

मुक्तता(discharge)

प्रभार(charge)

चमक(shine)

वादळ(thunder)

ऋण(negative)

उजा�(energy)

उष्णता(heat)

इंधन(fuel)

वाफ(steam)

ज्वलन(combustion)जनिनत्र

(turbine)

निनमा�ण(produce)

26

Hyperlex (2/2)• Example

– जनिनते्र वाफ वापरून वीज प्रभार निनमा�ण करतात. Turbines steam use to electricity produce

(Turbines use steam to produce electricity)

ज्वलन चमकजनिनत्र 0.70 0.00वाफ 1.00 0.00निनमा�ण 0.55 0.00प्रभार 0.00 0.75Total 2.25 0.75

Scores of context words for वीज found using earlier graph.

ज्वलन becomes the winner sense in this case.


SUMMARY


Supervised Algorithms:1. Based on human supervision hence the

name.2. Use corpus evidence instead of relying

on knowledge bases.3. Build classifiers to classify words, where

senses are classes.Semi-supervised Algorithms4. Use less information than supervised

approaches.5. Create required information as a part of

the algorithm.Unsupervised Algorithms6. Cluster instances based on inherent

features

SUMMARY


Supervised Algorithms:1. Perform better than all other approaches,

especially knowledge based.E.g. Can pick up clues from several components

like proper nouns, unlike knowledge based approaches.

2. Depend heavily on large amount of tagged data.

3. Suffer from data sparsity.Semi-supervised Algorithms4. Tend to partially eradicate the knowledge

acquisition bottleneck .5. Works at par with supervised approach.Unsupervised Algorithms6. Performance is good for a limited set of target

words.

REFERENCES1. AGIRRE, E., AND MARTINEZ, D. Exploring automatic word sense disambiguation with decision

lists and the web. In Proc. of the COLING-2000 (2000). 2. BOSER, B. E., GUYON, I. M., AND VAPNIK, V. N. A training algorithm for optimal margin

classifiers. In Proceedings of the fifth annual workshop on Computational learning theor y (1992), p. 144152.

3. COST, S., AND SALZBERG, S. A weighted nearest neighbor algorithm for learning with symbolic features. Machine learning 10, 1 (1993), 5778.

4. ESCUDERO, G., MARQUEZ, L., AND RIGAU, G. Naive bayes and exemplar-based approaches to word sense disambiguation revisited. Arxiv preprint cs/0007011 (2000).

5. FELLBAUM, C., ET AL. WordNet: An electronic lexical database. MIT press Cambridge, MA, 1998.

6. FREUND, Y., SCHAPIRE, R., AND ABE, N. A short introduction to boosting. JOURNAL-JAPANESE SOCIETY FOR ARTIFICIAL INTELLIGENCE 14 (1999), 771780.

7. KHAPRA, M. M., BHATTACHARYYA, P., CHAUHAN, S., NAIR, S., AND SHARMA, A. Domain specific iterative word sense disambiguation in a multilingual setting.

8. KILGARRIFF, A., AND GREFENSTETTE, G. Introduction to the special issue on the web as corpus. Computational linguistics 29, 3 (2003), 333347.


REFERENCES9. KILGARRIFF, A., AND YALLOP, C. Whats in a thesaurus. In Proceedings of the Second

Interna-tional Conference on Language Resources and Evaluation (2000), p. 13711379. 10. LITTLESTONE, N. Learning quickly when irrelevant attributes abound: A new linear-

threshold algorithm. Machine learning 2, 4 (1988), 285318. 11. MALLERY, J. C. Thinking about foreign policy: Finding an appropriate role for artificially

intel-ligent computers. Cambridge: Masters Thesis, MIT Political Science Department (1988).

12. MCCULLOCH, W. S., AND PITTS, W. A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biology 5, 4 (1943), 115133.

13. MILLER, G., BECKWITH, R., FELLBAUM, C., GROSS, D., AND MILLER, K. J. WordNet: an on-line lexical database. International journal of lexicography 3, 4 (1990), 235312.

14. NAVIGLI, R. Word sense disambiguation: A survey. ACM Comput. Surv. 41, 2 (2009). 15. NAVIGLI, R., AND VELARDI, P. Learning domain ontologies from document warehouses

and dedicated web sites. Computational Linguistics 30, 2 (2004), 151179.


REFERENCES16. NG, H. T., ET AL. Exemplar-based word sense disambiguation: Some recent improvements. In

Proceedings of the Second Conference on Empirical methods in natural Language Processing (1997), p. 208213.

17. PEDERSEN, T. A simple approach to building ensembles of naive bayesian classifiers f or word sense disambiguation. In Proceedings of the 1st North American chapter of the Association for Computational Linguistics conference (2000), p. 6369.

18. QUINLAN, J. R. Induction of decision trees. Machine learning 1, 1 (1986), 81106. 19. QUINLAN, J. R. C4. 5: programs for machine learning. Morgan Kaufmann, 1993. 20. ROGET, P. M. Roget's International Thesaurus, 1st ed. Cromwell, New York, 1911. 21. ROTH, D., YANG, M., AND AHUJA, N. A snowbased face detector. In Neural Information Processing

(2000), vol. 12. 22. SCHAPIRE, R. E., AND SINGER, Y. Improved boosting algorithms using confidence-rated predic-tions.

Machine learning 37, 3 (1999), 297336. 23. YAROWSKY, D. Decision lists for lexical ambiguity resolution: Application to accent restoration in

spanish and french. In Proceedings of the 32nd annual meeting on Association for Computational Linguistics (1994), p. 8895.

24. YAROWSKY, D. Unsupervised word sense disambiguation rivaling supervised methods. In Proceedings of the 33rd annual meeting on Association for Computational Linguistics (1995), p. 189196.



32

THANK YOU?


33

APPENDIX


Lexical Sample [Targeted WSD]: System is required to disambiguate a restricted set of target words usually occurring one per sentence. Employs Supervised techniques using Hand-

labeled instances as training set and then an unlabeled test set.

All-words WSD: Systems are expected to disambiguate all open-class words in a text (i.e., nouns, verbs, adjectives, and adverbs). Wide coverage systems to disambiguate all

open-class words. Suffers from Data sparseness problem, as large knowledge sources are not available.

1. WSD : VARIANTS

2. COLLOCATION VECTOR


• Set of words around the target word.

• Typically consists of next word(+1), next-to-next word(+2), -2, -1 & their POS's:[wi−2, POSi−2, wi−1, POSi−1, wi+1, POSi+1, wi+2,POSi+2]

• For example, the sentence :“I usually have grilled bass on

Sunday”and the target word bass, would

yield the following vector:[have, VB, grilled, ADJ, on, PREP, Sunday, NN]

3. DECISION TREES1. Feature vectors are represented in the form of

a tree.2. The tree is built using ID3(C4.5) algorithm.3. Corresponding to the input sentence, the tree

is traversed.4. The sense at the leaf node reached is the

winner sense.


4. NAÏVE BAYES1. Applying Bayes’ rule and naive

independence assumption on the features

sˆ= argmax s ε senses Pr(s).Πi=1nPr(Vw

i|s)

Also known as Memory Based or Instance Based Learning approach.

Unlike other Supervised approaches, builds a Classification model by keeping all the training instances in the memory.

Typically implemented using kNN algorithm.

Represented in form of points in feature space.

The new examples are classified by computing distance with all training set examples.

The k-nearest neighbors are found. Class from which largest number of

neighbors are found is selected as the Winner sense.

5. EXEMPLAR BASED APPROACH


The Hamming Distance between the points is calculated using:

Where, • x is the instance to be classified.• xi is the ith training example.• Wj is weight of jth feature, calculated

using gain ration measure [Quinlan, 1993] or using modified value difference metric [Cost & Salzberg, 1993].

• ∂ (xj, xij) is zero if xi = xj and 1 otherwise.

• EXEMPLAR BASED APPROACH(CNTD.)


1

( , ) ( , )m

i j j ijj

x x w x x

39

• WSD is treated as a sequence labeling task.• The class space is reduced by using WordNet's

super senses instead of actual senses.• A discriminative HMM is trained using the

following features:– POS of w as well as POS of neighboring words.– Local collocations– Shape of the word and neighboring words

E.g. for s = “Merrill Lynch & Co shape(s) =Xx*Xx*&Xx

• Lends itself well to NER as labels like “person”, location”, "time” etc are included in the super sense tag set.

6. NEURAL NETWORKS


7. Monosemous relatives


• Uses the web as corpus.• Selects a seed of data from the web.• The seed data is minimal.• Then bootstraps and builds large

annotated data.

8. An iterative approach to wsd


• Uses semantic relations (synonymy and hypernymy) form WordNet.

• Extracts collocational and contextual information form WordNet (gloss) and a small amount of tagged data.

• Monosemic words in the context serve as a seed set of disambiguated words.

• In each iteration new words are disambiguated based on their semantic distance from already disambiguated words.

• It would be interesting to exploit other semantic relations available in WordNet.

42

9. Results: Supervised


43SUPERVISED, SEMI-SUPERVISED AND UNSUPERVISED APPROACHES IN WSD

10. Results: Semi-Supervised

44SUPERVISED, SEMI-SUPERVISED AND UNSUPERVISED APPROACHES IN WSD

11. Results: Hybrid

INTRODUCTION


Q : What is Word Sense Disambiguation(WSD) ?

John has a bank account

Domain1 : FINANCE

Domain2 : GEOGRAPHY

Domain3 : SUPPLY

Senses of the word “bank”

Target word : bank Context word : account

WSD : Definitions1. Generally: WSD is the ability to identify the sense(meaning) of

words in context in a computational manner. 2. Formally: WSD a mapping A from words to senses, such that A(i)

⊆ SensesD (wi ).Where:SensesD(wi) : Set of senses encoded in a dictionary D for word wi .A(i) : That subset of the senses of wi which are appropriate in the context T.3. As a classification problem: Where senses are classes.

WINNER SENSE

MOTIVATION


WSD

MTNER

SASP

SRL

CLIR

TE

1. WSD: As the Heart NLP

2. WSD IS AN AI-COMPLETE PROBLEM: It is as hard as the hardest problems in AI, like

representation of common sense

SRL : Semantic Role Labeling TE : Text Entailment CLIR : Cross Lingual Information Retrieval NER : Named Entity Recognition MT : Machine Translation SP : Shallow Parsing SA : Sentiment Analysis WSD : Word Sense Disambiguation

1. Each instance is assigned equal weight initially.

2. In each pass of the iteration, the weights of misclassified instances are increased.

3. A value αj is calculated for each classifier, which is a function of the classification error for classifier Cj

D. ADABOOST.


STEPS

i. Constructs strong classifier as a linear combination of two or more weak classifiers.

ii. The method is adaptive because it adjusts the weak classifiers so that it correctly classifies previously misclassified instances.

iii. The algorithm iterates m times, if there are m classifiers.

4. A classifiers are then combined by the function ‘H’ for instance x.

• H is the strong classifier, which is a linear combination of the other weak classifiers.

• It is a sign function of the linear combination of the weak classifiers.

D. ADABOOST.


STEPS(CTD.)

1

( ) ( )m

j ji

H x sign C x

FUTURE DIRECTIONS


1. Development of better sense recognition systems.

2. Eradication of knowledge acquisition bottleneck.

3. More attention needs to be paid towards Domain Specific approach in WSD.

4. If larger annotated corpora can be built then the accuracy of supervised approaches will shoot higher.

50

2.SUPPORT VECTOR MACHINES• SVM is a binary classifier which finds a hyper plane

with the largest margin that separates training examples into 2 classes.

• As SVMs are binary classifiers, a separate classifier is built for each sense of the word.

• Training Phase: Using a tagged corpus, for every sense of the word a SVM is trained using features.

• Testing Phase: Given a test sentence, a test example is constructed using the features and fed as input to each binary classifier.

• The correct sense is selected based on the label returned by each classifier.

• In case of a clash, the SVM with higher confidence score is returned.



51

HYBRIDAPPROACHES

52

HYBRID APPROACHES


Knowledge baseHuman

Supervision(annotated

data)Hybrid Approach

HYBRID APPROACHESUnifying thread of operation

1. Combine information obtained from multiple knowledge sources.

2. Use a very small amount of tagged data.

Algorithms3. Sense Learner.4. Iterative WSD.


1. SENSE LEARNER


• Uses some tagged data to build a semantic language model for words seen in the training corpus.

• Uses WordNet to derive semantic generalizations for words which are not observed in the corpus.

Semantic Language Model• Each training example is represented as

a feature vector and a class label which is word & sense

• In the testing phase, for each test sentence, a similar feature vector is constructed.

• The trained classifier is used to predict the word and the sense.

• If the predicted word is same as the observed word then the predicted sense is selected as the correct sense.

1. SENSE LEARNER


Semantic Generalizations• Uses semantic dependencies form

the WordNet.• Labels a more general concept,

higher in the WordNet.• More training data can be found.For e.g. • if “drink water” is observed in the

corpus then using the hypernymy tree we can derive the syntactic dependency “take-in liquid”

• “take-in liquid” can then be used to disambiguate an instance of the word tea as in “take tea”, by using the hypernymy-hyponymy relations.

1. bootstrapping


I. Based on Yarowsky’s supervised algorithm that uses Decision Lists.

II. Uses two heuristics:1. “One sense per discourse” property• A word is referred to by the same

sense in a discourse (document).2. ‘One sense per collocation’ property.• Nearby words provide strong and

consistent clues to the sense of a target word.

III. Co-training: If the classifiers are alternated between iterations.

Self-training: If only 1 classifier used (Yarowsky).

supervised, semi-supervised and unsu pervised approaches for word sense disambiguation

Documents

semisupervised approaches

unsupervised approaches

hybrid approaches

wsdin supervised wsd

targetword wsd approaches

word sense disambiguationunder

wsdthe feature vector

feature vectors