similarity based methods for word sense disambiguation

Copyright © Wondershare Software

-Ido Dagan- Lillian Lee

-Fernando Pereira


• The problem is ” How to get sense from unseen word pairs that are not present in training set”.

Eg: I want to bee a scientist. Robbed the bank


• They compared four similarity based estimation methods:

1. KL divergence 2. Total divergence to average3. L1 Norm4. Confusion probability Against two well established methods 1. Katz’s back-off scheme2. Maximum likelihood estimation


• Katz’s back-off scheme(1987) widely used in bigram language modeling, estimates the probability of an unseen bigram by utilizing unigram estimates using baye’s conditional probability theorem.

Eg: {make,take} plans.


• As the estimation of probability of unseen bigram depends on unigram frequencies ,so this has undesirable result of assigning unseen bigrams the same probability if they are made up of unigrams of same frequency.

Eg:{a b} and {c b}


• In this method, words of similar meaning are grouped together statically to form a class.

• So for a group of words there is only one representative, which is its class.

• A word is therefore modeled by average behavior of many words.

• When in doubt between two words search the testing data related to words of those classes.

Eg: {a,b,c,d,e} & {f,g,h,I} W


• As the word is modeled by average behavior of many words so the uniqueness of meaning of word is lost.

Eg: Thanda• Initially probability for unseen word pairs remains

zero which leads to extremely inaccurate estimates for word pair probabilities.

Eg: Periodic table


• Estimates for most compatible(similar) words with a word w are combined and based on evidence provided by word w’ ,is weighted by a function of its compatibility with w .

• No word pair is dropped even it is very rare one,as there were in katz’s back off scheme.


• Similarity based word sense can be achieved in 3 steps…

1. A scheme for deciding which word pairs require similarity based estimation.

2. A method for combining information from similar words.

3. A function measuring similarity between words.


• Good points of katz’s back off scheme and MLE are combined…

• In the MLE probability is PML(w2/w1) =c(w1,w2)/c(w1)

But for similarity based sense P(w2/w1)={ Pd (w2/w1) c(w1,w2)>0 for seen pair

α(w1)Pr (w2/w1) for unseen pair


• Similarity based models assume that if word w1’ is similar to word w1,then w1’ can yield the information about probability of unseen word pairs involving w1.

• It is proved that w2 is more likely to occur with w1 if it tends to occur with the words that are most similar to w1.

• They used a weighted average of evidence provided by similar words, where the weight given to a particular word depends on its similarity to w1.


• Number of words similar to a word w1 are set up to a threshold value because in a large training set it will use very large amount of resources.

• Number of similar words(k) and threshold of dissimilarity between words(t) is tuned experimentally.


• These word similarity functions can be derived automatically from statistics of training data, as opposed to functions derived from manually constructed word classes.

1. KL divergence2. Total divergence to average3. L1 Norm4. Confusion Probability


• KL divergence is standard measure of dissimilarity between two probability mass functions

• For D to be defined P(w2|w1’)>0 whenever P(w2|w1)>0.

• Above condition might not hold good in some cases,So smoothing is required which is very expensive for large vocabularies.


• It is a relative measure based on the total KL divergence to the average of two distributions:

This is reduced to


• A(w1,w1’) is bounded ,ranging between 0 and 2log2.• Smoothed estimates are not required because

probability ratios are not involved.• Calculation of A(w1,w1’) requires summing only over

those w2 for which P(w2|w1) and P(w2|w1’) are both non zero, this makes computation quite fast.


• L1 norm is defined as

by reducing it to form depending upon w2

It is also bounded between 0 to 2.


• It estimates that a word w1’ can be substituted with word w1 or not.

• Unlike the D,A,L w1 may not be “closest” to itself ie. there may exist a word w1’ such that


• As the sense of actual word may be very fine or very coarse, provided by the dictionary and it will take large amount of resources for training data to have correct sense,Experiment done on Pseudo Word.

Eg: {make,take} plans {make,take} action where {make,take} is a pseudo word tested with

plans and action.


• Each method in experiment is tested with a noun and two verbs and method decides which verb is more likely to have a noun as direct object.

• Experiment used 587833 bigrams to make bigram language model.

• Experiment tested with 17152 unseen bigrams by dividing it into five equal parts T1 to T5.

• Used error rate as performance metric.


• As Back off consistently performed worse than MLE so not including Back off in experiments.

• As only experiment is only on unsmoothed data so KL divergence is not included in experiments.


• Similarity based methods performed 40% better over Back off and MLE methods.

• Singletons should not be omitted from training data for similarity based methods.

• Total divergence to average method (A) performs best in all cases.

similarity based methods for word sense disambiguation

Education