similarity based methods for word sense disambiguation
DESCRIPTION
TRANSCRIPT
Copyright © Wondershare Software
-Ido Dagan- Lillian Lee
-Fernando Pereira
Copyright © Wondershare Software
• The problem is ” How to get sense from unseen word pairs that are not present in training set”.
Eg: I want to bee a scientist. Robbed the bank
Copyright © Wondershare Software
• They compared four similarity based estimation methods:
1. KL divergence 2. Total divergence to average3. L1 Norm4. Confusion probability Against two well established methods 1. Katz’s back-off scheme2. Maximum likelihood estimation
Copyright © Wondershare Software
• Katz’s back-off scheme(1987) widely used in bigram language modeling, estimates the probability of an unseen bigram by utilizing unigram estimates using baye’s conditional probability theorem.
Eg: {make,take} plans.
Copyright © Wondershare Software
• As the estimation of probability of unseen bigram depends on unigram frequencies ,so this has undesirable result of assigning unseen bigrams the same probability if they are made up of unigrams of same frequency.
Eg:{a b} and {c b}
Copyright © Wondershare Software
• In this method, words of similar meaning are grouped together statically to form a class.
• So for a group of words there is only one representative, which is its class.
• A word is therefore modeled by average behavior of many words.
• When in doubt between two words search the testing data related to words of those classes.
Eg: {a,b,c,d,e} & {f,g,h,I} W
Copyright © Wondershare Software
• As the word is modeled by average behavior of many words so the uniqueness of meaning of word is lost.
Eg: Thanda• Initially probability for unseen word pairs remains
zero which leads to extremely inaccurate estimates for word pair probabilities.
Eg: Periodic table
Copyright © Wondershare Software
• Estimates for most compatible(similar) words with a word w are combined and based on evidence provided by word w’ ,is weighted by a function of its compatibility with w .
• No word pair is dropped even it is very rare one,as there were in katz’s back off scheme.
Copyright © Wondershare Software
• Similarity based word sense can be achieved in 3 steps…
1. A scheme for deciding which word pairs require similarity based estimation.
2. A method for combining information from similar words.
3. A function measuring similarity between words.
Copyright © Wondershare Software
• Good points of katz’s back off scheme and MLE are combined…
• In the MLE probability is PML(w2/w1) =c(w1,w2)/c(w1)
But for similarity based sense P(w2/w1)={ Pd (w2/w1) c(w1,w2)>0 for seen pair
α(w1)Pr (w2/w1) for unseen pair
Copyright © Wondershare Software
• Similarity based models assume that if word w1’ is similar to word w1,then w1’ can yield the information about probability of unseen word pairs involving w1.
• It is proved that w2 is more likely to occur with w1 if it tends to occur with the words that are most similar to w1.
• They used a weighted average of evidence provided by similar words, where the weight given to a particular word depends on its similarity to w1.
Copyright © Wondershare Software
• Number of words similar to a word w1 are set up to a threshold value because in a large training set it will use very large amount of resources.
• Number of similar words(k) and threshold of dissimilarity between words(t) is tuned experimentally.
Copyright © Wondershare Software
• These word similarity functions can be derived automatically from statistics of training data, as opposed to functions derived from manually constructed word classes.
1. KL divergence2. Total divergence to average3. L1 Norm4. Confusion Probability
Copyright © Wondershare Software
• KL divergence is standard measure of dissimilarity between two probability mass functions
• For D to be defined P(w2|w1’)>0 whenever P(w2|w1)>0.
• Above condition might not hold good in some cases,So smoothing is required which is very expensive for large vocabularies.
Copyright © Wondershare Software
• It is a relative measure based on the total KL divergence to the average of two distributions:
This is reduced to
Copyright © Wondershare Software
• A(w1,w1’) is bounded ,ranging between 0 and 2log2.• Smoothed estimates are not required because
probability ratios are not involved.• Calculation of A(w1,w1’) requires summing only over
those w2 for which P(w2|w1) and P(w2|w1’) are both non zero, this makes computation quite fast.
Copyright © Wondershare Software
• L1 norm is defined as
by reducing it to form depending upon w2
It is also bounded between 0 to 2.
Copyright © Wondershare Software
• It estimates that a word w1’ can be substituted with word w1 or not.
• Unlike the D,A,L w1 may not be “closest” to itself ie. there may exist a word w1’ such that
Copyright © Wondershare Software
• As the sense of actual word may be very fine or very coarse, provided by the dictionary and it will take large amount of resources for training data to have correct sense,Experiment done on Pseudo Word.
Eg: {make,take} plans {make,take} action where {make,take} is a pseudo word tested with
plans and action.
Copyright © Wondershare Software
• Each method in experiment is tested with a noun and two verbs and method decides which verb is more likely to have a noun as direct object.
• Experiment used 587833 bigrams to make bigram language model.
• Experiment tested with 17152 unseen bigrams by dividing it into five equal parts T1 to T5.
• Used error rate as performance metric.
Copyright © Wondershare Software
• As Back off consistently performed worse than MLE so not including Back off in experiments.
• As only experiment is only on unsmoothed data so KL divergence is not included in experiments.
Copyright © Wondershare Software
Copyright © Wondershare Software
Copyright © Wondershare Software
Copyright © Wondershare Software
• Similarity based methods performed 40% better over Back off and MLE methods.
• Singletons should not be omitted from training data for similarity based methods.
• Total divergence to average method (A) performs best in all cases.
Copyright © Wondershare Software