multilingual word sense disambiguation using wikipedia

Multilingual Word Sense Disambiguation using Wikipedia

Bharath Dandala (University of North Texas)

Rada Mihalcea (University of North Texas)

Razvan Bunescu (Ohio University)

IJCNLP, Oct 16, 2013

2

Word Sense Disambiguation

• Select the correct sense of a word based on the context:– The word bar has multiple senses:

bar (counter)

bar (law)

bar (landform) bar (establishment)

bar (music)

Sumner was admitted to the bar at the age of twenty-three, and entered private practice in Boston.

3

Word Sense Disambiguation

• Use a repository of senses such as WordNet:– Static resource, short glosses, too fine grained.

• Unsupervised:– Similarity between context and sense definition or gloss.

• Supervised:– Train on text manually tagged with word senses.

• Limited amount of manually labeled data.

• Use Wikipedia for WSD:– Large sense repository, continuously growing.– Large training dataset.– Support for multilingual WSD.

4

Three WSD Systems

• WikiMonoSense: – Address the sense-tagged data bottleneck problem by using

Wikipedia hyperlinks as a source of sense annotations.– Use the sense annotated corpora to train monolingual WSD

classifiers.

• WikiTransSense:– The sense tagged corpus extracted for the reference language is

machine translated into a number of supporting languages.– The word alignments between the reference sentences and the

supporting translations are used to generate complementary features in our first approach to multilingual WSD.

5

Three WSD Systems

• WikiMuSense: – The reliance on machine translation (MT) is significantly reduced

during training for this second approach to multilingual WSD.– Sense tagged corpora in the supporting languages are created

through the interlingual links available in Wikipedia.

6

Wikipedia for WSD

’’’Palermo’’’ is a city in [[Southern Italy]], the [[capital city | capital]] of the [[autonomous area | autonomous region]] of [[Sicily]]. wiki

Palermo is a city in Southern Italy, the capital of the autonomous region of Sicily. html

capital citycapital (economics)

financial capital

human capital

capital (architecture)

7

A Monolingual Dataset through Wikipedia Links

1) Collect all WP titles that are linked from the anchor word bar.=> Bar (law), Bar (music), Bar (establishment), …

2) Create a sense repository from all titles that have sufficient support in WP (ignore named entities, resolve redirects): => {Bar (law), Bar (music), Bar (establishment), Bar

(counter), Bar (landform)}

Use a subset of ambiguous words from Senseval 2 & 3: Avoid words with only one Wikpedia label.=> English (30), Spanish (25), Italian (25), German (25).

8

The WikiMonoSense Learning Framework

• For each word, use WP links as examples and train a classifier to distinguish between alternative senses:– each WP sense acts as a different label in the classification model.

• Each word context is represented as a vector of features:– Current word and its part-of-speech,– Local context of three words to the left and to the right.– Parts-of-speech of the surrounding words.– Verb and noun before and after the ambiguous words.– A global context implemented through sense-specific keywords

determined as a list of all words occurring at least three times in the contexts defining a certain word sense.

9

A Multilingual Dataset through Machine Translation

• Treat each of the 4 languages as a reference language:– Use Google Translate to translate the data from the reference language

into the other 3 supporting languages.• Translate into French as an additional supporting language.

=> each reference sentence is translated into 4 supporting languages.

En An airline seat is a chair on an airliner in which passengers areaccommodated for the duration of the journey.

De Ein Flugzeugsitz ist ein Stuhl auf einem Flugzeug, in dem Passagiere frdie Dauer der Reise untergebracht sind.

En For a year after graduation, Stanley served as chair of belles-lettres at Christian College in Hustonville.

De Seit einem Jahr nach dem Abschluss, diente Stanley als Vorsitzender Belletristik bei Christian College in Hustonville.

10

Benefits of Machine Translation

1) Knowledge of the target word translation can help in disambiguation:– Two different senses of the target ambiguous word may be translated

into a different word in the supporting language.– Assuming access to word alignments.

2) Features extracted from the translated sentence can be used to enrich the feature space:– For example, the two senses “(unit)" and (establishment)" of the

English word “bar" translate to the same German word “bar". – In cases like this, words in the context of the German translation may

help in identifying the correct English meaning.

11

The WikiTransSense Learning Framework

• Extract the same type of features Φ as in WikiMonoSense.

• Append features from supporting languages to vector of features from the reference language:– Φ’EN = [ΦEN | ΦSP ; ΦIT ; ΦDE ; ΦFR].

• Train a multilingual WSD classifier using the augmented feature vectors.

12

A Multilingual Dataset through Wikipedia Interlingua Links

• Wikipedia articles on the same topic in different languages are often connected through interlingual links.

• Use interlingua links to project sense repository in reference language to sense repository in supporting language.– Given reference sense repository for word “bar" in English is:

• EN = {bar (establishment), bar (landform), bar (law), bar (music)}– Projected supporting sense repository in German will be:

• DE = {Bar (Lokal), Sandbank, NIL, Takt (Musik)}

• Use projected repositories in supporting languages to train additional WSD classifiers for reference language senses.

13

Two Problematic Issues for Interlingua Links

1) There may be reference language senses that do not have interlingua links to the supporting language:– randomly sample a number of examples for that sense in the

reference language.– use GT to create examples in the supporting language.

2) The distribution of examples per sense in the corpus for the supporting language may be different from the corresponding distribution for the reference language:– use the distribution of reference language as the true distribution

and calculate the number of examples to be considered per sense from the supporting languages using [Agirre & Martinez, 2004].

14

The WikiMuSense Learning Framework

• Given an ambigous word in the reference language, at training time:– Train a probabilistic classifier PR for the reference language:

• use the same WP sense repository developed for WikiMonoSense and WikiTtransSense.

– Train a probabilistic classifier PS for each supporting language:• use the reference sense repository projected in the supporting

language.– Use same types of features as in WikiMonoSense, for each classifier.

Five probabilistic classifiers:– One from the reference language (PR).– Four from the supporting languages (PS).

15

The WikiMuSense Learning Framework

• Given an ambigous word in the reference language, at test time:– Use GT to translate reference sentence in all supporting languages.– Run probabilistic classifier PR on reference sentence and classifiers

PS on supporting sentences.– Combine the 5 probabilistic outputs into one disambiguation

score:• DR = the set of training examples in reference language R.• DS = the set of training examples in supporting language S.

– WSD = select the sense that maximizes score P.

16

WikiMuSense vs. WikiTransSense

• WikiMuSense significantly reduces the # of sentence translations required to create the multilingual dataset.

• Features extracted from each supporting language are more diverse, as sentences are natural, as opposed to translated:– although may lead to potential mismatch between training and

testing distributions.

17

Experimental Evaluation

• Used a subset of ambiguous words from Senseval 2 & 3:– Avoid words with only one Wikpedia label.=> English (30), Spanish (25), Italian (25), German (25).

18

Experimental Evaluation: Macro & Micro

1. cdcd

19

Experimental Evaluation: Macro Results

• WikiMonoSense better than MFS on 76 out of 105 words:– Average relative error reduction of 44%, 38%, 44%, and 28%.

• WikiTransSense better than MFS on 83 out of 105 words:– Average relative error reduction over WikiMonoSense of 13.7%. utility of using features from translated contexts.

• WikiMuSense better than MFS on 89 out of 105 words:– Average relative error reduction over WikiMonoSense of 16.5%. multilingual WP data can successfully replace MT component

during training.

20

Varying the Number of Supporting Languages

21

Varying the Amount of Supporting Language Data

Dip likely due to suboptimal combination of classifiers in:

[Future Work]: train weights for each supporting language.

22

Varying the Amount of Supporting Language Data

Peak likely due to suboptimal combination of classifiers in:

[Future Work]: train weights for each supporting language.

# of supporting examples = # of reference examples.

23

Future Work

1. Train weights in for each supporting language, when combining classifier outputs in WikiMuSense.

2. Reduce the number of translations in WikiMuSense by choosing from the 280 languages in WP those supporting languages with largest number of examples per sense.

3. Exploit directly the distributions used inside a MT system: eliminate MT altogether from WikiMuSense.

24

Conclusion

• WikiMonoSense: – Use Wikipedia hyperlinks to train monolingual WSD classifiers.

• WikiTransSense:– The sense tagged corpus extracted for the reference language is

machine translated into a number of supporting languages.– Use aligned sentences to generate additional features in a first

approach to multilingual WSD.

• WikiMuSense: – Use Wikipedia the interlingual links to reduce reliance on MT.– Train and combine multiple probabilistic classifiers, in a second

approach to multilingual WSD.

25

Questions

?

multilingual word sense disambiguation using wikipedia

Documents