ranlp 2009 – september 12-18, 2009, borovets, bulgaria a knowledge-rich approach to measuring the...

31
RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav Nakov, Sofia University "St. Kliment Ohridski" Elena Paskaleva, Bulgarian Academy of Sciences Svetlin Nakov, Sofia University "St. Kliment Ohridski" Workshop “Multilingual Resources, Technologies and Evaluation for Central and Eastern European Languages”, RANLP 2009

Upload: alfred-strickland

Post on 14-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

A Knowledge-Rich Approach to Measuring the Similarity

between Bulgarian and Russian Words

Preslav Nakov, Sofia University "St. Kliment Ohridski"

Elena Paskaleva, Bulgarian Academy of Sciences

Svetlin Nakov, Sofia University "St. Kliment Ohridski"

Workshop “Multilingual Resources, Technologies and Evaluation for Central and Eastern European

Languages”, RANLP 2009

Page 2: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Introduction Objective

Measure the extent to which a Bulgarian and a Russian word are perceived as similar by a person who is fluent in both languages

Orthographic similarity

Modified to account typical cross-lingual correspondences between Bulgarian and Russian, e.g. transformations of inflections

Example Bulgarian афектирахме and Russian

аффектировались are orthographically different but perceived as similar

Page 3: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Orthographic Similarity Minimum Edit Distance Ratio (MEDR)

MED(s1, s2) = the minimum number of INSERT / REPLACE / DELETE operations for transforming s1 to s2 (Levenshtein distance)

MEDR is also known as normalized edit distance (NED)

Longest Common Subsequence Ratio (LCSR)

Maximal length subsequence common to both words, normalized by the longer word

Page 4: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Modified Minimum Edit Distance Ratio (MMEDR)

Our MMEDR similarity algorithm

Reduces the Russian word to an intermediate Bulgarian-sounding form

Applies a set of linguistically motivated transformation rules

Compares orthographically the modified Russian word with the Bulgarian word

Calculates weighted Levenshtein distance

Page 5: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Linguistic Motivation behind the MMEDR

Algorithm

Page 6: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Linguistic Motivation Transliteration from Cyrillic to Cyrillic

Full coincidence (equality) of letters

Regular letter transitions

Transformations of n-grams

Lemmatization

Transformation Weights

Page 7: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transliteration What is transliteration?

Transition of sounds and their letter correspondences in one language to letters in another language

Russian → Bulgarian transliteration Full coincidence (equality) of letters

E.g. a → a (азбука – азбука) Russian letters missing in Bulgarian

E.g. ы → и, э → е (рыба – риба, поэт – поет) Removing a Russian letter

E.g. пальто → палто Regular letter transitions

E.g. муж → мъж, хлеб → хляб, сон → сън

Page 8: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation of n-grams Regular sound-letter transitions from Russian

to Bulgarian

Transformations originating from spelling Double consonants, e.g. процесс → процес

Voiceless to voiced consonants, e.g. бессмертный → безсмъртен

Transformations of morphological origin

Removing agglutinative morphemes (ся and сь), e.g. веселиться → веселить

Transforming endings, e.g. стенной → стенен

Page 9: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation of Russian Adjectives

Russian Ending

Bulgarian Ending

Example

-нный -нен военный → военен

-ный -ен вечный → вечен

-нний -нен ранний → ранен

-ний -ен вечерний → вечерен

-ский -ски вражеский → вражески

-ый -и стрелковый → стрелкови

-нной -нен стенной → стенен

-ной -ен родной → роден

-ой -и деловой → делови

Page 10: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation of Russian Verbs

Russian

Ending

Bulgarian

EndingExamples

-овать -амдекорировать →

декорирам

-ить -я бродить → бродя

-ять -я блеять → блея

-ать -ам давать → давам

-уть -а гаснуть → гасна

-еть -ея белеть → белея

Page 11: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Lemmatization Bulgarian and Russian are highly-

inflectional languages Variety of endings express the different

forms of the same word

What is lemmatization? Replacement of inflected wordforms

with their lemmata

E.g. късният → късен (Bulgarian), равняющимся → равнять (Russian)

Lemmatization can handle inflections

Page 12: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation Weights We use weights for letter substitutions

when measuring Levenshtein distance We account regular phonetic and spelling

letter correspondences

Some substitutions are unlikely E.g. о → у is more likely than о → щ

Replacing letter with itself has cost 0 Regular letter substitution cost is 1 Consonants and vowels with similar

sequences of distinctive phonetic features have less substitution cost (e.g. б → в)

Page 13: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transformation Weights

а

w(а, е)=0.7; w(а, и)=0.8; w(а, о)=0.7; w(а, у)=0.6; w(а, ъ)=0.5; w(а, ю)=0.8; w(а, я)=0.5

б w(б, в)=0.8; w(б, п)=0.6

в w(в, ф)=0.6

г w(г, х)=0.5

д w(д, т)=0.6

еw(е, и)=0.6; w(е, о)=0.7; w(е, у)=0.8; w(е, ъ)=0.5; w(е, ю)=0.8; w(е, я)=0.5

ж w(ж, з)=0.8; w(ж, ш)=0.6

з w(з, с)=0.5

иw(и, й)=0.6; w(и, о)=0.8; w(и, у)=0.8; w(и, ъ)=0.8; w(и, ю)=0.7; w(и, я)=0.7

й w(й, ю)=0.7; w(й, я)=0.7

к w(к, т)=0.8; w(к, х)=0.6

м w(м, н)=0.7

о

w(о, у)=0.6; w(о, ъ)=0.8;w(о, ю)=0.7; w(о, я)=0.8

пw(п, ф)=0.8; w(п, х)=0.9

сw(с, ц)=0.6; w(с, ш)=0.9

тw(т, ф)=0.8; w(т, х)=0.9;w(т, ц)=0.9

уw(у, ъ)=0.5; w(у, ю)=0.6; w(у, я)=0.8

ф w(ф, ц)=0.8

х w(х, ш)=0.9

ц w(ц, ч)=0.8

ч w(ч, ш)=0.9

ъw(ъ, ю)=0.8; w(ъ, я)=0.8

ю w(ю, я)=0.8

Page 14: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

The MMEDR Algorithm in Details

Page 15: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

The MMEDR Algorithm

MMEDR algorithm steps (order is important):

1. Lemmatize the Bulgarian word

2. Lemmatize the Russian word

3. Transform the Russian word’s ending

4. Transliterate the Russian word

5. Remove some double consonants in the Russian word

6. Calculate weighted Levenshtein distance

7. Normalize and calculate the MMEDR value

Page 16: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Lemmatizing Bulgarian and Russian Words

How to perform lemmatization?

Use of large morphological dictionaries

Wordforms are replaced with corresponding lemmata

Lemmatization if optional step in MMEDR

For each word it is either performed or not

When multiple lemmata are found, all of them are considered

Highest value of MMEDR is taken

Page 17: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Transforming the Russian Endings

The following endings are replaced in the Russian words:

нный → нен; ный → ен; нний → нен;

ний → ен; ий → и; ый → и; нной → нен;

ной → ен; ой → и; ский → ски; ься → ь;

овать → ам; ить → я; ять → я;

ать → ам; уть → а; еть → ея

Page 18: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Removing Double Consonants The following substitutions are performed

in the Russian words:

бб → б; жж → ж; кк → к; лл → л;

мм → м; пп → п; рр → р; сс → с;

тт → т; фф → ф

Note that not all double consonants are replaced, e.g. дд is left дд E.g. наддавать → наддавам

Page 19: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Calculating Weighted Levenshtein Distance

Starting from classical Levenshtein distance (MED) we modify it to use weights for letter substitutions (MMED)

We use the previously discussed linguistically motivated weights

We calculate MMEDR as follows:

Page 20: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Calculating the Final Result

The final MMEDR value is calculated by maximum of all MMEDR values: with / without lemmatization of the

Bulgarian word

with / without lemmatization of the Russian word

with / without transformation of the Russian word ending

Lemmatization sometimes produces multiple lemmata, so all of them are considered

Page 21: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

MMEDR Algorithm: Example Bulgarian word: афектирахме

Russian word: аффектировались

Traditional MEDR similarity MED(афектирахме, аффектировались)

= 7

Apply normalization MEDR = 1–(7/15) = 8/15 ≈ 53%

Even though these words "sound similar" to Bulgarian / Russian fluent speakers

Page 22: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

MMEDR Algorithm: Example (2) Our improved MMEDR similarity:

Lemmatization produces афектирам and аффектировать

We replace the double Russian consonant -фф- by -ф-

We obtain афектирам and афектировать

We replace the Russian ending -овать by the Bulgarian ending -ам

We obtain identical words: афектирам and афектирам

Thus our MMEDR similarity is 100%

Page 23: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Another MMEDR Example Bulgarian word избягам and the Russian

word отбегать (both meaning ‘to run out’)

MED(избягам,отбегать) = 5

MEDR = 1 – (5/8) = 3/8 = 37.5%

MMEDR first transforms отбегать to отбегам

MMED(избягам, отбегам) = 0.8 + 1 + 0.5 = 2.3

MMEDR = 1 – (2.3/7) = 47/70 ≈ 67%

Page 24: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Experiments and Evaluation

Page 25: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Experimental Setup Model the problem as information retrieval

(IR) task: Retrieve all similar pairs of words from

Bulgarian and Russian lists of words Measure similarity between 200 x 200 =

40,000 Bulgarian-Russian pairs of words 163 pairs annotated as similar by linguist 39,837 considered unrelated

Rank the 40,000 pairs by MMEDR algorithm Evaluate the quality of the ranking with

11pt interpolated average precision

Page 26: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Resources Textual resources

The first 200 words from the Russian novel The Lord of the World (Властелин мира) by Alexander Belyayev

The first 200 words form the Bulgarian translation of the novel

Grammatical resources (for lemmatization) Grammatical dictionary of Bulgarian

1M wordforms and 70,000 lemmata

Grammatical dictionary of Russian 1.5M wordforms and 100,000 lemmata

Page 27: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Results MMEDR significantly outperforms traditional

orthographic similarity measures:

Algorithm11-pt interpolated average precision

LCSR 69.06%

MEDR 72.30%

MMEDR 90.58%

Page 28: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Results – Produced Ranking# Bulgarian

wordRussian word

MMEDR

Similar?

Precision Recall

1 беляев беляев 1.0000 Yes 100.00% 0.68%

2 на на 1.0000 Yes 100.00% 1.37%

3 глава глава 1.0000 Yes 100.00% 2.05%

4 кандидат кандидат 1.0000 Yes 100.00% 2.74%

5 за за 1.0000 Yes 100.00% 3.42%

6 наполеон наполеоны 1.0000 Yes 100.00% 4.11%

7 не не 1.0000 Yes 100.00% 4.79%

8 ми нас 1.0000 No 87.50% 4.79%

9 ми мой 1.0000 Yes 88.89% 5.48%

10 ми мы 1.0000 Yes 90.00% 6.16%

... ... ... ... ... ... ...

93 четвъртият четвертым 0.9375 Yes 94.57% 59.59%

94 оставят остается 0.9286 Yes 94.62% 60.27%

... ... ... ... ... ... ...

39998 са в 0.0000 No 0.37% 100%

39999 са к 0.0000 No 0.37% 100%

40000 боядисвали к 0.0000 No 0.37% 100%

Page 29: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Conclusion We proposed orthographical similarity

measure algorithm for Bulgarian / Russian

Outperforms traditional orthographic similarity measures

Accuracy is still far from 100%

Evaluation performed with stop words included

No publications on orthographic similarity for Bulgarian / Russian

Can not compare the results with others

Page 30: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Future Work Combine the ideas of MMEDR with machine

learning techniques

Automatically learning transformation rules for n-grams correspondences

Perform evaluation with stop words excluded

Evaluation for different pairs of languages

Page 31: RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria A Knowledge-Rich Approach to Measuring the Similarity between Bulgarian and Russian Words Preslav

RANLP 2009 – September 12-18, 2009, Borovets, Bulgaria

Questions?

A Knowledge-Rich Approach to Measuring the Similarity between

Bulgarian and Russian Words