ranlp 2007 – september 27-29, 2007, borovets, bulgaria cognate or false friend? ask the web!...

28
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav Nakov, University of California, Berkeley Elena Paskaleva, Bulgarian Academy of Sciences A Workshop on Acquisition and Management of Multilingual Lexicons

Upload: adam-barrett

Post on 26-Mar-2015

218 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Cognate or False Friend? Ask the

Web!

Svetlin Nakov, Sofia University "St. Kliment Ohridski"

Preslav Nakov, University of California, Berkeley

Elena Paskaleva, Bulgarian Academy of Sciences

A Workshop on Acquisition andManagement of Multilingual Lexicons

Page 2: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Introduction Cognates and false friends

Cognates are pair of words in different languages that sound similar and are translations of each other

False friends are pairs of words in two languages that sound similar but differ in their meanings

The problem

Design an algorithm that can distinguish between cognates and false friends

Page 3: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Cognates and False Friends Examples of cognates

ден in Bulgarian = день in Russian (day)

idea in English = идея in Bulgarian (idea)

Examples of false friends

майка in Bulgarian (mother) ≠ майка in Russian (vest)

prost in German (cheers) ≠ прост in Bulgarian (stupid)

gift in German (poison) ≠ gift in English (present)

Page 4: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

The Paper in One Slide Measuring semantic similarity

Analyze the words local contexts

Use the Web as a corpus

Similarities contexts similar words

Context translation cross-lingual similarity

Evaluation 200 pairs of words

100 cognates and 100 false friends

11pt average precision: 95.84%

Page 5: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Contextual Web Similarity What is local context?

Few words before and after the target word

The words in the local context of given word are semantically related to it

Need to exclude the stop words: prepositions, pronouns, conjunctions, etc.

Stop words appear in all contexts

Need of sufficiently big corpus

Same day delivery of fresh flowers, roses, and unique gift baskets

from our online boutique. Flower delivery online by local florists for

birthday flowers.

Page 6: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Contextual Web Similarity Web as a corpus

The Web can be used as a corpus to extract the local context for given word

The Web is the largest possible corpus

Contains big corpora in any language

Searching some word in Google can return up to 1 000 excerpts of texts

The target word is given along with its local context: few words before and after it

Target language can be specified

Page 7: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Contextual Web Similarity Web as a corpus

Example: Google query for "flower"

Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...

Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.

Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...

Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable.

Flowers, plants, roses, & gifts. Flowers delivery with fewer ...

Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.

Page 8: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Contextual Web Similarity Measuring semantic similarity

For given two words their local contexts are extracted from the Web

A set of words and their frequencies

Semantic similarity is measured as similarity between these local contexts

Local contexts are represented as frequency vectors for given set of words

Cosine between the frequency vectors in the Euclidean space is calculated

Page 9: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Contextual Web Similarity Example of context words frequencies

word countfresh 217

order 204

rose 183

delivery 165

gift 124

welcome 98

red 87

... ...

word: flower

word countInternet 291

PC 286

technology 252

order 185

new 174

Web 159

site 146

... ...

word: computer

Page 10: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Contextual Web Similarity Example of frequency vectors

Similarity = cosine(v1, v2)

# word freq.0 alias 3

1 alligator 2

2 amateur 0

3 apple 5

... ... ...

4999 zap 0

5000 zoo 6

v1: flower

# word freq.0 alias 7

1 alligator 0

2 amateur 8

3 apple 133

... ... ...

4999 zap 3

5000 zoo 0

v2: computer

Page 11: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Cross-Lingual Similarity We are given two words in different

languages L1 and L2

We have a bilingual glossary G of translation pairs {p ∈ L1, q ∈ L2}

Measuring cross-lingual similarity:

1. We extract the local contexts of the target words from the Web: C1 ∈ L1 and C2 ∈ L2

2. We translate the context

3. We measure distance between C1* and C2

C1*C1G

Page 12: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Reverse Context Lookup Local context extracted from the Web can

contain arbitrary parasite words like "online", "home", "search", "click", etc.

Internet terms appear in any Web page

Such words are not likely to be associated with the target word

Example (for the word flowers)

"send flowers online", "flowers here", "order flowers here"

Will the word "flowers" appear in the local context of "send", "online" and "here"?

Page 13: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Reverse Context Lookup If two words are semantically related both

should appear in the local contexts of each other

Let #{x,y} = number of occurrences of x in the local context of y

For any word w and a word from its local context wc, we define their strength of semantic association p(w,wc) as follows:

p(w, wc) = min{ #(w, wc), #(wc,w) }

We use p(w,wc) as vector coordinates when measuring semantic similarity

Page 14: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Web Similarity Using Seed Words Adaptation of the Fung&Yee'98 algorithm*

We have a bilingual glossary G: L1 L2 of translation pairs and target words w1, w2

We search in Google the co-occurrences of the target words with the glossary entries

Compare the co-occurrence vectors

for each {p,q} ∈ G compare

max (google#("w1 p") and google#("p w1"))with

max (google#"w2 q") and google#("q w2"))

* P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts. In Proceedings of ACL, volume 1, pages 414–420, 1998

Page 15: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation Data Set We use 200 Bulgarian/Russian pairs of

words:

100 cognates and 100 false friends

Manually assembled by a linguist

Manually checked in several large monolingual and bilingual dictionaries

Limited to nouns only

Page 16: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Experiments We tested few modifications of our

contextual Web similarity algorithm Use of TF.IDF weighting

Preserve the stop words

Use of lemmatization of the context words

Use different context size (2, 3, 4 and 5)

Use small and large bilingual glossary

Compared it with the seed words algorithm

Compared with traditional orthographic similarity measures: LCSR and MEDR

Page 17: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Experiments BASELINE: random

MEDR: minimum edit distance ratio

LCSR: longest common subsequence ration

SEED: the "seed words" algorithm

WEB3: the Web-based similarity algorithm with the default parameters: context size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting

NO-STOP: WEB3 without stop words removal

WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5

LEMMA: WEB3 with lemmatization

HUGEDICT: WEB3 with the huge glossary

REVERSE: the "reverse context lookup" algorithm

COMBINED: WEB3 + lemmatization + huge glossary + reverse context lookup

Page 18: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Resources We used the following resources:

Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words

Huge bilingual glossary: 59 583 word pairs

A list of 599 Bulgarian stop words

A list of 508 Russian stop words

Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata

Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata

Page 19: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Evaluation We order the pairs of words from the

testing dataset by the calculated similarity

False friends are expected to appear on the top and the cognates on the bottom

We evaluate the 11pt average precision of the obtained ordering

Page 20: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Results (11pt Average Precision)

Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms

Page 21: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Results (11pt Average Precision)

Comparing different context sizes; keeping the stop words

Page 22: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Results (11pt Average Precision)

Comparing different improvements of the WEB3 algorithm

Page 23: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Results (Precision-Recall Graph)

Comparing the recall-precision graphs of evaluated algorithms

Page 24: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Results: The Ordering for WEB3

r CandidateBG

SenseRU

Sense Sim.Cogn.

? P@r R@r

1 муфта gratis muff0,008

5 no100.00

% 1.00%

2 багрене / багренье mottle gaff0,013

0 no100.00

% 2.00%

3 добитък / добыток livestock income0,014

3 no100.00

% 3.00%

4 мраз / мразь chill crud0,017

5 no100.00

% 4.00%

5 плет / плеть hedge whip0,018

2 no100.00

% 5.00%

… … … … … … … …

99 вулкан volcano volcano0,209

9 yes 81.82% 81.00%

100 година year time

0,2101 no 82.00% 82.00%

101 бут leg rubble

0,2130 no 82.18% 83.00%

… … … … … … … …

196

финанси / финансы finance finance

0,8017 yes 51.28%

100.00%

197 сребро / серебро silver silver

0,8916 yes 50.76%

100.00%

198 наука science science

0,9028 yes 50.51%

100.00%

199 флора flora flora

0,9171 yes 50.25%

100.00%

200 красота beauty beauty

0,9684 yes 50.00%

100.00%

Page 25: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Discussion Our approach is original because:

Introduces semantic similarity measure

Not orthographic or phonetic

Uses the Web as a corpus

Does not rely on any preexisting corpora

Uses reverse-context lookup

Significant improvement in quality

Is applied to original problem

Classification of almost identically spelled true/false friends

Page 26: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Discussion Very good accuracy: over 95% It is not 100% accurate

Typical mistakes are synonyms, hyponyms, words influenced by cultural, historical and geographical differences

The Web as a corpus introduces noise Google returns the first 1 000 results only Google ranks higher news portals, travel

agencies and retail sites than books, articles and forums posts

Local context could contains noise

Page 27: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Conclusion and Future Work Conclusion

Algorithm that can distinguish between cognates and false friends

Analyzes words local contexts, using the Web as a corpus

Future Work

Better glossaries

Automatic augmenting the glossary

Different language pairs

Page 28: RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria Cognate or False Friend? Ask the Web! Svetlin Nakov, Sofia University "St. Kliment Ohridski" Preslav

RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria

Questions?

Cognate or FalseFriend? Ask the Web!