ranlp 2007 – september 27-29, 2007, borovets, bulgaria cognate or false friend? ask the web!...
TRANSCRIPT
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Cognate or False Friend? Ask the
Web!
Svetlin Nakov, Sofia University "St. Kliment Ohridski"
Preslav Nakov, University of California, Berkeley
Elena Paskaleva, Bulgarian Academy of Sciences
A Workshop on Acquisition andManagement of Multilingual Lexicons
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Introduction Cognates and false friends
Cognates are pair of words in different languages that sound similar and are translations of each other
False friends are pairs of words in two languages that sound similar but differ in their meanings
The problem
Design an algorithm that can distinguish between cognates and false friends
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Cognates and False Friends Examples of cognates
ден in Bulgarian = день in Russian (day)
idea in English = идея in Bulgarian (idea)
Examples of false friends
майка in Bulgarian (mother) ≠ майка in Russian (vest)
prost in German (cheers) ≠ прост in Bulgarian (stupid)
gift in German (poison) ≠ gift in English (present)
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
The Paper in One Slide Measuring semantic similarity
Analyze the words local contexts
Use the Web as a corpus
Similarities contexts similar words
Context translation cross-lingual similarity
Evaluation 200 pairs of words
100 cognates and 100 false friends
11pt average precision: 95.84%
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity What is local context?
Few words before and after the target word
The words in the local context of given word are semantically related to it
Need to exclude the stop words: prepositions, pronouns, conjunctions, etc.
Stop words appear in all contexts
Need of sufficiently big corpus
Same day delivery of fresh flowers, roses, and unique gift baskets
from our online boutique. Flower delivery online by local florists for
birthday flowers.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Web as a corpus
The Web can be used as a corpus to extract the local context for given word
The Web is the largest possible corpus
Contains big corpora in any language
Searching some word in Google can return up to 1 000 excerpts of texts
The target word is given along with its local context: few words before and after it
Target language can be specified
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Web as a corpus
Example: Google query for "flower"
Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...
Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.
Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...
Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable.
Flowers, plants, roses, & gifts. Flowers delivery with fewer ...
Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Measuring semantic similarity
For given two words their local contexts are extracted from the Web
A set of words and their frequencies
Semantic similarity is measured as similarity between these local contexts
Local contexts are represented as frequency vectors for given set of words
Cosine between the frequency vectors in the Euclidean space is calculated
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Example of context words frequencies
word countfresh 217
order 204
rose 183
delivery 165
gift 124
welcome 98
red 87
... ...
word: flower
word countInternet 291
PC 286
technology 252
order 185
new 174
Web 159
site 146
... ...
word: computer
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Contextual Web Similarity Example of frequency vectors
Similarity = cosine(v1, v2)
# word freq.0 alias 3
1 alligator 2
2 amateur 0
3 apple 5
... ... ...
4999 zap 0
5000 zoo 6
v1: flower
# word freq.0 alias 7
1 alligator 0
2 amateur 8
3 apple 133
... ... ...
4999 zap 3
5000 zoo 0
v2: computer
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Cross-Lingual Similarity We are given two words in different
languages L1 and L2
We have a bilingual glossary G of translation pairs {p ∈ L1, q ∈ L2}
Measuring cross-lingual similarity:
1. We extract the local contexts of the target words from the Web: C1 ∈ L1 and C2 ∈ L2
2. We translate the context
3. We measure distance between C1* and C2
C1*C1G
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Reverse Context Lookup Local context extracted from the Web can
contain arbitrary parasite words like "online", "home", "search", "click", etc.
Internet terms appear in any Web page
Such words are not likely to be associated with the target word
Example (for the word flowers)
"send flowers online", "flowers here", "order flowers here"
Will the word "flowers" appear in the local context of "send", "online" and "here"?
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Reverse Context Lookup If two words are semantically related both
should appear in the local contexts of each other
Let #{x,y} = number of occurrences of x in the local context of y
For any word w and a word from its local context wc, we define their strength of semantic association p(w,wc) as follows:
p(w, wc) = min{ #(w, wc), #(wc,w) }
We use p(w,wc) as vector coordinates when measuring semantic similarity
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Web Similarity Using Seed Words Adaptation of the Fung&Yee'98 algorithm*
We have a bilingual glossary G: L1 L2 of translation pairs and target words w1, w2
We search in Google the co-occurrences of the target words with the glossary entries
Compare the co-occurrence vectors
for each {p,q} ∈ G compare
max (google#("w1 p") and google#("p w1"))with
max (google#"w2 q") and google#("q w2"))
* P. Fung and L. Y. Yee. An IR approach for translating from nonparallel, comparable texts. In Proceedings of ACL, volume 1, pages 414–420, 1998
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation Data Set We use 200 Bulgarian/Russian pairs of
words:
100 cognates and 100 false friends
Manually assembled by a linguist
Manually checked in several large monolingual and bilingual dictionaries
Limited to nouns only
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Experiments We tested few modifications of our
contextual Web similarity algorithm Use of TF.IDF weighting
Preserve the stop words
Use of lemmatization of the context words
Use different context size (2, 3, 4 and 5)
Use small and large bilingual glossary
Compared it with the seed words algorithm
Compared with traditional orthographic similarity measures: LCSR and MEDR
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Experiments BASELINE: random
MEDR: minimum edit distance ratio
LCSR: longest common subsequence ration
SEED: the "seed words" algorithm
WEB3: the Web-based similarity algorithm with the default parameters: context size = 3, small glossary, stop words filtering, no lemmatization, no reverse context lookup, no TF.IDF-weighting
NO-STOP: WEB3 without stop words removal
WEB1, WEB2, WEB4 and WEB5: WEB3 with context size of 1, 2, 4 and 5
LEMMA: WEB3 with lemmatization
HUGEDICT: WEB3 with the huge glossary
REVERSE: the "reverse context lookup" algorithm
COMBINED: WEB3 + lemmatization + huge glossary + reverse context lookup
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Resources We used the following resources:
Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words
Huge bilingual glossary: 59 583 word pairs
A list of 599 Bulgarian stop words
A list of 508 Russian stop words
Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata
Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation We order the pairs of words from the
testing dataset by the calculated similarity
False friends are expected to appear on the top and the cognates on the bottom
We evaluate the 11pt average precision of the obtained ordering
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results (11pt Average Precision)
Comparing BASELINE, LCSR, MEDR, SEED and WEB3 algorithms
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results (11pt Average Precision)
Comparing different context sizes; keeping the stop words
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results (11pt Average Precision)
Comparing different improvements of the WEB3 algorithm
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results (Precision-Recall Graph)
Comparing the recall-precision graphs of evaluated algorithms
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Results: The Ordering for WEB3
r CandidateBG
SenseRU
Sense Sim.Cogn.
? P@r R@r
1 муфта gratis muff0,008
5 no100.00
% 1.00%
2 багрене / багренье mottle gaff0,013
0 no100.00
% 2.00%
3 добитък / добыток livestock income0,014
3 no100.00
% 3.00%
4 мраз / мразь chill crud0,017
5 no100.00
% 4.00%
5 плет / плеть hedge whip0,018
2 no100.00
% 5.00%
… … … … … … … …
99 вулкан volcano volcano0,209
9 yes 81.82% 81.00%
100 година year time
0,2101 no 82.00% 82.00%
101 бут leg rubble
0,2130 no 82.18% 83.00%
… … … … … … … …
196
финанси / финансы finance finance
0,8017 yes 51.28%
100.00%
197 сребро / серебро silver silver
0,8916 yes 50.76%
100.00%
198 наука science science
0,9028 yes 50.51%
100.00%
199 флора flora flora
0,9171 yes 50.25%
100.00%
200 красота beauty beauty
0,9684 yes 50.00%
100.00%
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Discussion Our approach is original because:
Introduces semantic similarity measure
Not orthographic or phonetic
Uses the Web as a corpus
Does not rely on any preexisting corpora
Uses reverse-context lookup
Significant improvement in quality
Is applied to original problem
Classification of almost identically spelled true/false friends
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Discussion Very good accuracy: over 95% It is not 100% accurate
Typical mistakes are synonyms, hyponyms, words influenced by cultural, historical and geographical differences
The Web as a corpus introduces noise Google returns the first 1 000 results only Google ranks higher news portals, travel
agencies and retail sites than books, articles and forums posts
Local context could contains noise
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Conclusion and Future Work Conclusion
Algorithm that can distinguish between cognates and false friends
Analyzes words local contexts, using the Web as a corpus
Future Work
Better glossaries
Automatic augmenting the glossary
Different language pairs
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Questions?
Cognate or FalseFriend? Ask the Web!