svetlin nakov - improved word alignments using the web as a corpus
DESCRIPTION
Nakov P., Nakov S., Paskaleva E., Improved Word Alignments Using the Web as a Corpus, Proceedings of the International Conference RANLP 2007 (Recent Advances in Natural Language Processing), pp. 400-405, ISBN 978-954-91743-7-3, Borovets, Bulgaria, 27-29 September 2007TRANSCRIPT
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Improved Word Alignments Using the
Web as a Corpus
Preslav Nakov, University of California, Berkeley
Svetlin Nakov, Sofia University "St. Kliment Ohridski"
Elena Paskaleva, Bulgarian Academy of Sciences
International Conference RANLP 2007(Recent Advances in Natural Language Processing)
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Statistical Machine Translation (SMT) 1988 – IBM models 1, 2, 3, 4 and 5
Start with bilingual parallel sentence-aligned corpus
Learn translation probabilities of individual words
2004 – PHARAOH model Learn translation probabilities for phrases
Alignment template approach – extracts translation phrases from word alignments
Improved word alignments in sentences improve translation quality!
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Word Alignments The word alignments problem
Given a bilingual parallel sentence-aligned corpus align the words in each sentence with corresponding words in its translation
Example English sentence
Example Bulgarian sentence
Try our same day delivery of fresh flowers, roses, and
unique gift baskets.
Опитайте нашите свежи цветя, рози и уникални
кошници с подаръци с доставка на същия ден.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Word Alignments – Exampletry
our
same
day
delivery
of
fresh
flowers
roses
and
unique
gift
baskets
опитайте
нашите
свежи
цветя
рози
и
уникални
кошници
с
подаръци
с
доставка
на
същия
ден
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Our Method Use combination of
Orthographic similarity measure
Semantic similarity measure
Competitive linking
Orthographic similarity measure Modified weighted minimum-edit-distance
Semantic similarity measure Analyses the co-occurring words in the
local contexts of the target words using the Web as a corpus
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Orthographic Similarity Minimum Edit Distance Ratio (MEDR)
MED(s1, s2) = the minimum number of INSERT / REPLACE / DELETE operations for transforming s1 to s2
Longest Common Subsequence Ratio (LCSR)
LCS(s1, s2) = the longest common subsequence of s1 and s2
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Orthographic Similarity Modified Minimum Edit Distance Ratio
(MMEDR) for Bulgarian / Russian
1. Normalize the strings
2. Assign weights for the edit operations
Normalizing the strings
Hand-crafted rules
Strip the Russian letters "ь" and "ъ"
Remove the Russian "й" at the endings
Remove the definite article in Bulgarian (e.g. "ът", "ят" at the endings)
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Orthographic Similarity Assigning weights for the edit operations
0.5-0.9 for the vowel to vowel substitutions, e.g. 0.5 for е о
0.5-0.9 for some consonant-consonant replacements, e.g. с з
1.0 for all other edit operations
Example: Bulgarian първият and the Russian первый (first)
Normalization produces първи and перви, thus MMED = 0.5 (weight 0.5 for ъ о)
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Semantic Similarity What is local context?
Few words before and after the target word
The words in the local context of given word are semantically related to it
Need to exclude the stop words: prepositions, pronouns, conjunctions, etc.
Stop words appear in all contexts
Need of sufficiently big corpus
Same day delivery of fresh flowers, roses, and unique gift baskets
from our online boutique. Flower delivery online by local florists for
birthday flowers.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Semantic Similarity Web as a corpus
The Web can be used as a corpus to extract the local context for given word
The Web is the largest possible corpus
Contains big corpora in any language
Searching some word in Google can return up to 1 000 excerpts of texts
The target word is given along with its local context: few words before and after it
Target language can be specified
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Semantic Similarity Web as a corpus
Example: Google query for "flower"
Flowers, Plants, Gift Baskets - 1-800-FLOWERS.COM - Your Florist ...
Flowers, balloons, plants, gift baskets, gourmet food, and teddy bears presented by 1-800-FLOWERS.COM, Your Florist of Choice for over 30 years.
Margarita Flowers - Delivers in Bulgaria for you! - gifts, flowers, roses ...
Wide selection of BOUQUETS, FLORAL ARRANGEMENTS, CHRISTMAS ECORATIONS, PLANTS, CAKES and GIFTS appropriate for various occasions. CREDIT cards acceptable.
Flowers, plants, roses, & gifts. Flowers delivery with fewer ...
Flowers, roses, plants and gift delivery. Order flowers from ProFlowers once, and you will never use flowers delivery from florists again.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Semantic Similarity Measuring semantic similarity
For given two words their local contexts are extracted from the Web A set of words and their frequencies
Apply lemmatization
Semantic similarity is measured as similarity between these local contexts Local contexts are represented as
frequency vectors for given set of words
Cosine between the frequency vectors in the Euclidean space is calculated
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Semantic Similarity Example of context words frequencies
word countfresh 217
order 204
rose 183
delivery 165
gift 124
welcome 98
red 87
... ...
word: flower
word countInternet 291
PC 286
technology 252
order 185
new 174
Web 159
site 146
... ...
word: computer
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Semantic Similarity Example of frequency vectors
Similarity = cosine(v1, v2)
# word freq.0 alias 3
1 alligator 2
2 amateur 0
3 apple 5
... ... ...
4999 zap 0
5000 zoo 6
v1: flower
# word freq.0 alias 7
1 alligator 0
2 amateur 8
3 apple 133
... ... ...
4999 zap 3
5000 zoo 0
v2: computer
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Cross-Lingual Semantic Similarity
We are given two words in different languages L1 and L2
We have a bilingual glossary G of translation pairs {p ∈ L1, q ∈ L2}
Measuring cross-lingual similarity:
1. We extract the local contexts of the target words from the Web: C1 ∈ L1 and C2 ∈ L2
2. We translate the context
3. We measure similarity between C1* and C2
C1*C1G
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Competitive Linking What is competitive linking?
One-to-one bi-directional word alignments algorithm
Greedy "best first" approach
Links the most probable pair first, removes it, and repeats the same for the rest
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Applying Competitive Linking1. Make all words lowercase
2. Remove punctuation
3. Remove the stop words: prepositions, pronouns, conjunctions, etc. We don't align them
4. Align the most similar pair of words Using the orthographic similarity
combined with the semantic similarity
5. Remove the aligned words
6. Align the rest of the sentences
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Our Method – Example Bulgarian sentence
Russian sentence
Процесът на създаването на такива рефлекси е по-
сложен, но същността им е еднаква.
Процесс создания таких рефлексов сложнее, но
существо то же.
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Out Method – Example1. Remove the stop words
Bulgarian: на, на, такива, е, но, им, е Russian: таких, но, то
2. Align рефлекси and рефлексов (semantic similarity = 0.989)
3. Align по-сложен and сложнее (orthographic similarity = 0.750)
4. Align процесът and процесс (orthographic similarity = 0.714)
5. Align създаването and создания (orthographic similarity = 0.544)
6. Align процесът and процесс (orthographic similarity = 0.536)
7. Not aligned: еднаква
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Our Method – Exampleпроцесът
на
създаването
на
такива
рефлекси
е
по-сложен
но
същността
им
е
еднаква
процесс
создания
таких
рефлексов
сложнее
но
существо
то
же
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation We evaluated the following algorithms
BASELINE: the traditional alignment algorithm (IBM model 4)
LCSR, MEDR, MMEDR: orthographic similarity algorithms
WEB-ONLY: semantic similarity algorithm WEB-AVG: average of WEB-ONLY and
MMEDR WEB-MAX: maximum of WEB-ONLY and
MMEDR WEB-CUT: 1 if MMEDR(s1, s2) >= α (0 < α <
1), or WEB-ONLY(s1, s2) otherwise
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Testing Data and Experiments Testing data set
A corpus of 5 827 parallel sentences
Training set: 4 827 sentences
Tuning set: 500 sentences
Testing set: 500 sentences
Experiments Manual evaluation of WEB-CUT AER for competitive linking Translation quality: BLEU / NIST
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Manual Evaluation of WEB-CUT Aligned the texts of the testing data set
Used competitive linking and WEB-CUT for α=0.62
Obtained 14,246 distinct word pairs
Manually evaluated the aligned pairs as: Correct
Rough (considered incorrect)
Wrong (considered incorrect)
Calculated precision and recall For the case MMEDR < 0.62
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Manual Evaluation of WEB-CUT Precision-recall curve
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation of Alignment Error Rate
Gold standard for alignment
For the first 100 sentences
Created manually by a linguist
Stop words and punctuation were removed
Evaluated the alignment error rate (AER) for competitive linking
Evaluated for all the algorithms
LCSR, MEDR, MMEDR, WEB-ONLY, WEB-AVG, WEB-MAX and WEB-CUT
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation of Alignment Error Rate
AER for competitive linking
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation of Translation Quality Built a Russian Bulgarian statistical
machine translation (SMT) system Extracted from the training set the distinct
word pairs aligned with competitive linking Added them twice as additional “sentence”
pairs to the training corpus Trained log-linear model for SMT with
standard feature functions Used minimum error rate training on the
tuning set
Evaluated BLUE and NIST score on the testing set
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation of Translation Quality
Translation quality: BLEU
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Evaluation of Translation Quality
Translation quality: NIST
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Resources We used the following resources:
Bulgarian-Russian parallel corpus: 5 827 sentences
Bilingual Bulgarian / Russian glossary: 3 794 pairs of translation words
A list of 599 Bulgarian / 508 Russian stop words
Bulgarian lemma dictionary: 1 000 000 wordforms and 70 000 lemmata
Russian lemma dictionary: 1 500 000 wordforms and 100 000 lemmata
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Conclusion and Future Work Conclusion
Semantic similarity extracted from the Web can improve statistical machine translation
For similar languages like Bulgarian and Russian orthographic similarity is useful
Future Work Improve MMED with automatic leaned rules Improve the semantic similarity algorithm
Filter parasite words like "site", "click", etc.
Replace competitive linking with maximum weight bipartite matching
RANLP 2007 – September 27-29, 2007, Borovets, Bulgaria
Questions?
Improved Word Alignments Using the Web as a Corpus