dr. preslav nakov — combining, adapting and reusing bi-texts between related languages —...
DESCRIPTION
Bilingual sentence-aligned parallel corpora, or bitexts, are a useful resource for solving many computational linguistics problems including part-of speech tagging, syntactic parsing, named entity recognition, word sense disambiguation, sentiment analysis, etc.; they are also a critical resource for some real-world applications such as statistical machine translation (SMT) and cross-language information retrieval. Unfortunately, building large bi-texts is hard, and thus most of the 6,500+ world languages remain resource-poor in bi-texts. However, many resource-poor languages are related to some resource-rich language, with whom they overlap in vocabulary and share cognates, which offers opportunities for using their bi-texts. We explore various options for bi-text reuse: (i) direct combination of bi-texts, (ii) combination of models trained on such bi-texts, and (iii) a sophisticated combination of (i) and (ii). We further explore the idea of generating bitexts for a resource-poor language by adapting a bi-text for a resource-rich language. We build a lattice of adaptation options for each word and phrase, and we then decode it using a language model for the resource-poor language. We compare word- and phrase-level adaptation, and we further make use of cross-language morphology. For the adaptation, we experiment with (a) a standard phrase-based SMT decoder, and (b) a specialized beam-search adaptation decoder. Finally, we observe that for closely-related languages, many of the differences are at the subword level. Thus, we explore the idea of reducing translation to character-level transliteration. We further demonstrate the potential of combining word- and character-level models.TRANSCRIPT
![Page 1: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/1.jpg)
Combining, Adapting and Reusing Bi-texts between Related Languages:
Application to Statistical Machine Translation
Preslav Nakov, Qatar Computing Research Institute(collaborators: Jorg Tiedemann, Pidong Wang, Hwee Tou Ng)
Yandex seminarAugust 13, 2014, Moscow, Russia
![Page 2: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/2.jpg)
2
Plan
•Part I- Introduction to Statistical Machine Translation
•Part II- Combining, Adapting and Reusing Bi-texts between Related
Languages: Application to Statistical Machine Translation
•Part III- Further Discussion on SMT
![Page 3: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/3.jpg)
3
StatisticalMachine Translation
![Page 4: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/4.jpg)
4
Statistical Machine Translation (SMT)Reach Out to Asia (ROTA) has announced its fifth Wheels ‘n’ Heels, Qatar’s largest annual community event, which will promote ROTA’s partnership with the Qatar Japan 2012 Committee. Held at the Museum of Islamic Art Park on 10 February, the event will celebrate 40 years of cordial relations between the two countries. Essa Al Mannai, ROTA Director, said: “A group of 40 Japanese students are traveling to Doha especially to take part in our event.
English
SMT systems:- learn from human-generated translations- extract useful knowledge and build models- use the models to translate new sentences
![Page 5: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/5.jpg)
5
SMT:The Noisy Channel Model
![Page 6: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/6.jpg)
6
Translation as Decoding•1947, Warren Weaver, Rockefeller Foundation:
One naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: ‘This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.’
Example:- Это действительно написано по-английски .
- This is really written in English .
![Page 7: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/7.jpg)
7
The Basic Components of an SMT System
Look for the best English translation that both conveys the French meaning
and is grammatical.
![Page 8: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/8.jpg)
8
Components of an SMT System
•Language Model- English text е P(e)
o good English high probabilityo bad English low probability
•Translation Model- Pair <f,e> P(f|e)
o <f,e> are translations high probabilityo <f,e> are not translations low probability
•Decoder- Given P(e), P(f|e), and f we look for е that maximizes
[P(e).P(f|e)]
![Page 9: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/9.jpg)
9
Combining P(e) and P(f|e)
How do we translate to Englishthe Russian phrase “красный цветок”?
P(e) P(f|e) P(e).P(f|e)
a flower red ↓ ↑ ↓red flower a ↓ ↑ ↓flower red a ↓ ↑ ↓a red dog ↑ ↓ ↓dog cat mouse ↓ ↓ ↓ ↓a red flower ↑ ↑ ↑
![Page 10: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/10.jpg)
10
SMT:The Language Model P(e)
![Page 11: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/11.jpg)
11
Language Model
•Goal: prefer “good” to “bad” English- “good” ≠ grammatical- “bad” ≈ unlikely
•Examples (grammaticality):- I do not like strong tea. good - I do not like powerful tea. bad- I like strong tea not. bad- Like not tea strong do I. bad
![Page 12: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/12.jpg)
12
Example:Grammatical but Low-probability Text
Eye halve a spelling checkerIt came with my pea seaIt plainly marks four my revueMiss steaks eye kin knot sea.
Eye strike a key and type a wordAnd weight four it two sayWeather eye am wrong oar writeIt shows me a strait a weigh.
As soon as a mist ache is maidIt nose bee fore two longAnd eye can put the error riteIts rare lea ever wrong.
Eye have run this poem threw itI am shore your pleased two noIts letter perfect awl the weighMy checker tolled me sew.
Торопыжка был голодный - проглотил утюг холодный.
![Page 13: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/13.jpg)
13
Language Model:Learned from Monolingual Text
![Page 14: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/14.jpg)
14
Bigram Language Model
First-orderMarkov model(approximation)
Chain rule
)6()...|(P)|(P)|(P)|(P)(P
)5()...|(P)|(P)|(P)|(P)(P
)4()...(P)(P
453423121
432153214213121
21
wwwwwwwww
wwwwwwwwwwwwwww
wwwe n
Andrei Markov
![Page 15: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/15.jpg)
15
Bigram Language Model
)(
)(
)(
)()|(P
1
1
1
11
i
ii
wii
iiii wC
wwC
wwC
wwCww
i
n
iiin wwwPwww
21121 |P...P
P(“I eat an apple …”) = P(I | <S>) . P(eat | I) . P(an | eat) . P(apple | an) …
![Page 16: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/16.jpg)
16
SMT:The Translation Model P(f|e)
![Page 17: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/17.jpg)
17
Modeling P(f|e) – Sentence Level
Batman did not fight any cat woman .
Бэтмен не вел бой с никакой женщиной кошкой .
•Cannot be estimated directly
![Page 18: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/18.jpg)
18
Modeling P(f|e)
Batman did not fight any cat woman .
Бэтмен не вел бой с никакой женщиной кошкой .
•Broken into smaller steps
![Page 19: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/19.jpg)
19
IBM Model 4: Generation(Brown et al., CL 1993)
Batman did not fight any cat woman .
Batman not fight fight any cat woman .
Batman not fight fight NULL any cat woman .
Бэтмен не вел бой с никакой кошкой женщиной .
Бэтмен не вел бой с никакой женщиной кошкой .
n(3|fight)
P-NULL
t(не|not)
d(8|7)
(Brown et al., CL 1993)
![Page 20: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/20.jpg)
20
IBM Model 4: Generation(Brown et al., CL 1993)
Batman did not fight any cat woman .
Batman not fight fight any cat woman .
Batman not fight fight NULL any cat woman .
Бэтмен не вел бой с никакой кошкой женщиной .
Бэтмен не вел бой с никакой женщиной кошкой .
n(3|fight)
P-NULL
t(не|not)
d(8|7)
• All these probabilities could be learned if word alignments were available.
• We can learn word alignments using EM.
(Brown et al., CL 1993)
![Page 21: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/21.jpg)
21
Translation Model: Learned from a Bi-TextReach Out to Asia (ROTA) has announced its fifth Wheels ‘n’ Heels, Qatar’s largest annual community event, which will promote ROTA’s partnership with the Qatar Japan 2012 Committee. Held at the Museum of Islamic Art Park on 10 February, the event will celebrate 40 years of cordial relations between the two countries. Essa Al Mannai, ROTA Director, said: “A group of 40 Japanese students are traveling to Doha especially to take part in our event.
![Page 22: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/22.jpg)
22
100 Sentence Pairs
![Page 23: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/23.jpg)
23
1000 Sentence Pairs
![Page 24: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/24.jpg)
24
10,000 Sentences = 1 Book
![Page 25: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/25.jpg)
25
100,000 Sentences = Stack of Books
![Page 26: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/26.jpg)
26
1,000,000 Sentences = Shelf of Books
![Page 27: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/27.jpg)
27
10 Million Sentences = Large Shelf of Books
![Page 28: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/28.jpg)
28
The Large Data Trend Continues
![Page 29: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/29.jpg)
29
Alignment Levels
- Document
- Paragraph
- SentenceoGale & Church algorithm
- Wordso IBM models
![Page 30: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/30.jpg)
30
Learning Word AlignmentsUsing Expectation Minimization (EM)
… красивые цветы … красивые красные цветы … красивые девушки …
… beautiful flowers … beautiful red flowers … beautiful girls …
![Page 31: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/31.jpg)
31
Learning Word AlignmentsUsing Expectation Minimization (EM)
… красивые цветы … красивые красные цветы … красивые девушки …
… beautiful flowers … beautiful red flowers … beautiful girls …
![Page 32: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/32.jpg)
32
Learning Word AlignmentsUsing Expectation Minimization (EM)
… красивые цветы … красивые красные цветы … красивые девушки …
… beautiful flowers … beautiful red flowers … beautiful girls …
![Page 33: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/33.jpg)
33
Learning Word AlignmentsUsing Expectation Minimization (EM)
… красивые цветы … красивые красные цветы … красивые девушки …
… beautiful flowers … beautiful red flowers … beautiful girls …
![Page 34: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/34.jpg)
34
Phrase-basedSMT
![Page 35: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/35.jpg)
35
Phrase-Based SMT
•Sentence is broken into phrases- Contiguous token sequences- Not linguistic units
•Each phrase is translated in isolation•Translated phrases are reordered
Batman has not fought a cat woman yet . Бэтмен пока не сражался с женщиной кошкой .
(Koehn&al., HLT-NAACL 2003)
(Koehn&al., HLT-NAACL 2003)
![Page 36: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/36.jpg)
36
Phrase-Based Translation
•Multiple words Multiple words
•Models context
•Handles non-compositional phrases
•More data – longer phrases
![Page 37: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/37.jpg)
37
Phrase-Based SMT:Sample
Bulgarian-English Phrases
![Page 38: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/38.jpg)
38
Sample Phrases: главен
главни прокурори chief prosecutors
главни счетоводители chief accountants
главни архитекти chief architects
главни щабове main staffs
главни улици main streets
главни методисти senior instructors
главно предизвикателство major challenge
![Page 39: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/39.jpg)
39
Sample Phrases: както
•както физическа , така и психическа ||| both physical and psychological•както целият регион ||| like the whole region•както те са определени ||| as defined•както и размера ||| as well as the size•както и предишните редовни доклади ||| in line with previous regular reports•както и по други ||| and in other
![Page 40: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/40.jpg)
40
Phrase-Based SMT:Sample
Russian-Bulgarian Phrases
![Page 41: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/41.jpg)
41
Sample Phrases: заявление
•заявление ||| молба ||| 0.25 0.166667 1 1 2.718•заявление об ||| молба за ||| 1 0.00524692 1 0.53125 2.718•заявление об образовании ||| молба за образуването ||| 1 0.005 ...•заявления ||| заявление ||| 1 1 0.5 0.666667 2.718•заявления ||| заявление от ||| 1 0.500677 0.5 0.222222 2.718•заявляю ||| заявявам ||| 0.333333 0.6 1 1 2.718
![Page 42: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/42.jpg)
42
Sample Phrases: звонок, звук
•звонка ||| звънец ||| 1 1 0.4 0.5 2.718•звонка ||| звънеца ||| 0.25 0.2 0.4 0.5 2.718•звонка ||| на звънеца ||| 1 0.2 0.2 0.128199 2.718•звонки ||| звънци ||| 0.4 0.4 1 1 2.718•звонко ||| звънко ||| 0.333333 0.428571 1 1 2.718•звонков ||| звънци ||| 0.4 0.4 1 1 2.718•звонку ||| звънеца ||| 0.25 0.2 1 1 2.718•звонок ||| звънеца ||| 0.375 0.3 0.375 0.3 2.718•звонок ||| звънецът ||| 1 1 0.125 0.1 2.718•звонок ||| иззвъня ||| 0.6 0.625 0.375 0.5 2.718
•звук ||| звук ||| 0.666667 0.666667 1 1 2.718•звука ||| звук ||| 0.333333 0.333333 0.666667 0.4 2.718•звука ||| звука ||| 1 0.666667 0.333333 0.4 2.718•звуки ||| звуци ||| 1 1 1 1 2.718
![Page 43: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/43.jpg)
43
Sample Phrases: здание
•здание ||| здание ||| 1 1 0.4 0.4 2.718•здание ||| зданието ||| 0.75 0.5 0.6 0.6 2.718•здания ||| зданието ||| 0.25 0.5 0.2 0.375 2.718•здания ||| зданието на ||| 1 0.250861 0.4 0.140625 2.718•здания ||| сградите ||| 1 1 0.2 0.25 2.718•здания ||| сградите на ||| 1 0.500861 0.2 0.09375 2.718
![Page 44: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/44.jpg)
44
Sample Phrases: здравствуй
•здравствуй ||| добро утро ||| 1 0.75 0.333 0.0625 2.718•здравствуй ||| здравей ||| 1 1 0.666667 0.5 2.718
•здравствуйте ||| здравейте ||| 1 1 1 1 2.718
•здравствует ||| живее ||| 0.4 0.333333 1 1 2.718
![Page 45: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/45.jpg)
45
Sample Phrases: необычайное•необычайное ||| необикновено ||| 0.176471 0.142857 0.75 0.75 2.718•необычайное ||| необикновеното ||| 0.333333 0.333333 0.25 0.25 2.718•необычайно ||| извънредно ||| 1 0.4 0.125 0.117647 2.718•необычайно ||| необикновена ||| 0.222222 0.166667 0.125 0.117647 2.718•необычайно ||| необикновено ||| 0.588235 0.476191 0.625 0.588235 2.718•необычайно ||| необичайно ||| 1 1 0.0625 0.117647 2.718•необычайной ||| необикновена ||| 0.333333 0.416667 0.5 0.625 2.718•необычайной ||| необикновено ||| 0.0588235 0.047619 0.166667 0.125 2.718•необычайной ||| с необикновена ||| 1 0.209808 0.333333 0.15625 2.718•необычайные ||| необикновени ||| 0.5 0.5 1 1 2.718•необычайный ||| необикновен ||| 0.222222 0.222222 0.5 0.5 2.718•необычайный ||| необикновеният ||| 0.5 0.5 0.25 0.25 2.718•необычайный ||| необичайни ||| 0.333333 0.25 0.25 0.25 2.718
•необычное ||| необикновеното ||| 0.666667 0.666667 1 1 2.718•необычные ||| необичайни ||| 0.666667 0.5 1 1 2.718
•неожиданной ||| неочакваната ||| 0.333333 0.333333 0.25 0.25 2.718•неожиданной ||| неочаквана ||| 0.666667 0.6 0.75 0.75 2.718
![Page 46: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/46.jpg)
46
SMT:Evaluation
![Page 47: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/47.jpg)
47
How MT Evaluation is NOT Done…
•Backtranslation
- A “mythical” example (Hutchins,1995)o En: The spirit is willing, but the flesh is weak.o Ru: Дух бодр, но плоть слаба.o En. The vodka is good, but the meat is rotten.
- Not used, can be gamed easily:o En: The spirit is willing, but the flesh is weak.o Ru: The spirit is willing, but the flesh is weak.o En: The spirit is willing, but the flesh is weak.
![Page 48: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/48.jpg)
48
The BLEU Evaluation Metric(Papineni et al., ACL 2002)
Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
• BLEU4 formula (counts n-grams up to length 4)
exp (1.0 * log p1 + 0.5 * log p2 + 0.25 * log p3 + 0.125 * log p4 – max(words-in-reference / words-in-machine – 1, 0)
p1 = 1-gram precisionp2 = 2-gram precisionp3 = 3-gram precisionp4 = 4-gram precision
Correlates well with human judgments Very hard to “game” it
(Papineni et al., ACL 2002)
![Page 49: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/49.jpg)
49
BLEU: Multiple Reference Translations
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .
Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .
Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
Reference translation 1: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Reference translation 3: The US International Airport of Guam and its office has received an email from a self-claimed Arabian millionaire named Laden , which threatens to launch a biochemical attack on such public places as airport . Guam authority has been on alert .
Reference translation 4: US Guam International Airport and its office received an email from Mr. Bin Laden and other rich businessman from Saudi Arabia . They said there would be biochemistry air raid to Guam Airport and other public places . Guam needs to be in high precaution about this matter .
Reference translation 2: Guam International Airport and its offices are maintaining a high state of alert after receiving an e-mail that was from a person claiming to be the wealthy Saudi Arabian businessman Bin Laden and that threatened to launch a biological and chemical attack on the airport and other public places .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
(Papineni et al., ACL 2002)
![Page 50: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/50.jpg)
50
Phrase-Based SMT:Parameter Tuning
![Page 51: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/51.jpg)
51
The Basic Model, Revisited
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) e
argmax P(e) x P(f | e) e
argmax P(e)2.4 x P(f | e) e
argmax P(e)2.4 x P(f | e) x #words(e)1.1
eRewards longer hypotheses, since they are unfairly penalized by P(e)
Works better
x P(e | f)1.1 x Plex(f | e)1.3 x Plex(e | f)0.9 x #phrases(e,f)0.5...
(Och, ACL 2003)
![Page 52: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/52.jpg)
52
Maximum BLEU Training(Och, ACL 2003)
Translation System
(Automatic,Trainable)
Translation Quality
Evaluator(Automatic)
Frenchinput
EnglishMT Output
EnglishReference Translations(sample “right answers”)
BLEUscore
LanguageModel #1
TranslationModel
LanguageModel #2
Length Model
OtherFeatures
MERT: Minimum Error Rate Training(optimizes BLEU directly)
(Och, ACL 2003)
![Page 53: Dr. Preslav Nakov — Combining, Adapting and Reusing Bi-texts between Related Languages — Application to Statistical Machine Translation — part 1](https://reader035.vdocuments.net/reader035/viewer/2022062418/554e774bb4c9054a698b4e64/html5/thumbnails/53.jpg)
53
Statistical Phrase-Based Translation
1. Training:1. P(e): n-gram language model2. P(f|e):
1. Generate word alignments
2. Build a phrase table
2. Tuning:1. Use MERT to tune the parameters
3. Evaluation:1. Run the system on test data2. Calculate BLEU