what’s new in statistical machine translation
DESCRIPTION
What’s New in Statistical Machine Translation. USC/Information Sciences Institute USC/Computer Science Department. Kevin Knight. Machine Translation. 美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/1.jpg)
What’s New in Statistical Machine Translation
Kevin Knight USC/Information Sciences Institute
USC/Computer Science Department
![Page 2: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/2.jpg)
Machine Translation
美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件,威胁将会向机场等公众地方发动生化袭击後,关岛经保持高度戒备。
The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport.
![Page 3: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/3.jpg)
Thousands of Languages Are SpokenMANDARIN 885,000,000SPANISH 332,000,000ENGLISH 322,000,000BENGALI 189,000,000
HINDI 182,000,000PORTUGUESE 170,000,000RUSSIAN 170,000,000JAPANESE 125,000,000GERMAN 98,000,000
WU (China) 77,175,000JAVANESE 75,500,800KOREAN 75,000,000FRENCH 72,000,000VIETNAMESE 67,662,000
TELUGU 66,350,000YUE (China) 66,000,000MARATHI 64,783,000TAMIL 63,075,000
TURKISH 59,000,000URDU 58,000,000MIN NAN (China) 49,000,000JINYU (China) 45,000,000
GUJARATI 44,000,000POLISH 44,000,000ARABIC 42,500,000UKRAINIAN 41,000,000
ITALIAN 37,000,000XIANG (China) 36,015,000MALAYALAM 34,022,000HAKKA (China) 34,000,000
KANNADA 33,663,000ORIYA 31,000,000PANJABI 30,000,000SUNDA 27,000,000
Source: Ethnologue
![Page 4: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/4.jpg)
Recent Progress in Statistical MT
insistent Wednesday may recurred her trips to Libya tomorrow for flying
Cairo 6-4 ( AFP ) - An official announced today in the Egyptian lines company for flying Tuesday is a company "insistent for flying" may resumed a consideration of a day Wednesday tomorrow her trips to Libya of Security Council decision trace international the imposed ban comment.
Egyptair Has Tomorrow to Resume Its Flights to Libya
Cairo 4-6 (AFP) - Said an official at the Egyptian Aviation Company today that the company egyptair may resume as of tomorrow, Wednesday its flights to Libya after the International Security Council resolution to the suspension of the embargo imposed on Libya.
20022002 20032003
![Page 5: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/5.jpg)
newsbroadcast
foreignlanguagespeechrecognition
Englishtranslation
searchablearchive
20052005
![Page 6: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/6.jpg)
Warren Weaver (1947)
ingcmpnqsnwf cv fpn owoktvcv
hu ihgzsnwfv rqcffnw cw owgcnwf kowazoanv ...
![Page 7: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/7.jpg)
Warren Weaver (1947)
e e e e ingcmpnqsnwf cv fpn owoktvcv e e ehu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
![Page 8: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/8.jpg)
Warren Weaver (1947)
e e e the ingcmpnqsnwf cv fpn owoktvcv e e ehu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
![Page 9: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/9.jpg)
Warren Weaver (1947)
e he e the ingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
![Page 10: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/10.jpg)
Warren Weaver (1947)
e he e of the ingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
![Page 11: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/11.jpg)
Warren Weaver (1947)
e he e of the fofingcmpnqsnwf cv fpn owoktvcv e f o e o oe thu ihgzsnwfv rqcffnw cw owgcnwf efkowazoanv ...
![Page 12: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/12.jpg)
Warren Weaver (1947)
e he e of theingcmpnqsnwf cv fpn owoktvcv e e e thu ihgzsnwfv rqcffnw cw owgcnwf ekowazoanv ...
![Page 13: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/13.jpg)
Warren Weaver (1947)
e he e is the sisingcmpnqsnwf cv fpn owoktvcv e s i e i ie thu ihgzsnwfv rqcffnw cw owgcnwf eskowazoanv ...
![Page 14: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/14.jpg)
Warren Weaver (1947)
decipherment is the analysisingcmpnqsnwf cv fpn owoktvcvof documents written in ancienthu ihgzsnwfv rqcffnw cw owgcnwflanguages ...kowazoanv ...
![Page 15: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/15.jpg)
Warren Weaver (1947)
The non-Turkish guy next to me is even deciphering Turkish! All he needs is a
statistical table of letter-pair frequencies in Turkish …
Collected mechanically from aTurkish body of text, or corpus
Can this becomputerized?
![Page 16: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/16.jpg)
“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947
![Page 17: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/17.jpg)
“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947
“... as to the problem of mechanical translation, I frankly am afraid that the [semantic] boundaries of words in different languages are too vague ... to make any quasi-mechanical translation scheme very hopeful.”
- Norbert Wiener, April 1947
![Page 18: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/18.jpg)
Spanish/English corpus
1a. Garcia and associates .1b. Garcia y asociados .
7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .8b. la empresa tiene tres grupos .
3a. his associates are not strong .3b. sus asociados no son fuertes .
9a. its groups are in Europe .9b. sus grupos estan en Europa .
4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .
6a. the associates are also angry .6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .12b. los grupos pequenos no son modernos .
![Page 19: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/19.jpg)
Spanish/English corpus
1a. Garcia and associates .1b. Garcia y asociados .
7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .
2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .
8a. the company has three groups .8b. la empresa tiene tres grupos .
3a. his associates are not strong .3b. sus asociados no son fuertes .
9a. its groups are in Europe .9b. sus grupos estan en Europa .
4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .
10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .
5a. its clients are angry .5b. sus clientes estan enfadados .
11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .
6a. the associates are also angry .6b. los asociados tambien estan enfadados .
12a. the small groups are not modern .12b. los grupos pequenos no son modernos .
Translate: Clients do not sell pharmaceuticals in Europe.
![Page 20: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/20.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
![Page 21: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/21.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
![Page 22: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/22.jpg)
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
![Page 23: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/23.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
???
![Page 24: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/24.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
![Page 25: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/25.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
![Page 26: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/26.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
![Page 27: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/27.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
???
![Page 28: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/28.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
![Page 29: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/29.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
process ofelimination
![Page 30: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/30.jpg)
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp
cognate?
![Page 31: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/31.jpg)
Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }
Centauri/Arcturan [Knight 97]
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok farok ororok lalok sprok izok enemok .
7b. wat jjat bichat wat dat vat eneat .
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .
5b. totat jjat quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
zerofertility
![Page 32: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/32.jpg)
“When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”
- Warren Weaver, March 1947
The required statistical tables have millions of entries…?
Too much for the computers of Weaver’s day.
Not enough RAM!
![Page 33: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/33.jpg)
IBM Candide Project (1988-1994)
• How to get quantities of human translation in computer readable form?– parallel corpus
Canadianbureaucrat
IBM’s John Cocke,inventor of CKY parsing & RISC processors
![Page 34: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/34.jpg)
IBM Candide Project (1988-1994)
• How to get quantities of human translation in computer readable form?– parallel corpus
Canadianbureaucrat
IBM’s John Cocke,inventor of CKY parsing & RISC processors
![Page 35: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/35.jpg)
IBM Candide Project (1988-1994)
• How to get quantities of human translation in computer readable form?– parallel corpus
IBM’s John Cocke,inventor of CKY parsing & RISC processors
![Page 36: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/36.jpg)
IBM Candide Project[Brown et al 93]
French BrokenEnglish
English
French/EnglishBilingual Text
EnglishText
Statistical Analysis Statistical Analysis
J’ ai si faim
What hunger have I,Hungry I am so,I am so hungry,Have me that hunger …
I am so hungry
![Page 37: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/37.jpg)
Mathematical Formulation
J’ ai si faim I am so hungry
TranslationModel P(f | e)
LanguageModel P(e)
Decoding algorithm
argmaxe P(e) · P(f | e)
Given source sentence f:
argmaxe P(e | f) =
argmaxe P(f | e) · P(e) / P(f) = by Bayes Rule
argmaxe P(f | e) · P(e) P(f) same for all e
French BrokenEnglish
English
![Page 38: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/38.jpg)
Language Modeling
Goal of a language model for MT:
He is on the soccer fieldHe is in the soccer field
Is table the on cup theThe cup is on the table
American shrineAmerican company
Need to make thesedecisions, becausetranslation model maynot have a lot of context information!
![Page 39: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/39.jpg)
The Classic Language ModelWord Bigrams
Process model of English: Generate each word based only on the previous word.
P(I saw water on the table) =
P(I | START) ·P(saw | I) ·P(water | saw) ·P(on | water) ·P(the | on) ·P(table | the) ·P(END | table)
Probabilities can be tabulatedfrom an online English corpus …just like Weaver’s Turkish case.
![Page 40: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/40.jpg)
Trigram Language Model
to the saidroyal purchase plan trustcopart operations of its its is
international expand banking
[Soricut & Marcu, 05]
![Page 41: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/41.jpg)
Trigram Language Model
to the saidroyal purchase plan trustcopart operations of its its is
international expand banking
the banking trustco is said to expand its purchase part of its royal international plan operations
[Soricut & Marcu, 05]
![Page 42: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/42.jpg)
Trigram Language Model
to the saidroyal purchase plan trustcopart operations of its its is
international expand banking royal trustco said the purchase is part of its plan to expand its international banking operations
the banking trustco is said to expand its purchase part of its royal international plan operations
[Soricut & Marcu, 05]
N-grams have a lot of semantics in them!
![Page 43: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/43.jpg)
Trigram Language Model
to the saidroyal purchase plan trustcopart operations of its its is
international expand banking royal trustco said the purchase is part of its plan to expand its international banking operations
[Soricut & Marcu, 05]
the banking trustco is said to expand its purchase part of its royal international plan operations
with the stressed relationship part
own longstanding its its for chinese boeing , ,
for its part, stressed the longstanding relationship with its own, chinese boeing
boeing, for its part, stressed its own longstanding relationship with the chinese
![Page 44: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/44.jpg)
Translation Model?
Mary did not slap the green witch
Maria no dió una bofetada a la bruja verde
Source-language morphological analysis
Source parse tree
Semantic representation
Generate target structure
Process model of translation:
![Page 45: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/45.jpg)
Translation Model?
Mary did not slap the green witch
Maria no dió una bofetada a la bruja verde
Source-language morphological analysis
Source parse tree
Semantic representation
Generate target structure
What are allthe possiblemoves andwhat probabilitytables controlthose moves?
Process model of translation:
![Page 46: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/46.jpg)
The Classic Translation ModelWord Substitution/Permutation [Brown et al., 1993]
Mary did not slap the green witch
Mary not slap slap slap the green witch
Maria no dió una bofetada a la bruja verde
Mary not slap slap slap NULL the green witch
Maria no dió una bofetada a la verde bruja
Trainable
Process model of translation:
n(3|slap) 50k entries
d(j|i) 2500 entries
P-Null 1 entry
t(la|the) 25m entries
![Page 47: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/47.jpg)
The Classic Translation ModelWord Substitution/Permutation [Brown et al., 1993]
Mary did not slap the green witch
Maria no dió una bofetada a la bruja verde
Still trainable!
Process model of translation:
n(3|slap) 50k entries
d(j|i) 2500 entries
P-Null 1 entry
t(la|the) 25m entries
?
![Page 48: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/48.jpg)
Classic Formula for P(f | e)
P(f | e) =
m – Φ0 lΣ ( ) · P-Null m – 2Φ0 · (1-P-Null) Φ0 · Φi! · (1 /
Φ0!) ·a Φ0 i=0
l m m n(Φi | ei) · t(fj | eaj) · d(j | aj, l, m) i=1 j=1 j:a j <> 0
fertility word translation re-ordering
NULL stuff
Set parameter values so formula assigns the highest possible probability to observed human translations. This is a 25m-dimensional search space.
sum overalignment
possibilities
![Page 49: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/49.jpg)
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
All P(french-word | english-word) equally likely
![Page 50: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/50.jpg)
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“la” and “the” observed to co-occur frequently,so P(la | the) is increased.
![Page 51: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/51.jpg)
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
“maison” co-occurs with both “the” and “house”, butP(maison | house) can be raised without limit, to 1.0,
while P(maison | the) is limited because of “la”
(pigeonhole principle)
![Page 52: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/52.jpg)
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
settling down after another iteration
![Page 53: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/53.jpg)
Unsupervised EM Training
… la maison … la maison bleue … la fleur …
… the house … the blue house … the flower …
Inherent hidden structure revealed by EM training!
• “A Statistical MT Tutorial Workbook” (Knight, 1999). Promises free beer.
• “The Mathematics of Statistical Machine Translation” (Brown et al, 1993)
• Software: GIZA++
![Page 54: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/54.jpg)
e f P(f | e)
national
nationale 0.47
national 0.42
nationaux 0.05
nationales 0.03
the
le 0.50
la 0.21
les 0.16
l’ 0.09
ce 0.02
cette 0.01
farmers
agriculteurs 0.44
les 0.42
cultivateurs 0.05
producteurs 0.02
Translation Model
Sample Translation Probabilities
[Brown et al 93]
![Page 55: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/55.jpg)
e f P(f | e)
national
nationale 0.47
national 0.42
nationaux 0.05
nationales 0.03
the
le 0.50
la 0.21
les 0.16
l’ 0.09
ce 0.02
cette 0.01
farmers
agriculteurs 0.44
les 0.42
cultivateurs 0.05
producteurs 0.02
new Frenchsentence f
Translation Model
potentialtranslation e
P(f | e)
![Page 56: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/56.jpg)
e f P(f | e)
national
nationale 0.47
national 0.42
nationaux 0.05
nationales 0.03
the
le 0.50
la 0.21
les 0.16
l’ 0.09
ce 0.02
cette 0.01
farmers
agriculteurs 0.44
les 0.42
cultivateurs 0.05
producteurs 0.02
new Frenchsentence f
Translation Model
Language Model
w1 w2 P(w2 | w1)
of
the 0.13
a 0.09
another 0.01
some 0.01
hong
kong 0.98
said 0.01
stated 0.01
potentialtranslation e
P(f | e) P(e)
![Page 57: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/57.jpg)
e f P(f | e)
national
nationale 0.47
national 0.42
nationaux 0.05
nationales 0.03
the
le 0.50
la 0.21
les 0.16
l’ 0.09
ce 0.02
cette 0.01
farmers
agriculteurs 0.44
les 0.42
cultivateurs 0.05
producteurs 0.02
new Frenchsentence f
P(f | e) · P(e) score for e
Translation Model
Language Model
w1 w2 P(w2 | w1)
of
the 0.13
a 0.09
another 0.01
some 0.01
hong
kong 0.98
said 0.01
stated 0.01
potentialtranslation e
P(f | e) P(e)
![Page 58: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/58.jpg)
Search for Best Translation
voulez – vous vous taire !
![Page 59: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/59.jpg)
Search for Best Translation
voulez – vous vous taire !
you – you you quiet !
![Page 60: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/60.jpg)
Search for Best Translation
voulez – vous vous taire !
you – you quiet !
![Page 61: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/61.jpg)
Search for Best Translation
voulez – vous vous taire !
quiet you – you you !
![Page 62: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/62.jpg)
Search for Best Translation
voulez – vous vous taire !
shut you – you you !
![Page 63: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/63.jpg)
Search for Best Translation
voulez – vous vous taire !
you shut !
![Page 64: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/64.jpg)
Search for Best Translation
voulez – vous vous taire !
you shut up !
![Page 65: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/65.jpg)
Classic Decoding Algorithm
Given f, find the English string e that maximizes P(e) · P(f | e)
NP-Complete [Knight 99].
Brown et al 93:“In this paper, we focus on the
translation modeling problem. We hope to deal with the [decoding] problem in a later paper.”
![Page 66: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/66.jpg)
Beam Search Decoding[Brown et al US Patent #5,477,451]
1st Englishword
2nd Englishword
3rd Englishword
4th Englishword
start end
Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)
all sourcewords
covered
[Jelinek 69; Och, Ueffing, and Ney, 01]
![Page 67: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/67.jpg)
1st Englishword
2nd Englishword
3rd Englishword
4th Englishword
start end
Each partial translation hypothesis contains: - Last English word chosen + source words covered by it - Next-to-last English word chosen - Entire coverage vector (so far) of source sentence - Language model and translation model scores (so far)
all sourcewords
covered
best predecessorlink
[Jelinek 69; Och, Ueffing, and Ney, 01]
Beam Search Decoding[Brown et al US Patent #5,477,451]
![Page 68: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/68.jpg)
Classic Results• nous avons signé le protocole . (Foreign Original)• we did sign the memorandum of agreement . (Human Translation)• we have signed the protocol . (MT)
• où était le plan solide ? (Foreign Original)• but where was the solid plan ? (Human Translation)• where was the economic base ? (MT)
the Ministry of Foreign Trade and Economic Cooperation, including foreigndirect investment 40 billion US dollars today provide data includethat year to November china actually using foreign 46.959 billion US dollars and
very slow = one page per day
![Page 69: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/69.jpg)
Okay!
I know, so far, this talk should be called …
What’s Old in Statistical
Machine Translation!!
![Page 70: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/70.jpg)
Further Developments
• Follow-on projects– Hong Kong– Aachen– Behavior Design Corporation
• JHU Summer Workshop 1999– Build & distribute statistical MT tools– Create standard training & testing data– Disseminate tutorial material– “MT in a Day”– Ask new questions
![Page 71: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/71.jpg)
How Much Data Do We Need?
Amount of bilingual training data
Quality of automatically trainedmachine translationsystem
![Page 72: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/72.jpg)
Advances in Statistical MT2000-2004
![Page 73: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/73.jpg)
Ready-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
![Page 74: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/74.jpg)
Ready-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
+ European parliament data [Koehn 05]
![Page 75: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/75.jpg)
Reference (human) translation: The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport .
Machine translation: The American [?] international airport and its the office all receives one calls self the sand Arab rich business [?] and so on electronic mail , which sends out ; The threat will be able after public place and so on the airport to start the biochemistry attack , [?] highly alerts after the maintenance.
BLEU Evaluation Metric(Papineni et al 02)
• N-gram precision (score is between 0 & 1)
What percentage of machine n-grams can be found in the reference translation?
Gross measure over 1000 test sentences.Not allowed to use same portion of reference
translation twice (can’t cheat by typing out “the the the the the”)
Brevity penalty: can’t just type out single word “the” (and get precision 1.0)
![Page 76: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/76.jpg)
BLEU in Action
枪手被警方击毙。 (Foreign Original)
the gunman was shot to death by the police . (Reference Translation)
the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10
![Page 77: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/77.jpg)
BLEU in Action
枪手被警方击毙。 (Foreign Original)
the gunman was shot to death by the police . (Reference Translation)
the gunman was police kill . #1wounded police jaya of #2the gunman was shot dead by the police . #3the gunman arrested by police kill . #4the gunmen were killed . #5the gunman was shot to death by the police . #6gunmen were killed by police ?SUB>0 ?SUB>0 #7al by the police . #8the ringer is killed by the police . #9police killed the gunman . #10
green = 4-gram match (good!)red = word not matched (bad!)
![Page 78: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/78.jpg)
BLEU Tends to Predict Human Judgments
R2 = 88.0%
R2 = 90.2%
-2.5
-2.0
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
-2.5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5
Human Judgments
BL
EU
Sc
ore
Adequacy
Fluency
Linear(Adequacy)Linear(Fluency)
slide from G. Doddington (NIST)
![Page 79: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/79.jpg)
Experiment-Driven Progress
Mar1
Apr1
15
20
25
30
35
May1
BLEU
2005
ISI Syntax-Based MTChinese/EnglishNIST 2002 Test Set
Evaluate new MT research ideas every day!(and be alerted about bugs…)
![Page 80: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/80.jpg)
Draw Learning Curves
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
10k 20k 40k 80k 160k 320k
Swedish/EnglishFrench/EnglishGerman/EnglishFinnish/English
# of sentence pairs used in training
BLEUscore
Experiments byPhilipp Koehn
![Page 81: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/81.jpg)
Flaws of Word-Based MT
• Can’t translate multiple English words to one French word
• Can’t translate phrases– “real estate”, “note that”, “interest in”
• Isn’t sensitive to syntax– Adjectives/nouns should swap order– Verb comes at the beginning in Arabic
• Doesn’t understand the meaning (?)
![Page 82: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/82.jpg)
The MT Triangle
SOURCE TARGET
words words
syntax syntax
logical form
interlingua
logical form
![Page 83: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/83.jpg)
The MT Swimming Pool
syntax syntax
interlingua
words words
logical form logical form
![Page 84: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/84.jpg)
SOURCE TARGET
Commercial Rule-Based Systems
words words
syntax syntax
logical form
interlingua
logical form
![Page 85: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/85.jpg)
Knight et al 95 - meaning-based translation - composition rules
Language Model
SOURCE TARGET
words words
syntax syntax
logical form
interlingua
logical form
![Page 86: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/86.jpg)
Wu 97, Alshawi 98 - inducing syntactic structure as a by-product of aligning words in bilingual text
SOURCE TARGET
Language Modelwords words
syntax syntax
logical form
interlingua
logical form
![Page 87: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/87.jpg)
Yamada/Knight (01,02) - tree/string model - used existing target language parser
SOURCE TARGET
Language Modelwords words
syntax syntax
logical form
interlingua
logical form
![Page 88: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/88.jpg)
Well, these all seem like good ideas.
Which one had the most dramatic effect on MT quality?
None of them!
![Page 89: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/89.jpg)
Phrases
SOURCE TARGET
phrases phrases
How do you translate“real estate” into French?
real estate real number dance number dance card memory card memory stick …
words words
syntax syntax
logical form
interlingua
logical form
![Page 90: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/90.jpg)
Phrase-Based Statistical MT
• Foreign input segmented into phrases– “phrase” just means “word sequence”
• Each phrase is probabilistically translated into English– P(to the conference | zur Konferenz)– P(into the meeting | zur Konferenz)
• Phrases are probabilistically re-ordered
See [Koehn et al, 2003] for an overview.
Morgen fliege ich nach Kanada zur Konferenz
Tomorrow I will fly to the conference In Canada
![Page 91: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/91.jpg)
How to Learn the Phrase Translation Table?
• One method: “alignment templates” [Och et al 99]
• Start with word alignment
• Collect all phrase pairs that are consistent with the word alignment
![Page 92: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/92.jpg)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
![Page 93: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/93.jpg)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
![Page 94: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/94.jpg)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)
![Page 95: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/95.jpg)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap) (bruja verde, green witch)(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
![Page 96: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/96.jpg)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) …
![Page 97: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/97.jpg)
Mary
did
not
slap
the
green
witch
Maria no dió una bofetada a la bruja verde
Word Alignment Induced Phrases
(Maria, Mary) (no, did not) (slap, dió una bofetada) (la, the) (bruja, witch) (verde, green)
(a la, the) (dió una bofetada a, slap)
(Maria no, Mary did not) (no dió una bofetada, did not slap), (dió una bofetada a la, slap the)
(bruja verde, green witch) (Maria no dió una bofetada, Mary did not slap)
(a la bruja verde, the green witch) …
(Maria no dió una bofetada a la bruja verde, Mary did not slap the green witch)
![Page 98: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/98.jpg)
Phrase Pair Probabilities
• A certain phrase pair (f-f-f, e-e-e) may appear many times across the bilingual corpus.
• No EM training
• Just relative frequency:
count(f-f-f, e-e-e) P(f-f-f | e-e-e) = ----------------------- count(e-e-e)
![Page 99: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/99.jpg)
Phrase-Based MT
• This is currently the best way to do Statistical MT!
• What took so long to move from words to phrases?– Missing RAM
• 25m parameters billions of parameters• Trick idea: build test-corpus-specific phrase table (takes 5 hours!)• Now solved in commercial deployments
– Missing computing power– Many competing ideas to shake out
• Koehn 03 summarizes several variations– Empirical effectiveness even better than intuition would predict
• This is not building a ladder to the moon!– If you can’t translate “real estate” into French, you are sunk
![Page 100: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/100.jpg)
Advanced Training Methods
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) = e
argmax P(e) x P(f | e) e
![Page 101: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/101.jpg)
Advanced Training Methods
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) = e
argmax P(e)2.4 x P(f | e) … works better! e
![Page 102: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/102.jpg)
Advanced Training Methods
argmax P(e | f) = e
argmax P(e) x P(f | e) / P(f) = e
argmax P(e)2.4 x P(f | e) x length(e)1.1
eRewards longer hypotheses, since these are unfairly punished by P(e)
![Page 103: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/103.jpg)
Advanced Training Methods
argmax P(e)2.4 x P(f | e) x length(e)1.1 x FEAT 3.7 …
e
Lots of features vote on every potential translation.
Exponential model.
Problem: How to set the exponent weights?
IDEA 1: maximize probability of the dataIDEA 2: maximize BLEU score of MT system
![Page 104: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/104.jpg)
20.64% BLEU
WTM fixed at 1.0
17.96% BLEU
plot by Emil Ettelaie
![Page 105: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/105.jpg)
Maximum BLEU Training
• Novel algorithm developed by [Och 03]
• Opened gates to “feature hacking”– Word-based feature to smooth phrase pair
counts (“Model1 Inverse”)– Phrase-specific propensities to re-order
• Currently limited to ~25 features
![Page 106: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/106.jpg)
Advances in Statistical MT2005
![Page 107: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/107.jpg)
Google’s Language Model
• Previously, largest language model was trained on 1b words of English
• 20b words of news significant impact on news translation
• 200b words of web helpful
![Page 108: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/108.jpg)
Maryland’s Hiero system [Chiang 05]
• Previously:– ne mange pas does not eat
• New phrase pairs with variables and reordering– ne X pas does not X
– le X1 du X2 X2 's X1
• Nesting– “does not X” itself becomes an X
• CKY decoderJohnCocke
![Page 109: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/109.jpg)
ISI’s Syntax-Based MT System
• First strong showing for an SMT system that knows what nouns and verbs are!
• Why syntax?“Frequent high-tech exports are bright spots for
foreign trade growth of Guangdong has made important contributions.”
– Need much more grammatical output– Need accurate control over re-ordering– Need accurate insertion of function words
![Page 110: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/110.jpg)
The gunman killed by police .
String Output
. 击毙警方被枪手
![Page 111: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/111.jpg)
The gunman killed by police .
DT NN VBD IN NN
NPB PP
NP-C VP
S
Tree Output
. 击毙警方被枪手
![Page 112: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/112.jpg)
Gunman by police shot .
NN IN NN VBD
NPB PP
NP-C VP
S
Tree Output
. 击毙警方被枪手
![Page 113: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/113.jpg)
The gunman was killed by police .
DT NN AUX VBN IN NN
NPB PP
NP-C VP
S
Tree Output
. 击毙警方被枪手
![Page 114: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/114.jpg)
Sample Rules Learned from Data
"说 " "," x0 0.57
"说 " x0 0.09
"他 " "说 " "," x0 0.02
"指出 " "," x0 0.02
x0 0.02
VP
VB SBAR
IN x0:S
that
said
NP
x0:NP PP
IN x1:NP
from
x1 x0 0.27
"来自 " x1 x0 0.15
x1 "的 " x0 0.06
"从 " x1 x0 0.06
"来自 " x1 "的 " x0 0.06
![Page 115: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/115.jpg)
Sample Rules Learned from Data
S
x0:NP VP
x1:VB x2:NP
x0 x1 x2 0.82
x0 x1 "," x2 0.02
x0 x1 x2 0.54 x1 x0 x2 0.44
S
x0:NP VP
x1:VB x2:NP
(Chinese/English)
(Arabic/ English)
subject-verb inversion
![Page 116: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/116.jpg)
Format is Expressive
S
x0:NP VP
x1:VB x2:NP2
x1, x0, x2
S
PRO VP
VB x0:NPthere
are
hay, x0
NP
x0:NP PP
of
P x1:NP
x1, , x0
Multilevel Re-Ordering
Non-constituent Phrases
Lexicalized Re-Ordering
VP
VBZ VBG
is
está, cantando
Phrasal Translation
singing
VP
VB x0:NP PRT
put
poner, x0
Non-contiguous Phrases
on
NPB
DT x0:NNS
the
x0
Context-SensitiveWord Insertion
[Knight & Graehl, 2005]
![Page 117: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/117.jpg)
Story Gets More Interesting…
Applications
MT
Automata Theory
TreeTransducers(Rounds 70)
![Page 118: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/118.jpg)
Story Gets More Interesting…
Applications
MT
Automata Theory
TreeTransducers(Rounds 70)
Linguistic Theory
TransformationalGrammar
(Chomsky 57)
![Page 119: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/119.jpg)
Story Gets More Interesting…
Applications
MT (05)
Automata Theory
TreeTransducers(Rounds 70)
Compression (01)
QA (03)
Generation (00)
Linguistic Theory
TransformationalGrammar
(Chomsky 57)
![Page 120: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/120.jpg)
Story Gets More Interesting…
Applications
Algorithms
Linguistic Theory
Automata Theory
TransformationalGrammar
(Chomsky 57)
TreeTransducers(Rounds 70)
Efficient TransducerAlgorithms
Generic TreeToolkits
MT (05)
Compression (01)
QA (03)
Generation (00)
![Page 121: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/121.jpg)
Summary
• Making good progress
• Algorithms + Data + Evaluation + Computers
• Interdisciplinary work– Natural language processing– Machine learning– Linguistics– Automata theory
• Hope that more people will join!
![Page 122: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/122.jpg)
Thank you
![Page 123: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/123.jpg)
Syntax-Based vs Phrase-Based
Mar1
Apr1
15
20
25
30
35
May1
BLEUphrase-based system
2005
Chinese/EnglishNIST 2002 Test Set
![Page 124: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/124.jpg)
Future PhD Theses?
“Syntax-based Language Models for Improving Statistical MT”
“Discriminative Training of Millions of Features for MT”
“Semantic Representations Induced from Multilingual EU and UN Data”
“What Makes One Language Pair More Difficult to Translate Than Another”
“A State-of-the-Art MT System Based on Syntactic Transformations”
“New Training Methods for High-Quality Word Alignment”
+ many unpredictable ones…
![Page 125: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/125.jpg)
Summary
• Phrase-based models are state-of-the-art– Word alignments– Phrase pair extraction & probabilities– N-gram language models– Beam search decoding– Feature functions & learning weights
• But the output is not English– Fluency must be improved– Better translation of person names, organizations, locations– More automatic acquisition of parallel data, exploitation of
monolingual data across a variety of domains/languages– Need good accuracy across a variety of domains/languages
![Page 126: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/126.jpg)
Available Resources• Bilingual corpora
– 100m+ words of Chinese/English and Arabic/English, LDC (www.ldc.upenn.edu)– Lots of French/English, Spanish/French/English, LDC– European Parliament (sentence-aligned), 11 languages, Philipp Koehn, ISI
• (www.isi.edu/~koehn/publications/europarl)– 20m words (sentence-aligned) of English/French, Ulrich Germann, ISI
• (www.isi.edu/natural-language/download/hansard/)
• Sentence alignment– Dan Melamed, NYU (www.cs.nyu.edu/~melamed/GMA/docs/README.htm)– Xiaoyi Ma, LDC (Champollion)
• Word alignment– GIZA, JHU Workshop ’99 (www.clsp.jhu.edu/ws99/projects/mt/)– GIZA++, RWTH Aachen (www-i6.Informatik.RWTH-Aachen.de/web/Software/GIZA++.html)– Manually word-aligned test corpus (500 French/English sentence pairs), RWTH
Aachen– Shared task, NAACL-HLT’03 workshop
• Decoding– ISI ReWrite Model 4 decoder (www.isi.edu/licensed-sw/rewrite-decoder/)– ISI Pharoah phrase-based decoder
• Statistical MT Tutorial Workbook, ISI (www.isi.edu/~knight/)
• Annual common-data evaluation, NIST (www.nist.gov/speech/tests/mt/index.htm)
![Page 127: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/127.jpg)
Some Papers Referenced on Slides• ACL
– [Och, Tillmann, & Ney, 1999]– [Och & Ney, 2000]– [Germann et al, 2001]– [Yamada & Knight, 2001, 2002]– [Papineni et al, 2002]– [Alshawi et al, 1998]– [Collins, 1997]– [Koehn & Knight, 2003]– [Al-Onaizan & Knight, 2002]– [Och & Ney, 2002]– [Och, 2003]– [Koehn et al, 2003]
• EMNLP– [Marcu & Wong, 2002]– [Fox, 2002]– [Munteanu & Marcu, 2002]
• AI Magazine– [Knight, 1997]
• www.isi.edu/~knight– [MT Tutorial Workbook]
• AMTA– [Soricut et al, 2002]– [Al-Onaizan & Knight, 1998]
• EACL– [Cmejrek et al, 2003]
• Computational Linguistics– [Brown et al, 1993]– [Knight, 1999]– [Wu, 1997]
• AAAI– [Koehn & Knight, 2000]
• IWNLG– [Habash, 2002]
• MT Summit– [Charniak, Knight, Yamada, 2003]
• NAACL– [Koehn, Marcu, Och, 2003]– [Germann, 2003]– [Graehl & Knight, 2004]– [Galley, Hopkins, Knight, Marcu, 2004]
![Page 128: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/128.jpg)
Ready-to-Use Online Bilingual Data
0
20
40
60
80
100
120
140
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
![Page 129: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/129.jpg)
Ready-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
(Data stripped of formatting, in sentence-pair format, available from the Linguistic Data Consortium at UPenn).
Millions of words(English side)
+ 1m-20m words formany language pairs
![Page 130: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/130.jpg)
Ready-to-Use Online Bilingual Data
020
406080
100120140
160180
1994
1996
1998
2000
2002
2004
Chinese/English
Arabic/English
French/English
Millions of words(English side)
One Billion?
???
![Page 131: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/131.jpg)
From No Data to Sentence Pairs
• Easy way: Linguistic Data Consortium (LDC)• Really hard way: pay $$$
– Suppose one billion words of parallel data were sufficient– At 20 cents/word, that’s $200 million
• Pretty hard way: Find it, and then earn it!– De-formatting– Remove strange characters– Character code conversion– Document alignment– Sentence alignment– Tokenization (also called Segmentation)
![Page 132: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/132.jpg)
Sentence Alignment
The old man is happy. He has fished many times. His wife talks to him. The fish are jumping. The sharks await.
El viejo está feliz porque ha pescado muchos veces. Su mujer habla con él. Los tiburones esperan.
![Page 133: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/133.jpg)
Sentence Alignment
1. The old man is happy.
2. He has fished many times.
3. His wife talks to him.
4. The fish are jumping.
5. The sharks await.
1. El viejo está feliz porque ha pescado muchos veces.
2. Su mujer habla con él.
3. Los tiburones esperan.
![Page 134: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/134.jpg)
Sentence Alignment
1. The old man is happy.
2. He has fished many times.
3. His wife talks to him.
4. The fish are jumping.
5. The sharks await.
1. El viejo está feliz porque ha pescado muchos veces.
2. Su mujer habla con él.
3. Los tiburones esperan.
![Page 135: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/135.jpg)
Sentence Alignment
1. The old man is happy. He has fished many times.
2. His wife talks to him.
3. The sharks await.
1. El viejo está feliz porque ha pescado muchos veces.
2. Su mujer habla con él.
3. Los tiburones esperan.
Note that unaligned sentences are thrown out, andsentences are merged in n-to-m alignments (n, m > 0).
![Page 136: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/136.jpg)
Tokenization (or Segmentation)
• English– Input (some byte stream):
"There," said Bob.– Output (7 “tokens” or “words”): " There , " said Bob .
• Chinese– Input (byte stream):
– Output:
美国关岛国际机场及其办公室均接获一名自称沙地阿拉伯富商拉登等发出的电子邮件。
美国 关岛国 际机 场 及其 办公 室均接获 一名 自称 沙地 阿拉 伯
富 商拉登 等发 出 的 电子邮件。
![Page 137: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/137.jpg)
Lower-Casing
• English– Input (7 words):
" There , " said Bob .
– Output (7 words):
" there , " said bob .
Thethe“The“the
theSmaller vocabulary size.More robust counting and learning.
Idea of tokenizing and lower-casing:
![Page 138: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/138.jpg)
Recent Progress in Statistical MT
• Why is that?– Better algorithms that learn patterns from data– More data– Faster, cheaper computers with more RAM– Community-wide test sets– Novel automated evaluation methods– Shared software tools
![Page 139: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/139.jpg)
Three Problems for Statistical MT
• Translation model– Given a pair of strings <f,e>, assigns P(f | e) by formula– <f,e> look like translations high P(f | e)– <f,e> don’t look like translations low P(f | e)
• Language model– Given an English string e, assigns P(e) by formula– good English string high P(e)– random word sequence low P(e)
• Decoding algorithm– Given a language model, a translation model, and a new
sentence f … find translation e maximizing P(e) · P(f | e)
![Page 140: What’s New in Statistical Machine Translation](https://reader038.vdocuments.net/reader038/viewer/2022110210/56812de8550346895d934218/html5/thumbnails/140.jpg)
Web Language Models
She has a lot of nerve. [20]French input It has a lot of nerve. [3]
?
Used by Google in 2005 to increase performance of their research MT system!
[Soricut, Knight, Marcu, 02]