luísa coheur - projecto pt-star

38
Tradução Automática de Fala para Fala no Projecto PT-STAR Luísa Coheur (L2F/INESC-ID) Place Logos of Partner Institutions

Upload: i-conferencia-internacional-de-traducao-e-tecnologia

Post on 11-Jun-2015

126 views

Category:

Technology


0 download

DESCRIPTION

Apresentação da Dra. Luísa Coheur na I Conferência Internacional de Tradução e Tecnologia, 13 e 14 de Maio, Faculdade de Letras do Porto.

TRANSCRIPT

Page 1: Luísa Coheur - Projecto PT-STAR

Tradução Automática de Fala para Fala no Projecto PT-STAR

Luísa Coheur (L2F/INESC-ID)

Place Logos of Partner Institutions

Page 2: Luísa Coheur - Projecto PT-STAR

2

INESC-ID and L2F

Page 3: Luísa Coheur - Projecto PT-STAR

3 3

INESC-ID

Brief history Established January 2000 (Owned by IST and INESC)

Private Not-for Profit Research Institute of Public Interest

Associated Laboratory since December 2004

Facilities

Alameda

Tagus Park

Page 4: Luísa Coheur - Projecto PT-STAR

4 4

The Spoken Language Systems Lab

History Work on speech processing for Portuguese since the 90s

Creation: 2001

Mission Creating technology to bridge the gap between natural spoken language and the

underlying semantic information.

Interdisciplinary background: Signal processing, natural language processing, linguistics, etc.

Page 5: Luísa Coheur - Projecto PT-STAR

5 5

Core Technologies

Speech processing Text-to-speech synthesis

Automatic process for building new voices

Limited domain synthesis

Expressive speech synthesis

Audio-visual synthesis

Automatic speech recognition Robust speech recognition

Speaker adaptation

Large vocabulary continuous recognition

Rich transcription of spontaneous speech

Speech coding

Speech enhancement

Speaker and language identification

Text processing – Morphological analysis

– Syntactic analysis

– Semantic analysis

– Discourse analysis

– NL Generation

– Named entity extraction

– Information retrieval

– Summarization

– Question answering

– Machine translation

Spoken language processing – Speech understanding

– Spoken dialog systems

– Speech-to-Speech machine translation

– Summarization of spoken documents

– Question answering on spoken documents

– Classification of multimedia documents

– Language tutoring

– etc.

Page 6: Luísa Coheur - Projecto PT-STAR

6

Statistical Machine Translation

Page 7: Luísa Coheur - Projecto PT-STAR

7

Statistical Machine Translation

Automatic Translators target to maximize: Faithfulness or fidelity

How close is the meaning of the translation to the meaning of the original

Fluency or naturalness

How natural the translation is, just considering its fluency in the target language

Developed by researchers from IBM

ˆ T argmax T fluency(T)faithfulness(T,S)

Page 8: Luísa Coheur - Projecto PT-STAR

8

Statistical Machine Translation

ˆ T argmax T fluency(T)faithfulness(T,S)

Translation Model Language Model

Estou cansado Fluência Fidelidade

I’m exhausted 5 3

Tired me 2 5

I love cookies 5 0

Page 9: Luísa Coheur - Projecto PT-STAR

9

Modelo de língua: fluêcia

Qual a frase mais fluente? Passa a: “qual a mais provável”

Podemos recorrer a modelos de língua criados com base em N-grams, por exemplo

Advantage: this is monolingual knowledge!

Page 10: Luísa Coheur - Projecto PT-STAR

10

Modelo de tradução: fidelidade

Qual a frase mais fiel? Aqui há que observar como frases na língua fonte se traduzem na línga

alvo.

Problema: precisa de Corpora paralelos Parlamento Europeu

TED Talks

Page 11: Luísa Coheur - Projecto PT-STAR

11

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok . 1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok . 5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 12: Luísa Coheur - Projecto PT-STAR

12

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok . 1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok . 5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 13: Luísa Coheur - Projecto PT-STAR

13

Centauri/Arcturan [Knight 97]

1a. ok-voon ororok sprok . 1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok . 7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok . 2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok . 8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok . 3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp . 9b. totat nnat quat oloat at-yurp .

4a. ok-voon anok drok brok jok . 4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok . 10b. wat nnat gat mat bat hilat .

5a. wiwok farok izok stok . 5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok . 11b. wat nnat arrat mat zanzanat .

6a. lalok sprok izok jok stok . 6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok . 12b. wat nnat forat arrat vat gat .

Translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Page 14: Luísa Coheur - Projecto PT-STAR

14

Spanish/English corpus

1a. Garcia and associates . 1b. Garcia y asociados .

7a. the clients and the associates are enemies . 7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates . 2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups . 8b. la empresa tiene tres grupos .

3a. his associates are not strong . 3b. sus asociados no son fuertes .

9a. its groups are in Europe . 9b. sus grupos estan en Europa .

4a. Garcia has a company also . 4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals . 10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry . 5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine . 11b. los grupos no venden zanzanina .

6a. the associates are also angry . 6b. los asociados tambien estan enfadados .

12a. the small groups are not modern . 12b. los grupos pequenos no son modernos .

Page 15: Luísa Coheur - Projecto PT-STAR

15

Speech to Speech Machine Translation

Page 16: Luísa Coheur - Projecto PT-STAR

16

Speech to speech machine translation

Speech-to-Speech Machine Translation (S2SMT) technologies aim at enabling natural language communication between people that do not share the same language

Page 17: Luísa Coheur - Projecto PT-STAR

17

Speech to speech machine translation

S2SMT can be seen as a cascade of three major components: Automatic Speech Recognition

Machine Translation

Text-to-Speech Synthesis

Page 18: Luísa Coheur - Projecto PT-STAR

18

Speech to speech machine translation

Page 19: Luísa Coheur - Projecto PT-STAR

19

The PT-STAR project

Page 20: Luísa Coheur - Projecto PT-STAR

20

The PT-STAR project

Team: L2F/INESC-ID

LTI/CMU

UBI

FLUL

Page 21: Luísa Coheur - Projecto PT-STAR

21

The PT-STAR project

One of the main problems of S2SMT is the still weak integration between the three components The main goal of PT-STAR (Speech Translation Advanced

Research to and from Portuguese) is to improve speech translation systems for Portuguese by strengthening this integration

Page 22: Luísa Coheur - Projecto PT-STAR

22

Task 1: ASR/MT

TASK 1

Page 23: Luísa Coheur - Projecto PT-STAR

23

Task 1: ASR/MT

Challenge Improve full stops and commas insertions

Segmentation is a hard problem in automatic translation

Improve capitalization

Important to disambiguate (Ex: Pedro Steps Rabbit)

Detect interrogatives

Important if you target synthesis

Porte everything to English

Try to make everything as much language independent as possible

Page 24: Luísa Coheur - Projecto PT-STAR

24

Rich transcriptions

boa tarde o governo considera que as medidas de austeridade aprovadas e em vigor só para já adequadas às necessidades financeiras de portugal o ministro das finanças mostra-se confiante com as metas traçadas no programa de estabilidade e crescimento apesar de não fechar as portas à hipótese de medidas adicionais de controlo orçamental em dois mil e doze é desta forma que teixeira dos santos responde a pressão dos países da moeda única querem que portugal e espanha avança com mais medidas de austeridade dentro de ano e meio ainda em mês passou diz que o governo decidiu apertar o cinto aos portugueses e já europa vem pedir mais para depois de dois mil e onze o ministro das finanças não fecha a porta, mas defende cada ano a seu tempo acho que estamos de em condições de alimentar digamos confessa estar confiantes de que o objectivo para dois mil e dez vai ser conseguido com as medidas adicionais que foram entretanto já decididas

Page 25: Luísa Coheur - Projecto PT-STAR

25

Rich transcriptions

[anchor 150] Boa tarde o governo considera que as medidas de austeridade aprovadas e em vigor. Só para já adequadas às necessidades financeiras de Portugal. O ministro das Finanças mostra-se confiante com as metas traçadas no programa de Estabilidade e Crescimento. Apesar de não fechar as portas à hipótese de medidas adicionais de controlo orçamental, em dois mil e doze. É desta forma que Teixeira dos Santos responde a pressão dos países da moeda única, querem que Portugal e Espanha avança com mais medidas de austeridade, dentro de ano e meio.

[spk 2000] Ainda em mês passou diz que o Governo decidiu apertar o cinto aos portugueses e já Europa vem pedir mais para depois de dois mil e onze. O ministro das Finanças não fecha a porta, mas defende cada ano, a seu tempo.

[spk 1000] Acho que estamos de em condições de alimentar, digamos confessa estar confiantes, de que o objectivo para dois mil e dez, vai ser conseguido com as medidas adicionais que foram entretanto já decididas.

Tópicos: Política; Economia; Nacional;

Page 26: Luísa Coheur - Projecto PT-STAR

26

Rich transcriptions

[anchor 150] Boa tarde o governo considera que as medidas de austeridade aprovadas e em vigor. Só para já adequadas às necessidades financeiras de Portugal. O ministro das Finanças mostra-se confiante com as metas traçadas no programa de Estabilidade e Crescimento. Apesar de não fechar as portas à hipótese de medidas adicionais de controlo orçamental, em dois mil e doze. É desta forma que Teixeira dos Santos responde a pressão dos países da moeda única, querem que Portugal e Espanha avança com mais medidas de austeridade, dentro de ano e meio.

[spk 2000] Ainda em mês passou diz que o Governo decidiu apertar o cinto aos portugueses e já Europa vem pedir mais para depois de dois mil e onze. O ministro das Finanças não fecha a porta, mas defende cada ano, a seu tempo.

[spk 1000] Acho que estamos de em condições de alimentar, digamos confessa estar confiantes, de que o objectivo para dois mil e dez, vai ser conseguido com as medidas adicionais que foram entretanto já decididas.

Tópicos: Política; Economia; Nacional;

Page 27: Luísa Coheur - Projecto PT-STAR

27

Translation

[anchor 150] Good afternoon, the government believes that the austerity measures approved and in force. Only for already suited to financial needs of Portugal. The finance minister seems confident with the targets set out in the stability and growth programme. Despite not close the door to the possibility of additional measures of budgetery control in two thousand, twelve. This is the way that Teixeira dos Santos responds the pressure of the countries of the single currency, they want Spain and Portugal progresses with more austerity measures, within a year and a half.

[spk 2000] Still in month passed says that the government has decided to tighten their belts the Portuguese and already Europe comes to ask for more for after two thousand and eleven. The finance minister is not closes the door, but defends each year, the his time.

[spk 1000] I think that we are in conditions of food, say admits be trusted, that the objective for two thousand, ten, will be achieved with the additional measures that were in the meantime, has already decided.

Topic: Politics; Economy; National;

Page 28: Luísa Coheur - Projecto PT-STAR

28

Task 1: ASR/MT

Challenge Take advantage of in-domain texts to build domain adapted

language models for ASR and MT

Domain adaptation is one of the major problems in SMT (in a word is not seen during training, the system will not be able to translate it)

Page 29: Luísa Coheur - Projecto PT-STAR

29

Task 1: ASR/MT

Challenge Take advantage of imperfect transcriptions (in which annotations do

not include laughter, applause, filled pauses, repetitions, or other disfluencies, and sometimes contain errors) to build acoustic models for ASR

Example:

… In my opinion the many options to solve the...

… In my opinion ++BREATH++ the ++UH++ many options to solve the...

Page 30: Luísa Coheur - Projecto PT-STAR

30

Task 2: MT/TTS

TASK 2

Page 31: Luísa Coheur - Projecto PT-STAR

31

Task 2: MT/TTS

Challenges Built Statistical Parametric Synthetic voices for Portuguese

How do deal with translation errors when you target synthesis?

Techniques for optimal synchronization using MT N-best list

Grammar based phrasing strategies to improve synthesis of disfluent MT output

Voice Morphing

Cross lingual voice morphing to match source speaker

Page 32: Luísa Coheur - Projecto PT-STAR

32

Task 3: MT

TASK 3

Page 33: Luísa Coheur - Projecto PT-STAR

33

Task 3: MT

Challenges Alignments

New algorithms to generate the well known lexicalized reordering model using weighted alignment matrices

Geppetto: a toolkit for word alignments and phrase extraction

Users can improve the phrase extraction algorithm, due to the fact that key control points can be manipulated

Available at Google code

Page 34: Luísa Coheur - Projecto PT-STAR

34

Task 3: MT

Challenges Error analysis

Taxonomy and detailed analysis of Moses vs. Google

From BP to EP

Built the BP2EP translator

Corpora:

TAP-UP corpus

Flight magazine with parallel corpora PT/EN

6000 questions translated into PT

Original corpus in EN, from TREC

Translation Model adapted with the questions’ corpus

Important BLEU improvements (EN/PT 9, PT/EN 8)

Page 35: Luísa Coheur - Projecto PT-STAR

35

Task 3: MT

Challenges Participated in IWSLT 2010 (Evaluation Campaign)

CN-EN, EN-CN

FR-EN

Page 36: Luísa Coheur - Projecto PT-STAR

36

Task 4: Proof of concept

TASK 4

Page 37: Luísa Coheur - Projecto PT-STAR

37

Proof-of-concept

Prototype development (pt, en, cn) Broadcast news (S2T)

TED TALKS (S2S)

Real time demo (S2S)