generalising lexical translation strategies for mt using comparable corpora bogdan babych, serge...

17
Generalising lexical translation strategies for MT using comparable corpora Bogdan Babych, Serge Sharoff, Anthony Hartley Centre for Translation Studies, University of Leeds Leeds, UK {b.babych,s.sharoff,a.hartley}@leeds.ac.uk

Upload: sophie-rich

Post on 01-Jan-2016

229 views

Category:

Documents


0 download

TRANSCRIPT

Generalising lexical translation strategies for MT using comparable corpora

Bogdan Babych, Serge Sharoff, Anthony HartleyCentre for Translation Studies, University of Leeds

Leeds, UK{b.babych,s.sharoff,a.hartley}@leeds.ac.uk

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

2

Overview• Indirect translation equivalents in MT: current

limitations• Increasing the range of translation

equivalents used by MT– Equivalent-oriented vs. strategy-oriented

approaches– Methodology for discovering translation

strategies using comparable corpora– Applications for terminology research

• Conclusions and future work

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

3

Indirect equivalents in MTData-driven MT (statistical & example-based)

– Reusing equivalents learnt from parallel corpora

– Problem: Lack of generalisation• Equivalents expressed as word patterns • Do not generalise beyond lemmas

– Cannot generate indirect equivalents for ‘unseen’ expressions• Difficult to maintain many specific patterns• Fundamental limits on the range of

translation solutions generated by MT

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

4

Indirect equivalents: Change of perspectiveProblems for MT: non-fluent translations &

mistranslations• Ru: Из кризисов такого рода как парламентский

можно выходить за счет демократических методов.– lit.: 'From crises of such type as parliamentary it is

possible to go out by means of democratic methods– RBMT: Such as parliamentary it is possible to leave

crises due to democratic methods.– SMT: This kind of crisis as a parliamentary, can go

through democratic methods.• HT: We can escape crises like these through

democratic means

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

5

From equivalents to lexical translation strategies• Indirect equivalents = ‘creative’ solutions to non-

trivial problems• Parallel corpora: too small, sparse and specialised

– The same problem often solved idiosyncratically: no clear statistical model

– Set of ‘indirect’ translation problems is open• Our solution: higher order model

– Generalising classes of equivalents as strategies • By similarity of usage in comparable corpora• Equivalents to unseen expressions are

generated from discovered strategies

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

6

Current methodology• One fixed strategy: rephrasing words using

similarity of ‘collocation vectors’ ~ near-synonyms• Generator of equivalents from ASSIST project

– выходить из кризиса (go out of crisis) ~ {to approach, to face, to get over} crisis

• Выходить(go out).sim задходить(come).dict + collocations of (crisis) to approach

• No other strategies yet implemented– Transposition (change of syntactic perspective)

Modulation (change of lexical perspective) …– Further goal: to find ~ escape from crisis … via

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

7

Strategy evaluation• Coverage of problems vs. coverage of solutions• Several strategies cover the same problem (variation)

– Ru: Механизм принятия решений будет публичным. (lit.: 'The mechanism of making decisions will be public‘)

– публичный механизм (‘public mechanism’)• Public process / … a greater public interaction(Current re-phrasing strategy)• The answer will come from the people. (Change-of-perspective strategy)

• It is harder to match solutions: diversity of strategies

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

8

Coverage of translation problems by re-phrasing strategy• Characterising linguistic productivity of the

strategy• Experiment: 12 translators suggest indirect

solutions to the same set of problems– 36 translation problems (25 Ru & 11 En)– 210 different human solutions (5.83

solutions / problem)• Task of the system: to generate a possible

solution for each problem

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

9

Coverage of translation problems by re-phrasing strategy

• For 75% of problems: at least 1 match by re-phrasing strategy• Average coverage of a set of human solutions: 34.7%

NO MATCHES

one two

three

four five

0

1

2

3

4

5

6

7

8

9

10

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

10

Coverage of translation solutions by re-phrasing strategy• Comparing coverage of indirect equivalents by:

– (1) bilingual dictionary solutions (Oxford Russian)– (2) solutions extracted from word alignment in

parallel corpus: • Training Set: Ru-En news, 700k wd.• Test Set: Euronews Ru-En interviews, 100k wd.

– (3) strategy-based (i.e. re-phrasing) solutions:• Collocations vectors from monolingual corpora

(BNC, RNC) ~ 100M• Filtered by co-occurrence in news corpora

~200M

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

11

Coverage of solutions by re-phrasing strategy

• Task of the system: to generate an exact solution for each problem

Training Set Test Set(Ru-En News) (Euronews)

Bilingual dictionary 6.70% 4.60%Giza++ word alignment 13.90% 3.40%Rephrasing strategy 21.90% 19.50%

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

12

Coverage of solutions by re-phrasing strategy

Conclusions• Learning individual equivalents is not efficient

– Low coverage of unseen problems– Lower generalisation of idiosyncratic alignments

• Re-phrasing strategy: productive but not sufficient

Training Set Test Set(Ru-En News) (Euronews)

Bilingual dictionary 6.70% 4.60%Giza++ word alignment 13.90% 3.40%Rephrasing strategy 21.90% 19.50%

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

13

On-going project: beyond re-phrasing strategy

Modelling transposition and modulation strategies – Learning strategies from parallel data– Aligning ‘indirect’ solutions (discontinuous MWEs)

•выходить из кризиса (go out of crisis) <~> escape crisis

– Generalising equivalents with similarity classes – Covering unseen expressions:

• {Выходить / выводить…} из {конфликта / застоя / депрессии…} (go out / lead out from crisis, stagnation, depression) <~> to escape conflict/ controversy, to flee difficulty, to survive disaster/ tragedy …

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

14

MT-oriented evaluationImprovements for incomprehensible translations and

mistranslations:• MT: Es verdad que empezamos vacilantes pero era

lógico. (lit: started hesitant)• HT: Of course we had our doubts to begin with but

that's normal• SMT: It is true that we started to waver but was

logical (unacceptable literal translation) – empezar vacilante ~ begin doubt (modulation)

– Indirect solutions: we had our fears/ doubts to start

with; we began with fear/ scepticism/ worries...; we were not convinced then; after our early scepticism; we were soon/gradually/quickly convinced

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

15

Application to terminological research• Terminological equivalents are usually direct

– Rarely change lexical or syntactic perspective– Standard fixed equivalents preferred

• Distributional similarity framework– Yields a network of related terms (not

paraphrases)– Useful for automating terminological research

• Prototype terminological workbench for translators– English—French corpora in a specialised domain

(2M words in total); Giza alignments; termbanks– Translators explore systems of related terms

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

16

Terminological interface for translators

• French term plan and the English term plain

29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT

17

Conclusions and future work

• Making testable predictions for indirect equivalents– Model for re-phrasing, transposition & modulation

strategies– Match human translators’ solutions for unseen phrases

• Future work – Automatic identification of phrases which need non-

literal translation – Building fluent equivalents around solutions– Integrating strategy-based generator into SMT decoder– Evaluation of the improvement in coverage– Evaluation of the productivity / reusability of strategies