generalising lexical translation strategies for mt using comparable corpora bogdan babych, serge...
TRANSCRIPT
Generalising lexical translation strategies for MT using comparable corpora
Bogdan Babych, Serge Sharoff, Anthony HartleyCentre for Translation Studies, University of Leeds
Leeds, UK{b.babych,s.sharoff,a.hartley}@leeds.ac.uk
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
2
Overview• Indirect translation equivalents in MT: current
limitations• Increasing the range of translation
equivalents used by MT– Equivalent-oriented vs. strategy-oriented
approaches– Methodology for discovering translation
strategies using comparable corpora– Applications for terminology research
• Conclusions and future work
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
3
Indirect equivalents in MTData-driven MT (statistical & example-based)
– Reusing equivalents learnt from parallel corpora
– Problem: Lack of generalisation• Equivalents expressed as word patterns • Do not generalise beyond lemmas
– Cannot generate indirect equivalents for ‘unseen’ expressions• Difficult to maintain many specific patterns• Fundamental limits on the range of
translation solutions generated by MT
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
4
Indirect equivalents: Change of perspectiveProblems for MT: non-fluent translations &
mistranslations• Ru: Из кризисов такого рода как парламентский
можно выходить за счет демократических методов.– lit.: 'From crises of such type as parliamentary it is
possible to go out by means of democratic methods– RBMT: Such as parliamentary it is possible to leave
crises due to democratic methods.– SMT: This kind of crisis as a parliamentary, can go
through democratic methods.• HT: We can escape crises like these through
democratic means
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
5
From equivalents to lexical translation strategies• Indirect equivalents = ‘creative’ solutions to non-
trivial problems• Parallel corpora: too small, sparse and specialised
– The same problem often solved idiosyncratically: no clear statistical model
– Set of ‘indirect’ translation problems is open• Our solution: higher order model
– Generalising classes of equivalents as strategies • By similarity of usage in comparable corpora• Equivalents to unseen expressions are
generated from discovered strategies
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
6
Current methodology• One fixed strategy: rephrasing words using
similarity of ‘collocation vectors’ ~ near-synonyms• Generator of equivalents from ASSIST project
– выходить из кризиса (go out of crisis) ~ {to approach, to face, to get over} crisis
• Выходить(go out).sim задходить(come).dict + collocations of (crisis) to approach
• No other strategies yet implemented– Transposition (change of syntactic perspective)
Modulation (change of lexical perspective) …– Further goal: to find ~ escape from crisis … via
…
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
7
Strategy evaluation• Coverage of problems vs. coverage of solutions• Several strategies cover the same problem (variation)
– Ru: Механизм принятия решений будет публичным. (lit.: 'The mechanism of making decisions will be public‘)
– публичный механизм (‘public mechanism’)• Public process / … a greater public interaction(Current re-phrasing strategy)• The answer will come from the people. (Change-of-perspective strategy)
• It is harder to match solutions: diversity of strategies
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
8
Coverage of translation problems by re-phrasing strategy• Characterising linguistic productivity of the
strategy• Experiment: 12 translators suggest indirect
solutions to the same set of problems– 36 translation problems (25 Ru & 11 En)– 210 different human solutions (5.83
solutions / problem)• Task of the system: to generate a possible
solution for each problem
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
9
Coverage of translation problems by re-phrasing strategy
• For 75% of problems: at least 1 match by re-phrasing strategy• Average coverage of a set of human solutions: 34.7%
NO MATCHES
one two
three
four five
0
1
2
3
4
5
6
7
8
9
10
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
10
Coverage of translation solutions by re-phrasing strategy• Comparing coverage of indirect equivalents by:
– (1) bilingual dictionary solutions (Oxford Russian)– (2) solutions extracted from word alignment in
parallel corpus: • Training Set: Ru-En news, 700k wd.• Test Set: Euronews Ru-En interviews, 100k wd.
– (3) strategy-based (i.e. re-phrasing) solutions:• Collocations vectors from monolingual corpora
(BNC, RNC) ~ 100M• Filtered by co-occurrence in news corpora
~200M
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
11
Coverage of solutions by re-phrasing strategy
• Task of the system: to generate an exact solution for each problem
Training Set Test Set(Ru-En News) (Euronews)
Bilingual dictionary 6.70% 4.60%Giza++ word alignment 13.90% 3.40%Rephrasing strategy 21.90% 19.50%
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
12
Coverage of solutions by re-phrasing strategy
Conclusions• Learning individual equivalents is not efficient
– Low coverage of unseen problems– Lower generalisation of idiosyncratic alignments
• Re-phrasing strategy: productive but not sufficient
Training Set Test Set(Ru-En News) (Euronews)
Bilingual dictionary 6.70% 4.60%Giza++ word alignment 13.90% 3.40%Rephrasing strategy 21.90% 19.50%
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
13
On-going project: beyond re-phrasing strategy
Modelling transposition and modulation strategies – Learning strategies from parallel data– Aligning ‘indirect’ solutions (discontinuous MWEs)
•выходить из кризиса (go out of crisis) <~> escape crisis
– Generalising equivalents with similarity classes – Covering unseen expressions:
• {Выходить / выводить…} из {конфликта / застоя / депрессии…} (go out / lead out from crisis, stagnation, depression) <~> to escape conflict/ controversy, to flee difficulty, to survive disaster/ tragedy …
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
14
MT-oriented evaluationImprovements for incomprehensible translations and
mistranslations:• MT: Es verdad que empezamos vacilantes pero era
lógico. (lit: started hesitant)• HT: Of course we had our doubts to begin with but
that's normal• SMT: It is true that we started to waver but was
logical (unacceptable literal translation) – empezar vacilante ~ begin doubt (modulation)
– Indirect solutions: we had our fears/ doubts to start
with; we began with fear/ scepticism/ worries...; we were not convinced then; after our early scepticism; we were soon/gradually/quickly convinced
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
15
Application to terminological research• Terminological equivalents are usually direct
– Rarely change lexical or syntactic perspective– Standard fixed equivalents preferred
• Distributional similarity framework– Yields a network of related terms (not
paraphrases)– Useful for automating terminological research
• Prototype terminological workbench for translators– English—French corpora in a specialised domain
(2M words in total); Giza alignments; termbanks– Translators explore systems of related terms
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
16
Terminological interface for translators
• French term plan and the English term plain
29 May 2008 LREC 2008Generalising Lexical Translation Strategies for MT
17
Conclusions and future work
• Making testable predictions for indirect equivalents– Model for re-phrasing, transposition & modulation
strategies– Match human translators’ solutions for unseen phrases
• Future work – Automatic identification of phrases which need non-
literal translation – Building fluent equivalents around solutions– Integrating strategy-based generator into SMT decoder– Evaluation of the improvement in coverage– Evaluation of the productivity / reusability of strategies