learning transfer rules for machine translation with limited data

91
Learning Transfer Rules for Machine Translation with Limited Data Thesis Defense Katharina Probst Committee: Alon Lavie (Chair) Jaime Carbonell Lori Levin Bonnie Dorr, University of Maryland

Upload: abiola

Post on 08-Jan-2016

42 views

Category:

Documents


8 download

DESCRIPTION

Learning Transfer Rules for Machine Translation with Limited Data. Thesis Defense Katharina Probst Committee: Alon Lavie (Chair) Jaime Carbonell Lori Levin Bonnie Dorr, University of Maryland. Introduction (I). Why has Machine Translation been applied only to few language pairs? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Learning Transfer Rules for Machine Translation with Limited Data

Learning Transfer Rules for Machine Translation with Limited Data

Thesis Defense

Katharina Probst

Committee:

Alon Lavie (Chair)

Jaime Carbonell

Lori Levin

Bonnie Dorr, University of Maryland

Page 2: Learning Transfer Rules for Machine Translation with Limited Data

2

Introduction (I)

• Why has Machine Translation been applied only to few language pairs?– Bilingual corpora available only for few language pairs

(English-French, Japanese-English, etc.)– Natural Language Processing tools available only for

few language (English, German, Spanish, Japanese, etc.)

– Scaling to other languages often difficult, time-consuming, and knowledge-intensive

• What can we do to change this?

Page 3: Learning Transfer Rules for Machine Translation with Limited Data

3

Introduction (II)

• This thesis presents a framework for automatic inference of transfer rules

• Transfer rules capture syntactic and morphological mappings between languages

• Learned from small, word-aligned training corpus• Rules are learned for unbalanced language pairs, where

more data and tools are available for one language (L1) than for the other (L2)

Page 4: Learning Transfer Rules for Machine Translation with Limited Data

4

Training Data ExampleSL: the widespread interest

in the election[the interest the widespread in the

election]TL: h &niin h rxb b h bxirwtAlignment:((1,1),(1,3),(2,4),(3,2),

(4,5),(5,6),(6,7))Type: NPParse: (<NP> (DET the-1) (ADJ widespread-2) (N interest-3)

(<PP> (PREP in-4)(<NP> (DET the-5)

(N election-6))))

NP

DET ADJ N PP

the widespread

interest

PREP NP

in DET N

the election

Setting the StageRule Learning Experimental Results Conclusions

Page 5: Learning Transfer Rules for Machine Translation with Limited Data

5

Transfer Rule Formalism

;;L2: h &niin h rxb b h bxirwt;;L1: the widespread interest in the election

NP::NP [“h” N “h” Adj PP] -> [“the” Adj N PP]((X1::Y1)(X2::Y3) (X3::Y1)(X4::Y2) (X5::Y4) ((Y3 num) = (X2 num)) ((X2 num) = sg) ((X2 gen) = m))

Training example

Rule type Component sequences

Component alignments

Agreement constraints

Value constraints

Page 6: Learning Transfer Rules for Machine Translation with Limited Data

6

Research Goals (I)

1. Develop a framework for learning transfer rules from bilingual data• Training corpus: set of sentences/phrases in one language

with translation into other language (= bilingual corpus), word-aligned

• Rules include a) a context-free backbone and b) unification constraints

2. Improve of the grammaticality of MT output by automatically learned rules• Learned rules improve translation quality in run-time

system

Page 7: Learning Transfer Rules for Machine Translation with Limited Data

7

Research Goals (II)

3. Learn rules in the absence of a parser for one of the languages

• Infer syntactic knowledge about minor language using a) projection from major language, b) analysis of word alignments, c) morphology information, and d) bilingual dictionary

4. Combine a set of different knowledge sources in a meaningful way

• Resources (parser, morphology modules, dictionary, etc.) often disagree

• Combine conflicting knowledge sources

Page 8: Learning Transfer Rules for Machine Translation with Limited Data

8

Research Goals (III)

5. Address limited-data scenarios with `frugal‘ techniques• “Unbalanced” language pairs with little or no

bilingual data• Training corpus is small (~120 sentences and

phrases), but carefully designed

6. Pushing MT research in the direction of incorporating syntax into statistical-based systems

• Infer highly involved linguistic information, incorporate with statistical decoder in hybrid system

Page 9: Learning Transfer Rules for Machine Translation with Limited Data

9

Thesis Statement (I)

• Given bilingual, word-aligned data, and given a parser for one of the languages in the translation pair, we can learn a set of syntactic transfer rules for MT.

• The rules consist of a context-free backbone and unification constraints, learned in two separate stages.

• The resulting rules form a syntactic translation grammar for the language pair and are used in a statistical transfer system to translate unseen examples.

Page 10: Learning Transfer Rules for Machine Translation with Limited Data

10

Thesis Statement (II)

• The translation quality of a run-time system that uses the learned rules is – superior to a system that does not use the learned rules– comparable to the performance using a small manual

grammar written by an expert on Hebrew->English and Hindi->English translation tasks.

• The thesis presents a new approach to learning transfer rules for Machine Translation in that the system learns syntactic models from text in a novel way and in a rich hypothesis space, aiming at emulating a human grammar writer.

Page 11: Learning Transfer Rules for Machine Translation with Limited Data

11

Talk Overview

• Setting the Stage: related work, system overview, training data

• Rule Learning – Step 1: Seed Generation– Step 2: Compositionality– Step 3: Unification Constraints

• Experimental Results• Conclusion

Page 12: Learning Transfer Rules for Machine Translation with Limited Data

12

Related Work: MT overview

Target LanguageSource Language

Statistical MT, EBMT

Semantics-based MT

Syntax-based MT

Analyze structure

Dep

th o

f A

naly

sis

Analyze meaning

Analyze sequence

Setting the StageRule Learning Experimental Results Conclusions

Page 13: Learning Transfer Rules for Machine Translation with Limited Data

13

Related Work (I)

• Traditional transfer-based MT: analysis, transfer, generation (Hutchins and Somers 1992, Senellart et al. 2001)

• Data-driven MT: – EBMT: store database of examples, possibly

generalized (Sato and Nagao 1990, Brown 1997)– SMT: usually noisy channel model: translation model

+ target language model (Vogel et al. 2003, Och and Ney 2002, Brown 2004)

• Hybrid (Knight et al. 1995, Habash and Dorr 2002)

Setting the StageRule Learning Experimental Results Conclusions

Page 14: Learning Transfer Rules for Machine Translation with Limited Data

14

Related Work (II)

• Structure/syntax for MT – EBMT (Alshawi et al. 2000, Watanabe et al. 2002)– SMT (Yamada and Knight 2001, Wu 1997)– Other approaches (Habash and Dorr 2002, Menezes

and Richardson 2001)• Learning from elicited data / small datasets (Nirenburg

1998, McShane et al 2003, Jones and Havrilla 1998)

Setting the StageRule Learning Experimental Results Conclusions

Page 15: Learning Transfer Rules for Machine Translation with Limited Data

15

Training Data ExampleSL: the widespread interest

in the election[the interest the widespread in the

election]TL: h &niin h rxb b h bxirwtAlignment:((1,1),(1,3),(2,4),(3,2),

(4,5),(5,6),(6,7))Type: NPParse: (<NP> (DET the-1) (ADJ widespread-2) (N interest-3)

(<PP> (PREP in-4)(<NP> (DET the-5)

(N election-6))))

NP

DET ADJ N PP

the widespread

interest

PREP NP

in DET N

the election

Setting the StageRule Learning Experimental Results Conclusions

Page 16: Learning Transfer Rules for Machine Translation with Limited Data

16

Transfer Rule Formalism

;;L2: h &niin h rxb b h bxirwt;;[the interest the widespread in the

election];;L1: the widespread interest in the election

NP::NP [“h” N “h” Adj PP] -> [“the” Adj N PP]((X1::Y1)(X2::Y3) (X3::Y1)(X4::Y2) (X5::Y4) ((Y3 num) = (X2 num)) ((X2 num) = sg) ((X2 gen) = m))

Training example

Rule type Component sequences

Component alignments

Agreement constraints

Value constraints

Setting the StageRule Learning Experimental Results Conclusions

Page 17: Learning Transfer Rules for Machine Translation with Limited Data

17

Training Data Collection

• Elicitation Corpora– Generally designed to cover major linguistic

phenomena– Bilingual user translates and word aligns

• Structural Elicitation Corpus– Designed to cover a wide variety of structural

phenomena (Probst and Lavie 2004)– 120 sentences and phrases– Targeting specific constituent types: AdvP, AdjP, NP,

PP, SBAR, S with subtypes– Translated into Hebrew, Hindi

Setting the StageRule Learning Experimental Results Conclusions

Page 18: Learning Transfer Rules for Machine Translation with Limited Data

18

Resources

Setting the StageRule Learning Experimental Results Conclusions

• L1 parses: Either from statistical parser (Charniak 1999), or use data from Penn Treebank

• L1 morphology: Can be obtained or created (I created one for English)

• L1 language model: Trained on a large amount of monolingual data

• L2 morphology: If available, use morphology module. If not, use automated techniques, such as (Goldsmith 2001) or (Probst 2003).

• Bilingual lexicon: gives word-level correspondences, created from training data or previously existing

Page 19: Learning Transfer Rules for Machine Translation with Limited Data

19

Development and Testing Environment

• Syntactic transfer engine: takes rules and lexicon and produces all possible partial translations

• Statistical decoder: uses word-to-word probabilities and TL language model to extract best combination of partial translations (Vogel et al. 2003)

Setting the StageRule Learning Experimental Results Conclusions

Page 20: Learning Transfer Rules for Machine Translation with Limited Data

20

System OverviewBilingual training data

L1 parses & morphology

L2 morphology

BilingualLexicon

Transfer Engine

Rule Learner

Learned Rules

Lattice

Statistical Decoder

L1 Language Model

Final Translation

Setting the StageRule Learning Experimental Results Conclusions

Training time

Run time

L2 test data

Page 21: Learning Transfer Rules for Machine Translation with Limited Data

21

Overview of Learning Phases

1. Seed Generation: create initial guesses at rules based on specific training examples

2. Compositionality: add context-free structure to rules, rules can combine

3. Constraint learning: learn appropriate unification constraints

Setting the StageRule Learning Experimental Results Conclusions

Page 22: Learning Transfer Rules for Machine Translation with Limited Data

22

Seed Generation

Setting the StageRule Learning Experimental Results Conclusions

• “Training example in rule format” • Produce rules that closely reflect training examples• But: generalize to POS level when words are 1-1

aligned• Rules are fully functional, but little generalization• Seed rules are intended as input for later two learning

phases

Page 23: Learning Transfer Rules for Machine Translation with Limited Data

23

Seed Generation – Sample Learned rule

;;L2: TKNIT H @IPWL H HTNDBWTIT

;;[ plan the care the voluntary];;L1: THE VOLUNTARY CARE PLAN;;C-Structure:(<NP> (DET the-1)

(<ADJP> (ADJ voluntary-2))(N care-3)(N plan-4))

NP::NP [N "H" N "H" ADJ] -> ["THE" ADJ N N]((X1::Y4) (X3::Y3) (X5::Y2))

Setting the StageRule Learning Experimental Results Conclusions

Page 24: Learning Transfer Rules for Machine Translation with Limited Data

24

Seed Generation Algorithm

• For a given training example, produce a seed rule• For all 1-1 aligned words, enter the POS tag (e.g. “N”)

into component sequences– Get POS tags from morphology module and parse– Hypothesis: on unseen data, any words of this POS

can fill this slot• For all not 1-1 aligned words, put actual words in

component sequences• L2 and L1 type are parse’s root label• Derive alignments from training example

Setting the StageRule Learning Experimental Results Conclusions

Page 25: Learning Transfer Rules for Machine Translation with Limited Data

25

Compositionality

Setting the StageRule Learning Experimental Results Conclusions

• Generalize seed rules to reflect structure• Infer a partial constituent grammar for L2• Rules map mixture of

– Lexical items (LIT)– Parts of speech (PT)– Constituents (NT)

• Analyze L1 parse to find generalizations• Produced rules are context-free

Page 26: Learning Transfer Rules for Machine Translation with Limited Data

26

Compositionality - Example

;;L2: $ BTWK H M&@PH HIH $M;;[ that inside the envelope was name];;L1: THAT INSIDE THE ENVELOPE WAS A NAME;;C-Structure:(<SBAR> (SUBORD that-1)

(<SINV> (<PP> (PREP inside-2)(<NP> (DET the-3)(N envelope-4)))(<VP> (V was-5))(<NP> (DET a-6)(N name-7))))

SBAR::SBAR [SUBORD PP V NP] -> [SUBORD PP V NP]

((X1::Y1) (X2::Y2) (X3::Y3) (X4::Y4))

Setting the StageRule Learning Experimental Results Conclusions

Page 27: Learning Transfer Rules for Machine Translation with Limited Data

27

Basic Compositionality Algorithm

• Traverse parse tree in order to partition sentence• For each sub-tree, if there is previously learned rule that

can account for the subtree and its translation, introduce compositional element

• Compositional element: subtree’s root label for both L1 and L2

• Adjust alignments• Note: preference for maximum generalization, because

tree traversed from top

Setting the StageRule Learning Experimental Results Conclusions

Page 28: Learning Transfer Rules for Machine Translation with Limited Data

28

Maximum Compositionality

• Assume that lower-level rules exist Assumption is correct if training data is completely compositional

• Introduce compositional elements for direct children of parse root node

• Results in higher level of compositionality, thus higher generalization power

• Can overgeneralize, but because of strong decoder generally preferable

Setting the StageRule Learning Experimental Results Conclusions

Page 29: Learning Transfer Rules for Machine Translation with Limited Data

29

Other Advanced Compositionality

Techniques

• Techniques that allow you to generalize to POS not 1-1 aligned words

• Techniques that enhance the dictionary based on training data

• Techniques that deal with noun compounds• Rule filters to ensure that no learned rules violate axioms

Setting the StageRule Learning Experimental Results Conclusions

Page 30: Learning Transfer Rules for Machine Translation with Limited Data

30

Constraint Learning

Setting the StageRule Learning Experimental Results Conclusions

• Annotate context-free compositional rules with unification constraintsa) limit applicability of rules to certain contexts

(thereby limiting parsing ambiguity)b) ensure the passing of a feature value from source

to target language (thereby limiting transfer ambiguity)

c) disallow certain target language outputs (thereby limiting generation ambiguity)

• Value constraints and agreement constraints are learned separately

Page 31: Learning Transfer Rules for Machine Translation with Limited Data

31

Constraint Learning - Overview

1. Introduce basic constraints: use morphology module(s) and parses to introduce constraints for words in training example

2. Create agreement constraints (where appropriate) by merging basic constraints

3. Retain appropriate value constraints: help in constricting a rule to some contexts or restricting output

Setting the StageRule Learning Experimental Results Conclusions

Page 32: Learning Transfer Rules for Machine Translation with Limited Data

32

Constraint Learning – Agreement Constraints (I)

• For example: In an NP, do the adjective and the noun agree in number?

• in Hebrew the good boys: • Correct: H ILDIM @WBIM

the.det.def boy.pl.m good.pl.m“the good boys”

• Incorrect: H ILDIM @WBthe.det.def boy.pl.m

good.sg.m“the good boys”

Setting the StageRule Learning Experimental Results Conclusions

Page 33: Learning Transfer Rules for Machine Translation with Limited Data

33

Constraint Learning – Agreement Constraints (II)

• E.g. number in a determiner and the corresponding noun• Use a likelihood ratio test to determine what value

constraints can be merged into agreement constraints• The log-likelihood ratio is defined by proposing

distributions that could have given rise to the data:– Null Hypothesis: The values are independently

distributed.– Alternative Hypothesis: The values are not

independently distributed.• For sparse data, use heuristic test: if more evidence for

than against agreement constraint

Setting the StageRule Learning Experimental Results Conclusions

Page 34: Learning Transfer Rules for Machine Translation with Limited Data

34

Constraint Learning – Agreement Constraints (III)

• Collect all instances in the training data where an adjective and a noun mark for number

• Count how often the feature value is the same, how often different

• Feature values are distributed by– Two multinomial distributions (if they’re independent,

e.g. Null hypothesis)– One multinomial distribution (if they should agree, e.g.

Alternate hypothesis)• Compute log-likelihood under each scenario and perform

LL ratio or heuristic test• Generalize to cross-lingual case

Setting the StageRule Learning Experimental Results Conclusions

Page 35: Learning Transfer Rules for Machine Translation with Limited Data

35

Constraint Learning – Value Constraints

;;L2: ild @wb;;[ boy good] ;;L1: a good boyNP::NP [N ADJ] ->

[``A'' ADJ N]

(... ((X1 NUM) = SG) ((X2 NUM) = SG) ...)

Retain value constraints to distinguish

;;L2: ildim t@wbim

;;[ boys good]

;;L1: good boys

NP::NP [N ADJ] ->

[ADJ N]

(...

((X1 NUM) = PL)

((X2 NUM) = PL)

...)

Setting the StageRule Learning Experimental Results Conclusions

Page 36: Learning Transfer Rules for Machine Translation with Limited Data

36

Constraint Learning – Value Constraints

• Retain those value constraints that determine the structure of the L2 translation

• If you have two rules with – different L2 component sequences– same L1 component sequence– they differ in only a value constraint

• Retain the value constraint to distinguish

Setting the StageRule Learning Experimental Results Conclusions

Page 37: Learning Transfer Rules for Machine Translation with Limited Data

37

Constraint Learning – Sample Learned Rule

;;L2: ANI AIN@LIGN@I;;[ I intelligent];;L1: I AM INTELLIGENTS::S

[NP ADJP] -> [NP “AM” ADJP]((X1::Y1) (X2::Y3)((X1 NUM) = (X2 NUM))((Y1 NUM) = (X1 NUM))((Y1 PER) = (X1 PER))(Y0 = Y2))

Setting the StageRule Learning Experimental Results Conclusions

Page 38: Learning Transfer Rules for Machine Translation with Limited Data

38

Dimensions of Evaluation

• Learning Phases / Settings: default, Seed Generation only, Compositionality, Constraint Learning

• Evaluation: rule-based evaluation + pruning• Test Corpora: TestSet, TestSuite• Run-time Settings: Lengthlimit• Portability: Hindi→English translation

Setting the StageRule Learning Experimental Results Conclusions

Page 39: Learning Transfer Rules for Machine Translation with Limited Data

39

Test Corpora

• Test corpora:1. Test Corpus: Newspaper text (Haaretz): 65

sentences, 1 reference translation2. Test Suite: specific phenomena: 138 sentences, 1

reference translation3. Hindi: 245 sentences, 4 reference translations

• Compare: statistical system only, system with manually written grammar, system with learned grammar

• Manually written grammar: written by expert within about a month (both Hebrew and Hindi)

Setting the StageRule Learning Experimental Results Conclusions

Page 40: Learning Transfer Rules for Machine Translation with Limited Data

40

Test Corpus Evaluation, Default Settings (I)

Setting the StageRule Learning Experimental Results Conclusions

Grammar BLEU METEOR

No Grammar 0.0565 0.3019

Manual Grammar 0.0817 0.3241

Learned Grammar (With Constraints)

0.078 0.3293

Page 41: Learning Transfer Rules for Machine Translation with Limited Data

41

Test Corpus Evaluation, Default Settings (II)

Setting the StageRule Learning Experimental Results Conclusions

Learned grammar performs statistically significantly better than baseline

• Performed one-tailed paired t-test• BLEU with resampling:

t-value: 81.98, p-value:0 (df=999)

→ Significant at 100% confidence level

Median of differences: -0.0217 with 95% confidence interval [-0.0383,-0.0056]

• METEOR:

t-value: 1.73, p-value: 0.044 (df=61)

→ Significant at higher than 95% confidence level

Page 42: Learning Transfer Rules for Machine Translation with Limited Data

42

Test Corpus Evaluation, Default Settings (III)

Setting the StageRule Learning Experimental Results Conclusions

Page 43: Learning Transfer Rules for Machine Translation with Limited Data

43

Test Corpus Evaluation, Different Settings (I)

Setting the StageRule Learning Experimental Results Conclusions

Grammar BLEU METEOR

No Grammar 0.0565 0.3019

Manual Grammar 0.0817 0.3241

Learned Grammar (Seed Generation)

0.0741 0.3239

Learned Grammar (Compositionality)

0.0777 0.3360

Learned Grammar (With Constraints)

0.078 0.3293

Page 44: Learning Transfer Rules for Machine Translation with Limited Data

44

Test Corpus Evaluation, Different Settings (II)

Setting the StageRule Learning Experimental Results Conclusions

Grammar Transfer Engine

(in seconds system time)

Lattice size

(in mb)

Decoder

(in seconds system time)

Learned Grammar (Compositionality)

54.98 187 3123.38

Learned Grammar (With Constraints)

33.28 140 2287.47

System times in seconds, lattice sizes:

→ ~ 20% reduction in lattice size!

Page 45: Learning Transfer Rules for Machine Translation with Limited Data

45

Evaluation withRule Scoring (I)

• Estimate translation power of the rules• Use training data: most training examples are actually

unseen data for a given rule • Match arc against the reference translation • A rule’s score is the average of all its arcs’ scores• Order the rules by precision score, prune• Goal of rule scoring: limit run-time• Note trade-off with decoder power

Setting the StageRule Learning Experimental Results Conclusions

Page 46: Learning Transfer Rules for Machine Translation with Limited Data

46

Evaluation with Rule Scoring (II)

Setting the StageRule Learning Experimental Results Conclusions

Grammar BLEU ModBLEU METEOR

No Grammar 0.0565 0.1362 0.3019

Manual Grammar 0.0817 0.1546 0.3241

Learned Grammar (25%) 0.0565 0.1362 0.3019

Learned Grammar (50%) 0.0592 0.1389 0.3075

Learned Grammar (75%) 0.0800 0.1533 0.3296

Learned Grammar (full) 0.078 0.1524 0.3293

Page 47: Learning Transfer Rules for Machine Translation with Limited Data

47

Evaluation with Rule Scoring (III)

Setting the StageRule Learning Experimental Results Conclusions

Grammar TrEngine LatticeSize Decoder

Learned Grammar (25%) 1.02 330342 22.55

Learned Grammar (50%) 1.81 13431206 189.89

Learned Grammar (75%) 5.91 29242597 397.06

Learned Grammar (full) 33.28 149713589 2287.47

Page 48: Learning Transfer Rules for Machine Translation with Limited Data

48

Test Suite Evaluation (I)

Setting the StageRule Learning Experimental Results Conclusions

• Test suite designed to target specific constructions– Conjunctions of PPs– Adverb phrases– Reordering of adjectives and nouns– AdjP embedded in NP– Possessives– …

• Designed in English, translated into Hebrew• 138 sentences, one reference translation

Page 49: Learning Transfer Rules for Machine Translation with Limited Data

49

Test Suite Evaluation (II)

Setting the StageRule Learning Experimental Results Conclusions

Grammar BLEU METEOR

Baseline 0.0746 0.4146

Manual grammar 0.1179 0.4471

Learned Grammar 0.1199 0.4655

Page 50: Learning Transfer Rules for Machine Translation with Limited Data

50

Test Suite Evaluation (III)

Setting the StageRule Learning Experimental Results Conclusions

Learned grammar performs statistically significantly better than baseline

• Performed one-tailed paired t-test• BLEU with resampling:

t-value: 122.53, p-value:0 (df=999)

→ Statistically significantly better at 100% confidence level

Median of differences: -0.0462 with 95% confidence interval [-0.0245,-0.0721]

• METEOR:

t-value: 47.20, p-value: 0.0 (df=137)

→ Statistically significantly better at 100% confidence level

Page 51: Learning Transfer Rules for Machine Translation with Limited Data

51

Test Suite Evaluation (IV)

Setting the StageRule Learning Experimental Results Conclusions

Page 52: Learning Transfer Rules for Machine Translation with Limited Data

52

Hindi-English Portability Test (I)

Setting the StageRule Learning Experimental Results Conclusions

Grammar BLEU METEOR

Baseline 0.1003 0.3659

Manual grammar 0.1052 0.3676

Learned Grammar 0.1033 0.3685

Page 53: Learning Transfer Rules for Machine Translation with Limited Data

53

Hindi-English Portability Test (II)

Setting the StageRule Learning Experimental Results Conclusions

Learned grammar performs statistically significantly better than baseline

• Performed one-tailed paired t-test• BLEU with resampling:

t-value: 37.20, p-value:0 (df=999)

→ Statistically significantly better at 100% confidence level

Median of differences: -0.0024 with 80% confidence interval [-0.0052,0.0001]

• METEOR:

t-value: 1.72, p-value: 0.043 (df=244)

→ Statistically significantly better at higher than 95% confidence level

Page 54: Learning Transfer Rules for Machine Translation with Limited Data

54

Hindi-EnglishPortability Test (III)

Setting the StageRule Learning Experimental Results Conclusions

Page 55: Learning Transfer Rules for Machine Translation with Limited Data

55

Discussion of Results

• Performance superior to standard SMT system• Learned grammar comparable to manual grammar• Learned grammar: higher METEOR score, indicating

that it is more general• Constraints: slightly lower performance in exchange for

higher run-time efficiency• Pruning: slightly lower performance in exchange for

higher run-time efficiency

Setting the StageRule Learning Experimental Results Conclusions

Page 56: Learning Transfer Rules for Machine Translation with Limited Data

56

Conclusions and Contributions

1. Framework for learning transfer rules from bilingual data

2. Improvement of translation output in hybrid transfer and statistical system

3. Addressing limited-data scenarios with ‘frugal’ techniques

4. Combining different knowledge sources in a meaningful way

5. Pushing MT research in the direction of incorporating syntax into statistical-based systems

6. Human-readable rules that can be improved by an expert

Setting the StageRule Learning Experimental Results Conclusions

Page 57: Learning Transfer Rules for Machine Translation with Limited Data

57

Summary“Take a bilingual word-aligned corpus, and learn transfer

rules with constituent transfer and unification constraints.”

“Is it a big corpus?”

“Ahem. No.”

“Do I have a parser for both languages?”

“No, just for one.”

“… So I can use a dictionary, morphology modules, a parser … But these are all imperfect resources. How do I combine them?”

“We can do it!”

“Ok.”

Setting the StageRule Learning Experimental Results Conclusions

Page 58: Learning Transfer Rules for Machine Translation with Limited Data

58

Thank you!

Page 59: Learning Transfer Rules for Machine Translation with Limited Data

59

Additional Slides

Page 60: Learning Transfer Rules for Machine Translation with Limited Data

60

References (I)

Ayan, Fazil, Bonnie J. Dorr, and Nizar Habash. Application of Alignment to Real-World Data: Combining Linguistic and Statistical Techniques for Adaptable MT. Proceedings of AMTA-2004.

Baldwin, Timothy and Aline Villavicencio. 2002. Extracting the Unextractable: A case study on verb-particles. Proceedings of CoNLL-2002.

Brown, Ralf D., A Modified Burrows-Wheeler Transform for Highly-Scalable Example-Based Translation, Proceedings of AMTA-2004.

Charniak, Eugene, Kevin Knight and Kenji Yamada. 2003. Syntax-based Language Models for Statistical Machine Translation. Proceedings of MT-Summit IX.

Page 61: Learning Transfer Rules for Machine Translation with Limited Data

61

References (II)

Hutchins, John W. and Harold L. Somers. 1992. An Introduction to Machine Translation. Academic Press, London.

Jones, Douglas and R. Havrilla. Twisted Pair Grammar: Support for Rapid Development of Machine Translation for Low Density Languages. Proceedings of AMTA-98.

Menezes, Arul and Stephen D. Richardson. A best-first alignment algorithm for automatic extraction of transfer mappings from bilingual corpora. Proceedings of the Workshop on Data-driven Machine Translation at ACL-2001.

Nirenburg, Sergei. Project Boas: A Linguist in the Box as a Multi-Purpose Language Resource. Proceedings of LREC-98.

Page 62: Learning Transfer Rules for Machine Translation with Limited Data

62

References (III)

Orasan, Constantin and Richard Evans. 2001. Learning to identify animate references. Proceedings of CoNLL-2001.

Probst, Katharina. 2003. Using ‘smart’ bilingual projection to feature-tag a monolingual dictionary. Proceedings of CoNLL-2003.

Probst, Katharina and Alon Lavie. A Structurally Diverse Minimal Corpus for Eliciting Structural Mappings between Languages. Proceedings of AMTA-04.

Probst, Katharina and Lori Levin. 2002. Challenges in Automated Elicitation of a Controlled Bilingual Corpus. Proceedings of TMI-02.

Page 63: Learning Transfer Rules for Machine Translation with Limited Data

63

References (IV)

Senellart, Jean, Mirko Plitt, Christophe Bailly, and Francoise Cardoso. 2001. Resource Alignment and Implicit Transfer. Proceedings of MT-Summit VIII.

Vogel, Stephan and Alicia Tribble. 2002. Improving Statistical Machine Translation for a Speech-to-Speech Translation Task. Proceedings of ICSLP-2002.

Watanabe, Hideo, Sadao Kurohashi, and Eiji Aramaki. 2000. Finding Structural Correspondences from Bilingual Parsed Corpus for Corpus-based Translation. Proceedings of COLING-2000.

Page 64: Learning Transfer Rules for Machine Translation with Limited Data

64

Log-likelihood test for agreement constraints (I)• create list of all possible index pairs that should be

considered for an agreement constraint:• L1 only constraints:

– list of all head-head pairs that ever occur with the same feature (not necessarily same value), and all head-nonheads in the same constituent that occur with the same feature (not necessarily same value).

– For example, possible agreement constraint: Num agreement between Det and N in a NP where the Det is a dependent of N

• L2 only constraints: same as L1 only constraints above.• L2→L1 constraints: all situations where two aligned

indices mark the same feature

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 65: Learning Transfer Rules for Machine Translation with Limited Data

65

Log-likelihood test for agreement constraints (II)

• Hypothesis 0: The values are independently distributed. • Hypothesis 1: The values are not independently

distributed.• Under the null hypothesis:

• Under the alternative hypothesis:

where ind is 1 if vxi1 = vxi2 and 0 otherwise.

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 66: Learning Transfer Rules for Machine Translation with Limited Data

66

Log-likelihood test for agreement constraints (III)

• i1 and i2 drawn from a multinomial distribution.

where cvi is the number of times the value vi was encountered for the given feature (e.g. PERS), and k is the number of possible values for the feature (e.g. 1st, 2nd, 3rd).

• If strong evidence for Hypothesis 0, introduce agreement constraint

• For cases where there is not enough evidence either way (n<10), use heuristic test

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 67: Learning Transfer Rules for Machine Translation with Limited Data

67

Lexicon Enhancement for Hebrew Adverbs (I)

• Example 1: “B” “$MX” “happily”• Example 2: “IWTR” “GBWH” “taller”• These are not necessarily in the dictionary• Both processes are productive• How can we add these and similar entries to lexicon?

Automatically?

Page 68: Learning Transfer Rules for Machine Translation with Limited Data

68

Lexicon Enhancement for Hebrew Adverbs (II)

For all 1-2 (L1-L2) alignments in training data{1. extract all cases of at least 2 instances where one word is constant (constant word: wL2c, non-constant word wL2v, non-constant word wL1v)2. For each word wL2v{

2.1. Get all L1 translations 2.2. Find the closest match wL1match to wL1v2.3. Learn replacement rule wL1match->wL1v }

3. For each word wL2POS of same POS as wL2c{3.1. For each possible translations wL1POS {

3.1.1. Apply all replacement rules possible wL1POS->wL1POSmod

3.1.2. For each applied replacement rule, insert into lexicon entry:

[“wc” wL2POS] -> [wL1POSmod] } }

Page 69: Learning Transfer Rules for Machine Translation with Limited Data

69

Lexicon Enhancement for Hebrew Adverbs (III)

• Example: B $MX -> happily• Possible translations of $MX:

– joy– happiness

• Use edit distance to find that happiness is wL1match for happily

• Learn replacement rule ness->ly

Page 70: Learning Transfer Rules for Machine Translation with Limited Data

70

Lexicon Enhancement for Hebrew Adverbs (IV)

• For all L2 Nouns in the dictionary, get all possible L1 translations, and apply the replacement rule

• If replacement rule can be applied, add lexicon entry

• Examples of new adverbs added to lexicon:ADV::ADV |: ["B" "$APTNWT"] -> ["AMBITIOUSLY"]

ADV::ADV |: ["B" "$BIRWT"] -> ["BRITTLELY"]

ADV::ADV |: ["B" "$G&WN"] -> ["MADLY"]

ADV::ADV |: ["B" "$I@TIWT"] -> ["METHODICALLY"]

Page 71: Learning Transfer Rules for Machine Translation with Limited Data

71

Lexicon Enhancement for Hebrew Comparatives

• Same process as for adverbs• Examples of new comparatives added to

lexicon:ADJ::ADJ |: ["IWTR" "MLA"] -> ["FULLER"]

ADJ::ADJ |: ["IWTR" "MPGR"] -> ["SLOWER"]

ADJ::ADJ |: ["IWTR" "MQCH"] -> ["HEATER"]

• All words are checked in the BNC• Comment: automatic process, thus far from

perfect

Page 72: Learning Transfer Rules for Machine Translation with Limited Data

72

Some notation

• SL: Source Language, language to be translated from• TL: Target Language, language to be translated into• L1: language for which abundant information is available• L2: language for which less information is available• (Here:) SL = L2 = Hebrew, Hindi• (Here:) TL = L1 = English• POS: part of speech, e.g. noun, adjective, verb• Parse: structural (tree) analysis of sentence• Lattice: list of partial translations, arranged by length and

start index

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 73: Learning Transfer Rules for Machine Translation with Limited Data

73

Training Data Example

SL: the widespread interest in the electionTL: h &niin h rxb b h bxirwtAlignment:((1,1),(1,3),(2,4),(3,2),(4,5),(5,6),(6,7))Type: NPParse: (<NP>

(DET the-1)(ADJ widespread-2)(N interest-3)(<PP> (PREP in-4)

(<NP> (DET the-5)(N election-6))))

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 74: Learning Transfer Rules for Machine Translation with Limited Data

74

Seed Generation Algorithm

for all training examples {for all 1-1 aligned words {

get the L1 POS tag from the parseget the L2 POS tag from the morphology module and the dictionaryif the L1 POS and the L2 POS tags are not the same, leave both words lexicalized }

for all other words { leave the words lexicalized }create rule word alignments from training exampleset L2 type and L1 type to be the parse root’s label }

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 75: Learning Transfer Rules for Machine Translation with Limited Data

75

Taxonomy of Structural Mappings (I)

• Non-terminals (NT): – used in two rule parts:

• type definition of a rule (both for SLand TL, meaning X0 and Y0),

• constituent sequences for both languages. – any label that can be the type of a rule – describe higher-level structures such as sentences

(S), noun phrases (NP), or prepositional phrases(PP). – can be filled with more than one word: filled by other

rules.

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 76: Learning Transfer Rules for Machine Translation with Limited Data

76

Taxonomy of Structural Mappings (II)

• Pre-terminals (PT):– used only in the constituent sequences of the rules,

not as X0 or Y0 types. – filled with only one word, except phrasal lexicon

entries: filled by lexical entries, not by other grammar rules.

• Terminals (LIT): – lexicalized entries in the constituent sequences– can be used on both the x- and the y-side– can only be filled by the specified terminal itself.

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 77: Learning Transfer Rules for Machine Translation with Limited Data

77

Taxonomy of Structural Mappings (III)

• NTs must not be aligned 1-0 or 0-1• PTs must not be aligned 1-0 or 0-1.• Any word in the bilingual training pair must participate in

exactly one LIT, PT, or NT.• An L1 NT is assumed to translate into the same NT inL2.

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 78: Learning Transfer Rules for Machine Translation with Limited Data

78

Taxonomy of Structural Mappings (IV)

• Transformation I (SL type into SL component sequence). NT → (NT | PT | LIT)+

• Transformation II (SL type into TL type). NTi → NTi (same type of NT)

• Transformation III (TL type into TL component sequence). NT → (NT | PT | LIT)+

• Transformation IV (SL components into TL components). NTi → NTi+ (same type of NT)PT → PT+ LIT → ε ε → LIT

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 79: Learning Transfer Rules for Machine Translation with Limited Data

79

Basic Compositionality Pseudocode

traverse parse top-downfor each node i in parse {

extract the subtree rooted at iextract the L1 chunk cL1 rooted at i and the corresponding L2 chunk cL2 (using alignments)if transfer engine can translate cL1 into cL2 using previously learned rules {

introduce compositional element:replace POS sequence for cL1 and

cL2 with label of node iadjust alignments }

do not traverse already covered subtree } }

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 80: Learning Transfer Rules for Machine Translation with Limited Data

80

Co-Embedding Resolution, Iterative Type

Learning

• Problem: looking for previously learned rules– Must determine optimal learning ordering

• Co-Embedding Resolution: – Tag each training example with depth of tree, i.e.

how many embedded elements– Then learn lowest to highest

• Iterative Type Learning:– Some types (e.g. PPs) are frequently embedded in

others (e.g. NP)– Pre-determine the order in which types are learned

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 81: Learning Transfer Rules for Machine Translation with Limited Data

81

Compositionality – Sample Learned Rules (II)

;;L2: RQ AM H RKBT TGI&;;L1: ONLY IF THE TRAIN ARRIVES;;C-Structure:(<SBAR>

(<ADVP> (ADV only-1))(SUBORD if-2)(<S> (<NP> (DET the-3)(N train-4))

(<VP> (V arrives-5))))SBAR::SBAR

[ADVP SUBORD S] -> [ADVP SUBORD S]((X1::Y1) (X2::Y2) (X3::Y3))

Setting the StageRule Learning Experimental Results Conclusions

Page 82: Learning Transfer Rules for Machine Translation with Limited Data

82

Taxonomy of Constraints (I)

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Parameter Possible Values

value or agreement value, agreement

level POS, constituent, POS/constituent

language L2, L1, L2→L1

constrains head head, non-head, head+non-head

Page 83: Learning Transfer Rules for Machine Translation with Limited Data

83

Co-Embedding Resolution, Iterative Type

Learning

find highest co-embedding score in training data

find the number of types to learn, ntypes

for (i = 0; i < maxco-embedding; i++) {

for (j = 0; j < ntypes; j++) {

for all training examples with co-embedding score i and of type j {

perform Seed Generation

perform Compositionality Learning } } }

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 84: Learning Transfer Rules for Machine Translation with Limited Data

84

Taxonomy of Constraints (II)

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Subtype Value/ Agreement Language Level Comment

1 value x POS Group1-2

2 value x const Group1-2

3 value x POS/const can't exist

4 value y POS Group4

5 value y const Group5

6 value y POS/const can't exist

7 value xy POS can't exist

8 value xy const can't exist

9 value xy POS/const can't exist

10 agreement x POS Group10-12

11 agreement x const Group10-12

12 agreement x POS/const Group10-12

13 agreement y POS Group13-15

14 agreement y const Group13-15

15 agreement y POS/const Group13-15

16 agreement xy POS Group16-18

17 agreement xy const Group16-18

18 agreement xy POS/const Group16-18

Page 85: Learning Transfer Rules for Machine Translation with Limited Data

85

Taxonomy of Constraints (III)

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Subtype Value/ Agr Language Level

1,2 value x POS or const

4,5 value y POS or const

10,11,12 agreement x POS or const or POS/const

13,14,15 agreement y POS or const or POS/const

16,17,18 agreement xy POS or const or POS/const

Page 86: Learning Transfer Rules for Machine Translation with Limited Data

86

Constraint Learning – Sample Learned Rules (II)

;;L2: H ILD AKL KI HWA HIH R&B;;L1: THE BOY ATE BECAUSE HE WAS HUNGRYS::S [NP V SBAR] -> [NP V SBAR]((X1::Y1) (X2::Y2) (X3::Y3)(X0 = X2)((X1 GEN) = (X2 GEN))((X1 NUM) = (X2 NUM))((Y1 NUM) = (X1 NUM))((Y2 TENSE) = (X2 TENSE))((Y3 NUM) = (X3 NUM))((Y3 TENSE) = (X3 TENSE))(Y0 = Y2))

Setting the StageRule Learning Experimental Results Conclusions

Page 87: Learning Transfer Rules for Machine Translation with Limited Data

87

Evaluation with Different Length Limits (I)

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Grammar 1 2 3 4 5 6

No Grammar 0.171 0.2962 0.3016 0.3012 0.3019 0.3019

Manual Grammar

0.1744 0.297 0.3141 0.3182 0.3232 0.3241

Learned Grammar

0.171 0.2995 0.3072 0.3252 0.3282 0.3293

Page 88: Learning Transfer Rules for Machine Translation with Limited Data

88

Evaluation with Different Length Limits (II)

(METEOR score)

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 89: Learning Transfer Rules for Machine Translation with Limited Data

89

Discussion of Results: Comparison of Translations

(back to Hebrew-English)

No grammar: the doctor helps to patients hisLearned grammar: the doctor helps to his patientsReference translation: The doctor helps his patients  

No grammar: the soldier writes many letters to the family of heLearned grammar: the soldier writes many letters to his familyReference translation: The soldier writes many letters to his family  

Setting the StageRule Learning Experimental Results ConclusionsFuture Work

Page 90: Learning Transfer Rules for Machine Translation with Limited Data

90

Time Complexity of Algorithms

•Seed Generation: O(n)•Compositionality:

–Basic: O(n*max(tree_depth))–Maximum Compositionality: O(n*max(num_children))

•Constraint Learning: O(n*max(num_basic_constraints))•Practically: no issue

Setting the StageRule Learning Experimental Results Conclusions

Page 91: Learning Transfer Rules for Machine Translation with Limited Data

91

If I had 6 more months…

• Application to larger datasets– Training data enhancement to obtain training

examples at different levels (NPs, PPs, etc.)– More emphasis on rule scoring (more noise)– More emphasis on context learning: constraints

• Constraint learning as version space learning problem• Integrate rules into statistical system more directly,

without producing full lattice

Setting the StageRule Learning Experimental Results ConclusionsFuture Work