proceedings of the 46th annual meeting of the acl · sriram venkatapathy, rajeev sangal, aravind...

135
Proceedings of SSST-4 Fourth Workshop on Syntax and Structure in Statistical Translation Dekai Wu (editor) COLING 2010 / SIGMT Workshop 23rd International Conference on Computational Linguistics Beijing, China 28 August 2010

Upload: others

Post on 25-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Proceedings of

    SSST-4

    Fourth Workshop on

    Syntax and Structure inStatistical Translation

    Dekai Wu (editor)

    COLING 2010 / SIGMT Workshop23rd International Conference on Computational Linguistics

    Beijing, China28 August 2010

  • Produced byChinese Information Processing Society of ChinaAll rights reserved for Coling 2010 CD production.

    To order the CD of Coling 2010 and its Workshop Proceedings, please contact:

    Chinese Information Processing Society of ChinaNo.4, Southern Fourth StreetHaidian District, Beijing, 100190ChinaTel: +86-010-62562916Fax: [email protected]

    ii

  • Introduction

    The Fourth Workshop on Syntax and Structure in Statistical Translation (SSST-4) was held on 28August 2010 following the Coling 2010 conference in Beijing. Like the first three SSST workshopsin 2007, 2008, and 2009, it aimed to bring together researchers from different communities working inthe rapidly growing field of statistical, tree-structured models of natural language translation.

    We were honored to have Martin Kay deliver this year’s invited keynote talk. This field is indebted toMartin Kay for not one but two of the classic cornerstone ideas that inspired bilingual tree-structuredmodels for statistical machine translation: first, chart parsing, and second, parallel text alignment.

    Tabular approaches to parsing, using dynamic programming and/or memoization, were heavilyinfluenced by Kay’s (1980) charts (or forests, packed forests, well-formed substring tables, or WFSTs).Today’s biparsing models—the bilingual generalizations of this influential work—lie at the heart ofnumerous alignment and training algorithms for inversion transduction grammars or ITGs—includingall syntax-directed transduction grammars or SDTGs (or synchronous CFGs) of binary rank or ternaryrank, such as those learned by hierarchical phrase-based translation approaches.

    At the same time, Kay and Röscheisen’s (1988) seminal work on alignment of parallel texts led the wayin statistical machine translation’s basic paradigm of integrating the simultaneous learning of translationlexicons with aligning parallel texts. Today’s biparsing models generalize this by simultaneouslylearning tree structures as well. Once again, cross-pollination of ideas across different areas anddisciplines, empirical and theoretical, has provided mutual inspiration.

    We selected 15 papers for this year’s workshop. Studies emphasizing formal and algorithmicaspects include a method for intersecting synchronous/transduction grammars (S/TGs) with finite-state transducers (Dymetman and Cancedda), and a comparison of linear transduction grammars(LTGs) with ITGs (Saers, Nivre and Wu). Experiments on using syntactic features and constraintswithin flat phrase-based translation models include studies by Jiang, Du and Way; by Cao, Finchand Sumita; by Kolachina, Venkatapathy, Bangalore, Kolachina and PVS; and by Zhechev and vanGenabith. Dependency constraints are also used to improve HMM word alignments for both flatphrase-based as well as S/TG based translation models (Ma and Way). Extensions to the features andparameterizations in two S/TG based translation models, as well as methods for merging models, arestudied by Zollman and Vogel. The potential of incorporating LFG-style deep syntax within S/TGs isexplored by Graham and van Genabith. A tree transduction based approach is presented by Khalilov andSima’an. Meanwhile, Lo and Wu empirically compare n-gram, syntactic, and semantic structure basedMT evaluation approaches. An encouraging trend is an uptick in work on low-resource language pairsand from underrepresented regions, including English-Persian (Mohaghegh, Sarrafzadeh and Moir),Manipuri-English (Singh and Bandyopadhyay), Tunisia (Khemakhem, Jamoussi and Ben hamadou),and English-Hindi (Venkatapathy, Sangal, Joshi, and Gali).

    Once again this year, thanks are due to our authors and our Program Committee for making the SSSTworkshop another success.

    iii

  • Acknowledgements

    This work was supported in part by the Defense Advanced Research Projects Agency (DARPA) underGALE Contract Nos. HR0011-06-C-0022, subcontract BBN Technologies and HR0011-06-C-0023,subcontract SRI International. Any opinions, findings and conclusions or recommendations expressedin this material are those of the author(s) and do not necessarily reflect the views of the DefenseAdvanced Research Projects Agency.

    iv

  • Organizers:

    Dekai WU, Hong Kong University of Science and Technology (HKUST), Hong Kong

    Program Committee:

    Srinivas BANGALORE, AT&T Research, USAMarine CARPUAT, Hong Kong University of Science and Technology (HKUST), Hong KongDavid CHIANG, USC Information Sciences Institute, USAPascale FUNG, Hong Kong University of Science and Technology (HKUST), Hong KongDaniel GILDEA, University of Rochester, USADan KLEIN, University of California at Berkeley, USAKevin KNIGHT, USC Information Sciences Institute, USAJonas KUHN, University of Potsdam, GermanyYang LIU, Institute of Computing Technology, Chinese Academy of Sciences, ChinaYanjun MA, Dublin City University, IrelandDaniel MARCU, USC Information Sciences Institute, USAYuji MATSUMOTO, Nara Institute of Science and Technology, JapanHermann NEY, RWTH Aachen, GermanyOwen RAMBOW, Columbia University, USAPhilip RESNIK, University of Maryland, USAStefan RIEZLER, Google Inc., USALibin SHEN, BBN Technologies, USAChristoph TILLMANN, IBM T. J. Watson Research Center, USAStephan VOGEL, Carnegie Mellon University, USATaro WATANABE, NTT Communication Science Laboratories, JapanAndy WAY, Dublin City University, IrelandYuk-Wah WONG, Google Inc., USARichard ZENS, Google Inc., USAChengqing ZONG, Institute of Automation, Chinese Academy of Sciences, China

    v

  • Table of Contents

    Intersecting Hierarchical and Phrase-Based Models of Translation: Formal Aspects and AlgorithmsMarc Dymetman and Nicola Cancedda . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

    A Systematic Comparison between Inversion Transduction Grammar and Linear Transduction Gram-mar for Word Alignment

    Markus Saers, Joakim Nivre and Dekai Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

    Source-side Syntactic Reordering Patterns with Functional Words for Improved Phrase-based SMTJie Jiang, Jinhua Du and Andy Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    Syntactic Constraints on Phrase Extraction for Phrase-Based Machine TranslationHailong Cao, Andrew Finch and Eiichiro Sumita . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    Phrase Based Decoding using a Discriminative ModelPrasanth Kolachina, Sriram Venkatapathy, Srinivas Bangalore, Sudheer Kolachina and Avinesh

    PVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    Seeding Statistical Machine Translation with Translation Memory Output through Tree-Based Struc-tural Alignment

    Ventsislav Zhechev and Josef van Genabith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

    Semantic vs. Syntactic vs. N-gram Structure for Machine Translation EvaluationChi-kiu Lo and Dekai Wu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    Arabic morpho-syntactic feature disambiguation in a translation contextInes Turki Khemakhem, Salma Jamoussi and Abdelmajid Ben hamadou . . . . . . . . . . . . . . . . . . . . 61

    A Discriminative Approach for Dependency Based Statistical Machine TranslationSriram Venkatapathy, Rajeev Sangal, Aravind Joshi and Karthik Gali . . . . . . . . . . . . . . . . . . . . . . . 66

    Improved Language Modeling for English-Persian Statistical Machine TranslationMahsa Mohaghegh, Abdolhossein Sarrafzadeh and Tom Moir . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    Manipuri-English Bidirectional Statistical Machine Translation Systems using Morphology and Depen-dency Relations

    Thoudam Doren Singh and Sivaji Bandyopadhyay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

    A Discriminative Syntactic Model for Source Permutation via Tree TransductionMaxim Khalilov and Khalil Sima’an. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92

    HMM Word-to-Phrase Alignment with Dependency ConstraintsYanjun Ma and Andy Way . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

    New Parameterizations and Features for PSCFG-Based Machine TranslationAndreas Zollmann and Stephan Vogel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

    Deep Syntax Language Models and Statistical Machine TranslationYvette Graham and Josef van Genabith . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

    vi

  • Conference Program

    Saturday, August 28, 2010

    9:00–10:40 Open Tutorial: Tree-Structured and Syntactic SMTDekai Wu

    10:40–11:05 Coffee break

    11:05–12:05 Invited KeynoteMartin Kay

    12:05–12:25 Intersecting Hierarchical and Phrase-Based Models of Translation: Formal Aspectsand AlgorithmsMarc Dymetman and Nicola Cancedda

    12:25–12:55 A Systematic Comparison between Inversion Transduction Grammar and LinearTransduction Grammar for Word AlignmentMarkus Saers, Joakim Nivre and Dekai Wu

    12:55–14:00 Lunch break

    14:00–14:20 Source-side Syntactic Reordering Patterns with Functional Words for ImprovedPhrase-based SMTJie Jiang, Jinhua Du and Andy Way

    14:20–14:40 Syntactic Constraints on Phrase Extraction for Phrase-Based Machine TranslationHailong Cao, Andrew Finch and Eiichiro Sumita

    14:40–15:00 Phrase Based Decoding using a Discriminative ModelPrasanth Kolachina, Sriram Venkatapathy, Srinivas Bangalore, Sudheer Kolachinaand Avinesh PVS

    15:00–15:20 Seeding Statistical Machine Translation with Translation Memory Output throughTree-Based Structural AlignmentVentsislav Zhechev and Josef van Genabith

    15:20–15:40 Semantic vs. Syntactic vs. N-gram Structure for Machine Translation EvaluationChi-kiu Lo and Dekai Wu

    15:40–16:05 Posters / Coffee break

    Arabic morpho-syntactic feature disambiguation in a translation contextInes Turki Khemakhem, Salma Jamoussi and Abdelmajid Ben hamadou

    vii

  • Saturday, August 28, 2010 (continued)

    A Discriminative Approach for Dependency Based Statistical Machine TranslationSriram Venkatapathy, Rajeev Sangal, Aravind Joshi and Karthik Gali

    Improved Language Modeling for English-Persian Statistical Machine TranslationMahsa Mohaghegh, Abdolhossein Sarrafzadeh and Tom Moir

    Manipuri-English Bidirectional Statistical Machine Translation Systems using Morphol-ogy and Dependency RelationsThoudam Doren Singh and Sivaji Bandyopadhyay

    16:05–16:25 A Discriminative Syntactic Model for Source Permutation via Tree TransductionMaxim Khalilov and Khalil Sima’an

    16:25–16:45 HMM Word-to-Phrase Alignment with Dependency ConstraintsYanjun Ma and Andy Way

    16:45–17:05 New Parameterizations and Features for PSCFG-Based Machine TranslationAndreas Zollmann and Stephan Vogel

    17:05–17:25 Deep Syntax Language Models and Statistical Machine TranslationYvette Graham and Josef van Genabith

    17:25–17:40 Discussion

    viii

  • Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 1–9,COLING 2010, Beijing, August 2010.

    Intersecting Hierarchical and Phrase-Based Models of Translation:Formal Aspects and Algorithms

    Marc Dymetman Nicola CanceddaXerox Research Centre Europe

    {marc.dymetman,nicola.cancedda}@xrce.xerox.com

    Abstract

    We address the problem of construct-ing hybrid translation systems by inter-secting a Hiero-style hierarchical sys-tem with a phrase-based system andpresent formal techniques for doing so.We model the phrase-based componentby introducing a variant of weightedfinite-state automata, called σ-automata,provide a self-contained descriptionof a general algorithm for intersect-ing weighted synchronous context-freegrammars with finite-state automata, andextend these constructs to σ-automata.We end by briefly discussing complexityproperties of the presented algorithms.

    1 Introduction

    Phrase-based (Och and Ney, 2004; Koehn etal., 2007) and Hierarchical (Hiero-style) (Chi-ang, 2007) models are two mainstream ap-proaches for building Statistical Machine Trans-lation systems, with different characteristics.While phrase-based systems allow a direct cap-ture of correspondences between surface-levellexical patterns, but at the cost of a simplistichandling of re-ordering, hierarchical systems arebetter able to constrain re-ordering, especiallyfor distant language pairs, but tend to producesparser rules and often lag behind phrase-basedsystems for less distant language pairs. It mighttherefore make sense to capitalize on the com-plementary advantages of the two approaches bycombining them in some way.

    This paper attempts to lay out the formalprerequisites for doing so, by developing tech-

    niques for intersecting a hierarchical model anda phrase-based model. In order to do so, one firstdifficulty has to be overcome: while hierarchicalsystems are based on the mathematically well-understood formalism of weighted synchronousCFG’s, phrase-based systems do not correspondto any classical formal model, although they areloosely connected to weighted finite state trans-ducers, but crucially go beyond these by allow-ing phrase re-orderings.

    One might try to address this issue by limitinga priori the amount of re-ordering, in the spiritof (Kumar and Byrne, 2005), which would allowto approximate a phrase-based model by a stan-dard transducer, but this would introduce furtherissues. First, limiting the amount of reorder-ing in the phrase-based model runs contrary tothe underlying intuitions behind the intersection,namely that the hierarchical model should bemainly responsible for controlling re-ordering,and the phrase-based model mainly responsiblefor lexical choice. Second, the transducer result-ing from the operation could be large. Third,even if we could represent the phrase-basedmodel through a finite-state transducer, intersect-ing this transducer with the synchronous CFGwould actually be intractable in the general case,as we indicate later.

    We then take another route. For a fixed sourcesentence x, we show how to construct an au-tomaton that represents all the (weighted) tar-get sentences that can be produced by apply-ing the phrase based model to x. However, this“σ-automaton” is non-standard in the sense thateach transition is decorated with a set of sourcesentence tokens and that the only valid paths are

    1

  • those that do not traverse two sets containing thesame token (in other words, valid paths cannot“consume” the same source token twice).

    The reason we are interested in σ-automatais the following. First, it is known that inter-secting a synchronous grammar simultaneouslywith the source sentence x and a (standard) targetautomaton results in another synchronous gram-mar; we provide a self-contained description ofan algorithm for performing this intersection, inthe general weighted case, and where x is gener-alized to an arbitrary source automaton. Second,we extend this algorithm to σ-automata. Theresulting weighted synchronous grammar repre-sents, as in Hiero, the “parse forest” (or “hy-pergraph”) of all weighted derivations (that isof all translations) that can be built over x, butwhere the weights incorporate knowledge of thephrase-based component; it can therefore formthe basis of a variety of dynamic programmingor sampling algorithms (Chiang, 2007; Blunsomand Osborne, 2008), as is the case with standardHiero-type representations. While in the worstcase the intersected grammar can contain an ex-ponential number of nonterminals, we argue thatsuch combinatorial explosion will not happen inpractice, and we also briefly indicate formal con-ditions under which it will not be allowed to hap-pen.

    2 Intersecting weighted synchronousCFG’s with weighted automata

    We assume that the notions of weighted finite-state automaton [W-FSA] and weighted syn-chronous grammar [W-SCFG] are known (forshort descriptions see (Mohri et al., 1996) and(Chiang, 2006)), and we consider:

    1. A W-SCFG G, with associated sourcegrammar Gs (resp. target grammar Gt); theterminals of Gs (resp. Gt) vary over thesource vocabulary Vs (resp. target vocab-ulary Vt).

    2. A W-FSA As over the source vocabularyVs, with initial state s# and final state s$.

    3. A W-FSA At over the target vocabulary Vt,with initial state t# and final state t$.

    The grammar G defines a weighted synchronouslanguage LG over (Vs, Vt), the automaton As aweighted language Ls over Vs, and the automa-ton At a weighted language Lt over Vt. Wethen define the intersection language L′ betweenthese three languages as the synchronous lan-guage denoted L′ = Ls e LG e Lt over (Vs, Vt)such that, for any pair (x, y) of a source and atarget sentence, the weight L′(x, y) is definedby L′(x, y) ≡ Ls(x) · LG(x, y) · Lt(y), whereLs(x), LG(x, y), Lt(y) are the weights associ-ated to each of the component languages.

    It is natural to ask whether there exists a syn-chronous grammar G′ generating the languageL′, which we will now show to be the case.1

    Our approach is inspired by the construction in(Bar-Hillel et al., 1961) for the intersection of aCFG and an FSA and the observation in (Lang,1994) relating this construction to parse forests,and also partially from (Satta, 2008), although,by contrast to that work, our construction, (i)is done simultaneously rather than as the se-quence of intersecting As with G, then the re-sulting grammar with At, (ii) handles weightedformalisms rather than non-weighted ones.

    We will describe the construction of G′ basedon an example, from which the general construc-tion follows easily. Consider a W-SCFG gram-mar G for translating between French and En-glish, with initial nonterminal S, and containingamong others the following rule:

    N→ A manque à B / B misses A : θ, (1)

    where the source and target right-hand sides areseparated by a slash symbol, and where θ is anon-negative real weight (interpreted multiplica-tively) associated with the rule.

    Now let’s consider the following “rulescheme”:

    t0s0N

    t3s4 → t2s0At3s1 s1manques2 s2 às3

    t0s3B

    t1s4 /

    t0s3B

    t1s4

    t1missest2 t2s0At3s1 (2)

    1We will actually only need the application of this resultto the case where As is a “degenerate” automaton describ-ing a single source sentence x, but the general constructionis not harder to do than this special case and the resultingformat for G′ is well-suited to our needs below.

    2

  • This scheme consists in an “indexed” version ofthe original rule, where the bottom indices sicorrespond to states of As (“source states”), andthe top indices ti to states of At (“target states”).The nonterminals are associated with two sourceand two target indices, and for the same nonter-minal, these four indices have to match acrossthe source and the target RHS’s of the rule. Asfor the original terminals, they are replaced by“indexed terminals”, where source (resp. tar-get) terminals have two source (resp. target) in-dices. The source indices appear sequentiallyon the source RHS of the rule, in the patterns0, s1, s1, s2, s2 . . . sm−1, sm, with the nonter-minal on the LHS receiving source indices s0and sm, and similarly the target indices appearsequentially on the target RHS of the rule, in thepattern t0, t1, t1, t2, t2 . . . tn−1, tn, with the non-terminal on the LHS receiving target indices t0and tn. To clarify, the operation of associatingindices to terminals and nonterminals can be de-composed into three steps:

    s0Ns4 → s0As1 s1manques2 s2 à s3 s3Bs4 /B misses A

    t0Nt3 → A manque à B /t0Bt1 t1missest2 t2At3

    t0s0N

    t3s4 → t2s0At3s1 s1manques2 s2 à s3

    t0s3B

    t1s4 /

    t0s3B

    t1s4

    t1missest2 t2s0At3s1

    where the first two steps corresponds to handlingthe source and target indices separately, and thethird step then assembles the indices in order toget the same four indices on the two copies ofeach RHS nonterminal. The rule scheme (2) nowgenerates a family of rules, each of which corre-sponds to an arbitrary instantiation of the sourceand target indices to states of the source and tar-get automata respectively. With every such ruleinstantiation, we associate a weight θ′ which isdefined as:

    θ′ ≡ θ ·∏

    si s-termsi+1

    θAs(si, s-term, si+1)

    ·∏

    tj t-termtj+1

    θAt(tj , t-term, tj+1), (3)

    where the first product is over the indexed sourceterminals sis-termsi+1 , the second product

    over the indexed target terminals tj t-termtj+1 ;θAs(si, s-term, si+1) is the weight of the transi-tion (si, s-term, si+1) according to As, and sim-ilarly for θAt(tj , t-term, tj+1). In these prod-ucts, it may happen that θAs(si, s-term, si+1) isnull (and similarly for At), and in such a case,the corresponding rule instantiation is consid-ered not to be realized. Let us consider the multi-set of all the weighted rule instantiations for (1)computed in this way, and for each rule in thecollection, let us “forget” the indices associatedto the terminals. In this way, we obtain a col-lection of weighted synchronous rules over thevocabularies Vs and Vt, but where each nonter-minal is now indexed by four states.2

    When we apply this procedure to all the rulesof the grammar G, we obtain a new weightedsynchronous CFG G′, with start symbol t#s#S

    t$s$ ,

    for which we have the following Fact, of whichwe omit the proof for lack of space.

    Fact 1. The synchronous language LG′ associ-ated with G′ is equal to L′ = Ls e LG e Lt.

    The grammar G′ that we have just constructeddoes fulfill the goal of representing the bilat-eral intersection that we were looking for, butit has a serious defect: most of its nontermi-nals are improductive, that is, can never pro-duce a bi-sentence. If a rule refers to such animproductive nonterminal, it can be eliminatedfrom the grammar. This is the analogue for aSCFG of the classical operation of reduction forCFG’s; while, conceptually, we could start fromG′ and perform the reduction by deleting themany rules containing improductive nontermi-nals, it is equivalent but much more efficient todo the reverse, namely to incrementally add theproductive nonterminals and rules of G′ startingfrom an initially empty set of rules, and by pro-ceeding bottom-up starting from the terminals.We do not detail this process, which is relatively

    2It is possible that the multiset obtained by this simpli-fying operation contains duplicates of certain rules (pos-sibly with different weights), due to the non-determinismof the automata: for instance, two sequences suchas‘s1manques2 s2 às3 ’ and ‘s1manques′2 s′2 às3 ’ become in-distinguishable after the operation. Rather than producingmultiple instances of rules in this way, one can “conflate”them together and add their weights.

    3

  • straightforward.3

    A note on intersecting SCFGs with transduc-ers Another way to write Ls e LG e Lt is asthe intersection (Ls × Lt) ∩ LG. (Ls × Lt) canbe seen as a rational language (language gener-ated by a finite state transducer) of an especiallysimple form over Vs × Vt . It is then naturalto ask whether our previous construction can begeneralized to the intersection of G with an arbi-trary finite-state transducer. However, this is notthe case. Deciding the emptiness problem forthe intersection between two finite state trans-ducers is already undecidable, by reduction toPost’s Correspondence problem (Berstel, 1979,p. 90) and we have extended the proof of this factto show that intersection between a synchronousCFG and a finite state transducer also has an un-decidable emptiness problem (the proof relies onthe fact that a finite state transducer can be sim-ulated by a synchronous grammar). A fortiori,this intersection cannot be represented throughan (effectively constructed) synchronous CFG.

    3 Phrase-based models and σ-automata

    3.1 σ-automata: definition

    Let Vs be a source vocabulary, Vt a target vocab-ulary. Let x = x1, . . . , xM be a fixed sequenceof words over a certain source vocabulary Vs.Let us denote by z a token in the sequence x,and by Z the set of the M tokens in x. A σ-automaton over x has the general form of a stan-dard weighted automaton over the target vocabu-lary, but where the edges are also decorated withelements ofP(Z), the powerset ofZ (see Fig. 1).An edge in the σ-automaton between two statesq and q′ then carries a label of the form (α, β),where α ∈ P(Z) and β ∈ Vt (note that here wedo not allow β to be the empty string �). A pathfrom the initial state of the automaton to its fi-nal state is defined to be valid iff each token of xappears in exactly one label of the path, but notnecessarily in the same order as in x. As usual,the output associated with the path is the ordered

    3This bottom-up process is analogous to chart-parsing,but here we have decomposed the construction into firstbuilding a semantics-preserving grammar and then reduc-ing it, which we think is formally neater.

    sequence of target labels on that path, and theweight of the path is the product of the weightson its edges.

    σ-automata and phrase-based translationA mainstream phrase-based translation systemsuch as Moses (Koehn et al., 2007) can be ac-counted for in terms of σ-automata in the follow-ing way. To simplify exposition, we assume thatthe language model used is a bigram model, butany n-gram model can be accommodated. Then,given a source sentence x, decoding works by at-tempting to construct a sequence of phrase-pairsof the form (x̃1, ỹ1), ..., (x̃k, ỹk) such that eachx̃i corresponds to a contiguous subsequence oftokens of x, the x̃i’s do not overlap and com-pletely cover x, but may appear in a differentorder than that of x; the output associated withthe sequence is simply the concatenation of allthe ỹi’s in that sequence.4 The weight associ-ated with the sequence of phrase-pairs is thenthe product (when we work with probabilitiesrather than log-probabilities) of the weight ofeach (x̃i+1, ỹi+1) in the context of the previous(x̃i, ỹi), which consists in the product of severalelements: (i) the “out-of-context” weight of thephrase-pair (x̃i+1, ỹi+1) as determined by its fea-tures in the phrase table, (ii) the language modelprobability of finding ỹi+1 following ỹi,5 (iii)the contextual weight of (x̃i+1, ỹi+1) relative to(x̃i, ỹi) corresponding to the distorsion cost of“jumping” from the token sequence x̃i to the to-ken sequence x̃i+1 when these two sequencesmay not be consecutive in x.6

    Such a model can be represented by a σ-automaton, where each phrase-pair (x̃, ỹ) — for

    4We assume here that the phrase-pairs (x̃i, ỹi) are suchthat ỹi is not the empty string (this constraint could be re-moved by an adaptation of the �-removal operation (Mohri,2002) to σ-automata).

    5This is where the bigram assumption is relevant: fora trigram model, we may need to encode in the automatonnot only the immediately preceding phrase-pair, but alsothe previous one, and so on for higher-order models. Analternative is to keep the n-gram language model outsidethe σ-automaton and intersect it later with the grammar G′

    obtained in section 4, possibly using approximation tech-niques such as cube-pruning (Chiang, 2007).

    6Any distorsion model — in particular “lexicalized re-ordering” — that only depends on comparing two consec-utive phrase-pairs can be implemented in this way.

    4

  • # h

    b a

    r

    k

    $

    tclf

    tcl1

    tcl2

    {ces}

    these

    {avocats, marrons}

    totally

    lawyers

    corrupt {cuits}

    finished

    {sont}

    are

    {avocats}

    avocadoes

    {sont}

    are

    {cuits}

    cooked

    {$}

    $

    {marrons}

    brown

    {$}$

    # h

    b a

    r

    k

    $

    tclf

    tcl1

    tcl2

    these

    totally

    lawyers

    corrupt

    finishedare

    avocadoes

    arecooked $

    brown

    $

    p-transducer

    Figure 1: On the top: a σ-automaton with two valid paths shown. Each box denotes a state corresponding to a phrasepair, while states internal to a phrase pair (such as tcl1 and tcl2) are not boxed. Above each transition we have indicatedthe corresponding target word, and below it the corresponding set of source tokens. We use a terminal symbol $ to denotethe end of sentence both on the source and on the target. The solid path corresponds to the output these totally corruptlawyers are finished, the dotted path to the output these brown avocadoes are cooked. Note that the source tokens are notnecessarily consumed in the order given by the source, and that, for example, there exists a valid path generating theseare totally corrupt lawyers finished and moving according to h → r → tcl1 → tcl2 → tcl → f ; Note, however, thatthis does not mean that if a biphrase such as (marrons avocats, avocado chestnuts) existed in the phrasetable, it would be applicable to the source sentence here: because the source words in this biphrase would not match theorder of the source tokens in the sentence, the biphrase would not be included in the σ-automaton at all. On the bottom:The target W-FSA automaton At associated with the σ-automaton, where we are ignoring the source tokens (but keepingthe same weights).

    x̃ a sequence of tokens in x and (x̃, ỹ) an entryin the global phrase table — is identified with astate of the automaton and where the fact that thephrase-pair (x̃′, ỹ′) = ((x1, ..., xk), (y1, ..., yl))follows (x̃, ỹ) in the decoding sequence is mod-eled by introducing l “internal” transitions withlabels (σ, y1), (∅, y2), ..., (∅, yl), where σ ={x1, ..., xk}, and where the first transition con-nects the state (x̃, ỹ) to some unique “internalstate” q1, the second transition the state q1 tosome unique internal state q2, and the last tran-sition qk to the state (x̃′, ỹ′).7 Thus, a state(x̃′, ỹ′) essentially encodes the previous phrase-pair used during decoding, and it is easy to seethat it is possible to account for the differentweights associated with the phrase-based modelby weights associated to the transitions of the σ-automaton.8

    7For simplicity, we have chosen to collect the set of allthe source tokens {x1, ..., xk} on the first transition, but wecould distribute it on the l transitions arbitrarily (but keep-ing the subsets disjoint) without changing the semantics ofwhat we do. This is because once we have entered one ofthe l internal transitions, we will always have to traversethe remaining internal transitions and collect the full set ofsource tokens.

    8By creating states such as ((x̃, ỹ), (x̃′, ỹ′)) that en-

    Example Let us consider the following Frenchsource sentence x: ces avocats marrons sontcuits (idiomatic expression for these totally cor-rupt lawyers are finished). Let’s assume that thephrase table contains the following phrase pairs:

    h: (ces, these)a: (avocats, avocados)b: (marrons, brown)tcl: (avocats marrons,

    totally corrupt lawyers)r: (sont, are)k: (cuits, cooked)f: (cuits, finished).

    An illustration of the corresponding σ-automaton SA is shown at the top of Figure 1,with only a few transitions made explicit, andwith no weights shown.9

    code the two previous phrase-pairs used during decoding,it is possible in principle to account for a trigram languagemodel, and similarly for higher-order LMs. This is simi-lar to implementing n-gram language models by automatawhose states encode the n− 1 words previously generated.

    9Only two (valid) paths are shown. If we had shown thefull σ-automaton, then the graph would have been “com-plete” in the sense that for any two box states B,B′, wewould have shown a connection B → B′1... → B′k−1 →B′, where the B′i are internal states, and k is the length ofthe target side of the biphrase B′.

    5

  • 4 Intersecting a synchronous grammarwith a σ-automaton

    Intersection of a W-SCFG with a σ-automaton If SA is a σ-automaton overinput x, with each valid path in SA we asso-ciate a weight in the same way as we do fora weighted automaton. For any target wordsequence in V ∗t we can then associate the sumof the weights of all valid paths outputting thatsequence. The weighted language LSA,x overVt obtained in this way is called the languageassociated with SA. Let G be a W-SCFG overVs, Vt, and let us denote by LG,x the weightedlanguage over Vs, Vt corresponding to theintersection {x} e G e V ∗t , where {x} denotesthe language giving weight 1 to x and weight 0to other sequences in V ∗s , and V

    ∗t denotes the

    language giving weight 1 to all sequences in V ∗t .Note that non-null bi-sentences in LG,x havetheir source projection equal to x and thereforeLG,x can be identified with a weighted languageover Vt. The intersection of the languages LSA,xand LG,x is denoted by LSA,x e LG,x.

    Example Let us consider the following W-SCFG (where again, weights are not explicitlyshown, and where we use a terminal symbol $to denote the end of a sentence, a technicalityneeded for making the grammar compatible withthe SA automaton of Figure 1):S → NP VP $ / NP VP $NP → ces N A / these A NVP → sont A / are AA → marrons / brownA → marrons / totally corruptA → cuits / cookedA → cuits / finishedN → avocats / avocadoesN → avocats / lawyers

    It is easy to see that, for instance, the sen-tences: these brown avocadoes are cooked $,these brown avocadoes are finished $, and thesetotally corrupt lawyers are finished $ all belongto the intersection LSA,x e LG,x, while the sen-tences these avocadoes brown are cooked $, to-tally corrupt lawyers are finished these $ belongonly to LSA,x.

    Building the intersection We now describehow to build a W-SCFG that represents the inter-

    section LSA,x eLG,x. We base our explanationson the example just given.

    A Relaxation of the Intersection At thebottom of Figure 1, we show how we can as-sociate an automaton At with the σ-automatonSA: we simply “forget” the source-sides of thelabels carried by the transitions, and retain all theweights. As before, note that we are only show-ing a subset of the transitions here.

    All valid paths for SAmap into valid paths forAt (with the same weights), but the reverse is nottrue because some validAt paths can correspondto traversals of SA that either consume severaltime the same source token or do not consume allsource tokens. For instance, the sentence thesebrown avocadoes brown are $ belongs to thelanguage of At, but cannot be produced by SA.Let’s however consider the intersection {x} eG e At, where, with a slight abuse of notation,we have notated {x} the “degenerate” automatonrepresenting the sentence x, namely the automa-ton (with weights on all transitions equal to 1):

    Source automaton

    ces marronsavocats sont cuits $

    0 1 2 3 4 5 6

    This is a relaxation of the true intersection, butone that we can represent through a W-SCFG, aswe know from section 2.10

    This being noted, we now move to the con-struction of the full intersection.

    The full intersection We discussed in sec-tion 2 how to modify a synchronous grammarrule in order to produce the indexed rule scheme(2) in order to represent the bilateral intersectionof the grammar with two automata. Let us redothat construction here, in the case of our example

    10Note that, in the case of our very simple example, anytarget string that belongs to this relaxed intersection (whichconsists of the eight sentences these {brown | totally cor-rupt} {avocadoes | lawyers} are {cooked | finished}) actu-ally belongs to the full intersection, as none of these sen-tences corresponds to a path in SA that violates the token-consumption constraint. More generally, it may often bethe case in practice that the W-SCFG, by itself, providesenough “control” of the possible target sentences to pre-vent generation of sentences that would violate the token-consumption constraints, so that there may be little differ-ence in practice between performing the relaxed intersec-tion {x} e G e At and performing the full intersection{x} eG e LSA,x.

    6

  • W-SCFG, of the target automaton represented onthe bottom of Figure 1, and of the source automa-ton {x}.

    The construction is then done in three steps:

    s0NPs3 → s0cess1 s1Ns2 s2As3 /these A N

    t0NPt3 → ces N A /t0theset1 t1At2 t2Nt3

    t0s0NP

    t3s3 → s0cess1 t2s1Nt3s2 t1s2At2s3 /

    t0theset1 t1s2At2s3

    t2s1N

    t3s2

    In order to adapt that construction to the casewhere we want the intersection to be with a σ-automaton, what we need to do is to further spe-cialize the nonterminals. Rather than specializ-ing a nonterminal X in the form tsX

    t′s′ , we spe-

    cialize it in the form: tsXt′,σs′ , where σ represents

    a set of source tokens that correspond to “collect-ing” the source tokens in the σ-automaton alonga path connecting the states t and t′.11

    We then proceed to define a new rule schemeassociated to our rule, which is obtained as be-fore in three steps, as follows.

    s0NPs3 → s0cess1 s1Ns2 s2As3 /these A N

    t0NPt3,σ03 → ces N A /t0theset1,σ01 t1At2,σ12 t2Nt3,σ23

    t0s0NP

    t3,σ03s3 → s0cess1 t2s1Nt3,σ23s2 t1s2At2,σ12s3 /

    t0theset1,σ01 t1s2At2,σ12s3

    t2s1N

    t3,σ23s2

    The only difference with our previous tech-nique is in the addition of the σ’s to the top in-dices. Let us focus on the second step of the an-notation process:

    t0NPt3,σ03 → ces N A /t0theset1,σ01 t1At2,σ12 t2Nt3,σ23

    11To avoid a possible confusion, it is important to noteright away that σ is not necessarily related to the tokensappearing between the positions s and s′ in the source sen-tence (that is, between these states in the associated sourceautomaton), but is defined solely in terms of the source to-kens along the t, t′ path. See example with “persons” and“people” below.

    Conceptually, when instanciating this scheme,the ti’s may range over all possible states ofthe σ-automaton, and the σij over all subsets ofthe source tokens, but under the following con-straints: the RHS σ’s (here σ01, σ12, σ23) mustbe disjoint and their union must be equal to the σon the LHS (here σ03). Additionally, a σ associ-ated with a target terminal (as σ01 here) must beequal to the token set associated to the transitionthat this terminal realizes between σ-automatonstates (here, this means that σ01 must be equalto the token set {ces} associated with the transi-tion between t0, t1 labelled with ‘these’). If weperform all these instantiations, compute theirweights according to equation (3), and finally re-move the indices associated with terminals in therules (by adding the weights of the rules only dif-fering by the indices of terminals, as done previ-ously), we obtain a very large “raw” grammar,but one for which one can prove direct coun-terpart of Fact 1. Let us call, as before G′ theraw W-SCFG obtained, its start symbol beingt#s#S

    t$,σalls$ , with σall the set of all source tokens

    in x.

    Fact 2. The synchronous language LG′ associ-ated with G′ is equal to ({x}, LSA,x e LG,x).

    The grammar that is obtained this way, despitecorrectly representing the intersection, containsa lot of useless rules, this being due to the factthat many nonterminals can not produce any out-put. The situation is wholly similar to the caseof section 2, and the same bottom-up techniquescan be used for activating nonterminals and rulesbottom-up.

    The algorithm is illustrated in Figure 2, wherewe have shown the result of the process of acti-vating in turn the nonterminals (abbreviated by)N1, A1, A2, NP1, VP1, S1. As a consequenceof these activations, the original grammar ruleNP → ces N A /these A N (for instance)becomes instantiated as the rule:

    #0 NP

    tcl,{ces,avocats,marrons}3 →

    0ces1tcl21 N

    tcl,∅2

    h2A

    tcl2,{avocats,marrons}3 /

    #theseh,{ces} h2Atcl2,{avocats,marrons}3

    tcl21 N

    tcl,∅2

    7

  • # h r $

    tclf

    tcl1

    tcl2

    {ces}

    these

    {avocats, marrons}

    totally

    lawyers

    corrupt {cuits}

    finished

    {sont}

    are

    {$}$

    p-transducer-intersection

    ces marronsavocats sont cuits $

    S1

    S1

    N1 A1 A2

    VP1NP1

    NP1 VP1

    A2

    A1

    N1

    0 1 2 3 4 5 6

    2 ,

    1 2

    2,{ , }

    2 3

    # ,{ , , }

    0 3

    ,{ }

    4 5

    ,{ , }

    3 5

    # ,{ , , , , ,$ $}

    0 6

    1:

    1:

    1:

    2 :

    1:

    1:

    tcl tcl

    h tcl avocats marrons

    tcl ces avocats marrons

    r f cuits

    tcl f sont cuits

    ces avocats marrons sont cuits

    N N

    A A

    NP NP

    A A

    VP VP

    S S

    Figure 2: Building the intersection. The bottom of the figure shows some active non-terminals associated with the sourcesequence, at the top these same non-terminals associated with a sequence of transitions in the σ-automaton, correspondingto the target sequence these totally corrupt lawyers are finished $. To avoid cluttering the drawing, we have used theabbreviations shown on the right. Note that while A1 only spans marrons in the bottom chart, it is actually decorated withthe source token set {avocats,marrons}: such a “disconnect” between the views that the W-SCFG and the σ-automatonhave of the source tokens is not ruled out.

    that is, after removal of the indices on terminals:

    #0 NP

    tcl,{ces,avocats,marrons}3 →

    ces tcl21 Ntcl,∅2

    h2A

    tcl2,{avocats,marrons}3 /

    these h2Atcl2,{avocats,marrons}3

    tcl21 N

    tcl,∅2

    Note that while the nonterminal tcl21 Ntcl,∅2 by

    itself consumes no source token (it is associatedwith the empty token set), any actual use of thisnonterminal (in this specific rule or possibly insome other rule using it) does require travers-ing the internal node tcl2 and therefore all theinternal nodes “belonging” to the biphrase tcl(because otherwise the path from # to $ wouldbe disconnected); in particular this involves con-suming all the tokens on the source side of tcl,including ‘avocats’.12

    Complexity considerations The bilateral in-tersection that we defined between a W-SCFG

    12In particular there is no risk that a derivation relativeto the intersected grammar generates a target containingtwo instances of ‘lawyers’, one associated to the expansionof tcl21 N

    tcl,∅2 and consuming no source token, and another

    one associated with a different nonterminal and consumingthe source token ‘avocats’: this second instance would in-volve not traversing tcl1, which is impossible as soon astcl21 N

    tcl,∅2 is used.

    and two W-FSA’s in section 2 can be shown tobe of polynomial complexity in the sense that ittakes polynomial time and space relative to thesum of the sizes of the two automata and of thegrammar to construct the (reduced) intersectedgrammar G′, under the condition that the gram-mar right-hand sides have length bounded by aconstant.13

    The situation here is different, because theconstruction of the intersection can in princi-ple introduce nonterminals indexed not only bystates of the automata, but also by arbitrary sub-sets of source tokens, and this may lead in ex-treme cases to an exponential number of rules.Such problems however can only happen in sit-uations where, in a nonterminal tsX

    t′,σs′ , the set

    σ is allowed to contain tokens that are “unre-lated” to the token set {personnes} appearingbetween s and s′ in the source automaton. An il-lustration of such a situation is given by the fol-lowing example. Suppose that the source sen-

    13If this condition is removed, and for the simpler casewhere the source (resp. target) automaton encodes a singlesentence x (resp. y), (Satta and Peserico, 2005) have shownthat the problem of deciding whether (x, y) is recognizedby G is NP-hard relative to the sum of the sizes. A conse-quence is then that the grammar G′ cannot be constructedin polynomial time unless P = NP .

    8

  • tence contains the two tokens personnes andgens between positions i, i + 1 and j, j + 1 re-spectively, with i and j far from each other, thatthe phrase table contains the two phrase pairs(personnes, persons) and (gens, people), butthat the synchronous grammar only contains thetwo rules X → personnes/people and Y →gens/persons, with these phrases and rules ex-hausting the possibilities for translating gensand personnes; Then the intersected grammarwill contain such nonterminals as tiX

    t′,{gens}i+1 and

    rjY

    r′,{personnes}j+1 , where in the first case the token

    set {gens} in the first nonterminal is unrelated tothe tokens appearing between i, i + 1, and simi-larly in the second case.

    Without experimentation on real cases, itis impossible to say whether such phenomenawould empirically lead to combinatorial explo-sion or whether the synchronous grammar wouldsufficiently constrain the phrase-base component(whose re-ordering capabilities are responsiblein fine for the potential NP-hardness of the trans-lation process) to avoid it. Another possible ap-proach is to prevent a priori a possible combi-natorial explosion by adding formal constraintsto the intersection mechanism. One such con-straint is the following: disallow introduction oftiX

    t′,σj when the symmetric difference between

    σ and the set of tokens between positions i andj in the source sentence has cardinality largerthan a small constant. Such a constraint couldbe interpreted as keeping the SCFG and phrase-base components “in sync”, and would be betteradapted to the spirit of our approach than limit-ing the amount of re-ordering permitted to thephrase-based component, which would contra-dict the reason for using a hierarchical compo-nent in the first place.

    5 Conclusion

    Intersecting hierarchical and phrase-based mod-els of translation could allow to capitalize oncomplementarities between the two approaches.Thus, one might train the hierarchical compo-nent on corpora represented at the part-of-speechlevel (or at a level where lexical units are ab-stracted into some kind of classes) while the

    phrase-based component would focus on transla-tion of lexical material. The present paper doesnot have the ambition to demonstrate that suchan approach would improve translation perfor-mance, but only to provide some formal meansfor advancing towards that goal.

    ReferencesBar-Hillel, Y., M. Perles, and E. Shamir. 1961. On for-

    mal properties of simple phrase structure grammars.Zeitschrift für Phonetik, Sprachwissenschaft und Kom-municationsforschung, 14:143–172.

    Berstel, Jean. 1979. Transductions and Context-Free Lan-guages. Teubner, Stuttgart.

    Blunsom, P. and M. Osborne. 2008. Probabilistic inferencefor machine translation. In Proceedings of the Confer-ence on Empirical Methods in Natural Language Pro-cessing, pages 215–223. Association for ComputationalLinguistics. Slides downloaded.

    Chiang, David. 2006. An introduction to synchronousgrammars. www.isi.edu/˜chiang/papers/synchtut.pdf, June.

    Chiang, David. 2007. Hierarchical phrase-based transla-tion. Computational Linguistics, 33:201–228.

    Koehn, Philipp, Hieu Hoang, Alexandra Birch, ChrisCallison-Burch, Marcello Federico, Nicola Bertoldi,Brooke Cowan, Wade Shen, Christine Moran, RichardZens, Chris Dyer, Ondrej Bojar, Alexandra Constantin,and Evan Herbst. 2007. Moses: Open source toolkit forstatistical machine translation. In ACL. The Associationfor Computer Linguistics.

    Kumar, Shankar and William Byrne. 2005. Local phrasereordering models for statistical machine translation. InProc. HLT/EMNLP.

    Lang, Bernard. 1994. Recognition can be harder than pars-ing. Computational Intelligence, 10:486–494.

    Mohri, Mehryar, Fernando Pereira, and Michael Riley.1996. Weighted automata in text and speech processing.In ECAI-96 Workshop on Extended Finite State Modelsof Language.

    Mohri, Mehryar. 2002. Generic epsilon-removal and inputepsilon-normalization algorithms for weighted trans-ducers. International Journal of Foundations of Com-puter Science, 13:129–143.

    Och, Franz Josef and Hermann Ney. 2004. The align-ment template approach to statistical machine transla-tion. Comput. Linguist., 30(4):417–449.

    Satta, Giorgio and Enoch Peserico. 2005. Some compu-tational complexity results for synchronous context-freegrammars. In HLT ’05: Proceedings of the conferenceon Human Language Technology and Empirical Meth-ods in Natural Language Processing, pages 803–810,Morristown, NJ, USA. Association for ComputationalLinguistics.

    Satta, Giorgio. 2008. Translation algorithms by means oflanguage intersection. Submitted. www.dei.unipd.it/˜satta/publ/paper/inters.pdf.

    9

  • Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 10–18,COLING 2010, Beijing, August 2010.

    A Systematic Comparison between Inversion Transduction Grammarand Linear Transduction Grammar for Word Alignment

    Markus Saers and Joakim Nivre

    Dept. of Linguistics & PhilologyUppsala University

    [email protected]

    Dekai WuHKUST

    Human Language Technology CenterDept. of Computer Science & EngineeringHong Kong Univ. of Science & Technology

    [email protected]

    Abstract

    We present two contributions to gram-mar driven translation. First, since bothInversion Transduction Grammar andLinear Inversion Transduction Gram-mars have been shown to produce bet-ter alignments then the standard wordalignment tool, we investigate how thetrade-off between speed and end-to-endtranslation quality extends to the choiceof grammar formalism. Second, weprove that Linear Transduction Gram-mars (LTGs) generate the same transduc-tions as Linear Inversion TransductionGrammars, and present a scheme for ar-riving at LTGs by bilingualizing LinearGrammars. We also present a method forobtaining Inversion Transduction Gram-mars from Linear (Inversion) Transduc-tion Grammars, which can speed upgrammar induction from parallel corporadramatically.

    1 Introduction

    In this paper we introduce Linear TransductionGrammars (LTGs), which are the bilingual caseof Linear Grammars (LGs). We also show thatLTGs are equal to Linear Inversion TransductionGrammars (Saers et al., 2010). To be able to in-duce transduction grammars directly from par-allel corpora an approximate search for parses isneeded. The trade-off between speed and end-to-end translation quality is investigated and com-pared to Inversion Transduction Grammars (Wu,1997) and the standard tool for word alignment,

    GIZA++ (Brown et al., 1993; Vogel et al., 1996;Och and Ney, 2003). A heuristic for convertingstochastic bracketingLTGs into stochastic brack-eting ITGs is presented, and fitted into the speed–quality trade-off.

    In section 3 we give an overview of transduc-tion grammars, introduceLTGs and show thatthey are equal toLITGs. In section 4 we givea short description of the rational for the trans-duction grammar pruning used. In section 5 wedescribe a way of seeding a stochastic bracketingITG with the rules and probabilities of a stochas-tic bracketingLTG. Section 6 describes the setup,and results are given in section 7. Finally, someconclusions are offered in section 8

    2 Background

    Any form of automatic translation that relies ongeneralizations of observed translations needs toalign these translations on a sub-sentential level.The standard way of doing this is by aligningwords, which works well for languages that usewhite space separators between words. The stan-dard method is a combination of the family ofIBM -models (Brown et al., 1993) and HiddenMarkov Models (Vogel et al., 1996). Thesemethods all arrive at a function (A) from lan-guage 1 (F ) to language 2 (E). By running theprocess in both directions, two functions can beestimated and then combined to form an align-ment. The simplest of these combinations are in-tersection and union, but usually, the intersectionis heuristically extended. Transduction gram-mars on the other hand, impose a shared struc-ture on the sentence pairs, thus forcing a consis-tent alignment in both directions. This method

    10

  • has proved successful in the settings it has beentried (Zhang et al., 2008; Saers and Wu, 2009;Haghighi et al., 2009; Saers et al., 2009; Saerset al., 2010). Most efforts focus on cutting downtime complexity so that larger data sets than toy-examples can be processed.

    3 Transduction Grammars

    Transduction grammars were first introduced inLewis and Stearns (1968), and further devel-oped in Aho and Ullman (1972). The origi-nal notation called for regularCFG-rules in lan-guageF with rephrasedE productions, either incurly brackets, or comma separated. The bilin-gual version ofCFGs is called Syntax-DirectedTransduction Grammars (SDTGs). To differenti-ate identical nonterminal symbols, indices wereused (the bag of nonterminals for the two pro-ductions are equal by definition).

    A → B(1) a B(2) {x B(1) B(2)}= A → B(1) a B(2), x B(1) B(2)

    The semantics of the rules is that one nontermi-nal rewrites into a bag of nonterminals that is dis-tributed independently in the two languages, andinterspersed with any number of terminal sym-bols in the respective languages. As withCFGs,the terminal symbols can be factored out intopreterminals with the added twist that they areshared between the two languages, since preter-minals are formally nonterminals. The aboverule can thus be rephrased as

    A → B(1) Xa/x B(2), Xa/x B(1) B(2)Xa/x → a, x

    In this way, rules producing nonterminals andrules producing terminals can be separated.Since only nonterminals are allowed to move,their movement can be represented as the orig-inal sequence of nonterminals and a permutationvector as follows:

    A → B Xa/x B ; 1, 0, 2Xa/x → a, x

    To keep the reordering as monotone as possible,the terminalsa andx can be produced separately,

    but doing so eliminates any possibility of param-eterizing their lexical relationship. Instead, theindividual terminals are pair up with the emptystring (ǫ).

    A → Xx B Xa B ; 0, 1, 2, 3Xa → a, ǫXx → ǫ, x

    Lexical rules involving the empty string are re-ferred to as singletons. Whenever a preterminalis used to pair up two terminal symbols, we referto that pair of terminals as abiterminal, whichwill be written ase/f .

    Any SDTG can be rephrased to contain per-muted nonterminal productions and biterminalproductions only, and we will call this the nor-mal form of SDTGs. Note that it is not possi-ble to produce a two-normal form forSDTGs,as there are some rules that are not binarizable(Wu, 1997; Huang et al., 2009). This is animportant point to make, since efficient parsingfor CFGs is based on either restricting parsingto only handle binary grammars (Cocke, 1969;Kasami, 1965; Younger, 1967), or rely on on-the-fly binarization (Earley, 1970). When trans-lating with a grammar, parsing only has to bedone in F , which is binarizable (since it is aCFG), and can therefor be computed in polyno-mial time (O(n3)). Once there is a parse treefor F , the corresponding tree forE can be eas-ily constructed. When inducing a grammar fromexamples, however, biparsing (finding an anal-ysis that is consistent across a sentence pair) isneeded. The time complexity for biparsing withSDTGs isO(n2n+2), which is clearly intractable.

    Inversion Transduction Grammars orITGs(Wu, 1997) are transduction grammars that havea two-normal form, thus guaranteeing binariz-ability. Defining the rank of a rule as the numberof nonterminals in the production, and the rankof a grammar as the highest ranking rule in therule set, ITGs area) any SDTG of rank two, b)anySDTG of rank three orc) anySDTG where norule has a permutation vector other than identitypermutation or inversion permutation. It followsfrom this definition thatITGs have a two-normalform, which is usually expressed asSDTG rules,

    11

  • with brackets around the production to distin-guish the different kinds of rules from each other.

    A → B C ; 0, 1 = A → [ B C ]A → B C ; 1, 0 = A → 〈 B C 〉A → e/f = A → e/f

    By guaranteeing binarizability, biparsing timecomplexity becomesO(n6).

    There is an even more restricted version ofSDTGs called Simple Transduction Grammar(STG), where no permutation at all is allowed,which can also biparse a sentence pair inO(n6)time.

    A Linear Transduction Grammar (LTG) is abilingual version of a Linear Grammar (LG).

    Definition 1. An LG in normal form is a tuple

    GL = 〈N,Σ, R, S〉

    WhereN is a finite set of nonterminal symbols,Σ is a finite set of terminal symbols,R is a finiteset of rules andS ∈ N is the designated startsymbol. The rule set is constrained so that

    R ⊆ N × (Σ ∪ {ǫ})N(Σ ∪ {ǫ}) ∪ {ǫ}

    Whereǫ is the empty string.

    To bilingualize a linear grammar, we will takethe same approach as taken when a finite-stateautomaton is bilingualized into a finite-statetransducer. That is: to replace all terminal sym-bols with biterminal symbols.

    Definition 2. An LTG in normal form is a tuple

    T GL = 〈N,Σ,∆, R, S〉

    WhereN is a finite set of nonterminal symbols,Σ is a finite set of terminal symbols in languageE, ∆ is a finite set of terminal symbols in lan-guageF , R is a finite set of linear transductionrules andS ∈ N is the designated start symbol.The rule set is constrained so that

    R ⊆ N × ΨNΨ ∪ {〈ǫ, ǫ〉}

    WhereΨ = Σ∪{ǫ}×∆∪{ǫ} andǫ is the emptystring.

    Graphically, we will representLTG rules as pro-duction rules with biterminals:

    〈A, 〈x, p〉B〈y, q〉〉 = A → x/p B y/q〈A, 〈ǫ, ǫ〉〉 = B → ǫ/ǫ

    Like STGs, LTGs do not allow any reordering,and are monotone, but because they are linear,this has no impact on expressiveness, as we shallsee later.

    Linear Inversion Transduction Grammars(LITGs) were introduced in Saers et al. (2010),and representITGs that are allowed to have atmost one nonterminal symbol in each produc-tion. These are attractive because they can bi-parse a sentence pair inO(n4) time, which canbe further reduced to linear time by severelypruning the search space. This makes themtractable for large parallel corpora, and a viableway to induce transduction grammars from largeparallel corpora.

    Definition 3. An LITG in normal form is a tuple

    T GLI = 〈N,Σ,∆, R, S〉

    WhereN is a finite set of nonterminal symbols,Σ is a finite set of terminal symbols from lan-guageE, ∆ is a finite set of terminal symbolsfrom languageF , R is a set of rules andS ∈ Nis the designated start symbol. The rule set isconstrained so that

    R ⊆ N × {[], 〈〉} × ΨN ∪ NΨ ∪ {〈ǫ, ǫ〉}

    Where[] represents identity permutation and〈〉represents inversion permutation,Ψ = Σ∪{ǫ}×∆ ∪ {ǫ} is a possibly empty biterminal, andǫ isthe empty string.

    Graphically, a rule will be represented as anITGrule:

    〈A, [], B〈e, f〉〉 = A → [ B e/f ]〈A, 〈〉, 〈e, f〉B〉 = A → 〈 e/f B 〉

    〈A, [], 〈ǫ, ǫ〉〉 = A → ǫ/ǫ

    As with ITGs, productions with only biterminalswill be represented without their permutation, asany such rule can be trivially rewritten into in-verted or identity form.

    12

  • Definition 4. An ǫ-free LITG is an LITG whereno rule may rewrite one nonterminal into anothernonterminal only. Formally, the rule set is con-strained so that

    R ∩ N × {[], 〈〉} × ({〈ǫ, ǫ〉}B ∪ B{〈ǫ, ǫ〉}) = ∅

    The LITG presented in Saers et al. (2010) isthus anǫ-free LITG in normal form, since it hasthe following thirteen rule forms (of which 8 aremeaningful, 1 is only used to terminate genera-tion and 4 are redundant):

    A → [ e/f B ]A → 〈 e/f B 〉A → [ B e/f ]A → 〈 B e/f 〉A → [ e/ǫ B ] | A → 〈 e/ǫ B 〉A → [ B e/ǫ ] | A → 〈 B e/ǫ 〉A → [ ǫ/f B ] | A → 〈 B ǫ/f 〉A → [ B ǫ/f ] | A → 〈 ǫ/f B 〉A → ǫ/ǫ

    All the singleton rules can be expressed either instraight or inverted form, but the result of apply-ing the two rules are the same.

    Lemma 1. Any LITG in normal form can be ex-pressed as an LTG in normal form.

    Proof. The aboveLITG can be rewritten inLTGform as follows:

    A → [ e/f B ] = A → e/f BA → 〈 e/f B 〉 = A → e/ǫ B ǫ/fA → [ B e/f ] = A → B e/fA → 〈 B e/f 〉 = A → ǫ/f B e/ǫA → [ e/ǫ B ] = A → e/ǫ BA → [ B e/ǫ ] = A → B e/ǫA → [ ǫ/f B ] = A → ǫ/f BA → [ B ǫ/f ] = A → B ǫ/fA → ǫ/ǫ = A → ǫ/ǫ

    To account for allLITGs in normal form, the fol-lowing two non-ǫ-free rules also needs to be ac-counted for:

    A → [ B ] = A → BA → 〈 B 〉 = A → B

    Lemma 2. Any LTG in normal form can be ex-pressed as an LITG in normal form.

    Proof. An LTG in normal form has two rules,which can be rewritten inLITG form, either asstraight or inverted rules as follows

    A → x/p B y/q = A → [ x/p B̄ ]B̄ → [ B y/q ]

    = A → 〈 x/q B̄ 〉B̄ → 〈 B y/p 〉

    A → ǫ/ǫ = A → ǫ/ǫ

    Theorem 1. LTGs in normal form and LITGs innormal form express the same class of transduc-tions.

    Proof. Follows from lemmas 1 and 2.

    By theorem 1 everything concerningLTGs is alsoapplicable toLITGs, and anLTG can be expressedin LITG form when convenient, and vice versa.

    4 Pruning the Alignment Space

    The alignment space for a transduction grammaris the combinations of the parse spaces of thesentence pair. Lete be theE sentence, andfbe theF sentence. The parse spaces would beO(|e|2) andO(|f |2) respectively, and the com-bination of these spaces would beO(|e|2 ×|f |2),or O(n4) if we assumen to be proportionalto the sentence lengths. In the case ofLTGs,this space is searched linearly, giving time com-plexity O(n4), and in the case ofITGs thereis branching within both parse spaces, addingan order of magnitude each, giving a total timecomplexity ofO(n6). There is, in other words,a tight connection between the alignment spaceand the time complexity of the biparsing al-gorithm. Furthermore, most of this alignmentspace is clearly useless. Consider the case wherethe entireF sentence is deleted, and the entireEsentence is simply inserted. Although it is pos-sible that it is allowed by the grammar, it shouldhave a negligible probability (since it is clearly atranslation strategy that generalize poorly), andcould, for all practical reasons, be ignored.

    13

  • Language pair Bisentences Tokens

    Spanish–English 108,073 1,466,132French–English 95,990 1,340,718German–English 115,323 1,602,781

    Table 1: Size of training data.

    Saers et al. (2009) present a scheme for prun-ing away most of the points in the alignmentspace. Parse items are binned according to cov-erage (the total number of words covered), andeach bin is restricted to carry a maximum ofbitems. Any items that do not fit in the bins areexcluded from further analysis. To decide whichitems to keep, inside probability is used. Thispruning scheme effectively linearizes the align-ment space, as is will be of sizeO(nb), regard-less of what type grammar is used. AnITG canthus be biparsed in cubic time, and anLTG in lin-ear time.

    5 Seeding anITG with an LTG

    SinceLTGs are a subclass ofITGs, it would bepossible to convert anLTG to a ITG. This couldsave a lot of time, sinceLTGs aremuch faster toinduce from corpora thanITGs.

    Converting aBLTG to aBITG is fairly straightforward. Consider theBLTG rule

    X → [ e/f X ]

    To convert it toBITG in two-normal form, thebiterminal has to be factored out. Replacingthe biterminal with a temporary symbol̄X , andintroducing a rule that rewrites this temporarysymbol to the replaced biterminal produces tworules:

    X → [ X̄ X ]X̄ → e/f

    This is no longer a bracketing grammar sincethere are two nonterminals, but equatingX̄ to Xrestores this property. An analogous procedurecan be applied in the case where the nonterminalcomes before the biterminal, as well as for theinverting cases.

    When converting stochasticLTGs, the proba-bility mass of theSLTG rule has to be distributed

    to two SITG rules. The fact that theLTG ruleX → ǫ/ǫ lacks correspondence inITGs has tobe weighted in as well. In this paper we took themaximum entropy approach and distributed theprobability mass uniformly. This means defin-ing the probability mass functionp′ for the newSBITG from the probability mass functionp ofthe originalSBLTG such that:

    p′(X → [ X X ]) =∑

    e/f

    √p(X→[ e/f X ])1−p(X→ǫ/ǫ)

    +√p(X→[ X e/f ])1−p(X→ǫ/ǫ)

    p′(X → 〈 X X 〉) =∑

    e/f

    √p(X→〈 e/f X 〉)

    1−p(X→ǫ/ǫ)+√

    p(X→〈 X e/f 〉)1−p(X→ǫ/ǫ)

    p′(X → e/f) =

    √p(X→[ e/f X ])1−p(X→ǫ/ǫ)

    +√p(X→[ X e/f ])1−p(X→ǫ/ǫ)

    +√p(X→〈 e/f X 〉)

    1−p(X→ǫ/ǫ)+√

    p(X→〈 X e/f 〉)1−p(X→ǫ/ǫ)

    6 Setup

    The aim of this paper is to compare the align-ments from SBITG and SBLTG to those fromGIZA++, and to study the impact of pruningon efficiency and translation quality. Initialgrammars will be estimated by counting cooc-currences in the training corpus, after whichexpectation-maximization (EM) will be used torefine the initial estimate. At the last iteration,the one-best parse of each sentence will be con-sidered as the word alignment of that sentence.

    In order to keep the experiments comparable,relatively small corpora will be used. If largercorpora were used, it would not be possible to getany results for unprunedSBITGs because of theprohibitive time complexity. The Europarl cor-pus (Koehn, 2005) was used as a starting point,and then all sentence pairs where one of the sen-tences were longer than 10 tokens were filtered

    14

  • Figure 1: Trade-offs between translation quality (as measured byBLEU) and biparsing time (inseconds plotted on a logarithmic scale) forSBLTGs, SBITGs and the combination.

    Beam sizeSystem 1 10 25 50 75 100 ∞

    BLEU

    SBITG 0.1234 0.2608 0.2655 0.2653 0.2661 0.2671 0.2663SBLTG 0.2574 0.2645 0.2631 0.2624 0.2625 0.2633 0.2628GIZA++ 0.2597 0.2597 0.2597 0.2597 0.2597 0.2597 0.2597

    NIST

    SBITG 3.9705 6.6439 6.7312 6.7101 6.7329 6.7445 6.6793SBLTG 6.6023 6.6800 6.6657 6.6637 6.6714 6.6863 6.6765GIZA++ 6.6464 6.6464 6.6464 6.6464 6.6464 6.6464 6.6464

    Training timesSBITG 03:10 17:00 38:00 1:20:00 2:00:00 2:40:00 3:20:00SBLTG 35 1:49 3:40 7:33 9:44 12:13 11:59

    Table 2: Results for the Spanish–English translation task.

    out (see table 1). TheGIZA++ system was builtaccording to the instructions for creating a base-line system for the Fifth Workshop on StatisticalMachine Translation (WMT’10),1 but the abovecorpora were used instead of those supplied bythe workshop. This includes word alignmentwith GIZA++, a 5-gram language model builtwith SRILM (Stolcke, 2002) and parameter tun-ing with MERT (Och, 2003). To carry out the ac-tual translations, Moses (Koehn et al., 2007) wasused. TheSBITG andSBLTG systems were builtin exactly the same way, except that the align-ments fromGIZA++ were replaced by those fromthe respective grammars.

    In addition to trying out exhaustive biparsing

    1http://www.statmt.org/wmt10/

    for SBITGs andSBLTGs on three different trans-lation tasks, several different levels of pruningwere tried (1, 10, 25, 50, 75 and 100). We alsoused the grammar induced fromSBLTGs with abeam size of 25 to seedSBITGs (see section 5),which were then run for an additional iterationof EM, also with beam size 25.

    All systems are evaluated withBLEU (Pap-ineni et al., 2002) andNIST (Doddington, 2002).

    7 Results

    The results for the three different translationtasks are presented in Tables 2, 3 and 4. It isinteresting to note that the trend they portray isquite similar. When the beam is very narrow,GIZA++ is better, but already at beam size 10,both transduction grammars are superior. Con-

    15

  • Beam sizeSystem 1 10 25 50 75 100 ∞

    BLEU

    SBITG 0.1268 0.2632 0.2654 0.2669 0.2668 0.2655 0.2663SBLTG 0.2600 0.2638 0.2651 0.2668 0.2672 0.2662 0.2649GIZA++ 0.2603 0.2603 0.2603 0.2603 0.2603 0.2603 0.2603

    NIST

    SBITG 4.0849 6.7136 6.7913 6.8065 6.8068 6.8088 6.8151SBLTG 6.6814 6.7608 6.7656 6.7992 6.8020 6.7925 6.7784GIZA++ 6.6907 6.6907 6.6907 6.6907 6.6907 6.6907 6.6907

    Training timesSBITG 03:25 17:00 42:00 1:25:00 2:10:00 2:45:00 3:10:00SBLTG 31 1:41 3:25 7:06 9:35 13:56 10:52

    Table 3: Results for the French–English translation task.

    Beam sizeSystem 1 10 25 50 75 100 ∞

    BLEU

    SBITG 0.0926 0.2050 0.2091 0.2090 0.2091 0.2094 0.2113SBLTG 0.2015 0.2067 0.2066 0.2073 0.2080 0.2066 0.2088GIZA++ 0.2059 0.2059 0.2059 0.2059 0.2059 0.2059 0.2059

    NIST

    SBITG 3.4297 5.8743 5.9292 5.8947 5.8955 5.9086 5.9380SBLTG 5.7799 5.8819 5.8882 5.8963 5.9252 5.8757 5.9311GIZA++ 5.8668 5.8668 5.8668 5.8668 5.8668 5.8668 5.8668

    Training timesSBITG 03:20 17:00 41:00 1:25:00 2:10:00 2:45:00 3:40:00SBLTG 38 1:58 4:52 8:08 11:42 16:05 13:32

    Table 4: Results for the German–English translation task.

    sistent with Saers et al. (2009),SBITG has a sharprise in quality going from beam size 1 to 10,and then a gentle slope up to beam size 25, af-ter which it levels out.SBLTG, on the other handstart out at a respectable level, and goes up a gen-tle slope from beam size 1 to 10, after which islevel out. This is an interesting observation, as itsuggests thatSBLTG reaches its optimum with alower beam size (although that optimum is lowerthan that ofSBITG). The trade-off between qual-ity and time can now be extended beyond beamsize to include grammar choice. In Figure 1, runtimes are plotted againstBLEU scores to illus-

    trate this trade-off. It is clear thatSBLTGs areindeed much faster thanSBITGs, the only excep-tion is whenSBITGs are run withb = 1, but thentheBLEU score is so low that is is not worth con-sidering.

    The time may seem inconsistent betweenb =100 and b = ∞ for SBLTG, but the extra timefor the tighter beam is because of beam manage-ment, which the exhaustive search doesn’t botherwith.

    In table 5 we compare the pure approachesto one where anLTG was trained during 10 it-erations ofEM and then used to seed (see sec-

    16

  • Translation task System BLEU NIST Total time

    SBLTG 0.2631 6.6657 36:40Spanish–English SBITG 0.2655 6.7312 6:20:00

    Both 0.2660 6.7124 1:14:40

    SBLTG 0.2651 6.7656 34:10French–English SBITG 0.2654 6.7913 7:00:00

    Both 0.2625 6.7609 1:16:10

    SBLTG 0.2066 5.8882 48:52German–English SBITG 0.2091 5.9292 6:50:00

    Both 0.2095 5.9224 1:29:40

    Table 5: Results for seeding anSBITG with anSBLTG (Both) compared to the pure approach. Totaltime refers to 10 iterations ofEM training for SBITG andSBLTG respectively, and 10 iterations ofSBLTG and one iteration ofSBITG training for the combined system.

    tion 5) an SBITG, which was then trained forone iteration ofEM. Although the differencesare fairly small, German–English and Spanish–English seem to reach the level ofSBITG,whereas French–English is actually hurt. Thebig difference is in time, since the combined sys-tem needs about a fifth of the time theSBITG-based system needs. This phenomenon needs tobe more thoroughly examined.

    It is also worth noting thatGIZA++ was beatenby an aligner that used less than 20 minutes (lessthan 2 minutes per iteration and at most 10 itera-tions) to align the corpus.

    8 Conclusions

    In this paper we have introduced the bilingualversion of linear grammar: Linear Transduc-tion Grammars, and found that they generate thesame class of transductions as Linear InversionTransduction Grammars. We have also com-pared Stochastic Bracketing versions ofITGs andLTGs to GIZA++ on three word alignment tasks.The efficiency issues with transduction gram-mars have been addressed by pruning, and theconclusion is that there is a trade-off betweenrun time and translation quality. A part of thetrade-off is choosing which grammar frameworkto use, asLTGs are faster but not as good asITGs.It also seems possible to take a short-cut in thistrade-off by starting out with anLTG and convert-ing it to an ITG. We have also showed that it is

    possible to beat the translation quality ofGIZA++with a quite fast transduction grammar.

    Acknowledgments

    This work was funded by the Swedish Na-tional Graduate School of Language Technol-ogy (GSLT), the Defense Advanced ResearchProjects Agency (DARPA) under GALE Con-tracts No. HR0011-06-C-0022 and No. HR0011-06-C-0023, and the Hong Kong ResearchGrants Council (RGC) under research grantsGRF621008, DAG03/04.EG09, RGC6256/00E,and RGC6083/99E. Any opinions, findings andconclusions or recommendations expressed inthis material are those of the authors and do notnecessarily reflect the views ofthe Defense Ad-vanced Research Projects Agency. The computa-tions were performed onUPPMAX resources un-der project p2007020.

    References

    Aho, Alfred V. Ullman, Jeffrey D. 1972.The Theoryof Parsing, Translation, and Compiling. Prentice-Halll, Inc., Upper Saddle River, NJ.

    Brown, Peter F., Stephen A. Della Pietra, VincentJ. Della Pietra, and Robert L. Mercer. 1993. Themathematics of statistical machine translation: Pa-rameter estimation. Computational Linguistics,19(2):263–311.

    Cocke, John. 1969.Programming languages andtheir compilers: Preliminary notes. Courant Insti-

    17

  • tute of Mathematical Sciences, New York Univer-sity.

    Doddington, George. 2002. Automatic eval-uation of machine translation quality using n-gram co-occurrence statistics. InProceedings ofHuman Language Technology conference (HLT-2002), San Diego, California.

    Earley, Jay. 1970. An efficient context-free parsingalgorithm.Communications of the Association forComuter Machinery, 13(2):94–102.

    Haghighi, Aria, John Blitzer, John DeNero, and DanKlein. 2009. Better word alignments with super-vised ITG models. InProceedings of the JointConference of the 47th Annual Meeting of the ACLand the 4th International Joint Conference on Nat-ural Language Processing of the AFNLP, pages923–931, Suntec, Singapore, August.

    Huang, Liang, Hao Zhang, Daniel Gildea, and KevinKnight. 2009. Binarization of synchronouscontext-free grammars.Computational Linguis-tics, 35(4):559–595.

    Kasami, Tadao. 1965. An efficient recognitionand syntax analysis algorithm for context-free lan-guages. Technical Report AFCRL-65-00143, AirForce Cambridge Research Laboratory.

    Koehn, Philipp, Hieu Hoang, Alexandra Birch,Chris Callison-Burch, Marcello Federico, NicolaBertoldi, Brooke Cowan, Wade Shen, ChristineMoran, Richard Zens, Chris Dyer, Ondrej Bojar,Alexandra Constantin, and Evan Herbst. 2007.Moses: Open source toolkit for statistical ma-chine translation. InProceedings of the 45thAnnual Meeting of the Association for Computa-tional Linguistics Companion Volume Proceedingsof the Demo and Poster Sessions, pages 177–180,Prague, Czech Republic, June.

    Koehn, Philipp. 2005. Europarl: A parallel cor-pus for statistical machine translation. InMachineTranslation Summit X, Phuket, Thailand, Septem-ber.

    Lewis, Philip M. and Richard E. Stearns. 1968.Syntax-directed transduction.Journal of the Asso-ciation for Computing Machinery, 15(3):465–488.

    Och, Franz Josef and Hermann Ney. 2003. A sys-tematic comparison of various statistical alignmentmodels.Computational Linguistics, 29(1):19–51.

    Och, Franz Josef. 2003. Minimum error rate trainingin statistical machine translation. In41st AnnualMeeting of the Association for Computational Lin-guistics, pages 160–167, Sapporo, Japan, July.

    Papineni, Kishore, Salim Roukos, Todd Ward, andWei-Jing Zhu. 2002. Bleu: a method for auto-matic evaluation of machine translation. InPro-ceedings of 40th Annual Meeting of the Associ-ation for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, July.

    Saers, Markus and Dekai Wu. 2009. Improvingphrase-based translation via word alignments fromStochastic Inversion Transduction Grammars. InProceedings of the Third Workshop on Syntaxand Structure in Statistical Translation (SSST-3)at NAACL HLT 2009, pages 28–36, Boulder, Col-orado, June.

    Saers, Markus, Joakim Nivre, and Dekai Wu. 2009.Learning Stochastic Bracketing Inversion Trans-duction Grammars with a cubic time biparsing al-gorithm. InProceedings of the 11th InternationalConference on Parsing Technologies (IWPT’09),pages 29–32, Paris, France, October.

    Saers, Markus, Joakim Nivre, and Dekai Wu. 2010.Word alignment with Stochastic Bracketing LinearInversion Transduction Grammar. InProceedingsof Human Language Technologies: The 11th An-nual Conference of the North American Chapterof the Association for Computational Linguistics,Los Angeles, California, June.

    Stolcke, Andreas. 2002. SRILM – an extensiblelanguage modeling toolkit. InInternational Con-ference on Spoken Language Processing, Denver,Colorado, September.

    Vogel, Stephan, Hermann Ney, and Christoph Till-mann. 1996. Hmm-based word alignment in sta-tistical translation. InProceedings of the 16th con-ference on Computational linguistics, pages 836–841, Morristown, New Jersey.

    Wu, Dekai. 1997. Stochastic inversion transductiongrammars and bilingual parsing of parallel cor-pora.Computational Linguistics, 23(3):377–403.

    Younger, Daniel H. 1967. Recognition and parsingof context-free languages in timen3. Informationand Control, 10(2):189–208.

    Zhang, Hao, Chris Quirk, Robert C. Moore, andDaniel Gildea. 2008. Bayesian learning of non-compositional phrases with synchronous parsing.In Proceedings of ACL-08: HLT, pages 97–105,Columbus, Ohio, June.

    18

  • Proceedings of SSST-4, Fourth Workshop on Syntax and Structure in Statistical Translation, pages 19–27,COLING 2010, Beijing, August 2010.

    Source-side Syntactic Reordering Patterns with Functional Words forImproved Phrase-based SMT

    Jie Jiang, Jinhua Du, Andy WayCNGL, School of Computing, Dublin City University

    {jjiang,jdu,away}@computing.dcu.ie

    Abstract

    Inspired by previous source-side syntacticreordering methods for SMT, this paperfocuses on using automatically learnedsyntactic reordering patterns with func-tional words which indicate structural re-orderings between the source and targetlanguage. This approach takes advan-tage of phrase alignments and source-sideparse trees for pattern extraction, and thenfilters out those patterns without func-tional words. Word lattices transformedby the generated patterns are fed into PB-SMT systems to incorporate potential re-orderings from the inputs. Experimentsare carried out on a medium-sized cor-pus for a Chinese–English SMT task. Theproposed method outperforms the base-line system by 1.38% relative on a ran-domly selected testset and 10.45% rela-tive on the NIST 2008 testset in termsof BLEU score. Furthermore, a systemwith just 61.88% of the patterns filteredby functional words obtains a comparableperformance with the unfiltered one on therandomly selected testset, and achieves1.74% relative improvements on the NIST2008 testset.

    1 Introduction

    Previous work has shown that the problem ofstructural differences between language pairs inSMT can be alleviated by source-side syntacticreordering. Taking account for the integrationwith SMT systems, these methods can be dividedinto two different kinds of approaches (Elming,

    2008): the deterministic reordering and the non-deterministic reordering approach.

    To carry out the deterministic approach, syntac-tic reordering is performed uniformly on the train-ing, devset and testset before being fed into theSMT systems, so that only the reordered sourcesentences are dealt with while building duringthe SMT system. In this case, most work is fo-cused on methods to extract and to apply syntac-tic reordering patterns which come from manuallycreated rules (Collins et al., 2005; Wang et al.,2007a), or via an automatic extraction process tak-ing advantage of parse trees (Collins et al., 2005;Habash, 2007). Because reordered source sen-tence cannot be undone by the SMT decoders (Al-Onaizan et al., 2006), which implies a systematicerror for this approach, classifiers (Chang et al.,2009b; Du & Way, 2010) are utilized to obtainhigh-performance reordering for some specializedsyntactic structures (e.g.DE construction in Chi-nese).

    On the other hand, the non-deterministic ap-proach leaves the decisions to the decoders tochoose appropriate source-side reorderings. Thisis more flexible because both the original andreordered source sentences are presented in theinputs. Word lattices generated from syntacticstructures for N-gram-based SMT is presentedin (Crego et al., 2007). In (Zhang et al., 2007a;Zhang et al., 2007b), chunks and POS tags areused to extract reordering rules, while the gener-ated word lattices are weighted by language mod-els and reordering models. Rules created from asyntactic parser are also utilized to form weightedn-best lists which are fed into the decoder (Li etal., 2007). Furthermore, (Elming, 2008; Elm-

    19

  • ing, 2009) uses syntactic rules to score the outputword order, both on English–Danish and English–Arabic tasks. Syntactic reordering information isalso considered as an extra feature to improve PB-SMT in (Chang et al., 2009b) for the Chinese–English task. These results confirmed the effec-tiveness of syntactic reorderings.

    However, for the particular case of Chinesesource inputs, although theDE construction hasbeen addressed for both PBSMT and HPBSMTsystems in (Chang et al., 2009b; Du & Way,2010), as indicated by (Wang et al., 2007a), thereare still lots of unexamined structures that im-ply source-side reordering, especially in the non-deterministic approach. As specified in (Xue,2005), these include thebei-construction, ba-construction, three kinds ofde-construction (in-cluding DE construction) and general prepositionconstructions. Such structures are referred withfunctional words in this paper, and all the con-structions can be identified by their correspond-ing tags in the Penn Chinese TreeBank. It is in-teresting to investigate these functional words forthe syntactic reordering task since most of themtend to produce structural reordering between thesource and target sentences.

    Another related work is to filter the bilingualphrase pairs with closed-class words (Sánchez-Martı́nez, 2009). By taking account of the wordalignments and word types, the filtering processreduces the phrase tables by up to a third, but stillprovide a system with competitive performancecompared to the baseline. Similarly, our idea is touse special type of words for the filtering purposeon the syntactic reordering patterns.

    In this paper, our objective is to exploitthese functional words for source-side syntac-tic reordering of Chinese–English SMT in thenon-deterministic approach. Our assumption isthat syntactic reordering patterns with functionalwords are the most effective ones, and others canbe pruned for both speed and performance.

    To validate this assumption, three systems arecompared in this paper: a baseline PBSMT sys-tem, a syntactic reordering system with all pat-terns extracted from a corpus, and a syntactic re-ordering system with patterns filtered with func-tional words. To accomplish this, firstly the lat-

    tice scoring approach (Jiang et al., 2010) is uti-lized to discover non-monotonic phrase align-ments, and then syntactic reordering patterns areextracted from source-side parse trees. After that,functional word tags specified in (Xue, 2005) areadopted to perform pattern filtering. Finally, boththe unfiltered pattern set and the filtered one areused to transform inputs into word lattices topresent potential reorderings for improving PB-SMT system. A comparison between the threesystems is carried out to examine the performanceof syntactic reordering as well as the usefulness offunctional words for pattern filtering.

    The rest of this paper is organized as follows:in section 2 we describe the extraction process ofsyntactic reordering patterns, including the latticescoring approach and the extraction procedures.Then section 3 presents the filtering process usedto obtain patterns with functional words. Afterthat, section 4 shows the generation of word lat-tices with patterns, and experimental setup and re-sults included related discussion are presented insection 5. Finally, we give our conclusion and av-enues for future work in section 6.

    2 Syntactic reordering patternsextraction

    Instead of top-down approaches such as (Wanget al., 2007a; Chang et al., 2009a), we use abottom-up approach similar to (Xia et al., 2004;Crego et al., 2007) to extract syntactic reorderingpatterns from non-monotonic phrase alignmentsand source-side parse trees. The following stepsare carried out to extract syntactic reordering pat-terns: 1) the lattice scoring approach proposedin (Jiang et al., 2010) is used to obtain phrasealignments from the training corpus; 2) reorder-ing regions from the non-monotonic phrase align-ments are used to identify minimum treelets forpattern extraction; and 3) the treelets are trans-formed into syntactic reordering patterns whichare then weighted by their occurrences in thetraining corpus. Details of each of these steps arepresented in the rest of this section.

    2.1 Lattice scoring for phrase alignments

    The lattice scoring approach is proposed in (Jianget al., 2010) for the SMT data cleaning task.

    20

  • To clean the training corpus, word alignmentsare used to obtain approximate decoding results,which are then used to calculate BLEU (Papineniet al., 2002) scores to filter out low-scoring sen-tences pairs. The following steps are taken inthe lattice scoring approach: 1) train an initialPBSMT model; 2) collect anchor pairs contain-ing source and target phrase positions from wordalignments generated in the training phase; 3)build source-side lattices from the anchor pairsand the translation model; 4) search on the source-side lattices to obtain approximate decoding re-sults; 5) calculate BLEU scores for the purpose ofdata cleaning.

    Note that the source-side lattices in step 3 comefrom anchor pairs, so each edge in the lattices con-tain both the source and target phrase positions.Thus the outputs of step 4 contain phrase align-ments on the training corpus. These phrase align-ments are used to identify non-monotonic areasfor the extraction of reordering patterns.

    2.2 Reordering patterns

    Non-monotonic regions of the phrase alignmentsare examined as potential source-side reorderings.By taking a bottom-up approach, the reorderingregions are identified and mapped to minimumtreelets on the source parse trees. After that, syn-tactic reordering patterns are transformed fromthese minimum treelets.

    In this paper, reordering regionsA andB indi-cating swapping operations on the source side areonly considered as potential source-side reorder-ings. Thus, given reordering regionsAB, this im-plies (1):

    AB ⇒ BA (1)on the source-side word sequences. Referring tothe phrase alignment extraction in the last section,each non-monotonic phrase alignment producesone reordering region. Furthermore, for each r