gappy phrasal alignment by agreement
TRANSCRIPT
Gappy Phrasal Alignment by Agreement
Mohit Bansal, Chris Quirk, and Robert C. Moore
Summer 2010, NLP Group
Introduction
β’ High level motivation: β Word alignment is a pervasive problem
β’ Crucial component in MT systems β To build phrase tables
β To extract synchronous syntactic rules
β’ Also used in other NLP problems: β entailment
β paraphrase
β question answering
β summarization
β spell correction, etc.
Introduction
β’ The limitation to words is obviously wrong
β’ People have tried to correct this for a while now β Phrase-based alignment
β Pseudo-words
β’ Our contribution: clean, fast phrasal alignment model hidden semi-markov model (observations can be phrasesβ¦)
for phrase-to-phrase alignment (β¦and statesβ¦)
using alignment by agreement (β¦meaningful statesβ¦no phrase penalty)
allowing subsequences (β¦finally, with ne .. pas !)
Related work
Two major influences :
1) Conditional phrase-based alignment models
β Word-to-phrase HMM is one approach (Deng & Byrneβ05)
β’ model subsequent words from the same state using a bigram model
β’ change only the parameterization and not set of possible alignments
β Phrase-to-phrase alignments (Daume & Marcuβ04; DeNero et al.β06)
β’ unconstrained model may overfit using unusual segmentations
β Phrase-based hidden semi-markov model (Ferrer & Juanβ09) β’ interpolates with Model 1 and monotonic (no reordering)
Related work
2) Alignment by agreement (Liang et al. 2006) β’ soft intersection cleans up and symmetrizes word HMM alignments
β’ symmetric portion of HMM space is only word-to-word
f1 f3
e2
F β E E β F
f1 f3
e2 f2
e3 f2
e3
e1
e1
Here we canβt capture f1 ~ e1e2 after agreement because 2 states e1 and e2 cannot align to the same observation f1 in the E β F direction
Word-to-word constraint
Model of F given E canβt represent phrasal alignment {e1,e2} ~ {f1}: probability mass is distributed between {f1} ~ {e1} and {f1} ~ {e2}. Agreement of forward and backward HMM alignments places less mass on phrasal links and more mass on word-to-word links.
Contributions
β’ Unite phrasal alignment and alignment by agreement
β Allow phrases at both state and observation side
β Agreement favors alignments meaningful in both directions
β’ With word alignment, agreement removes phrases
β’ With phrase-to-phrase alignment, agreement reinforces meaningful
phrases β avoids overfitting
f2 f3 f4
e2
F β E E β F
f2 f3 f4
e2 f1
e3 f1
e3
Contributions
β’ Furthermore, we can allow subsequences
States: qui ne le exposera pas
Observations : which will not lay him open
β State space extended to include gappy phrases
β Gappy obsrv phrases approximated as multiple obsrv words emitting from a single state
f1 f3
e1
F β E E β F
f1 f3
e1 f2
e2 f2
e2
Complexity still ~ πͺ(π2. π)
HMM Alignment
e1 e2 e3 e4 e5 e6
f1 f2 f3 f4 f5 f6
State S :
Alignment A :
Observation O :
1 5 3 4 6 6
Model Parameters
Emission/Translation Model Transition/Distortion Model P(O2 = f2 | SA2
= e5) P(A2 = 5 | A1 = 1)
EM Training (Baum-Welch)
E-Step π π³; π± β π π³ π±; π
Forward πΌ Backward π½ Posteriors πΎ, π
M-Step
πβ² = argmaxπ π π³; π± log π(π±, π³; π)
π±,π³
Translation π π |π Distortion π πβ²|π; πΌ
β5 iterations
Baum-Welch (Forward-Backward)
πΌπ π = πΌπβ1 πβ²
πβ²
. ππ π πβ²; πΌ . ππ‘ ππ| π π
πΌπ π = probability of generating obsrv π1 to ππ s.t. state π π generates ππ
previous πΌ distortion model
translation model
πΌ0 π = 1
Initialization
π½π π = π½π+1 πβ²
πβ²
. ππ πβ² π; πΌ ππ‘ ππ+1| π πβ²
π½π π = probability of generating obsrv ππ+1 to ππ½ s.t. state π π generates ππ
next π½ distortion model
translation model
π½π½ π = 1
Initialization
Forward :
Backward :
Baum-Welch
π(π|π) = probability of full observation sentence π = π1π½ from full state sentence π = π 1
πΌ
π π π = πΌπ½ π
π
. 1 = 1
π
. π½0 π = πΌπ π
π
. π½π(π)
πΎπ π =πΌπ π . π½π(π)
π(π|π)
πΎπ π = probability, given π, that obsrv ππ generated by state π π Node Posterior :
ππ πβ², π =
πΌπ πβ² . ππ π πβ²; πΌ ππ‘ ππ+1| π π . π½π+1(π)
π(π|π)
ππ πβ², π = probability, given π, that obsrv ππ and ππ+1 generated by
states π πβ² and π π resp. Edge Posterior :
Parameter Re-estimation :
ππ‘ π |π =1
π πΎπ π
π,π π π=π ππ=π
π,π
ππ π|πβ²; πΌ =1
π ππ π
β², π
π π,π π =πΌ
Hidden Semi-Markov Model
e1 e2 e3 e4
f1 f2 f3 f4
State S :
Alignment A :
Observation O :
1
Model Parameters
Emissions : P(O1 = f1 | SA1 = e2e3e4),
P(O2 = f2f3 | SA2 = e1)
Transitions : P(A2 = 1 | A1 = 2,3,4)
e5
2, 3, 4 4, 5
k β 1 alignment
1 β k alignment
Jump (4 β 1)
Phrasal obsrv (semi-Markov) Fertility(s1) = 2
Phrasal states
Forward Algorithm (HSMM)
πΌπ π, π = πΌπβπ πβ², πβ²
πβ²,πβ²
. ππ π πβ²; πΌ . π π π π) . π
βπ . πβ|π π| . ππ‘ ππβπ+1π| π π
πΌπ π, π = probability of generating observations π1to ππ such that
last observation-phrase ππβπ+1π
generated by state π π
previous πΌ distortion model
fertility model
penalty for obsrv
phrase-len
penalty for state
phrase-len
translation model
πΌπ π, π = ππππππ‘(π) . π π π π) . πβπ . πβ|π π| . ππ‘ π1
π| π π
πΌ0 π, 0 = 1 Initialize :
Agreement
β’ Multiply posteriors of both directions EβF and FβE after every E-step of EM
β’ π π³; π± β π1 π§ππ π±; π1 .π2 π§ππ π±; π2π,π
FβE E-Step : πΎππΉβπΈ FβE M-Step : ππΉβπΈ
EβF E-Step : πΎΞΈπΈβπΉ EβF M-Step : ππΈβπΉ
Agreement
β’ Multiply posteriors of both directions EβF and FβE after every E-step of EM
β’ π π³; π± β π1 π§ππ π±; π1 .π2 π§ππ π±; π2π,π
πΎΞΈπΉβπΈ = πΎΞΈπΈβπΉ = (πΎΞΈπΉβπΈ* πΎΞΈπΈβπΉ)
FβE E-Step : πΎππΉβπΈ FβE M-Step : ππΉβπΈ
EβF E-Step : πΎΞΈπΈβπΉ EβF M-Step : ππΈβπΉ
Agreement
πΎπΉβπΈ ππ , ππ = πΎπΈβπΉ ππ , ππ = πΎπΉβπΈ ππ , ππ β πΎπΈβπΉ ππ, ππ
β’ Phrase Agreement : need phrases on both observation and state sides
πΎπΉβπΈ πππ , ππ = πΎπΈβπΉ ππ , ππ
π = πΎπΉβπΈ πππ , ππ β πΎπΈβπΉ ππ, ππ
π
f1 f3
e2
F β E E β F
f1 f3
e2 f2
e3 f2
e3
e1
e1
e1 e3
f2 f3
F β E E β F
e1 e3
f2 f3 e2
f4 e2
f4
f1
f1
Gappy Agreement
e1 e2 e3
f1 f2 f3 f4
State S :
Alignment A :
Observation O :
1 2<β>4 3
e4
β’ Asymmetry β Computing posterior of gappy observation phrase is inefficient
β Hence, approximate posterior of ek~{fi <β> fj} using posteriors of ek~{fi} and ek~{fj}
Gappy Agreement
β’ Modified agreement, with approximate computation of posterior β Reverse direction of gapped state corresponds to a revisited state emitting
two discontiguous observations
πΎπΉβπΈ ππ <β> ππ , ππ β= minΗ πΎπΈβπΉ ππ , ππ , πΎπΈβπΉ ππ , ππ
πΎπΈβπΉ ππ , ππ β= πΎπΉβπΈ ππ , ππ +
πΎπΉβπΈ πβ <β> ππ , ππ + πΎπΉβπΈ ππ <β> ππ , ππβ<π<π
Η min is an upper bound on the posterior that both observations fi and fj are ~ ek, since every path that passes through ek ~ fi & ek ~ fj must pass through ek ~ fi, therefore the posterior of ek ~ fi & ek ~ fj is less than that of ek ~ fi, and likewise less than that of ek ~ fj
fi fj
ek
F β E E β F
fi fj
ek fβ¦
eβ¦ fβ¦
eβ¦
Allowed Phrases
β’ Only allow certain βgoodβ phrases instead of all possible ones
β’ Run word-to-word HMM on full data
β’ Get observation phrases (contiguous and gapped) aligned to
single state, i.e. πππ~ π for both languages/directions
β’ Weight the phrases πππ by discounted probability
max 0, π(πππ~ π β πΏ)/π(ππ
π) and choose top X phrases
f1 f3
e1
f2
e2
f1 f3
e1
f2
Complexity
Model1 : ππ‘ ππ π ππ)ππ=1 πͺ(π . π)
w2w HMM : ππ ππ| ππβ1, π . ππ‘ ππ π ππ)ππ=1 πͺ(π2 . π)
HSMM (phrasal obsrv of bounded length k): πͺ(π2 . ππ)
+ Phrasal states of bounded length k : πͺ ππ 2 . ππ = πͺ π3π2π
+ Gappy phrasal states of form w <*> w : πͺ ππ +π2 2 . ππ = πͺ ππ4π
Phrases (obsrv and states (contig and gappy))
from pruned lists : πͺ π + π 2 . (π + π) ~ πͺ(π2. π)
Still smaller than exact ITG πͺ π6
State length Obsrv length
Evaluation
ππππππ πππ =π΄ β© π
|π΄| β 100%, π πππππ =
π΄ β© π
π β 100%
where S, P and A are gold-sure, gold-possible and predicted edge-sets respectively
πΉπ½ = 1 + π½2 . ππππππ πππ β π πππππ
π½2. ππππππ πππ + π πππππ β 100%
π΅πΏπΈππ = min 1,ππ’π‘ππ’π‘ β πππ
πππ β πππ. exp ππ log ππ
π
π=1
ππ = πΆππ’ππ‘ππππ(π β ππππ)πβππππβπΆπΆβ*πΆπππππππ‘ππ +
πΆππ’ππ‘ (π β ππππ)πβππππβπΆπΆβ*πΆπππππππ‘ππ +
Evaluation
β’ Datasets β French-English : Hansards NAACL 2003 shared-task
β’ 1.1M sentence-pairs β’ Hand-alignments from Och&Ney03 β’ 137 dev-set, 347 test-set (Liang06)
β German-English : Europarl from WMT 2010 β’ 1.6M sentence-pairs β’ Hand-alignments from ChrisQ β’ 102 dev-set, 258 test-set
β’ Training Regimen : β 5 iterations of Model 1 (independent training) β 5 iterations of w2w HMM (independent training) β Initialize the p2p model using phrase-extraction from w2w Viterbi
alignments β Minimality : Only allow 1-K or K-1 alignments, since 2-3 can be
generally be decomposed into 1-1 βͺ 1-2, etc.
Translation BLEU Results
β’ Phrase-based system using only contiguous phrases consistent with the potentially gappy alignment β β 4 channel models, lexicalized reordering model
β word and phrase count features, distortion penalty
β 5-gram language model (weighted by MERT)
β’ Parameters tuned on dev-set BLEU using grid search
β’ A syntax-based or non-contiguous phrasal system (Galley and Manning, 2010) may benefit more from gappy phrases
Conclusion
β’ Start with HMM alignment by agreement
β’ Allow phrasal observations (HSMM)
β’ Allow phrasal states
β’ Allow gappy phrasal states
β’ Agreement between FβE and EβF finds meaningful phrases and makes phrase penalty almost unnecessary
β’ Maintain ~πͺ(π2. π) complexity
Future Work
β’ Limiting the gap length also prevents combinatorial explosion
β’ Translation system using discontinuous mappings at runtime (Chiang, 2007; Galley and Manning, 2010) may make better use of discontinuous alignments
β’ Apply model at the morpheme or character level, allowing joint inference of segmentation and alignment
β’ State space could be expanded and enhanced to include more possibilities: states with multiple gaps might be useful for alignment in languages with template morphology, such as Arabic or Hebrew
β’ A better distortion model might place a stronger distribution on the likely starting and ending points of phrases