shuffling non-constituents
DESCRIPTION
syntactically-flavored reordering model. Shuffling Non-Constituents. Jason Eisner. with David A. Smith and Roy Tromble. syntactically-flavored reordering search methods. ACL SSST Workshop, June 2008. Starting point: Synchronous alignment. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/1.jpg)
1
Shuffling Non-Constituents
Jason Eisner
ACL SSST Workshop, June 2008
with David A. Smith and Roy Tromblesyntactically-flavored
reordering search methodssyntactically-
flavoredreordering model
![Page 2: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/2.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 2
Starting point: Synchronous alignment Synchronous grammars are very pretty.
But does parallel text actually have parallel structure? Depends on what kind of parallel text
Free translations? Noisy translations? Were the parsers trained on parallel annotation schemes?
Depends on what kind of parallel structure What kinds of divergences can your synchronous grammar
formalism capture? E.g., wh-movement versus wh in situ
![Page 3: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/3.jpg)
Two training trees, showing a free translation from French to English.
Synchronous Tree Substitution Grammar
enfants(“kids”)
d’(“of”)
beaucoup(“lots”)
Sam
donnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
kids
Sam
kiss
quite
often
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
![Page 4: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/4.jpg)
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NP
SamSam
NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
quitenullAdv
oftennullAdv
nullAdv
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.
![Page 5: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/5.jpg)
enfants(“kids”)
kids
Adv
d’(“of”)
beaucoup(“lots”)
NP
SamSam
NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NPquite
often
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.A much worse alignment ...
![Page 6: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/6.jpg)
enfants(“kids”)
kids
NPd’
(“of”)
beaucoup(“lots”)
NP
NP
SamSam
NP
Synchronous Tree Substitution Grammar
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
quitenullAdv
oftennullAdv
nullAdv
“beaucoup d’enfants donnent un baiser à Sam” “kids kiss Sam quite often”
Two training trees, showing a free translation from French to English.A possible alignment is shown in orange.
![Page 7: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/7.jpg)
SamSamNP
enfants(“kids”)
kids
NPquitenull
Adv
Grammar = Set of Elementary Trees
oftennullAdv
nullAdv
d’(“of”)
beaucoup(“lots”)
NP
NP
kissdonnent (“give”)
baiser(“kiss”)
un(“a”)
à (“to”)
Start
NP
NP
nullAdv
![Page 8: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/8.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 8
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
![Page 9: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/9.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 9
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
Displaced modifier (negation)
![Page 10: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/10.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 10
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
Displaced modifier (negation)
![Page 11: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/11.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 11
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
Displaced argument (here, because projective parser)
![Page 12: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/12.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 12
But many examples are harder
Auf Fragediese bekommenichhabe leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
To questionthis received Ihave alas answer no
Head-swapping (here, different annotation conventions)
![Page 13: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/13.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 13
Free Translation
TschernobylChernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
Then we could deal with Chernobyl some time later
NULL
![Page 14: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/14.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 14
Free Translation
TschernobylChernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
Then we could deal with Chernobyl some time later
NULL
Probably not systematic (but words are correctly aligned)
![Page 15: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/15.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 15
Free Translation
TschernobylChernobyl
könntecould
dannthen
etwas something
später later
anon
diethe
Reihequeue
kommencome
Then we could deal with Chernobyl some time later
NULL
Erroneous parse
![Page 16: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/16.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 16
What to do? Current practice:
Don’t try to model all systematic phenomena! Just use non-syntactic alignments (Giza++). Only care about the fragments that recur often
Phrases or gappy phrases Sometimes even syntactic constituents
(can favor these, e.g., Marton & Resnik 2008) Use these (gappy) phrases in a decoder
Phrase based or hierarchical
![Page 17: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/17.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 17
What to do? Current practice:
Use non-syntactic alignments (Giza++) Keep frequent phrases for a decoder
But could syntax give us better alignments? Would have to be “loose” syntax …
Why do we want better alignments?
1. Throw away less of the parallel training data
2. Help learn a smarter, syntactic, reordering model Could help decoding: less reliance on LM
3. Some applications care about full alignments
![Page 18: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/18.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 18
Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:
Generate target English by a monolingual grammar Any grammar formalism is okay Pick a dependency grammar formalism for now
I did not unfortunately receive an answer to this question
P(PRP | no previous left children of “did”)
P(I | did, PRP)
parsing: O(n3)
![Page 19: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/19.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 19
Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:
Generate target English by a monolingual grammar But probabilities are influenced by source sentence
Each English node is aligned to some source node Prefers to generate children aligned to nearby source nodes
I did not unfortunately receive an answer to this question
parsing: O(n3)
![Page 20: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/20.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 20
P(PRP | no previous left children of “did”, habe)
QCFG Generative Storyobserved
Auf Fragediese bekommenich leider Antwortkeine
I did not unfortunately receive an answer to this question
NULL
habe
P(parent-child)
aligned parsing: O(m2n3)
P(breakage)P(I | did, PRP, ich)
![Page 21: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/21.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 21
What’s a “nearby node”?
+ “none of the above”
Given parent’s alignment, where might child be aligned?
synchronousgrammar case
![Page 22: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/22.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 22
Useful analogies:1. Generative grammar with latent word senses
2. MEMM Generate n-gram
tag sequence,
but probabilities are influenced by word sequence
Quasi-synchronous grammar
Target
Source
How do we handle “loose” syntax? Translation story:
Generate target English by a monolingual grammar But probabilities are influenced by source sentence
![Page 23: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/23.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 23
Useful analogies:1. Generative grammar with latent word senses
2. MEMM
3. IBM Model 1 Source nodes can be freely reused or unused Future work: Enforce 1-to-1 to allow good decoding
(NP-hard to do exactly)
Quasi-synchronous grammar How do we handle “loose” syntax? Translation story:
Generate target English by a monolingual grammar But probabilities are influenced by source sentence
![Page 24: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/24.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 24
Some results: Quasi-synch. Dep. Grammar Alignment (D. Smith & Eisner 2006)
Quasi-synchronous much better than synchronous Maybe also better than IBM Model 4
Question answering (Wang et al. 2007) Align question w/ potential answer Mean average precision 43% 48% 60%
previous state of the art + QG + lexical features
Bootstrapping a parser for a new language(D. Smith & Eisner 2007 & ongoing)
Learn how parsed parallel text influences target dependencies Along with many other features! (cf. co-training)
Unsupervised: German 30% 69%, Spanish 26% 65%
![Page 25: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/25.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 25
Summary of part I Current practice:
Use non-syntactic alignments (Giza++) Some bits align nicely Use the frequent bits in a decoder
Suggestion: Let syntax influence alignments.
So far, loose syntax methods are like IBM Model I. NP-hard to enforce 1-to-1 in any interesting model.
Rest of talk: How to enforce 1-to-1 in interesting models? Can we do something smarter than beam search?
![Page 26: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/26.jpg)
26
Shuffling Non-Constituents
Jason Eisner
ACL SSST Workshop, June 2008
with David A. Smith and Roy Tromblesyntactically-flavored
reordering modelsyntactically-flavored
reordering search methods
![Page 27: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/27.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 27
Motivation
MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!
![Page 28: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/28.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 28
Permutation search in MT
1 42 3 5 6 initial order(French)
1 54 2 6 3 best order(French’)
NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
Mary hasn’t seen me easy transduction
![Page 29: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/29.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 29
Motivation
MT is really easy! Just use a finite-state transducer! Phrases, morphology, the works!
Have just to fix that pesky word order.
Framing it this way lets us enforce 1-to-1 exactly at the permutation step.Deletion and fertility > 1 are still allowed in the subsequent transduction.
![Page 30: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/30.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 30
Often want to find an optimal permutation … Machine translation:
Reorder French to French-prime (Brown et al. 1992)
So it’s easier to align or translate MT eval:
How much do you need to rearrange MT output so it scores well under an LM derived from ref translations?
Discourse generation, e.g., multi-doc summarization: Order the output sentences (Lapata 2003)
So they flow nicely Reconstruct temporal order of events after info extraction Learn rule ordering or constraint ranking for phonology? Multi-word anagrams that score well under a LM
![Page 31: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/31.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 32
How can we find this needlein the haystack of N!possible permutations?
Permutation search: The problem
1 42 3 5 6 initial order
1 54 2 6 3 best orderaccording tosome costfunction
![Page 32: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/32.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 33
Traditional approach: Beam searchApprox. best path through a really big FSA
N! paths: one for each permutationonly 2N states
arc weight = cost of picking 5 next
if we’ve seen {1,2,4} so far
state remembers what we’ve generated so far(but not in what order)
![Page 33: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/33.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 34
An alternative: Local search (“hill climbing”)
The SWAP neighborhood
1 2 3 4 5 6cost=22
2 1 3 4 5 6cost=26
1 2 3 4 5 6cost=22 1 2 3 5 4 6
cost=25
1 3 2 4 5 6cost=20
1 2 4 3 5 6cost=19
![Page 34: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/34.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 35
An alternative: Local search (“hill-climbing”)
The SWAP neighborhood
1 2 3 4 5 6cost=22
1 2 4 3 5 6cost=19
![Page 35: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/35.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 36
An alternative: Local search (“hill-climbing”)Like “greedy decoder” of Germann et al. 2001
1 42 3 5 6 cost=22
The SWAP neighborhood
cost=19cost=17cost=16
. . . Why are the costs always going down?How long does it take to pick best swap?How many swaps might you need to reach answer?What if you get stuck in a local min?
we pick best swapO(N) if you’re
carefulO(N2)
random restarts
![Page 36: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/36.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 37
Larger neighborhood
1 2 3 4 5 6cost=22
2 1 3 4 5 6cost=26
1 2 3 4 5 6cost=22 1 2 3 5 4 6
cost=25
1 3 2 4 5 6cost=20
1 2 4 3 5 6cost=19
![Page 37: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/37.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 38
Larger neighborhood(well-known in the literature; reportedly works well)
1 2 3 4 5 6 cost=22cost=17
INSERT neighborhood
Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?
yes – 3 can move past 4 to get past 5
O(N) rather than O(N2)O(N2) rather than O(N) O(N2) rather than O(N)
![Page 38: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/38.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 39
2
Even larger neighborhood
1 3 4 5 6 cost=22cost=14
BLOCK neighborhood
Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?
yes – 2 can get past 45 without having to cross 3 or move 3 first
still O(N)O(N3) rather than O(N), O(N2) O(N3) rather than O(N), O(N2)
![Page 39: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/39.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 40
2
Larger yet: Via dynamic programming??
1 3 4 5 6 cost=22
Fewer local minima?Graph diameter (max #moves needed)?How many neighbors?How long to find best neighbor?
logarithmicexponentialpolynomial
![Page 40: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/40.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 41
Unifying/generalizing neighborhoods so far
21 3 4 5 6 7 8
Exchange two adjacent blocks, of max widths w ≤ w’
SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N
Move is defined by an (i,j,k) triple
i j k
runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)
everything in this talk can be generalized to other values of w,w’
![Page 41: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/41.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 42
Very large-scale neighborhoods What if we consider multiple simultaneous exchanges
that are “independent”?
The DYNASEARCH neighborhood (Potts & van de Velde 1995; Congram 2000)
2 3 4 5 61
2 1 4 3 6 5
3 2 5 4
1 52 43 6
Lowest-cost neighboris lowest-cost path
Cost of this arc is Δcostof swapping (4,5), here < 0
3 62 1
5 4
![Page 42: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/42.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 43
Very large-scale neighborhoods
2 3 4 5 61
2 1 4 3 6 5
3 2 5 4
Lowest-cost neighboris lowest-cost path
Why would this be a good idea?
Help get out of bad local minima?Help avoid getting into bad local minima?
no; they’re still local minimayes – less greedy
B = 2 3 41
2 1 4 3
3 2
DYNASEARCH (-20+-20)
SWAP (-30)
0 -20 0 80
0 0 -30 -0
0 0 0 -20
0 0 0 0
![Page 43: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/43.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 44
no; they’re still local minimayes – less greedy
yes! – shortest-path algorithm finds the best set of swaps in O(N) time, as fast as best single swap.
Up to N moves as fast as 1 move:no penalty for “parallelism”!
Globally optimizes over exponentially many neighbors (paths).
Very large-scale neighborhoods
2 3 4 5 61
2 1 4 3 6 5
3 2 5 4
Lowest-cost neighboris lowest-cost path
Why would this be a good idea?
Help get out of bad local minima?Help avoid getting into bad local minima?More efficient?
![Page 44: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/44.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 45
Can we extend this idea – up to N moves in parallel by dynamic programming – to neighborhoods beyond SWAP?
21 3 4 5 6 7 8
Exchange two adjacent blocks, of max widths w ≤ w’
SWAP: w=1, w’=1INSERT: w=1, w’=NBLOCK: w=N, w’=N
Move is defined by an (i,j,k) triple
i j k
runtime = # neighbors = O(ww’N)O(N)O(N2)O(N3)
Yes.Asymptotic runtime is
always unchanged.
![Page 45: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/45.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 46
Let’s define each neighbor by a “colored tree”Just like ITG!
1 42 3 5 6
= swap children
![Page 46: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/46.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 47
Let’s define each neighbor by a “colored tree”Just like ITG!
1 42 3 5 6
= swap children
![Page 47: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/47.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 48
Let’s define each neighbor by a “colored tree”Just like ITG!
1 42 35 6
= swap children
This is like the BLOCK neighborhood, but with multiple block exchanges, which may be nested.
![Page 48: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/48.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 49
If that was the optimal neighbor …
1 45 6 2 3
… now look for its optimal neighbor
new tree!
![Page 49: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/49.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 50
If that was the optimal neighbor …
5 6 1 4 2 3
… now look for its optimal neighbor
new tree!
![Page 50: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/50.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 51
If that was the optimal neighbor …
5 61 4 2 3
… now look for its optimal neighbor… repeat till reach local optimum
Each tree defines a neighbor.At each step, optimize over all possible trees
by dynamic programming (CKY parsing).
Use your favorite parsing speedups (pruning, best-first, …)
![Page 51: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/51.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 52
Very-large-scale versions of SWAP, INSERT, and BLOCK all by the algorithm we just saw …
21 3 4 5 6 7 8
Exchange two adjacent blocks, of max widths w ≤ w’
Runtime of the algorithm we just saw was O(N3) because we considered O(N3) distinct (i,j,k) triplesMore generally, restrict to only the O(ww’N) triples of interest to define a smaller neighborhood with runtime of O(ww’N). (yes, the dynamic programming recurrences go through)
Move is defined by an (i,j,k) triple
i j k
![Page 52: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/52.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 53
How many steps to get from here to there?
8 46 2 5 3 7 1
4 51 2 3 6 7 8
One twisted-tree step?No: As you probably know,3 1 4 2 1 2 3 4 is impossible.
initial order
best order
![Page 53: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/53.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 54
Can you get to the answer in one step?
German-English, Giza++ alignment
often(yay, big neighborhood)
not always(yay, local search)
![Page 54: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/54.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 55
8 46 2 5 3 7 1
How many steps to the answer in the worst case? (what is diameter of the search space?)
4 51 2 3 6 7 8
claim: only log2N steps at worst (if you know where to step)
Let’s sketch the proof!
![Page 55: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/55.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 56
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
8 46 2 5 3 7 1
5 4
right-branchingtree
![Page 56: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/56.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 57
Quicksort anything into, e.g., 1 2 3 4 5 6 7 8
1 72 4 3 8 5 6
6
5 4
7 2 3
sequence of right-branching
trees
Only log2 N steps to get to 1 2 3 4 5 6 7 8 …
… or to anywhere!
![Page 57: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/57.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 58
How can we find this needlein the haystack of N!possible permutations?
1 42 3 5 6 initial order
1 54 2 6 3 best orderaccording tosome costfunction
Defining “best order”What class of cost functions can we handle efficiently? How fast can we compute a subtree’s cost from its child subtrees?
![Page 58: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/58.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 59
+ a25 + a56 + a63+ a42
How can we find this needlein the haystack of N!possible permutations?
Defining “best order”What class of cost functions?
best orderaccording tosome costfunction
1 54 2 6 3
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A =
a14
“Traveling Salesperson
Problem” (TSP)
+ a31
![Page 59: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/59.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 60
How can we find this needlein the haystack of N!possible permutations?
Defining “best order”What class of cost functions?
best orderaccording tosome costfunction
1 54 2 6 3
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
5 4 -12 6 55 0
B =
b26 = cost of 2 preceding 6“Linear Ordering
Problem” (LOP)
(add up n(n-1)/2 such costs)(any order will incur either b26 or b62)
![Page 60: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/60.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 61
Defining “best order”What class of cost functions?
TSP and LOP are both NP-complete In fact, believed to be inapproximable
hard even to achieve C * optimal cost (any C≥1)
Practical approaches: correct answer, typically fast branch-and-bound,
ILP, … fast answer, typically close to correct beam search,
this talk, …
![Page 61: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/61.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 63
Defining “best order”What class of cost functions?
initial order1 42 3 5 6
1 54 2 6 3 cost of this order:
1.Does my favorite WFSA like this string of #s?
2.Non-local pair order ok?3.Non-local triple order
ok?Can add these all up …
4 before 3 …?1…2…3?Generalize
s TSP
LOP
![Page 62: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/62.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 64
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
Costs are derived from source sentence features
1 42 3 5 6initial order
(French)
NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
A =
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
-75 4 -12 6 55 0
B =
ne would like to be brought adjacent to the next NEG word
ne would like to be brought adjacent to the next NEG word
![Page 63: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/63.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 65
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
1 42 3 5 6initial order
(French)
NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A = B =
50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie)+27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap
…
50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie)+27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap
…
Can also include phrase boundary symbols in the input!
Costs are derived from source sentence features
= 75
![Page 64: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/64.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 66
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
1 42 3 5 6initial order
(French)
NNP
Marie
NEG
ne
PRP
m’
AUX
a
NEG
pas
VBN
vu
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A = B =
FSA costs: Distortion modelLanguage model – looks ahead to next step! ( good finite-state translation into good
English?)
Costs are derived from source sentence features
![Page 65: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/65.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 67
Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6
1 54 2 6 3 cost of this order:
1.Does my favorite WFSA like it as a string?
![Page 66: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/66.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 68
Scoring with a weighted FSA
This particular WFSA implements TSP scoring for N=3:After you read 1, you’re in state 1After you read 2, you’re in state 2After you read 3, you’re in state 3 …
and this state determines the cost of the next symbol you read
nitial
We’ll handle a WFSA with Q states by using a fancier grammar, with nonterminals. (Now runtime goes up to O(N3Q3) …)
![Page 67: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/67.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 69
Including WFSA costs via nonterminals
1 42 3 5 661 42 23 14 I5 56
A possible preterminal for word 2is an arc in A that’s labeled with 2.
The preterminal 42 rewrites as word 2
with a cost equal to the arc’s cost.
4 22
![Page 68: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/68.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 70
I3I3Including WFSA costs via nonterminals
1 42 361 42 23 14
43
13
63
5 6I5 56
I6
63
I6
63
I6
I3
1 42 3 5 661 42 23 14 I5 56
This constituent’s total cost is the
total cost of the best 63 path
.
6 11
4 2 34 2 3
.161
4 2 34 2 3
56
I5
cost of the new permutation
![Page 69: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/69.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 71
Dynamic program must pick the tree that leads to the lowest-cost permutation initial order1 42 3 5 6
1 54 2 6 3 cost of this order:
1.Does my favorite WFSA like it as a string?
2.Non-local pair order ok?4 before 3 …?
![Page 70: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/70.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 72
Incorporating the pairwise ordering costs
So this hypothesis must add costs 5 < 1, 5 < 2, 5 < 3, 5 < 4, 6 < 1, 6 < 2, 6 < 3, 6 < 4, 7 < 1, 7 < 2, 7 < 3, 7 < 4
Uh-oh! So now it takesO(N2) time to combine twosubtrees, instead of O(1) time?
Nope – dynamic programmingto the rescue again!
1 42 3 5 6 7
This puts {5,6,7} before {1,2,3,4}.
![Page 71: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/71.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 73
Computing LOP cost of a block move
1 42 3 5 6 7
1 2 3 4
5
6
7
1 2 3 4
5
6
7
1 2 3 4
5
6
7
1 2 3 4
5
6
7
1 2 3 4
5
6
7
So we have to add O(N2) costsjust to consider this single neighbor!
This puts {5,6,7} before {1,2,3,4}.
= + - +
already computed at earlier steps of parsing
Reuse work from other, “narrower” block moves …computed new cost in O(1)!
revise
![Page 72: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/72.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 74
Incorporating 3-way ordering costs See the initial paper (Eisner & Tromble 2006)
A little tricky, but comes “for free” if you’re willing to
accept a certain restriction on these costs more expensive without that restriction, but possible
![Page 73: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/73.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 75
Another option: Markov chain Monte Carlo Random walk in the space of permutations
interpret a permutation’s cost as a log-probability Sample a permutation from the neighborhood
instead of always picking the most probable
Why? Simulated annealing might beat greedy-with-random-restarts When learning the parameters of the distribution, can use
sampling to compute the feature expectations
![Page 74: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/74.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 76
Another option: Markov chain Monte Carlo Random walk in the space of permutations
interpret a permutation’s cost as a log-probability Sample a permutation from the neighborhood
instead of always picking the most probable
How? Pitfall: Sampling a permutation sampling a tree
Spurious ambiguity: some permutations have many trees Solution: Exclude some trees, leaving 1 per permutation
Normal form has long been known for colored trees For restricted colored trees (which limit the size of blocks to
swap), we have devised a more complicated normal form
![Page 75: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/75.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 78
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
Learning the costs
Where do these costs come from? If we have some examples on which we know
the true permutation, could try to learn them
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A = B =
![Page 76: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/76.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 79
0 5 -22 93 8 6
12 0 8 -31 -6 54
-7 41 0 -9 24 82
88 17 -6 0 12 -60
11 -17 10 -59 0 23
75 4 -12 6 55 0
0 15 22 80 5 -7
-30 0 -76 24 63 -44
15 28 0 -15 71 -99
12 8 -31 0 54 -6
7 -9 41 24 0 82
6 5 -22 8 93 0
A = B =
Learning the costs
Where do these costs come from? If we have some examples on which we know
the true permutation, could try to learn them More precisely, try to learn these weights θ
(the knowledge that’s reused across examples) 50: a verb (e.g., vu) shouldn’t
precede its subject (e.g., Marie)27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap
…
50: a verb (e.g., vu) shouldn’t precede its subject (e.g., Marie)27: words at a distance of 5 shouldn’t swap order-2: words with PRP between them ought to swap
…
![Page 77: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/77.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 83
Experimenting with training LOP params(LOP is quite fast: O(n3) with no grammar constant)
PDS VMFIN PPER ADV APPR ART NN PTKNEG VVINF $.
Das kann ich so aus dem Stand nicht sagen .
B[7,9]
![Page 78: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/78.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 84
LOP feature templates
![Page 79: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/79.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 85
LOP feature templates
Only LOP features so far And they’re unnecessarily simple
(don’t examine syntactic constituency) And input sequence is only words
(not interspersed with syntactic brackets)
![Page 80: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/80.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 86
Learning LOP Costs for MT
Define German’ to be German in English word order To get German’ for training data, use Giza++ to align
all German positions to English positions (disallow NULL)
German EnglishGerman’LOP MOSES
MOSES baseline
(interesting, if odd, to try to reorder with only the LOP costs)
![Page 81: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/81.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 87
Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)
Easy first try: Naïve Bayes Treat each feature in θ as independent Count and normalize over the training data No real improvement over baseline
German EnglishGerman’LOP MOSES
MOSES baseline
![Page 82: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/82.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 88
Learning LOP Costs for MT(interesting, if odd, to try to reorder with only the LOP costs)
Easy second try: Perceptron
German EnglishGerman’LOP MOSES
MOSES baseline
0 1 n
*
. . . searcherror
model
error
globaloptimum
localoptimu
mupdate
gold standard
Note: Search error can be beneficial, e.g., just take 1 step from identity permutation
![Page 83: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/83.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 90
Benefit from reordering
Learning method BLEU vs. German′
BLEU vs. English
No reordering 49.65 25.55
Naïve Bayes—POS 49.21
Naïve Bayes—POS+lexical 49.75
Perceptron—POS 50.05 25.92
Perceptron—POS+lexical 51.30 26.34obviously, not
yet unscrambling German: need more features
![Page 84: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/84.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 91
Contrastive estimation (Smith & Eisner 2005)
Maximize the probability of the desired permutation relative to its ITG neighborhood
Requires summing all permutations in a neighborhood Must use normal-form trees here
Stochastic gradient descent
1-step very-large-scale neighborhood
Alternatively, work back from gold standard
gold standard
*
![Page 85: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/85.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 92
k-best MIRA in the neighborhood
Make gold standard beat its local competitors Beat the bad ones by a bigger margin
Good = close to gold in swap distance? Good = close to gold using BLEU? Good = translates into English that’s close to reference?
1-step very-large-scale neighborhood
gold standard
*current winnersin the
neighborhood
Alternatively, work back from gold standard
![Page 86: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/86.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 93
Alternatively, train each iterate
0 1 n. . .
*
0
*
1
*
n
updateupdate update
model best inneigh of (0)
oracle inneigh of (0)
Or could do a k-best MIRA version of this, too; even use a loss measure based on lookahead to(n)
![Page 87: Shuffling Non-Constituents](https://reader035.vdocuments.net/reader035/viewer/2022062410/56815809550346895dc57820/html5/thumbnails/87.jpg)
Eisner, D.A.Smith, Tromble - SSST Workshop - June 2008 95
Summary of part II
Local search is fun and easy Popular elsewhere in AI Closely related to MCMC sampling
Probably useful for translation Maybe other NP-hard problems too
Can efficiently use huge local neighborhoods Algorithms are closely related to parsing and FSMs Our community knows that stuff better than anyone!