on dual decomposition and linear programming relaxations for natural language processing
TRANSCRIPT
![Page 1: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/1.jpg)
On Dual Decompositionand Linear Programming Relaxations
for Natural Language Processing
Alexander M. Rush, David Sontag,Michael Collins, and Tommi Jaakkola
![Page 2: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/2.jpg)
Dynamic Programming
Dynamic programming is a dominant technique in NLP.
I Fast
I Exact
I Easy to implement
Examples:
I Viterbi algorithm for hidden Markov models
I CKY algorithm for weighted context-free grammars
y∗ = arg maxy
f (y) ← Decoding
![Page 3: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/3.jpg)
Model Complexity
Unfortunately, dynamic programming algorithms do not scale wellwith model complexity.
As our models become complex, these algorithms can explode interms of computational or implementational complexity.
Integration:
I f ← Easy
I g ← Easy
I f + g ← Hard
![Page 4: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/4.jpg)
Integration (1)
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
Red1 flies2 some3 large4 jet5
N V D A N
f (y) + g(z)
I Classical problem in NLP.
I The dynamic programming intersection is prohibitively slowand complicated to implement.
![Page 5: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/5.jpg)
Integration (2)
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
*0 Red1 flies2 some3 large4 jet5
f (y) + g(z)
I Important for improving parsing accuracy.
I The dynamic programming intersection is slow andcomplicated to implement.
![Page 6: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/6.jpg)
Dual Decomposition
A general technique for constructing decoding algorithms
Solve complicated models
y∗ = arg maxy
f (y)
by decomposing into smaller problems.
Upshot: Can utilize a toolbox of combinatorial algorithms.
I Dynamic programming
I Minimum spanning tree
I Shortest path
I Min-Cut
I ...
![Page 7: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/7.jpg)
Dual Decomposition Algorithms
Simple - Uses basic dynamic programming algorithms
Efficient - Faster than full dynamic programming intersections
Strong Guarantees - Gives a certificate of optimality when exact
In experiments, we find the global optimum on 99% of examples.
Widely Applicable - Similar techniques extend to other problems
![Page 8: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/8.jpg)
Roadmap
Algorithm
Experiments
LP Relaxations
![Page 9: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/9.jpg)
Integrated Parsing and Tagging
Red1 flies2 some3 large4 jet5
Red1 flies2 some3 large4 jet5
N V D A N
Red1 flies2 some3 large4 jet5
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
HMM
Dual Decomposition
CFG
![Page 10: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/10.jpg)
Integrated Parsing and Tagging
Red1 flies2 some3 large4 jet5
Red1 flies2 some3 large4 jet5
N V D A N
Red1 flies2 some3 large4 jet5
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
HMM
Dual Decomposition
CFG
![Page 11: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/11.jpg)
HMM for Tagging
Red1 flies2 some3 large4 jet5
N V D A N
I Let Z be the set of all valid taggings of a sentence
and g(z) be a scoring function.
e.g. g(z) = log p(Red1|N) + log p(V|N) + ...
z∗ = arg maxz∈Z
g(z) ← Viterbi decoding
![Page 12: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/12.jpg)
HMM for Tagging
Red1 flies2 some3 large4 jet5
N V D A N
I Let Z be the set of all valid taggings of a sentence
and g(z) be a scoring function.
e.g. g(z) = log p(Red1|N) + log p(V|N) + ...
z∗ = arg maxz∈Z
g(z) ← Viterbi decoding
![Page 13: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/13.jpg)
CFG for Parsing
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
I Let Y be the set of all valid parse trees for a sentence and
f (y) be a scoring function.
e.g. f (y) = log p(S → NP VP|S) + log p(NP → N|NP) + ...
y∗ = arg maxy∈Y
f (y) ← CKY Algorithm
![Page 14: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/14.jpg)
CFG for Parsing
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
I Let Y be the set of all valid parse trees for a sentence and
f (y) be a scoring function.
e.g. f (y) = log p(S → NP VP|S) + log p(NP → N|NP) + ...
y∗ = arg maxy∈Y
f (y) ← CKY Algorithm
![Page 15: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/15.jpg)
Problem DefinitionS
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
I Find parse tree that optimizes
score(S → NP VP) + score(VP → V NP) +
...+ score(Red1,N) + score(V,N) + ...
I Conventional Approach (Bar Hillel et al., 1961)I Replace rules like S → NP VP
with rules like SN,N → NPN,V VPV ,N
Painful. O(t6) increase in complexity for trigram tagging.
![Page 16: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/16.jpg)
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
![Page 17: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/17.jpg)
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees
Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
![Page 18: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/18.jpg)
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
![Page 19: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/19.jpg)
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG
HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
![Page 20: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/20.jpg)
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
![Page 21: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/21.jpg)
The Integrated Parsing and Tagging Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , t, y(i , t) = z(i , t)
Trees Taggings
CFG HMM
Constraints
Where y(i , t) = 1 if parse includes tag t at position i
z(i , t) = 1 if tagging includes tag t at position i
![Page 22: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/22.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
![Page 23: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/23.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
![Page 24: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/24.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
![Page 25: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/25.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
![Page 26: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/26.jpg)
Algorithm Sketch
Set penalty weights equal to 0 for the tag at each position.
For k = 1 to K
y (k) ← Decode (f (y) + penalty) by CKY Algorithm
z(k) ← Decode (g(z)− penalty) by Viterbi Decoding
If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))
Else Update penalty weights based on y (k)(i , t)− z(k)(i , t)
![Page 27: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/27.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 28: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/28.jpg)
CKY Parsing S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 29: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/29.jpg)
CKY Parsing S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A N
N V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 30: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/30.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A N
N V D A N
A N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 31: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/31.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A N
N V D A N
A N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 32: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/32.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 33: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/33.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 34: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/34.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A N
A N D A N
A N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 35: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/35.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A N
A N D A N
N V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 36: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/36.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 37: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/37.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A NN V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 38: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/38.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A N
N V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 39: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/39.jpg)
CKY Parsing
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
A
Red
N
flies
D
some
A
large
VP
V
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
S
NP
N
Red
VP
V
flies
NP
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,t
u(i , t)y(i , t))
Viterbi Decoding
Red1 flies2 some3 large4 jet5
N V D A NN V D A NA N D A NA N D A N
N V D A N
z∗ = arg maxz∈Z
(g(z)−∑i ,t
u(i , t)z(i , t))
Penaltiesu(i , t) = 0 for all i ,t
Iteration 1
u(1,A) -1
u(1,N) 1
u(2,N) -1
u(2,V ) 1
u(5,V ) -1
u(5,N) 1
Iteration 2
u(5,V ) -1
u(5,N) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)Key
f (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i
![Page 40: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/40.jpg)
Guarantees
TheoremIf at any iteration y (k)(i , t) = z(k)(i , t) for all i , t, then (y (k), z(k))is the global optimum.
In experiments, we find the global optimum on 99% of examples.
If we do not converge to a match, we can still get a result (more inpaper).
![Page 41: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/41.jpg)
Guarantees
TheoremIf at any iteration y (k)(i , t) = z(k)(i , t) for all i , t, then (y (k), z(k))is the global optimum.
In experiments, we find the global optimum on 99% of examples.
If we do not converge to a match, we can still get a result (more inpaper).
![Page 42: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/42.jpg)
Integrated CFG and Dependency Parsing
Red1 flies2 some3 large4 jet5
*0 Red1 flies2 some3 large4 jet5
Red1 flies2 some3 large4 jet5
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
Dependency Model
Dual Decomposition
Lexicalized CFG
![Page 43: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/43.jpg)
Integrated CFG and Dependency Parsing
Red1 flies2 some3 large4 jet5
*0 Red1 flies2 some3 large4 jet5
Red1 flies2 some3 large4 jet5
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
Dependency Model
Dual Decomposition
Lexicalized CFG
![Page 44: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/44.jpg)
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
I Let Z be the set of all valid dependency parses of a sentenceand g(z) be a scoring function.
e.g. g(z) = log p(some3|jet5, large4) + ...
z∗ = arg maxz∈Z
g(z) ← Eisner (2000) algorithm
![Page 45: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/45.jpg)
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
I Let Z be the set of all valid dependency parses of a sentenceand g(z) be a scoring function.
e.g. g(z) = log p(some3|jet5, large4) + ...
z∗ = arg maxz∈Z
g(z) ← Eisner (2000) algorithm
![Page 46: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/46.jpg)
Lexicalized PCFG
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
I Let Y be the set of all valid dependency parses of a sentenceand f (y) be a scoring function.
e.g. f (y) = log p(S(flies) → NP(Red) VP(flies)|S(flies)) + ...
y∗ = arg maxy∈Y
f (y) ← Modified CKY algorithm
![Page 47: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/47.jpg)
Lexicalized PCFG
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
I Let Y be the set of all valid dependency parses of a sentenceand f (y) be a scoring function.
e.g. f (y) = log p(S(flies) → NP(Red) VP(flies)|S(flies)) + ...
y∗ = arg maxy∈Y
f (y) ← Modified CKY algorithm
![Page 48: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/48.jpg)
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
![Page 49: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/49.jpg)
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees
Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
![Page 50: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/50.jpg)
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
![Page 51: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/51.jpg)
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG
Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
![Page 52: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/52.jpg)
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
![Page 53: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/53.jpg)
The Integrated Constituency and Dependency Parsing Problem
Find argmax
y∈ Y, z∈ Zf (y) + g(z)
such that for all i , j , y(i , j) = z(i , j)
Trees Dependency Trees
CFG Dependency
Constraints
Where y(i , j) = 1 if parse includes dependency from word i to j
z(i , j) = 1 if parse includes dependency from word i to j
![Page 54: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/54.jpg)
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 55: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/55.jpg)
CKY Parsing S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 56: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/56.jpg)
CKY Parsing S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 57: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/57.jpg)
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 58: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/58.jpg)
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 59: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/59.jpg)
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 60: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/60.jpg)
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 61: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/61.jpg)
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 62: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/62.jpg)
CKY Parsing
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
D
some
NP(jet)
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
S(flies)
NP
N
Red
VP(flies)
V
flies
NP(jet)
D
some
A
large
N
jet
y∗ = arg maxy∈Y
(f (y) +∑i ,j
u(i , j)y(i , j))
Dependency Parsing
*0 Red1 flies2 some3 large4 jet5
z∗ = arg maxz∈Z
(g(z)−∑i ,j
u(i , j)z(i , j))
Penaltiesu(i , j) = 0 for all i ,j
Iteration 1
u(2, 3) -1
u(5, 3) 1
Convergedy∗ = arg max
y∈Yf (y) + g(y)
Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j
![Page 63: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/63.jpg)
Roadmap
Algorithm
Experiments
LP Relaxations
![Page 64: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/64.jpg)
Experiment
Properties:
I Exactness
I Parsing Accuracy
Experiments on:
I English Penn Treebank
Models
I Collins (1997) Model 1I Semi-Supervised Dependency Parser (Koo, 2008)I Trigram Tagger (Toutanova, 2000)
![Page 65: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/65.jpg)
How quickly do the models converge?
0
20
40
60
80
100
<=1<=2
<=3<=4
<=10<=20
<=50
% e
xam
ples
con
verg
ed
number of iterations
Integrated Dependency Parsing
0
20
40
60
80
100
<=1<=2
<=3<=4
<=10<=20
<=50
% e
xam
ples
con
verg
ed
number of iterations
Integrated POS Tagging
![Page 66: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/66.jpg)
Integrated Constituency and Dependency Parsing: Accuracy
87
88
89
90
91
92Collins
DepDual
F1 Score
I Collins (1997) Model 1
I Fixed, First-best Dependencies from Koo (2008)
I Dual Decomposition
![Page 67: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/67.jpg)
Integrated Parsing and Tagging: Accuracy
87
88
89
90
91
92FixedDual
F1 Score
I Fixed, First-Best Tags From Toutanova (2000)
I Dual Decomposition
![Page 68: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/68.jpg)
Roadmap
Algorithm
Experiments
LP Relaxations
![Page 69: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/69.jpg)
Dual Decomposition and Linear Programming Relaxations
Theorem
I If the dual decomposition algorithm converges, then(y (k), z(k)) is the global optimum.
Questions
I What problem is dual decomposition solving?
I How come the algorithm doesn’t always converge?
Dual decomposition searches over a linear programming relaxationof the original problem.
![Page 70: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/70.jpg)
Convex Hulls for CKY
A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.
Parsing
y∗
w
Y
conv(Y)
I If f is linear, arg maxy∈conv(Y)
f (y) is a linear program.
I The best point in an LP is a vertex. So CKY solves this LP.
![Page 71: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/71.jpg)
Convex Hulls for CKY
A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.
Parsing
y∗
w
Y
conv(Y)
I If f is linear, arg maxy∈conv(Y)
f (y) is a linear program.
I The best point in an LP is a vertex. So CKY solves this LP.
![Page 72: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/72.jpg)
Convex Hulls for CKY
A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.
Parsing
y∗
w
conv(Y)
I If f is linear, arg maxy∈conv(Y)
f (y) is a linear program.
I The best point in an LP is a vertex. So CKY solves this LP.
![Page 73: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/73.jpg)
Convex Hulls for CKY
A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.
Parsing
y∗
w
conv(Y)
I If f is linear, arg maxy∈conv(Y)
f (y) is a linear program.
I The best point in an LP is a vertex. So CKY solves this LP.
![Page 74: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/74.jpg)
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
Q
conv(Q)Q′Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
![Page 75: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/75.jpg)
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
Q
conv(Q)Q′Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
![Page 76: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/76.jpg)
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)
Q′Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
![Page 77: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/77.jpg)
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)
Q′
Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
![Page 78: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/78.jpg)
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)Q′
Q′ conv(Q)
Possible (y∗, z∗)
wPossible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
![Page 79: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/79.jpg)
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)Q′
Q′ conv(Q)
Possible (y∗, z∗)
w
Possible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
![Page 80: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/80.jpg)
Combined Problem
Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}
conv(Q)Q′
Q′ conv(Q)
Possible (y∗, z∗)
w
Possible (y∗, z∗)
w
Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),
µ(i , t) = ν(i , t) for all (i , t)}
Dual decomposition searches over Q′
Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.
![Page 81: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/81.jpg)
Are there points strictly in the outer bound?
Q′
Possible (y∗, z∗)?
Taggings 0.5x
w1 w2 w3
A A A
+ 0.5x
w1 w2 w3
A B B
Parses 0.5 x
X
A
w1
X
A
w2
B
w3
+ 0.5 x
X
A
w1
X
B
w2
A
w3
Best result can be afractional solution.
Convexcombination ofthese structures.
![Page 82: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/82.jpg)
Summary
A Dual Decomposition algorithm for integrated decoding
Simple - Uses only simple, off-the-shelf dynamic programmingalgorithms to solve a harder problem.
Efficient - Faster than classical methods for dynamic programmingintersection.
Strong Guarantees - Solves a linear programming relaxation whichgives a certificate of optimality.
Finds the exact solution on 99% of the examples.
Widely Applicable - Similar techniques extend to other problems
![Page 83: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/83.jpg)
Appendix
![Page 84: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/84.jpg)
Iterative Progress
50
60
70
80
90
100
0 10 20 30 40 50
Per
cent
age
Maximum Number of Dual Decomposition Iterations
f score% certificates
% match K=50
![Page 85: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/85.jpg)
Deriving the Algorithm
Goal:y∗ = arg max
y∈Yf (y)
Rewrite:arg max
z∈Z,y∈Yf (z) + g(y)
s.t. z(i , j) = y(i , j) for all i , j
Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j
u(i , j) (y(i , j)− z(i , j))
The dual problem is to find minu
L(u) where
L(u) = maxy∈Y,z∈Z
L(u, y , z) = maxz∈Z
f (z) +∑i,j
u(i , j)z(i , j)
+ max
y∈Y
g(y)−∑i,j
u(i , j)y(i , j)
Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u
![Page 86: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/86.jpg)
Deriving the Algorithm
Goal:y∗ = arg max
y∈Yf (y)
Rewrite:arg max
z∈Z,y∈Yf (z) + g(y)
s.t. z(i , j) = y(i , j) for all i , j
Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j
u(i , j) (y(i , j)− z(i , j))
The dual problem is to find minu
L(u) where
L(u) = maxy∈Y,z∈Z
L(u, y , z) = maxz∈Z
f (z) +∑i,j
u(i , j)z(i , j)
+ max
y∈Y
g(y)−∑i,j
u(i , j)y(i , j)
Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u
![Page 87: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/87.jpg)
A Subgradient Algorithm for Minimizing L(u)
L(u) = maxz∈Z
f (z) +∑i,j
u(i , j)y(i , j)
+ maxy∈Y
g(y)−∑i,j
u(i , j)z(i , j)
L(u) is convex, but not differentiable. A subgradient of L(u) at uis a vector gu such that for all v ,
L(v) ≥ L(u) + gu · (v − u)
Subgradient methods use updates u′ = u − αgu
In fact, for our L(u), gu(i , j) = z∗(i , j)− y∗(i , j)
![Page 88: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing](https://reader030.vdocuments.net/reader030/viewer/2022021207/620628638c2f7b173004f8f5/html5/thumbnails/88.jpg)
Related Work
I Methods that use general purpose linear programming orinteger linear programming solvers (Martins et al. 2009;Riedel and Clarke 2006; Roth and Yih 2005)
I Dual decomposition/Lagrangian relaxation in combinatorialoptimization (Dantzig and Wolfe, 1960; Held and Karp, 1970;Fisher 1981)
I Dual decomposition for inference in MRFs (Komodakis et al.,2007; Wainwright et al., 2005)
I Methods that incorporate combinatorial solvers within loopybelief propagation (Duchi et al. 2007; Smith and Eisner 2008)