on dual decomposition and linear programming relaxations for natural language processing

88
On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing Alexander M. Rush, David Sontag, Michael Collins, and Tommi Jaakkola

Upload: others

Post on 11-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

On Dual Decompositionand Linear Programming Relaxations

for Natural Language Processing

Alexander M. Rush, David Sontag,Michael Collins, and Tommi Jaakkola

Page 2: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Dynamic Programming

Dynamic programming is a dominant technique in NLP.

I Fast

I Exact

I Easy to implement

Examples:

I Viterbi algorithm for hidden Markov models

I CKY algorithm for weighted context-free grammars

y∗ = arg maxy

f (y) ← Decoding

Page 3: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Model Complexity

Unfortunately, dynamic programming algorithms do not scale wellwith model complexity.

As our models become complex, these algorithms can explode interms of computational or implementational complexity.

Integration:

I f ← Easy

I g ← Easy

I f + g ← Hard

Page 4: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Integration (1)

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

Red1 flies2 some3 large4 jet5

N V D A N

f (y) + g(z)

I Classical problem in NLP.

I The dynamic programming intersection is prohibitively slowand complicated to implement.

Page 5: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Integration (2)

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

*0 Red1 flies2 some3 large4 jet5

f (y) + g(z)

I Important for improving parsing accuracy.

I The dynamic programming intersection is slow andcomplicated to implement.

Page 6: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Dual Decomposition

A general technique for constructing decoding algorithms

Solve complicated models

y∗ = arg maxy

f (y)

by decomposing into smaller problems.

Upshot: Can utilize a toolbox of combinatorial algorithms.

I Dynamic programming

I Minimum spanning tree

I Shortest path

I Min-Cut

I ...

Page 7: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Dual Decomposition Algorithms

Simple - Uses basic dynamic programming algorithms

Efficient - Faster than full dynamic programming intersections

Strong Guarantees - Gives a certificate of optimality when exact

In experiments, we find the global optimum on 99% of examples.

Widely Applicable - Similar techniques extend to other problems

Page 8: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Roadmap

Algorithm

Experiments

LP Relaxations

Page 9: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Integrated Parsing and Tagging

Red1 flies2 some3 large4 jet5

Red1 flies2 some3 large4 jet5

N V D A N

Red1 flies2 some3 large4 jet5

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

HMM

Dual Decomposition

CFG

Page 10: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Integrated Parsing and Tagging

Red1 flies2 some3 large4 jet5

Red1 flies2 some3 large4 jet5

N V D A N

Red1 flies2 some3 large4 jet5

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

HMM

Dual Decomposition

CFG

Page 11: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

HMM for Tagging

Red1 flies2 some3 large4 jet5

N V D A N

I Let Z be the set of all valid taggings of a sentence

and g(z) be a scoring function.

e.g. g(z) = log p(Red1|N) + log p(V|N) + ...

z∗ = arg maxz∈Z

g(z) ← Viterbi decoding

Page 12: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

HMM for Tagging

Red1 flies2 some3 large4 jet5

N V D A N

I Let Z be the set of all valid taggings of a sentence

and g(z) be a scoring function.

e.g. g(z) = log p(Red1|N) + log p(V|N) + ...

z∗ = arg maxz∈Z

g(z) ← Viterbi decoding

Page 13: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CFG for Parsing

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

I Let Y be the set of all valid parse trees for a sentence and

f (y) be a scoring function.

e.g. f (y) = log p(S → NP VP|S) + log p(NP → N|NP) + ...

y∗ = arg maxy∈Y

f (y) ← CKY Algorithm

Page 14: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CFG for Parsing

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

I Let Y be the set of all valid parse trees for a sentence and

f (y) be a scoring function.

e.g. f (y) = log p(S → NP VP|S) + log p(NP → N|NP) + ...

y∗ = arg maxy∈Y

f (y) ← CKY Algorithm

Page 15: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Problem DefinitionS

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

I Find parse tree that optimizes

score(S → NP VP) + score(VP → V NP) +

...+ score(Red1,N) + score(V,N) + ...

I Conventional Approach (Bar Hillel et al., 1961)I Replace rules like S → NP VP

with rules like SN,N → NPN,V VPV ,N

Painful. O(t6) increase in complexity for trigram tagging.

Page 16: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Parsing and Tagging Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , t, y(i , t) = z(i , t)

Trees Taggings

CFG HMM

Constraints

Where y(i , t) = 1 if parse includes tag t at position i

z(i , t) = 1 if tagging includes tag t at position i

Page 17: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Parsing and Tagging Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , t, y(i , t) = z(i , t)

Trees

Taggings

CFG HMM

Constraints

Where y(i , t) = 1 if parse includes tag t at position i

z(i , t) = 1 if tagging includes tag t at position i

Page 18: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Parsing and Tagging Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , t, y(i , t) = z(i , t)

Trees Taggings

CFG HMM

Constraints

Where y(i , t) = 1 if parse includes tag t at position i

z(i , t) = 1 if tagging includes tag t at position i

Page 19: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Parsing and Tagging Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , t, y(i , t) = z(i , t)

Trees Taggings

CFG

HMM

Constraints

Where y(i , t) = 1 if parse includes tag t at position i

z(i , t) = 1 if tagging includes tag t at position i

Page 20: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Parsing and Tagging Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , t, y(i , t) = z(i , t)

Trees Taggings

CFG HMM

Constraints

Where y(i , t) = 1 if parse includes tag t at position i

z(i , t) = 1 if tagging includes tag t at position i

Page 21: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Parsing and Tagging Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , t, y(i , t) = z(i , t)

Trees Taggings

CFG HMM

Constraints

Where y(i , t) = 1 if parse includes tag t at position i

z(i , t) = 1 if tagging includes tag t at position i

Page 22: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Algorithm Sketch

Set penalty weights equal to 0 for the tag at each position.

For k = 1 to K

y (k) ← Decode (f (y) + penalty) by CKY Algorithm

z(k) ← Decode (g(z)− penalty) by Viterbi Decoding

If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))

Page 23: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Algorithm Sketch

Set penalty weights equal to 0 for the tag at each position.

For k = 1 to K

y (k) ← Decode (f (y) + penalty) by CKY Algorithm

z(k) ← Decode (g(z)− penalty) by Viterbi Decoding

If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))

Page 24: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Algorithm Sketch

Set penalty weights equal to 0 for the tag at each position.

For k = 1 to K

y (k) ← Decode (f (y) + penalty) by CKY Algorithm

z(k) ← Decode (g(z)− penalty) by Viterbi Decoding

If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))

Page 25: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Algorithm Sketch

Set penalty weights equal to 0 for the tag at each position.

For k = 1 to K

y (k) ← Decode (f (y) + penalty) by CKY Algorithm

z(k) ← Decode (g(z)− penalty) by Viterbi Decoding

If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))

Page 26: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Algorithm Sketch

Set penalty weights equal to 0 for the tag at each position.

For k = 1 to K

y (k) ← Decode (f (y) + penalty) by CKY Algorithm

z(k) ← Decode (g(z)− penalty) by Viterbi Decoding

If y (k)(i , t) = z(k)(i , t) for all i , t Return (y (k), z(k))

Else Update penalty weights based on y (k)(i , t)− z(k)(i , t)

Page 27: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 28: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 29: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A N

N V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 30: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A N

N V D A N

A N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 31: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A N

N V D A N

A N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 32: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 33: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 34: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A N

A N D A N

A N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 35: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A N

A N D A N

N V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 36: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 37: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A NN V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 38: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A N

N V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 39: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

A

Red

N

flies

D

some

A

large

VP

V

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

S

NP

N

Red

VP

V

flies

NP

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,t

u(i , t)y(i , t))

Viterbi Decoding

Red1 flies2 some3 large4 jet5

N V D A NN V D A NA N D A NA N D A N

N V D A N

z∗ = arg maxz∈Z

(g(z)−∑i ,t

u(i , t)z(i , t))

Penaltiesu(i , t) = 0 for all i ,t

Iteration 1

u(1,A) -1

u(1,N) 1

u(2,N) -1

u(2,V ) 1

u(5,V ) -1

u(5,N) 1

Iteration 2

u(5,V ) -1

u(5,N) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)Key

f (y) ⇐ CFG g(z) ⇐ HMMY ⇐ Parse Trees Z ⇐ Taggingsy(i , t) = 1 if y contains tag t at position i

Page 40: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Guarantees

TheoremIf at any iteration y (k)(i , t) = z(k)(i , t) for all i , t, then (y (k), z(k))is the global optimum.

In experiments, we find the global optimum on 99% of examples.

If we do not converge to a match, we can still get a result (more inpaper).

Page 41: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Guarantees

TheoremIf at any iteration y (k)(i , t) = z(k)(i , t) for all i , t, then (y (k), z(k))is the global optimum.

In experiments, we find the global optimum on 99% of examples.

If we do not converge to a match, we can still get a result (more inpaper).

Page 42: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Integrated CFG and Dependency Parsing

Red1 flies2 some3 large4 jet5

*0 Red1 flies2 some3 large4 jet5

Red1 flies2 some3 large4 jet5

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

Dependency Model

Dual Decomposition

Lexicalized CFG

Page 43: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Integrated CFG and Dependency Parsing

Red1 flies2 some3 large4 jet5

*0 Red1 flies2 some3 large4 jet5

Red1 flies2 some3 large4 jet5

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

Dependency Model

Dual Decomposition

Lexicalized CFG

Page 44: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

I Let Z be the set of all valid dependency parses of a sentenceand g(z) be a scoring function.

e.g. g(z) = log p(some3|jet5, large4) + ...

z∗ = arg maxz∈Z

g(z) ← Eisner (2000) algorithm

Page 45: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

I Let Z be the set of all valid dependency parses of a sentenceand g(z) be a scoring function.

e.g. g(z) = log p(some3|jet5, large4) + ...

z∗ = arg maxz∈Z

g(z) ← Eisner (2000) algorithm

Page 46: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Lexicalized PCFG

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

I Let Y be the set of all valid dependency parses of a sentenceand f (y) be a scoring function.

e.g. f (y) = log p(S(flies) → NP(Red) VP(flies)|S(flies)) + ...

y∗ = arg maxy∈Y

f (y) ← Modified CKY algorithm

Page 47: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Lexicalized PCFG

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

I Let Y be the set of all valid dependency parses of a sentenceand f (y) be a scoring function.

e.g. f (y) = log p(S(flies) → NP(Red) VP(flies)|S(flies)) + ...

y∗ = arg maxy∈Y

f (y) ← Modified CKY algorithm

Page 48: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Constituency and Dependency Parsing Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , j , y(i , j) = z(i , j)

Trees Dependency Trees

CFG Dependency

Constraints

Where y(i , j) = 1 if parse includes dependency from word i to j

z(i , j) = 1 if parse includes dependency from word i to j

Page 49: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Constituency and Dependency Parsing Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , j , y(i , j) = z(i , j)

Trees

Dependency Trees

CFG Dependency

Constraints

Where y(i , j) = 1 if parse includes dependency from word i to j

z(i , j) = 1 if parse includes dependency from word i to j

Page 50: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Constituency and Dependency Parsing Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , j , y(i , j) = z(i , j)

Trees Dependency Trees

CFG Dependency

Constraints

Where y(i , j) = 1 if parse includes dependency from word i to j

z(i , j) = 1 if parse includes dependency from word i to j

Page 51: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Constituency and Dependency Parsing Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , j , y(i , j) = z(i , j)

Trees Dependency Trees

CFG

Dependency

Constraints

Where y(i , j) = 1 if parse includes dependency from word i to j

z(i , j) = 1 if parse includes dependency from word i to j

Page 52: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Constituency and Dependency Parsing Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , j , y(i , j) = z(i , j)

Trees Dependency Trees

CFG Dependency

Constraints

Where y(i , j) = 1 if parse includes dependency from word i to j

z(i , j) = 1 if parse includes dependency from word i to j

Page 53: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

The Integrated Constituency and Dependency Parsing Problem

Find argmax

y∈ Y, z∈ Zf (y) + g(z)

such that for all i , j , y(i , j) = z(i , j)

Trees Dependency Trees

CFG Dependency

Constraints

Where y(i , j) = 1 if parse includes dependency from word i to j

z(i , j) = 1 if parse includes dependency from word i to j

Page 54: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 55: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 56: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 57: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 58: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 59: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 60: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 61: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 62: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

CKY Parsing

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

D

some

NP(jet)

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

S(flies)

NP

N

Red

VP(flies)

V

flies

NP(jet)

D

some

A

large

N

jet

y∗ = arg maxy∈Y

(f (y) +∑i ,j

u(i , j)y(i , j))

Dependency Parsing

*0 Red1 flies2 some3 large4 jet5

z∗ = arg maxz∈Z

(g(z)−∑i ,j

u(i , j)z(i , j))

Penaltiesu(i , j) = 0 for all i ,j

Iteration 1

u(2, 3) -1

u(5, 3) 1

Convergedy∗ = arg max

y∈Yf (y) + g(y)

Keyf (y) ⇐ CFG g(z) ⇐ Dependency ModelY ⇐ Parse Trees Z ⇐ Dependency Treesy(i , j) = 1 if y contains dependency i , j

Page 63: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Roadmap

Algorithm

Experiments

LP Relaxations

Page 64: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Experiment

Properties:

I Exactness

I Parsing Accuracy

Experiments on:

I English Penn Treebank

Models

I Collins (1997) Model 1I Semi-Supervised Dependency Parser (Koo, 2008)I Trigram Tagger (Toutanova, 2000)

Page 65: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

How quickly do the models converge?

0

20

40

60

80

100

<=1<=2

<=3<=4

<=10<=20

<=50

% e

xam

ples

con

verg

ed

number of iterations

Integrated Dependency Parsing

0

20

40

60

80

100

<=1<=2

<=3<=4

<=10<=20

<=50

% e

xam

ples

con

verg

ed

number of iterations

Integrated POS Tagging

Page 66: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Integrated Constituency and Dependency Parsing: Accuracy

87

88

89

90

91

92Collins

DepDual

F1 Score

I Collins (1997) Model 1

I Fixed, First-best Dependencies from Koo (2008)

I Dual Decomposition

Page 67: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Integrated Parsing and Tagging: Accuracy

87

88

89

90

91

92FixedDual

F1 Score

I Fixed, First-Best Tags From Toutanova (2000)

I Dual Decomposition

Page 68: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Roadmap

Algorithm

Experiments

LP Relaxations

Page 69: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Dual Decomposition and Linear Programming Relaxations

Theorem

I If the dual decomposition algorithm converges, then(y (k), z(k)) is the global optimum.

Questions

I What problem is dual decomposition solving?

I How come the algorithm doesn’t always converge?

Dual decomposition searches over a linear programming relaxationof the original problem.

Page 70: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Convex Hulls for CKY

A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.

Parsing

y∗

w

Y

conv(Y)

I If f is linear, arg maxy∈conv(Y)

f (y) is a linear program.

I The best point in an LP is a vertex. So CKY solves this LP.

Page 71: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Convex Hulls for CKY

A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.

Parsing

y∗

w

Y

conv(Y)

I If f is linear, arg maxy∈conv(Y)

f (y) is a linear program.

I The best point in an LP is a vertex. So CKY solves this LP.

Page 72: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Convex Hulls for CKY

A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.

Parsing

y∗

w

conv(Y)

I If f is linear, arg maxy∈conv(Y)

f (y) is a linear program.

I The best point in an LP is a vertex. So CKY solves this LP.

Page 73: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Convex Hulls for CKY

A parse tree can be represented as a binary vector y ∈ Y.y(A→ B C , i , j , k) = 1 if rule A→ B C is used at span i , j , k.

Parsing

y∗

w

conv(Y)

I If f is linear, arg maxy∈conv(Y)

f (y) is a linear program.

I The best point in an LP is a vertex. So CKY solves this LP.

Page 74: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Combined Problem

Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}

Q

conv(Q)Q′Q′ conv(Q)

Possible (y∗, z∗)

wPossible (y∗, z∗)

w

Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),

µ(i , t) = ν(i , t) for all (i , t)}

Dual decomposition searches over Q′

Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.

Page 75: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Combined Problem

Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}

Q

conv(Q)Q′Q′ conv(Q)

Possible (y∗, z∗)

wPossible (y∗, z∗)

w

Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),

µ(i , t) = ν(i , t) for all (i , t)}

Dual decomposition searches over Q′

Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.

Page 76: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Combined Problem

Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}

conv(Q)

Q′Q′ conv(Q)

Possible (y∗, z∗)

wPossible (y∗, z∗)

w

Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),

µ(i , t) = ν(i , t) for all (i , t)}

Dual decomposition searches over Q′

Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.

Page 77: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Combined Problem

Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}

conv(Q)

Q′

Q′ conv(Q)

Possible (y∗, z∗)

wPossible (y∗, z∗)

w

Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),

µ(i , t) = ν(i , t) for all (i , t)}

Dual decomposition searches over Q′

Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.

Page 78: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Combined Problem

Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}

conv(Q)Q′

Q′ conv(Q)

Possible (y∗, z∗)

wPossible (y∗, z∗)

w

Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),

µ(i , t) = ν(i , t) for all (i , t)}

Dual decomposition searches over Q′

Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.

Page 79: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Combined Problem

Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}

conv(Q)Q′

Q′ conv(Q)

Possible (y∗, z∗)

w

Possible (y∗, z∗)

w

Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),

µ(i , t) = ν(i , t) for all (i , t)}

Dual decomposition searches over Q′

Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.

Page 80: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Combined Problem

Q = {(y , z): y ∈ Y, z ∈ Z,y(i , t) = z(i , t) for all (i , t)}

conv(Q)Q′

Q′ conv(Q)

Possible (y∗, z∗)

w

Possible (y∗, z∗)

w

Q′ = {(µ, ν): µ ∈ conv(Y), ν ∈ conv(Z),

µ(i , t) = ν(i , t) for all (i , t)}

Dual decomposition searches over Q′

Depending on the weight vector, (y∗, z∗) ∈ Q′ could be in Q or inthe strict outer bound.

Page 81: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Are there points strictly in the outer bound?

Q′

Possible (y∗, z∗)?

Taggings 0.5x

w1 w2 w3

A A A

+ 0.5x

w1 w2 w3

A B B

Parses 0.5 x

X

A

w1

X

A

w2

B

w3

+ 0.5 x

X

A

w1

X

B

w2

A

w3

Best result can be afractional solution.

Convexcombination ofthese structures.

Page 82: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Summary

A Dual Decomposition algorithm for integrated decoding

Simple - Uses only simple, off-the-shelf dynamic programmingalgorithms to solve a harder problem.

Efficient - Faster than classical methods for dynamic programmingintersection.

Strong Guarantees - Solves a linear programming relaxation whichgives a certificate of optimality.

Finds the exact solution on 99% of the examples.

Widely Applicable - Similar techniques extend to other problems

Page 83: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Appendix

Page 84: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Iterative Progress

50

60

70

80

90

100

0 10 20 30 40 50

Per

cent

age

Maximum Number of Dual Decomposition Iterations

f score% certificates

% match K=50

Page 85: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Deriving the Algorithm

Goal:y∗ = arg max

y∈Yf (y)

Rewrite:arg max

z∈Z,y∈Yf (z) + g(y)

s.t. z(i , j) = y(i , j) for all i , j

Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j

u(i , j) (y(i , j)− z(i , j))

The dual problem is to find minu

L(u) where

L(u) = maxy∈Y,z∈Z

L(u, y , z) = maxz∈Z

f (z) +∑i,j

u(i , j)z(i , j)

+ max

y∈Y

g(y)−∑i,j

u(i , j)y(i , j)

Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u

Page 86: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Deriving the Algorithm

Goal:y∗ = arg max

y∈Yf (y)

Rewrite:arg max

z∈Z,y∈Yf (z) + g(y)

s.t. z(i , j) = y(i , j) for all i , j

Lagrangian: L(u, y , z) = f (z) + g(y) +∑i,j

u(i , j) (y(i , j)− z(i , j))

The dual problem is to find minu

L(u) where

L(u) = maxy∈Y,z∈Z

L(u, y , z) = maxz∈Z

f (z) +∑i,j

u(i , j)z(i , j)

+ max

y∈Y

g(y)−∑i,j

u(i , j)y(i , j)

Dual is an upper bound: L(u) ≥ f (z∗) + g(y∗) for any u

Page 87: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

A Subgradient Algorithm for Minimizing L(u)

L(u) = maxz∈Z

f (z) +∑i,j

u(i , j)y(i , j)

+ maxy∈Y

g(y)−∑i,j

u(i , j)z(i , j)

L(u) is convex, but not differentiable. A subgradient of L(u) at uis a vector gu such that for all v ,

L(v) ≥ L(u) + gu · (v − u)

Subgradient methods use updates u′ = u − αgu

In fact, for our L(u), gu(i , j) = z∗(i , j)− y∗(i , j)

Page 88: On Dual Decomposition and Linear Programming Relaxations for Natural Language Processing

Related Work

I Methods that use general purpose linear programming orinteger linear programming solvers (Martins et al. 2009;Riedel and Clarke 2006; Roth and Yih 2005)

I Dual decomposition/Lagrangian relaxation in combinatorialoptimization (Dantzig and Wolfe, 1960; Held and Karp, 1970;Fisher 1981)

I Dual decomposition for inference in MRFs (Komodakis et al.,2007; Wainwright et al., 2005)

I Methods that incorporate combinatorial solvers within loopybelief propagation (Duchi et al. 2007; Smith and Eisner 2008)