advanced dynamic programming in...

Post on 24-Jun-2020

31 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Advanced Dynamic Programming in CL:

Theory, Algorithms, and Applications

Liang HuangUniversity of Pennsylvania

(S, 0, n)

w0 w1 ... wn-1

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...

2

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...• Who invented Dynamic Programming?

and when was it invented?

2

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...• Who invented Dynamic Programming?

and when was it invented?

• R. Bellman (1940s-50s)

• A. Viterbi (1967)

• E. Dijkstra (1959)

• Hart, Nilsson, and Raphael (1968)

• Dijkstra => A* Algorithm

• D. Knuth (1977)

• Dijkstra on Grammar (Hypergraph)

2

Andrew Viterbi

Richard Bellman

Liang Huang (Penn) Dynamic Programming

A Little Bit of History...• Who invented Dynamic Programming?

and when was it invented?

• R. Bellman (1940s-50s)

• A. Viterbi (1967)

• E. Dijkstra (1959)

• Hart, Nilsson, and Raphael (1968)

• Dijkstra => A* Algorithm

• D. Knuth (1977)

• Dijkstra on Grammar (Hypergraph)

2

Andrew Viterbi

Richard BellmanA. Turing

Liang Huang (Penn) Dynamic Programming

Dynamic Programming• Dynamic Programming is everywhere in NLP

• Viterbi Algorithm for Hidden Markov Models

• CKY Algorithm for Parsing and Machine Translation

• Forward-Backward and Inside-Outside Algorithms

• Also everywhere in AI/ML

• Reinforcement Learning, Planning (POMDP)

• AI Search: Uniform-cost, A*, etc.

• This tutorial: a unified theoretical view of DP

• Focusing on Optimization Problems3

Liang Huang (Penn) Dynamic Programming

Review: DP Basics• DP = Divide-and-Conquer + Two Principles:

• [required] Optimal Subproblem Property

• [recommended] Sharing of Common Subproblems

• Structure of the Search Space

• Incremental

• Graph

• Knapsack, Edit Dist., Sequence Alignment

• Branching

• Hypergraph

• Matrix-Chain, Polygon Triangulation, Optimal BST4

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

5

topological(acyclic)

best-first(superior)

graphs with semirings

hypergraphs with weight functions

Viterbi Dijkstra

Generalized Viterbi Knuth

traversing order

sear

ch s

pace

Liang Huang (Penn) Dynamic Programming

Graphs in NLP

6

part-of-speech tagging

lattice in speech

Liang Huang (Penn) Dynamic Programming

Semirings on Graphs• in a weighted graph, we need two operators:

• extension (multiplicative) and summary (additive)

• the weight of a path is the product of edge weights

• the weight of a vertex is the summary of path weights

7

s

ue1 v

t

...

...

e2e3

d(!1) =!

ei!!1

w(ei) = w(e1) ! w(e2) ! w(e3)

d(t) =!

!i

w(!i)

= w(p1) ! w(p2) ! · · ·

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A,!, 1) where

1. ! is a closed associative binary operator on the set A,

2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.

A monoid is commutative if ! is commutative.

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A,!, 1) where

1. ! is a closed associative binary operator on the set A,

2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.

A monoid is commutative if ! is commutative. ([0, 1], +, 0)([0, 1], ×, 1)

([0, 1], max, 0)

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A,!, 1) where

1. ! is a closed associative binary operator on the set A,

2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.

A monoid is commutative if ! is commutative.

A semiring is a 5-tuple R = (A,!,", 0, 1) such that

1. (A,!, 0) is a commutative monoid.

2. (A,", 1) is a monoid.

3. " distributes over !: for all a, b, c in A,

(a ! b) " c = (a " c) ! (b " c),

c " (a ! b) = (c " a) ! (c " b).

4. 0 is an annihilator for ": for all a in A, 0 " a = a " 0 = 0.

([0, 1], +, 0)([0, 1], ×, 1)

([0, 1], max, 0)

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A,!, 1) where

1. ! is a closed associative binary operator on the set A,

2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.

A monoid is commutative if ! is commutative.

A semiring is a 5-tuple R = (A,!,", 0, 1) such that

1. (A,!, 0) is a commutative monoid.

2. (A,", 1) is a monoid.

3. " distributes over !: for all a, b, c in A,

(a ! b) " c = (a " c) ! (b " c),

c " (a ! b) = (c " a) ! (c " b).

4. 0 is an annihilator for ": for all a in A, 0 " a = a " 0 = 0.

([0, 1], +, 0)([0, 1], ×, 1)

([0, 1], max, 0)

([0, 1], max, ×, 0, 1)([0, 1], +, ×, 0, 1)

Liang Huang (Penn) Dynamic Programming

Semiring Definitions

8

A monoid is a triple (A,!, 1) where

1. ! is a closed associative binary operator on the set A,

2. 1 is the identity element for !, i.e., for all a " A, a ! 1 = 1 ! a = a.

A monoid is commutative if ! is commutative.

A semiring is a 5-tuple R = (A,!,", 0, 1) such that

1. (A,!, 0) is a commutative monoid.

2. (A,", 1) is a monoid.

3. " distributes over !: for all a, b, c in A,

(a ! b) " c = (a " c) ! (b " c),

c " (a ! b) = (c " a) ! (c " b).

4. 0 is an annihilator for ": for all a in A, 0 " a = a " 0 = 0.

([0, 1], +, 0)([0, 1], ×, 1)

([0, 1], max, 0)

([0, 1], max, ×, 0, 1)([0, 1], +, ×, 0, 1)

Liang Huang (Penn) Dynamic Programming

Examples

9

Semiring Set ! " 0 1 intuition/applicationBoolean {0, 1} # $ 0 1 logical deduction, recognition

Viterbi [0, 1] max % 0 1 prob. of the best derivation

Inside R+ & {+'} + % 0 1 prob. of a string

Real R & {+'} min + +' 0 shortest-distance

Tropical R+ & {+'} min + +' 0 with non-negative weights

Counting N + % 0 1 number of paths

Liang Huang (Penn) Dynamic Programming

Ordering

10

Liang Huang (Penn) Dynamic Programming

Ordering

• idempotent

10

A semiring (A,!,", 0, 1) is idempotent if for all a in A, a ! a = a.

Liang Huang (Penn) Dynamic Programming

Ordering

• idempotent

• comparison

• examples: boolean, viterbi, tropical, real, ...

10

A semiring (A,!,", 0, 1) is idempotent if for all a in A, a ! a = a.

(a ! b) " (a # b = a) defines a partial ordering.

({0, 1},!,", 0, 1) (R+ # {+$},min,+,+$, 0)

([0, 1],max,%, 0, 1) (R # {+$},min,+,+$, 0)

Liang Huang (Penn) Dynamic Programming

Ordering

• idempotent

• comparison

• examples: boolean, viterbi, tropical, real, ...

• total-order for optimization problems

• examples: all of the above10

A semiring (A,!,", 0, 1) is idempotent if for all a in A, a ! a = a.

(a ! b) " (a # b = a) defines a partial ordering.

A semiring is totally-ordered if ! defines a total ordering.

({0, 1},!,", 0, 1) (R+ # {+$},min,+,+$, 0)

([0, 1],max,%, 0, 1) (R # {+$},min,+,+$, 0)

Liang Huang (Penn) Dynamic Programming

Monotonicity

11

Liang Huang (Penn) Dynamic Programming

Monotonicity• monotonicity

11

Liang Huang (Penn) Dynamic Programming

Monotonicity• monotonicity

11

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A

(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)

Liang Huang (Penn) Dynamic Programming

Monotonicity• monotonicity

• optimal substructure in dynamic programming

11

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A

(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)

Liang Huang (Penn) Dynamic Programming

Monotonicity• monotonicity

• optimal substructure in dynamic programming

11

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A

(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)

B: b C: c

A: b ⊗ c

Liang Huang (Penn) Dynamic Programming

Monotonicity• monotonicity

• optimal substructure in dynamic programming

11

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A

(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)

B: b C: c

A: b ⊗ c

B: b’ ≤ b C: c

A: b’ ⊗ c≤ b ⊗ c

Liang Huang (Penn) Dynamic Programming

Monotonicity• monotonicity

• optimal substructure in dynamic programming

• idempotent => monotone (from distributivity)

11

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A

(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)

B: b C: c

A: b ⊗ c

B: b’ ≤ b C: c

A: b’ ⊗ c≤ b ⊗ c

Liang Huang (Penn) Dynamic Programming

Monotonicity• monotonicity

• optimal substructure in dynamic programming

• idempotent => monotone (from distributivity)

• (a+b)⊗c = (a⊗c)+(b⊗c); if a≤b, (a⊗c)=(a⊗c)+(b⊗c)

11

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A

(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)

B: b C: c

A: b ⊗ c

B: b’ ≤ b C: c

A: b’ ⊗ c≤ b ⊗ c

Liang Huang (Penn) Dynamic Programming

Monotonicity• monotonicity

• optimal substructure in dynamic programming

• idempotent => monotone (from distributivity)

• (a+b)⊗c = (a⊗c)+(b⊗c); if a≤b, (a⊗c)=(a⊗c)+(b⊗c)

• by def. of comparison, a⊗c ≤ b⊗c11

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is monotonic if for all a, b, c $ A

(a # b) % (a " c # b " c) (a # b) % (c " a # c " b)

B: b C: c

A: b ⊗ c

B: b’ ≤ b C: c

A: b’ ⊗ c≤ b ⊗ c

Liang Huang (Penn) Dynamic Programming

DP on Graphs

• optimization problems on graphs=> generic shortest-path problem

• weighted directed graph G=(V, E) with a function w that assigns each edge a weight from a semiring

• compute the best weight of the target vertex t

• generic update along edge (u, v)

• how to avoid cyclic updates?

• only update when d(u) is fixed

12

vu w(u, v) d(v) ! = d(u) " w(u, v)

d(v) ! d(v) " (d(u) # w(u, v))

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

13

topological(acyclic)

best-first(superior)

graphs with semirings

(e.g., FSMs)

hypergraphs with weight functions

(e.g., CFGs)

Viterbi Dijkstra

Generalized Viterbi Knuth

traversing order

sear

ch s

pace

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAGs1. topological sort

2. visit each vertex v in sorted order and do updates

• for each incoming edge (u, v) in E

• use d(u) to update d(v):

• key observation: d(u) is fixed to optimal at this time

• time complexity: O( V + E )14

v

u w(u, v)

d(v) ! = d(u) " w(u, v)

Liang Huang (Penn) Dynamic Programming

Variant 1: forward-update1. topological sort

2. visit each vertex v in sorted order and do updates

• for each outgoing edge (v, u) in E

• use d(v) to update d(u):

• key observation: d(v) is fixed to optimal at this time

• time complexity: O( V + E )15

d(u) ! = d(v) " w(v, u)

v

uw(v, u)

Liang Huang (Penn) Dynamic Programming

Examples

16

Liang Huang (Penn) Dynamic Programming

Examples• [Number of Paths in a DAG]

16

Liang Huang (Penn) Dynamic Programming

Examples• [Number of Paths in a DAG]

• just use the counting semiring (N, +, ×, 0, 1)

• note: this is not an optimization problem!

16

Liang Huang (Penn) Dynamic Programming

Examples• [Number of Paths in a DAG]

• just use the counting semiring (N, +, ×, 0, 1)

• note: this is not an optimization problem!

• [Longest Path in a DAG]

16

Liang Huang (Penn) Dynamic Programming

Examples• [Number of Paths in a DAG]

• just use the counting semiring (N, +, ×, 0, 1)

• note: this is not an optimization problem!

• [Longest Path in a DAG]

• just use the semiring

16

(R ! {"#},max,+,"#, 0)

Liang Huang (Penn) Dynamic Programming

Examples• [Number of Paths in a DAG]

• just use the counting semiring (N, +, ×, 0, 1)

• note: this is not an optimization problem!

• [Longest Path in a DAG]

• just use the semiring

• [Part-of-Speech Tagging with a Hidden Markov Model]

16

(R ! {"#},max,+,"#, 0)

Liang Huang (Penn) Dynamic Programming

Examples• [Number of Paths in a DAG]

• just use the counting semiring (N, +, ×, 0, 1)

• note: this is not an optimization problem!

• [Longest Path in a DAG]

• just use the semiring

• [Part-of-Speech Tagging with a Hidden Markov Model]

16

(R ! {"#},max,+,"#, 0)

Liang Huang (Penn) Dynamic Programming

Example: Speech Alignment

17

time complexity: O(n2)

also used in:edit distance

biological sequence alignment

Liang Huang (Penn) Dynamic Programming

Example: Word Alignment

18

• key difference

• reorderings in translation!

• sequence/speech alignment is always monotonic

• complexity under HMM

• word alignment is O(n3)

• for every (i, j)

• enumerate all (i-1, k)

• sequence alignment O(n2)

I love you .

Je

t’

aime

.

ii-1

j

k

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

下 雨 天 地 面 积 水 xia yu tian di mian ji shui

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主min-zhu

people-dominate

“democracy”

下 雨 天 地 面 积 水 xia yu tian di mian ji shui

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主min-zhu

people-dominate

“democracy”

江泽民 主席jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

下 雨 天 地 面 积 水 xia yu tian di mian ji shui

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主min-zhu

people-dominate

“democracy”

江泽民 主席jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

this was 5 years ago.

now Google isgood at segmentation!

下 雨 天 地 面 积 水 xia yu tian di mian ji shui

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主min-zhu

people-dominate

“democracy”

江泽民 主席jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

this was 5 years ago.

now Google isgood at segmentation!

下 雨 天 地 面 积 水 xia yu tian di mian ji shui

Liang Huang (Penn) Dynamic Programming

Chinese Word Segmentation

19

民主min-zhu

people-dominate

“democracy”

江泽民 主席jiang-ze-min zhu-xi

... - ... - people dominate-podium

“President Jiang Zemin”

this was 5 years ago.

now Google isgood at segmentation!

下 雨 天 地 面 积 水 xia yu tian di mian ji shui

graph search

Huang and Chiang Forest Rescoring

Phrase-based Decoding

20

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon

yu Shalong juxing le huitanwith Sharon held a talk

talksSharon heldwith

Huang and Chiang Forest Rescoring

Phrase-based Decoding

20

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon

yu Shalong juxing le huitanwith Sharon held a talk

talksSharon heldwith

_ _ _ _ _

Huang and Chiang Forest Rescoring

Phrase-based Decoding

20

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon

yu Shalong juxing le huitanwith Sharon held a talk

talksSharon heldwith

_ _ _ _ _ _ _●●●

Huang and Chiang Forest Rescoring

Phrase-based Decoding

20

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon

yu Shalong juxing le huitanwith Sharon held a talk

talksSharon heldwith

_ _ _ _ _ _ _●●● ●●●●●

Huang and Chiang Forest Rescoring

Phrase-based Decoding

21

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon

yu Shalong juxing le huitanwith Sharon held a talk

talksSharon heldwith

_ _ _ _ _ _ _●●● ●●●●●

Huang and Chiang Forest Rescoring

Phrase-based Decoding

21

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon

yu Shalong juxing le huitanwith Sharon held a talk

talksSharon heldwith

_ _ _ _ _ _ _●●● ●●●●●

●_●●●

Huang and Chiang Forest Rescoring

Phrase-based Decoding

22

yu Shalong juxing le huitan

与 沙龙 举行 了 会谈

held a talk with Sharon

_ _●●●held a talk held a talk with Sharon

_ _ _ _ _

...

...

...

●●●●●

...

_ _●●●held a talk

source-side: coverage vector

target-side: grow hypotheses strictly left-to-right

space: O(2n), time: O(2n n2) -- cf. traveling salesman problem

Huang and Chiang Forest Rescoring

Traveling Salesman Problem & MT

• a classical NP-hard problem

• goal: visit each city once and only once

• exponential-time dynamic programming

• state: cities visited so far (bit-vector)

• search in this O(2n) transformed graph

• MT: each city is a source-language word

• restrictions in reordering can reduce complexity => distortion limit

• => syntax-based MT

23(Held and Karp, 1962; Knight, 1999)

Huang and Chiang Forest Rescoring

Traveling Salesman Problem & MT

• a classical NP-hard problem

• goal: visit each city once and only once

• exponential-time dynamic programming

• state: cities visited so far (bit-vector)

• search in this O(2n) transformed graph

• MT: each city is a source-language word

• restrictions in reordering can reduce complexity => distortion limit

• => syntax-based MT

23(Held and Karp, 1962; Knight, 1999)

Huang and Chiang Forest Rescoring

Adding a Bigram Model• “refined” graph: annotated with language model words

• still dynamic programming, just larger search space

24

_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon

_ _●●● ... talks

_ _●●● ... meeting ●●●●● ... Shalong

Huang and Chiang Forest Rescoring

Adding a Bigram Model• “refined” graph: annotated with language model words

• still dynamic programming, just larger search space

24

_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon

_ _●●● ... talks

_ _●●● ... meeting ●●●●● ... Shalong

with Sharon

Huang and Chiang Forest Rescoring

Adding a Bigram Model• “refined” graph: annotated with language model words

• still dynamic programming, just larger search space

24

_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon

_ _●●● ... talks

_ _●●● ... meeting ●●●●● ... Shalong

with Sharon

bigram

Huang and Chiang Forest Rescoring

Adding a Bigram Model• “refined” graph: annotated with language model words

• still dynamic programming, just larger search space

24

_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon

_ _●●● ... talks

_ _●●● ... meeting ●●●●● ... Shalong

with Sharon

bigram

space: O(2n), time: O(2n n2) => space: O(2n Vm-1), time: O(2n Vm-1 n2)

for m-gram language models

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

25

topological(acyclic)

best-first(superior)

graphs with semirings

(e.g., FSMs)

hypergraphs with weight functions

(e.g., CFGs)

Viterbi Dijkstra

Generalized Viterbi Knuth

traversing order

sear

ch s

pace

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

26

d(u) d(u) ⊗ w(e)w(e)

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm• Dijkstra does not require acyclicity

• instead of topological order, we use best-first order

• but this requires superiority of the semiring

• intuition: combination always gets worse

26

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is superior if for all a, b $ A

a # a " b, b # a " b.

d(u) d(u) ⊗ w(e)w(e)

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm• Dijkstra does not require acyclicity

• instead of topological order, we use best-first order

• but this requires superiority of the semiring

• intuition: combination always gets worse

• contrast: monotonicity: combination preserves order

26

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is superior if for all a, b $ A

a # a " b, b # a " b.

d(u) d(u) ⊗ w(e)w(e)

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm• Dijkstra does not require acyclicity

• instead of topological order, we use best-first order

• but this requires superiority of the semiring

• intuition: combination always gets worse

• contrast: monotonicity: combination preserves order

26

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is superior if for all a, b $ A

a # a " b, b # a " b.

d(u) d(u) ⊗ w(e)w(e)

({0, 1},!,", 0, 1)([0, 1],max,#, 0, 1)

(R+ $ {+%},min,+,+%, 0)(R $ {+%},min,+,+%, 0)

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm• Dijkstra does not require acyclicity

• instead of topological order, we use best-first order

• but this requires superiority of the semiring

• intuition: combination always gets worse

• contrast: monotonicity: combination preserves order

26

Let K = (A,!,", 0, 1) be a semiring, and # a partial ordering over A.We say K is superior if for all a, b $ A

a # a " b, b # a " b.

d(u) d(u) ⊗ w(e)w(e)

({0, 1},!,", 0, 1)([0, 1],max,#, 0, 1)

(R+ $ {+%},min,+,+%, 0)(R $ {+%},min,+,+%, 0)

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

27

S V - S

s ...

d(u) ! = d(v) " w(v, u)

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

v

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

27

S V - S

vs ...

d(u) ! = d(v) " w(v, u)

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

27

uw(v, u)

S V - S

vs ...

d(u) ! = d(v) " w(v, u)

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

• structural vs. algebraic constraints

• Dijkstra only applicable to optimization problems

28

monotonic optimization problems

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

• structural vs. algebraic constraints

• Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

• structural vs. algebraic constraints

• Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

• structural vs. algebraic constraints

• Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP

problems

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

• structural vs. algebraic constraints

• Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP

problems

forward-backward(Inside semiring)

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

• structural vs. algebraic constraints

• Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP

problems

forward-backward(Inside semiring) non-probabilistic

models

Liang Huang (Penn) Dynamic Programming

Viterbi vs. Dijkstra

• structural vs. algebraic constraints

• Dijkstra only applicable to optimization problems

28

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP

problems

forward-backward(Inside semiring) non-probabilistic

models

cyclic FSMs/grammars

Liang Huang (Penn) Dynamic Programming

What if both fail?

29

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP

problems

generalized Bellman-Ford(CLR, 1990; Mohri, 2002)

or, first do strongly-connected components (SCC)which gives a DAG; use Viterbi globally on this SCC-DAG;

use Bellman-Ford locally within each SCC

Liang Huang (Penn) Dynamic Programming

What if both work?

30

monotonic optimization problems

acyclic: Viterbi

superior: Dijkstra

many NLP

problems

full Dijkstra is slower than ViterbiO((V + E) lgV) vs. O(V + E)

but it can finish as early as the target vertex is poppeda (V + E) lgV vs. V + E

Q: how to (magically) reduce a?

Liang Huang (Penn) Dynamic Programming

A* Search: Intuition

• Dijkstra is “blind” about how far the target is

• may get “trapped” by obstacles

• can we be more intelligent about the future?

• idea: prioritize by s-v distance + v-t estimate

31

s

v

ut

Liang Huang (Penn) Dynamic Programming

A* Search: Intuition

• Dijkstra is “blind” about how far the target is

• may get “trapped” by obstacles

• can we be more intelligent about the future?

• idea: prioritize by s-v distance + v-t estimate

31

s

v

ut

Liang Huang (Penn) Dynamic Programming

A* Search: Intuition

• Dijkstra is “blind” about how far the target is

• may get “trapped” by obstacles

• can we be more intelligent about the future?

• idea: prioritize by s-v distance + v-t estimate

31

s

v

ut

Liang Huang (Penn) Dynamic Programming

A* Heuristic

• h(v): the distance from v to target t

• ĥ(v) must be an optimistic estimate of h(v): ĥ(v)≤ h(v)

• Dijkstra is a special case where ĥ(v) = ī (0 for dist.)

• now, prioritize the queue by d(v) ⊗ ĥ(v)

• can stop when target gets popped -- why?

• optimal subpaths should pop earlier than non-optimal

• d(v) ⊗ ĥ(v) ≤ d(v) ⊗ h(v) ≤ d(t) ≤ non-optimal paths of t32

s v t

d(v) h(v)

ĥ(v)

Liang Huang (Penn) Dynamic Programming

How to design a heuristic?• more of an art than science

• basic idea: projection into coarser space

• cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V }

• exact cost in coarser graph is estimate of finer graph

33 (Raphael, 2001)

Liang Huang (Penn) Dynamic Programming

How to design a heuristic?• more of an art than science

• basic idea: projection into coarser space

• cluster: w’(U, V) = min { w(u, v) | u ∈ U, v ∈ V }

• exact cost in coarser graph is estimate of finer graph

33

U V U V

(Raphael, 2001)

Liang Huang (Penn) Dynamic Programming

Viterbi or A*?• A* intuition: d(t) ⊗ ĥ(t) ranks higher among d(v) ⊗ ĥ(v)

• can finish early if lucky

• actually, d(t) ⊗ ĥ(t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t)

• with the price of maintaining priority queue - O(log V)

• Q: how early? worth the price?

• if the rank is r, then A* is better when r/V log V < 1

34Dijkstra

d(v) pool

d(t)A*

d(v) ⊗ ĥ(v) pool

d(t) r

1

V

Liang Huang (Penn) Dynamic Programming

Viterbi or A*?• A* intuition: d(t) ⊗ ĥ(t) ranks higher among d(v) ⊗ ĥ(v)

• can finish early if lucky

• actually, d(t) ⊗ ĥ(t) = d(t) ⊗ h(t) = d(t) ⊗ ī = d(t)

• with the price of maintaining priority queue - O(log V)

• Q: how early? worth the price?

• if the rank is r, then A* is better when r/V log V < 1

34

r < V / log V

Dijkstra

d(v) pool

d(t)A*

d(v) ⊗ ĥ(v) pool

d(t) r

1

V

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

35

topological(acyclic)

best-first(superior)

graphs with semirings

(e.g., FSMs)

hypergraphs with weight functions

(e.g., CFGs)

Viterbi Dijkstra

Generalized Viterbi Knuth

traversing order

sear

ch s

pace

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

35

topological(acyclic)

best-first(superior)

graphs with semirings

(e.g., FSMs)

hypergraphs with weight functions

(e.g., CFGs)

Viterbi Dijkstra

Generalized Viterbi Knuth

traversing order

sear

ch s

pace

Liang Huang (Penn) Dynamic Programming

Background: CFG and Parsing

36

(S, 0, n)

w0 w1 ... wn-1

Liang Huang (Penn) Dynamic Programming

Background: CFG and Parsing

36

(S, 0, n)

w0 w1 ... wn-1

Liang Huang (Penn) Dynamic Programming

Background: CFG and Parsing

37

(S, 0, n)

w0 w1 ... wn-1

Liang Huang (Penn) Dynamic Programming

Background: CFG and Parsing

37

(S, 0, n)

w0 w1 ... wn-1

Liang Huang (Penn) Dynamic Programming

(Directed) Hypergraphs• a generalization of graphs

• edge => hyperedge: several vertices to one vertex

• e = (T(e), h(e), fe). arity |e| = |T(e)|

• a totally-ordered weight set R

• we borrow the ⊕ operator to be the comparison

• weight function fe : R|e| to R

• generalizes the ⊗ operator in semirings

38

vu1

u2

fe

tails

headd(v) ! = fe(d(u1), d(u2))

simple case: fe(a, b) = a ⊗ b ⊗ w(e)

Yi,j e

Zj,kXi,k

Liang Huang (Penn) Dynamic Programming

Hypergraphs and Deduction

39

(Nederhof, 2003)

: b

v

u1 u2

fe

: a

: a × b × Pr(A → B C)

(A, i, j)

(C, k, j)(B, i, k) (B, i, k) (C, k, j)

(A, i, j)A→B C

Liang Huang (Penn) Dynamic Programming

Hypergraphs and Deduction

39

(Nederhof, 2003)

: b

v

u1 u2

fe

: a

: a × b × Pr(A → B C)

(A, i, j)

(C, k, j)(B, i, k) (B, i, k) (C, k, j)

(A, i, j)A→B C

v

u1 u2tails

head

fe

: a

: fe (a,b) v

u1 u2

fe

: a : b

: fe (a,b)

antecedents

consequent

: b

Liang Huang (Penn) Dynamic Programming

Related Formalisms

40

v

u1 u2

e

v

u1 u2

e AND-node

OR-node

OR-nodes

Liang Huang (Penn) Dynamic Programming

Packed Forests• a compact representation of many parses

• by sharing common sub-derivations

• polynomial-space encoding of exponentially large set

41

(Klein and Manning, 2001; Huang and Chiang, 2005)

0 I 1 saw 2 him 3 with 4 a 5 mirror 6

Liang Huang (Penn) Dynamic Programming

Packed Forests• a compact representation of many parses

• by sharing common sub-derivations

• polynomial-space encoding of exponentially large set

41

(Klein and Manning, 2001; Huang and Chiang, 2005)

0 I 1 saw 2 him 3 with 4 a 5 mirror 6

nodes hyperedges

a hypergraph

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

v

u1

u2

tails head

uk

fe

...

fe(a1, ..., ak)

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

v

u1

u2

tails head

uk

fe

...

fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)special case

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

d(u) d(u) ⊗ w(e)w(e)

d(u) fe(d(u))fe

fe(a) = a ⊗ w(e)

v

u1

u2

tails head

uk

fe

...

fe(a1, ..., ak) = a1 ⊗ ... ⊗ ak ⊗ w(e)special case

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

d(u) d(u) ⊗ w(e)w(e)

d(u) fe(d(u))fe

fe(a) = a ⊗ w(e)

v

u1

u2

tails head

uk

fe

...

fe(a1, ..., ak)

semiring-composed

= a1 ⊗ ... ⊗ ak ⊗ w(e)special case

Liang Huang (Penn) Dynamic Programming

Weight Functions and Semirings

42

d(u) d(u) ⊗ w(e)w(e)

d(u) fe(d(u))fe

fe(a) = a ⊗ w(e)

v

u1

u2

tails head

uk

fe

...

fe(a1, ..., ak)

semiring-composed

can also extend monotonicity and superiority to general weight functions

= a1 ⊗ ... ⊗ ak ⊗ w(e)special case

Liang Huang (Penn) Dynamic Programming

Generalizing Semiring Properties• monotonicity

• semiring: a ≤ b => a x c ≤ b x c

• for all weight function f, for all a1... ak, for all i, if a’i ≤ ai then f(a1... a’i ... ak) ≤ f(a1... ai ... ak)

• superiority

• semiring: a ≤ a x b, b ≤ a x b

• for all f, for all a1... ak, for all i, ai ≤ f(a1, ..., ak)

• acyclicity

• degenerate a hypergraph back into a graph43

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

44

topological(acyclic)

best-first(superior)

graphs with semirings

(e.g., FSMs)

hypergraphs with weight functions

(e.g., CFGs)

Viterbi Dijkstra

Generalized Viterbi Knuth

traversing order

sear

ch s

pace

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAGs1. topological sort

2. visit each vertex v in sorted order and do updates

• for each incoming edge (u, v) in E

• use d(u) to update d(v):

• key observation: d(u) is fixed to optimal at this time

• time complexity: O( V + E )45

v

u w(u, v)

d(v) ! = d(u) " w(u, v)

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAHs1. topological sort

2. visit each vertex v in sorted order and do updates

• for each incoming hyperedge e = ((u1, .., u|e|), v, fe)

• use d(ui)’s to update d(v)

• key observation: d(ui)’s are fixed to optimal at this time

• time complexity: O( V + E ) (assuming constant arity)46

vu1

u2

fed(v) ! = fe(d(u1), · · · , d(u|e|))

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)

• typical instance of the generalized Viterbi for DAHs

• many variants of CKY ~ various topological ordering

47

O(n3|P|)

(S, 0, n) (S, 0, n)

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)

• typical instance of the generalized Viterbi for DAHs

• many variants of CKY ~ various topological ordering

47

O(n3|P|)

bottom-up

(S, 0, n) (S, 0, n)

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)

• typical instance of the generalized Viterbi for DAHs

• many variants of CKY ~ various topological ordering

47

O(n3|P|)

bottom-up left-to-right

(S, 0, n) (S, 0, n)

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)

• typical instance of the generalized Viterbi for DAHs

• many variants of CKY ~ various topological ordering

48

O(n3|P|)

bottom-up left-to-right

(S, 0, n) (S, 0, n) (S, 0, n)

Liang Huang (Penn) Dynamic Programming

Example: CKY Parsing• parsing with CFGs in Chomsky Normal Form (CNF)

• typical instance of the generalized Viterbi for DAHs

• many variants of CKY ~ various topological ordering

48

O(n3|P|)

bottom-up left-to-right right-to-left

(S, 0, n) (S, 0, n) (S, 0, n)

Liang Huang (Penn) Dynamic Programming

Example: Syntax-based MT

49

• synchronous context-free grammars (SCFGs)

• context-free grammar in two dimensions

• generating pairs of strings/trees simultaneously

• co-indexed nonterminal further rewritten as a unit

VP

PP

yu Shalong

VP

juxing le huitan

VP

VP

held a meeting

PP

with Sharon

VP ! PP(1) VP(2), VP(2) PP(1)

VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

• translation with SCFGs => monolingual parsing

• parse the source input with the source projection

• build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6

VP1, 6

yu Shalong juxing le huitan

VP ! PP(1) VP(2), VP(2) PP(1)

VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

• translation with SCFGs => monolingual parsing

• parse the source input with the source projection

• build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6

VP1, 6

yu Shalong juxing le huitan

VP ! PP(1) VP(2), VP(2) PP(1)

VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

• translation with SCFGs => monolingual parsing

• parse the source input with the source projection

• build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6

VP1, 6

yu Shalong juxing le huitan

with Sharon held a talk

held a talk with Sharon

VP ! PP(1) VP(2), VP(2) PP(1)

VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon

Liang Huang (Penn) Dynamic Programming

Translation as Parsing

50

• translation with SCFGs => monolingual parsing

• parse the source input with the source projection

• build the corresponding target sub-strings in parallel

PP1, 3 VP3, 6

VP1, 6

yu Shalong juxing le huitan

with Sharon held a talk

held a talk with Sharon

VP ! PP(1) VP(2), VP(2) PP(1)

VP ! juxing le huitan, held a meetingPP ! yu Shalong, with Sharon

complexity: same as CKY parsing -- O(n3)

Liang Huang (Penn) Dynamic Programming

Adding a Bigram Model

51

PP1, 3 VP3, 6

VP1, 6

_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon

_ _●●● ... talks

_ _●●● ... meeting ●●●●● ... Shalong

with ... Sharon

along ... Sharonwith ... Shalong

held ... talkheld ... meeting

hold ... talks

with Sharon

bigram

held ... talk

VP3, 6

with ... Sharon

PP1, 3

bigram

Liang Huang (Penn) Dynamic Programming

Adding a Bigram Model

51

PP1, 3 VP3, 6

VP1, 6

_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon

_ _●●● ... talks

_ _●●● ... meeting ●●●●● ... Shalong

with ... Sharon

along ... Sharonwith ... Shalong

held ... talkheld ... meeting

hold ... talks

with Sharon

bigram

held ... talk

VP3, 6

with ... Sharon

PP1, 3

bigram

held ... Sharon

VP1, 6

Liang Huang (Penn) Dynamic Programming

Adding a Bigram Model

51

PP1, 3 VP3, 6

VP1, 6

_ _●●● ... talk_ _ _ _ _ ●●●●● ... Sharon

_ _●●● ... talks

_ _●●● ... meeting ●●●●● ... Shalong

with ... Sharon

along ... Sharonwith ... Shalong

held ... talkheld ... meeting

hold ... talks

with Sharon

bigram

complexity: O(n3 V4(m-1) )

held ... talk

VP3, 6

with ... Sharon

PP1, 3

bigram

held ... Sharon

VP1, 6

Liang Huang (Penn) Dynamic Programming

Two Dimensional Survey

52

topological(acyclic)

best-first(superior)

graphs with semirings

(e.g., FSMs)

hypergraphs with weight functions

(e.g., CFGs)

Viterbi Dijkstra

Generalized Viterbi Knuth

traversing order

sear

ch s

pace

Liang Huang (Penn) Dynamic Programming

Viterbi Algorithm for DAHs1. topological sort

2. visit each vertex v in sorted order and do updates

• for each incoming hyperedge e = ((u1, .., u|e|), v, fe)

• use d(ui)’s to update d(v)

• key observation: d(ui)’s are fixed to optimal at this time

• time complexity: O( V + E ) (assuming constant arity)53

vu1

u2

fed(v) ! = fe(d(u1), · · · , d(u|e|))

Liang Huang (Penn) Dynamic Programming

Forward Variant for DAHs1. topological sort

2. visit each vertex v in sorted order and do updates

• for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)

• if d(ui)’s have all been fixed to optimal

• use d(ui)’s to update d(h(e))

• time complexity: O( V + E )54

v = ui

h(e)u1

v

fe

u2 =

h(e)fe

Liang Huang (Penn) Dynamic Programming

Forward Variant for DAHs1. topological sort

2. visit each vertex v in sorted order and do updates

• for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)

• if d(ui)’s have all been fixed to optimal

• use d(ui)’s to update d(h(e))

• time complexity: O( V + E )54

v = ui

h(e)u1

v

fe

u2 =

h(e)fe

Liang Huang (Penn) Dynamic Programming

Forward Variant for DAHs1. topological sort

2. visit each vertex v in sorted order and do updates

• for each outgoing hyperedge e = ((u1, .., u|e|), h(e), fe)

• if d(ui)’s have all been fixed to optimal

• use d(ui)’s to update d(h(e))

• time complexity: O( V + E )54

v = ui

h(e)u1

v

fe

u2 =

Q: how to avoid repeated checking?maintain a counter r[e] for each e: how many tails yet to be fixed?fire this hyperedge only if r[e]=0

h(e)fe

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

55

S V - S

s ...

d(u) ! = d(v) " w(v, u)

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

v

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

55

S V - S

vs ...

d(u) ! = d(v) " w(v, u)

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

Liang Huang (Penn) Dynamic Programming

Dijkstra Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

55

uw(v, u)

S V - S

vs ...

d(u) ! = d(v) " w(v, u)

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

56

S V - S

vs ...

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

u1

v

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

56

S V - S

vs ...

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

u1

v

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

56

S V - S

vs ...

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

u1

vh(e)

fe

Liang Huang (Penn) Dynamic Programming

Knuth (1977) Algorithm

• keep a cut (S : V - S) where S vertices are fixed

• maintain a priority queue Q of V - S vertices

• each iteration choose the best vertex v from Q

• move v to S, and use d(v) to forward-update others

56

S V - S

vs ...

time complexity:O((V+E) lgV) (binary heap)

O(V lgV + E) (fib. heap)

u1

vh(e)

fe

Liang Huang (Penn) Dynamic Programming

Example: Best-First/A* Parsing

• Knuth for parsing: best-first (Caraballo & Charniak, 1998)

• further speed-up: use A* heuristics

• showed significant speed up with carefully designed heuristic functions (Klein and Manning, 2003)

• heuristic function: an estimate of outside cost

57

[open problem] can you still define heuristic function if weight functions are not semiring-composed?

(S, 0, n)

Liang Huang (Penn) Dynamic Programming

Example: Best-First/A* Parsing

• Knuth for parsing: best-first (Caraballo & Charniak, 1998)

• further speed-up: use A* heuristics

• showed significant speed up with carefully designed heuristic functions (Klein and Manning, 2003)

• heuristic function: an estimate of outside cost

57

[open problem] can you still define heuristic function if weight functions are not semiring-composed?

(S, 0, n)

Liang Huang (Penn) Dynamic Programming

Outside Cost in Hypergraph• outside cost: yet to pay to reach goal

• let’s only consider semiring-composed case

• and only acyclic hypergraphs

• after computing d(v) for all v from bottom-up

• backwards Viterbi from top-down (outside-in)

58

e

...

...h(S0,n) = īh(v) ⊕= h(u)⊗w(e)⊗d(v’)

v v’

u

d(v)

s

v

t

d(v)

h(v)

S0,n

h(v)

d(v)

Liang Huang (Penn) Dynamic Programming

Outside Cost in Hypergraph• outside cost: yet to pay to reach goal

• let’s only consider semiring-composed case

• and only acyclic hypergraphs

• after computing d(v) for all v from bottom-up

• backwards Viterbi from top-down (outside-in)

58

e

...

...h(S0,n) = īh(v) ⊕= h(u)⊗w(e)⊗d(v’)

v v’

u

d(v)

Q: d(v)⊗h(v) = ?

s

v

t

d(v)

h(v)

S0,n

h(v)

d(v)

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

• how to guess? project onto a coarser-grained space

• and parse with the coarser grammar

• outside cost of of the coarser item as heuristics

59

(Klein and Manning, 2003)

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

• how to guess? project onto a coarser-grained space

• and parse with the coarser grammar

• outside cost of of the coarser item as heuristics

59

(Klein and Manning, 2003)

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

• how to guess? project onto a coarser-grained space

• and parse with the coarser grammar

• outside cost of of the coarser item as heuristics

60

(Klein and Manning, 2003)

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

• how to guess? project onto a coarser-grained space

• and parse with the coarser grammar

• outside cost of of the coarser item as heuristics

61

(Klein and Manning, 2003)

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

• how to guess? project onto a coarser-grained space

• and parse with the coarser grammar

• outside cost of of the coarser item as heuristics

61

(Klein and Manning, 2003)

Liang Huang (Penn) Dynamic Programming

Projection-based Heuristics

• how to guess? project onto a coarser-grained space

• and parse with the coarser grammar

• outside cost of of the coarser item as heuristics

61

(Klein and Manning, 2003)ĥ (VBD2,3) = h’ (V2,3)

Liang Huang (Penn) Dynamic Programming

Analogy with Graphs

62

Liang Huang (Penn) Dynamic Programming

Analogy with Graphs

62

Liang Huang (Penn) Dynamic Programming

More on Coarse-to-Fine• multilevel coarse-to-fine A*

• heuristic = exact outside cost in previous stage

• ĥi (v) = hi-1 (proj i-1(v))

• VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)

• multilevel coarse-to-fine Viterbi w/ beam-search

• Viterbi + beam pruning in each stage

• prune according to merit: d(v)⊗h(v) ⊘ d(TOP)

• hard to derive a provably correct threshold

• in practice: use a preset threshold (but works well!)63

Liang Huang (Penn) Dynamic Programming

More on Coarse-to-Fine• multilevel coarse-to-fine A*

• heuristic = exact outside cost in previous stage

• ĥi (v) = hi-1 (proj i-1(v))

• VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)

• multilevel coarse-to-fine Viterbi w/ beam-search

• Viterbi + beam pruning in each stage

• prune according to merit: d(v)⊗h(v) ⊘ d(TOP)

• hard to derive a provably correct threshold

• in practice: use a preset threshold (but works well!)63

Liang Huang (Penn) Dynamic Programming

More on Coarse-to-Fine• multilevel coarse-to-fine A*

• heuristic = exact outside cost in previous stage

• ĥi (v) = hi-1 (proj i-1(v))

• VBD>V>X. ĥi (VBD1,5) = hi-1 (V1,5); ĥi-1 (V1,5) = hi-2 (X1,5)

• multilevel coarse-to-fine Viterbi w/ beam-search

• Viterbi + beam pruning in each stage

• prune according to merit: d(v)⊗h(v) ⊘ d(TOP)

• hard to derive a provably correct threshold

• in practice: use a preset threshold (but works well!)63

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP

problems

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP

problems

PCFG parsing with CNF

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP

problems

Inside-Outside Alg.(Inside semiring)

PCFG parsing with CNF

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP

problems

Inside-Outside Alg.(Inside semiring) non-prob.

(discriminative) parsing

PCFG parsing with CNF

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP

problems

Inside-Outside Alg.(Inside semiring) non-prob.

(discriminative) parsing

cyclic grammars

PCFG parsing with CNF

Liang Huang (Penn) Dynamic Programming

Same Picture Again

64

monotonic optimization problems

acyclic: Viterbi

superior: Knuth

many NLP

problems

Inside-Outside Alg.(Inside semiring) non-prob.

(discriminative) parsing

cyclic grammars

PCFG parsing with CNF

generalized generalized

Bellman-Ford (open)

Liang Huang (Penn) Dynamic Programming

Take Home Message

• Dynamic Programming is cool, easy, and universal!

• two frameworks and two types of algorithms

• monotonicity; acyclicity and/or superiority

• topological (Viterbi) vs. best-first style (Dijkstra/Knuth/A*)

• when to choose which: A* can finish early if lucky

• graph (lattice) vs. hypergraph (forest)

• incremental, finite-state vs. branching, context-free

• covered many typical NLP applications

• a better understanding of theory helps in practice65

THE END - Thanks!Thanks!

66final slides will be available on my website.

(S, 0, n)

w0 w1 ... wn-1

Questions?Comments?

top related