28 June 2007 EMNLP-CoNLL 1
Probabilistic Models of Nonprojective Dependency
Trees
David A. SmithCenter for Language
and Speech Processing
Computer Science Dept.
Johns Hopkins University
Noah A. SmithLanguage Technologies Institute
Machine Learning Dept.
School of Computer Science
Carnegie Mellon University
28 June 2007 EMNLP-CoNLL 2
See Also
On the Complexity of Non-Projective Data-Driven Dependency ParsingR. McDonald and G. SattaIWPT 2007
Structured-Prediction Models via the Matrix-Tree TheoremT. Koo, A. Globerson, X. Carreras and M. CollinsEMNLP-CoNLL 2007
Coming Up Next!
28 June 2007 EMNLP-CoNLL 3
Nonprojective Syntax
ista meam norit gloria canitiemROOT
I give a on bootstrappingtalk tomorrowROOT ‘ll
thatNOM myACC may-know gloryNOM going-grayACC
How would we parse this?
That glory shall last till I go gray
28 June 2007 EMNLP-CoNLL 4
Edge-Factored Models (McDonald et al., 2005)
€
0 s(1,0) s(2,0) L s(n,0)
0 0 s(2,1) L s(n,1)
0 s(1,2) 0 L s(n,2)
M M M O M
0 s(1,n) s(2,n) L 0
⎡
⎣
⎢ ⎢ ⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥ ⎥ ⎥
€
s(i, j) = exp[w ⋅f(i, j)]Non-neg. score for each edge
€
maxy∈T
s(i, j)(i, j )∈y
∏Find edge sum among legal trees
children
pare
nts
Score edges in isolation
Find maximum spanning tree with Chu-Liu-Edmonds
NP hard to add sibling or degree constraints, hidden node variables
What about training?Unlabeled for now
28 June 2007 EMNLP-CoNLL 5
If Only It Were Projective…
I give a on bootstrappingtalk tomorrowROOT ‘ll
An Inside-Outside algorithm gives us• Normalizing constant for globally
normalized models• Posterior probability of edges• Summing over hidden variables€
p(y | x) = Zx−1ew• f (x,y )
€
p(y | x) = p(y,z | x)z
∑
But we can’t use Inside-Outside for nonprojective parsing!
28 June 2007 EMNLP-CoNLL 6
Graph Theory to the Rescue!
Tutte’s Matrix-Tree Theorem (1948)
The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r.
Exactly the Z we need!
O(n3) time!
28 June 2007 EMNLP-CoNLL 7
Building the Kirchoff (Laplacian) Matrix
€
0 −s(1,0) −s(2,0) L −s(n,0)
0 0 −s(2,1) L −s(n,1)
0 −s(1,2) 0 L −s(n,2)
M M M O M
0 −s(1,n) −s(2,n) L 0
⎡
⎣
⎢ ⎢ ⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥ ⎥ ⎥
€
0 −s(1,0) −s(2,0) L −s(n,0)
0 s(1, j)j≠1
∑ −s(2,1) L −s(n,1)
0 −s(1,2) s(2, j)j≠2
∑ L −s(n,2)
M M M O M
0 −s(1,n) −s(2,n) L s(n, j)j≠n
∑
⎡
⎣
⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢
⎤
⎦
⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥
€
s(1, j)j≠1
∑ −s(2,1) L −s(n,1)
−s(1,2) s(2, j)j≠2
∑ L −s(n,2)
M M O M
−s(1,n) −s(2,n) L s(n, j)j≠n
∑
• Negate edge scores• Sum columns
(children)• Strike root row/col.• Take determinant
N.B.: This allows multiple children of root, but see Koo et al. 2007.
28 June 2007 EMNLP-CoNLL 8
Why Should This Work?Chu-Liu-Edmonds analogy:Every node selects best parentIf cycles, contract and recur
€
′ K ≡ K with contracted edge 1,2
′ ′ K ≡ K({1,2} |{1,2})
K = s(1,2) ′ K + ′ ′ K
€
s(1, j)j≠1
∑ −s(2,1) L −s(n,1)
−s(1,2) s(2, j)j≠2
∑ L −s(n,2)
M M O M
−s(1,n) −s(2,n) L s(n, j)j≠n
∑
Clear for 1x1 matrix; use induction
Undirected case; special root cases for directed
28 June 2007 EMNLP-CoNLL 9
When You Have a Hammer…
Matrix-Tree Theorem
•Sequence-normalized log-linear models (Lafferty et al. ‘01)•Minimum Bayes-risk parsing (cf. Goodman ‘96)•Hidden-variable models
•O(n) inference with length constraints (cf. N. Smith & Eisner, ‘05)•Minimum risk training (D. Smith & Eisner ‘06)•Tree (Rényi) entropy (Hwa ‘01; S & E ‘07)
28 June 2007 EMNLP-CoNLL 10
Analogy to Other Models
Sequence
Projective PCFGs
Shift-Reduce(Action-Based)
Projective
global log-linear
Max-margin(or error-driven)
(e.g. McDonald,
Collins)Non-
projective?
Parent-predicting(K. Hall ‘07)
This Work
28 June 2007 EMNLP-CoNLL 11
More Machinery: The Gradient
€
∂ logA
∂ A[ ] j,i
= A−1[ ]
i, j
€
∂ logZθ (x)
∂s(i, j)= s(i, j) K−1
[ ]i,i
− K−1[ ]
i, j( )j= 0
n
∑i=1
n
∑
SinceInvert Kirchoff matrix
K in O(n3) timevia LU factorization
The edge gradient is also edge posterior
probability.
Use the chain rule to backpropagate into s(i,j), whatever its internal structure may be.
28 June 2007 EMNLP-CoNLL 12
Nonprojective Conditional Log-Linear Training
train Arabic Czech Danish Dutch
MIRA 79.9 81.4 86.6 90.0
CL 80.4 80.2 87.5 90.0
•CoNLL 2006 Danish and Dutch•CoNLL 2007 Arabic and Czech•Features from McDonald et al. 2005•Compared with MSTParser’s MIRA max-margin training•Trained LL weights with stochastic gradient descent•Same #iterations and stopping criteria as MIRA
Significance on paired permutation test
€
p(y | x) = Zx−1ew• f (x,y )
28 June 2007 EMNLP-CoNLL 13
Minimum Bayes-Risk Parsing
parse train Arabic Czech Danish Dutch
map MIRA 79.9 81.4 86.6 90.0
CL 80.4 80.2 87.5 90.0
mBr MIRA 79.4 80.3 85.0 87.2
CL 80.5 80.4 87.5 90.0
Select the tree, not with the highest probability, but the most expected correct edges.
€
ˆ y = argmaxy
Ep( ′ y |x ) −δ( ′ y i,y i)i=1
n
∑ ⎡
⎣ ⎢
⎤
⎦ ⎥= argmax
yp
i=1
n
∑ (y i ∈ parse of x | x)
Plug posteriors into MST
MIRA doesn’t
estimate probs.
N.B. One could do mBr inside MIRA.
28 June 2007 EMNLP-CoNLL 14
Edge Clustering
Franz loves Milena
OBJSUBJ
Franz loves Milena
ZC
B
A
Y
X
(Supervised) labeled
dependency parsing
Simple idea: conjoin each model feature with a cluster
Sum out all possible edge labelings if we don’t care about labels per se.
OR
28 June 2007 EMNLP-CoNLL 15
Edge Clustering
74
76
78
80
82
84
86
88
90
92
Arabic Danish Dutch
1 cluster2 clusters16 clusters32 clusters
No significant gains or losses from clustering
28 June 2007 EMNLP-CoNLL 16
What’s Wrong with Edge Clustering?
Edge labels don’t interact
Unlike clusters on PCFG nonterminals (e.g. Matsuzaki et al.’05)
Cf. small/no gains for unlabeled accuracy from supervised labeled parsers
NP-A
NP-B NP-A
Franz loves Milena
A B
No interaction
Interaction in rewrite rule
28 June 2007 EMNLP-CoNLL 17
Constraints on Link Length
€
0 −1 −1 −1 −1 −1 −1 −1 −1 −1
0 2 −1 0 0 0 0 0 0 0
0 −1 3 −1 0 0 0 0 0 0
0 −1 −1 4 −1 0 0 0 0 0
0 0 −1 −1 4 −1 0 0 0 0
0 0 0 −1 −1 4 −1 0 0 0
0 0 0 0 −1 −1 4 −1 0 0
0 0 0 0 0 −1 −1 4 −1 0
0 0 0 0 0 0 −1 −1 3 −1
0 0 0 0 0 0 0 −1 −1 2
Example with L=1, R=2
• Max. left/right child distances L and R (Cf. Eisner & N. Smith‘05)
• Band-diagonal Kirchoff matrix once root row and column are removed
• Inversion in O(min(L3R2, L2R3)n) time
28 June 2007 EMNLP-CoNLL 18
Conclusions
• O(n3) inference for edge-factored nonprojective dependency models
• Performance closely comparable to MIRA
• Learned edge clustering doesn’t seem to help unlabeled parsing
• Many other applications to hit
28 June 2007 EMNLP-CoNLL 19
Thanks
Jason EisnerKeith Hall
Sanjeev Khudanpur
The Anonymous Reviewers
Ryan McDonald & Michael Collins & colleaguesFor sharing drafts