structured prediction: a large margin approach ben taskar university of pennsylvania joint work...
Post on 26-Dec-2015
216 Views
Preview:
TRANSCRIPT
Structured Prediction:A Large Margin Approach
Ben TaskarUniversity of Pennsylvania
Joint work with:
V. Chatalbashev, M. Collins, C. Guestrin, M. Jordan, D. Klein, D. Koller, S. Lacoste-Julien, C. Manning
“Don’t worry, Howard. The big questions are multiple choice.”
Handwriting Recognition
brace
Sequential structure
x y
Object Segmentation
Spatial structure
x y
Natural Language Parsing
The screen was a sea of red
Recursive structure
x y
Bilingual Word Alignment
What is the anticipated cost of collecting fees under the new proposal?
En vertu des nouvelles propositions, quel est le coût prévu de perception des droits?
x yWhat
is the
anticipated
costof
collecting fees
under the
new proposal
?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de les droits?Combinatorial structure
Protein Structure and Disulfide Bridges
Protein: 1IMT
AVITGACERDLQCGKGTCCAVSLWIKSVRVCTPVGTSGEDCHPASHKIPFSGQRMHHTCPCAPNLACVQTSPKKFKCLSK
Local Prediction
Classify using local information Ignores correlations & constraints!
br ea c
Local Predictionbuildingtreeshrubground
Structured Prediction
Use local information Exploit correlations
br ea c
Structured Predictionbuildingtreeshrubground
Outline Structured prediction models
Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings
Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation
Structured Models
Mild assumption:
linear combination
space of feasible outputs
scoring function
Chain Markov Net (aka CRF*)
a-z
a-z
a-z
a-z
a-z
y
x
*Lafferty et al. 01
Chain Markov Net (aka CRF*)
a-z
a-z
a-z
a-z
a-z
y
x
*Lafferty et al. 01
Associative Markov Nets
Point featuresspin-images, point height
Edge featureslength of edge, edge orientation
yj
yk
jk
j
“associative” restriction
CFG Parsing
#(NP DT NN)
…
#(PP IN NP)
…
#(NN ‘sea’)
Bilingual Word Alignment
position orthography association
Whatis
theanticipate
dcost
ofcollecting
fees under
the new
proposal?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?
j
k
Disulfide Bonds: Non-bipartite Matching
1
2 3
4
6 5
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6
6
1
2
4
5
3
Fariselli & Casadio `01, Baldi et al. ‘04
Scoring Function
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6
RSCCPCYWGGCPWGQNCYPEGCSGPKV 1 2 3 4 5 6
1
2 3
4
6 5
amino acid identities phys/chem properties
Structured Models
Mild assumptions:
linear combination
sum of part scores
space of feasible outputs
scoring function
Supervised Structured Prediction
Learning Prediction
Estimate w
Example:Weighted matching
Generally: Combinatorial
optimization
Data
Model:
Likelihood(intractable)
MarginLocal(ignores
structure)
Outline Structured prediction models
Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings
Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation
We want:
Equivalently:
OCR Example
a lot!…
“brace”
“brace”
“aaaaa”
“brace” “aaaab”
“brace” “zzzzz”
We want:
Equivalently:
‘It was red’
Parsing Example
a lot!
SA B
C D
SA BD F
SA B
C D
SE F
G H
SA B
C D
SA B
C D
SA B
C D
…
‘It was red’
‘It was red’
‘It was red’
‘It was red’
‘It was red’
‘It was red’
We want:
Equivalently:
‘What is the’‘Quel est le’
Alignment Example
a lot!…
123
123
‘What is the’‘Quel est le’
123
123
‘What is the’‘Quel est le’
123
123
‘What is the’‘Quel est le’
123
123
123
123
123
123
123
123
‘What is the’‘Quel est le’
‘What is the’‘Quel est le’
‘What is the’‘Quel est le’
Structured Loss
b c a r e b r o r e b r o c eb r a c e
2 2 10
123
123
123
123
123
123
123
123
‘What is the’‘Quel est le’
0 1 2 2S
A EC D
SB E
A C
SB D
A C
SA B
C D‘It was red’
0 1 2 3
Large margin estimation Given training examples , we want:
Maximize margin
Mistake weighted margin:
# of mistakes in y
*Collins 02, Altun et al 03, Taskar 03
Large margin estimation
Eliminate
Add slacks for inseparable case
Large margin estimation Brute force enumeration
Min-max formulation
‘Plug-in’ linear program for inference
Min-max formulation
LP Inference
Structured loss (Hamming):
Inference
discrete optim.
Key step:
continuous optim.
Outline Structured prediction models
Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings
Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation
y z Map for Markov Nets
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
1
0
:
0
0
1
:
0
1
0
:
0
0
1
:
0
0
1
:
0
a
b
:
z
0 0 . 0
1 0 . 0
. . . 0
0 0 0 0
0 1 . 0
0 0 . 0
. . . 0
0 0 0 0
0 0 . 0
0 1 . 0
. . . 0
0 0 0 0
a
b
:
z
a b . z a b . z a b . z a b . z
Markov Net Inference LP
0 0 0 0
0 0 0 0
0 1 0 0
0 0 0 0
0
0
1
0
0 1 0 0
Has integral solutions z for chains, treesCan be fractional for untriangulated networks
normalization
agreement
Associative MN Inference LP
For K=2, solutions are always integral (optimal) For K>2, within factor of 2 of optimal
“associative” restriction
0
1
0
0
0
1
0
0
0 1 0 0
CFG Chart
CNF tree = set of two types of parts: Constituents (A, s, e) CF-rules (A B C, s, m, e)
CFG Inference LP
inside
outside
Has integral solutions z
root
Matching Inference LP
Has integral solutions z
degree
Whatis
theanticipate
dcost
ofcollecting
fees under
the new
proposal?
En vertu delesnouvelles propositions, quel est le coût prévu de perception de le droits?
j
k
LP Duality Linear programming duality
Variables constraints Constraints variables
Optimal values are the same When both feasible regions are bounded
Min-max Formulation
LP duality
Min-max formulation summary
Formulation produces concise QP for Low-treewidth Markov networks Associative MNs (K=2) Context free grammars Bipartite matchings Approximate for untriangulated MNs, AMNs with K>2
*Taskar et al 04
Unfactored Primal/Dual
QP duality
Exponentially many constraints/variables
Factored Primal/Dual
By QP duality
Dual inherits structure from problem-specific inference LP
Variables correspond to a decomposition of variables of the flat case
The Connection
b c a r e b r o r e b r o c eb r a c e
rc
ao
cr
.2.15.25
.4
.2 .35
.65.8
.4
.61b 1e
2 2 10
Duals and Kernels
Kernel trick works: Factored dual Local functions (log-potentials) can use
kernels
Simple iterative method
Unstable for structured output: fewer instances, big updates
May not converge if non-separable Noisy
Voted / averaged perceptron [Freund & Schapire 99, Collins 02]
Regularize / reduce variance by aggregating over iterations
Alternatives: Perceptron
Add most violated constraint
Handles more general loss functions Only polynomial # of constraints needed Need to re-solve QP many times Worst case # of constraints larger than
factored
Alternatives: Constraint Generation
[Collins 02; Altun et al, 03; Tsochantaridis et al, 04]
Handwriting Recognition
Length: ~8 charsLetter: 16x8 pixels 10-fold Train/Test5000/50000
letters600/6000 words
Models: Multiclass-SVMs* CRFs M3 nets
*Crammer & Singer 01
0
5
10
15
20
25
30
CRFsMC–SVMs M^3 nets
Te
st e
rro
r (a
vera
ge
pe
r-c
ha
ract
er) raw
pixelsquadratic
kernelcubickernel
45% error reduction over linear CRFs33% error reduction over multiclass
SVMs
better
0
5
10
15
20
Tes
t Err
or
SVMs RMNS M^3Ns
Hypertext Classification WebKB dataset
Four CS department websites: 1300 pages/3500 links Classify each page: faculty, course, student, project, other Train on three universities/test on fourth
53% error reduction over SVMs
38% error reduction over RMNs
relaxed dual
*Taskar et al 02
better
loopy belief propagation
3D Mapping
Laser Range Finder
GPS
IMU
Data provided by: Michael Montemerlo & Sebastian Thrun
Label: ground, building, tree, shrub Training: 30 thousand points Testing: 3 million points
Segmentation results
Hand labeled 180K test pointsModel
Accuracy
SVM 68%
V-SVM
73%
M3N 93%
Fly-through
Word Alignment Results
Model *Error
Local learning+matching 10.0
Our approach 8.5
Data: [Hansards – Canadian Parliament] Features induced on 1 mil unsupervised sentences Trained on 100 sentences (10,000 edges) Tested on 350 sentences (35,000 edges)
[Taskar+al 05]
*Error: weighted combination of precision/recall [Lacoste-Julien+Taskar+al 06]
GIZA/IBM4 [Och & Ney 03] 6.5
+Our approach+QAP 4.5
+Local learning+matching 5.4
+Our approach 4.9
Outline Structured prediction models
Sequences (CRFs) Trees (CFGs) Associative Markov networks (Special MRFs) Matchings
Structured large margin estimation Margins and structure Min-max formulation Linear programming inference Certificate formulation
Certificate formulation Non-bipartite matchings:
O(n3) combinatorial algorithm No polynomial-size LP known
Spanning trees No polynomial-size LP known Simple certificate of optimality
Intuition: Verifying optimality easier than optimizing
Compact optimality condition of wrt.
1
2 3
4
6 5
ijkl
Certificate for non-bipartite matching
Alternating cycle: Every other edge is in matching
Augmenting alternating cycle: Score of edges not in matching greater than edges in matching
Negate score of edges not in matching Augmenting alternating cycle = negative length alternating
cycle
Matching is optimal no negative alternating cycles
1
2 3
4
6 5
Edmonds ‘65
Certificate for non-bipartite matching
Pick any node r as root
= length of shortest alternating
path from r to j
Triangle inequality:
Theorem:
No negative length cycle distance function d exists
Can be expressed as linear constraints: O(n) distance variables, O(n2) constraints
1
2 3
4
6 5
Certificate formulation
Formulation produces compact QP for Spanning trees Non-bipartite matchings Any problem with compact optimality condition
*Taskar et al. ‘05
Disulfide Bonding Prediction Data [Swiss Prot 39]
450 sequences (4-10 cysteines) Features:
windows around C-C pair physical/chemical properties
[Taskar+al 05]
Model *Acc
Local learning+matching 41%
Recursive Neural Net [Baldi+al’04] 52%
Our approach (certificate) 55%
*Accuracy: % proteins with all correct bonds
Formulation summary
Brute force enumeration
Min-max formulation ‘Plug-in’ convex program for inference
Certificate formulation Directly guarantee optimality of
Omissions Kernels
Non-parametric models
Structured generalization bounds Bounds on hamming loss
Scalable algorithms (no QP solver needed) Structured SMO (works for chains, trees)
[Taskar 04] Structured ExpGrad (works for chains, trees)
[Bartlett+al 04] Structured ExtraGrad (works for matchings, AMNs)
[Taskar+al 06]
Open questions Statistical consistency
Hinge loss not consistent for non-binary output [See Tewari & Bartlett 05, McAllester 07]
Learning with approximate inference Does constant factor approximate inference
guarantee anything about learning? No [See Kulesza & Pereira 07] Perhaps other assumptions needed
Discriminative structure learning Using sparsifying priors
Conclusion Two general techniques for structured large-margin
estimation
Exact, compact, convex formulations
Allow efficient use of kernels
Tractable when other estimation methods are not
Efficient learning algorithms
Empirical success on many domains
ReferencesY. Altun, I. Tsochantaridis, and T. Hofmann. Hidden
Markov support vector machines. ICML03.M. Collins. Discriminative training methods for hidden
Markov models: Theory and experiments with perceptron algorithms. EMNLP02
K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR01
J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. ICML04
More papers at http://www.cis.upenn.edu/~taskar
Modeling First Order Effects
Monotonicity Local inversion Local fertility
QAP NP-complete Sentences (30 words, 1k vars) few seconds (Mosek) Learning: use LP relaxation Testing: using LP, 83.5% sentences, 99.85% edges integral
Segmentation Model Min-Cut
0 1
Local evidence
Spatial smoothness
Computing is hard in general, but if edge potentials attractive min-cut algorithmMultiway-cut for multiclass case use LP relaxation
[Greig+al 89, Boykov+al 99, Kolmogorov & Zabih 02, Taskar+al 04]
Scalable Algorithms Batch and online Linear in the size of the data
Iterate until convergence For each example in the training sample
Run inference using current parameters (varies by method) Online: Update parameters using computed example values
Batch: Update parameters using computed sample values
Structured SMO (Taskar et al, 03; Taskar 04) Structured Exponentiated Gradient (Bartlett et al, 04)Structured Extragradient (Taskar et al, 05)
Experimental Setup Standard Penn treebank split (2-21/22/23) Generative baselines
Klein & Manning 03 and Collins 99 Discriminative
Basic = max-margin version of K&M 03 Lexical & Lexical + Aux
Lexical features (on constituent parts only)ts-1 [ts … te] te+1 predicted tags
xs-1 [xs … xe] xe+1
Auxillary features Flat classifier using same features Prediction of K&M 03 on each span
Results for sentences ≤40 words
Model LP LR F1
Generative 86.37 85.27 85.82
Lexical+Aux* 87.56 86.85 87.20
Collins 99* 85.33 85.94 85.73
*Trained only on sentences ≤20 words
*Taskar et al 04
Example
The Egyptian president said he would visit Libya today to resume the talks.
Generative model: Libya today is base NP
Lexical model: today is a one word constituent
top related