structured prediction cascades ben taskar david weiss, ben sapp, alex toshev university of...
Post on 15-Jan-2016
220 Views
Preview:
TRANSCRIPT
Structured Prediction CascadesBen Taskar
David Weiss, Ben Sapp, Alex Toshev
University of Pennsylvania
Supervised Learning
Learn from
• Regression:
• Binary Classification:
• Multiclass Classification:
• Structured Prediction:
Handwriting Recognition
`structured’
x y
Machine Translation
x y
‘Ce n'est pas un autreproblème de classification.’
‘This is not another classification problem.’
Pose Estimation
x y
© Arthur Gretton
Structured Models
space of feasible outputs
scoring function
parts = cliques, productions
Complexity of inference depends on “part” structure
Supervised Structured Prediction
Learning Prediction
Discriminative estimation of θ
Data
Model:
Intractable/impracticalfor complex models
Intractable/impractical for complex models
3rd order OCR model = 1 mil states * lengthBerkeley English grammar = 4 mil productions * length^3Tree-based pose model = 100 mil states * # joints
Approximation vs. Computation
• Usual trade-off: approximation vs. estimation – Complex models need more data– Error from over-fitting
• New trade-off: approximation vs. computation– Complex models need more time/memory – Error from inexact inference
• This talk: enable complex models via cascades
An Inspiration: Viola Jones Face Detector
Scanning window at every location, scale and orientation
Classifier Cascade
• Most patches are non-face• Filter out easy cases fast!!• Simple features first• Low precision, high recall• Learned layer-by-layer • Next layer more complex
C1
C2
C3
Cn
Non-face
Non-face
Non-face
Non-face Face
Related Work
• Global Thresholding and Multiple-Pass Parsing. J Goodman, 97• A maximum-entropy-inspired parser. E. Charniak, 00.• Coarse-to-fine n-best parsing and MaxEnt discriminative
reranking, E Charniak & M Johnson, 05• TAG, dynamic pro- gramming, and the perceptron for efficient,
feature-rich parsing, X. Carreras, M. Collins, and T. Koo, 08• Coarse-to-Fine Natural Language Processing, S. Petrov, 09• Coarse-to-fine face detection. Fleuret, F., Geman, D, 01• Robust real-time object detection. Viola, P., Jones, M, 02• Progressive search space reduction for human pose estimation,
Ferrari, V., Marin-Jimenez, M., Zisserman, A, 08
What’s a Structured Cascade?
• What to filter?– Clique assignments
• What are the layers?– Higher order models– Higher resol. models
• How to learn filters?– Novel convex loss– Simple online algorithm– Generalization bounds
F1
F2
F3
Fn
??????
??????
??????
?????? ‘structured’
Trade-off in Learning Cascades
• Accuracy: Minimize the number of errors incurred by each level
• Efficiency: Maximize the number of filtered assignments at each level
Filter 1
Filter 2
Filter D
Predict
Max-marginals (Sequences)
a
b
c
d
a
b
c
d
a
b
c
d
a
b
c
d
Score of an output:
Computemax (*bc*)
Max marginal:
Filtering with Max-marginals
• Set threshold• Filter clique assignment if
a
b
c
d
a
b
c
d
a
b
c
d
a
b
c
d
Filtering with Max-marginals
• Set threshold• Filter clique assignment if
a
b
c
d
a
b
c
d
a
b
c
d
a
b
c
dRemove edge
bc
Why Max-marginals?
• Valid path guarantee– Filtering leaves at least one valid global assignment
• Faster inference – No exponentiation, O(k) vs. O(k log k) in some cases
• Convex estimation– Simple stochastic subgradient algorithm
• Generalization bounds for error/efficiency– Guarantees on expected trade-off
Choosing a Threshold
• Threshold must be specific to input x– Max-marginal scores are not normalized
• Keep top-K max-marginal assignments?– Hard(er) to handle and analyze
• Convex alternative: max-mean-max function:
max score mean max marginalm = #edges
Choosing a Threshold (cont)
Max marginal scores
Mean max marginalMax score
Score of truth
Range of possible thresholdsα = efficiency level
Example OCR Cascade
a b c d e f g h i j k l …144 269 -137 80 52 -42 49 360 -27 -774 368 -24
Mean ≈ 0 Max = 368 α = 0.55 Threshold ≈ 204
a c d e f g i j l144 -137 80 52 -42 49 -27 -774 -24
b h k …b h k
Example OCR Cascade
b h k
a e n u r
a b d g
a c e g n o u
a e m n u w
b h k
b h k
Example OCR Cascade
a e n u r
a b d g
a c e g n o u
▪b ▪h ▪k
ba ha ka be he …
aa ab ad ag ea …
aa ac ae ag an …
a e m n u waa ae am an au …
Example OCR Cascade
▪b ▪h ▪k
ba ha ka be he …
aa ab ad ag ea …
aa ac ae ag an …
aa ae am an au …
1440 1558 1575
1440 1558 1575 1413 1553
1400 1480 1575 1397 1393
1257 1285 1302 1294 1356
1336 1306 1390 1346 1306
Mean ≈ 1412
Max = 1575
α = 0.44
Threshold ≈ 1502
▪b
ba be
aa ab ag ea
aa ac ae ag an
aa ae am an au
▪b
ba be
aa ab ag ea
aa ac ae ag an
aa ae am an au
Example OCR Cascade
▪h ▪k
ha ka he …
ad …
…
…
1440 1558 1575
1440 1558 1575 1413 1553
1400 1480 1575 1397 1393
1257 1285 1302 1294 1356
1336 1306 1390 1346 1306
Mean ≈ 1412
Max = 15751440
1440 1413
1400 1480 1397 1393
1257 1285 1302 1294 1356
1336 1306 1390 1346 1306
α = 0.44
Threshold ≈ 1502
Example OCR Cascade
ka
▪h ▪k
ha he …
ad …
…
…
Mean ≈ 1412
Max = 1575▪h ▪k
ha ka he ke
ad ed
kn
do
ow om
nd
α = 0.44
Threshold ≈ 1502
Example OCR Cascade
▪h ▪k
ha ka he ke
ad ed
kn
do
ow om
nd
…
…
▪▪h ▪▪k
▪ha ▪ka ▪he ▪ke
had kad hed ked
ado edo ndo
dow dom
10x10x12 10x10x24 40x40x24 80x80x24 110×122×24
Example: Pictorial Structure Cascade
Upperarm
Lowerarm
H
T ULA
LLA
URA
LRA
k ≈ 320,000k2 ≈ 100mil
• Filter loss
If score(truth) > threshold, all true states are safe• Efficiency loss
Proportion of unfiltered clique assignments
Quantifying Loss
Learning One Cascade Level
• Fix α, solve convex problem for θ
Minimize filter mistakes at efficiency level α
No filter mistakes Margin w/ slack
An Online Algorithm
• Stochastic sub-gradient update:
Features of truth
Convex combination: Features of best guess+ Average features of max marginal “witnesses”
Generalization Bounds
• W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data
Expected loss on true distribution
)-ramp(
Generalization Bounds
• W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data
-Empirical -ramp upper bound
0
1
Generalization Bounds
• W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data
n number of examples m number of clique assignments number of cliquesB
Generalization Bounds
• W.h.p. (1-δ), filtering and efficiency loss observed on training set generalize to new data
Similar bound holds for Le and all ® 2 [0,1]
OCR Experiments
Dataset: http://www.cis.upenn.edu/~taskar/ocr
Character Error Word Error0
10
20
30
40
50
60
70
80
22.5
73.35
14.31
50.56
12.05
26.17
7.7515.54
1st order2nd order3rd order4rd order
% E
rror
Efficiency Experiments
• POS Tagging (WSJ + CONLL datasets)– English, Bulgarian, Portuguese
• Compare 2nd order model filters– Structured perceptron (max-sum)– SCP w/ ® 2 [0, 0.2,0.4,0.6,0.8] (max-sum)– CRF log-likelihood (sum-product marginals)
• Tightly controlled– Use random 40% of training set– Use development set to fit regularization parameter ¸
and α for all methods
POS Tagging Results
Error Cap (%) on Development Set
Max-sum (SCP)Max-sum (0-1)Sum-product (CRF)
English
POS Tagging Results
Error Cap (%) on Development Set
Max-sum (SCP)Max-sum (0-1)Sum-product (CRF)
Portuguese
POS Tagging Results
Error Cap (%) on Development Set
Max-sum (SCP)Max-sum (0-1)Sum-product (CRF)
Bulgarian
English POS Cascade (2nd order)
Full SCP CRF Taglist
Accuracy (%) 96.83 96.82 96.84 ---
Filter Loss (%) 0 0.12 0.024 0.118
Test Time (ms) 173.28 1.56 4.16 10.6
Avg. Num States 1935.7 3.93 11.845 95.39
The cascade is efficient.
DT NN VRB ADJ
method torso head upper arms
lower arms total
Ferrari et al 08 -- -- -- -- 61.4%
Ferrari et al. 09 -- -- -- -- 74.5%
Andriluka et al. 09 98.3% 95.7% 86.8% 51.7% 78.8%
Eichner et al. 09 98.72% 97.87% 92.8% 59.79% 80.1%
CPS (ours) 100.00% 99.57% 96.81% 62.34% 86.31%
Pose Estimation (Buffy Dataset)
“Ferrari score:” percent of parts correct within radius
Pose Estimation
Cascade Efficiency vs. Accuracy
Features: Shape/Segmentation Match
Features: Contour Continuation
Conclusions
• Novel loss significantly improves efficiency
• Principled learning of accurate cascades
• Deep cascades focus structured inference and allow rich models
• Open questions: – How to learn cascade structure– Dealing with intractable models– Other applications: factored dynamic models, grammars
Thanks!
Structured Prediction Cascades, Weiss & Taskar, AISTATS10
Cascaded Models for Articulated Pose Estimation, Sapp, Toshev & Taskar, ECCV10
top related