DeeperCut: A Deeper, Stronger, and FasterMulti-Person Pose Estimation Model
Eldar Insafutdinov1, Leonid Pishchulin1, Bjoern Andres1,Mykhaylo Andriluka1,2, and Bernt Schiele1
1Max Planck Institute for Informatics 2Stanford University
Saarbrücken, Germany Stanford, USA
Goal
• Multi-person pose estimation in monocular images
State of the Art
• DeepCut [5]: joint body part labeling and grouping
+ joint reasoning at finest level of details
– weak pairwise based on geometry only
– infeasible run-time: takes hours to complete
Contributions
• A deeper, stronger and faster multi-person model
+ “deeper”: strong part detectors based on ResNet [3]
+ “stronger”: novel image-conditioned pairwise terms
+ “faster”: dramatic speed-ups due to strong pairwise
and incremental optimization
+ NEW: heuristic solver for real-time inference
Unary Terms
• deeper architectures based on Residual Networks [3]
• dilation and de-convolution reduce stride to 8 px
• intermediate supervision: add loss into mid-layers
• joint training of classification and regression tasks
DeeperCut Overview
• Joint part labeling and grouping via 0/1 variables
detection candidates
dense graph
labeled body partsbody part labeling
joint person clusters
I
II IIIIV
∑
d∈D
∑
c∈C
αdc xdc +
∑
dd′∈
(D2
)∑
c,c′∈C
βdd′cc′ xdcxd′c′ydd′
detection part part labeling
{
part clustering
constraints
cost
min(x,y)∈XDC
subset partitioning
∈{0, 1}
I. Unary terms
• Body part detection candidates
• Capture distribution of scores over all part classes
II. Pairwise terms
• Capture part relationships within/across people
– proximity: same body part class (c = c′)
– kinematic relations: different part classes (c!= c′)
III. Integer Linear Program (ILP)
• Substitute zdd ′cc′ = xdc xd ′c′ ydd ′ to linearize objective
• NP-Hard problem solved via branch-and-cut (1% gap)
• Linear constraints on 0/1 labelings: plausible poses
– uniqueness
∀d ∈ D :
∑
c∈C
xdc ≤ 1
– consistency
∀dd ′ ∈�D
2
�: ydd ′ ≤
∑
c∈C
xdc
∀dd ′ ∈�D
2
�: ydd ′ ≤
∑
c∈C
xd ′c
– transitivity
∀dd ′d ′′ ∈�D
3
�: ydd ′+ yd ′d ′′ − 1≤ ydd ′′
Pairwise Terms
• image conditioned pairwise using CNN regression
– train CNN to regress body part locations
– use regressed offsets and angles as features to train
logistic regression to output pairwise probability
regression from left shoulder
regression from right knee
pairwise vs. unary predictions
righ
tkn
ee
all
part
s
regression from all parts unary only
Multi-stage optimization
• speed-up inference via incremental optimization
1. solve for head and shoulder locations
2. add elbows/wrists to stage 1 solution, re-optimize
3. add rest of body parts to stage 2 solution,
re-optimize
Stage 1 Stage 2 Stage 3
head, shoulder elbow, wrist hip, knee, ankle
Quantitative Multi-Person Results
• MPII Multi-Person [1]
– Mean Average Precision (mAP) metric
Setting Head Sho Elb Wri Hip Knee Ank mAP s/frame
subset of 288 images
DeepCut [5] 73.4 71.8 57.9 39.9 56.7 44.0 32.0 54.1 57995
DeeperCut
+image cond. pw. 83.1 75.8 64.6 54.0 60.6 52.0 44.9 62.6 2336
+deeper archit. 83.3 79.4 66.1 57.9 63.5 60.5 49.9 66.2 1333
+multi-st. opt. 87.5 82.8 70.2 61.6 66.0 60.6 56.5 69.7 230
Iqbal&Gall, ECCVw’16 70.0 65.2 56.4 46.1 52.7 47.9 44.5 54.7 10
full set
DeeperCut 79.1 72.2 59.7 50.0 56.0 51.0 44.6 59.4 485
+heuristic solver 79.6 74.0 62.8 52.5 60.0 53.3 44.6 61.4 0.15
FR-CNN [6] + unary 64.9 62.9 53.4 44.1 50.7 43.1 35.2 51.0 1
Iqbal&Gall, ECCVw’16 58.4 53.9 44.5 35.0 42.2 36.7 31.1 43.1 10
• We are Family (WAF) [2]
– Percentage of Correct Parts (PCP) metric
Setting Head U Arms L Arms Torso mPCP AOP s/frame
DeepCut [5] 99.3 81.5 79.5 87.1 84.7 86.5 22000
DeeperCut 99.3 83.8 81.9 87.1 86.3 88.1 13
Ghiasi et al., CVPR’14 - - - - 63.6 74.0 -
Eichner&Ferrari, ECCV’10 97.6 68.2 48.1 86.1 69.4 80.0 -
Chen&Yuille, CVPR’15 98.5 77.2 71.3 88.5 80.7 84.9 -
Qualitative Multi-Person Results
• Successful cases
• Failure cases
limbs across symmetry hard poses
people confusion
Single Person Results
• Percentage of Correct Keypoints (PCK) metric
• MPII Single Person dataset [1]
Setting Head Sho Elb Wri Hip Knee Ank PCKh AUC
DeepCut [5] (unary) 94.1 90.2 83.4 77.3 82.6 75.7 68.6 82.4 56.5
DeeperCut (unary) 96.6 94.6 88.5 84.4 87.6 83.9 79.4 88.3 60.7
Newell et al., ECCV’16 98.2 96.3 91.2 87.1 90.1 87.4 83.6 90.9 62.9
• Leeds Sports Poses (LSP) [4]
Setting Head Sho Elb Wri Hip Knee Ank PCK AUC
DeepCut [5] (unary) 97.0 91.0 83.8 78.1 91.0 86.7 82.0 87.1 63.5
DeeperCut (unary) 97.4 92.7 87.5 84.4 91.5 89.9 87.2 90.1 66.1
Bulat&Tzimir., ECCV’16 97.2 92.1 88.1 85.2 92.2 91.4 88.7 90.7 63.4
• More comparisons at human-pose.mpi-inf.mpg.de
References[1] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New
benchmark and state of the art analysis. In CVPR’14.
[2] M. Eichner and V. Ferrari. We are family: Joint pose estimation of multiple persons. In
ECCV’10.
[3] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv’15.
[4] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human
pose estimation. In BMVC’10.
[5] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele.
Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR’16.
[6] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with
region proposal networks. In NIPS’15.