pose machine
TRANSCRIPT
05/01/2023 1
Pose Machines Estimating Articulated Pose from
Images
Robotics Institute Carnegie Mellon University
Convolutional Pose Machines. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
Pose Machines: Articulated Pose Estimation via Inference Machines. Varun Ramakrishna, Daniel Munoz, Martial Hebert, J.A. Bagnell, Yaser Sheikh. In ECCV 2014 (Oral presentation).
05/01/2023 2
Goal: Articulated Pose Estimation
05/01/2023 3
Goal: Articulated Pose Estimation
https://www.youtube.com/watch?v=Oi_ycvFHd64&index=6&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg
05/01/2023 4
Goal: Articulated Pose Estimation
https://www.youtube.com/watch?v=MsZkLK0Wcmk&list=PLNh5A7HtLRcpsMfvyG0DED-Dr4zW5Lpcg&index=1
05/01/2023 5
Which part corresponds to a body part?
• Local evidence is weak • Part context is a strong cue• Top-down cues are helpful
05/01/2023 6
Using Local Image EvidenceMulti-Class Classification of Patches
g1
Image Features
1xz
Image Location z
Input Image
hand
sfe
et
Requires a high-capacity supervised predictor capable of handling multi-modal data
05/01/2023 7
Using Local Image EvidenceA Classical Sliding Window Detection Pipeline
Image Feature Extraction Classification
05/01/2023 8
Local Image Evidence is Weak• Certain parts are easier to detect than others
head neck l.shoulder l.elbow l.wrist
05/01/2023 9
Part Context is a Strong CuePartdetection confidences provide spatial context cues
L-ShoulderL-ElbowImage Neck
10
Tree Structures vs Loopy GraphsTree Structures• Fast and exact
inference• Double counting
Loopy Graphs• Rich context• Approximate inference
2015/9/11
05/01/2023
Designing ContextRepresentations
Context features encode responses of a previous prediction stage
Offs
et
Feat
ures
Pat
ch
Feat
ures
Image
L b11
05/01/2023
Context Feature
sg2
g3
Stage II
Stage IIIConfidence Maps
Confidence Maps
g1
Context Features
Stage I Confidence Maps
Stage
I
Confidence
Image Features
Head Neck L-Shoulder L-Elbow L-Wrist
L b12
05/01/2023
g2g1
Context Features
g3
Image Features
Context Feature
s
Stage I Confidence Maps
Stage II Confidence Maps
Stage III Confidence Maps
Stage IIConfidence
Head Neck L-Shoulder L-Elbow L-Wrist
L b13
05/01/2023
g2g1 g3
Context Features
Context Features
Stage I Confidence Maps
Stage II Confidence Maps
Stage III Confidence Maps
Image Features
Stage IIIConfidence
Head Neck L-Shoulder L-Elbow L-Wrist
L b14
05/01/2023 15
Level 1parts
Level 2 poselet Level 3 full body
[Bourdev et al., CVPR 2009][Sun et al., CVPR 2012] [Duan et al., BMVC 2012][Singh et al., ECCV 2012] [Pishchulin et al., CVPR 2013] etc.
Top Down Cues are HelpfulLarger Composite Parts can be Easier to detect
05/01/2023 16
2gT
1gT
Stage t = (T = 3)
ContextFeatures
Context
ContextFeatures
ImageFeatures
Features
ContextFeatures
Context Features
Context Features
Image Features
Image Features
Image Features
2g1
L g1
Stage t = 1
1g1Leve
l 1
Leve
l 2
Leve
l L
Image Features
Image Features
Image Features
L g2
2g2
1g
Stage t = 2
Incorporating HierarchicalCues
• Each level of the hierarchy uses a separate predictor• Context features are computed on the outputs of the previous stage• Spatial context information is passed across layers via context features
LgT
05/01/2023
1g21g1Le
vel
1 1gT
Image Features
Image Features
Image Features
ContextFeatures
Context Features
Leve
l 2 2g1
L g1L g2
2g2
Stage t = 1
Stage t = 2
Level I Confidence MapsL.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee
L.Ank.
L gT
2gT
Stage t = (T = 3)
Context Features
Context Features
Context Features
Context Features
Leve
l L
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Sta
ge I
Sta
ge II
Sta
ge II
I
L b17
05/01/2023
Stage t = 2
Level 2 ConfidenceMaps
Sta
ge I
Sta
ge II
Sta
ge II
I
Head+Sho L.Arm R.Arm TorsoL.Leg
Bkgd.R.Leg
1g21g1Le
vel
1 1gT
Image Features
Image Features
Image Features
Context Features
Context Features
Leve
l 2 2g1
L g1L g2
2g2
Stage t = 1
L gT
2gT
Stage t = (T = 3)
ContextFeatures
ContextFeatures
ContextFeatures
Context Features
Leve
l L
L b18
05/01/2023
Stage t = 2
Level 3 Confidence MapsTorso Bkgd.
Sta
ge
IS
tage
II
Sta
ge
III
1g21g1Le
vel
1 1gT
Image Features
Image Features
Image Features
Context Features
Context Features
Leve
l 2 2g1
L g1L g2
2g2
Stage t = 1
L gT
2gT
Stage t = (T = 3)
Context Features
ContextFeatures
Context Features
ContextFeatures
Leve
l L
L b19
05/01/2023
1g21g1Le
vel
1 1gT
Image Features
Image Features
Image Features
Context Features
Context Features
Leve
l 2 2g1
L g1 L g2
2g2
Stage t = 1
Stage t = 2
L gT
2gT
Stage t = (T = 3)
Context Features
Context Features
Context Features
Context Features
Leve
l L
Fully Connected Model
L b20
05/01/2023 21
Pose MachinesSequential Predictionwith Spatial
Context
Training reduces totraining multiple supervised classifiers
g2g1 g3
Context Features
Context Features
Stage I Confidence Maps
Stage II Confidence Maps
Stage III Confidence MapsImage
Features
Image Features
Image Features
No structured lossfunction No specializedsolvers
No handcrafted spatial modelSpatial model is learnedimplicitly by the classifiersin a data-driven fashion
05/01/2023 22
Learning Feature Representations• Convolutional Architectures for Feature Embedding
05/01/2023 23
Learning Context Representations• Large Receptive Fields as a Design Criterion
05/01/2023 24
Learning Context Representations• Large Receptive Fields Improve Pose Estimation
05/01/2023 25
Convolutional Pose Machines• Designing a Convolutional Architecture
05/01/2023 26
Learning• Joint Training with Intermediate Supervision
𝑓 𝑡=‖−‖22
Loss: Euclidean distance
groundtruth prediction
Network without Intermediate Supervision leads vanishing gradients
05/01/2023 27
Input Stage 1
Layer 1 Layer 3Layer 6
41 10
310
Epoch 10
2
110
010
OutputLayer 18
Stage 2
Layer 7
Layer 9
Layer 12
Layer 13
Stage 3
Layer 15
42 10
310
Epoch 10
2
110
010
43 10
310
Epoch 10
2
110
010−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5
Supervision SupervisionHistograms of Gradient Magnitude During Training
Supervision
LearningIntermediateSupervision Addresses Vanishing
Gradients
Gradient Magnitude
10
Gradient (× 10−3) With Intermediate Supervision Without Intermediate Supervision
0101102103104
Input Image h w
3
5⇥5C
5⇥5C
2⇥ 5⇥5 9⇥9 1⇥1 1⇥1
P C C C C
9⇥9C
9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C
2⇥P
5⇥5C
5⇥5C
5⇥5C
2⇥P
2⇥P
Input Image
h w 3
h0 w0
P1+1 P1+1
9⇥9C
Loss1f 2
Loss1 f 1x1 1
x129⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C
5⇥5 2⇥ 5⇥5 2⇥ 5⇥5
C P C P C
Input Image
h w 3
h0 w0
P1+1
Loss1f 3
x12
h0 w0
Stage 3, level 1
Stage 2, level 1
Stage 1, level 1
05/01/2023 28
InputLayer 1
OutputLayer 18
100101102103104
Epoc
h 1
Stage 1
Layer 3 Layer 6 Layer 7
Stage 2
Layer 9 Layer 12 Layer 13
Stage 3
Layer 15
100101102103104
Epoc
h 2
−0.5 0.0 0.5
100101102103104
Epoc
h 3
−0.5 0.0 0.5
−0.5 0.0 0.5
−0.5 0.0 0.5
−0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5 −0.5 0.0 0.5
Histograms of Gradient Magnitude During TrainingSupervision
SupervisionSupervision
Input Image h w
3
5⇥5C
5⇥5C
2⇥ 5⇥5 9⇥9 1⇥1 1⇥1
P C C C C
9⇥9C
9⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C
2⇥P
5⇥5C
5⇥5C
5⇥5C
2⇥P
2⇥P
Input Image
h w 3
h0 w0
P1+1 P1+1
9⇥9C
Loss1f 2
Loss1 f 1x1 1
x129⇥9 13⇥13 13⇥13 15⇥15 1⇥1 1⇥1C C C C C C
5⇥5 2⇥ 5⇥5 2⇥ 5⇥5
C P C P C
Input Image
h w 3
h0 w0
P1+1
Loss1f 3
x12
h0 w0
Gradient (× 10−3) With Intermediate Supervision Without Intermediate Supervision
Stage 3, level 1
Stage 2, level 1
Stage 1, level 1
LearningIntermediateSupervision Addresses Vanishing
Gradients
05/01/2023 29
00
Det
ectio
n ra
te %
(i) With Intermediate Supervision (IS)(ii) Stagewise(iii) IS + Stagewise Pretrain(iv) Without Intermediate Supervision
0.05 0.1 0.150.2Normalized distance
100908070605040302010
PCK total, LSP OC
LearningComparison of Learning Methods
05/01/2023
Qualitative Results
L b30
05/01/2023
EvaluationQualitative Examples on LEEDS (Person-
centric)
L b31
05/01/2023
EvaluationQualitative Examples on MPI (Person-
centric)
L b32
05/01/2023
Resolving Symmetric Confusions
Left
Rig
ht
t = 1 t = 2
t = 3
Wrists
L b33
05/01/2023 34
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 35
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Sta
ge
IIS
tage
I
Sta
ge
III
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Context from the confidence map ofhead is removed
Ablative Spatial Analysis
05/01/2023 36
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 37
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 38
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Stag
e II
Stag
e I
Stag
e II
I
Predicted Pose
Ablative Spatial Analysis
05/01/2023 39
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 40
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 41
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 42
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 43
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 44
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 45
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 46
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 47
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 48
Head Neck L.Sho. L.Elb. R.Hip R.Knee R.Ank. Bkgd.
Predicted Pose
Level 1 PartConfidences
L.Wri. R.Sho. R.Elb. R.Wri. L.Hip L.Knee L.Ank.
Predicted confidences are resilientto missing context (ofone part)
Sta
ge
IIS
tage
I
Sta
ge
III
Ablative Spatial Analysis
05/01/2023 49
0 0.05 0.1 0.15Normalized distance
0.2 00
100908070605040302010
Det
ectio
n ra
te %
Ours 3−Stage 2−Level Tompson et al., CVPR’15
Tompson et al., NIPS’14 Chen&Yullie, NIPS’14
Toshev et al., CVPR’14 Sapp et al., CVPR’13
EvaluationPCK PerformanceComparison on FLIC
datasetPCK wrist, FLIC
0.05 0.1 0.15Normalized distance
0.2
PCK elbow, FLIC
05/01/2023 50
0 0.05 0.1 0.15Normalized distance
Ours 3−Stage 2−Level
0.2 00
100908070605040302010
PCK total, LSP PC
Det
ectio
n ra
te %
Tompson et al., NIPS’14 Pishchulin et al., ICCV’13 Chen&Yuille, NIPS’14 Wang et al., CVPR’13
0.05 0.1 0.15 0.2 0
Normalized distance
0.05 0.1 0.15 0.2 0
Normalized distance
PCK wrist&elbow, LSP PC
0.05 0.1 0.15 0.2 0
Normalized distance
PCK knee, LSP PC
0.05 0.1 0.15 0.2
PCK ankle, LSP PC
Normalized distance
PCK hip, LSP PC
EvaluationPCK PerformanceComparison on LEEDS dataset (Person-
centric)