![Page 1: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/1.jpg)
VNect: Real-time 3D Human Pose Estimation with a Single
RGB Camera By Dushyant Mehta, Srinath Sridhar, Oleksandr Sotnychenko, Helge Rhodin,
Mohammad Shafiei, Hans-Peter Seidel, Weipeng Xu, Dan Casas, Christian Teobalt
Max Planck Institute for Informatics, Saarland University, Universidad Rey Juan Carlos
Presented by Asbjoern Fintland Lystrup and Marcus Loo Vergara
![Page 2: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/2.jpg)
Goal Real-time markerless 3D pose estimation from single RGB camera
Temporal stability
Invariant to background and body shape
Invariant to input image size
![Page 3: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/3.jpg)
Excerpt starting at: https://youtu.be/W1ZNFfftx2E?t=0m5s
![Page 4: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/4.jpg)
Previous Works
![Page 5: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/5.jpg)
Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation
Deep learning methods represents the state-of-the-art
Issues ◦ 2D pose estimation is not sufficient for certain tasks e.g. virtual avatar control
◦ Typically assumes tight bounding boxes
Advantages ◦ Real-time
◦ High accuracy
![Page 6: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/6.jpg)
RGB-D 3D pose estimation using RGB with depth (e.g. Microsoft Kinect)
Issues ◦ Does not work well in outdoors due to sunlight interference
◦ Bulkier, more expensive, not as widely available, higher power consumption, limited resolution, field-of-view and range
Advantages ◦ Tracking of deformable objects
◦ Template-free reconstruction
RGB Depth Segmented
![Page 7: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/7.jpg)
Multi-view 3D pose estimation using multiple cameras
Issues ◦ Needs elaborate setup
◦ Offline computation ◦ Typically not real-time
Advantages ◦ Attains high accuracy
◦ Can reach real-time with approximations
![Page 8: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/8.jpg)
Monocular 3D Pose Estimation Previous work in monocular 3D pose estimation uses deep learning
Issues ◦ Typically offline
◦ Often reconstructs 3D joint positions individually per image
◦ Temporally unstable when applied to sequences of images
◦ Does not enforce constant bone lengths
![Page 9: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/9.jpg)
Method
![Page 10: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/10.jpg)
Overview
![Page 11: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/11.jpg)
Bounding Box Tracker Goal: Efficiently create a tight bound around the person
Want to avoid slow scale-space search for bounding box (BB)
![Page 12: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/12.jpg)
Bounding Box Tracker First frames
◦ Scale-space search
Then ◦ Use previous keypoints to compute
the smallest BB containing all keypoints
◦ Add 20% to the height and 40% to the width
◦ Shift BB horizontally to the centroid of the keypoints
◦ Corners of the BB are updated using a weighted average with the previous frame’s corners
◦ Finally, BB is resized to 368x368 px
![Page 13: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/13.jpg)
Excerpt starting at: https://youtu.be/W1ZNFfftx2E?t=1m24s
![Page 14: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/14.jpg)
CNN Pose Regression Goal: Predict 2D and 3D joint positions
◦ 2D: Image space heatmap formulation
◦ 3D: Position relative to the root (pelvis)
![Page 15: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/15.jpg)
CNN Pose Regression Approach
◦ Generate heatmap 𝐻𝑗 for each joint 𝑗
◦ Generate location-maps 𝑋𝑗 , 𝑌𝑗 , 𝑍𝑗 for each joint 𝑗
◦ Captures root-relative locations 𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗
◦ 𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗 are read from 𝑋𝑗 , 𝑌𝑗 , 𝑍𝑗 at the respective position in the heatmap 𝐻𝑗
![Page 16: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/16.jpg)
CNN Pose Regression Modified ResNet50
◦ Remove the 3 last residual blocks
◦ Replace with the following architecture
![Page 17: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/17.jpg)
CNN Pose Regression Custom residual block
![Page 18: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/18.jpg)
CNN Pose Regression ∆𝑋𝑗 , ∆𝑌𝑗 , ∆𝑍𝑗: Parent-relative location-maps
𝐵𝐿: Bone lengths ◦ 𝐵𝐿𝑗 = ∆𝑋𝑗⨀∆𝑋𝑗 + ∆𝑌𝑗⨀∆𝑌𝑗 + ∆𝑍𝑗⨀∆𝑍𝑗
![Page 19: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/19.jpg)
CNN Pose Regression Image features
![Page 20: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/20.jpg)
CNN Pose Regression Concatenate intermediate features
◦ Idea: Parent-relative positions and bone lengths help guide the network
![Page 21: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/21.jpg)
CNN Pose Regression Final heatmap 𝐻𝑗 and location-maps 𝑋𝑗 , 𝑌𝑗 , 𝑍𝑗
![Page 22: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/22.jpg)
CNN Pose Regression Now we can extract 2D keypoints and 3D joint positions
◦ Keypoint position 𝐾𝑗 simply given by max value in heatmap
◦ 𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗 are read from 𝑋𝑗 , 𝑌𝑗 , 𝑍𝑗 at their respective 𝐾𝑗
![Page 23: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/23.jpg)
Training Loss term
𝐿𝑜𝑠𝑠 𝑥𝑗 = 𝐻𝑗
𝐺𝑇⨀ 𝑋𝑗 − 𝑋𝑗𝐺𝑇
2
where ⨀ is an element-wise multiplication of the left and right matrix
Enforce that we are only interested in 𝑥𝑗 , 𝑦𝑗 , 𝑧𝑗 at the respective 𝐻𝑗 ’s 2D location ◦ That is, the loss should be weighted stronger
around the joint’s 2D location
![Page 24: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/24.jpg)
Temporal Filtering 2D keypoints and 3D positions are filtered with the 1 Euro filter [Casiez et al. 2012] ◦ A temporal smoothing filter
![Page 25: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/25.jpg)
Kinematic Skeleton Fitting 1. Retarget skeleton to the underlying model
2. Fit final skeleton using the Levenberg-Marquardt algorithm
![Page 26: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/26.jpg)
Kinematic Skeleton Fitting Final skeleton 𝑃𝑡
𝐺 = 𝑃𝑡𝐺(𝜃, 𝑑) parameterized by 𝜃 and 𝑑
◦ 𝜃: Vector of joint angles
◦ 𝑑: Root joint’s location in camera space
Non-linear optimization problem
![Page 27: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/27.jpg)
Kinematic Skeleton Fitting Objective energy
𝐸𝑡𝑜𝑡𝑎𝑙 𝜃, 𝑑 = 𝐸𝐼𝐾 𝜃, 𝑑 + 𝐸𝑝𝑟𝑜𝑗 𝜃, 𝑑 + 𝐸𝑠𝑚𝑜𝑜𝑡ℎ 𝜃, 𝑑 + 𝐸𝑑𝑒𝑝𝑡ℎ(𝜃, 𝑑)
Idea: Fit a skeleton which minimizes this energy ◦ That is, solve the minimization problem
𝑎𝑟𝑔𝑚𝑖𝑛𝜃,𝑑 𝐸𝑡𝑜𝑡𝑎𝑙 𝜃, 𝑑
![Page 28: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/28.jpg)
Kinematic Skeleton Fitting The inverse kinematics term
𝐸𝐼𝐾 𝜃, 𝑑 = 𝑃𝑡
𝐺 − 𝑑 − 𝑃𝑡𝐿2
◦ 𝑑: Skeleton root
◦ 𝑃𝑡𝐺: Final 3D pose
◦ 𝑃𝑡𝐿: Predicted 3D pose
![Page 29: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/29.jpg)
Kinematic Skeleton Fitting The projection term
𝐸𝑝𝑟𝑜𝑗 𝜃, 𝑑 = Π 𝑃𝑡
𝐺 − 𝐾𝑡 2
◦ Π ∘ : Projection function from 3D to the 2D image plane
◦ 𝑃𝑡𝐺: Final 3D pose
◦ 𝐾𝑡: Predicted 2D keypoints
![Page 30: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/30.jpg)
Kinematic Skeleton Fitting The smoothing term
𝐸𝑠𝑚𝑜𝑜𝑡ℎ 𝜃, 𝑑 = 𝑃𝑡𝐺 2
◦ 𝑃𝑡𝐺 : Acceleration of 𝑃𝑡
𝐺
![Page 31: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/31.jpg)
Kinematic Skeleton Fitting The depth term
𝐸𝑑𝑒𝑝𝑡ℎ 𝜃, 𝑑 = 𝑃𝑡𝐺 𝑧 2
◦ 𝑃𝑡𝐺 : Velocity of 𝑃𝑡
𝐺
◦ 𝑃𝑡𝐺 𝑧: Z-component of the 3D velocity
![Page 32: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/32.jpg)
Kinematic Skeleton Fitting Apply the Levenberg–Marquardt algorithm, also known as the damped least-squares (DLS) to obtain final pose 𝑃𝑡
𝐺
𝑎𝑟𝑔𝑚𝑖𝑛𝜃,𝑑 𝐸𝑡𝑜𝑡𝑎𝑙 𝜃, 𝑑
![Page 33: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/33.jpg)
Training
![Page 34: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/34.jpg)
About Training CNN regressor is the only part that needs training
How to train network to predict keypoints and location-maps?
![Page 35: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/35.jpg)
About Training The network is pretrained for 2D pose estimation on MPII and LSP
◦ MPII ◦ 25K images containing over 40K people with annotated body joints
◦ Wide range of activities
◦ LSP ◦ 2K images of sports activities
The paper does not go into details on how this pretraining is done
![Page 36: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/36.jpg)
About Training Train for 3D pose estimation using MPI-INF-3DHP and Human3.6m
◦ MPI-INF-3DHP ◦ Generated using multi-view markerless motion capture system
◦ From all 14 cameras there are ~1.3𝑀 frames
◦ Captured on a greenscreen with background augmentation
◦ Use data from 5 chest-high cameras, 2 head-high cameras and 1 knee-high camera
◦ The sampled frames have at least one joint move by > 200mm between them
◦ Human3.6m ◦ 3.6 million 3D human poses and corresponding images
◦ 11 professional actors
◦ 17 scenarios
![Page 37: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/37.jpg)
Results
![Page 39: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/39.jpg)
Comparisons Comparison to state of the art on MPI-INF-3DHP test set using ground-truth bounding boxes: ◦ PCK: Percentage of Correct Keypoints (3D joint positions)
◦ AUC: Area Under the Curve
◦ MPJPE: Mean Per Joint Position Error (mm)
![Page 40: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/40.jpg)
Comparisons Overall better pose quality
◦ Particularly for end effectors
◦ Occasional large mispredictions
![Page 41: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/41.jpg)
Comparisons Using 3D pose vastly improves PCK
Additional improvement from filtering and combining 2D and 3D constraints
![Page 42: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/42.jpg)
Limitations Self-occlusion
Poses far from the training data are hard
Multiple people ◦ Lack of training data
Occluded faces
Fast motion
High-end hardware
![Page 43: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/43.jpg)
Related Work DensePose [Güler et al. 2018]
◦ Surface-based representation of human pose
◦ Using recurrent neural network and ROI-Align pooling to obtain part labels
◦ Multi-person
![Page 45: VNect: Real-time 3D Human Pose Estimation with a Single ... · Monocular 2D Pose Estimation Early work mostly on monocular 2D pose estimation Deep learning methods represents the](https://reader034.vdocuments.net/reader034/viewer/2022042302/5ecd5f5bca840f6107767346/html5/thumbnails/45.jpg)
Summary Real-time 3D pose estimation from single RGB camera
Better than offline, state-of-the-art solutions in some categories
Temporal filtering and skeleton fitting improves quality
Limited availability of annotated datasets