simultaneous segmentation and 3d pose estimation of humans or detection + segmentation = tracking?
DESCRIPTION
Simultaneous Segmentation and 3D Pose Estimation of Humans or Detection + Segmentation = Tracking?. Philip H.S. Torr Pawan Kumar, Pushmeet Kohli, Matt Bray Oxford Brookes University Andrew Zisserman Oxford Arasanathan Thayananthan, Bjorn Stenger, Roberto Cipolla Cambridge. Algebra. - PowerPoint PPT PresentationTRANSCRIPT
Simultaneous Segmentation and 3D Pose Estimation of Humans
orDetection + Segmentation = Tracking?
Philip H.S. TorrPawan Kumar, Pushmeet Kohli, Matt Bray
Oxford Brookes University
Andrew ZissermanOxford
Arasanathan Thayananthan, Bjorn Stenger, Roberto CipollaCambridge
Algebra
Unifying Conjecture
Tracking = Detection = Recognition Detection = Segmentation
• therefore Tracking (pose estimation)=Segmentation?
Objective
Image Segmentation Pose Estimate??
Aim to get a clean segmentation of a human…
Developments
ICCV 2003, pose estimation as fast nearest neighbour plus dynamics (inspired by Gavrilla and Toyoma & Blake)
BMVC 2004, parts based chamfer to make space of templates more flexible (a la pictorial structures of Huttenlocher)
CVPR 2005, ObjCut combining segmentation and detection.
ECCV 2006, interpolation of poses using the MVRVM (Agarwal and Triggs)
ECCV 2006 combination of pose estimation and segmentation using graph cuts.
Tracking as Detection (Stenger et al ICCV 2003)
Detection has become very efficient,e.g. real-time face detection, pedestrian detection
Example: Pedestrian detection [Gavrila & Philomin, 1999]: Find match among large number of exemplar templates
Issues: Number of templates needed Efficient search Robust cost function
Cascaded Classifiers
First filter : 19.8 % patches remaining
1280x1024 image, 11 subsampling levels, 80sAverage number of filter per patch : 6.7
Filter 10 : 0.74 % patches remaining
1280x1024 image, 11 subsampling levels, 80sAverage number of filter per patch : 6.7
Filter 20 : 0.06 % patches remaining
1280x1024 image, 11 subsampling levels, 80sAverage number of filter per patch : 6.7
Filter 30 : 0.01 % patches remaining
1280x1024 image, 11 subsampling levels, 80sAverage number of filter per patch : 6.7
Filter 70 : 0.007 % patches remaining
1280x1024 image, 11 subsampling levels, 80sAverage number of filter per patch : 6.7
Hierarchical Detection Efficient template matching (Huttenlocher & Olson,
Gavrila) Idea: When matching similar objects, speed-up by
forming template hierarchy found by clustering Match prototypes first, sub-tree only if cost below
threshold
Trees
These search trees are the same as used for efficient nearest neighbour.
Add dynamic model and • Detection = Tracking = Recognition
Evaluation at Multiple Resolutions
One traversal of tree per time step
Evaluation at Multiple Resolutions
Tree: 9000 templates of hand pointing, rigid
Templates at Level 1
Templates at Level 2
Templates at Level 3
Comparison with Particle Filters
This method is grid based,• No need to render the model on line• Like efficient search• Can always use this as a proposal process for
a particle filter if need be.
Interpolation, MVRVM, ECCV 2006
Code available.
Energy being Optimized, link to graph cuts
Combination of• Edge term (quickly evaluated using chamfer)• Interior term (quickly evaluated using integral
images)
Note that possible templates are a bit like cuts that we put down, one could think of this whole process as a constrained search for the best graph cut.
Likelihood : Edges
Edge Detection Projected Contours
Robust EdgeMatching
Input Image 3D Model
Chamfer MatchingInput image Canny edges
Distance transform Projected Contours
Likelihood : Colour
Skin Colour ModelProjected Silhouette
Input Image 3D Model
Template Matching
Template Matching =
Template Matching = constrained search for a cut/segmentation?
Detection = Segmentation?
Objective
Image Segmentation Pose Estimate??
Aim to get a clean segmentation of a human…
MRF for Interactive Image Segmentation, Boykov and Jolly [ICCV 2001]
EnergyMR
F
Pair-wise Terms MAP SolutionUnary likelihoodData (D)
Unary likelihood Contrast Term Uniform Prior(Potts Model)
Maximum-a-posteriori (MAP) solution x* = arg min E(x)x
=
However…
This energy formulation rarely provides realistic (target-
like) results.
Shape-Priors and Segmentation
Combine object detection with segmentation• Obj-Cut, Kumar et al., CVPR ’05• Zhao and Davis, ICCV ’05
Obj-Cut • Shape-Prior: Layered Pictorial Structure (LPS)• Learned exemplars for parts of the LPS model• Obtained impressive results
+
Layer 1 Layer 2
=
LPS model
LPS for Detection
Learning• Learnt automatically using a set of examples
DetectionTree of chamfers to detect parts, assemble with
pictorial structure and belief propogation.
Solve via Integer Programming
SDP formulation (Torr 2001, AI stats)
SOCP formulation (Kumar, Torr & Zisserman this conference)
LBP (Huttenlocher, many)
Obj-CutImage Likelihood Ratio (Colour)
Shape Prior Distance from
Likelihood + Distance from
Integrating Shape-Prior in MRFs
Unary potential
Pairwise potential
Labels
Pixels
Prior Potts model
MRF for segmentation
Integrating Shape-Prior in MRFs
Unary potential
Pairwise potential
Pose parameters
Labels
Pixels
Prior Potts model
Pose-specific MRF
Layer 2
Layer 1
Transformations
Θ1
P(Θ1) = 0.9
Cow Instance
Do we really need accurate models?
Do we really need accurate models?
Segmentation boundary can be extracted from edges
Rough 3D Shape-prior enough for region disambiguation
Energy of the Pose-specific MRFEnergy to be
minimizedUnary term
Shape prior
Pairwise potential
Potts model
But what should be the value of θ?
The different terms of the MRF
Original image
Likelihood of being foreground given a
foreground histogram
Grimson-Stauffer
segmentation
Shape prior model
Shape prior (distance transform)
Likelihood of being foreground
given all the terms
Resulting Graph-Cuts
segmentation
Can segment multiple views simultaneously
Solve via gradient descent
Comparable to level set methods
Could use other approaches (e.g. Objcut)
Need a graph cut per function evaluation
Formulating the Pose Inference Problem
But…But…
… to compute the MAP of E(x) w.r.t the pose, it means that the unary terms will be changed at EACHEACH iteration and the maxflow recomputed!
However…However… Kohli and Torr showed how dynamic graph cuts can
be used to efficiently find MAP solutions for MRFs that change minimally from one time instant to the next: Dynamic Graph Cuts (ICCV05).
Dynamic Graph Cuts
PB SB
cheaperoperation
computationally
expensive operation
Simplerproblem
PB*
differencesbetweenA and B
A and Bsimilar
PA SA
solve
Dynamic Image Segmentation
Image
Flows in n-edges Segmentation Obtained
First segmentation problem MAP solution
Ga
Our Algorithm
Gb
second segmentation problem
Maximum flow
residual graph (Gr)
G`
differencebetween
Ga and Gbupdated residual
graph
Dynamic Graph Cut vs Active Cuts
Our method flow recycling
AC cut recycling
Both methods: Tree recycling
Experimental Analysis
MRF consisting of 2x105 latent variables connected in a 4-neighborhood.
Running time of the dynamic algorithm
Segmentation Comparison
Gri
mson
-G
rim
son
-S
tau
ffer
Sta
uff
er
Bath
ia0
Bath
ia0
44O
ur
Ou
r m
eth
od
meth
od
Face Detector and ObjCut
Segmentation
Segmentation
Conclusion
Combining pose inference and segmentation worth investigating.
Tracking = Detection Detection = Segmentation Tracking = Segmentation. Segmentation = SFM ??