1.introduction 2.article [1] real time motion capture using a single tof camera (2010) 3.article [2]...

Human Pose Recognition

Contents

1. Introduction

2. Article [1]

Real Time Motion Capture Using a

Single TOF Camera (2010)

3. Article [2]

Real Time Human Pose Recognition In

Parts Using a Single Depth

Images(2011)

1.1 What Is Pose Recognition?

Fig From [2]

Input Image

armtorso

head

1.2 Motivation

Why do we need this?

Robotics

Smart surveillance

virtual reality

motion analysis

Gaming - Kinect

Kinect – Project Natal

Microsoft Xbox 360 console

“You are the controller”

Launched - 04/11/10

In the first 60 days on the market sold

over 8M units! (Guinness world record)

http://www.youtube.com/watch?v=p2qlHo

xPioM

http://www.youtube.com/watch?v=p2qlHoxPioM

http://www.youtube.com/watch?v=p2qlHoxPioM

1.3 Challenges

Real Time???

Full Solution??

Cheap???

OCCLUSIONS???Light?

Shadows?

Clothes?

What is the problem???

1.4 Previous Technology

mocap using markers –

expensive

Multi View camera systems –

limited applicability.

Monocular –

simplified problems.

1.4 New TechnologyTime Of Flight Camera. (TOF)

Dense depth

High frame rate (100 Hz)

Robust to:

Lighting

shadows

other problems.

2. Article [1]Real Time Motion

Capture Using a Single Time Of Flight Camera

(V. Ganapathi et al. CVPR 2010)

Article Contents

2.1 previous work

2.2 What’s new?

2.3 Overview

2.4 results

2.5 limitations & future work

2.6 Evaluation

2.1 Previous workMany many many articles…

(Moeslund et al 2006–covered 350

articles…)

(2006) (2006) (1998)

2.2 What’s new?TOF technology

Propagating information up the kinematic

chain.

Probabilistic model using the unscented

transform.

Multiple GPUs.

2.3 Overview

1. Probabilistic Model

2. Algorithm Overview:

Model Based Hill Climbing Search

Evidence Propagation

Full Algorithm

1 .Probabilistic Model 1 .Probabilistic Model

15 body parts

DAG – Directed Acyclic Graph

1{ }i Nt t iX X pose

tVspeed tzrange scan

DBN– Dynamic Bayesian Network

dynamic Bayesian network (DBN)

Assumptions

Use ray casting to evaluate

distance from measurement.

Goal: Find the most likely states, given previous frame

MAP, i.e.:

Fig From [1]

1( ) 1i i it t tP X V X 1 1| ~ ( , )t t tV V N V

, 1 1ˆ ˆ ˆ ˆ, argmax log ( | , ) log ( , | , )

t tt t X V t t t t t t tX V P z X V P X V X V

1 .Probabilistic Model

kz

2 .Algorithm Overview

1. Hill climbing search (HC)

2. Evidence Propagation –EP

2.1 Hill Climbing Search (HC)

Fig From [1]

0.05m

1ˆ,t t tX fromV X

0.05m

Calculate

evaluate likelihood choose best point!

1( | )i it tP V V Grid around

itVSamplei

Coarse to fine Grids.

2.1 Hill Climbing Search (HC)

The good:

Simple

Fast

run in parallel in GPUS

The Bad:

Local optimum

Ridges, Plateau, Alleys

Can lose track when motion is fast ,or occlusions

occur.

2.2 Evidence Propagation

Also has 3 stages:

1. Body part detection (C. Plagemann et al 2010)

2. Probabilistic Inverse Kinematics

3. Data association and inference

2.2.1 Body Part Detection

Bottom up approach:

1. Locate interest points with AGEX –

Accumulative Geodesic Extrema.

2. Find orientation.

3. Classify the head, foots and hands using local shape

descriptors.

Fig From [3]

2.2.1 Body Part Detection

Results:

Fig From [3]

2.2.2 Probabalistic inverse kinematics (EP)

51{ , , }i ip Head Hands Legs of X ˆ ( 1,..., )jp j N

?

Assume Correspondence

Need new MAP conditioned on .

Problem – isn’t linear!

Solution: Linearize with the unscented Kalman filter .

Easy to determine .

1 1ˆ ˆ( , , )i t t tp V X V

1 1ˆ ˆ ˆ( | , , )t t t jP V V X p

1ˆ ˆ,t jX p

ˆi jp p

2.3 Full Algorithm

HC

Part Detection

Remove ExplainedSuggestions.Coresspond: by body parts

ˆ{( , )}i jp p

X’

HC

PreviousMAP

DepthImage

X’>Xbest?

X’

Xbest

EP

2.4 Results Experiments:

28 real depth image sequences.

Ground Truth - tracking markers.

, – real marker position

– estimated position

perfect tracks.

fault tracking.

Compared 3 algorithms: EP, HC, HC+EP .

1

ˆ|| ||Mi i

avgi

m m

M

im

ˆ im

0.1avg m

0.3avg m

2.4 Results

best – HC+EP, worse – EP.

Runs close to real time.

HC: 6 frames per second.

HC+EP: 4-6 frames per second.Fig From [1]

BiggerDifference

Harder

2.4 Results

HC

HC+EP

Lose trackExtreme case – 27:

Fig From [1]

2.5 Limitations & Future workLimitations:

Manual Initialization.

Tracking more than one person at a time.

Using temporal data – consume more time,

reinitialization problem.

Future work:

improving the speed.

combining with color cameras

fully automatic model initialization.

Track more than 1 person.

2.6 Evaluation Well Written

Self Contained

Novel combination of existing parts

New technology

Achieving goals (real time)

Missing examples on probabilistic model.

Not clear how is defined

Extensively validated:

Data set and code available

not enough visual examples in article

No comparison to different

algorithms

0X

3. Article [2]Real Time Human Pose

Recognition In Parts From Single Depth

Images (Shotton et al. & Xbox incubation

Microsoft Research 2011)

Article Contents

2.1 previous work

2.2 What’s new?

2.3 Overview

2.4 results

2.5 limitations & future work

2.6 Evaluation

2.1 Previous work Same as Article [1].

2.2 What’s new? Using no temporal information – robust

and

fast (200 frames per second).

Object recognition approach.

per pixel classification.

Large and highly varied

training dataset .

Fig From [2]

2.3 Overview

1. Database construction

2. Body part inference and joint proposals:

Goals:

computational efficiency and robustness

1 .Database

Pose estimation is often overcome lack of training

data… why???

Huge color and texture variability.

Computer simulation don’t produce the range of

volitional motions of a human subject.

2 .Data base

Fig From [2]

100k mocap frames Synthetic rendering pipeline

1 .Database

Real data

Synthetic data

Which is real???

Fig From [2]

2 .Body part inference

1. Body part labeling

2. Depth image features

3. Randomized decision forests

4. Joint position proposals

2.1 Body part labeling

31 body parts labeled .

The problem now can be solved by an efficient

classification algorithms.

Fig From [2]

Head Up RightHead Up Left

2.2 Depth comparison features

Simple depth comparison features:(1)

– depth at pixel x in image I, offset

normalization - depth invariant.

computational efficiency:

no preprocessing.

( )Id x ( , )u v

Fig From [2]

2.3 Randomized Decision forests

How does it work?

Node = feature

Classify pixel x:

f and a threshold

1

1( | , ) ( | , )

T

tt

P c I x P c I xT

Fig From [2]

Pixel x


Training Algorithm: 1M Images – 2000 pixels

Per image

( , )

( , )

| ( ) |argmax ( ) ( ) ( ) ( ( ))

| |s

ss l r

QG G H Q H Q

Q

*H-antropy

Training 3 trees, depth 20, 1M images~ 1 day (1000 core

cluster)

1M images*2000pixels*2000 *50 =

f 142 10 ...computations


Fig From [2]

Trained tree:

2.4 Joint Position Proposal

Local mode finding approach based on mean shift with a

weighted Gaussian kernel.

Density estimator:

2

1

ˆ( ) expN

ic ic

i c

x xf x w

b

2( | , ) ( )ic i I iw P c I x d x

Fig From [4]

outliersCenter of mass

2.4 Results Experiments:

8800 frames of real depth images.

5000 synthetic depth images.

Also evaluate Article [1] dataset.

Measures :

1. Classification accuracy – confusion

matrix.

2. joint accuracy –mean Average Precision

(mAP)

results within D=0.1m –TP.

Fault

Fig From [2]

2.4 Results- Classification accuracy high correlation between real and synthetic.

Depth of tree – most effective

Fig From [2]

2.4 Results - Joint Prediction Comparing the algorithm on:

real set (red) – mAP 0.731

ground truth set (blue) – mAP 0.914

mAP 0.984 – upper

body

Fig From [2]

2.4 Results- Joint PredictionComparing algorithm to ideal Nearest Neighbor

matching, and realistic NN - Chamfer NN.

Fig From [2]

2.4 Results- Joint PredictionComparison to Article[1]:

Run on the same dataset

Better results (even without temporal

data)

Runs 10x faster.

Fig From [2]

2.4 Results- Joint PredictionFull rotations and multiple people

Right-left ambiguity

mAP of 0.655 ( good for our uses)

Result VideoFig From [2]

2.5 Limitations & Future workFuture work:

better synthesis pipeline

Is there efficient approach that directly

regress joint positions? (already done in

future

work -

Efficient offset regression of body joint position

s

)

http://research.microsoft.com/apps/pubs/?id=154457

http://research.microsoft.com/apps/pubs/?id=154457

2.6 Evaluation Well Written

Self Contained

Novel combination of existing parts

New technology

Achieving goals (real time)

Extensively validated:

Used in real console

Many results graphs and examples

(Another pdf of supplementary

material)

Broad comparison to other

algorithms

data set and code not available

References[1] Real Time Motion Capture Using a Single TOF Camera (V.

Ganapathi et al. 2010)

[2] Real Time Human Pose Recognition In Parts Using a Single

Depth Images(Shotton et al. & Xbox Incubation 2011)

[3] Real time identification and localization of body parts from

depth images (C. Plagemann et al. 2010)

[4] Computer Graphics course (046746), Technion.

Questions?

1.introduction 2.article [1] real time motion capture using a single tof camera (2010) 3.article [2]...

Documents

algorithm slide

evaluation slide

inference slide

evidence propagation

single depth images2011

hc hc ep

best hc ep

search hc