modeling mutual context of object and human pose in human-object interaction activities

Modeling Mutual Context of Object and Human Pose in Human-Object

Interaction Activities

Bangpeng Yao and Li Fei-FeiComputer Science Department, Stanford University

{bangpeng,feifeili}@cs.stanford.edu

1

Robots interact with objects

Automatic sports commentary

“Kobe is dunking the ball.”

2

Human-Object Interaction

Medical care

3

Vs.


Playing saxophone

Playing bassoon

Playing saxophone

Grouplet is a generic feature for structured objects, or interactions of groups of objects.

(Previous talk: Grouplet)

Caltech101Caltech101

HOI activity: Tennis Forehand

Holistic image based classification

Detailed understanding and reasoning

Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS

48% 59% 77% 62%

4


TorsoRight-armLeft-a

rmRigh

t-leg

Left-leg

Head

• Human pose estimation



5


Tennis racket




• Object detection

6







rmRigh

t-leg

Left-leg

Head

Tennis racket

HOI activity: Tennis Forehand

• Background and Intuition• Mutual Context of Object and Human Pose

Model Representation Model Learning Model Inference

• Experiments• Conclusion

Outline

7




Outline

8

• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009

Difficult part appearance

Self-occlusion

Image region looks like a body part

Human pose estimation & Object detection

9

Human pose estimation is challenging.


10

Human pose estimation is challenging.

• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009


11

FacilitateFacilitate

Given the object is detected.

• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009

Small, low-resolution, partially occluded

Image region similar to detection target


12

Object detection is challenging


13

Object detection is challenging

• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009


14

FacilitateFacilitate

Given the pose is estimated.


15

Mutual ContextMutual Context

• Hoiem et al, 2006• Rabinovich et al, 2007• Oliva & Torralba, 2007• Heitz & Koller, 2008• Desai et al, 2009• Divvala et al, 2009

• Murphy et al, 2003• Shotton et al, 2006• Harzallah et al, 2009• Li, Socher & Fei-Fei, 2009• Marszalek et al, 2009• Bao & Savarese, 2010

Context in Computer Vision

~3-4%

with context

without context

Helpful, but only moderately outperform better

Previous work – Use context cues to facilitate object detection:

• Viola & Jones, 2001• Lampert et al, 2008

16

Context in Computer VisionOur approach – Two challenging tasks serve as mutual context of each other:

With mutual context:

Without context:

17

~3-4%

with context

without context

Helpful, but only moderately outperform better

Previous work – Use context cues to facilitate object detection:

• Hoiem et al, 2006• Rabinovich et al, 2007• Oliva & Torralba, 2007• Heitz & Koller, 2008• Desai et al, 2009• Divvala et al, 2009

• Murphy et al, 2003• Shotton et al, 2006• Harzallah et al, 2009• Li, Socher & Fei-Fei, 2009• Marszalek et al, 2009• Bao & Savarese, 2010




Outline

18

19

H

A

Mutual Context Model Representation

• More than one H for each A;• Unobserved during training.

A:

Croquet shot

Volleyball smash

Tennis forehand

Intra-class variations

Activity

Object

Human pose

Body parts

lP: location; θP: orientation; sP: scale.

Croquet mallet

Volleyball

Tennis racket

O:

H:

P:

f: Shape context. [Belongie et al, 2002]

P1

Image evidence

fO

f1 f2 fN

O

P2 PN

20


( , )e O H

( , )e A O( , )e A H

e ee E

w

Markov Random Field

Clique potential

Clique weight

O

P1 PN

fO

H

A

P2

f1 f2 fN

( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.

21

A

f1 f2 fN


( , )e nO P

( , )e m nP P

fO

P1 PNP2

O

H• , , : Spatial relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P

bin binn n nO P O P O Pl l s s

location orientation size( , )e nH P

e ee E

w

Markov Random Field

Clique potential

Clique weight


22

H

A

f1 f2 fN


Obtained by structure learning

fO

PNP1 P2

O

• Learn structural connectivity among the body parts and the object.


• , , : Spatial relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P


location orientation size ( , )e nO P

( , )e m nP P

( , )e nH P

e ee E

w

Markov Random Field

Clique potential

Clique weight

23

H

O

A

fO

f1 f2 fN

P1 P2 PN


• and : Discriminative part detection scores.

( , )e OO f ( , )ne n PP f

[Andriluka et al, 2009]

Shape context + AdaBoost

• Learn structural connectivity among the body parts and the object.

[Belongie et al, 2002][Viola & Jones, 2001]

( , )e OO f

( , )ne n PP f


• , , : Spatial relationship among object and body parts.

( , )e nO P ( , )e m nP P( , )e nH P


location orientation size

e ee E

w

Markov Random Field

Clique potential

Clique weight




Outline

24

25

Model Learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

e ee E

w

cricket shot

cricket bowling

Input:

Goals:Hidden human poses

26

Model Learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:

Goals:Hidden human posesStructural connectivity

e ee E

w

cricket shot

cricket bowling

e ee E

w

27

Model Learning

Goals:Hidden human posesStructural connectivityPotential parametersPotential weights

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:

cricket shot

cricket bowling

28

Model Learning

Goals:

Parameter estimation

Hidden variablesStructure learning

H

O

A

fO

f1 f2 fN

P1 P2 PN

Input:e e

e E

w

cricket shot

cricket bowling

Hidden human posesStructural connectivityPotential parametersPotential weights

29

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

croquet shot

e ee E

w


30

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

22max

2e eeE e

Ew

Joint density of the model

Gaussian priori of the edge number

Add an

ed

ge

Remove

an edge

Add an

ed

ge

Remove

an edge

Hill-climbing

e ee E

w


31

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

( , )e O H( , )e A O ( , )e A H( , )e nO P ( , )e m nP P( , )e nH P

( , )e OO f ( , )ne n PP f

• Maximum likelihood

• Standard AdaBoost

e ee E

w


32

Model Learning

Goals:

H

O

A

fO

f1 f2 fN

P1 P2 PN

Approach:

Max-margin learning

2

2,

1min2 r i

r i w

w

• xi: Potential values of the i-th image.• wr: Potential weights of the r-th pose.• y(r): Activity of the r-th pose.• ξi: A slack variable for the i-th image.

Notations

s.t. , where ,

1

, 0i

i

c i r i i

i

i r y r y c

i

w x w x

e ee E

w


33

Learning Results

Cricket defensive

shot

Cricket bowling

Croquet shot

34

Learning Results

Tennis serve

Volleyball smash

Tennis forehand




Outline

35

I

36

Model Inference

The learned models

I

37

Model Inference

The learned models

Head detection

Torso detection

Tennis racket detection

Layout of the object and body parts.

Compositional Inference

[Chen et al, 2007]

* *1 1 1 1,, , , n nA H O P

I

38

Model Inference

The learned models

* *1 1 1 1,, , , n nA H O P * *

,, , ,K K K K n nA H O P

Output




Outline

39

40

Dataset and Experiment Setup

• Object detection;• Pose estimation;• Activity classification.

Tasks:

[Gupta et al, 2009]

Cricket defensive shot

Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set: 6 classes180 training (supervised with object and part locations) & 120 testing images

[Gupta et al, 2009]


Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set: 6 classes

41



Tasks:

180 training (supervised with object and part locations) & 120 testing images

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Object Detection Results

Cricket bat

42

Valid region

Croquet mallet Tennis racket Volleyball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Cricket ball

Our Method

Sliding window

Pedestrian context

[Andriluka et al, 2009]

[Dalal & Triggs, 2006]

Object Detection Results

43

430 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Volleyball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on

Cricket ball

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

RecallP

reci

sion

Our MethodPedestrian as contextScanning window detector

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on


0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Recall

Pre

cisi

on


Sliding window Pedestrian context Our method

Smal

l obj

ect

Bac

kgro

und

clut

ter

44



Tasks:

[Gupta et al, 2009]


Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set: 6 classes180 training & 120 testing images

45

Human Pose Estimation Results

Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head

Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58

46



Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58

Andriluka et al, 2009

Our estimation result

Tennis serve model

Andriluka et al, 2009

Our estimation result

Volleyball smash model

47



Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42

Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45

Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58

One pose per class .63 .40 .36 .41 .31 .38 .35 .21 .23 .52

Estimation result

Estimation result

Estimation result

Estimation result

48



Tasks:

[Gupta et al, 2009]


Cricket bowling

Croquet shot

Tennis forehand

Tennis serve

Volleyball smash

Sport data set: 6 classes180 training & 120 testing images

Activity Classification Results

49

Gupta et al, 2009

Our model

Bag-of-Words

83.3%

Cla

ssifi

catio

n ac

cura

cy 78.9%

52.5%

0.9

0.8

0.7

0.6

0.5

No scene No scene informationinformation Scene is Scene is

critical!!critical!! Cricket shot

Tennis forehand

Bag-of-wordsSIFT+SVM

Gupta et al, 2009

Our model

50

ConclusionHuman-Object Interaction

Next Steps

Vs.

• Pose estimation & Object detection on PPMI images.

• Modeling multiple objects and humans.

Grouplet representation

Mutual context model

Acknowledgment• Stanford Vision Lab reviewers:

– Barry Chai (1985-2010)– Juan Carlos Niebles– Hao Su

• Silvio Savarese, U. Michigan• Anonymous reviewers

51

53






How to beat this???


rm

Right-l

eg

Left-leg

Head

Tennis racket

image patchesimage patches

body partsbody parts

human posehuman pose

objectobject

Hier

arch

ical

repr

esen

tatio

nof

imag

es

human-object interaction activityhuman-object interaction activity


54

H

O

A

fO

f1 f2 fN

P1 P2 PN

modeling mutual context of object and human pose in human-object interaction activities

Documents

ferrari et

lampert et

divvala et

andriluka et

ren et

vedaldi et

marszalek et

rabinovich et