modeling mutual context of object and human pose in human-object interaction activities
DESCRIPTION
Modeling Mutual Context of Object and Human Pose in Human-Object Interaction Activities. {bangpeng,feifeili}@cs.stanford.edu. Bangpeng Yao and Li Fei-Fei Computer Science Department, Stanford University. Human-Object Interaction. Robots interact with objects. Automatic sports commentary. - PowerPoint PPT PresentationTRANSCRIPT
Modeling Mutual Context of Object and Human Pose in Human-Object
Interaction Activities
Bangpeng Yao and Li Fei-FeiComputer Science Department, Stanford University
{bangpeng,feifeili}@cs.stanford.edu
1
Robots interact with objects
Automatic sports commentary
“Kobe is dunking the ball.”
2
Human-Object Interaction
Medical care
3
Vs.
Human-Object Interaction
Playing saxophone
Playing bassoon
Playing saxophone
Grouplet is a generic feature for structured objects, or interactions of groups of objects.
(Previous talk: Grouplet)
Caltech101Caltech101
HOI activity: Tennis Forehand
Holistic image based classification
Detailed understanding and reasoning
Berg & Malik, 2005 Grauman & Darrell, 2005 Gehler & Nowozin, 2009 OURS
48% 59% 77% 62%
4
Human-Object Interaction
TorsoRight-armLeft-a
rmRigh
t-leg
Left-leg
Head
• Human pose estimation
Holistic image based classification
Detailed understanding and reasoning
5
Human-Object Interaction
Tennis racket
• Human pose estimation
Holistic image based classification
Detailed understanding and reasoning
• Object detection
6
Human-Object Interaction
• Human pose estimation
Holistic image based classification
Detailed understanding and reasoning
• Object detection
TorsoRight-armLeft-a
rmRigh
t-leg
Left-leg
Head
Tennis racket
HOI activity: Tennis Forehand
• Background and Intuition• Mutual Context of Object and Human Pose
Model Representation Model Learning Model Inference
• Experiments• Conclusion
Outline
7
• Background and Intuition• Mutual Context of Object and Human Pose
Model Representation Model Learning Model Inference
• Experiments• Conclusion
Outline
8
• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009
Difficult part appearance
Self-occlusion
Image region looks like a body part
Human pose estimation & Object detection
9
Human pose estimation is challenging.
Human pose estimation & Object detection
10
Human pose estimation is challenging.
• Felzenszwalb & Huttenlocher, 2005• Ren et al, 2005• Ramanan, 2006• Ferrari et al, 2008• Yang & Mori, 2008• Andriluka et al, 2009• Eichner & Ferrari, 2009
Human pose estimation & Object detection
11
FacilitateFacilitate
Given the object is detected.
• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009
Small, low-resolution, partially occluded
Image region similar to detection target
Human pose estimation & Object detection
12
Object detection is challenging
Human pose estimation & Object detection
13
Object detection is challenging
• Viola & Jones, 2001• Lampert et al, 2008• Divvala et al, 2009• Vedaldi et al, 2009
Human pose estimation & Object detection
14
FacilitateFacilitate
Given the pose is estimated.
Human pose estimation & Object detection
15
Mutual ContextMutual Context
• Hoiem et al, 2006• Rabinovich et al, 2007• Oliva & Torralba, 2007• Heitz & Koller, 2008• Desai et al, 2009• Divvala et al, 2009
• Murphy et al, 2003• Shotton et al, 2006• Harzallah et al, 2009• Li, Socher & Fei-Fei, 2009• Marszalek et al, 2009• Bao & Savarese, 2010
Context in Computer Vision
~3-4%
with context
without context
Helpful, but only moderately outperform better
Previous work – Use context cues to facilitate object detection:
• Viola & Jones, 2001• Lampert et al, 2008
16
Context in Computer VisionOur approach – Two challenging tasks serve as mutual context of each other:
With mutual context:
Without context:
17
~3-4%
with context
without context
Helpful, but only moderately outperform better
Previous work – Use context cues to facilitate object detection:
• Hoiem et al, 2006• Rabinovich et al, 2007• Oliva & Torralba, 2007• Heitz & Koller, 2008• Desai et al, 2009• Divvala et al, 2009
• Murphy et al, 2003• Shotton et al, 2006• Harzallah et al, 2009• Li, Socher & Fei-Fei, 2009• Marszalek et al, 2009• Bao & Savarese, 2010
• Background and Intuition• Mutual Context of Object and Human Pose
Model Representation Model Learning Model Inference
• Experiments• Conclusion
Outline
18
19
H
A
Mutual Context Model Representation
• More than one H for each A;• Unobserved during training.
A:
Croquet shot
Volleyball smash
Tennis forehand
Intra-class variations
Activity
Object
Human pose
Body parts
lP: location; θP: orientation; sP: scale.
Croquet mallet
Volleyball
Tennis racket
O:
H:
P:
f: Shape context. [Belongie et al, 2002]
P1
Image evidence
fO
f1 f2 fN
O
P2 PN
20
Mutual Context Model Representation
( , )e O H
( , )e A O( , )e A H
e ee E
w
Markov Random Field
Clique potential
Clique weight
O
P1 PN
fO
H
A
P2
f1 f2 fN
( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.
21
A
f1 f2 fN
Mutual Context Model Representation
( , )e nO P
( , )e m nP P
fO
P1 PNP2
O
H• , , : Spatial relationship among object and body parts.
( , )e nO P ( , )e m nP P( , )e nH P
bin binn n nO P O P O Pl l s s
location orientation size( , )e nH P
e ee E
w
Markov Random Field
Clique potential
Clique weight
( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.
22
H
A
f1 f2 fN
Mutual Context Model Representation
Obtained by structure learning
fO
PNP1 P2
O
• Learn structural connectivity among the body parts and the object.
( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.
• , , : Spatial relationship among object and body parts.
( , )e nO P ( , )e m nP P( , )e nH P
bin binn n nO P O P O Pl l s s
location orientation size ( , )e nO P
( , )e m nP P
( , )e nH P
e ee E
w
Markov Random Field
Clique potential
Clique weight
23
H
O
A
fO
f1 f2 fN
P1 P2 PN
Mutual Context Model Representation
• and : Discriminative part detection scores.
( , )e OO f ( , )ne n PP f
[Andriluka et al, 2009]
Shape context + AdaBoost
• Learn structural connectivity among the body parts and the object.
[Belongie et al, 2002][Viola & Jones, 2001]
( , )e OO f
( , )ne n PP f
( , )e A O ( , )e A H ( , )e O H• , , : Frequency of co-occurrence between A, O, and H.
• , , : Spatial relationship among object and body parts.
( , )e nO P ( , )e m nP P( , )e nH P
bin binn n nO P O P O Pl l s s
location orientation size
e ee E
w
Markov Random Field
Clique potential
Clique weight
• Background and Intuition• Mutual Context of Object and Human Pose
Model Representation Model Learning Model Inference
• Experiments• Conclusion
Outline
24
25
Model Learning
H
O
A
fO
f1 f2 fN
P1 P2 PN
e ee E
w
cricket shot
cricket bowling
Input:
Goals:Hidden human poses
26
Model Learning
H
O
A
fO
f1 f2 fN
P1 P2 PN
Input:
Goals:Hidden human posesStructural connectivity
e ee E
w
cricket shot
cricket bowling
e ee E
w
27
Model Learning
Goals:Hidden human posesStructural connectivityPotential parametersPotential weights
H
O
A
fO
f1 f2 fN
P1 P2 PN
Input:
cricket shot
cricket bowling
28
Model Learning
Goals:
Parameter estimation
Hidden variablesStructure learning
H
O
A
fO
f1 f2 fN
P1 P2 PN
Input:e e
e E
w
cricket shot
cricket bowling
Hidden human posesStructural connectivityPotential parametersPotential weights
29
Model Learning
Goals:
H
O
A
fO
f1 f2 fN
P1 P2 PN
Approach:
croquet shot
e ee E
w
Hidden human posesStructural connectivityPotential parametersPotential weights
30
Model Learning
Goals:
H
O
A
fO
f1 f2 fN
P1 P2 PN
Approach:
22max
2e eeE e
Ew
Joint density of the model
Gaussian priori of the edge number
Add an
ed
ge
Remove
an edge
Add an
ed
ge
Remove
an edge
Hill-climbing
e ee E
w
Hidden human posesStructural connectivityPotential parametersPotential weights
31
Model Learning
Goals:
H
O
A
fO
f1 f2 fN
P1 P2 PN
Approach:
( , )e O H( , )e A O ( , )e A H( , )e nO P ( , )e m nP P( , )e nH P
( , )e OO f ( , )ne n PP f
• Maximum likelihood
• Standard AdaBoost
e ee E
w
Hidden human posesStructural connectivityPotential parametersPotential weights
32
Model Learning
Goals:
H
O
A
fO
f1 f2 fN
P1 P2 PN
Approach:
Max-margin learning
2
2,
1min2 r i
r i w
w
• xi: Potential values of the i-th image.• wr: Potential weights of the r-th pose.• y(r): Activity of the r-th pose.• ξi: A slack variable for the i-th image.
Notations
s.t. , where ,
1
, 0i
i
c i r i i
i
i r y r y c
i
w x w x
e ee E
w
Hidden human posesStructural connectivityPotential parametersPotential weights
33
Learning Results
Cricket defensive
shot
Cricket bowling
Croquet shot
34
Learning Results
Tennis serve
Volleyball smash
Tennis forehand
• Background and Intuition• Mutual Context of Object and Human Pose
Model Representation Model Learning Model Inference
• Experiments• Conclusion
Outline
35
I
36
Model Inference
The learned models
I
37
Model Inference
The learned models
Head detection
Torso detection
Tennis racket detection
Layout of the object and body parts.
Compositional Inference
[Chen et al, 2007]
* *1 1 1 1,, , , n nA H O P
I
38
Model Inference
The learned models
* *1 1 1 1,, , , n nA H O P * *
,, , ,K K K K n nA H O P
Output
• Background and Intuition• Mutual Context of Object and Human Pose
Model Representation Model Learning Model Inference
• Experiments• Conclusion
Outline
39
40
Dataset and Experiment Setup
• Object detection;• Pose estimation;• Activity classification.
Tasks:
[Gupta et al, 2009]
Cricket defensive shot
Cricket bowling
Croquet shot
Tennis forehand
Tennis serve
Volleyball smash
Sport data set: 6 classes180 training (supervised with object and part locations) & 120 testing images
[Gupta et al, 2009]
Cricket defensive shot
Cricket bowling
Croquet shot
Tennis forehand
Tennis serve
Volleyball smash
Sport data set: 6 classes
41
Dataset and Experiment Setup
• Object detection;• Pose estimation;• Activity classification.
Tasks:
180 training (supervised with object and part locations) & 120 testing images
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Object Detection Results
Cricket bat
42
Valid region
Croquet mallet Tennis racket Volleyball
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Cricket ball
Our Method
Sliding window
Pedestrian context
[Andriluka et al, 2009]
[Dalal & Triggs, 2006]
Object Detection Results
43
430 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Volleyball
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Cricket ball
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
RecallP
reci
sion
Our MethodPedestrian as contextScanning window detector
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Our MethodPedestrian as contextScanning window detector
0 0.2 0.4 0.6 0.8 10
0.2
0.4
0.6
0.8
1
Recall
Pre
cisi
on
Our MethodPedestrian as contextScanning window detector
Sliding window Pedestrian context Our method
Smal
l obj
ect
Bac
kgro
und
clut
ter
44
Dataset and Experiment Setup
• Object detection;• Pose estimation;• Activity classification.
Tasks:
[Gupta et al, 2009]
Cricket defensive shot
Cricket bowling
Croquet shot
Tennis forehand
Tennis serve
Volleyball smash
Sport data set: 6 classes180 training & 120 testing images
45
Human Pose Estimation Results
Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head
Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42
Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45
Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58
46
Human Pose Estimation Results
Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head
Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42
Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45
Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58
Andriluka et al, 2009
Our estimation result
Tennis serve model
Andriluka et al, 2009
Our estimation result
Volleyball smash model
47
Human Pose Estimation Results
Method Torso Upper Leg Lower Leg Upper Arm Lower Arm Head
Ramanan, 2006 .52 .22 .22 .21 .28 .24 .28 .17 .14 .42
Andriluka et al, 2009 .50 .31 .30 .31 .27 .18 .19 .11 .11 .45
Our full model .66 .43 .39 .44 .34 .44 .40 .27 .29 .58
One pose per class .63 .40 .36 .41 .31 .38 .35 .21 .23 .52
Estimation result
Estimation result
Estimation result
Estimation result
48
Dataset and Experiment Setup
• Object detection;• Pose estimation;• Activity classification.
Tasks:
[Gupta et al, 2009]
Cricket defensive shot
Cricket bowling
Croquet shot
Tennis forehand
Tennis serve
Volleyball smash
Sport data set: 6 classes180 training & 120 testing images
Activity Classification Results
49
Gupta et al, 2009
Our model
Bag-of-Words
83.3%
Cla
ssifi
catio
n ac
cura
cy 78.9%
52.5%
0.9
0.8
0.7
0.6
0.5
No scene No scene informationinformation Scene is Scene is
critical!!critical!! Cricket shot
Tennis forehand
Bag-of-wordsSIFT+SVM
Gupta et al, 2009
Our model
50
ConclusionHuman-Object Interaction
Next Steps
Vs.
• Pose estimation & Object detection on PPMI images.
• Modeling multiple objects and humans.
Grouplet representation
Mutual context model
Acknowledgment• Stanford Vision Lab reviewers:
– Barry Chai (1985-2010)– Juan Carlos Niebles– Hao Su
• Silvio Savarese, U. Michigan• Anonymous reviewers
51
52
53
Human-Object Interaction
• Human pose estimation
Holistic image based classification
Detailed understanding and reasoning
• Object detection
How to beat this???
TorsoRight-armLeft-a
rm
Right-l
eg
Left-leg
Head
Tennis racket
image patchesimage patches
body partsbody parts
human posehuman pose
objectobject
Hier
arch
ical
repr
esen
tatio
nof
imag
es
human-object interaction activityhuman-object interaction activity
Mutual Context Model Representation
54
H
O
A
fO
f1 f2 fN
P1 P2 PN