learning interactions and relationships between movie ... · d. dataset analysis fig.11and...
TRANSCRIPT
Learning Interactions and Relationships between Movie CharactersSUPPLEMENTARY MATERIAL
Anna Kukleva1,2 Makarand Tapaswi1 Ivan Laptev1
1Inria Paris, France 2Max-Planck-Institute for Informatics, Saarbrucken, [email protected], {makarand.tapaswi,ivan.laptev}@inria.fr
We provide additional analysis of our task and modelsincluding confusion matrices, prediction examples for allour models, skewed distribution of number of samples forour classes, and diagrams depicting how we grouped theinteraction and relationship classes.
A. Impact of ModalitiesWe analyze the impact of modalities by presenting quali-
tative examples where using multiple modalities help predictthe correct interactions. Qualitative results presented here,refer to the quantitative performance indicated in Table 1 ofthe main paper. Fig. 1 shows that using dialog can help toimprove predictions, Fig. 2 demonstrates the necessity ofvisual clip information and highlights that the two modalitiesare complementary. Finally, Fig. 3 shows that focusing ontracks (visual representations in which the two charactersappear) provides further improvements to our model. Fur-thermore, Fig. 4 shows top-5 interaction classes that benefitmost from using additional modalities.Analyzing modalities. We also analyze the two modelstrained on only visual or only dialog cues (first two rowsof Table 1). Some interactions can be recognized only withvisual (v) features: rides 63% (v) / 0% (d), walks 29% (v) /0% (d), runs 26% (v) / 0% (d); while others only with dialog(d) cues: apologizes 0% (v) / 66% (d), compliments 0% (v) /26% (d), agrees 0% (v) / 25% (d).
Interactions that achieve non-zero accuracy with bothmodalities are: hits 64% (v) / 5% (d), greets 12% (v) / 57%(d), explains 25% (v) / 51% (d).
Additionally, the top-5 predicted classes for visual cuesare asks 77%, hits 64%, rides 63%, watches 49%, talks onphone 41%; and dialog cues are asks 75%, apologizes 66%,greets 57%, explains 51%, watches 30%. As asks is the mostcommon class, and watches is the second most common,these interactions work well with both modalities.
B. Joint Interaction and RelationshipsConfusion matrices. Fig. 5 shows the confusion matrix inthe top-15 most commonly occurring interactions on the
validation and test sets. We see that multiple dialog basedinteractions (e.g. talks to, informs, and explains) are oftenconfused. We also present confusion matrices for relation-ships in Fig. 6. A large part of the confusion is due to lackof sufficient data to model the tail of relationship classes.Qualitative examples. Related to Table 2 of the main paper,Fig. 7 shows some examples where interaction predictionsimprove by jointly learning to model both interactions andrelationships. Similarly, Fig. 8 shows how relationship clas-sification benefits from our multi-task training setup.
C. Examples for Who is InteractingEmpirical evaluation shows that the knowledge about the
relationship is important for localizing the pair of characters(Table 6 of the main paper). In Fig. 9, we illustrate anexample where the dad walks into a room, sees his daughterwith someone, and asks questions (see figure caption fordetails).
Finally, in Fig. 10, we show an example where the modelis able to correctly predict all components (interaction class,relationship type and the pair of tracks) in a complex situa-tion with more than 2 people appearing in the clip.
D. Dataset AnalysisFig. 11 and Fig. 12 show normalized distributions for the
number of samples in each class for train, validation andtest sets of interactions and relationships respectively. Ascan be seen the most common classes appear many moretimes than the others. Data from a complete movie belongsto one of the three train/val/test sets to avoid model bias onthe plot and characters behaviour. Notably, this means thatthe relative ratios between number of samples per class arealso not necessarily consistent making the dataset and taskeven more challenging.
In the main paper, we described our approach to groupover 300 interactions into 101 classes, and over 100 rela-tionships into 15. We use radial tree diagrams to depict thegroupings for interaction and relationship labels, visualizedin Fig. 13 and 14 respectively.
1
Visual
- Focker, I'm not gonna tell you again! Jinx cannot flush the toilet. He's a cat, for chrissakes! The animal doesn't even have thumbs, Focker.
watches 0.71asks 0.70
informs 0.66orders 0.64
explains 0.64
explains 0.78talks to 0.72reminds 0.68informs 0.67shows 0.64
Textual
Visual
Input Modalities
Figure 1: Improvement in prediction of interactions by including textual modality in addition to visual. The model learns to recognizesubtle differences between interactions based on dialog. The example is from Meet the Parents (2000).
- What a bruiser. - The Cardinals just refuse to go quietly into the desert night. I'm
just trying to be a little poetic.- A tough hit on Tidwell. I'd have made a touchdown. That's his
sixth catch tonight. A first down at the 19 yard line.- Another savage hit on Tidwell. They're really working Tidwell.
explains 0.73hears 0.65
watches 0.65informs 0.65argues 0.64
watches 0.75explains 0.67
hears 0.67talks to 0.62plays 0.58
Textual
Textual
Visual
Input Modalities
Figure 2: Improvement in prediction of interactions by including visual modality in addition to textual. The top-5 predicted interactionsreflect the impact of visual input rather than relying only on the dialog. The example is from Jerry Maguire (1996).
Visual- Hey, Jackie.- Wanna play, Pops?
Let's play.
informs 0.70suggests 0.69
greets 0.69watches 0.67
asks 0.67
watches 0.69greets 0.67
suggests 0.67informs 0.66asks 0.64
Textual
Tracks
Visual Textual
Input Modalities
Figure 3: Improvement in prediction of interactions by including the pair of tracks modality in addition to visual and textual cues. Themodel can concentrate its attention on visual cues for the two people of interest instead of looking only at the clip level. The example is fromMeet the Parents (2000).
2
0 5 10 15 20 25 30# improved samples
suggests
asks
watches
informs
explains
(vis) -> (vis + txt)
0 5 10# improved samples
runs
informs
explains
kisses
watches
(txt) -> (vis + txt)
0 5 10# improved samples
talksto
explains
hears
informs
watches
(vis + txt) -> (vis + txt + tr)
Figure 4: Each plot shows 5 interaction classes that have the most number of improved instances by including an additional modality.Specifically, the x-axis denotes the number of samples in which interaction prediction performance improves. Left: From only visual cliprepresentation to visual and textual. As expected, using dialogues in addition to video frames boosts performance for classes that rely ondialog e.g. explains, informs. Middle: From only textual clip representation to visual and textual. Visual clip representations influenceclasses as kisses, runs during which people usually do not talk (dialog modality filled with zeros). Right: Finally, including all threemodalities visual, textual, tracks improves performance over using visual and textual. Track pair localization improves recognition of classestypically used in group activities.
Figure 5: Confusion matrices for top-15 most common interactions for validation set (left) and test set (right). Model corresponds to the “Int.only” performance of 26.1% shown in Table 2. Numbers on the right axis indicate number of samples for each class.
stran
ger
frien
dlov
erco
lleag
uesib
ling
acqu
aintan
cepa
rent
worke
rch
ildbo
sskb
rcu
stomer
manag
erex
-love
ren
emy
Prediction
strangerfriendlover
colleaguesibling
acquaintanceparentworker
childboss
kbrcustomermanagerex-lover
enemy
Grou
nd T
ruth
VAL
2813151617182021242847607196
stran
ger
frien
dco
lleag
ueac
quain
tance
lover
paren
two
rker
child
manag
ercu
stomer
boss
kbr
enem
ysib
ling
ex-lo
ver
Prediction
strangerfriend
colleagueacquaintance
loverparentworker
childmanagercustomer
bosskbr
enemysibling
ex-lover
Grou
nd T
ruth
TEST
1112022242424262933505160120211
Figure 6: Confusion matrices for all relationships for validation set (left) and test set (right). Model corresponds to the “Rel. only”performance of 26.8% shown in Table 2. Numbers on the right axis indicate number of samples for each class.
3
Interactions
only
Model
watches 0.64suggests 0.61explains 0.61
asks 0.58informs 0.54
JointInts Rels
asks 0.64watches 0.62assures 0.59explains 0.57
suggests 0.51
- If anybody else wants to come with me, this is a chance for something real, and fun, and inspiring in this godforsaken business,and we will do it together. Who's coming with me? Who's coming with me? Who's coming with me besides Flipper here? This is embarrassing.
Ints Prediction
colleaguesrepresented as collection of N clips with the same two people
Interactions
only
Model
watches 0.70leaves 0.66greets 0.65walks 0.59stays 0.59
JointInts Rels
leaves 0.62greets 0.61
watches 0.58stays 0.50walks 0.50
no dialog
Ints Prediction
friends
represented as collection of N clips with the same two people
Figure 7: We show examples where training to predict interactions and relationships jointly helps improve the performance of interactions.Top: In the example from Jerry Maguire (1996), the joint model looks at several clips between Dorothy and Jerry and is able to reasonabout them being colleagues. This in turn helps refine the interaction prediction to asks. Bottom: In the example from Four Weddings andFuneral (1994), the model observes several clips from the entire movie where Charles and Tom are friends, and reasons that the interactionshould be leave (which contains the leave together class). Note that there is no dialog for this clip.
4
Relationshipsonly
Model
stranger 0.58lover 0.49parent 0.48
Rels Prediction
complains kisses informsRels Ints
Jointlover 0.63
stranger 0.62parent 0.55
Relationships
Rels Ints
only
Joint
Model
stranger 0.59worker 0.59
boss 0.58
Rels Prediction
customer 0.66manager 0.62stranger 0.60
explains explains
Figure 8: We show examples where training to predict interactions and relationships jointly helps improve the performance of relationships.Top: In the movie Four Weddings and Funeral (1994), clips between Bernard and Lydia exhibit a variety of interactions (e.g. kisses) that aremore typical between lovers than strangers. Bottom: In the movie The Firm (1993), Frank and Mitch meet only once for a consultation, andare involved in two clips with the same interaction label explains. Our model is able to reason about this interaction, and it encourages therelationship to be customer and manager, instead of stranger.
5
Clip
Possible track pairs
- Not in my own den. What are you two doing in here?
(Pam, None) (None, Pam)
(Greg, Pam) (Pam, Greg)
(Greg, Jack)
(Pam, Jack)
(Jack, Greg)
(Jack, Pam)
(Jack, None) (None, Jack)
(Greg, None) (None, Greg)
PredictGiven
Who?asks
parent
(Jack, Pam)
Figure 9: We illustrate an example from the movie Meet the Parents (2000) where a father (Jack) walks into a room while his daughter (Pam)and the guy (Greg) are kissing. Our goal is to predict the two characters when the interaction and relationship labels are provided. In thisparticular example, we see that Dad asks Pam a question (What are you two doing in here?). Note that their relationship is encoded as (Pam→ child → Jack), or equivalently, (Jack → parent → Pam). When searching for the pair of characters with a given interaction asks andrelationship as parent, our model is able to focus on the question at the clip level as it is asked by Jack in the interaction, and correctlypredict (Jack, Pam) as the ordered character pair. Note that our model not only considers all possible directed track pairs (e.g. (Greg, Pam)and (Pam, Greg)) between characters, but also singleton tracks (e.g. (Jack, None)) to deal with situations when a person is absent due tofailure in tracking or does not appear in the scene.
6
- Oh, hi. Hi, Jess.- Uh, this is my work friend, David.- David is an accountant.- David, this is Jessica, my babysitter.- Uh So, you know, everything looks
great.- See you at work.- Yeah, see you at work.
Possible track pairs
Clip
(Jess, David) (David, Jess)
(Jess, Emily)
(Emily, David)
(Emily, Jess)
(David, Emily)
(Jess, None) (None, Jess)
(David, None) (None, David)
(Emily, None) (None, Emily)
Emily Jess
Joint prediction
introduces
boss
Figure 10: We present an example where our model is able to correctly and jointly predict all three components: track pair, interaction classand relationship type for the clip obtained from the movie Crazy, Stupid, Love (2011). This clip contains three characters which leads to 12possible track pairs (including singletons to deal with situations when a person is absent due to failure in tracking or does not appear in thescene). The model is able to correctly predict the two characters, their order, interaction and relationship. In this case, Emily introducesDavid to Jess. Jess is also her hired babysitter, and thus their relationship is – Emily is boss of Jess.
7
asks
watc
hes
expl
ains
info
rms
talk
s to
sugg
ests
orde
rsgr
eets
hear
sta
lks a
bout
answ
ers
leav
esco
mpl
imen
tsst
ays
reas
sure
swa
lks
kiss
esye
llsar
gues
agre
eshu
gsta
lks o
n th
e ph
one
disa
gree
s with
give
ssh
ows
play
sea
ts w
ithhe
lps
runs hits sits
cong
ratu
late
sm
eets
teas
esrid
esad
mits
follo
wsap
olog
izes
intro
duce
sas
sure
swo
rks w
ithac
cuse
sjo
kes
laug
hsre
fuse
sco
mpl
ains
rem
inds
conv
ince
sca
llsig
nore
sru
ns fr
omen
cour
ages
appr
oach
esin
sults
hold
spu
shes
shoo
tsle
ads
disc
usse
sth
reat
ens
danc
es w
ithta
kes
reve
als
com
es h
ome
smile
s at
look
s for
rest
rain
swa
kes u
pan
noun
ces
brin
gsno
tices
read
slie
sth
inks
fishe
stri
esco
mm
its c
rime
wish
essa
ves
rem
embe
rsvi
sits
phot
ogra
phs
take
s car
e of
shus
hes
walk
s awa
y fro
mre
ceiv
es n
ews f
rom
open
shi
des f
rom
stat
esgo
ssip
spa
sses
by
poin
ts a
tth
rows
chec
kspu
tspu
llssin
gs to
send
s awa
yre
peat
sha
s sex
obey
s
Norm
alize
d #s
ampl
es
trainvaltest
Figure 11: Distribution of interaction labels in train/val/test sets. Sorted by descending order based on train set.
stran
ger
collea
gue
friend
lover
boss
enem
ywork
er
acqua
intan
cepa
rent
kbr
manag
erchi
ld
custom
ersib
ling
ex-lo
ver
Norm
alize
d #s
ampl
es trainvaltest
Figure 12: Distribution of relationship labels in train/val/test sets. Sorted by descending order based on train set.
8
INTERACTIONSleavesleaves
leave togethergets out of the house kisses
kissestries to kiss walks
walkswalks with
accompanieswalks to
walks down the hallwaywalks away from
walks away fromtakes off
follows
follows
hugs
hugs
puts arm around
cuddles
embraces
sits
sits
sits near
sits at table
sits together
passes by
passes by
passes in the street
comes home
comes home
tries to enter the room
arrives
meets
meets
bumps into
hits
hits
fights
slaps
attacks
punches
hurts
kicks
approaches
approaches
draws near to
stays
stays
stands near
stays near
stays with
stands on rooftop
waits for
spends time w
ith
waits for the m
edication
plays
plays
plays with
plays cards
plays poker
wakes up
wakes up
runs
runsruns to
runs with
rusheschases
runs after
runs from
runs from
runs away
rides
ridesrides w
ithdrives
picks up
rides together
moves van
hold
s
hold
s
hold
s in arm
shold
scatch
eshold
s han
d
visits
visits
dance
s with
dance
s with
push
es
push
es
touch
es
push
es a
way
push
es to
go to
the ro
om
shoots
shoots
poin
ts gun
poin
ts a g
un a
t
eats w
ith
eats w
ithdrin
ks with
feeds
has d
inner
take
sta
kes
gra
bs
buys fro
mta
kes
from
take
s aw
ay
opens
opens
opens
the d
oor
for
opens
door
bri
ngs
bri
ngs
carr
ies
serv
es
poin
ts a
t
poin
tsat
hid
es
from
hid
es
from
trie
s
trie
s
puts
puts
thro
ws
thro
ws
thro
ws
at
pulls
pulls
dra
gs
sends
away
sends
away
has
sex
has
sex
fishe
s
fishe
s
give
s
give
spa
ysgi
ves
a gi
ft
give
s m
oney
to
offer
s fo
od to
give
s ph
one
sells
to
expl
ains
expl
ains
expl
ains
to
inst
ruct
s
expl
ains
situ
atio
n
excu
ses
self
teac
hes
desc
ribes
answ
ers
answ
ers
resp
onds
repl
ies
orde
rs
orde
rsre
ques
ts
dem
ands
begs
orde
rs a
bout
wor
k
greet
s
gree
ts
welco
mes
shak
es h
ands
waves
at
says
goo
dbye
reass
ures
reass
ures
supports
comforts
conso
les
suggests
suggests
advises
warns
invites
offers
offers
to help
gives opinion
proposes
corrects
offers to
asks out
calls out to
compliments
compliments
praises
flatters
admires
sympathizes with
flirtsthanks
helps
helpsassists
encourages
encourages
apologizes
apologizes
reminds
reminds
convinces
convincespersuadesurgesinsiststries to convincecalls for
assuresassurespromisesswears
congratulatescongratulatesapplaudscheerscheers forcelebratesapplauds totoasts to
leads
leadsguidesdirects
agrees
agreesconfirms toconfirmsacceptsallowsgives permission
admits
admitsadmits toconfesses to
announces
announcesbreaks news to
declares love to
receives news from
receives news from
receives news about
wishes
wisheshopes for
saves
savesdefends
checks
checkschecks on
takes care of
takes care of
petstreats
asks
asksquestions
asks about
asks for help
asks about behavior
asks about situation
asks permission
asks for advice
asks about health
informs
informs
informs about w
ork
signals
reports
watches
watches
looks at
sees
stares at
watches the opera
watches tv
watch
talks to
talks to
speaks to
tellschats
chitchats
gives speech
converses with
saysw
hispers
talks abou
t
talks about
talks abou
t work
talks abou
t relationsh
ips
talks abou
t self
men
tions
hears
hears
listens to
show
s
show
s
joke
s
joke
s
calls
calls
ignore
s
ignore
savo
ids
tries to
avoid
intro
duce
s
intro
duce
sin
troduce
s self
works w
ith
works w
ith
reveals
reveals
confides in
opens u
pta
lks
on
the p
hone
talks o
n th
e p
hone
calls
on p
hone
hangs
up
laughs
laughs
laughs
at
laughs
wit
hgig
gle
s w
ith
smile
s at
smile
s at
reads
reads
reads
to
looks
for
looks
for
searc
hes
dis
cuss
es
dis
cuss
es
negoti
ate
s
dis
cuss
es
pro
ble
m
not
ices
noti
ces
reco
gniz
es
unders
tands
spot
s
rem
ember
s
rem
ember
s
know
s ab
out
shus
hes
shus
hes
calm
s
trie
s to
cal
m
sing
s to
sing
s to
thin
ks
thin
ks
thin
ks o
f
belie
ves
in
assu
mes
won
ders
stat
es
stat
es
pret
ends
pron
ounc
es
conc
lude
s
repe
ats
repe
ats
obey
s
obey
s
goss
ips
goss
ips
goss
ips with
phot
ogra
phs
phot
ogra
phs
phot
ogra
phs with
rest
rain
s
rest
rain
s
tries
to st
opstop
s
argues
argues
confro
ntsar
gues
argues w
ith
yells
yells
scre
amssc
olds
snaps a
t
disagrees w
ith
disagrees w
ithprotests
critic
izes
interrupts
dislikes
disbelieves
disapproves o
f
complains
complains
complains about
regrets
accuses
accusesblames
reprimands
reproaches
insults
insultsoffendsscoffs atdisrespects
teases
teasesmocksmakes fun of
threatens
threatens
refuses
refusesdeclinesrefuses helpdismissesrefuses to helpdenies
lies
lies
commits crime
commits crimekillssteals
Figure 13: Diagram depicting how we group 324 interaction classes (outer circle) into 101 (inner circle). Best seen on the screen.
9
RELATIONSHIPS
strangerstranger
friend
friend
colleague
colleague
lover
lover
pare
nt
pare
nt
boss
boss
siblin
g
siblin
g
kbr
knows by reputation
spousecollaborator
enemy
enemy
customer
customer
child
child
acquaintance
acquaintance
best friend
antagonist
empl
oyer
of
work
er
em
plo
yed b
y
would like to know
engaged
aunt/u
ncle
man
ager
teach
er
neighbor
operative system
business partnerco
usin
owner
moth
er-in
-law
men
tor
patientfiancee
roomm
ate
uncl
e
foster-son
ex-lover
ex-loverdivorced
siste
r/bro
ther-i
n-law
aunt
robber
ex-spouse
ex-neighbor
gra
ndpare
nt
super
ior
girlfriend
pare
nt-
in-law
fath
er-
in-law
mistress
supporters
grandchild
godfa
ther
land
lord
publ
ic offi
cial
appre
ntice
niece/nephew
law
yer supporter
couple
slave
vet
inte
rvie
wee
inte
rvie
wer
guard
ian
agent
classmate
family friend
godson
brother in
-law
cusomer
replacement
work
er
empl
oyer
customers
fian
ce
aid
e
heard about
spon
sor
alle
ged lo
ver
fan
host
students
fam
ily
classmate
s
ex-fiance
student
boyfrie
nd
psyc
hiat
rist
goddaughter
ex-girlfriend
supe
rviso
rtra
iner
hostage
babysitte
rdocto
rnanny
one n
ight sta
nd
distant c
ousin
instructor
ex-boyfriend
nursetenant
competitor
killer
step
-mot
her
close friend
Figure 14: Diagram depicting how we group 107 relationship classes (outer circle) into 15 (inner circle).
10