weakly-supervised learning from videos and...
TRANSCRIPT
![Page 1: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/1.jpg)
Ivan Laptev
WILLOW, INRIA/ENS/CNRS, Paris
Weakly-supervised learning from
videos and scripts
ERC ALLEGRO workshop
INRIA Grenoble
July 23, 2014
Joint work with: Piotr Bojanowski – Rémi Lajugie – Francis Bach –
Jean Ponce – Cordelia Schmid – Josef Sivic
![Page 2: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/2.jpg)
![Page 3: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/3.jpg)
Where to get training data?
Shoot actions in the lab•
KTH dataset
Weizman dataset,…
- Limited variability
- Unrealistic
Manually annotate existing content•
HMDB, Olympic Sports,
UCF50, UCF101, …
- Very time-consuming
Use readily-available video scripts•
www.dailyscript.com, www.movie-page.com, www.weeklyscript.com
- Scripts are available for 1000’s of hours of movies and TV-series
- Scripts describe dynamic and static content of videos
![Page 4: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/4.jpg)
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
5
![Page 5: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/5.jpg)
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam.Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
6
![Page 6: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/6.jpg)
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
7
![Page 7: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/7.jpg)
As the headwaiter takes them to a table they pass by the piano, and the woman looks at Sam. Sam, with a conscious effort, keeps his eyes on the keyboard as they go past. The headwaiter seats Ilsa...
8
![Page 8: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/8.jpg)
Scripts as weak supervision
Un
ce
rta
inty
24:25
24:51
Imprecise temporal localization•
No explicit spatial localization •
NLP problems, scripts ≠ training labels•
“… Will gets out of the Chevrolet. …”
“… Erin exits her new truck…”vs. Get-out-car
Challenges:
![Page 9: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/9.jpg)
Previous work
Sivic, Everingham, and Zisserman,
''Who are you?'' -- Learning Person Specific
Classifiers from Video, In CVPR 2009.
Buehler, Everingham, and Zisserman "Learning
sign language by watching TV (using weakly
aligned subtitles)", In CVPR 2009.
Duchenne, Laptev, Sivic, Bach and Ponce,
"Automatic Annotation of Human Actions in
Video", In ICCV 2009.
…wanted to know about the history of the trees
![Page 10: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/10.jpg)
Joint Learning of Actors and Actions
Rick? Rick?
Walks?Walks?
[Bojanowski et al. ICCV 2013]
Rick walks up behind Ilsa
![Page 11: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/11.jpg)
Rick
Walks
Rick walks up behind Ilsa
Joint Learning of Actors and Actions[Bojanowski et al. ICCV 2013]
![Page 12: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/12.jpg)
Formulation: Cost function
Rick
Ilsa
Sam
Actor labels Actor image features
Actor classifier
![Page 13: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/13.jpg)
Formulation: Cost function
Person p appears at
least once in clip N :
p = Rick
Weak supervision
from scripts:
![Page 14: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/14.jpg)
Action a appears at
least once in clip N :
a = Walk
Weak supervision
from scripts:
Formulation: Cost function
![Page 15: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/15.jpg)
Formulation: Cost function
Action a
appears
in clip N :
Weak supervision
from scripts:
Person p
appears in
clip N :
Person p
and
Action a
appear in
clip N :
![Page 16: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/16.jpg)
22
Image and video features
• Facial features
[Everingham’06]
• HOG descriptor on
normalized face image
• Dense Trajectory
features in person
bounding box
[Wang et al.,’11]
Face features
Action features
![Page 17: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/17.jpg)
23
Results for Person Labelling
American beauty (11 character names)Casablanca (17 character names)
![Page 18: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/18.jpg)
24
Results for Person + Action Labelling
Casablanca,
Walking
![Page 19: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/19.jpg)
Finding Actions and Actors in Movies
[Bojanowski, Bach, Laptev, Ponce, Sivic, Schmid, 2013]
![Page 20: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/20.jpg)
26
Action Learning with
Ordering Constraints[Bojanowski et al. ECCV 2014]
![Page 21: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/21.jpg)
27
Action Learning with
Ordering Constraints[Bojanowski et al. ECCV 2014]
![Page 22: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/22.jpg)
Cost Function
Weak supervision from ordering constraints on Z:
Action
label
Action
index
2
4
1
2
3
2
Video time intervals
![Page 23: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/23.jpg)
Cost Function
Weak supervision from ordering constraints on Z:
Action
label
Action
index
2
4
1
2
3
2
Video time intervals
![Page 24: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/24.jpg)
Cost Function
Weak supervision from ordering constraints on Z:
Action
label
Action
index
2
4
1
2
3
2
Video time intervals
![Page 25: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/25.jpg)
Is the optimization tractable?
• Path constraints are implicit
• Cannot use off-the-shelf solvers
• Frank-Wolfe optimization algorithm
![Page 26: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/26.jpg)
Results
937 video clips from 60 Hollywood movies•
16 action classes•
Each clip is annotated by a sequence of n actions (2≤n≤11)•
![Page 27: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/27.jpg)
![Page 28: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/28.jpg)
Summary
Reason about action sequences.•
Weakly-supervised learning
using time ordering constraints. •
Action learning with ordering constraints
Reason about individual people.•
Joint Learning of Actors and Actions
Weakly-supervised learning of
actions and names.•
![Page 29: Weakly-supervised learning from videos and scriptslear.inrialpes.fr/workshop/allegro/slides/laptev.pdf · Weakly-supervised learning from videos and scripts ERC ALLEGRO workshop INRIA](https://reader033.vdocuments.net/reader033/viewer/2022042209/5eac814ee6b6a47cb90740d2/html5/thumbnails/29.jpg)
Limitations / Future work
No spatial localization. Want to answer questions:•
- Who is doing what?
- Who interacts with whom?
Actions are modeled at short time intervals (15 frames).•
Sequences of action labels are given manually. Want to jointly
cluster videos and scripts. •
Action learning with ordering constraints
No temporal localization of actions within
person tracks.•
Joint Learning of Actors and Actions
Finding people in movies is still a big challenge.•
Extracting action labels from scripts is a major
(NLP+vision?) challenge.•