learning how-to knowledge from the webyukez/talks/learning_how_to... · 2020. 4. 25. · humans...
TRANSCRIPT
![Page 1: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/1.jpg)
Learning How-To Knowledge from the Web
Yuke Zhu
IROS 2019
![Page 2: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/2.jpg)
Advances in Artificial Intelligence
Visual Recognition Machine Translation Question Answering
![Page 3: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/3.jpg)
The Unsung Hero: Web Data
SQuAD QA Dataset [Rajpurkar et al. 2016]
100,000+ questions posed
by crowdworkers on a set
of Wikipedia articles
Google NMT[Wu et al. 2016]
WMT En→Fr dataset
with 36M sentence pairs
ImageNet[Deng et al. 2009]
14 million web images
annotated by AMT workers
Visual Recognition Machine Translation Question Answering
![Page 4: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/4.jpg)
Traditional form of automation Intelligent robots in real world
The Unsung Hero: Web Data
?
What’s the role of web data in improving robot intelligence?
![Page 5: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/5.jpg)
What knowledge do we need for robotics?
“To accelerate or to brake?”
![Page 6: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/6.jpg)
What knowledge do we need for robotics?
Knowledge of “That-Is”
car
Heavy & Fast
bike
Slow
Declarative knowledge
Understanding the world
v Easy to articulate
(conscious)
v Describes facts
of the world
![Page 7: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/7.jpg)
What knowledge do we need for robotics?
Declarative knowledge
Understanding the world
Procedural knowledge
Interacting with the world
v Easy to articulate
(conscious)
v Describes facts
of the world
vDescribes how to
perform tasks
vHard to pinpoint
(unconscious)
Knowledge of “How-To”
![Page 8: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/8.jpg)
Robotics
Procedural
Knowledge
(“How-To”)
Declarative
Knowledge
(“That-Is”)
Understanding
the World
Interacting with
the World
![Page 9: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/9.jpg)
Understanding
the World
Interacting with
the WorldRobotics
Procedural
Knowledge
(“How-To”)
Declarative
Knowledge
(“That-Is”)
Learning Declarative (“That-Is”) Knowledge from the Web
Understanding the world is the cornerstone of interacting with the world.
![Page 10: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/10.jpg)
The Visual Genome Project
A large-scale visual knowledge base of
structured image concepts
Krishna, Zhu, Groth, Johnson, Hata, Kravitz, Chen, Kalantidis, Li, Shamma, Bernstein, and Fei-Fei, IJCV 2017
![Page 11: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/11.jpg)
Visual Genome
knob
bowl
bowl
drawer
holder
knife
counter
Scene Graph: Objects
Questions
1. Q: What’s the color of the counter? A: Black.
2. Q: How many drawers can you see? A: Two.
3. Q: What’s the material of the pots? A: Metal.
……
Region Descriptions
1. There a green bowl on the black counter.
2. The cabinet door is closed.
3. Six knives are placed in the knife holder.
……
large
openable
metal
graspable
black
+ Attributes
has
on
with
next to
on
in
+ Relationships
![Page 12: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/12.jpg)
Visual Genome
knob
has
bowl
on
bowl
drawer
with
next to
large
holder
knife
openable
on
in
metal
graspable
black counter
Scene Graph: Objects + Attributes + Relationships
Questions Region Descriptions
1. Q: What’s the color of the counter? A: Black.
2. Q: How many drawers can you see? A: Two.
3. Q: What’s the material of the pots? A: Metal.
……
1. There a green bowl on the black counter.
2. The cabinet door is closed.
3. Six knives are placed in the knife holder.
……
108K Images
1.7M Questions 5.4M Region Descriptions
3.8M Objects
2.8M Attributes
2.3M Relationships
![Page 14: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/14.jpg)
green onions sitting on the counter
a big white bowl
knives in a holder
wooden drawer is closed
two ceramic jars
Johnson et al. CVPR’16; Krishna, Zhu, et al. IJCV’17
![Page 15: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/15.jpg)
A: In the daytime.
Q: When was the picture taken?
A: Seven.
Q: How many drawer knobs can you see?
Zhu et al. CVPR’16, Zhu et al. CVPR’17
A: Black.
Q: What color is the countertop?
![Page 16: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/16.jpg)
Xu, Zhu, Choy, Fei-Fei, CVPR’17
knob
has
bowl
on
bowl
drawer
with
next to
large
holder
knife
openable
on
in
metal
graspable
black counter
![Page 17: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/17.jpg)
Understanding
the World
Interacting with
the World
Visual Genome learns Declarative Knowledge from the web.
We built a large-scale visual knowledge base via online crowdsourcing.
Procedural
Knowledge
(“How-To”)
Declarative
Knowledge
(“That-Is”)
![Page 18: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/18.jpg)
Learning Procedural Knowledge needs new methodology.
Understanding
the World
Interacting with
the World
It is hard to pinpoint and difficult to verbally described.
Procedural
Knowledge
(“How-To”)
Declarative
Knowledge
(“That-Is”)
![Page 19: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/19.jpg)
Learning Procedural (“How-To”) Knowledge from the Web
Three Key Questions
vWhat’s a good representation of procedural knowledge?
vHow do we learn procedural knowledge from the web?
vHow can robots take advantage of such knowledge?
![Page 20: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/20.jpg)
Part I: Learning from Video Demonstrations
Part II: Learning from Crowd Teleoperation
![Page 21: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/21.jpg)
Part II: Learning from Crowd Teleoperation
Part I: Learning from Video Demonstrations
![Page 22: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/22.jpg)
Web videos supply massive knowledge of how to solve new tasks.
Source: The Verge, Pew Research Center
![Page 23: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/23.jpg)
Humans learn efficiently from video demonstrations.
Meltzoff & Moore 1977; Meltzoff & Moore 1989, Meltzoff 1988
Imitation of Televised Models by Infants
Andrew N. Meltzoff, Child Development 1988
Babies (14-24 months) can learn by imitating
demonstrations from the TV screen.
![Page 24: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/24.jpg)
prepare dinner
prepare dinner
cook foodwash dishes
grasp wash place cut boil
Our Goal: Learning procedural knowledge as compositional task structures
from video demonstrations of a task
![Page 25: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/25.jpg)
One-Shot Imitation Learning from Videos
Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018
single video
demonstration
meta-learning
model
policy for the
demonstrated task
![Page 26: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/26.jpg)
a lot of training videos
(seen tasks)
policy for the
demonstrated task
supervision
…
One-Shot Imitation Learning from Videos
meta-learning
model
Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018
![Page 27: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/27.jpg)
single test video
(unseen task)
policy for the
demonstrated task
One-Shot Imitation Learning from Videos
meta-learning
model
Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018
![Page 28: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/28.jpg)
One-Shot Imitation Learning from Videos
[Duan et al. 17; Finn et al. 2017; Wang et al. 2017; Yu et al. 2018]
modeling demonstration
as a flat sequence
Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018
![Page 29: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/29.jpg)
One-Shot Imitation Learning from Videos
modeling demonstration
as a flat sequence
modeling demonstration
as a compositional structure
[Duan et al. 17; Finn et al. 2017; Wang et al. 2017; Yu et al. 2018]
Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018
![Page 30: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/30.jpg)
Pin: pick_and_place
Pout: pick
EOP: False
Output Task Spec.
Env. Observation Input Task Spec.
Args: block_E
Pin: pick
Pout: move_to
EOP: False
move_to (block_E) return return
Args: block_E
Pin: pick
Pout: grip
EOP: True
retu
rn
Pin: block_stacking
Pout: pick_and_place
EOP: False
Output Task Spec.
Env. Observation Input Task Spec.
return
Pin: block_stacking
Pout: pick_and_place
EOP: False
Output Task Spec.
Pin: pick_and_place
Pout: drop
EOP: True
Output Task Spec.
Args: N/A
Pin: place
Pout: release
EOP: TruePin: place
Pout: move_to
EOP: False
Args: block_B
retu
rn
Env. Observation Input Task Spec. Env. Observation Input Task Spec.
Env. Observation Input Task Spec.
Env. Observation Input Task Spec.
Env. Observation Input Task Spec.
Env. Observation Input Task Spec.
grip (block_E) move_to (block_B) returnrelease()
… …
return
Pick and Place
Block Stacking
Pick Pick
Pick and Place
Place Place
Block Stacking
Move_to (Blue) Grip (Blue) Move_to (Red) Release( )
Neural Task Programming (NTP): Hierarchical Policy Learning as Neural Program Induction
![Page 31: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/31.jpg)
One-Shot Imitation Learning from Videos: Neural Task Programming (NTP)
demonstration policy
next program
pick(blue)
observation
State obs.
Task Demonstration
Robot API
Completed Tasks
NTP Env.
Robot API```
Task
1
```
Task
2
Task 1 Final State
Task 2 Final State
Task Conditional Output Policies
current program
pick_place(blue, green)
end-to-end
neural network
(LSTM)
meta-learning
model
Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018
![Page 32: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/32.jpg)
{( , )}
One-Shot Imitation Learning from Videos: Neural Task Programming (NTP)
demonstration
next program
pick(blue)
observation
State obs.
Task Demonstration
Robot API
Completed Tasks
NTP Env.
Robot API```
Task
1
```
Task
2
Task 1 Final State
Task 2 Final State
Task Conditional Output Policies
end-to-end
neural network
(LSTM)
current program
pick_place(blue, green)
Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018
Training supervision
video demonstration hierarchical program trace
![Page 33: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/33.jpg)
0
0.2
0.4
0.6
0.8
1
50 100 400 1000
un
se
en
ta
sk s
uc
ce
ss r
ate
number of training tasks
Flat NTP (Ours)
N/A
N/A
One-Shot Imitation Learning from Videos: Neural Task Programming (NTP)
Qualitative Quantitative
(the higher the better)
Object Sorting
Autonomous Execution
Demo
8x
Better generalization with less
training data than flat baselines
Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018
![Page 34: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/34.jpg)
One-Shot Imitation Learning from Videos: Neural Task Programming (NTP)
demonstration policy
compositional
model prior
meta-learning
model
Xu*, Nair*, Zhu, Gao, Garg, Fei-Fei, Savarese. ICRA 2018
end-to-end
neural network
(LSTM)
![Page 35: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/35.jpg)
One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)
Task Graph
Generator
Neural Task Graph
observation
Task Graph
Executor
State obs.
Task Demonstration
Robot API
Completed Tasks
NTP Env.
Robot API```
Task
1
```
Task
2
Task 1 Final State
Task 2 Final State
Task Conditional Output Policies
Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019
demonstration policy
meta-learning
model
![Page 36: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/36.jpg)
One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)
Task Graph
Generator
Neural Task Graph
observation
Task Graph
Executor
State obs.
Task Demonstration
Robot API
Completed Tasks
NTP Env.
Robot API```
Task
1
```
Task
2
Task 1 Final State
Task 2 Final State
Task Conditional Output Policies
demonstration policy
meta-learning
model
Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019
![Page 37: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/37.jpg)
One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)
Task Graph
Generator
Neural Task Graph
observation
Task Graph
Executor
State obs.
Task Demonstration
Robot API
Completed Tasks
NTP Env.
Robot API```
Task
1
```
Task
2
Task 1 Final State
Task 2 Final State
Task Conditional Output Policies
demonstration policy
meta-learning
model
Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019
![Page 38: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/38.jpg)
Task Graph
Nodes States
place(red)
place(green)
pic
k(g
reen
) pic
k(r
ed
)
pick(orange)
Edges Actions
Conjugate Task Graph
place(green) pick(green)
pick(orange)
pick(red) place(red)
Nodes ActionsinfiniteEdges States (Preconditions)
…
valid states
One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)
finite
Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019
![Page 39: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/39.jpg)
place(green) pick(green)
pick(orange)
pick(red) place(red)
One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)
current observation
node
localizer
edge
classifier
selectednode
next action
pick(red)
selectededge
Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019
![Page 40: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/40.jpg)
{( , )}
Training supervision
video demonstration action sequence
One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)
Task Graph
Generator
Neural Task Graph
observation
Task Graph
Executor
State obs.
Task Demonstration
Robot API
Completed Tasks
NTP Env.
Robot API```
Task
1
```
Task
2
Task 1 Final State
Task 2 Final State
Task Conditional Output Policies
demonstration
policy
Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019
![Page 41: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/41.jpg)
0
0.2
0.4
0.6
0.8
1
50 100 400 1000
un
se
en
ta
sk s
uc
ce
ss r
ate
number of training tasks
Flat NTP (Ours)
0
0.2
0.4
0.6
0.8
1
50 100 400 1000
un
se
en
ta
sk s
uc
ce
ss r
ate
number of training tasks
Flat NTP (Ours) NTG (Ours)
N/A
N/A
One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)
Qualitative Quantitative
(the higher the better)
Recovery from Intermediate Failures
Autonomous Execution 20x
Weaker supervision, less training
data, and better generalization
Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019
![Page 42: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/42.jpg)
One-Shot Imitation Learning from Videos: Neural Task Graphs (NTG)
Huang*, Nair*, Xu*, Zhu, Garg, Fei-Fei, Savarese, Niebles. CVPR 2019
OrientingNeedle Positioning NeedlePushing Needle
through Tissue
PullingSuturewith
LeftHand
Predicted
Path
Video
OrientingNeedle Positioning NeedlePushing Needle
through Tissue
PullingSuturewith
LeftHand
Predicted
Graph
Applying NTG to the real-world surgical video dataset JIGSAWS
![Page 43: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/43.jpg)
Next Goal: Learning task knowledge from web videos
![Page 44: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/44.jpg)
Summary - Part I
Extracting how-to knowledge about the compositional task
structure of complex tasks from video demonstrations
Meta-learning models with compositional priors generalize
better than black-box models
vs.
task graph
black box
![Page 45: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/45.jpg)
NTP and NTG learn how-to knowledge in the form of compositional task
structures while motor skills are abstracted away.
prepare dinner
prepare dinner
cook foodwash dishes
grasp wash place cut boil
modeled as pre-defined “API calls”
![Page 46: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/46.jpg)
NTP and NTG learn how-to knowledge in the form of compositional task
structures while motor skills are abstracted away.
How can we collect data
for learning motor skills
from the web?
Manually defining motor skills is intractable.
We need to learn from data.
![Page 47: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/47.jpg)
Part I: Learning from Video Demonstrations
Part II: Learning from Crowd Teleoperation
![Page 48: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/48.jpg)
Vecerik et al. 2017: 100 demos
Finn et al. 2017: 30 demosRajeswaran et al. 2018: 25 demos
Large demonstration datasets is hard to collect.
Humans need to demonstrate not label.
Zhu et al. 2018: 30 demos
Imitation Learning Reinforcement & Self-Supervised Learning
Levine et al. 2016
Pinto et al. 2016
Kalashnikov et al. 2018
Data can be low quality due to lack of expert.
Fang et al. 2018
Data is critical for learning robot motor skills.
![Page 49: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/49.jpg)
Data is critical for learning robot skills.
How to scale up high-quality human supervision for robotics?
Provide a natural way for anyone to provide demonstrations
![Page 50: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/50.jpg)
RoboTurk in action
+
roboturk.stanford.edu Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018
Web-based Crowd Teleoperation with RoboTurk
RoboTurk: Crowdsourcing Platform for Large-Scale Demonstration Collection
real-time streaming
from remote robot
6-DoF
controller
![Page 51: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/51.jpg)
cloud
users
remote
robots
server
Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018roboturk.stanford.edu
Web-based Crowd Teleoperation with RoboTurk
![Page 52: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/52.jpg)
cloud
users
remote
robots
server
User Interface
Web Browser View
Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018roboturk.stanford.edu
Web-based Crowd Teleoperation with RoboTurk
![Page 53: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/53.jpg)
RoboTurk Pilot Dataset
137.5 hours of demonstrations
22 hours of total platform usage
2218 successful demonstrations
surreal.stanford.edu Zhu*, Fan*, Zhu, Liu, Zeng, Gupta, Creus-Costa, Savarese, Fei-Fei, CoRL 2018
teleoperated demonstrations
roboturk.stanford.edu Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018
Web-based Crowd Teleoperation with RoboTurk
![Page 54: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/54.jpg)
Bin Picking (Can) Nut Assembly (Round)
Policy Learning from Teleoperated Demonstrations
Learning from the Masses
![Page 55: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/55.jpg)
RoboTurk Pilot Dataset
137.5 hours of demonstrations
22 hours of total platform usage
2218 successful demonstrations0
200
400
600
800
1000
0 1 10 100 1000
Task P
erf
orm
ance (re
ward
)
Number of Demonstrations
Reinforcement and Imitation Learning
Pure RL
assembly
pick & place
Zhu*, Fan*, Zhu, Liu, Zeng, Gupta, Creus-Costa, Savarese, Fei-Fei, CoRL 2018
Mandlekar, Zhu, Garg, Booher, Spero, Tung, Gao, Emmons, Gupta, Orbay, Savarese, Fei-Fei, CoRL 2018
surreal.stanford.edu
roboturk.stanford.edu
Reinforcement and Imitation Learning: Data
![Page 56: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/56.jpg)
Webcam
Kinect
Robot
61
RoboTurk on Physical Robots
![Page 57: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/57.jpg)
Real-time
62
Scalable Data Collection
![Page 58: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/58.jpg)
Gao et al.
2014
Zhang et al.
2018
Yu, Finn et al.
2018
Sharma et al.
2018
1.662.35
4.08
13.7
0
2
4
6
8
10
12
14
16
JIGSAWS Deep Imitation DAML MIME
Data
set
Siz
e (H
ours
)
Dataset Size Comparison
Mandlekar, Booher, Spero, Tung, Gupta, Zhu, Garg, Savarese, Fei-Fei, IROS 2019
![Page 59: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/59.jpg)
Gao et al.
2014
Zhang et al.
2018
Yu, Finn et al.
2018
Sharma et al.
2018
1.66 2.35 4.08
13.7
111.25
0
20
40
60
80
100
120
JIGSAWS Deep Imitation DAML MIME Ours
Data
set
Siz
e (H
ours
)
10x
Dataset Size Comparison
Mandlekar, Booher, Spero, Tung, Gupta, Zhu, Garg, Savarese, Fei-Fei, IROS 2019
![Page 60: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/60.jpg)
RoboTurk for
everyone, everywhere
![Page 61: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/61.jpg)
Summary - Part II
RoboTurk scales up demonstration collection with teleoperated
crowdsourcing from web users
Large-scale crowdsourced data enables us to train more effective
motor skill learning algorithms.
![Page 62: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/62.jpg)
Come to our IROS Presentation
Learn More about RoboTurk?
RoboTurk: Human Reasoning and Dexterity for Large-
Scale Dataset Creation
Tuesday 15:45-16:00, Award Session II: Paper TuBT4.5
![Page 63: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/63.jpg)
Part I: Learning from Web Videos
Part II: Learning from Crowd Teleoperation
Extracting compositional task structures from video data
Crowdsourcing teleoperated demonstrations for skill learning
![Page 64: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/64.jpg)
Conclusions
vWhat’s a good representation of procedural knowledge?
vHow do we learn procedural knowledge from the web?
vHow can robots take advantage of such knowledge?
High-level task structures & low-level motor skills
Large-scale web videos & crowd teleoperation from online users
Machine learning algorithms, e.g., meta-learning & imitation learning
![Page 65: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/65.jpg)
Conclusions
Open Question:
How to integrate procedural knowledge and
declarative knowledge into a unified knowledge
ontology for building intelligent algorithms in
robotics?
![Page 66: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/66.jpg)
Acknowledgements
Fei-Fei Li Silvio Savarese Animesh Garg Danfei Xu De-An HuangAjay Mandlekar
![Page 67: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/67.jpg)
Robotics
Procedural
Knowledge
(“How-To”)
Declarative
Knowledge
(“That-Is”)
Understanding
the World
Interacting with
the World
http://ai.stanford.edu/~yukez/
![Page 68: Learning How-To Knowledge from the Webyukez/talks/learning_how_to... · 2020. 4. 25. · Humans learn efficiently from video demonstrations. Meltzoff & Moore 1977; Meltzoff & Moore](https://reader036.vdocuments.net/reader036/viewer/2022071016/5fcf5c2da406951c9c35a0d8/html5/thumbnails/68.jpg)
Conclusions
Open Question:
How to integrate procedural knowledge and
declarative knowledge into a unified knowledge
ontology for building intelligent algorithms in
robotics?