learning to play with intrinsically-motivated self-aware ... · learning to play with...

1
Learning to Play with Intrinsically-Motivated Self-Aware Agents Nick Haber 1,2,3 *, Damian Mrowca 4 *, Stephanie Wang 4 , Li Fei-Fei 4 , Daniel L. K. Yamins 1,4,5 Introduction Approach: Self-Model Vs World-Model Emergent Behaviors Department of Psychology 1 , Pediatrics 2 , Biomedical Data Science 3 , Computer Science 4 , and Wu Tsai Neurosciences Institute 5 , Stanford, CA 94305 Learning in a physically realistic embodied environment. Agents move around and interact with objects. Artificial intelligent robots cannot learn in unstructured environment without supervision. Babies learn in unstructured environments through intrinsically motivated curious playing. Curiosity-based learning mechanism - The world model (blue) solves a dynamics prediction problem. - The self-model (red) seeks to predict the world-model's loss and is learned simultaneously. - Actions are chosen to antagonize the world-model, leading to novel and surprising events in the environment (black). - This creates a virtuous cycle in which the agent chooses novel but predictable actions. - Playful behavior emerges as the agent pushes the boundaries of what its world-model-prediction systems can achieve. - As world-modeling capacity improves, what used to be novel becomes old hat, and the cycle repeats. S t a t e Environment Self Model World Model S t a t e A c t i o n α ego t α obj t 1 α obj t 2 α ego t+1 α obj t+1 1 α obj t+1 2 α ego t-2 α obj t-2 1 α obj t-2 2 α ego t α obj t 1 α obj t 2 L t L t+1 L t+2 L t+T L t+3 ... s t argmax A c t i o n ω φ Λ ψ State from environment α ego t-1 α obj t-1 1 α obj t-1 2 (b) (a) s t-1 s t s t+1 s t-2 A c t i o n L o s s A c t i o n L o s s Intrinsically-motivated self-aware architecture. Emerging behaviors across different models. Different combinations of world-model tasks and policy mechanisms are compared. ID = Inverse Dynamics prediction; RW = Random World model; RP = Random Policy; SP = Self-aware Policy; LF = Latent Future prediction A sequence of behaviors emerges: Ego motion prediction Object attention Navigation towards ob- Improved object dynamics predic- Improvement on task transfers Multi-object gathering Ego motion learning Emergence of object attention 1 object learning 2 object learning 1 object loss 2 object loss ° * x ID-SP-40 ID-SP-5 ID-RP LF-SP Training loss 0.4 0.3 0.2 0.1 0.0 0.5 0.6 0 40 20 60 80 Steps (in thousands) 320 560 800 x ° * 0 Steps (in thousands) 20 40 60 80 560 1 Object frequency 1.0 0.8 0.6 0.4 0.0 0.2 x 0 Steps (in thousands) 20 40 60 80 560 2 Object frequency 0.5 0.4 0.3 0.2 0.0 0.1 0 Steps (in thousands) 20 40 60 80 560 Agent-object distance 8.0 6.0 4.0 0.0 2.0 x Planning Navigation and planning behavior for 12 consecutive time steps. Red force vectors on the objects depict the predicted actions. Ego-motion self prediction maps are drawn at the agents position. Task Transfers The agent starts off without seeing an object and turns around to explore for an object. Once an object is in view the agent approaches or turns to- wards the object to keep it in view. This work was supported by grants from the James S. McDonnell Foundation, Simons Foundation, and Sloan Foundation (DLKY), a Berry Foundation postdoctoral fellowship and Stanford Department of Biomedical Data Science NLM T-15 LM007033-35 (NH), ONR - MURI (Stanford Lead) N00014- 16-1-2127 and ONR - MURI (UCLA Lead) 1015 G TA275 (LF). Model comparison on task transfers including object presence (binary classification), localization (pixel-wise 2D centroid position), and recognition (16-way categorization). Linear estimators were trained on top off the output features of each model. ID = Inverse Dynamics; RW = Random World model; RP = Random Policy; SP = Self-aware Policy; LF = Latent Future Transfer to other Tasks Accuracy in % 50 60 70 80 90 100 Ego (Easy) Ego (Hard) Accuracy in % 10 20 30 40 50 60 Actions (Hard) Play Frequency Accuracy in % 0 10 20 30 40 Recognition Error in % 0 0.75 1.5 2.25 3 Object Presence Error in Pixels 0 2.75 5.5 8.25 11 Localization ID-SP trained with 2 objects! Future Work Next steps include - more photorealistic simulation - more realistic agent embodiment - curiosity towards the novel but learnable - animate attention and theory of mind - comparision to human developmental data Phyiscally and visually realistic simulation environment.

Upload: others

Post on 06-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning to Play with Intrinsically-Motivated Self-Aware ... · Learning to Play with Intrinsically-Motivated Self-Aware Agents Nick Haber1,2,3*, Damian Mrowca4*, Stephanie Wang4,

Learning to Play with Intrinsically-Motivated Self-Aware AgentsNick Haber1,2,3*, Damian Mrowca4*, Stephanie Wang4, Li Fei-Fei4, Daniel L. K. Yamins1,4,5

Introduction Approach: Self-Model Vs World-Model

Emergent Behaviors

Department of Psychology1, Pediatrics2, Biomedical Data Science3, Computer Science4, and Wu Tsai Neurosciences Institute5, Stanford, CA 94305

Learning in a physically realistic embodied environment. Agents move around and interact with objects.

Artificial intelligent robots cannot learn in unstructured

environment without supervision.

Babies learn in unstructured environments through intrinsically motivated

curious playing.

Curiosity-based learning mechanism- The world model (blue) solves a dynamics prediction problem. - The self-model (red) seeks to predict the world-model's loss and is learned simultaneously. - Actions are chosen to antagonize the world-model, leading to novel and surprising events in the environment (black).

- This creates a virtuous cycle in which the agent chooses novel but predictable actions.- Playful behavior emerges as the agent pushes the boundaries of what its world-model-prediction systems can achieve. - As world-modeling capacity improves, what used to be novel becomes old hat, and the cycle repeats.

StateEnvironment

SelfModel

World

Model

State

Action

αegot αobj

t

1αobj

t

2

αegot+1 αobj

t+1

1αobj

t+1

2

αegot-2 αobj

t-2

1αobj

t-2

2

αegot αobj

t

1αobj

t

2Lt

Lt+1 Lt+2 Lt+TLt+3 ...st

argmax Action

ωφ

ΛψState from

environment

αegot-1 αobj

t-1

1αobj

t-1

2

(b)(a)

st-1

st

st+1

st-2

Acti

on L

oss

Acti

on L

oss

Intrinsically-motivated self-aware architecture.

Emerging behaviors across different models. Different combinations of world-model tasks and policy mechanisms are compared. ID = Inverse Dynamics prediction; RW = Random World model; RP = Random Policy; SP = Self-aware Policy; LF = Latent Future prediction

A sequence of behaviors emerges:Ego motionprediction

Objectattention

Navigation towards ob-

Improved objectdynamics predic-

Improvement on task transfers

Multi-objectgathering

Ego motion learning Emergence of object attention 1 object learning 2 object learning

1 object loss

2 object loss

°* x

ID-SP-40 ID-SP-5 ID-RP LF-SP

Trai

ning

loss 0.4

0.3

0.2

0.10.0

0.5

0.6

0 4020 60 80Steps (in thousands)

320 560 800

x°*

0Steps (in thousands)

20 40 60 80 5601 O

bjec

t fre

quen

cy 1.00.80.60.4

0.00.2

x

0Steps (in thousands)

20 40 60 80 5602 O

bjec

t fre

quen

cy 0.50.40.30.2

0.00.1

0Steps (in thousands)

20 40 60 80 560

Agen

t-ob

ject

dis

tanc

e

8.0

6.0

4.0

0.0

2.0

x

Planning

Navigation and planning behavior for 12 consecutive time steps. Red force vectors on the objects depict the predicted actions. Ego-motion self prediction maps are drawn at the agents position.

Task Transfers

The agent starts off without seeing an object and turns around to explore for an object. Once an

object is in view the agent approaches or turns to-wards the object to keep it in view.

This work was supported by grants from the James S. McDonnell Foundation, Simons Foundation, and Sloan Foundation (DLKY), a Berry Foundation

postdoctoral fellowship and Stanford Department of Biomedical Data Science NLM T-15 LM007033-35 (NH), ONR - MURI (Stanford Lead) N00014-

16-1-2127 and ONR - MURI (UCLA Lead) 1015 G TA275 (LF).

Model comparison on task transfers including object presence (binary classification), localization (pixel-wise 2D centroid position), and recognition (16-way categorization). Linear estimators were trained on top off the output features of each model.

ID = Inverse Dynamics; RW = Random World model; RP = Random Policy; SP = Self-aware Policy; LF = Latent Future

Transfer to other Tasks

Acc

urac

y in

%

5060708090

100

Ego (Easy) Ego (Hard)

Accu

racy

in %

102030405060

Actions (Hard) Play Frequency

Acc

urac

y in

%

0

1020

30

40

Recognition

Erro

r in

%

0

0.751.5

2.25

3

Object Presence

Erro

r in

Pixe

ls

0

2.755.5

8.25

11

Localization

ID-SP trained with 2 objects!

Future Work

Next steps include - more photorealistic simulation- more realistic agent embodiment- curiosity towards the novel but learnable- animate attention and theory of mind- comparision to human developmental data

Phyiscally and visually realistic simulation environment.