deep learning and reinforcement learning

108
Deep Learning & Reinforcement Learning Renārs Liepiņš Lead Researcher, LUMII & LETA [email protected] At “Riga AI, Machine Learning and Bots”, February 16, 2017

Upload: renars-liepins

Post on 19-Mar-2017

288 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Deep Learning and Reinforcement Learning

Deep Learning & Reinforcement Learning

Renārs LiepiņšLead Researcher, LUMII & [email protected]

At “Riga AI, Machine Learning and Bots”, February 16, 2017

Page 2: Deep Learning and Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Page 3: Deep Learning and Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Page 5: Deep Learning and Reinforcement Learning

Machine learning is a core transformative way by which we are rethinking everything we are doing– Sundar Pichai (CEO Google) 2015

Source

Page 8: Deep Learning and Reinforcement Learning

Why such optimism?

Page 9: Deep Learning and Reinforcement Learning

Artificial Intelligence

computer systems ableto perform tasks normally

requiring human intelligence

Page 10: Deep Learning and Reinforcement Learning

0

5

10

15

20

25

30

2010 2011 2012 2013 2014 2015 2016

3.083.57

6.7

11.7

16.4

27.82829

Classic Deep Learning

HumanLevel

Page 11: Deep Learning and Reinforcement Learning
Page 12: Deep Learning and Reinforcement Learning

HumanLevel

Page 13: Deep Learning and Reinforcement Learning
Page 14: Deep Learning and Reinforcement Learning
Page 15: Deep Learning and Reinforcement Learning

Nice, but so what?

Page 16: Deep Learning and Reinforcement Learning

First Universal Learning Algorithm

Page 17: Deep Learning and Reinforcement Learning

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

Before Deep Learning

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

Source

Page 18: Deep Learning and Reinforcement Learning

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

With Deep Learning

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Page 19: Deep Learning and Reinforcement Learning

Universal Learning Algorithm

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

… …

Page 20: Deep Learning and Reinforcement Learning

A yellow busdriving down….

Universal Learning Algorithm – Speech Recognition

Andrew Ng Andrew Ng

T h e _ q u i c k …

Bi-directional Recurrent Neural Network (BDRNN)

Baidu"Deep"Speech"

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Page 21: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – Translation

Dzeltens autobuss brauc pa ceļu….

A yellow busdriving down….

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Page 22: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – Self driving cars

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Page 23: Deep Learning and Reinforcement Learning

Universal Learning Algorithm

A yellow busdriving down….

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Page 24: Deep Learning and Reinforcement Learning

Andrew Ng Andrew Ng

Caption Data (image)

A yellow bus driving down….

The limitations of supervised learning

Universal Learning Algorithm – Image captions

A yellow busdriving down….

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Page 25: Deep Learning and Reinforcement Learning

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Andrew Ng Andrew Ng

Chinese captions

�� ��

(A baseball player

getting ready to bat.)

(A person surfing on the ocean.)

� � �

(A double-decker bus driving on a street.)

Page 26: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – X-ray reports

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Page 27: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – Photo localisation

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Deep Learning in Computer VisionImage Localization

PlaNet is able to determine the location of almost any image with superhuman ability.

Source

Deep Learning in Computer VisionImage Localization

PlaNet is able to determine the location of almost any image with superhuman ability.

Source

Source

Page 28: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – Style Transfer

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Page 29: Deep Learning and Reinforcement Learning
Page 30: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – Semantic Face Transforms

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step,showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the inputimage was 400x400, all source and target images used in the transformation were only 100x100.

olderinput mouth open eyes open smiling facial hair spectaclesFigure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards sixcategories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features.

images. It also requires that sample images with and withoutthe desired attribute are otherwise similar to the target image(e.g. in the case of Figure 1 they consist of images of othercaucasian males).

However, these assumptions on the data are surprisinglymild, and in the presence of such data DFI works surprisinglywell. We demonstrate its efficacy on several transformationtasks that generative approaches are most commonly evalu-ated on. Compared to prior work, it is often much simpler,faster and more versatile: It does not require re-training of aconvnet, is not specialized on any particular task, and it isable to deal with much higher resolution images. Despiteits simplicity we show that on many of these image editingtasks it even outperforms state-of-the-art methods that aresubstantially more involved and specialized.

2. Related Work

Probably the generative methods most similar to oursare [23] and [28] as these also generate data-driven attributetransformations and rely on deep feature spaces. We usethese methods as our primary point of comparison, althoughthey rely on specially trained generative auto-encoders andare fundamentally different in their approaches to learn im-

age transformations. Works by Reed et al. [29, 30] proposecontent change models for challenging tasks (identity andviewpoint changes) but do not demonstrate photo-realisticresults. A contemporaneous work [4] edits image content bymanipulating latent space variables but their approach failswhen applied directly to existing photos. An advantage ofour approach is that it works with pre-trained networks andhas the ability to run on much higher resolution images. Ingeneral, many other uses of generative networks are distinctfrom our problem setting [13, 6, 47, 33, 26, 7], as they dealprimarily with generating novel images rather than changingexisting ones.

Gardner et al. [9] edits images by minimizing the witnessfunction of the Maximum Mean Discrepancy statistic. Thememory needed to find w by their method grows linearlywhereas DFI removes this bottleneck.

Mahendran and Vedaldi [25] recovered visual imagery byinverting deep convolutional feature representations. Gatyset al. [11] demonstrated how to transfer the artistic style offamous artists to natural images by optimizing for featuretargets during reconstruction. Rather than reconstructingimagery or transferring style, we construct new images withdifferent content class memberships.

mouth open

Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step,showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the inputimage was 400x400, all source and target images used in the transformation were only 100x100.

olderinput mouth open eyes open smiling facial hair spectaclesFigure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards sixcategories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features.

images. It also requires that sample images with and withoutthe desired attribute are otherwise similar to the target image(e.g. in the case of Figure 1 they consist of images of othercaucasian males).

However, these assumptions on the data are surprisinglymild, and in the presence of such data DFI works surprisinglywell. We demonstrate its efficacy on several transformationtasks that generative approaches are most commonly evalu-ated on. Compared to prior work, it is often much simpler,faster and more versatile: It does not require re-training of aconvnet, is not specialized on any particular task, and it isable to deal with much higher resolution images. Despiteits simplicity we show that on many of these image editingtasks it even outperforms state-of-the-art methods that aresubstantially more involved and specialized.

2. Related Work

Probably the generative methods most similar to oursare [23] and [28] as these also generate data-driven attributetransformations and rely on deep feature spaces. We usethese methods as our primary point of comparison, althoughthey rely on specially trained generative auto-encoders andare fundamentally different in their approaches to learn im-

age transformations. Works by Reed et al. [29, 30] proposecontent change models for challenging tasks (identity andviewpoint changes) but do not demonstrate photo-realisticresults. A contemporaneous work [4] edits image content bymanipulating latent space variables but their approach failswhen applied directly to existing photos. An advantage ofour approach is that it works with pre-trained networks andhas the ability to run on much higher resolution images. Ingeneral, many other uses of generative networks are distinctfrom our problem setting [13, 6, 47, 33, 26, 7], as they dealprimarily with generating novel images rather than changingexisting ones.

Gardner et al. [9] edits images by minimizing the witnessfunction of the Maximum Mean Discrepancy statistic. Thememory needed to find w by their method grows linearlywhereas DFI removes this bottleneck.

Mahendran and Vedaldi [25] recovered visual imagery byinverting deep convolutional feature representations. Gatyset al. [11] demonstrated how to transfer the artistic style offamous artists to natural images by optimizing for featuretargets during reconstruction. Rather than reconstructingimagery or transferring style, we construct new images withdifferent content class memberships.

Source

Page 31: Deep Learning and Reinforcement Learning

Figure 1. (Zoom in for details.) Aging a 400x400 face with Deep Feature Interpolation, before and after the artifact removal step,showcasing the quality of our method. In this figure (and no other) a mask was applied to preserve the background. Although the inputimage was 400x400, all source and target images used in the transformation were only 100x100.

olderinput mouth open eyes open smiling facial hair spectaclesFigure 2. (Zoom in for details.) An example Deep Feature Interpolation transformation of a test image (Silvio Berlusconi, left) towards sixcategories. Each transformation was performed via linear interpolation in deep feature space composed of pre-trained VGG features.

images. It also requires that sample images with and withoutthe desired attribute are otherwise similar to the target image(e.g. in the case of Figure 1 they consist of images of othercaucasian males).

However, these assumptions on the data are surprisinglymild, and in the presence of such data DFI works surprisinglywell. We demonstrate its efficacy on several transformationtasks that generative approaches are most commonly evalu-ated on. Compared to prior work, it is often much simpler,faster and more versatile: It does not require re-training of aconvnet, is not specialized on any particular task, and it isable to deal with much higher resolution images. Despiteits simplicity we show that on many of these image editingtasks it even outperforms state-of-the-art methods that aresubstantially more involved and specialized.

2. Related Work

Probably the generative methods most similar to oursare [23] and [28] as these also generate data-driven attributetransformations and rely on deep feature spaces. We usethese methods as our primary point of comparison, althoughthey rely on specially trained generative auto-encoders andare fundamentally different in their approaches to learn im-

age transformations. Works by Reed et al. [29, 30] proposecontent change models for challenging tasks (identity andviewpoint changes) but do not demonstrate photo-realisticresults. A contemporaneous work [4] edits image content bymanipulating latent space variables but their approach failswhen applied directly to existing photos. An advantage ofour approach is that it works with pre-trained networks andhas the ability to run on much higher resolution images. Ingeneral, many other uses of generative networks are distinctfrom our problem setting [13, 6, 47, 33, 26, 7], as they dealprimarily with generating novel images rather than changingexisting ones.

Gardner et al. [9] edits images by minimizing the witnessfunction of the Maximum Mean Discrepancy statistic. Thememory needed to find w by their method grows linearlywhereas DFI removes this bottleneck.

Mahendran and Vedaldi [25] recovered visual imagery byinverting deep convolutional feature representations. Gatyset al. [11] demonstrated how to transfer the artistic style offamous artists to natural images by optimizing for featuretargets during reconstruction. Rather than reconstructingimagery or transferring style, we construct new images withdifferent content class memberships.

Page 32: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – Lipreading

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

A yellow busdriving down….

Deep Learning in Computer VisionLipNet - Sentence-level Lipreading

Source

LipNet achieves 93.4% accuracy, outperforming experienced human lipreadersand the previous 79.6% state-of-the-art accuracy.

Source

Page 33: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – Sketch Vectorisation

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Page 34: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – Handwriting Generation

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

A yellow busdriving down….

Source

Page 35: Deep Learning and Reinforcement Learning

Deep Learning in Computer VisionImage Generation - Handwriting

This LSTM recurrent neural network is able to generate highly realistic

cursive handwriting in a wide variety of styles, simply by predicting one data

point at a time.

Source

Page 36: Deep Learning and Reinforcement Learning

Universal Learning Algorithm – Image upscaling

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Source

Page 37: Deep Learning and Reinforcement Learning

Google – Saving you bandwidth through machine learning

Source

Page 38: Deep Learning and Reinforcement Learning

First Universal Learning Algorithm

Page 39: Deep Learning and Reinforcement Learning

Not Magic

• Simply downloading and “applying” open-source software won’t work.

• Needs to be customised to your business context and data.

• Needs lots of examples and computing power for training

Source

Page 40: Deep Learning and Reinforcement Learning
Page 41: Deep Learning and Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Page 42: Deep Learning and Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Page 43: Deep Learning and Reinforcement Learning

cell bodyoutput axon

synapse

Neuron

Source

Page 44: Deep Learning and Reinforcement Learning

cell bodyoutput axon

synapse

Neuron

Artifical Neuron

Source

Page 46: Deep Learning and Reinforcement Learning

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Page 47: Deep Learning and Reinforcement Learning

Andrew Ng Andrew Ng

Yes/No (Mug or not?)

What is a neural network?

Data (image)

!

x1 ∈!5 , !x2∈!5

x2 = (W1 × x1)+x3 = (W2 × x2)+

x1 x2 x3

x4

x5

W4 W3 W2 W1

Page 48: Deep Learning and Reinforcement Learning

Training

Andrew Ng Andrew Ng

Yes/No (Mug or not?)

What is a neural network?

Data (image)

!

x1 ∈!5 , !x2∈!5

x2 = (W1 × x1)+x3 = (W2 × x2)+

x1 x2 x3

x4

x5

W4 W3 W2 W1 .

0.9

0.3

0.2

output

1.0

0.0

1.0

true out error

0.1

0.3

0.8

training data

Andrew Ng Andrew Ng

Yes/No (Mug or not?)

What is a neural network?

Data (image)

!

x1 ∈!5 , !x2∈!5

x2 = (W1 × x1)+x3 = (W2 × x2)+

x1 x2 x3

x4

x5

W4 W3 W2 W1

error backpropagation

Page 49: Deep Learning and Reinforcement Learning

Andrew Ng

Features for machine learning

Image! Vision features! Detection!

Images!

Audio! Audio features! Speaker ID!

Audio!

Text!

Text! Text features!

Web search!…!

Page 50: Deep Learning and Reinforcement Learning

19

WHAT MAKES DEEP LEARNING DEEP?

Input Result

Today’s Largest Networks ~10 layers 1B parameters 10M images ~30 Exaflops ~30 GPU days Human brain has trillions of parameters – only 1,000 more.

Page 51: Deep Learning and Reinforcement Learning

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

“cat”

● Loosely based on (what little) we know about the brain

What is Deep Learning?

19

WHAT MAKES DEEP LEARNING DEEP?

Input Result

Today’s Largest Networks ~10 layers 1B parameters 10M images ~30 Exaflops ~30 GPU days Human brain has trillions of parameters – only 1,000 more.

Page 52: Deep Learning and Reinforcement Learning

Demo

Page 53: Deep Learning and Reinforcement Learning

http://playground.tensorflow.org/

Page 54: Deep Learning and Reinforcement Learning

https://transcranial.github.io/keras-js/

Page 55: Deep Learning and Reinforcement Learning

Why Now?

Page 56: Deep Learning and Reinforcement Learning

20171943 1956

A brief HistoryA long time ago…

1974 Backpropagation

awkward silence (AI Winter)

1995

SVM reigns

Convolution Neural Networks for

Handwritten Recognition

1998

2006

Restricted

Boltzmann

Machine

1958 Perceptron

1969

Perceptron criticized

Google Brain Project on

16k Cores

2012

2012

AlexNet wins

ImageNet

196919

58

1974 AI Winter 1998

DeepLearning

2012

Page 57: Deep Learning and Reinforcement Learning

Why Now?

Page 58: Deep Learning and Reinforcement Learning

Computational Power

Big Data

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Algorithms

Page 59: Deep Learning and Reinforcement Learning
Page 60: Deep Learning and Reinforcement Learning

Current Situation

Page 61: Deep Learning and Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

Page 62: Deep Learning and Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

Learning from Experience

Page 63: Deep Learning and Reinforcement Learning
Page 64: Deep Learning and Reinforcement Learning
Page 67: Deep Learning and Reinforcement Learning

What is Reinforcement Learning?

Action (A1)

State (S1)

Reward (R1)

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Agent Environment

Page 68: Deep Learning and Reinforcement Learning

What is Reinforcement Learning?

Action (A2)

State (S2)

Reward (R2)

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Agent Environment

Page 69: Deep Learning and Reinforcement Learning

What is Reinforcement Learning?

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Agent Environment

Action (Ai)

State (Si)

Reward (Ri)

Page 70: Deep Learning and Reinforcement Learning

What is Reinforcement Learning?

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Agent Environment

Goal: Maximize Accumulated Rewards R1 + R2 + R3 + …

Action (Ai)

State (Si)

Reward (Ri)

iRi∑=

Page 71: Deep Learning and Reinforcement Learning

Pong Example

States (S)

Actions (A) Rewards (R)

+1

-1

0

Environment Agent

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Goal: Maximize Accumulated Rewards

Page 72: Deep Learning and Reinforcement Learning

Reinforcement Agent

Agent

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Page 73: Deep Learning and Reinforcement Learning

Reinforcement Agent = Policy Function

Agent

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

π(S) -> A

Policy Function

=

Page 74: Deep Learning and Reinforcement Learning

AiSi

Pong Example

π( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 75: Deep Learning and Reinforcement Learning

AiSi

Pong Example

π( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 76: Deep Learning and Reinforcement Learning

AiSi

Pong Example

π( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 77: Deep Learning and Reinforcement Learning

Pong Example

States (S)

Actions (A) Rewards (R)

+1

-1

0

Environment Agent

Gorila (General Reinforcement Learning Architecture)

I 10x faster than Nature DQN on 38 out of 49 Atari games

I Applied to recommender systems within Google

Goal: Maximize Accumulated Rewards

π(S) -> A

Page 78: Deep Learning and Reinforcement Learning

Reinforcement Learning Problem

that maximizesiR

i∑Accumulated Rewards:

Policy Function: π(S) -> AFind

Page 79: Deep Learning and Reinforcement Learning

How to Find

π(S) -> A

?

Page 80: Deep Learning and Reinforcement Learning

Reinforcement Learning Algorithms

• Q-Learning

• Actor-Critic methods

• Policy Gradient

Page 81: Deep Learning and Reinforcement Learning

Reinforcement Learning Algorithms

• Q-Learning

• Actor-Critic methods

• Policy Gradient

Page 82: Deep Learning and Reinforcement Learning

Episode

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

😁R3=+1R1=0 R2=0

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Game Over

👍 👍 👍

iRi∑ = +1

Page 83: Deep Learning and Reinforcement Learning

😭

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Episode

R3=-1R1=0 R2=0Game Over

👎 👎 👎

iRi∑ = −1

Page 84: Deep Learning and Reinforcement Learning

How to Find

π(S) -> A

?

Page 85: Deep Learning and Reinforcement Learning

How to Find π(S) -> A ?

1. Change π to Stochastic: π(S) -> P(A)

Page 86: Deep Learning and Reinforcement Learning

Pong Example

π( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 87: Deep Learning and Reinforcement Learning

π( ) -> Action Probability

0 0.25 0.5 0.75 1

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 88: Deep Learning and Reinforcement Learning

π( ) -> Action Probability

0 0.25 0.5 0.75 1

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 89: Deep Learning and Reinforcement Learning

π( ) -> Action Probability

0 0.25 0.5 0.75 1

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 90: Deep Learning and Reinforcement Learning

2. Approximate π with NeuralNet: π(S, θ) -> P(A)

How to Find π(S) -> A ?

1. Change π to Stochastic: π(S) -> P(A)

Page 91: Deep Learning and Reinforcement Learning

Siπ( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

Page 92: Deep Learning and Reinforcement Learning

Siπ( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

, θ

Page 93: Deep Learning and Reinforcement Learning

Siπ( ) ->

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

,

Page 94: Deep Learning and Reinforcement Learning

Si

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

Page 95: Deep Learning and Reinforcement Learning

π(Si, θ)

θ

Page 96: Deep Learning and Reinforcement Learning

Si

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

Page 97: Deep Learning and Reinforcement Learning

Si

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Action Probability0 0.25 0.5 0.75 1

Page 98: Deep Learning and Reinforcement Learning

How to Find ?π(Si, θ) -> P(A)

How to Find θ

Loss Function…

Page 99: Deep Learning and Reinforcement Learning

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

0.0

0.2

0.4

0.6

0.8

1.0

R1=0

0.0

0.2

0.4

0.6

0.8

1.0

R2=0

0.0

0.2

0.4

0.6

0.8

1.0

Game Over

R3=+1Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Rii

n

∑ = +1

😁

👍 👍 👍

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 100: Deep Learning and Reinforcement Learning

π(Si , θ | Ai)

θk

0.0

0.2

0.4

0.6

0.8

1.0

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

π(Si , θ | Ai)}👍

👎

Δ(π(Si , θ | Ai) )Δ θk

Page 101: Deep Learning and Reinforcement Learning

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

0.0

0.2

0.4

0.6

0.8

1.0

R1=0

0.0

0.2

0.4

0.6

0.8

1.0

R2=0

0.0

0.2

0.4

0.6

0.8

1.0

Game Over

R3=+1Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Rii

n

∑ = +1

😁

👍 👍 👍

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 102: Deep Learning and Reinforcement Learning
Page 103: Deep Learning and Reinforcement Learning

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 104: Deep Learning and Reinforcement Learning

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

0.0

0.2

0.4

0.6

0.8

1.0

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Extended Data Figure 2 | Visualization of learned value functions on twogames, Breakout and Pong. a, A visualization of the learned value function onthe game Breakout. At time points 1 and 2, the state value is predicted to be ,17and the agent is clearing the bricks at the lowest level. Each of the peaks inthe value function curve corresponds to a reward obtained by clearing a brick.At time point 3, the agent is about to break through to the top level of bricks andthe value increases to ,21 in anticipation of breaking out and clearing alarge set of bricks. At point 4, the value is above 23 and the agent has brokenthrough. After this point, the ball will bounce at the upper part of the bricksclearing many of them by itself. b, A visualization of the learned action-valuefunction on the game Pong. At time point 1, the ball is moving towards thepaddle controlled by the agent on the right side of the screen and the values of

all actions are around 0.7, reflecting the expected value of this state based onprevious experience. At time point 2, the agent starts moving the paddletowards the ball and the value of the ‘up’ action stays high while the value of the‘down’ action falls to 20.9. This reflects the fact that pressing ‘down’ would leadto the agent losing the ball and incurring a reward of 21. At time point 3,the agent hits the ball by pressing ‘up’ and the expected reward keeps increasinguntil time point 4, when the ball reaches the left edge of the screen and the valueof all actions reflects that the agent is about to receive a reward of 1. Note,the dashed line shows the past trajectory of the ball purely for illustrativepurposes (that is, not shown during the game). With permission from AtariInteractive, Inc.

LETTER RESEARCH

Macmillan Publishers Limited. All rights reserved©2015

Page 105: Deep Learning and Reinforcement Learning

Reinforcement Learning

Page 106: Deep Learning and Reinforcement Learning

Outline

• Current State

• Deep Learning

• Reinforcement Learning

• Conclusions

Page 107: Deep Learning and Reinforcement Learning

Conclusions

1.

Andrew Ng Andrew Ng

Neurons in the brain

Output

Deep Learning: Neural network �

… …2.

3.

Page 108: Deep Learning and Reinforcement Learning