reinforcement learning - function approximation › mlr › wp... · reinforcementlearning function...

Reinforcement LearningFunction approximation

Daniel Hennes19.06.2017

University Stuttgart - IPVS - Machine Learning & Robotics

1

Today

• Eligibility traces

• n-step TD returns• Forward and backward view

• Function approximation

• stochastic gradient descent (SGD)• on-policy prediction and control

2

Scaling-up reinforcement learning

• Sparse rewards

• e.g. gridworld

• Large state spaces:

• Go: 10170 states• Camera input, e.g. 1280× 720× 24 ≈ 2× 107 (3 color channels, 8 bits each)• Continuous spaces: e.g. inverted pendulum, mobile robot, etc.

3

Recall: n-step return

• n-step returns for n = 1, 2, . . .∞:

n = 1 TD Rt+1 + γV (St+1)n = 2 Rt+1 + γRt+2 + γ2V (St+2)n = 3 Rt+1 + γRt+2 + γ2Rt+3 + γ3V (St+3)

......

n =∞ MC Rt+1 + γRt+2 + γ2Rt+3 + . . .+ γTRT

• n-step return:

Gt:t+n = Rt+1 + γRt+2 + γ2Rt+3 + . . .+ γn−1Rt+n + γnV (St+n)

• n-step temporal difference update:

V (St)← V (St) + α (Gt:t+n − V (St))

4

Example: n-step Sarsa

7.2. N -STEP SARSA 157

n-step Sarsa for estimating Q ⇡ q⇤, or Q ⇡ q⇡ for a given ⇡

Initialize Q(s, a) arbitrarily, 8s 2 S, a 2 A

Initialize ⇡ to be "-greedy with respect to Q, or to a fixed given policyParameters: step size ↵ 2 (0, 1], small " > 0, a positive integer nAll store and access operations (for St, At, and Rt) can take their index mod n

Repeat (for each episode):Initialize and store S0 6= terminalSelect and store an action A0 ⇠ ⇡(·|S0)T 1For t = 0, 1, 2, . . . :| If t < T , then:| Take action At

| Observe and store the next reward as Rt+1 and the next state as St+1

| If St+1 is terminal, then:| T t + 1| else:| Select and store an action At+1 ⇠ ⇡(·|St+1)| ⌧ t� n + 1 (⌧ is the time whose estimate is being updated)| If ⌧ � 0:

| G Pmin(⌧+n,T )i=⌧+1 �i�⌧�1Ri

| If ⌧ + n < T , then G G + �nQ(S⌧+n, A⌧+n) (G(n)⌧ )

| Q(S⌧ , A⌧ ) Q(S⌧ , A⌧ ) + ↵ [G�Q(S⌧ , A⌧ )]| If ⇡ is being learned, then ensure that ⇡(·|S⌧ ) is "-greedy wrt QUntil ⌧ = T � 1

Path takenAction values increased

by one-step SarsaAction values increased

by Sarsa(!) with !=0.9by 10-step Sarsa

G G G

Figure 7.4: Gridworld example of the speedup of policy learning due to the use of n-stepmethods. The first panel shows the path taken by an agent in a single episode, ending ata location of high reward, marked by the G. In this example the values were all initially 0,and all rewards were zero except for a positive reward at G. The arrows in the other twopanels show which action values were strengthened as a result of this path by one-step andn-step Sarsa methods. The one-step method strengthens only the last action of the sequenceof actions that led to the high reward, whereas the n-step method strengthens the last nactions of the sequence, so that much more is learned from the one episode.

• How to choose n?

5

λ-return

• The λ-return Gλt combines all n-step returns:

Gλt = (1− λ)

∞∑n=1

λn−1Gt:t+n

12.1. THE �-RETURN 253

1

2

1

2

A backup that averages simpler component backups is called acompound backup. The backup diagram for a compound backupconsists of the backup diagrams for each of the component backupswith a horizontal line above them and the weighting fractionsbelow. For example, the compound backup for the case mentionedat the start of this section, mixing half of a two-step backup andhalf of a four-step backup, has the diagram shown to the right.A compound backup can only be done when the longest of itscomponent backups is complete. The backup at the right, forexample, could only be done at time t+4 for the estimate formedat time t. In general one would like to limit the length of thelongest backup because of the corresponding delay in the updates.

The TD(�) algorithm can be understood as one particular wayof averaging n-step backups. This average contains all the n-stepbackups, each weighted proportional to �n�1, where � 2 [0, 1], andnormalized by a factor of 1�� to ensure that the weights sum to 1 (see Figure 12.1).The resulting backup is toward a return, called the �-return, defined by

G�t

.= (1� �)

1X

n=1

�n�1G(n)t . (12.2)

Figure 12.2 further illustrates the weighting on the sequence of n-step returns in the�-return. The one-step return is given the largest weight, 1� �; the two-step returnis given the next largest weight, (1 � �)�; the three-step return is given the weight(1 � �)�2; and so on. The weight fades by � with each additional step. After aterminal state has been reached, all subsequent n-step returns are equal to Gt. If we

1!"

(1!") "

(1!") "2

# = 1

TD("), "-return

"T-t-1

Figure 12.1: The backup digram for TD(�). If � = 0, then the overall backup reduces to itsfirst component, the one-step TD backup, whereas if � = 1, then the overall backup reducesto its last component, the Monte Carlo backup.

6

λ-return weighting function

Gλt = (1− λ)

∞∑n=1

λn−1Gt:t+n254 CHAPTER 12. ELIGIBILITY TRACES

1!"

weight given to

the 3-step return

decay by "

weight given to

actual, final return

t T

Time

Weight

total area = 1

is (1� �)�2

is �T�t�1

Figure 12.2: Weighting given in the �-return to each of the n-step returns.

want, we can separate these post-termination terms from the main sum, yielding

G�t = (1� �)

T�t�1X

n=1

�n�1G(n)t + �T�t�1Gt, (12.3)

as indicated in the figures. This equation makes it clearer what happens when� = 1. In this case the main sum goes to zero, and the remaining term reduces tothe conventional return, Gt. Thus, for � = 1, backing up according to the �-returnis a Monte Carlo algorithm. On the other hand, if � = 0, then the �-return reduces

to G(1)t , the one-step return. Thus, for � = 0, backing up according to the �-return

is a one-step TD method.

Exercise 12.1 The parameter � characterizes how fast the exponential weightingin Figure 12.2 falls o↵, and thus how far into the future the �-return algorithm looksin determining its backup. But a rate factor such as � is sometimes an awkward wayof characterizing the speed of the decay. For some purposes it is better to specify atime constant, or half-life. What is the equation relating � and the half-life, ⌧�, thetime by which the weighting sequence will have fallen to half of its initial value?

We are now ready to define our first learning algorithm based on the �-return:the o↵-line �-return algorithm. As an o↵-line algorithm, it makes no changes to theweight vector during the episode. Then, at the end of the episode, a whole sequenceof o↵-line updates are made according to our usual semi-gradient rule, using the�-return as the target:

✓t+1.= ✓t + ↵

hG�

t � v̂(St,✓t)irv̂(St,✓t), t = 0, . . . , T � 1. (12.4)

The �-return gives us an alternative way of moving smoothly between MonteCarlo and one-step TD methods that can be compared with the n-step TD way ofChapter 7. There we assessed e↵ectiveness on a 19-state random walk task (Example7.1). Figure 12.3 shows the performance of the o↵-line �-return algorithm on this taskalongside that of the n-step methods (repeated from Figure 7.2). The experiment was

7

Forward view

12.1. THE �-RETURN 255

↵

n=1

n=2n=4

n=8

n=16

n=32

n=32n=64128512256

n-step TD methods(from Chapter 7)Off-line λ-return algorithm

↵

RMS errorat the end

of the episodeover the first10 episodes λ=0

λ=.4λ=.8

λ=.9

λ=.95

λ=.975λ=.99

λ=1

λ=.95

Figure 12.3: 19-state Random walk results (Example 7.1): Performance of the o✏ine �-return algorithm alongside that of the n-step TD methods. In both case, intermediate valuesof the bootstrapping parameter (� or n) performed best. The results with the o↵-line �-returnalgorithm are slighly better at the best values of ↵ and �, and at high ↵.

just as described earlier except that for the �-return algorithm we varied � instead ofn. The performance measure used is the estimated root-mean-squared error betweenthe correct and estimated values of each state measured at the end of the episode,averaged over the first 10 episodes and the 19 states. Note that overall performanceof the o↵-line �-return algorithms is comparable to that of the n-step algorithms. Inboth cases we get best performance with an intermediate value of the bootstrappingparameter, n for n-step methods and � for the o✏ine �-return algorithm.

The approach that we have been taking so far is what we call the theoretical, orforward, view of a learning algorithm. For each state visited, we look forward in timeto all the future rewards and decide how best to combine them. We might imagineourselves riding the stream of states, looking forward from each state to determineits update, as suggested by Figure 12.4. After looking forward from and updatingone state, we move on to the next and never have to work with the preceding stateagain. Future states, on the other hand, are viewed and processed repeatedly, oncefrom each vantage point preceding them.

Time

rt+3

rt+2

rt+1

rT

st+1

st+2

st+3

st

St+1

St

St+2

St+3Rt+3

Rt+2

Rt+1

RT

Figure 12.4: The forward view. We decide how to update each state by looking forward tofuture rewards and states.

• Update values by looking forward to future rewards and states• Update values towards λ-return• Can only be computed for terminated sequences

8

Backward view

12.2. TD(�) 257

Semi-gradient TD(�) for estimating v̂ ⇡ v⇡

Input: the policy ⇡ to be evaluatedInput: a di↵erentiable function v̂ : S+ ⇥ Rn ! R such that v̂(terminal,·) = 0

Initialize value-function weights ✓ arbitrarily (e.g., ✓ = 0)Repeat (for each episode):

Initialize Se 0 (An n-dimensional vector)Repeat (for each step of episode):. Choose A ⇠ ⇡(·|S). Take action A, observe R, S0

. e ��e +rv̂(S,✓)

. � R + �v̂(S0,✓)� v̂(S,✓)

. ✓ ✓ + ↵�e

. S S0

until S0 is terminal

riding along the stream of states, computing TD errors, and shouting them back tothe previously visited states, as suggested by Figure 12.5. Where the TD error andtraces come together, we get the update given by (12.7).

To better understand the backward view, consider what happens at various valuesof �. If � = 0, then by (12.5) the trace at t is exactly the value gradient correspondingto St. Thus the TD(�) update (12.7) reduces to the one-step semi-gradient TDupdate treated in Chapter 9 (and, in the tabular case, to the simple TD rule (6.2)).This is why that algorithm was called TD(0). In terms of Figure 12.5, TD(0) isthe case in which only the one state preceding the current one is changed by theTD error. For larger values of �, but still � < 1, more of the preceding states

!tet

et

et

et

Time

st

st+1

st-1

st-2

st-3

St

St+1

St-1

St-2

St-3

�t

et

et

et

et

Figure 12.5: The backward or mechanistic view. Each update depends on the current TDerror combined with eligibility traces of past events.• Forward view provides theory

• Backward view provides a mechanism how to perform updates• Update every step; works for incomplete sequences

9

Eligibility traces

Credit assignment problem:

• Frequency: assign credit to most frequent states• Recency: assign credit to recent states

∀s : e(s)← γλe(s)e(St)← e(St) + 1

10

Backward view

12.2. TD(�) 257

Semi-gradient TD(�) for estimating v̂ ⇡ v⇡

Input: the policy ⇡ to be evaluatedInput: a di↵erentiable function v̂ : S+ ⇥ Rn ! R such that v̂(terminal,·) = 0

Initialize value-function weights ✓ arbitrarily (e.g., ✓ = 0)Repeat (for each episode):

Initialize Se 0 (An n-dimensional vector)Repeat (for each step of episode):. Choose A ⇠ ⇡(·|S). Take action A, observe R, S0

. e ��e +rv̂(S,✓)

. � R + �v̂(S0,✓)� v̂(S,✓)

. ✓ ✓ + ↵�e

. S S0

until S0 is terminal

riding along the stream of states, computing TD errors, and shouting them back tothe previously visited states, as suggested by Figure 12.5. Where the TD error andtraces come together, we get the update given by (12.7).

To better understand the backward view, consider what happens at various valuesof �. If � = 0, then by (12.5) the trace at t is exactly the value gradient correspondingto St. Thus the TD(�) update (12.7) reduces to the one-step semi-gradient TDupdate treated in Chapter 9 (and, in the tabular case, to the simple TD rule (6.2)).This is why that algorithm was called TD(0). In terms of Figure 12.5, TD(0) isthe case in which only the one state preceding the current one is changed by theTD error. For larger values of �, but still � < 1, more of the preceding states

!tet

et

et

et

Time

st

st+1

st-1

st-2

st-3

St

St+1

St-1

St-2

St-3

�t

et

et

et

et

Figure 12.5: The backward or mechanistic view. Each update depends on the current TDerror combined with eligibility traces of past events.• Keep an eligibility trace for every state s

• Update value V (s) for every state s:

δt = Rt+1 + γV (St+1)− V (St)

∀s : V (s)← V (s) + αδtet(s)11

TD(λ) and TD(0)

• When λ = 0:e(s) =

{ 1 for s = St

0 else

δt = Rt+1 + γV (St+1)− V (St)

∀s : V (s)← V (s) + αδtet(s)

• Same as TD(0):V (St)← V (St) + α [Rt+1 + γV (St+1)− V (St)]

12

Large state spaces

• Backgammon: 1020 states

• Go: 10170 states

• Continuous spaces:

• inverted pendulum• mountain car• helicopter• . . .

13

Small domains: tabular representation

• So far V and Q were just lookup tables:

1 2 3 4

1

2

3

0.491

0.566

0.645

0.431

0.744

0.475

0.572

0.848

0.277

-1.000

1.000

k = 50

1 2 3 4

1

2

3

0.49

0.41

0.44

0.45

0.57

0.51

0.46

0.51

0.59

0.64

0.53

0.57

0.40

0.42

0.40

0.43

0.67

0.74

0.67

0.60

0.48

0.29

0.41

0.40

0.57

-0.60

0.30

0.53

0.77

0.85

0.57

0.66

-0.65

0.13

0.27

0.28

-1.00

1.00

k = 50

• Problems with large MDPs:

• too many states to store in memory• too slow to learn

• How can we scale the tabular solution methods to arbitrarily large state/action spaces?

• generalize from previous encounters of similar states (or state–action pairs)• function approximation (supervised learning)

14

Example: mountain car

15

State aggregation in mountain car

0 2 4 6 8 10Position

0

2

4

6

8

10

Velo

city

300

250

200

150

100

50

0


0

2

4

6

8

10

Velo

city

16

State aggregation in mountain car


0

2

4

6

8

10

Velo

city

17

Idea of value function approximation

• Parameterized functional form, with weights w ∈ Rd :

v̂π(s,w) ≈ vπ(s)

• Generally, much less weights than states d � |S|

• obvious for continuous state spaces• changing single weight, changes value estimate of many states• when one state is update, change generalizes to many states

• Update w with MC or TD learning

18

Function approximators

• Linear combinations of features

• state aggregation• tile coding• polynomials• radial basis functions (RBFs)• Fourier bases• . . .

• Neural networks

• Decision trees

• Non-parametric approaches

19

Tile coding

9.5. FEATURE CONSTRUCTION FOR LINEAR METHODS 211

In tile coding the receptive fields of the features are grouped into partitions of theinput space. Each such partition is called a tiling, and each element of the partitionis called a tile. For example, the simplest tiling of a two-dimensional state space is auniform grid such as that shown on the left side of Figure 9.9. The tiles or receptivefield here are squares rather than the circles in Figure 9.6. If just this single tilingwere used, then the state indicated by the white spot would be represented by thesingle feature whose tile it falls within; generalization would be complete to all stateswithin the same tile and nonexistent to states outside it. With just one tiling, wewould not have coarse coding by just a case of state aggregation.

To get the strengths of coarse coding requires overlapping receptive fields, and bydefinition the tiles of a partition do not overlap. To get true coarse coding withtile coding, multiple tilings are used, each o↵set by a fraction of a tile width. Asimple case with four tilings is shown on the right side of Figure 9.9. Every state,such as that indicated by the white spot, falls in exactly one tile in each of the fourtilings. These four tiles correspond to four features that become active when thestate occurs. Specifically, the feature vector �(s) has one component for each tile ineach tiling. In this example there are 4 ⇥ 4 ⇥ 4 = 64 components, all of which willbe 0 except for the four corresponding to the tiles that s falls within. Figure 9.10shows the advantage of multiple o↵set tilings (coarse coding) over a single tiling onthe 1000-state random walk example.

An immediate practical advantage of tile coding is that, because it works withpartitions, the overall number of features that are active at one time is the samefor any state. Exactly one feature is present in each tiling, so the total number offeatures present is always the same as the number of tilings. This allows the step-size parameter, ↵, to be set in an easy, intuitive way. For example, choosing ↵ = 1

m ,where m is the number of tilings, results in exact one-trial learning. If the examples 7! v is trained on, then whatever the prior estimate, v̂(s,✓t), the new estimate willbe v̂(s,✓t+1) = v. Usually one wishes to change more slowly than this, to allow forgeneralization and stochastic variation in target outputs. For example, one mightchoose ↵ = 1

10m , in which case the estimate for the trained state would move one-

Point in state space

to berepresented

Tiling 1Tiling 2

Tiling 3Tiling 4Continuous

2D state space

Four activetiles/features

overlap the pointand are used to

represent it

Figure 9.9: Multiple, overlapping grid-tilings on a limited two-dimensional space. Thesetilings are o↵set from one another by a uniform amount in each dimension.

20

Function approximators for RL

• Differentiable function approximators, e.g.:

• Linear combination of features• Neural networks

• RL specific problems:

• non-stationary• non-iid data• bootstrapping• delayed targets

21

Stochastic gradient descent (SGD)

• Approximate value function v̂(s,w)

• differentiable for all s ∈ S

• Weight vector w = (w1,w2, . . . ,wd)T

• wt weight vector at time t = 0, 1, 2, . . .

• Gradient of f (w):

∇f (w) =(δf (w)δw1

,δf (w)δw1

, . . . ,δf (w)δwd

)T

• Adjust w in the direction that reduces the error on sample St 7→ vπ(St):

wt+1 = wt −12α∇ [vπ(St)− v̂(St ,wt)]2︸︷︷︸

squared sample error

= wt + α[vπ(St)− v̂(St ,wt)]∇v̂(St ,w)

• alpha is a step size parameter

• why not use α = 1, thus eliminating full error on sample?22

Linear methods

• Special case where v̂(·,w) is linear in the weights

• Feature vector x(s) represents state s:

x(s) = (x1(s), x2(s), . . . , xd(s))T

• Each component of x is a feature

• Examples:

• distance of robot to landmarks• piece on a specific location on a chess board

• Value function is simply the inner product between w and x(s):

v̂(s,w) = wT x(s) =d∑

i=1wixi(s)

• Value function is represented as a linear combination of features

• Gradient is simply ∇v̂(s,w) = x(s)

23

Prediction with function approximation

• We assumed the true value function vπ(St) is known

• Substitute target Ut for vπ(s)

wt+1 = wt + α[Ut − v̂(St ,wt)]∇v̂(St ,w)

• Ut might be a noisy or bootstrapped approximation of the true value

• Monte Carlo: Ut = Gt

• TD(0): Ut = Rt+1 + γv̂(St+1,w)

• TD(λ): Ut = Gλt

24

Monte–Carlo with function approximation

• Target is unbiased by definition:

E[Ut |St = s] = E[Gt |St = s] = vπ(St)

• Training data:D =

{〈S1,G1〉, 〈S2,G2〉, . . . , 〈ST−1,GT−1〉, 〈ST , 0〉

}• Using SGD, w is guaranteed to converge to a local optimum

• MC prediction exhibits local convergence with linear and non-linear function approximation

• SGD update for sample St 7→ Gt :

w ← w + α[Gt − v̂(St ,w)]∇v̂(St ,w)

25

TD with function approximation

• TD-target Ut = Rt+1 + γv̂(St+1,w) is biased sample of the true value vπ(St)

• Training data:

D ={〈S1,R2 + γv̂(S2,w)〉, 〈S2,R3 + γv̂(S3,w)〉, . . . 〈ST−1,RT 〉

}• SGD update for sample St 7→ Gt :

w ← w + α[Rt+1 + γv̂(St+1,w)− v̂(St ,w)]∇v̂(St ,w)

• Linear TD(0):w ← w + α[Rt+1 + γwT x(St+1)−wT x(St)]x(St)

26

Example: TD-Gammon

• by Gerald Tesauro (1992)• Multi-layer perceptron (neural network) represents value function• Only reward given at the end of game (1, 0.5, 0)• Self-play: use the current policy to sample moves on both sides!• Same ideas used in recent work on Go, Chess, Shogi:

• Alpha Go Zero (2015)• Alpha Zero (2017)

27

Control with function approximation

• Control via generalized policy iteration (GPI):

• policy evaluation: approximate policy evaluation: q̂(·, ·,w) ≈ qπ

• policy improvement: ε-greedy policy improvement

28

Types of action–value function approximation

• Action as input: q̂(s, a,w) ≈ qπ(s, a)• Multiple action–value outputs: q̂a(s,w) ≈ qπ(s, a)

29

Action-value function approximation

• Approximate action-value function

q̂(s, a,w) ≈ qπ(s, a)

• Linear case:

q̂(s, a,w) = wT x(s, a) =d∑

i=1wixi(s, a)

∇q̂(s, a,w) = x(s, a)

• Minimize squared error on samples St ,At 7→ qπ:[qπ − q̂(St ,At ,w)

]2• Use SGD to find local minimum:

wt+1 = wt −12α∇[qπ(St ,At)− q̂(St ,At ,wt)]2

= wt + α[qπ(St ,At)− q̂(St ,At ,wt)]∇q̂(St ,At ,w)

30

Control with function approximation

• Again, we must substitute target Ut for true action-value qπ(s, a):

wt+1 = wt + α[Ut − q̂(St ,At ,wt)]∇q̂(St ,At ,w)

• Monte Carlo: Ut = Gt

• One-step Sarsa: Ut = Rt+1 + γq̂(St+1,At+1,w)

31

Batch reinforcement learning

• Gradient descent is simple

• But it is not sample efficient

• Batch methods seek to find best fitting value function

• Given the agent’s experience (“training data”)

32

Least squares prediction

• Approximate value function v̂(s,w) ≈ vπ(s)

• Given experience D sampled using π:

D ={〈s1, vπ(s1)〉, 〈s2, vπ(s2)〉, . . . , 〈sT , vπ(sT )〉

}• Which parameters w give the best fit for v̂?

• Least squares algorithms find parameter vectors w minimising the sum-squared errorbetween v̂(si) and true values vπ(si):

T∑i=1

(vπ(si)− v̂(si ,w))2 = ED[(vπ(s)− v̂(s,w)2]

• Sampling 〈s, vπ(s)〉 ∼ D from experience and applying SGD, converges to the least squaressolution:

wπ = arg minw

T∑i=1

(vπ(si)− v̂(si ,w))2

33

Linear least squares prediction (LSPI)

• Recall the linear TD update

w ← w + α[Rt+1 + γwT x(St+1)−wT x(St)]x(St)

• At steady state wπ the expected update must be zero:

ED

[α[U −wT x(S)]x(S)

]= 0

α

T∑t=1

[ui −wT x(si)]x(si) = 0

T∑t=1

x(si)ui =T∑

t=1x(si)x(si)T w

w =( T∑

t=1x(si)x(si)T

)−1 T∑t=1

x(si)ui

• Direct solution is O(d3)

• Incremental solution using Shermann-Morrison equality is O(d2)34

Recent successes of deep reinforcement learning

images from Human-level control through deep reinforcement learning (Google Deepmind / Nature)

35

Deep (supervised) learning

• Deep representation is a composition of many functions

x −→w1

h1 −→w2h2 −→w3

. . . −→wn

hn −→wn+1y

• Linear transformation and non-linear activation functions hk

• Weight sharing• Recurrent neural networks: across time steps• Convolutional neural networks: across spatial (or temporal) regions

• Weights w optimized by stochastic gradient descent (SGD)• Powerful function approximation and representation learning

• finds compact low-dimensional representation (features)• State-of-the-art for image, text and audio

36

Naive deep Q-learning

• Q-learning update rule:

Q(s, a)← Q(s, a) + α(r + γmax

aQ(s ′, a)− Q(s, a)

)• Q is represented by a neural network with weights w : q̂(s, a,w)

• Loss is the mean-squared TD-error:

L(w) = E[(

r + γmaxa

q̂(s ′, a,w)− q̂(s, a,w ]))2]

• Minimize sample errors with SGD

37

Stability

Naive Q-learning with neural networks oscillates or diverges:

1. Data is non i.i.d!• trajectories• samples are correlated (generated by interaction)

2. Policy changes rapidly with slight changes to Q-values• policy may oscillate

3. Reward range is unknown• gradients can be large• instabilities during back-propagation

4. Maximization bias

38

Deep Q-networks (DQN)

Deep Q-networks (DQN) address instabilities through:

• Experience replay• store transitions 〈St ,At ,Rt , St+1〉• sample random mini-batches• removes correlation, restores i.i.d. property

• Target network• second q network (second set of parameters)• fixed parameters in target network• periodically update target network parameters

• Reward clipping/normalization• clip rewards to r ∈ [−1, 1]• batch normalization

39

DQN in Atari

“End-to-end” learning:• state: stack of 4 frames, raw pixels• action: joystick commands (18 discrete actions)• reward: change in score

difficult and engaging for human players. We used the same networkarchitecture, hyperparameter values (see Extended Data Table 1) andlearning procedure throughout—taking high-dimensional data (210|160colour video at 60 Hz) as input—to demonstrate that our approachrobustly learns successful policies over a variety of games based solelyon sensory inputs with only very minimal prior knowledge (that is, merelythe input data were visual images, and the number of actions availablein each game, but not their correspondences; see Methods). Notably,our method was able to train large neural networks using a reinforce-ment learning signal and stochastic gradient descent in a stable manner—illustrated by the temporal evolution of two indices of learning (theagent’s average score-per-episode and average predicted Q-values; seeFig. 2 and Supplementary Discussion for details).

We compared DQN with the best performing methods from thereinforcement learning literature on the 49 games where results wereavailable12,15. In addition to the learned agents, we also report scores fora professional human games tester playing under controlled conditionsand a policy that selects actions uniformly at random (Extended DataTable 2 and Fig. 3, denoted by 100% (human) and 0% (random) on yaxis; see Methods). Our DQN method outperforms the best existingreinforcement learning methods on 43 of the games without incorpo-rating any of the additional prior knowledge about Atari 2600 gamesused by other approaches (for example, refs 12, 15). Furthermore, ourDQN agent performed at a level that was comparable to that of a pro-fessional human games tester across the set of 49 games, achieving morethan 75% of the human score on more than half of the games (29 games;

Convolution Convolution Fully connected Fully connected

No input

Figure 1 | Schematic illustration of the convolutional neural network. Thedetails of the architecture are explained in the Methods. The input to the neuralnetwork consists of an 84 3 84 3 4 image produced by the preprocessingmap w, followed by three convolutional layers (note: snaking blue line

symbolizes sliding of each filter across input image) and two fully connectedlayers with a single output for each valid action. Each hidden layer is followedby a rectifier nonlinearity (that is, max 0,xð Þ).

a b

c d

0 200 400 600 800

1,000 1,200 1,400 1,600 1,800 2,000 2,200

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10 11

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

0

1,000

2,000

3,000

4,000

5,000

6,000

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

sco

re p

er e

piso

de

Training epochs

0 1 2 3 4 5 6 7 8 9

10

0 20 40 60 80 100 120 140 160 180 200

Ave

rage

act

ion

valu

e (Q

)

Training epochs

Figure 2 | Training curves tracking the agent’s average score and averagepredicted action-value. a, Each point is the average score achieved per episodeafter the agent is run with e-greedy policy (e 5 0.05) for 520 k frames on SpaceInvaders. b, Average score achieved per episode for Seaquest. c, Averagepredicted action-value on a held-out set of states on Space Invaders. Each point

on the curve is the average of the action-value Q computed over the held-outset of states. Note that Q-values are scaled due to clipping of rewards (seeMethods). d, Average predicted action-value on Seaquest. See SupplementaryDiscussion for details.

RESEARCH LETTER

5 3 0 | N A T U R E | V O L 5 1 8 | 2 6 F E B R U A R Y 2 0 1 5

Macmillan Publishers Limited. All rights reserved©2015

image from Human-level control through deep reinforcement learning (Google Deepmind / Nature)40

DQN in Atari

image from Human-level control through deep reinforcement learning (Google Deepmind / Nature) 41

reinforcement learning - function approximation › mlr › wp... · reinforcementlearning function...

Documents