reinforcement learning - rl in finite mdpshome.deib.polimi.it/restelli/mywebsite/pdf/rl4.pdf ·...

113
Reinforcement Learning RL in finite MDPs Marcello Restelli March–April, 2015

Upload: trinhdung

Post on 18-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

Reinforcement LearningRL in finite MDPs

Marcello Restelli

March–April, 2015

Page 2: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

RL techniques

Model–free vs Model–basedOn–policy vs Off–policyOnline vs OfflineTabular vs Function ApproximationValue–based vs Policy–based vs Actor–Critic

Page 3: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

RL problems

Model–free Prediction: Estimate the value function ofan unknown MRP (MDP + policy)Model-free Control: Optimize the value function of anunknown MDP

Page 4: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Monte–Carlo Reinforcement Learning

MC methods learn directly from episodes ofexperienceMC is model–free: no knowledge of MDPtransitions/rewardsMC learns from complete episodes: no bootstrappingMC uses the simplest possible idea: value = meanreturnCaveat: can only apply MC to episodic MDPs

All episodes must terminate

Page 5: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Monte Carlo for Prediction and Control

MC can be used for prediction:Input: Episodes of experience {s1,a1, r2, . . . , sT}generated by following policy π in given MDPor: Episodes of experience {s1,a1, r2, . . . , sT}generated by MRPOutput: Value function Vπ

Or for control:Input: Episodes of experience {s1,a1, r2, . . . , sT} ingiven MDPOutput: Optimal value function V ∗

Output: Optimal policy π∗

Page 6: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Estimation of Mean: Monte Carlo

Let X be a random variable with mean µ = E[x ] andvariance σ2 = Var [X ]. Let xi ∼ X , i = 1, . . . ,n be n i.i.d.realizations of X .Empirical mean of X :

µ̂n =1n

n∑i=1

xi

We have E[µ̂n] = µ, Var[µ̂n] = Var[X ]n

Weak law of large numbers: µ̂nP−→ µ

(P(limn→∞ µ̂n = µ) = 1)String law of large numbers: µ̂n

a.s.−−→ µ(limn→∞ P(|µ̂n − µ| > ε) = 0)Central limit theorem:

√n(µ̂n − µ)

D−→ N (0,Var[x ])

Page 7: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Monte–Carlo Policy Evaluation

Goal: learn Vπ from experience under policy π

s1,a1, r2, . . . , sT ∼ π

Recall that the return is the total discounted reward:

vt = rt+1 + γrt+2 + · · ·+ γT−1rt+T

Recall that the value function is the expected return:

Vπ(s) = E[vt |st = s]

Monte Carlo policy evaluation uses empirical meanreturn instead of expected return

first visit: average returns only for the first time s isvisited (unbiased estimator)every visit: average returns for every time s is visited(biased but consistent estimator)

Page 8: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

First–Visit Monte–Carlo Policy Evaluation

Initialize:π ← policy to be evaluatedV ← an arbitrary state–value functionReturns(s)← an empty list, for all s ∈ S

loopGenerate an episode using πfor each state s in the episode do

R ← return following the first occurrence of sAppend R to Returns(s)V (s)← average(Returns(s))

end forend loop

Page 9: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Every–Visit Monte–Carlo Policy Evaluation

Initialize:π ← policy to be evaluatedV ← an arbitrary state–value functionReturns(s)← an empty list, for all s ∈ S

loopGenerate an episode using πfor each state s in the episode do

for each occurrence of state s in the episode doR ← return following this occurrence of sAppend R to Returns(s)V (s)← average(Returns(s))

end forend for

end loop

Page 10: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

First–Visit vs Every–Visit

Model:

S Tp1− p

1

Reward is +1 on every step

Sample Path: S → S → S → S → T

V FV (S) = 4 V EV (S) = 2.5

Max Likelihood Model:

S T0.750.25

1

Page 11: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

First–Visit vs Every–VisitCrossover

Page 12: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Blackjack Example

Goal: Have your card sum be greater than the dealerswithout exceeding 21States (200 of them):

current sum (12–21)dealer’s showing card (ace–10)do I have a usable ace?

Reward: +1 for winning, 0 for a draw, -1 for losingActions: stand (stop receiving cards), hit (receiveanother card)Policy: Stand if my sum is 20 or 21, else hit

Page 13: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Blackjack ExampleAfter Monte–Carlo Learning

Page 14: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Incremental Mean

The mean µ̂1, µ̂2, . . . of a sequence x1, x2, . . . can becomputed incrementally

µ̂k =1k

k∑j=1

xj

=1k

xk +k−1∑j=1

xj

=

1k

(xk + (k − 1)µ̂k−1)

= µ̂k−1 +1k

(xk − µ̂k−1)

Page 15: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Incremental Monte–Carlo Updates

Update V (s) incrementally after episodes1,a1, r2, . . . , sT

For each state st with return vt

N(st )← N(st ) + 1

V (st )← V (st ) +1

N(st )(vt − V (st ))

In non–stationary problems, it is useful to track arunning mean, i.e., forget old episodes

V (st )← V (st ) + α(vt − V (st ))

Page 16: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Stochastic ApproximationEstimation of Mean

Let X be a random variable in [0,1] with mean µ = E[X ]. Letxi ∼ X , i = 1, . . . ,n be n i.i.d. realizations of X .Consider the estimator (exponential average)

µi = (1− αi)µi−1 + αixi ,

with µ1 = x1 and αi ’s are step–size parameters orlearning rates

Proposition

If∑

i≥0 αi =∞ and∑

i≥0 α2i <∞, then µ̂n

a.s.−−→ µ, i.e., theestimator µ̂n is consistent

Note: The step sizes αi = 1i satisfy the above conditions. In

this case, the exponential average gives us the empiricalmean µ̂n = 1

n∑n

i=1 xi , which is consistent according to thestrong law of large numbers

Page 17: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Monte–Carlo Backups

Entire episode includedOnly one choice at each state (unlike DP)MC does not bootstrapTime required to estimate one state does not dependon the total number of states

Page 18: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Temporal Difference Learning

TD methods learn directly from episodes ofexperienceTD is model–free: no knowledge of MDPtransitions/rewardsTD learns from incomplete episodes: bootstrappingTD updates a guess towards a guess

Page 19: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

TD Prediction

Goal: learn Vπ online from experience under policy πRecall: incremental every–visit Monte Carlo

V (st )← V (st ) + α(vt − V (st ))

Simplest temporal–difference learning algorithm:TD(0)

Update value V (st ) towards estimated returnrt+1 + γV (st+1)

V (st )← V (st ) + α(rt+1 + γV (st+1)− V (st ))

rt+1 + γV (st+1) is called the TD targetδt = rt+1 + γV (st+1)− V (st ) is called the TD error

Page 20: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Driving Home Example

State Elapsed Time(minutes)

PredictedTime to Go

PredictedTotal Time

leaving office 0 30 30reach car, raining 5 35 40

exit highway 20 15 35behind truck 30 10 40home street 40 3 43arrive home 43 0 43

Page 21: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Driving Home Example: MC vd TD

Changes recommended byMonte Carlo methods (α = 1)

Changes recommended byTD methods (α = 1)

Page 22: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Comparison between MC and TD

TD can learn before knowing the final outcomeTD can learn online after every stepMC must wait until end of episode before return isknown

TD cal learn without the final outcomeTD can learn from incomplete sequencesMC can only learn form complete sequencesTD works in continuing (non–terminating)environmentsMC only works for episodic (terminating) environments

Page 23: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Bias–Variance Trade–OffConceptual Definition

Error due to bias: is the difference between theexpected prediction of our model and the actual valuewe want to predictError due to variance: is the variability of a modelprediction for a given data point

Page 24: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Bias–Variance Trade–OffGraphical Definition

Page 25: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Bias–Variance Trade–OffMathematical Definition

We want to predict Y = f (X ) + ε, where the error term ε isnormally distributed ε ∼ N (0, σε)

We estimate a model f̂ (X ), with the following prediction error atpoint x :

Err(x) = E[(Y − f̂ (x))2]

This error can be decomposed into bias and variance:

Err(x) =(

f (x)− E[̂f (x)])2

+ E[E[̂f (x)]− f̂ (x)

]2+ σ2

ε

Err(x) = Bias2 + Variance + Irreducible Error

Page 26: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Bias–Variance Trade–OffMC vs TD

Return vt = rt+1 + γrt+2 + · · ·+ γT−1rt+T is anunbiased esitmate of Vπ(st )

TD target rt+1 + γV (st+1) is a biased estimate ofVπ(st )

Unless V (st+1) = Vπ(st+1)

But the TD target is much lower variance:Return depends on many random actions, transitions,rewardsTD target depends on one random action, transition,reward

Page 27: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Bias–Variance comparison between MC andTD

MC has high variance, zero biasGood convergence propertiesWorks well with function approximationNot very sensitive to initial valueVery simple to understand and use

TD has low variance, some biasUsually more efficient than MCTD(0) converges to Vπ(s)Problem with function approximationMore sensitive to initial values

Page 28: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Random Walk Example

Page 29: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Random WalkMC vs TD

Page 30: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Batch MC and TD

MC and TD converge: V (s)→ Vπ(s) as experience→∞But what about batch solution for finite experience?

s11,a

11, r

12 , . . . , s

1T1

...sK

1 ,aK1 , r

K2 , . . . , s

KT1

e.g., repeatedly sample episode k ∈ [1,K ]Apply MC or TD(0) to episode k

Page 31: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

AB Example

Two states A,BUndiscounted8 episodes of experience

A,0,B,0B,1B,1B,1B,1B,1B,1B,0

What is V (A), V (B)?

A B

T 1

T 2

100%

75%

25%

Page 32: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Certainty Equivalence

MC converges to solution with minimum mean–squared errorBest fit to the observed returns

K∑k=1

T∑t=1

(vkt − V (sk

t ))2

In the AB example, V (A) = 0TD(0) converges to solution of max likelihood Markov model

Solution to the MDP 〈S,A, P̂, R̂, γ, µ〉 that best fits thedata

P̂(s′|s,a) =1

N(s,a)

K∑k=1

T∑t=1

1(skt ,a

kt , s

kt+1 = s,a, s′)

R̂(s,a) =1

N(s,a)

K∑k=1

T∑t=1

1(skt ,a

kt = s,a)r k

t

In the AB example, V (A) = 0.75

Page 33: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Comparison between MC and TDMarkov Property

TD exploits Markov propertyUsually more efficient in Markov environments

MC does not exploit Markov propertyUsually more efficient in non–Markov environments

Page 34: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

MC vs TD vs DP

V (st )← V (st ) + α[vt − V (st )]

where Rt is the actual return following state st

V (st )← V (st ) + α[rt+1 + γV (st+1)− V (st )]

where Rt is the actual return following state st

V (st )← Eπ[rt+1 + γV (st+1)]

where Rt is the actual return following state st

Bootstrapping SamplingMC No YesTD Yes YesDP Yes No

Page 35: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

n–Step Prediction

Let TD target look n steps into the future

Page 36: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

n–Step Return

Consider the following n–step returns forn = 1,2, . . . ,∞:

n = 1 (TD) v (1)t = rt+1 + γV (st+1)

n = 2 v (2)t = rt+1 + γrt+2 + γ2V (st+2)

......

n =∞ (MC) v (∞)t = rt+1 + γrt+2 + · · ·+ γT−1rT

Define the n–step return

v (n)t = rt+1 + γrt+2 + · · ·+ γn−1rt+n + γnV (st+n)

n–step temporal–difference learning

V (st )← V (st ) + α(v (n)t − V (st ))

Page 37: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Large Random Walk Example

Page 38: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Averaging n–step Returns

We can average n–step returns over different ne.g., average the 2–step and 4–step returns

12

v (2) +12

v (4)

Combines information from two different time–stepsCan we efficiently combine information from alltime–steps?

Page 39: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

λ–return

The λ–return vλt combines alln–step returns v (n)

t

Using weight (1− λ)λn−1

vλt = (1− λ)∞∑

n=1

λn−1v (n)t

Forward–view TD(λ)

V (st )← V (st )+α(

vλt − V (st ))

Page 40: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

TD(λ) Weighting Function

vλt = (1− λ)∞∑

n=1

λn−1v (n)t

Page 41: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Forward–view TD(λ)

Update value function towards the λ–returnForward-view looks into the future to compute vλtLike MC, can only be computed from completeepisodes

Page 42: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Forward–view TD(λ) on Large Random Walk

Page 43: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Backward–view TD(λ)

Forward view provides theoryBackward view provides mechanismUpdate online, every step, from incompletesequences

Page 44: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Eligibility Traces

Credit assignment problem: did bell or light cause shock?

Frequency heuristic: assign credit to the most frequent states

Recency heuristics: assign credit to the most recent states

Eligibility traces combine both heuristics

et+1(s) = γλet(s) + 1(s = st)

Page 45: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Backward–view TD(λ)

Update value V (s) for every state sIn proportion to TD–error δt and eligibility trace et (s)

e0(s) = 0et (s) = γλet−1(s) + 1(s = st )

V (s) ← V (s) + αδtet (s)

Page 46: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Backward–view TD(λ) Algorithm

Initialize V (s) arbitrarilyfor all episodes do

e(s) = 0, ∀s ∈ SInitialize srepeat

a← action given by π for sTake action a, observe reward r , and next state s′

δ ← r + γV (s′)− V (s)e(s)← e(s) + 1for all s ∈ S do

V (s)← V (s) + αδe(s)e(s)← γλe(s)

end fors ← s′

until s is terminalend for

Page 47: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

TD(λ) and TD(0)

When λ = 0, only current state is updated

et (s) = 1(s = st )

V (s) ← V (s) + αδtet (s)

This is exactly equivalent to TD(0) update

V (st )← V (st ) + αδt

Page 48: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Telescoping in TD(1)

When λ = 1, sum of TD errors telescopes into MC error

δt + γδt+1 + γ2δt+2 + · · ·+ γT−tδT−1

= rt+1 + γV (st+1)− V (st )

+ γrt+2 + γ2V (st+2)− γV (st+1)

+ γ2rt+3 + γ3V (st+3)− γ2V (st+2)

...+ γT−1rt+T + γT V (st+T )− γT−1V (st+T−1)

= rt+1 + γrt+2 + γ2rt+3 + · · ·+ γT−1rt+T − V (st )

= vt − V (st )

Page 49: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

TD(λ) and TD(1)

TD(1) is roughly equivalent to every–visitMonte–CarloError is accumulated online, step–by–stepIf value function is only updated offline at end ofepisode, then the total update is exactly the same asMCIf value function is updated online after every step,then TD(1) may have different total update to MC

Page 50: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Telescoping in TD(λ)

For general λ, TD errors also telescope to λ–error,vλt − V (st )

vλt − V (st ) = −V (st ) + (1− λ)λ0(rt+1 + γV (st+1))+ (1− λ)λ1(rt+1 + γrt+2 + γ2V (st+2))+ (1− λ)λ2(rt+1 + γrt+2 + γ2rt+3 + γ3V (st+3))+ . . .

= −V (st ) + (γλ)0(rt+1 + γV (st+1)− γλV (st+1))+ (γλ)1(rt+2 + γV (st+2)− γλV (st+2))+ (γλ)2(rt+3 + γV (st+3)− γλV (st+3))+ . . .

= (γλ)0(rt+1 + γV (st+1)− V (st+1))+ (γλ)1(rt+2 + γV (st+2)− V (st+2))+ (γλ)2(rt+3 + γV (st+3)− V (st+3))+ . . .

= δt + γλδt+1 + (γλ)2δt+2 + . . .

Page 51: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Forwards and Backwards TD(λ)

Consider an episode where s is visited once attime–step kTD(λ) eligibility trace discounts time since visit

et (s) = γλet−1(s) + 1(st = s)

=

{0 if t < k(γλ)t−k if t ≥ k

Backward TD(λ) updates accumulate error online

T∑t=1

αδtet (s) = α

T∑t=k

(γλ)t−kδt = α(vk − V (sk ))

By end of episode it accumulates total error forλ–returnFor multiple visits to s, et (s) accumulates many errors

Page 52: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Equivalence of Forward and Backward TD

TheoremThe sum of offline updates is identical for forward–view andbackward–view TD(λ)

T∑t=1

αδtet (s) =T∑

t=1

α(vλt − V (st ))1(st = s)

In practice, value function is updated online by TD(λ)But if α is small then equivalence is almost exact

Page 53: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Replacing Traces

Using accumulating traces, frequently visited statescan have eligibilities greater than 1

This can be a problem for convergenceReplacing traces: Instead of adding 1 when you visit astate, set that trace to 1

et (s) =

{γλet−1(s) if s 6= st

1 if s = st

Page 54: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Use of Model–Free Control

Some example problems that can be modeled as MDPs:

ElevatorParallel ParkingShip SteeringBioreactorHelicopterAirplane Logistics

Robocup SoccerQuakePortfolio managementProtein foldingRobot walkingGame of Go

For most of these problems, either:MDP model is unknown, but experience can besampledMDP model is known, but is too big to use, except bysamples

Model–free control can solve these problems

Page 55: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

On and Off–Policy Learning

On–policy learning“Learn on the job”Learn about policy π from experience sampled from π

Off–policy learning“Learn over someone’s shoulder”Learn about policy π from experience sampled from π

Page 56: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Generalized Policy Iteration (Refresher)

Policy evaluation: Estimate Vπ

e.g., Iterative policy evaluationPolicy improvement: Generateπ′ ≥ π

e.g., Greedy policy improvement

Page 57: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Generalized Policy Iteration with Monte–CarloEvaluation

Policy Evaluation: Monte–Carlo policy evaluation,V = Vπ?Policy Improvement: Greedy policy improvement?

Page 58: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Model–Free Policy Iteration Using Action–ValueFunction

Greedy policy improvement over V (s) requires modelof MDP

π′(s) = arg maxa∈A

{R(s,a) + P(s′|s,a)V (s′)

}Greedy policy improvement over Q(s,a) is model–free

π′(s) = arg maxa∈A

Q(s,a)

Page 59: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Generalized Policy Iteration with Monte–CarloEvaluation

Policy Evaluation: Monte–Carlo policy evaluation,Q = Qπ

Policy Improvement: Greedy policy improvement?

Page 60: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

On–Policy Exploration

There are two doors in front of you

You open the left door and getreward 0, V (left) = 0

You open the right door and getreward +1, V (right) = +1

You open the right door and getreward +3, V (right) = +2

You open the right door and getreward +2, V (right) = +2

...

Are you sure you’ve chosen thebest door?

Page 61: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

ε–Greedy Exploration

Simplest idea for ensuring continual explorationAll m actions are tried with non–zero probabilityWith probability 1− ε choose the greedy actionWith probability ε choose an action at random

π(s,a) =

{εm + 1− ε if a∗ = arg maxa∈AQ(s,a)εm otherwise

Page 62: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

ε–Greedy Policy Improvement

Theorem

For any ε–greedy policy π, the ε–greedy policy π′ with respect to Qπ is animprovement

Qπ(s, π′(s)) =∑a∈A

π′(a|s)Qπ(s, a)

m

∑a∈A

Qπ(s, a) + (1− ε)maxa∈A

Qπ(s, a)

≥ ε

m

∑a∈A

Qπ(s, a) + (1− ε)∑a∈A

π(a|s)− εm

1− ε Qπ(s, a)

=∑a∈A

π(a|s)Qπ(s, a) = Vπ(s)

Therefore from policy improvement theorem, Vπ′(s) ≥ Vπ(s)

Page 63: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Monte–Carlo Policy Iteration

Policy Evaluation: Monte–Carlo policy evaluation,Q = Qπ

Policy Improvement: ε–greedy policy improvement

Page 64: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Monte–Carlo Control

Every episode:Policy Evaluation: Monte–Carlo policy evaluation,Q ≈ Qπ

Policy Improvement: ε–greedy policy improvement

Page 65: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

GLIE

DefinitionGreedy in the Limit of Infinite Exploration (GLIE)

All state–action pairs are explored infinitely manytimes

limt→∞

Nk (s,a) =∞

The policy converges on a greedy policy

limt→∞

πk (a|s) = 1(a = arg maxa′∈A

Qk (s′,a′))

Page 66: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

GLIE Monte–Carlo Control

Sample k–th episode using π: {s1,a1, r2, . . . , sT} ∼ πFor each state st and action at in the episode,

N(st ,at )← N(st ,at ) + 1

Q(st ,at )← Q(st ,at ) +1

N(st ,at )(vt −Q(st ,at ))

Improve policy based on new action–value function

ε← 1k

π ← ε–greedy(Q)

TheoremGLIE Monte–Carlo control converges to the optimalaction–value function, Q(s,a)→ Q∗(s,a)

Page 67: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Relevant Time Scales

There are three main time scales1 Behavioral time scale 1

1−γ (discount factor)2 Sampling in the estimation of the Q–function α

(learning rate)3 Exploration ε (e.g., for ε–greedy strategy)

1− γ � α� ε

Initially 1− γ ≈ α ≈ ε is possibleThen decrease ε faster than αPractically, you can choose number of trials M <∞and set α ∼ 1− m

M and ε ∼(1− m

M

)2, m = 1, . . . ,M

In some cases, γ should be initialized to low valuesand then gradually moved towards its correct value

Page 68: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

MC vs TD Control

Temporal–Difference (TD) learning has severaladvantages over Monte–Carlo (MC)

Lower VarianceOnlineIncomplete sequences

Natural idea: use TD instead of MC in our control loopApply TD to Q(s,a)Use ε–greedy policy improvementUpdate every time–step

Page 69: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

On–Policy Control with SARSA

Q(s,a)← Q(s,a) + α(r + γQ(s′,a′)−Q(s,a))

Every time–step:Policy Evaluation: SARSA, Q ≈ Qπ

Policy Improvement: ε–greedy policy improvement

Page 70: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

SARSA Algorithm for On–Policy Control

Initialize Q(s,a) arbitrarilyloop

Initialize sChoose a from s using policy derived from Q (e.g.,ε–greedy)repeat

Take action a, observe r , s′

Choose a′ from s′ using policy derived from Q (e.g.,ε–greedy)Q(s,a)← Q(s,a) + α[r + γQ(s′,a′)−Q(s,a)]s ← s′; a← a′;

until s is terminalend loop

Page 71: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Convergence of SARSA

TheoremSARSA converges to the optimal action–value function,Q(s,a)→ Q∗(s,a), under the following conditions:

GLIE sequence of policies πt (s,a)

Robbins–Monro sequence of step–sizes αt

∞∑t=1

αt =∞

∞∑t=1

α2t <∞

Page 72: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

SARSA ExampleWindy Gridworld

Page 73: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

SARSA ExampleResults in the Windy Gridworld

Page 74: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

SARSA with Eligibility Traces

Forward view: update action–valueQ(s, a) to λ–return vλtBackward view: use eligibility tracesfor state–action pairs

et(s, a) = γλet−1(s, a)+1(st , at = s, a)

Page 75: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

SARSA(λ) Algorithm

Initialize Q(s, a) arbitrarilyloop

e(s, a) = 0, for all s, aInitialize s, arepeat

Take action a, observe r , s′

Choose a′ from s′ using policy derived from Q (e.g., ε–greedy)δ ← r + γQ(s′, a′)−Q(s, a)e(s, a)← e(s, a) + 1for all s, a do

Q(s, a)← Q(s, a) + αδe(s, a)e(s, a)← γλe(s, a)

end fors ← s′; a← a′;

until s is terminalend loop

Page 76: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

SARSA(λ) Gridworld Example

Page 77: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Off–Policy Learning

Learn about target policy π(a|s)

While following behavior policy π(a|s)

Why is this important?Learn from observing humans or other agentsRe–use experience generated from old policiesπ1, π2, . . . , πt−1Learn about optimal policy while following exploratorypolicyLearn about multiple policies while following one policy

Page 78: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Importance Sampling

Estimate the expectation of a different distribution w.r.t. thedistribution used to draw samples

Ex∼P [f (x)] =∑

P(x)f (x)

=∑

Q(x)P(x)

Q(x)f (x)

= Ex∼Q

[P(x)

Q(x)f (x)

]

Page 79: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Importance Sampling for Off–PolicyMonte–Carlo

Use returns generated from π to evaluate πWeight return vt according to similarity betweenpoliciesMultiply importance sampling corrections alongwhole episode

vµt =π(at |st )

π(at |st )

π(at+1|st+1)

π(at+1|st+1). . .

π(aT |sT )

π(aT |sT )vt

Update value towards corrected return

Q(st ,at )← Q(st ,at ) + α(vt −Q(st ,at ))

Cannot use if π is zero where π is non–zeroImportance sampling can dramatically increasevariance

Page 80: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Importance Sampling for Off–PolicyMonte–CarloDerivation

Off–policy MC is derived from expected return:

Qπ(s,a) = Eπ[vt |st = s,at = a]

=∑

P[s1,a1, r2, . . . , sT ]vt

=∑

P[s1]

(ΠT

t=1π(st ,at )P(st |st−1,at−1)π(st |at )

π(st ,at )

)vt

= Eπ[

ΠTt=1

π(st ,at )

π(st ,at )vt |st = s,at = a

]

Page 81: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Off–Policy MC Control

Initialize, for all s ∈ S, a ∈ A:Q(s, a)← arbitraryN(s, a)← 0D(s, a)← 0π ← an arbitrary deterministic policyloop

Using a policy π, generate an episodes0, a0, r1, s1, a1, r2, . . . , sT−1, aT−1, rT , sTτ ← latest time at which aτ 6= π(sτ )for all pair s, a appearing in the episode after τ do

t ← the time of first occurence (after τ ) of s, a

w ← ΠT−1k=t+1

1π(sk , ak )

N(s, a)← N(s, a) + wRtD(s, a)← D(s, a) + wQ(s, a)← N(s,a)

D(s,a)

end forfor all s ∈ S do

π(s)← arg maxa∈A Q(s, a)end for

end loop

Page 82: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Importance Sampling for Off–Policy SARSA

Use TD targets generated from π to evaluate πWeight TD target r + γQ(s′,a′) according to similaritybetween policiesOnly need a single importance sampling correction

Q(st , at)← Q(st , at) + α

(rt+1 + γ

π(a|s)π(a|s)Q(st+1, at+1)−Q(st , at)

)Much lower variance than Monte–Carlo importancesamplingPolicies only need to be similar over a single step

Page 83: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Importance Sampling for Off–Policy SARSABellman expectation equation

Off–Policy SARSA comes from Bellman expectation equation for Qπ(s, a)

Qπ(s, a) = Eπ [rt+1 + γQπ(st+1, at+1)|st = s, at = a]

= R(s, a) + γ∑s′∈S

P(s′|s, a)∑

a′∈A

π(a′|s′)Qπ(s′, a′)

= R(s, a) + γ∑s′∈S

P(s′|s, a)∑

a′∈A

π(a′|s′)π(a′|s′)

π(a′|s′)Qπ(s′, a′)

= Eµ[rt+1 + γ

π(a|s)π(a|s)Qπ(st+1, at+1)|st = s, at = a

]

Page 84: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Off–Policy Control with Q–learning

Learn about optimal policy π = π∗

From experience sampled from behavior policy πEstimate Q(s,a) ≈ Q∗(s,a)

Behavior policy can depend on Q(s,a)

e.g., π could be ε–greedy with respect to Q(s,a)As Q(s,a)→ Q∗(s,a), behavior policy π improves

Page 85: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Q–learning Algorithm

Q(s,a)← Q(s,a) + α(r + γmaxa∈A

Q(s′,a′)−Q(s,a))

Initialize Q(s,a) arbitrarilyloop

Initialize srepeat

Choose a from s using policy derived from Q (e.g.,ε–greedy)Take action a, observe r , s′

Q(s,a)← Q(s,a) +α[r +γmaxa′ Q(s′,a′)−Q(s,a)]s ← s′;

until s is terminalend loop

Page 86: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

SARSA vs Q–learningCliffwalking

Page 87: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Q–learning vs SARSA

SARSA: Q(s,a)← Q(s,a) + α[r + γQ(s′,a′)−Q(s,a)]on–policyQ–learning:Q(s,a)← Q(s,a) + α[r + γmaxa′ Q(s′,a′)−Q(s,a)]off–policyIn the cliff–walking task:

Q–learning: learns optimal policy along edgeSARSA: learns a safe non–optimal policy away fromedge

ε–greedy algorithmFor ε 6= 0 SARSA performs better onlineFor ε→ 0 gradually, both converge to optimal

Page 88: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Q–Learning, the TD(λ) way

How we can extend this toQ–learning?If you mark every state actionpair as eligible, you backupover non–greedy policyWatkins:

Zero out eligibility trace aftera non–greedy actionsDo max when backing up atfirst non–greedy choice

et (s, a) =

1 + γλet−1(s, a) if s = st , a = at ,Qt−1(st , at ) = maxa Qt−1(st , a)

0 if Qt−1(st , at ) 6= maxa Qt−1(st , a)

γλet−1(s, a) otherwise

Qt+1(s, a) = Qt (s, a) + αδt et (s, a)

δt = rt+1 + γmaxa′

Q(st+1, a′)− Qt (st , at )

Page 89: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Peng’s Q(λ)

Disadvantage to Watkins’method:

Early in learning, theeligibility trace will be “cut”(zeroed out) frequentlyresulting in little advantagesto traces

Peng:Backup max action except atendNever cut traces

Disadvantage:Complicated to implement

Page 90: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Relationship Between DP and TD

Full Backup (DP) Sample backup (TD)Iterative Policy Evaluation TD LearningV (s)← Eπ[r + γV (s′)|s] V (s) α←− r + γV (s′)Q–Policy Iteration SARSAQ(s, a)← Eπ[r + γQ(s′, a′)|s, a] Q(s, a) α←− r + γQ(s′, a′)Q–Value Iteration Q–learningQ(s, a)← Eπ[r +γmax

a′∈AQ(s′, a′)|s, a] Q(s, a) α←− r + γmax

a′∈AQ(s′, a′)

Page 91: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Actor–Critic Method

Explicit representationof policy as well as valuefunctionMinimal computation toselect actionsCan learn an explicitstochastic policyCan put constraints onpoliciesAppealing aspsychological andneural models

Page 92: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Actor–Critic MethodSome details

TD–error is used to evaluate actions:

δt = rt+1 + γV (st+1)− V (st )

If actions are determined by preferences, p(s,a), asfollows:

πt (a|s) =ep(s,a)∑b ep(s,b)

Preferences can be updated as follows:

p(st ,at )← p(st ,at ) + αδt

An example of two time–scale stochasticapproximation

Page 93: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Average Reward Optimality

Average reward of state s under policy π

ρπ(s) = limN→∞

1NE

[N−1∑t=0

R(st , π(st ))|s0 = s, π

]

If MDP is unichain, then the average reward of policy πis

ρπ = limN→∞

1NE

[N−1∑t=0

R(st , π(st ))|π

]Optimal policy: policy π∗ with maximum averagereward

ρπ∗ ≥ ρπ , ∀π

Page 94: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Average Adjusted Value Function

Differential or Relative or Average–Adjusted Value Function

Vπ(s) =∞∑

k=1

E[rt+k − ρπ|st = s]

TheoremFor any unichain MDP, there exists a value function V ∗

and a scalar ρ∗ satisfying the equation

V ∗(s) + ρ∗ = maxa∈A

[R(s,a) +

∑s′∈S

P(s′|s,a)V ∗(s)

]

Such that the greedy policy π resulting from V ∗ achievesthe optimal average reward ρ∗ = ρπ

∗where ρπ

∗ ≥ ρπ overall policies π

Page 95: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

R–Learning

Initialize ρ and Q(s,a), for all s,a arbitrarilyloop

s ← current stateChoose action a in s using behavior policy (e.g.,ε–greedy)Take action a, observe r , s′

Q(s,a)← Q(s,a) + α[r − ρ+ maxa′ Q(s′,a′)−Q(s,a)]if Q(s,a) = maxa Q(s,a) thenρ← ρ+ β[r − ρ+ maxa′ Q(s′,a′)−maxa Q(s,a)]

end ifend loop

Page 96: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Model–Based and Model–Free RL

Model Free RLNo modelLearn a policy and/or value function from experience

Model Based RLLearn a model from experiencePlan a policy and/or value function from model

Page 97: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Pros and Cons of Model–Based RL

Advantages:Can efficiently learn model by supervised learningmethodsCan reason about model uncertainty

DisadvantagesFirst learn a model, then construct a value function

two sources of approximation errorLearn a value function directly

only one source of approximation error

Page 98: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Model–Based RL

Page 99: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

What is a Model?

A modelMθ is a representation of an MDP〈S,A,P,R, γ, µ〉, parameterized by θWe will assume state space S and action space A areknownSo a modelM = 〈Pθ,Rθ〉 represents state transitionsPθ ≈ P and rewards Rθ ≈ R

st+1 ∼ Pθ(st+1|st ,at )

rt+1 = Rθ(rt+1|st ,at )

Can assume conditional independence betweenstate transitions and rewards

P[st+1, rt+1|st ,at ] = P[st+1|st ,at ]P[rt+1|st ,at ]

Page 100: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Model Learning

Goal: estimate modelMθ from experience{s1,a1, r2, . . . , sT}This is a supervised learning problem

s1,a1 → r2, s2

s2,a2 → r3, s3...

sT−1,aT−1 → rT , sT

Learning s,a→ r is a regression problemLearning s,a→ s′ is a density estimation problemPick loss function, e.g., mean–squared error, KLdivergence, . . .Find parameters θ that minimize empirical loss

Page 101: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Examples of Models

Table Lookup ModelLinear Expectation ModelLinear Gaussian ModelGaussian Process ModelDeep Belief Network Model. . .

Page 102: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Table Lookup Model

Model is an explicit MDP, P̂, R̂Count visits N(s,a) to each state action pair

P̂(s′|s,a) =1

N(s,a)

T∑t=1

1(st ,at , st+1 = s,a, s′)

R̂(s,a) =1

N(s,a)

T∑t=1

1(st ,at = s,a)rt

AlternativelyAt each timestep t , record experience tuple〈st ,at , rt+1, st+1〉To sample model, randomly pick tuple matching〈s,a, ·, ·〉

Page 103: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Planning with a Model

Given a modelMθ = 〈Pθ,Rθ〉Solve the MDP 〈S,A,Pθ,Rθ〉Using favorite planning algorithm

Value iterationPolicy iterationTree search. . .

Page 104: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Planning with an Inaccurate Model

Given an imperfect model 〈Pθ,Rθ〉 6= 〈P,R〉Performance of model–based RL is limited to optimalpolicy for approximate MDP 〈S,A,Pθ,Rθ〉i.e., Model–based RL is only as good as the estimatedmodelWhen the model is inaccurate, planning process willcompute a suboptimal policySolution 1: reason explicitly about model uncertaintySolution 2: when model is wrong, use model–free RL

Page 105: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Bayesian Model–Learning

Given a prior distribution over models P[θ]

and experience D = {s1,a1, r2, . . . , sT}Compute posterior distribution over models P[θ|D]

Using Bayes rule

P[θ|D] ∝ P[D|θ]P[θ]

Posterior distribution P[θ|D] explicitly representsuncertainty

Page 106: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Bayesian Model–Based RL

Given posterior model distribution P[θ|D]

Plan with respect to full distributionModel–based RL: one world modelBayesian model–based RL: many worlds model

Integrate value over many possible worlds, e.g.

Vπbayes(s) = EP[θ|D][Eπ[vt |st = s,Mθ]]

V ∗bayes(s) = maxπ

Vπbayes(s)

V ∗bayes is Bayes–optimal w.r.t. prior P[θ]

This finds optimal stationary policy π∗

Ignoring value–of–information

But planning problem harder than before

Page 107: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Real and Simulated Experience

We consider two sources of experienceReal experience: Sampled from environment (trueMDP)

s′ ∼ P(s′|s,a)

r = R(s,a)

Simulated experience: Sampled from model(approximate MDP)

s′ ∼ Pθ(s′|s,a)

r = Rθ(s,a)

Page 108: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Integrating Learning and Planning

Model–free RLNo modelLearn a policy and/or value function from realexperience

Model–based RLLearn a model from real experiencePlan a policy and/or value function from simulatedexperience

DynaLearn a model from real experienceLearn and plan a policy and/or value function from realand simulated experience

Page 109: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Dyna Architecture

Page 110: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Dyna–Q Algorithm

Initialize Q(s,a) and Model(s,a) for all s ∈ S and a ∈ A(s)loop

s ← current (nonterminal) statea← ε–greedy(s,Q)Execute action a; observe resultant state, s′, and reward, rQ(s,a)← Q(s,a) + α[r + γmaxa′ Q(s′,a′)−Q(s,a)]Model(s,a)← s′, r (assuming deterministic environment)for N times do

s ← random previously observed statea← random action previously taken in ss′, r ← Model(s,a)Q(s,a)← Q(s,a) + α[r + γmaxa′ Q(s′,a′)−Q(s,a)]

end forend loop

Page 111: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Dyna–Q on a Simple Maze

Page 112: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Dyna–Q with an Inaccurate ModelHard case

Page 113: Reinforcement Learning - RL in finite MDPshome.deib.polimi.it/restelli/MyWebSite/pdf/rl4.pdf · Reinforcement Learning RL in finite MDPs ... i 0 i = 1and P i 0 2 i

MarcelloRestelli

Dyna–Q with an Inaccurate ModelEasy case