q-learning watkins, c. j. c. h., and dayan, p., q learning, machine learning, 8: 279-292 (1992)

12
Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

Upload: abel-bell

Post on 13-Jan-2016

239 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

Q-learning

Watkins, C. J. C. H., and Dayan, P., Q learning,

Machine Learning, 8: 279-292 (1992)

Page 2: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

Q valueWhen an agent take action at in state st at time t,the predicted future rewards is defined as Q(st,at).

43

32

21),( tttttt rrrrEasQ

st+1 st+2stat+1a1t

at+2

rt rt+1 rt+1

… …Q3(st,at)=0

Q1(st,at)=2

Q2(st,at)=1

Example)

Generally speaking, an agent should take action a1t

because the corresponding Q value Q1(st,at) is max.

a2ta3t

Page 3: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

Q learningFirst, Q value can be transformed as follows.

),(

),(

111

021

01

43

32

21

ttt

kkt

kt

kkt

k

tttttt

asQrE

rrE

rE

rrrrEasQ

As a result, the Q value at time t is easily calculated by rt+1 and Q value of the next step.

( ① )

Page 4: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

Q learningQ values is updated every step.

When an agent take action at in state st,and gets reward r, the Q value is updated as follows.

),(),(max),(),( 1 tttatttt asQasQrasQasQ

target value current value

α: step size parameter (learning rate)

TD error

Page 5: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

Q learning algorithm

Initialize Q(s,a) arbitrarilyRepeat (for each episode):

initialize sRepeat (for each step of episode):

Choose a from s using policy derived from Q(e.g., greedy, ε-greedy)

take action a, observe r, s’

s←s’;until s is terminal

),()','(max),(),( ' asQasQrasQasQ a

Page 6: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

n- step return (reward)

action

state

1 step(Q-learning) 2 step Monte Carlo

…..

Terminal state

(time T)

...

n step

….. …..

111)1( , tttt asQrR

222

21)2( , ttttt asQrrR

ntntn

ntn

ttnt asQrrrR

,1

21)(

TtT

ntn

ttt rrrrR 1121

Boot-strapping

Complete experience based method

initial state

(time t)

Page 7: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

n- step return (reward)

),(

),(

),(

),(

111

222

21

32

21

43

32

21

11

ttt

tttt

ntntn

ttt

tttt

tt

asQrE

asQrrE

asQrrrE

rrrrE

asQ

Page 8: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

λ-return (trace-decay parameter)

1 step 2 step Monte Carlo

…..

...

n step

….. …..

1

1

11 t1 tT

111

1

n

n

ttTn

t

tT

n

n

nt

n

nt

RR

RR

1)(1

1

1

)(

1

1)(

1

1

λ-return

3 step

21

weight

Page 9: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

λ-return (trace-decay parameter)

3-step return

Page 10: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

Eligibility trace and Replacing trace

Eligibility and Replacing traces is useful to calculate the n-step return

These traces show how often each state is visited.

)(set)(1 set

1)(1 set

Eligibility trace

Replacing trace

)( tss

)( tss )(set

)(1 set

1

)( tss

)( tss

Eligibility trace replacing trace

Page 11: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

Q(λ) algorithm

),(),(max),(),( 1 tttatttt asQasQrasQasQ Q-learning

Q(λ) with replacing trace

),(),(max 1 ttta asQasQr

stSt+1at

Q (st ,at)

current valuetarget value

1)( tt se

),(),(),( aseasQasQ

for all s,a

asese ,)(

Page 12: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)

Q(λ) algorithm

Initialize Q(s,a) arbitrarily and e(s,a)=0, for all s, aRepeat (for each episode):

Initialize s, aRepeat (for each step):

take action a, observe r, s’choose a’ from s’ using policy derived from Q (e.g., ε-greedy)a*←arg maxb Q(s’,b) (if a’ ties for the max, then a*←a’)δ←r+γQ(s’,a*)-Q(s,a)e(s,a)←1for all s, a:

Q(s,a)←Q(s,a)+αδe(s,a)If a’=a*, then e(s,a)←γλe(s,a)

else e(s,a)← 0s←s’; a←a’

until s is terminal