q-learning watkins, c. j. c. h., and dayan, p., q learning, machine learning, 8: 279-292 (1992)
TRANSCRIPT
![Page 1: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/1.jpg)
Q-learning
Watkins, C. J. C. H., and Dayan, P., Q learning,
Machine Learning, 8: 279-292 (1992)
![Page 2: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/2.jpg)
Q valueWhen an agent take action at in state st at time t,the predicted future rewards is defined as Q(st,at).
43
32
21),( tttttt rrrrEasQ
st+1 st+2stat+1a1t
at+2
rt rt+1 rt+1
… …Q3(st,at)=0
Q1(st,at)=2
Q2(st,at)=1
Example)
Generally speaking, an agent should take action a1t
because the corresponding Q value Q1(st,at) is max.
a2ta3t
![Page 3: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/3.jpg)
Q learningFirst, Q value can be transformed as follows.
),(
),(
111
021
01
43
32
21
ttt
kkt
kt
kkt
k
tttttt
asQrE
rrE
rE
rrrrEasQ
As a result, the Q value at time t is easily calculated by rt+1 and Q value of the next step.
( ① )
①
![Page 4: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/4.jpg)
Q learningQ values is updated every step.
When an agent take action at in state st,and gets reward r, the Q value is updated as follows.
),(),(max),(),( 1 tttatttt asQasQrasQasQ
target value current value
α: step size parameter (learning rate)
TD error
![Page 5: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/5.jpg)
Q learning algorithm
Initialize Q(s,a) arbitrarilyRepeat (for each episode):
initialize sRepeat (for each step of episode):
Choose a from s using policy derived from Q(e.g., greedy, ε-greedy)
take action a, observe r, s’
s←s’;until s is terminal
),()','(max),(),( ' asQasQrasQasQ a
![Page 6: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/6.jpg)
n- step return (reward)
action
state
1 step(Q-learning) 2 step Monte Carlo
…..
Terminal state
(time T)
...
n step
….. …..
111)1( , tttt asQrR
222
21)2( , ttttt asQrrR
ntntn
ntn
ttnt asQrrrR
,1
21)(
TtT
ntn
ttt rrrrR 1121
Boot-strapping
Complete experience based method
initial state
(time t)
![Page 7: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/7.jpg)
n- step return (reward)
),(
),(
),(
),(
111
222
21
32
21
43
32
21
11
ttt
tttt
ntntn
ttt
tttt
tt
asQrE
asQrrE
asQrrrE
rrrrE
asQ
![Page 8: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/8.jpg)
λ-return (trace-decay parameter)
1 step 2 step Monte Carlo
…..
...
n step
….. …..
1
1
11 t1 tT
111
1
n
n
ttTn
t
tT
n
n
nt
n
nt
RR
RR
1)(1
1
1
)(
1
1)(
1
1
λ-return
3 step
21
weight
![Page 9: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/9.jpg)
λ-return (trace-decay parameter)
3-step return
![Page 10: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/10.jpg)
Eligibility trace and Replacing trace
Eligibility and Replacing traces is useful to calculate the n-step return
These traces show how often each state is visited.
)(set)(1 set
1)(1 set
Eligibility trace
Replacing trace
)( tss
)( tss )(set
)(1 set
1
)( tss
)( tss
Eligibility trace replacing trace
![Page 11: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/11.jpg)
Q(λ) algorithm
),(),(max),(),( 1 tttatttt asQasQrasQasQ Q-learning
Q(λ) with replacing trace
),(),(max 1 ttta asQasQr
stSt+1at
Q (st ,at)
current valuetarget value
1)( tt se
),(),(),( aseasQasQ
for all s,a
asese ,)(
![Page 12: Q-learning Watkins, C. J. C. H., and Dayan, P., Q learning, Machine Learning, 8: 279-292 (1992)](https://reader031.vdocuments.net/reader031/viewer/2022012400/56649ef65503460f94c09f6b/html5/thumbnails/12.jpg)
Q(λ) algorithm
Initialize Q(s,a) arbitrarily and e(s,a)=0, for all s, aRepeat (for each episode):
Initialize s, aRepeat (for each step):
take action a, observe r, s’choose a’ from s’ using policy derived from Q (e.g., ε-greedy)a*←arg maxb Q(s’,b) (if a’ ties for the max, then a*←a’)δ←r+γQ(s’,a*)-Q(s,a)e(s,a)←1for all s, a:
Q(s,a)←Q(s,a)+αδe(s,a)If a’=a*, then e(s,a)←γλe(s,a)
else e(s,a)← 0s←s’; a←a’
until s is terminal