an introduction to reinforcement learning (rl)
DESCRIPTION
Slides from Aditya Tarigoppula's talk at NYC Machine Learning on December 13th.TRANSCRIPT
![Page 1: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/1.jpg)
An Introduction to Reinforcement Learning (RL) and RL Brain Machine Interface (RL-BMI)
Aditya Tarigoppula www.joefrancislab.com
SUNY Downstate Medical Center
![Page 2: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/2.jpg)
Outline
RL Examples
Environment
Value functions
Optimality
Methods for attaining optimality
DP MC TD
BMI & RL-BMI
Eligibility Traces
START / END
![Page 3: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/3.jpg)
RL Examples
Stanford Autonomous Helicopterhttp://heli.stanford.edu/
Reinforcement Learning Brain Machine InterfaceJoe Francis Lab.
![Page 4: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/4.jpg)
Environment model - Markov decision process 1) States ‘S’
2) Actions ‘A’
3) State transition probabilities.
4) Reward
5)
Deterministic, non-stationary policy
RL Problem: The decision maker, ‘agent’ needs to learn the optimum policy in an ‘environment’ to maximize the total amount of reward it receives over the long term.
1'
'},,|'1Pr{' s
assPaatsstssta
ssP
10
...32
21
...321
trtrtrtR
TtrtrtrtrtR
as :
![Page 5: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/5.jpg)
• Agent performs the action under the policy being followed.
• Environment is everything else other than the agent.
assP '
![Page 6: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/6.jpg)
Value Functions: State Value Function
State – Action Value Function
},|{
},|{),(
10
aassrE
aassREasQ
ttktk
k
ttt
)]'([),(
}|{
}|{)(
''
'
20
1
sVRPas
ssrrE
ssREsV
ass
s
ass
a
tktk
kt
tt
![Page 7: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/7.jpg)
Optimal Value Function:
Optimal Policy – A policy that is better than or equal to all the other policies is called Optimal policy.
(in the sense of maximizing expected reward)
Optimal state value function
Optimal state-action value function
Bellman optimality equation
)(max)(* sVsV
),(max),(* asQasQ
},|)','(max{),(
},|)'({max)(
*1
*
*1
*
aassasQrEasQ
aasssVrEsV
tta
t
ttta
![Page 8: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/8.jpg)
At time = tAcquire Brain State
DecoderAction Selection (trying to execute an optimum action)
Action executedAt time = t +1Observe reward Update the decoder
t
t+1
EXAMPLE
![Page 9: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/9.jpg)
8.0Pr
1.0Pr 1.0Pr
0Pr
EXAMPLE
))](*),((*1.0))(*...
)...,((*1.0))(*),((*8.0[)(
332
211
sVasRsV
asRsVasRsV
1S
2S3S
4S
Prof. Andrew Ng, Lecture 16, Machine learning
![Page 10: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/10.jpg)
Outline
Environment
Value functions
Optimality
Methods for attaining optimality
DP MC TD
BMI & RL-BMI
Eligibility Traces
START / END
We're here !
RL Examples
![Page 11: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/11.jpg)
Solution Methods for RL problem◦ Dynamic Programming (DP) – is a method for optimization of
problems which exhibit the characteristics of overlapping sub problems and optimal substructure.
◦ Monte Carlo method (MC) - requires only experience--sample sequences of states, actions, and rewards from interaction with an environment.
◦ Temporal Difference learning (TD) – is a method that combines the better aspects of DP (estimation) and MC (experience) without incorporating the ‘troublesome’ aspects of both.
![Page 12: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/12.jpg)
Dynamic ProgrammingPolicy Evaluation
![Page 13: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/13.jpg)
Dynamic ProgrammingPolicy Improvement
)())(,( ' sVssQ
*1 *
10 ...... VVVEIIEIE
o
E – Policy Evaluation I – Policy Improvement
![Page 14: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/14.jpg)
Policy Iteration Value Iteration
Replace entire section with
DYNAMIC
PROGRAMMING
)]'([max)( ''
' sVRPsV ass
s
ass
a
![Page 15: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/15.jpg)
Monte Carlo Vs. DP
◦ The estimates for each state are independent. In other words, MC methods do not "bootstrap“.
◦ DP includes only one-step transitions
whereas the MC diagram goes all the
way to the end of the episode.
◦ The computational expense of estimating the value of a single state is less when one requires the value of only a subset of the states.
![Page 16: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/16.jpg)
Monte Carlo Policy Evaluation
Every visit MC First visit MC
![Page 17: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/17.jpg)
-> Without a model, we need Q value estimates.-> All state-action pairs should be visited.-> Exploration techniques 1) Exploring starts 2) e-greedy Policy
Next SlideMONTE
CARLO
![Page 18: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/18.jpg)
As promised, this is the “NEXT SLIDE” !
MONTE
CARLO
![Page 19: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/19.jpg)
Temporal Difference Methods◦ Like MC, TD methods can learn directly from raw experience
without a model of the environment's dynamics. Like DP, TD methods update estimates are based in part on other learned estimates, without waiting for a final outcome (they bootstrap).
)]()([)()( 11 ttttt sVsVrsVsV
![Page 20: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/20.jpg)
TD(lambda)
trace decay parameter
Bias decreases
Variance Increases
Bias –Variance Tradeoff
Intuition: start with large ‘lamda’ and then decrease over time
![Page 21: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/21.jpg)
SARSA
Q Learning
Difference
![Page 22: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/22.jpg)
Outline
Environment
Value functions
Optimality
Methods for attaining optimality
DP MC TD
BMI & RL-BMI
Eligibility Traces
START / END
We're here !
RL Examples
![Page 23: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/23.jpg)
Eligibility Traces
OR
![Page 24: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/24.jpg)
Outline
Environment
Value functions
Optimality
Methods for attaining optimality
DP MC TD
BMI & RL-BMI
Eligibility Traces
START / END
We're here !
RL Examples
![Page 25: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/25.jpg)
Online/closed loop RL-BMI architecture
),(),(
))]([max(_
actionsQasQ
tsiQindexoutputaction
titt
tanh(.)
reward
traceeerrTDdelta
asQasQrerrTD ttttt
_*_
),(),(*_ 11
‘delta’ used for updating the weights through back-propagation
NEURALSIGNAL
![Page 26: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/26.jpg)
Scott, S. H. (1999). "Apparatus for measuring and perturbing shoulder and elbow joint positions and torques during reaching." J Neurosci Methods 89(2): 119-27.
BM I
SET UP
![Page 27: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/27.jpg)
Autonomous Helicopter (Stanford Uni) http://heli.stanford.edu/papers/iser04-invertedflight.pdf
Position , orientation, velocity and angular velocity ),,,,,,,,,,,(
zyxzyx ),,,,,,,(
zyx
S1 S2
a1 a2R1 R2
Dynamics Dynamics
DynamicsRandom Gen
Dynamics
![Page 28: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/28.jpg)
Actor-Critic Model
http://drugabuse.gov/researchreports/methamph/meth04.gif
![Page 29: An introduction to reinforcement learning (rl)](https://reader033.vdocuments.net/reader033/viewer/2022061203/547debc9b47959bb508b4a6e/html5/thumbnails/29.jpg)
References Reinforcement Learning: An Introduction
Richard S. Sutton & Andrew G. Barto Prof. Andrew Ng’s machine Learning Lectures http://heli.stanford.edu Dr. Joseph T. Francis
www.joefrancislab.com Prof. Peter Dayan Dr. Justin Sanchez Group
http://www.bme.miami.edu/nrg/