an introduction to reinforcement learning (rl)

An Introduction to Reinforcement Learning (RL) and RL Brain Machine Interface (RL-BMI)

Aditya Tarigoppula www.joefrancislab.com

SUNY Downstate Medical Center

Outline

RL Examples

Environment

Value functions

Optimality

Methods for attaining optimality

DP MC TD

BMI & RL-BMI

Eligibility Traces

START / END

RL Examples

Stanford Autonomous Helicopterhttp://heli.stanford.edu/

Reinforcement Learning Brain Machine InterfaceJoe Francis Lab.

http://heli.stanford.edu/

Environment model - Markov decision process 1) States ‘S’

2) Actions ‘A’

3) State transition probabilities.

4) Reward

5)

Deterministic, non-stationary policy

RL Problem: The decision maker, ‘agent’ needs to learn the optimum policy in an ‘environment’ to maximize the total amount of reward it receives over the long term.

1'

'},,|'1Pr{' s

assPaatsstssta

ssP

10

...32

21

...321

trtrtrtR

TtrtrtrtrtR

as :

• Agent performs the action under the policy being followed.

• Environment is everything else other than the agent.

assP '

Value Functions: State Value Function

State – Action Value Function

},|{

},|{),(

10

aassrE

aassREasQ

ttktk

k

ttt

)]'([),(

}|{

}|{)(

''

'

20

1

sVRPas

ssrrE

ssREsV

ass

s

ass

a

tktk

kt

tt

Optimal Value Function:

Optimal Policy – A policy that is better than or equal to all the other policies is called Optimal policy.

(in the sense of maximizing expected reward)

Optimal state value function

Optimal state-action value function

Bellman optimality equation

)(max)(* sVsV

),(max),(* asQasQ

},|)','(max{),(

},|)'({max)(

*1

*

*1

*

aassasQrEasQ

aasssVrEsV

tta

t

ttta

At time = tAcquire Brain State

DecoderAction Selection (trying to execute an optimum action)

Action executedAt time = t +1Observe reward Update the decoder

t

t+1

EXAMPLE

8.0Pr

1.0Pr 1.0Pr

0Pr

EXAMPLE

))](*),((*1.0))(*...

)...,((*1.0))(*),((*8.0[)(

332

211

sVasRsV

asRsVasRsV

1S

2S3S

4S

Prof. Andrew Ng, Lecture 16, Machine learning

Outline

Environment

Value functions

Optimality


DP MC TD

BMI & RL-BMI

Eligibility Traces

START / END

We're here !

RL Examples

Solution Methods for RL problem◦ Dynamic Programming (DP) – is a method for optimization of

problems which exhibit the characteristics of overlapping sub problems and optimal substructure.

◦ Monte Carlo method (MC) - requires only experience--sample sequences of states, actions, and rewards from interaction with an environment.

◦ Temporal Difference learning (TD) – is a method that combines the better aspects of DP (estimation) and MC (experience) without incorporating the ‘troublesome’ aspects of both.

Dynamic ProgrammingPolicy Evaluation

Dynamic ProgrammingPolicy Improvement

)())(,( ' sVssQ

*1 *

10 ...... VVVEIIEIE

o

E – Policy Evaluation I – Policy Improvement

Policy Iteration Value Iteration

Replace entire section with

DYNAMIC

PROGRAMMING

)]'([max)( ''

' sVRPsV ass

s

ass

a

Monte Carlo Vs. DP

◦ The estimates for each state are independent. In other words, MC methods do not "bootstrap“.

◦ DP includes only one-step transitions

whereas the MC diagram goes all the

way to the end of the episode.

◦ The computational expense of estimating the value of a single state is less when one requires the value of only a subset of the states.

Monte Carlo Policy Evaluation

Every visit MC First visit MC

-> Without a model, we need Q value estimates.-> All state-action pairs should be visited.-> Exploration techniques 1) Exploring starts 2) e-greedy Policy

Next SlideMONTE

CARLO

As promised, this is the “NEXT SLIDE” !

MONTE

CARLO

Temporal Difference Methods◦ Like MC, TD methods can learn directly from raw experience

without a model of the environment's dynamics. Like DP, TD methods update estimates are based in part on other learned estimates, without waiting for a final outcome (they bootstrap).

)]()([)()( 11 ttttt sVsVrsVsV

TD(lambda)

trace decay parameter

Bias decreases

Variance Increases

Bias –Variance Tradeoff

Intuition: start with large ‘lamda’ and then decrease over time

SARSA

Q Learning

Difference

Outline

Environment

Value functions

Optimality


DP MC TD

BMI & RL-BMI

Eligibility Traces

START / END

We're here !

RL Examples

Eligibility Traces

OR

Outline

Environment

Value functions

Optimality


DP MC TD

BMI & RL-BMI

Eligibility Traces

START / END

We're here !

RL Examples

Online/closed loop RL-BMI architecture

),(),(

))]([max(_

actionsQasQ

tsiQindexoutputaction

titt

tanh(.)

reward

traceeerrTDdelta

asQasQrerrTD ttttt

_*_

),(),(*_ 11

‘delta’ used for updating the weights through back-propagation

NEURALSIGNAL

Scott, S. H. (1999). "Apparatus for measuring and perturbing shoulder and elbow joint positions and torques during reaching." J Neurosci Methods 89(2): 119-27.

BM I

SET UP

Autonomous Helicopter (Stanford Uni) http://heli.stanford.edu/papers/iser04-invertedflight.pdf

Position , orientation, velocity and angular velocity ),,,,,,,,,,,(

zyxzyx ),,,,,,,(

zyx

S1 S2

a1 a2R1 R2

Dynamics Dynamics

DynamicsRandom Gen

Dynamics

Actor-Critic Model

http://drugabuse.gov/researchreports/methamph/meth04.gif



References Reinforcement Learning: An Introduction

Richard S. Sutton & Andrew G. Barto Prof. Andrew Ng’s machine Learning Lectures http://heli.stanford.edu Dr. Joseph T. Francis

www.joefrancislab.com Prof. Peter Dayan Dr. Justin Sanchez Group

http://www.bme.miami.edu/nrg/

an introduction to reinforcement learning (rl)

Technology