reinforcement learning in robotics - ijsabr.ijs.si/upload/1423561726-reinforcementlearninginrobotics...

REINFORCEMENT LEARNING IN ROBOTICS

AN INTRODUCTION TO

Nemec Bojan, Jozef Stefan institute

Humanoid robots

A humanoid robot is a robot with appearance similar to the human body, allowing interaction with made-for-human tools or environments

self-maintenance autonomous learning avoiding harmful situations to people, property, and itself safe interacting with human beings and the environment

A humanoid robot is an autonomous robot that can adapt to changes in its environment or itself

Supervised Learning

•learning approaches to regression & classification, neural networks

•learning from examples, learning from a teacher

Unsupervised Learning

•Reinforcement Learning

•learning approaches to sequential decision making

•learning from a critic, learning from delayed reward

Robot learning

Reinforcement learning algorithms attempt to find a policy that maps states of the world to the actions the agent ought to take in those states.

• environment states S

• actions A

• rewards R.

State S

• value V

• action-value Q.

Reinforcement learning

111 ,,,, ttttttt xxxxxx

1 ttt xx

is MDP

is NOT MDP

Agent – driver Environment – car on a road States – car pos,vel,acc Action – stearing wheel angle Action is result of a policy Reward – success or fail

xxx ,,

Markov decision process (MDP) - A probabilistic model of a sequential decision problem, where states can be perceived exactly, and the current state and action selected determine a probability distribution on future states. The outcome of applying an action to a state depends only on the current action and state (and not on preceding actions or states).

The environment is typically formulated as a finite-state Markov decision process

The basic reinforcement learning model applied to MDPs consists of:

• a set of environment states S;

• a set of actions A; and

• a set of scalar "rewards" in R.

At each time t, the agent perceives its state st ϵ S and the set of possible actions A(st). It chooses an action and receives from the environment the new state st + 1 and a reward rt. Based on these interactions, the reinforcement learning agent must develop a policy π : S →A which maximizes the quantity R = r0 + r1 + .. +rn for MDPs which have a terminal state, or the quantity

for MDP witout terminal states.

A policy π determines which action should be performed in each state; a policy is a mapping from states to actions

Action Value Function

The value of a state is defined as the sum of the reinforcements received when starting in that state and following some fixed policy to a terminal state

1 ||)(k

tt ssrEssREsV

Value Function

1 ,|,|),(k

ttt aassrEaassREasQ

Optimal value function V* Is a value function following optimal policy

π* : S →A , maximizing V (Q)

Value function update methods

Monte Carlo Method

Repeat forever Choose a random policy Generate an entire episode For each state s appearing in episode, compute return R = R + r(s) Value function V(s) is average return

)()()( tttt sVRsVsV

Drawback : can only be updated after R is calculated from a complete simulation run, can be applied for episodic tasks

learning rate

Temporal difference learning

is a prediction method. TD learning is a combination of Monte Carlo ideas and dinamic programming (DP) ideas. Neuroscience researches have proved the existence of this method in human and animal brains.

The TD method updates value or action value function immediately after visits to new states.

)()()()( 11 ttttt sVsVrsVsV

Benefits : suitable for non-episodic tasks, suitable for on-line implementation, better convergence

Eglibility traces – TD(λ)

Eligibility trace is a temporary record of the occurrence of an event, such as the visiting of a state or the taking of an action. They are a bridge from TD to Monte Carlo methods

repeat for all states in each step

)()()()()( 11 sesVsVrsVsV ttttt

tssifse

ssifsese

update algorithm :

Random walk example

1 2 3 4 50

1RANDOM WALK - VALUE FUNCTION AFTER 10 STEPS

states

1 2 3 4 50

1RANDOM WALK - VALUE FUNCTION AFTER 100 STEPS

states

Direct Q-Learning methods

),(),(),(),( 111 ttttttttt asQasQrasQasQ

Because of choosing actual future value for state action update, we say that SARSA learns the Q values associated with taking the policy it follows itself. (on-Policy method)

Example : ε-greedy policy

),(max

asQa exploitation, exploration

Q-Learning

),(),(max),(),( 11 tttttttt asQasQrasQasQ

Because of choosing max future value for state action update, we say that Q-learning learns the Q values associated with taking the exploitation policy while following an exploration/exploitation policy. (Off-Policy method)

Cliff – walking problem

Q-Learning

R=-100

Case study – Ball in a cup game

Goal : Swing ball on a rope to the desired height at 0 velocity

State variables

hand position and velocity, rope angle and angular velocity

State boxing

x1 = [0.3 0]; x2 = [-1 0 1]; x3 = [0 : 360]; 18 values x4 = [-1.5 0 1.5]; Number of states = 3*3*18*3 = 342 Reward function

if (x > 0.65) || (x < -0.05) r = -500 - 5*abs(x_dot); elseif ((abs(theta-D1)<0.2) &&(abs(theta_dot) < 0.3)) r = 1000; else r = theta^2; end

Learning algorithm

Q-Learning, α = 0.4 γ = 0.99 λ = 0.9

States discrete representation of continuous problem

,,, xx

Actions – acceleration [-1,-0.5,0,+0.5,+1]

Simulation results of Q Learning

Experimental result of Q Learning

A central issue in the trajectory generation and modification is the choice of the representation (or encoding) of the trajectory

Dynamic Motion Primitives

Second order system + Trajectory modulation

Goal (g) Kernel function weights (w)

Discrete time Continous time

Trajectory representation

Second order system Trajectory modulation

Time evolution

Learning of Dynamic Motion Primitives can be accomplished by regression

REGRESION

Dynamic Motion Primitives – DMP-s

Policy gradient methods

Policy gradient methods are a type of reiforcement learning techniques that rely upon optimizing parameterized policies with respect to the expected return

}{)(0 k

k krEJ }max{J

- dicsount factor [0,1] r - reward

update using gradient update rule

hJhhh 1

- learning rate

local maximum !

Gradient estimation technique

Finite-difference Methods

ihih ,Random variantion of parameters ,

Experimental evaluation of J refihih JJJ )(,

Gradient estimation using regression hhh

hh JJ ])[( 1

hJhkh 1

Parameters update

khhhh ,2,1, ...

episode

transition

Liquid pouring learning - simulation

one transition

J estimation using balance

Experimental results

Goal learning

reinforcement learning in robotics - ijsabr.ijs.si/upload/1423561726-reinforcementlearninginrobotics...

Documents

presentation - nemec

nemec adaptive management for ecosystems

mandatário lista o - nemec 2014/2015

international corporate finance prof. edith littich-nemec

overview on reinforcement learning for robotics ·...

periodizacija 19. i 20. stoljeća - nemec

marija nemec - nlp practitioner sertifikacija

cse-571 probabilistic robotics - university of...

nemec two pratyabhijna theories of error

deep learning for semantics, geometry, and physics in...

marie-claire nemec a memoir.pdf

nemec langue germana

improved deep reinforcement learning for robotics through...

delft university of technology improved deep reinforcement...

introduction to robotics reinforcement learning in...

reinforcement learning of motor skills using policy search...

no slide...

manifesto eleitoral lista o - nemec 2014/2015

nemec abt 399 research report final

extending the openai gym for robotics: a toolkit for...