reinforcement learning rafy michaeli assaf naor supervisor: yaakov engel visit project’s home page...
Post on 20-Dec-2015
213 views
TRANSCRIPT
Reinforcement LearningReinforcement Learning
Rafy Michaeli
Assaf Naor
Supervisor:
Yaakov Engel
Visit project’s home page at:
http://www.technion.ac.il/~smily/rl/index.html
FOR MORE INFO...
Project GoalsProject Goals
Study the field of Reinforcement Learning (RL)
Have a practical experience with implementing RL algorithms
Examine the influence of various parameters on RL algorithms performence
OverviewOverview
Reinforcement Learning
In RL problems, an agent (a decision-maker), attempts to control a dynamic system by choosing actions every time interval
Overview, cont.Overview, cont.
Reinforcement Learning
The agent receives a feedback with every action it executes
Overview, cont.Overview, cont.
Reinforcement Learning
The ultimate goal of the agent is to learn a strategy for selecting actions such that the overall performance is optimized according to a given criteria
Overview, cont.Overview, cont.
The Value function
Given a fixed policy , which determines the action to be performed at a given state, this function assigns a value to every state in the state space (all possible states the system can have)
Overview, cont.Overview, cont.
The Value function
The value of a state is defined as the weighted sum (short term reinforcements are taken more strongly
into account than long term ones) of the reinforcements received when starting at that state and following the given policy to a final state
Overview, cont.Overview, cont.
The Action Value Function or Q-Function
Given a fixed policy , this function assigns a value to every pair of (state, action) in the (state, action) space
Overview, cont.Overview, cont.
The Action Value Function or Q-Function
The value of a pair (state s, action a) is defined as the weighted sum of reinforcements due to executing action a at state s, and then following the given policy for selecting actions in subsequent states
Overview, cont.Overview, cont.
The Action Value Function or Q-Function
Or mathematically:
))(,(),( 111 ttttt ssQrasQ
Overview, cont.Overview, cont.
The learning algorithm
Uses experiences to progressively learn the optimal value function, which is the function that predicts the best long term outcome an agent could receive from a given state
Overview, cont.Overview, cont.
The learning algorithm
The agent studies the optimal value function by continually exercising the current, non-optimal estimate of the value function and improving this estimate after every experience
Overview, cont.Overview, cont.
The learning algorithm
Given the optimal value functuion, the agent can then evaluate the optimal policy by performing:
))},((),({maxarg)( ttttAa
tt asTVasRsat
DescriptionDescription
Overviewed the field of Learning in general and focused on RL algorithms
Implemented various RL algorithms on a chosen task, aiming to teach the agent the best way to perform the task
Description, cont.Description, cont.
The task of the agentGiven a car’s initial location
and velocity, bring it to a desired location with a zero velocity, as quickly as possible !
Description, cont.Description, cont.
The task of the agentSystem description:
• The car can move either forward or backwards
• The agent can control the car’s acceleration at any time interval
Description, cont.Description, cont.
The task of the agentSystem description:
• Walls are placed on both sides of the track• When the car hits a wall, it bounces back in
the same speed it had prior to the collision
Description, cont.Description, cont.
The code was written using MATLABPerformed experiments to determine the
influence of different parameters on the Learning algorithm (mainly on convergence and how fast the system studies the optimal policy)
Tested the performance of CMAC as a function approximator (tested for both 1D and 2D functions)
Implementation issuesImplementation issues
Function approximators - Representing the Value/Q Function– Lookup Tables
• A finite ordered set of elements (A possible implementation would be an array). Each element is uniquely associated with an index. Accessing the element would be through it’s index.
• Each region in a continuous state space is mapped into an element of the lookup table. Thus, all states within a region are aggregated (accumulated) into one table element, therefore assigned the same value.
Implementation issues, cont.Implementation issues, cont.
Function approximators - Representing the Value/Q Function– Lookup Tables
• This mapping from the state space to the Lookup Table can be uniform or non-uniform.
An example of a uniform mapping of the state space to cells
Implementation issues, cont.Implementation issues, cont.
Function approximators - Representing the Value/Q Function– Cerebellar Model Articulation Controller – CMAC
• each state activates a specific set of memory locations (features). The arithmetic sum of their values is the value of the stored function.
Implementation issues, cont.Implementation issues, cont.
Learning the optimal Value Function
We wish to study the optimal Value Function from which we can deduce the optimal action policy
Implementation issues, cont.Implementation issues, cont.
Learning the optimal Value Function
Our learning algorithm was based on methods of Temporal Difference or shortly, TD
Implementation issues, cont.Implementation issues, cont.
Learning the optimal Value Function
We define the temporal difference as:
)(ˆ)(ˆ))(,()( 1 ttttttt sVsVssRs
Implementation issues, cont.Implementation issues, cont.
Learning the optimal Value Function
At each time step we update the estimated Value Function by calculating:
)()(ˆ)(ˆ1 ttttt ssVsV
Implementation issues, cont.Implementation issues, cont.
Learning the optimal Value Function
By definition, the optimal policy satisfies:
))},((),({maxarg)( ttttAa
tt asTVasRsat
Implementation issues, cont.Implementation issues, cont.
Learning the optimal Value Function– TD( ) and Eligibility Traces
The TD rule as presented above is really an instance of a more general class of algorithms called TD( ) with .
0
Implementation issues, cont.Implementation issues, cont.
Learning the optimal Value Function– TD( ) and Eligibility Traces
The general TD( ) rule is similar to TD rule given above:
is taken to be .
)()()(ˆ)(ˆ11 sessVsV tttt
tt
ttt ssifse
ssifsese
1)(
)()(1
10
Implementation issues, cont.Implementation issues, cont.
Look-Up Table Implementation
– We used a Look-Up Table to represent the Value Function, and acquired the optimal policy by applying the TD( ) algorithm.
– We used a non uniform mapping of the state space to cells in the Look-Up Table which enabled us to keep a rather small number of cells but still have a fine quantization around the origin.
Implementation issues, cont.Implementation issues, cont.
CMAC Implementation - 1
– CMAC is used to represent the Value Function and TD( ) is the learning algorithm.
CMAC Implementation - 2
– CMAC is used to represent the Q-Function and TD( ) is the learning algorithm.
Implementation issues, cont.Implementation issues, cont.
System simulation description
We simulated each of the three implementations for different values of .
For each value of we tested the system for different values of
. was taken to be 1 throughout the simulations.
9.0,5.0,0
8.0,5.0,2.0
Simulation ResultsSimulation Results
We define:– success rate:
The percentage of all tries in which the agent has successfuly been able to bring the car to it’s destination with zero velocity
Simulation ResultsSimulation Results
We define:– Average way:
The average of the time intervals it took the agent to bring the car to it’s destination
Simulation ResultsSimulation Results
Look-Up Table results– A common result for all parameter variants is the improvement
of the success rate and the shortening of the average way to goal as learning progresses.
– For a given , it’s hard to observe any differences between the results for different values of .
– As increases, the learning process is better i.e. For a given try number, the results for the success rate and average way are better.
Simulation ResultsSimulation Results
Look-Up Table results– It’s noted that eventually, in all cases, the success
rate reaches 100% i.e. the agent successfully brought the car to it’s goal.
Simulation ResultsSimulation Results
CMAC Q-Function results– A common result for all parameter variants is the
improvement of the success rate and the shortening of the average way to goal as learning progresses.
– For a given , better results were obtained for a bigger .
– As increases, the learning process is generally better.
Simulation ResultsSimulation Results
CMAC Q-Function results– In most cases, 100% success rate is not reached,
though it reaches 100% in some cases.– We can see in some cases that the success rate
decreases along the tries and increases again.
Simulation ResultsSimulation Results
CMAC Value Iteration results– The figure ahead shows the results obtained by
CMAC Value Iteration implementation compared to the results already obtained for the CMAC Q-Learning implementation. The results are for the best pair of ( ) as obtained from previous results.
,
Simulation ResultsSimulation Results
CMAC Value Iteration results
A comparison between CMAC Q-Learning and CMAC Value Iteration performance
Simulation ResultsSimulation Results
Learning process examples– In figure 1 we show the process of learning for a
specific starting state and learning parameters. The figure shows the movement of the car after every few tries, for 150 time consecutive time intervals.
– In figure 2 we demonstrate the systems ability (at the end of learning) to direct the car to it’s goal starting from different states.
Figure 2: The systems performance from different starting states after try 20 with the
parameters: 9.0 ,8.0
ConclusionsConclusions
In this project we implemented a family of R.L. Algorithms, , with two different function approximators, CMAC and Look-Up Table.
)(TD
ConclusionsConclusions
We examined the affect of the learning parameters, , and on the overall performance of the system.– In the Look-Up Table implementation: does not
have a significant impact on the results; as increases the success rate increases more rapidly.
ConclusionsConclusions
We examined the affect of the learning parameters, , and on the overall performance of the system.– In the CMAC implementation: as or
increases, the success rate increases and the average way decreases more rapidly.