11
ECE 517: Reinforcement Learning in ECE 517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence
Lecture 20: Approximate & Neuro Dynamic Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient MethodsProgramming, Policy Gradient Methods
Dr. Itamar ArelDr. Itamar Arel
College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science
The University of TennesseeThe University of TennesseeFall 2010Fall 2010
November 15, 2010November 15, 2010
ECE 517: Reinforcement Learning in AI 22
IntroductionIntroduction
We will discuss different methods that heuristically We will discuss different methods that heuristically approximate the dynamic programming problemapproximate the dynamic programming problem
Approximate Dynamic ProgrammingApproximate Dynamic Programming Direct Neuro-Dynamic ProgrammingDirect Neuro-Dynamic Programming
Assumptions:Assumptions: DP methods assume DP methods assume fully observed systemsfully observed systems ADP rely on no model and (usually) partial observability ADP rely on no model and (usually) partial observability
(POMDPs)(POMDPs)
Relationship to classical control theoriesRelationship to classical control theories Optimal controlOptimal control – in the linear case, is a solved problem. It – in the linear case, is a solved problem. It
estimates the state vector using Kalman filter estimates the state vector using Kalman filter methodologiesmethodologies
Adaptive controlAdaptive control responds to the question: what can we responds to the question: what can we do when the dynamics of the system are unknowndo when the dynamics of the system are unknown
We have a model of the plant but lack its parameters We have a model of the plant but lack its parameters (values)(values)Often focus on stability properties rather than performanceOften focus on stability properties rather than performance
ECE 517: Reinforcement Learning in AI 33
Reference to classic control theories (cont.)Reference to classic control theories (cont.)
Robust controlRobust control Attempt to find a controller design that guarantees Attempt to find a controller design that guarantees
stability, i.e. that the plant will not “blow up”stability, i.e. that the plant will not “blow up” Regardless of what the unknown parameter values areRegardless of what the unknown parameter values are e.g. Lyapunov-based analysis (e.g. queueuing systems)e.g. Lyapunov-based analysis (e.g. queueuing systems)
Adaptive controlAdaptive control Attempt to adapt the controller in real time, based on Attempt to adapt the controller in real time, based on
real-time observations of how the plant actually behavesreal-time observations of how the plant actually behaves ADP may be viewed as an adaptive control frameworkADP may be viewed as an adaptive control framework ““Neural-observers” – used to predict the next set of Neural-observers” – used to predict the next set of
observations, based on which the controller actsobservations, based on which the controller acts
ECE 517: Reinforcement Learning in AI 44
Core principals of ADPCore principals of ADP
The following are three general principles that are The following are three general principles that are at the core of ADPat the core of ADP
Value approximationValue approximation – instead of solving for – instead of solving for V(s) exactly, we can use a universal V(s) exactly, we can use a universal approximation function V(s,W) ~ V(s)approximation function V(s,W) ~ V(s)
Alternate starting points Alternate starting points – instead of always – instead of always starting from the Bellman equation directly, we starting from the Bellman equation directly, we can start from related recurrence equationscan start from related recurrence equations
Hybrid designHybrid design – combining multiple ADP – combining multiple ADP systems in more complex hybrid designssystems in more complex hybrid designs
Usually in order to scale better Usually in order to scale better Mixture of continuous and discrete variablesMixture of continuous and discrete variables Multiple spatio-temporal scalesMultiple spatio-temporal scales
ECE 517: Reinforcement Learning in AI 55
Direct Neuro-Dynamic Programming (Direct NDP)Direct Neuro-Dynamic Programming (Direct NDP)
MotivationMotivation The intuitive appeal of Reinforcement Learning, in
particular the actor/critic design The power of calculus of variation and this concept used
in the form of backpropagation to solve optimal control problems
Can inherently deal with POMDPs (using RNNs, for example)
The method is considered “direct” in thatThe method is considered “direct” in that It does not have explicit state representationIt does not have explicit state representation Temporal progression – everything is a function of time Temporal progression – everything is a function of time
and not state/action setsand not state/action sets
It is also model-free as it does not assume a model or attempt to directly estimate model dynamics/structure
ECE 517: Reinforcement Learning in AI 66
Direct NDP ArchitectureDirect NDP Architecture
Critic NetworkCritic Network : : estimate the future reward-to-go (i.e. estimate the future reward-to-go (i.e. value function)value function)
Action NetworkAction Network : : adjust action to minimize the adjust action to minimize the difference between the estimated difference between the estimated JJ and the ultimate and the ultimate objective objective UUcc..
ECE 517: Reinforcement Learning in AI 77
Direct NDP vs. Classic RLDirect NDP vs. Classic RL
u(t)
J(t)
J(t-1)-r(t)
Uc(t)
ActionNetwork
CriticNetwork
Environment
State
Reward
Action
Agent: direct NDP
ECE 517: Reinforcement Learning in AI 88
Inverted Helicopter Flight (Ng. / Stanford 2004)Inverted Helicopter Flight (Ng. / Stanford 2004)
ECE 517: Reinforcement Learning in AI 99
Solving POMDPs with RNNsSolving POMDPs with RNNs
Case study: framework for obtaining optimal policy in Case study: framework for obtaining optimal policy in model-free POMDPs model-free POMDPs using using Recurrent Neural NetworksRecurrent Neural NetworksUses NDP version of Q-LearningUses NDP version of Q-LearningTRTRL is employedTRTRL is employed(efficient version of RTRL)(efficient version of RTRL)
GoalGoal: : investigate scenarioinvestigate scenarioin which two states havein which two states havethe same observationthe same observation(yet different optimal(yet different optimalactions)actions)
Method: RNNs in aMethod: RNNs in aTD framework (moreTD framework (morelater)later)
Model is unknown!Model is unknown!
ECE 517: Reinforcement Learning in AI 1010
Direct NDP architecture using RNNsDirect NDP architecture using RNNs
RNN
Q-Learning approx.
Ot
at
Q(st ,at) ~
Environment
Softmax
Final action
rt
TD
Method is good for small action sets. Q: why ?Method is good for small action sets. Q: why ?
ECE 517: Reinforcement Learning in AI 1111
Simulation results – 10 neuronsSimulation results – 10 neurons
ECE 517: Reinforcement Learning in AI 1212
Training Robots (1Training Robots (1stst-gen AIBOs) to walk (faster)-gen AIBOs) to walk (faster)
11stst generation AIBOs were used (internal CPU) generation AIBOs were used (internal CPU)
Fundamental motor capabilities were prescribedFundamental motor capabilities were prescribed e.g. apply torque to a given joint, turn in a given e.g. apply torque to a given joint, turn in a given
directiondirection In other words, finite action setIn other words, finite action set Observations were limited to distance (a.k.a. radar view)Observations were limited to distance (a.k.a. radar view)
The goal was to cross the field in short timeThe goal was to cross the field in short time Reward was growing negative as time progressedReward was growing negative as time progressed Large positive reward when goal was metLarge positive reward when goal was met
Multiple robots were trained to observe variability Multiple robots were trained to observe variability in the learning processin the learning process
ECE 517: Reinforcement Learning in AI 1313
The general RL approach revisitedThe general RL approach revisited
RL will solve all of your problems, but …RL will solve all of your problems, but … We need lots of experience to train fromWe need lots of experience to train from Taking random actions can be dangerousTaking random actions can be dangerous It can take a long time to learnIt can take a long time to learn Not all problems fit into the NDP frameworkNot all problems fit into the NDP framework
An alternative approach to RL is to reward whole An alternative approach to RL is to reward whole policies, rather than individual actionspolicies, rather than individual actions
Run whole policy, then receive a single rewardRun whole policy, then receive a single reward Reward measures success of the entire policyReward measures success of the entire policy
If there are a small number of policies, we can If there are a small number of policies, we can exhaustively try them allexhaustively try them all
However, this is not possible in most interesting problemsHowever, this is not possible in most interesting problems
ECE 517: Reinforcement Learning in AI 1414
Policy Gradient MethodsPolicy Gradient Methods
Assume that our policy, Assume that our policy, , has a set of , has a set of nn real- real-valued parameters, valued parameters, = { = { 11, , 22, , 33, ... , , ... , nn}} Running the policy with a particular Running the policy with a particular results in results in
a reward, a reward, rr Estimate the reward gradient, Estimate the reward gradient, , for each , for each ii
iθr
iii θ
θθ
r
This is anotherlearning rate
ECE 517: Reinforcement Learning in AI 1515
Policy Gradient Methods (cont.)Policy Gradient Methods (cont.)
This results in This results in hill-climbinghill-climbing in policy space in policy space So, it’s subject to all the problems of hill-climbingSo, it’s subject to all the problems of hill-climbing But, we can also use tricks from search theory, But, we can also use tricks from search theory,
like random restarts and momentum termslike random restarts and momentum terms
This is a good approach if you have a This is a good approach if you have a parameterized policyparameterized policy Let’s assume we have a “reasonable” starting Let’s assume we have a “reasonable” starting
policypolicy Typically faster than value-based methodsTypically faster than value-based methods ““Safe” exploration, if you have a good policySafe” exploration, if you have a good policy Learns locally-best parameters Learns locally-best parameters for that policyfor that policy
ECE 517: Reinforcement Learning in AI 1616
An Example: Learning to WalkAn Example: Learning to Walk
RoboCup 4-legged leagueRoboCup 4-legged league Walking quickly is a Walking quickly is a bigbig advantage advantage Until recently, this was tuned manuallyUntil recently, this was tuned manually
Robots have a parameterized gait controllerRobots have a parameterized gait controller 12 parameters12 parameters Controls step length, height, etc.Controls step length, height, etc.
Robot walk across soccer field and is timedRobot walk across soccer field and is timed Reward is a function of the time takenReward is a function of the time taken They know when to stop (distance measure)They know when to stop (distance measure)
ECE 517: Reinforcement Learning in AI 1717
An Example: Learning to Walk (cont.)An Example: Learning to Walk (cont.)
Basic ideaBasic idea1.1. Pick an initial Pick an initial = { = {11, , 22, ... , , ... , 1212}}
2.2. Generate Generate NN testing parameter settings by testing parameter settings by perturbing perturbing jj = { = {11 + + 11, , 22 + + 22, ... , , ... , 1212 + + 1212}, }, ii {- {-, 0, , 0, }}
3.3. Test each setting, and observe rewardsTest each setting, and observe rewardsj j → r→ rjj
4.4. For each For each ii Calculate Calculate ii
++, , ii00, , ii
-- and set and set
5.5. SetSet ← ← ’, and go to 2’, and go to 2Average rewardwhen n
i = i - i
largest θ if
largest θ if
largest θ if
θθ'
i
i
i
ii
00
ECE 517: Reinforcement Learning in AI 1818
An Example: Learning to Walk (cont.)An Example: Learning to Walk (cont.)
Initial Final
Q: Q: Can we translate Gradient Policy into a Can we translate Gradient Policy into a direct policy, actor/critic Neuro-Dynamic direct policy, actor/critic Neuro-Dynamic Programming system?Programming system?
ECE 517: Reinforcement Learning in AI 1919
Value Function or Policy Gradient?Value Function or Policy Gradient?
When should I use policy gradient?When should I use policy gradient? When there’s a parameterized policyWhen there’s a parameterized policy When there’s a high-dimensional state spaceWhen there’s a high-dimensional state space When we expect the gradient to be smoothWhen we expect the gradient to be smooth Typically one episodic tasks (e.g. AIBO walking)Typically one episodic tasks (e.g. AIBO walking)
When should I use a value-based method?When should I use a value-based method? When there is no parameterized policyWhen there is no parameterized policy When we have no idea how to solve the When we have no idea how to solve the
problem (i.e. no known structure)problem (i.e. no known structure)
ECE 517: Reinforcement Learning in AI 2020
Direct NDP with RNNs – Backpropagation through a Direct NDP with RNNs – Backpropagation through a modelmodel
RNNs have memory and can create temporal contextRNNs have memory and can create temporal context
Applies to both actor and criticApplies to both actor and critic
Much harder to train (time and logic/memory resources)Much harder to train (time and logic/memory resources) e.g. RTRL issuese.g. RTRL issues
ECE 517: Reinforcement Learning in AI 2121
Consolidated Actor-Critic Model (Z. Liu, I. Arel, 2007)Consolidated Actor-Critic Model (Z. Liu, I. Arel, 2007)
Single network (FF or RNN) sufficient for both actor Single network (FF or RNN) sufficient for both actor and critic functionsand critic functions
Two-pass (TD-style) for both action and value estimate Two-pass (TD-style) for both action and value estimate correctionscorrections
Training via standard techniques, e.g. BPTraining via standard techniques, e.g. BP