Download - 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods Dr. Itamar Arel

11

ECE 517: Reinforcement Learning in ECE 517: Reinforcement Learning in Artificial IntelligenceArtificial Intelligence

Lecture 20: Approximate & Neuro Dynamic Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient MethodsProgramming, Policy Gradient Methods

Dr. Itamar ArelDr. Itamar Arel

College of EngineeringCollege of EngineeringDepartment of Electrical Engineering and Computer ScienceDepartment of Electrical Engineering and Computer Science

The University of TennesseeThe University of TennesseeFall 2010Fall 2010

November 15, 2010November 15, 2010

ECE 517: Reinforcement Learning in AI 22

IntroductionIntroduction

We will discuss different methods that heuristically We will discuss different methods that heuristically approximate the dynamic programming problemapproximate the dynamic programming problem

Approximate Dynamic ProgrammingApproximate Dynamic Programming Direct Neuro-Dynamic ProgrammingDirect Neuro-Dynamic Programming

Assumptions:Assumptions: DP methods assume DP methods assume fully observed systemsfully observed systems ADP rely on no model and (usually) partial observability ADP rely on no model and (usually) partial observability

(POMDPs)(POMDPs)

Relationship to classical control theoriesRelationship to classical control theories Optimal controlOptimal control – in the linear case, is a solved problem. It – in the linear case, is a solved problem. It

estimates the state vector using Kalman filter estimates the state vector using Kalman filter methodologiesmethodologies

Adaptive controlAdaptive control responds to the question: what can we responds to the question: what can we do when the dynamics of the system are unknowndo when the dynamics of the system are unknown

We have a model of the plant but lack its parameters We have a model of the plant but lack its parameters (values)(values)Often focus on stability properties rather than performanceOften focus on stability properties rather than performance


Reference to classic control theories (cont.)Reference to classic control theories (cont.)

Robust controlRobust control Attempt to find a controller design that guarantees Attempt to find a controller design that guarantees

stability, i.e. that the plant will not “blow up”stability, i.e. that the plant will not “blow up” Regardless of what the unknown parameter values areRegardless of what the unknown parameter values are e.g. Lyapunov-based analysis (e.g. queueuing systems)e.g. Lyapunov-based analysis (e.g. queueuing systems)

Adaptive controlAdaptive control Attempt to adapt the controller in real time, based on Attempt to adapt the controller in real time, based on

real-time observations of how the plant actually behavesreal-time observations of how the plant actually behaves ADP may be viewed as an adaptive control frameworkADP may be viewed as an adaptive control framework ““Neural-observers” – used to predict the next set of Neural-observers” – used to predict the next set of

observations, based on which the controller actsobservations, based on which the controller acts


Core principals of ADPCore principals of ADP

The following are three general principles that are The following are three general principles that are at the core of ADPat the core of ADP

Value approximationValue approximation – instead of solving for – instead of solving for V(s) exactly, we can use a universal V(s) exactly, we can use a universal approximation function V(s,W) ~ V(s)approximation function V(s,W) ~ V(s)

Alternate starting points Alternate starting points – instead of always – instead of always starting from the Bellman equation directly, we starting from the Bellman equation directly, we can start from related recurrence equationscan start from related recurrence equations

Hybrid designHybrid design – combining multiple ADP – combining multiple ADP systems in more complex hybrid designssystems in more complex hybrid designs

Usually in order to scale better Usually in order to scale better Mixture of continuous and discrete variablesMixture of continuous and discrete variables Multiple spatio-temporal scalesMultiple spatio-temporal scales


Direct Neuro-Dynamic Programming (Direct NDP)Direct Neuro-Dynamic Programming (Direct NDP)

MotivationMotivation The intuitive appeal of Reinforcement Learning, in

particular the actor/critic design The power of calculus of variation and this concept used

in the form of backpropagation to solve optimal control problems

Can inherently deal with POMDPs (using RNNs, for example)

The method is considered “direct” in thatThe method is considered “direct” in that It does not have explicit state representationIt does not have explicit state representation Temporal progression – everything is a function of time Temporal progression – everything is a function of time

and not state/action setsand not state/action sets

It is also model-free as it does not assume a model or attempt to directly estimate model dynamics/structure


Direct NDP ArchitectureDirect NDP Architecture

Critic NetworkCritic Network : : estimate the future reward-to-go (i.e. estimate the future reward-to-go (i.e. value function)value function)

Action NetworkAction Network : : adjust action to minimize the adjust action to minimize the difference between the estimated difference between the estimated JJ and the ultimate and the ultimate objective objective UUcc..


Direct NDP vs. Classic RLDirect NDP vs. Classic RL

u(t)

J(t)

J(t-1)-r(t)

Uc(t)

ActionNetwork

CriticNetwork

Environment

State

Reward

Action

Agent: direct NDP


Inverted Helicopter Flight (Ng. / Stanford 2004)Inverted Helicopter Flight (Ng. / Stanford 2004)


Solving POMDPs with RNNsSolving POMDPs with RNNs

Case study: framework for obtaining optimal policy in Case study: framework for obtaining optimal policy in model-free POMDPs model-free POMDPs using using Recurrent Neural NetworksRecurrent Neural NetworksUses NDP version of Q-LearningUses NDP version of Q-LearningTRTRL is employedTRTRL is employed(efficient version of RTRL)(efficient version of RTRL)

GoalGoal: : investigate scenarioinvestigate scenarioin which two states havein which two states havethe same observationthe same observation(yet different optimal(yet different optimalactions)actions)

Method: RNNs in aMethod: RNNs in aTD framework (moreTD framework (morelater)later)

Model is unknown!Model is unknown!


Direct NDP architecture using RNNsDirect NDP architecture using RNNs

RNN

Q-Learning approx.

Ot

at

Q(st ,at) ~

Environment

Softmax

Final action

rt

TD

Method is good for small action sets. Q: why ?Method is good for small action sets. Q: why ?


Simulation results – 10 neuronsSimulation results – 10 neurons


Training Robots (1Training Robots (1stst-gen AIBOs) to walk (faster)-gen AIBOs) to walk (faster)

11stst generation AIBOs were used (internal CPU) generation AIBOs were used (internal CPU)

Fundamental motor capabilities were prescribedFundamental motor capabilities were prescribed e.g. apply torque to a given joint, turn in a given e.g. apply torque to a given joint, turn in a given

directiondirection In other words, finite action setIn other words, finite action set Observations were limited to distance (a.k.a. radar view)Observations were limited to distance (a.k.a. radar view)

The goal was to cross the field in short timeThe goal was to cross the field in short time Reward was growing negative as time progressedReward was growing negative as time progressed Large positive reward when goal was metLarge positive reward when goal was met

Multiple robots were trained to observe variability Multiple robots were trained to observe variability in the learning processin the learning process


The general RL approach revisitedThe general RL approach revisited

RL will solve all of your problems, but …RL will solve all of your problems, but … We need lots of experience to train fromWe need lots of experience to train from Taking random actions can be dangerousTaking random actions can be dangerous It can take a long time to learnIt can take a long time to learn Not all problems fit into the NDP frameworkNot all problems fit into the NDP framework

An alternative approach to RL is to reward whole An alternative approach to RL is to reward whole policies, rather than individual actionspolicies, rather than individual actions

Run whole policy, then receive a single rewardRun whole policy, then receive a single reward Reward measures success of the entire policyReward measures success of the entire policy

If there are a small number of policies, we can If there are a small number of policies, we can exhaustively try them allexhaustively try them all

However, this is not possible in most interesting problemsHowever, this is not possible in most interesting problems


Policy Gradient MethodsPolicy Gradient Methods

Assume that our policy, Assume that our policy, , has a set of , has a set of nn real- real-valued parameters, valued parameters, = { = { 11, , 22, , 33, ... , , ... , nn}} Running the policy with a particular Running the policy with a particular results in results in

a reward, a reward, rr Estimate the reward gradient, Estimate the reward gradient, , for each , for each ii

iθr

iii θ

θθ

r

This is anotherlearning rate


Policy Gradient Methods (cont.)Policy Gradient Methods (cont.)

This results in This results in hill-climbinghill-climbing in policy space in policy space So, it’s subject to all the problems of hill-climbingSo, it’s subject to all the problems of hill-climbing But, we can also use tricks from search theory, But, we can also use tricks from search theory,

like random restarts and momentum termslike random restarts and momentum terms

This is a good approach if you have a This is a good approach if you have a parameterized policyparameterized policy Let’s assume we have a “reasonable” starting Let’s assume we have a “reasonable” starting

policypolicy Typically faster than value-based methodsTypically faster than value-based methods ““Safe” exploration, if you have a good policySafe” exploration, if you have a good policy Learns locally-best parameters Learns locally-best parameters for that policyfor that policy


An Example: Learning to WalkAn Example: Learning to Walk

RoboCup 4-legged leagueRoboCup 4-legged league Walking quickly is a Walking quickly is a bigbig advantage advantage Until recently, this was tuned manuallyUntil recently, this was tuned manually

Robots have a parameterized gait controllerRobots have a parameterized gait controller 12 parameters12 parameters Controls step length, height, etc.Controls step length, height, etc.

Robot walk across soccer field and is timedRobot walk across soccer field and is timed Reward is a function of the time takenReward is a function of the time taken They know when to stop (distance measure)They know when to stop (distance measure)


An Example: Learning to Walk (cont.)An Example: Learning to Walk (cont.)

Basic ideaBasic idea1.1. Pick an initial Pick an initial = { = {11, , 22, ... , , ... , 1212}}

2.2. Generate Generate NN testing parameter settings by testing parameter settings by perturbing perturbing jj = { = {11 + + 11, , 22 + + 22, ... , , ... , 1212 + + 1212}, }, ii {- {-, 0, , 0, }}

3.3. Test each setting, and observe rewardsTest each setting, and observe rewardsj j → r→ rjj

4.4. For each For each ii Calculate Calculate ii

++, , ii00, , ii

-- and set and set

5.5. SetSet ← ← ’, and go to 2’, and go to 2Average rewardwhen n

i = i - i

largest θ if

largest θ if

largest θ if

θθ'

i

i

i

ii

00


An Example: Learning to Walk (cont.)An Example: Learning to Walk (cont.)

Initial Final

Q: Q: Can we translate Gradient Policy into a Can we translate Gradient Policy into a direct policy, actor/critic Neuro-Dynamic direct policy, actor/critic Neuro-Dynamic Programming system?Programming system?


Value Function or Policy Gradient?Value Function or Policy Gradient?

When should I use policy gradient?When should I use policy gradient? When there’s a parameterized policyWhen there’s a parameterized policy When there’s a high-dimensional state spaceWhen there’s a high-dimensional state space When we expect the gradient to be smoothWhen we expect the gradient to be smooth Typically one episodic tasks (e.g. AIBO walking)Typically one episodic tasks (e.g. AIBO walking)

When should I use a value-based method?When should I use a value-based method? When there is no parameterized policyWhen there is no parameterized policy When we have no idea how to solve the When we have no idea how to solve the

problem (i.e. no known structure)problem (i.e. no known structure)


Direct NDP with RNNs – Backpropagation through a Direct NDP with RNNs – Backpropagation through a modelmodel

RNNs have memory and can create temporal contextRNNs have memory and can create temporal context

Applies to both actor and criticApplies to both actor and critic

Much harder to train (time and logic/memory resources)Much harder to train (time and logic/memory resources) e.g. RTRL issuese.g. RTRL issues


Consolidated Actor-Critic Model (Z. Liu, I. Arel, 2007)Consolidated Actor-Critic Model (Z. Liu, I. Arel, 2007)

Single network (FF or RNN) sufficient for both actor Single network (FF or RNN) sufficient for both actor and critic functionsand critic functions

Two-pass (TD-style) for both action and value estimate Two-pass (TD-style) for both action and value estimate correctionscorrections

Training via standard techniques, e.g. BPTraining via standard techniques, e.g. BP

Download - 1 ECE 517: Reinforcement Learning in Artificial Intelligence Lecture 20: Approximate & Neuro Dynamic Programming, Policy Gradient Methods Dr. Itamar Arel

Top Related