richard s. sutton doina precup university of massachusetts satinder singh university of colorado
DESCRIPTION
Between MDPs and Semi-MDPs: Learning, Planning and Representing Knowledge at Multiple Temporal Scales. Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado with thanks to Andy Barto Amy McGovern, Andrew Fagg Ron Parr, Csaba Szepeszvari. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/1.jpg)
Between MDPs and Semi-MDPs:Learning, Planning and Representing
Knowledge at Multiple Temporal Scales
Richard S. SuttonDoina Precup
University of Massachusetts
Satinder SinghUniversity of Colorado
with thanks toAndy Barto
Amy McGovern, Andrew FaggRon Parr, Csaba Szepeszvari
![Page 2: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/2.jpg)
Related Work
“Classical” AI
Fikes, Hart & Nilsson(1972)Newell & Simon (1972)Sacerdoti (1974, 1977)
Macro-OperatorsKorf (1985)Minton (1988)Iba (1989)Kibler & Ruby (1992)
Qualitative ReasoningKuipers (1979)de Kleer & Brown (1984)Dejong (1994)
Laird et al. (1986)Drescher (1991)Levinson & Fuchs (1994)Say & Selahatin (1996)Brafman & Moshe (1997)
Robotics and Control Engineering
Brooks (1986)Maes (1991)Koza & Rice (1992)Brockett (1993)Grossman et. al (1993)Dorigo & Colombetti (1994)Asada et. al (1996)Uchibe et. al (1996)Huber & Grupen(1997)Kalmar et. al (1997)Mataric(1997)Sastry (1997)Toth et. al (1997)
Reinforcement Learningand MDP Planning
Mahadevan & Connell (1992)Singh (1992)Lin (1993)Dayan & Hinton (1993)Kaelbling(1993)Chrisman (1994)Bradtke & Duff (1995)Ring (1995)Sutton (1995)Thrun & Schwartz (1995)Boutilier et. al (1997)Dietterich(1997)Wiering & Schmidhuber (1997)Precup, Sutton & Singh (1997)McGovern & Sutton (1998)Parr & Russell (1998)Drummond (1998)Hauskrecht et. al (1998)Meuleau et. al (1998)Ryan and Pendrith (1998)
![Page 3: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/3.jpg)
Abstraction in Learning and Planning
• A long-standing, key problem in AI !
• How can we give abstract knowledge a clear semantics?e.g. “I could go to the library”
• How can different levels of abstraction be related? spatial: states temporal: time scales
• How can we handle stochastic, closed-loop, temporally extended courses of action?
• Use RL/MDPs to provide a theoretical foundation
![Page 4: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/4.jpg)
Outline
• RL and Markov Decision Processes (MDPs)
• Options and Semi-MDPs
• Rooms Example
• Between MDPs and Semi-MDPs
Termination Improvement
Intra-option Learning
Subgoals
![Page 5: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/5.jpg)
RL is Learning from Interaction
Environment
actionperception
rewardAgent
• complete agent
• temporally situated
• continual learning and planning
• object is to affect environment
• environment is stochastic and uncertain
RL is like Life!
![Page 6: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/6.jpg)
More Formally:Markov Decision Problems (MDPs)
t
An MDP is defined by < S, A, p, r, >S - set of states of the environment
A(s) – set of actions possible in state s - probability of transition from s
- expected reward when executing a in s
- discount rate for expected reward
Assumption: discrete time t = 0, 1, 2, . . .
sr s s
r s. . .t a
t +1t +1
t +1art +2
t +2t +2a
t +3t +3 . . .
t +3a
rsa
pss'a
![Page 7: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/7.jpg)
The Objective
Find a policy (way of acting) that gets a lot of reward in the long run
These are called value functions - cf. evaluation functions
Find a policy π : S × A a [0,1] that maximizes the value(expected future reward) for each state
V π (s) = E {r1 +γ r2 +L +γ t−1rt +L | s0 = s}
Qπ (s,a) = E {r1 + γ r2 +L + γ t−1rt +L | s0 = s,a 0 = a}
V * (s) = maxπ
(V π (s))
Q* (s,a) = maxπ
(Qπ (s,a))
![Page 8: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/8.jpg)
Outline
• RL and Markov Decision Processes (MDPs)
• Options and Semi-MDPs
• Rooms Example
• Between MDPs and Semi-MDPs
Termination Improvement
Intra-option Learning
Subgoals
![Page 9: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/9.jpg)
Options
Example: docking πhand-crafted controller
: terminate when docked or charger not visible
Options can take variable number of steps
A generalization of actions to include courses of action
Option execution is assumed to be call-and-return
An option is a triple o=<I ,π ,β >
• I ⊆S is the set of states in which o may be started• π : S × A a [0,1] is the policy followed during o• β : S a [0,1] is the probability of terminating in each state
I : all states in which charger is in sight
![Page 10: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/10.jpg)
Room Example
HALLWAYS
O2
O1
4 rooms
4 hallways
8 multi-step options
Given goal location, quickly plan shortest route
up
down
rightleft
(to each room's 2 hallways)
G?
G?
4 unreliable primitive actions
Fail 33% of the time
Goal states are givena terminal value of 1 = .9
All rewards zero
ROOM
![Page 11: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/11.jpg)
Options define a Semi-Markov Decison Process (SMDP)
Discrete timeHomogeneous discount
Continuous timeDiscrete eventsInterval-dependent discount
Discrete timeOverlaid discrete eventsInterval-dependent discount
A discrete-time SMDP overlaid on an MDPCan be analyzed at either level
MDP
SMDP
Optionsover MDP
State
Time
![Page 12: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/12.jpg)
MDP + Options = SMDP
Thus all Bellman equations and DP results extend for value functions over options and models of options (cf. SMDP theory).
Theorem:
For any MDP, and any set of options,the decision process that chooses among the options,executing each to termination,is an SMDP.
![Page 13: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/13.jpg)
What does the SMDP connection give us?
• Policies over options : μ : S × O a [0,1]• Value functions over options : V μ (s), Qμ (s, o), VO
*(s), QO* (s,o)
• Learning methods : Bradtke & Duff (1995), Parr (1998)• Models of options• Planning methods : e.g. value iteration, policy iteration, Dyna...• A coherent theory of learning and planning with courses of action at variable time scales, yet at the same level
A theoretical fondation for what we really need!
But the most interesting issues are beyond SMDPs...
![Page 14: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/14.jpg)
Value Functions for Options
Define value functions for options, similar to the MDP case
V μ (s) = E {rt+1 + γ rt+2 + ... | E(μ ,s,t )}
Q μ (s,o) = E {rt+1 + γ rt+2 + ... | E(oμ ,s, t )}
Now consider policies μ ∈Π(O) restricted to choose only from options in O :
VO* (s) = max
μ∈Π(O)V μ (s)
QO* (s,o) = max
μ∈Π(O)Qμ (s,o)
![Page 15: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/15.jpg)
Models of Options
Knowing how an option is executed is not enough for reasoning aboutit, or planning with it. We need information about its consequences
The model of the consequences of starting option o in state s has :
• a reward partrs
o =E{r1 + γ r2 + L + γ k−1rk | s0 = s,o taken in s0 , lasts k steps }
• a next - state part pss'
o = E {γ kδ sk s' | s0 = s,o taken in s0 , lasts k steps }
↓ 1 if s' = sk is the termination state, 0 otherwise
This form follows from SMDP theory. Such models can be used interchangeably with models of primitive actions in Bellman equations.
![Page 16: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/16.jpg)
Outline
• RL and Markov Decision Processes (MDPs)
• Options and Semi-MDPs
• Rooms Example
• Between MDPs and Semi-MDPs
Termination Improvement
Intra-option Learning
Subgoals
![Page 17: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/17.jpg)
Room Example
HALLWAYS
O2
O1
4 rooms
4 hallways
8 multi-step options
Given goal location, quickly plan shortest route
up
down
rightleft
(to each room's 2 hallways)
G?
G?
4 unreliable primitive actions
Fail 33% of the time
Goal states are givena terminal value of 1 = .9
All rewards zero
ROOM
![Page 18: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/18.jpg)
Example: Synchronous Value IterationGeneralized to Options
Initialize : V0 (s) ← 0∀s∈S
Iterate: Vk+1(s) ← maxo∈O[rs
o + pss'o
s'∈S∑ Vk(s')]∀s∈S
,The algorithm converges to the optimal value function given the options:lim
k→∞Vk =VO
*
OnceVO* ,is computedμO
* .is readily determined
IfO =A, the algorithm reduces to conventional value iterationIfA ⊆ S, thenVO
* =V *
![Page 19: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/19.jpg)
Rooms Example
Iteration #0 Iteration #1 Iteration #2
with cell-to-cellprimitive actions
Iteration #0 Iteration #1 Iteration #2
with room-to-roomoptions
V(goa l)=1
V (goa l)=1
![Page 20: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/20.jpg)
Example with GoalSubgoalboth primitive actions and options
Iteration #1Initial values Iteration #2
Iteration #3 Iteration #4 Iteration #5
![Page 21: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/21.jpg)
What does the SMDP connection give us?
• Policies over options : μ : S × O a [0,1]• Value functions over options : V μ (s), Qμ (s, o), VO
*(s), QO* (s,o)
• Learning methods : Bradtke & Duff (1995), Parr (1998)• Models of options• Planning methods : e.g. value iteration, policy iteration, Dyna...• A coherent theory of learning and planning with courses of action at variable time scales, yet at the same level
A theoretical foundation for what we really need!
But the most interesting issues are beyond SMDPs...
![Page 22: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/22.jpg)
Advantages of Dual MDP/SMDP View
At the SMDP levelCompute value functions and policies over optionswith the benefit of increased speed / flexibility
At the MDP levelLearn how to execute an option for achieving agiven goal
Between the MDP and SMDP levelImprove over existing options (e.g. by terminating early)
Learn about the effects of several options in parallel, without executing them to termination
![Page 23: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/23.jpg)
Outline
• RL and Markov Decision Processes (MDPs)
• Options and Semi-MDPs
• Rooms Example
• Between MDPs and Semi-MDPs
Termination Improvement
Intra-option Learning
Subgoals
![Page 24: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/24.jpg)
Between MDPs and SMDPs
• Termination Improvement Improving the value function by changing the termination conditions of options
• Intra-Option Learning Learning the values of options in parallel, without executing them to termination
Learning the models of options in parallel, without executing them to termination
• Tasks and Subgoals Learning the policies inside the options
![Page 25: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/25.jpg)
Termination Improvement
Idea: We can do better by sometimes interrupting ongoing options- forcing them to terminate before says to
Theorem: For any policy over options μ : S × O a [0,1], suppose we interrupt its options one or more times, when
Q μ (s,o) < Q μ (s,μ(s)), where s is the state at that time o is the ongoing option to obtain μ ' : S × O' a [0,1],
Then μ ' > μ (it attains more or equal reward everywhere)
Application : Suppose we have determined QO* and thus μ = μO
* .
Then μ' is guaranteed better than μO*
and is available with no additional computation.
![Page 26: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/26.jpg)
range (input set) of eachrun-to-landmark controller
landmarks
S
G
Landmarks Task
Task: navigate from S to G as fast as possible
4 primitive actions, for taking tiny steps up, down, left, right
7 controllers for going straightto each one of the landmarks,from within a circular regionwhere the landmark is visible
In this task, planning at the level of primitive actions is computationally intractable, we need the controllers
![Page 27: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/27.jpg)
S
G
SMDP Solution(600 Steps)
Termination-ImprovedSolution (474 Steps)
Termination Improvement for Landmarks Task
Allowing early termination based on models improves the value function at no additional cost!
![Page 28: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/28.jpg)
Illustration: Reconnaissance Mission Planning (Problem)
• Mission: Fly over (observe) most valuable sites and return to base
• Stochastic weather affects observability (cloudy or clear) of sites
• Limited fuel
• Intractable with classical optimal control methods
• Temporal scales: Actions: which direction to fly now
Options: which site to head for • Options compress space and time
Reduce steps from ~600 to ~6 Reduce states from ~1011 to ~106
QO
* (s, o) =rso + ps ′s
o VO* ( ′s )
′s∑
any state (106) sites only (6)
10
50
50
50
100
25
15 (reward)
5
25
8
Base100 decision steps
options
(mean time between weather changes)
![Page 29: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/29.jpg)
30
40
50
60
Illustration: Reconnaissance Mission Planning (Results)
• SMDP planner: Assumes options followed to
completion Plans optimal SMDP solution
• SMDP planner with re-evaluation Plans as if options must be
followed to completion But actually takes them for only
one step Re-picks a new option on every
step
• Static planner: Assumes weather will not change Plans optimal tour among clear
sites Re-plans whenever weather
changes
Low FuelHigh Fuel
Expected Reward/Mission
SMDPPlanner
StaticRe-planner
SMDP planner
with re-evaluation of options on
each step
Temporal abstraction finds better approximation than static planner, with little more computation than SMDP planner
![Page 30: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/30.jpg)
Intra-Option Learning Methods for Markov Options
Proven to converge to correct values, under same assumptions as 1-step Q-learning
Idea: take advantage of each fragment of experience
SMDP Q-learning:• execute option to termination, keeping track of reward along
the way• at the end, update only the option taken, based on reward and
value of state in which option terminates
Intra-option Q-learning:• after each primitive action, update all the options that could have taken that action, based on the reward and the expected value from the next state on
![Page 31: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/31.jpg)
Intra-Option Learning Methods for Markov Options
Proven to converge to correct values, under same assumptions as 1-step Q-learning
Idea: take advantage of each fragment of experience
Intra-Option Learning: after each primitive action, update all the options that could have taken that action
SMDP Learning: execute option to termination,then update only the option taken
![Page 32: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/32.jpg)
Example of Intra-Option Value Learning
Intra-option methods learn correct values without ever taking the options! SMDP methods are not applicable here
Random start, goal in right hallway, random actions
-4
-3
-2
-1
0
0 10001000 6000 2000 3000 4000 5000 6000
EpisodesEpisodes
Optionvalues
Averagevalue of
greedy policy
Learned value
Learned value
Upperhallwayoption
Lefthallwayoption
True value
True value-4
-3
-2
1 10 100
Value of Optimal Policy
![Page 33: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/33.jpg)
Intra-Option Value Learning Is FasterThan SMDP Value Learning
Random start, goal in right hallway, choice from A U H, 90% greedy
![Page 34: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/34.jpg)
Intra-Option Model Learning
Intra-option methods work much faster than SMDP methods
Random start state, no goal, pick randomly among all options
Options executed
Stateprediction
error
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 20,000 40,000 60,000 80,000 100,000
SMDP
SMDP
Intra
Intra
SMDP 1/t
Maxerror
Avg.error
SMDP 1/t
0
1
2
3
4
0 20,000 40,000 60,000 80,000 100,000
Options executed
SMDP
IntraSMDP 1/t
SMDP
Intra SMDP 1/t
Rewardprediction
error
Max error
Avg. error
![Page 35: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/35.jpg)
Tasks and Subgoals
It is natural to define options as solutions to subtaskse.g. treat hallways as subgoals, learn shortest paths
We have defined subgoals as pairs : < G, g > G⊆ S is the set of states treated as subgoalsg:Ga ℜ ( )are their subgoal values can be both good and bad
,Each subgoal has its own set of value functions . .e g:Vg
o(s) =E{r1 + r2 +L + k−1rk + g(sk) |s0 =s, o, sk ∈G}Vg
* (s) =maxo
Vgo(s)
,Policies inside options can be learned from subgoals in intra- ,optionoff- .policy manner
![Page 36: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/36.jpg)
Options Depend on Outcome Values
Large Outcome Values Small Outcome Values
Learned Policy: Avoids Negative Rewards
Learned Policy: Shortest Paths
Small negative rewards on each step
g = 10 g = 1
g = 0 g = 0
![Page 37: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/37.jpg)
Between MDPs and SMDPs
• Termination Improvement Improving the value function by changing the termination conditions of options
• Intra-Option Learning Learning the values of options in parallel, without executing them to termination
Learning the models of options in parallel, without executing them to termination
• Tasks and Subgoals Learning the policies inside the options
![Page 38: Richard S. Sutton Doina Precup University of Massachusetts Satinder Singh University of Colorado](https://reader035.vdocuments.net/reader035/viewer/2022081520/56814c08550346895db908e5/html5/thumbnails/38.jpg)
Summary: Benefits of Options
• Transfer Solutions to sub-tasks can be saved and reused Domain knowledge can be provided as options and subgoals
• Potentially much faster learning and planning By representing action at an appropriate temporal scale
• Models of options are a form of knowledge representation Expressive Clear Suitable for learning and planning
• Much more to learn than just one policy, one set of values A framework for “constructivism” – for finding models of the
world that are useful for rapid planning and learning