stochastic dynamic programming with factored representations presentation by dafna shahaf...
TRANSCRIPT
Stochastic Dynamic Programming with Factored Representations
Presentation by Dafna Shahaf(Boutilier, Dearden, Goldszmidt 2000)
The Problem Standard MDP algorithms require explicit
state space enumeration Curse of dimensionality Need: Compact Representation
(intuition: STRIPS) Need: versions of standard dynamic
programming algorithms for it
A Glimpse of the Future
Policy Tree Value Tree
A Glimpse of the Future: Some Experimental Results
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
MDPs- Reminder
(states, actions, transitions, rewards)
Discounted infinite-horizon Stationary Policies
(an action to take at state s) Value functions: is k-stage-to-go
value function for π)(sV k
AS :
RTAS ,,,
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Representing MDPs as Bayesian Networks: Coffee world
O: Robot is in office W: Robot is wet U: Has umbrella R: It is raining HCR: Robot has coffee HCO: Owner has coffee
Go: Switch location BuyC: Buy coffee DelC: Deliver coffee GetU: Get umbrella
The effect of the actions might be noisy.Need to provide a distribution for each effect.
Representing Actions: DelC
00.300
Representing Actions: Interesting Points
No need to provide marginal distribution over pre-action variables
Markov Property: we need only the previous state For now, no synchronic arcs Frame Problem? Single Network vs. a network for each action Why Decision Trees?
Representing Reward
Generally determined by a subset of features.
Policies and Value Functions
Policy Tree Value Tree
The optimal choice may depend only on certain variables (given some others).
FeaturesHCR=T
HCR=F
ValuesActions
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Bellman Backup
Q-Function: The value of performing a in s, given value function v
Value Iteration- Reminder
)'(' )',,'Pr()()( ss vsassRsQva
)(max:)(max)(1
sQsQsV kaa
Vaa
kk
)}'(' ),,'Pr({max)()( 1 ss VsassRsV ka
k
Structured Value Iteration- OverviewInput: Tree( ). Output: Tree( ).
1. Set Tree( )= Tree( )
2. Repeat
(a) Compute Tree( )= Regress(Tree( ),a)
for each action a
(b) Merge (via maximization) trees Tree( )
to obtain Tree( )
Until termination criterion. Return Tree( ).
VkaQ
VkaQ
RR
0V
1kV1kV
kV
*V
Example World
Step 2a: Calculating Q-Functions
)'(' )',,Pr()()( ss VsassRsQVa
1. Expected FutureValue
2. DiscountingFutureValue
3. AddingImmediate
Reward
How to use the structure of the trees?
Tree( ) should distinguish only conditions under which a makes a branch of Tree(V) true with different odds.
VaQ
Calculating :
Tree(V0)
1aQ
PTree( )
Finding conditions under which a will have distinct expected value, with respect to V0
1aQ FVTree( )1aQ
Undiscounted Expected Future Value for performing action a with one-stage-to-go.
Tree( )1aQ
Discounting FVTree (by 0.9), and adding the immediate reward function.
Z:
Z: Z:
1*10+0*0
An Alternative View:
(a more complicated example)
Tree(V1) PartialPTree( )
UnsimplifiedPTree( )
PTree( )
2aQ
2aQ
2aQ FVTree( )2aQ Tree( )
2aQ
The Algorithm: Regress
Input: Tree(V), action a. Output: Tree( )
1. PTree( )= PRegress(Tree(V),a) (simplified)
VaQ
VaQ
The Algorithm: Regress
Input: Tree(V), action a. Output: Tree( )
1. PTree( )= PRegress(Tree(V),a) (simplified)
2. Construct FVTree( ):
for each branch b of PTree, with leaf node l(b)
(a) Prb =the product of individual distr. from l(b)
(b)
(c) Re-label leaf l(b) with vb.
VaQ
VaQ
VaQ
)(')'()'(Pr
VTreeb
bb bVbv
The Algorithm: Regress
Input: Tree(V), action a. Output: Tree( )
1. PTree( )= PRegress(Tree(V),a) (simplified)
2. Construct FVTree( ):
for each branch b of PTree, with leaf node l(b)
(a) Prb =the product of individual distr. from l(b)
(b)
(c) Re-label leaf l(b) with vb.
3. Discount FVTree( ) with , append Tree(R)
4. Return FVTree( )
VaQ
VaQ
VaQ
)(')'()'(Pr
VTreeb
bb bVbv
VaQ
VaQ
The Algorithm: PRegressInput: Tree(V), action a. Output: PTree( )
1. If Tree(V) is a single node, return emptyTree
2. X = the variable at the root of Tree(V)
= the tree for CPT(X) (label leaves with X)
VaQ
PXT
The Algorithm: PRegressInput: Tree(V), action a. Output: PTree( )
1. If Tree(V) is a single node, return emptyTree
2. X = the variable at the root of Tree(V)
= the tree for CPT(X) (label leaves with X)
3. = the subtrees of Tree(V) for X=t, X=f
4. = call PRegress on
VaQ
PXT
VfX
VtX TT ,
PfX
PtX TT ,
VfX
VtX TT ,
The Algorithm: PRegressInput: Tree(V), action a. Output: PTree( )
1. If Tree(V) is a single node, return emptyTree
2. X = the variable at the root of Tree(V)
= the tree for CPT(X) (label leaves with X)
3. = the subtrees of Tree(V) for X=t, X=f
4. = call PRegress on
5. For each leaf l in , add or both (according to distribution. Use union to combine labels)
6. Return
VaQ
PXT
VfX
VtX TT ,
PfX
PtX TT ,
VfX
VtX TT ,
PXT
PfX
PtX TT ,
PXT
Step 2b. Maximization
Value Iteration Complete.
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Experimental Results
WorstCase:
BestCase:
Roadmap
MDPs- Reminder Structured Representation for MDPs:
Bayesian Nets, Decision Trees Algorithms for Structured Representation Experimental Results Extensions
Extensions
Synchronic edges POMDPs Rewards Approximation
Questions?
Backup slides
Here be dragons.
Regression through a Policy
Improving Policies: Example
Maximization Step, Improved Policy