markov decision process (mdp)
DESCRIPTION
Markov Decision Process (MDP). Ruti Glick Bar-Ilan university. Policy. Policy is similar to plan generated ahead of time Unlike traditional plans, it is not a sequence of actions that an agent must execute If there are failures in execution, agent can continue to execute a policy - PowerPoint PPT PresentationTRANSCRIPT
1
Markov Decision Process(MDP)
Ruti GlickBar-Ilan university
2
PolicyPolicy is similar to plan
generated ahead of time
Unlike traditional plans it is not a sequence of actions that an agent must execute
If there are failures in execution agent can continue to execute a policy
Prescribes an action for all the states
Maximizes expected reward rather than just reaching a goal state
3
Utility and Policyutility
Compute for every state ldquoWhat is the usage (utility) of this state for the overall taskrdquo1048708
PolicyComplete mapping from states to actionsldquoIn which state should I perform which actionrdquopolicy state action
4
The optimal Policy
If we know the utility we can easily compute the optimal policy1048708The problem is to compute the correct utilities for all states
π(s) = argmaxasumsrsquoT(s a srsquo)U(srsquo)
T(s a srsquo) = Probability of reaching state s from state s
U(srsquo) = Utility of state j
5
Finding π
Value iterationPolicy iteration
6
Value iterationProcess
Calculate the utility of each stateUse the values to select an optimal action
7
Bellman Equation
Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))
For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)
09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)
START
+1
-1
Up
Left
Down
Right
8
Bellman Equationproperties
U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables
ProblemOperator max is not a linear operatorNon-linear equations
SolutionIterative approach
9
Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors
Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))
Iteration step called Bellman update
Repeat till converges
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
2
PolicyPolicy is similar to plan
generated ahead of time
Unlike traditional plans it is not a sequence of actions that an agent must execute
If there are failures in execution agent can continue to execute a policy
Prescribes an action for all the states
Maximizes expected reward rather than just reaching a goal state
3
Utility and Policyutility
Compute for every state ldquoWhat is the usage (utility) of this state for the overall taskrdquo1048708
PolicyComplete mapping from states to actionsldquoIn which state should I perform which actionrdquopolicy state action
4
The optimal Policy
If we know the utility we can easily compute the optimal policy1048708The problem is to compute the correct utilities for all states
π(s) = argmaxasumsrsquoT(s a srsquo)U(srsquo)
T(s a srsquo) = Probability of reaching state s from state s
U(srsquo) = Utility of state j
5
Finding π
Value iterationPolicy iteration
6
Value iterationProcess
Calculate the utility of each stateUse the values to select an optimal action
7
Bellman Equation
Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))
For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)
09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)
START
+1
-1
Up
Left
Down
Right
8
Bellman Equationproperties
U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables
ProblemOperator max is not a linear operatorNon-linear equations
SolutionIterative approach
9
Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors
Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))
Iteration step called Bellman update
Repeat till converges
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
3
Utility and Policyutility
Compute for every state ldquoWhat is the usage (utility) of this state for the overall taskrdquo1048708
PolicyComplete mapping from states to actionsldquoIn which state should I perform which actionrdquopolicy state action
4
The optimal Policy
If we know the utility we can easily compute the optimal policy1048708The problem is to compute the correct utilities for all states
π(s) = argmaxasumsrsquoT(s a srsquo)U(srsquo)
T(s a srsquo) = Probability of reaching state s from state s
U(srsquo) = Utility of state j
5
Finding π
Value iterationPolicy iteration
6
Value iterationProcess
Calculate the utility of each stateUse the values to select an optimal action
7
Bellman Equation
Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))
For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)
09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)
START
+1
-1
Up
Left
Down
Right
8
Bellman Equationproperties
U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables
ProblemOperator max is not a linear operatorNon-linear equations
SolutionIterative approach
9
Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors
Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))
Iteration step called Bellman update
Repeat till converges
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
4
The optimal Policy
If we know the utility we can easily compute the optimal policy1048708The problem is to compute the correct utilities for all states
π(s) = argmaxasumsrsquoT(s a srsquo)U(srsquo)
T(s a srsquo) = Probability of reaching state s from state s
U(srsquo) = Utility of state j
5
Finding π
Value iterationPolicy iteration
6
Value iterationProcess
Calculate the utility of each stateUse the values to select an optimal action
7
Bellman Equation
Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))
For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)
09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)
START
+1
-1
Up
Left
Down
Right
8
Bellman Equationproperties
U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables
ProblemOperator max is not a linear operatorNon-linear equations
SolutionIterative approach
9
Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors
Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))
Iteration step called Bellman update
Repeat till converges
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
5
Finding π
Value iterationPolicy iteration
6
Value iterationProcess
Calculate the utility of each stateUse the values to select an optimal action
7
Bellman Equation
Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))
For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)
09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)
START
+1
-1
Up
Left
Down
Right
8
Bellman Equationproperties
U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables
ProblemOperator max is not a linear operatorNon-linear equations
SolutionIterative approach
9
Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors
Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))
Iteration step called Bellman update
Repeat till converges
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
6
Value iterationProcess
Calculate the utility of each stateUse the values to select an optimal action
7
Bellman Equation
Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))
For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)
09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)
START
+1
-1
Up
Left
Down
Right
8
Bellman Equationproperties
U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables
ProblemOperator max is not a linear operatorNon-linear equations
SolutionIterative approach
9
Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors
Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))
Iteration step called Bellman update
Repeat till converges
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
7
Bellman Equation
Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))
For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)
09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)
START
+1
-1
Up
Left
Down
Right
8
Bellman Equationproperties
U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables
ProblemOperator max is not a linear operatorNon-linear equations
SolutionIterative approach
9
Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors
Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))
Iteration step called Bellman update
Repeat till converges
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
8
Bellman Equationproperties
U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables
ProblemOperator max is not a linear operatorNon-linear equations
SolutionIterative approach
9
Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors
Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))
Iteration step called Bellman update
Repeat till converges
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
9
Value iteration algorithm
Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors
Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))
Iteration step called Bellman update
Repeat till converges
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
10
Value Iteration properties
This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
11
convergence Value iteration is contraction
Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point
Wersquoll not going to prove last pointconverge to correct value
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
12
Value Iteration Algorithm
function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T
reward function R discount γ local variables U Ursquo vectors of utilities for states in S
initially identical to R
repeat U Ursquo for each state s in S do
Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
13
ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board
R=-004 R=+1
R=-004 R=-1
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
14
Example (cont)First iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)
=-004 + max -0136 -004 -0136 -0808 =-008
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)
=-004 + max 0064 -004 0064 0792 =0752
Goal states remain the same
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
15
Example (cont)Second iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)
=-004 + max 04936 00032 -0172 -03728 =04536
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)
=-004 + max 07768 06688 01112 08672 = 08272
U=0752
R=-004
U=+1R=+1
U=-008R=-004
U=-1R=-1
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
16
Example (cont)Third iteration
U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)
= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)
=-004 + max 06071 0491 03082 -06719 =05676
U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)
= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)
=-004 + max 08444 07898 05456 09281 = 08881
U=08272
R=-004
U=+1R=+1
U= 04536
R=-004
U=-1R=-1
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
17
Example (cont)Continue to next iterationhellip
Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough
U= 08881
R=-004
U=+1R=+1
U=-05676
R=-004
U=-1R=-1
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
18
ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence
RMS error ndashroot mean square error of the utility value compare to the correct values
demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of
any state in an iteration
| | 2
1
1( ( ) ( ))
| |
S
iRMS U i U i
S
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
19
ldquoclose enoughrdquo (cont)
Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ
When ||U|| = maxa |U(s)|
ε ndash maximum error allowed in utility of any state in an iteration
γ ndash the discount factor
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
20
Finding the policy
True utilities have foundedNew search for the optimal policy
For each s in S doπ[s] argmaxa sumsrsquo T(s a
srsquo)U(srsquo)Return π
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
21
Example (cont)Find the optimal police
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)
09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)
= argmaxa 06672 05996 04108 -06512 = Up
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)
= argmaxa 08993 08561 06429 09456 = Right
U= 08881R=-004
U=+1R=+1
U=-05676R=-004
U=-1R=-1
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
22
+1
-1
0705 0655 0611 0388
0762 0660
091208680812+1
-1
+1
-1
+1
-1
3 Extract optimal policy 4 Execute actions
1 The given environment 2 Calculate utilities
Summery ndash value iteration
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
23
Example - convergence Error allowed
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
24
Policy iteration
picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
25
Policy iterationFor each state in each step
Policy evaluationGiven policy πi
Calculate the utility Ui of each state if π were to be execute
Policy improvementCalculate new policy πi+1
Based on πi
π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
26
Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy
input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R
π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
27
ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)
R=-004
R=+1
R=-004
R=-1
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
28
Example (cont)First iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 03778U(12) = 06U(21) = -1U(22) = 1
U=-004R=-
004
U=+1R=+1
U=-004R=-
004
U=-1R=-1Policy
Π(11) = Up
Π(12) = Up09 08 01 0 (11) 004
0 01 0 01 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
29
Example (cont)First iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)
09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)
= argmaxa 04178 04 024 -07022 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)
= argmaxa 064 05778 04622 08978 = Right update
U= 06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Up
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
30
Example (cont)Second iteration ndash policy evaluation
U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)
U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1
U=06R=-004
U=+1R=+1
U=03778R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right09 08 01 0 (11) 004
01 09 0 08 (12) 004
0 0 1 0 (21) 1
0 0 0 1 (22) 1
U
U
U
U
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
31
Example (cont)Second iteration ndash policy improvement
Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down
08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)
09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)
= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update
Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left
08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right
= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)
= argmaxa 08059 076 06115 09326 = Right donrsquot have to update
U=07843R=-004
U=+1R=+1
U=05413R=-004
U=-1R=-1
Policy
Π(11) = Up
Π(12) = Right
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
32
Example (cont)
No change in the policy has found finish
The optimal policyπ(11) = Upπ(12) = Right
Policy iteration must terminate since policyrsquos number is finite
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
33
Simplify Policy iteration
Can focus of subset of stateFind utility by simplified value iteration
Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))
ORPolicy Improvement
Guaranteed to converge under certain conditions on initial polity and utility values
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
34
Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations
35
Value vs Policy Iteration
Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations