markov decision process (mdp)

35
1 Markov Decision Process (MDP) Ruti Glick Bar-Ilan university

Upload: elan

Post on 21-Jan-2016

70 views

Category:

Documents


3 download

DESCRIPTION

Markov Decision Process (MDP). Ruti Glick Bar-Ilan university. Policy. Policy is similar to plan generated ahead of time Unlike traditional plans, it is not a sequence of actions that an agent must execute If there are failures in execution, agent can continue to execute a policy - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Markov Decision Process (MDP)

1

Markov Decision Process(MDP)

Ruti GlickBar-Ilan university

2

PolicyPolicy is similar to plan

generated ahead of time

Unlike traditional plans it is not a sequence of actions that an agent must execute

If there are failures in execution agent can continue to execute a policy

Prescribes an action for all the states

Maximizes expected reward rather than just reaching a goal state

3

Utility and Policyutility

Compute for every state ldquoWhat is the usage (utility) of this state for the overall taskrdquo1048708

PolicyComplete mapping from states to actionsldquoIn which state should I perform which actionrdquopolicy state action

4

The optimal Policy

If we know the utility we can easily compute the optimal policy1048708The problem is to compute the correct utilities for all states

π(s) = argmaxasumsrsquoT(s a srsquo)U(srsquo)

T(s a srsquo) = Probability of reaching state s from state s

U(srsquo) = Utility of state j

5

Finding π

Value iterationPolicy iteration

6

Value iterationProcess

Calculate the utility of each stateUse the values to select an optimal action

7

Bellman Equation

Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))

For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)

09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)

START

+1

-1

Up

Left

Down

Right

8

Bellman Equationproperties

U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables

ProblemOperator max is not a linear operatorNon-linear equations

SolutionIterative approach

9

Value iteration algorithm

Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors

Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))

Iteration step called Bellman update

Repeat till converges

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 2: Markov Decision Process (MDP)

2

PolicyPolicy is similar to plan

generated ahead of time

Unlike traditional plans it is not a sequence of actions that an agent must execute

If there are failures in execution agent can continue to execute a policy

Prescribes an action for all the states

Maximizes expected reward rather than just reaching a goal state

3

Utility and Policyutility

Compute for every state ldquoWhat is the usage (utility) of this state for the overall taskrdquo1048708

PolicyComplete mapping from states to actionsldquoIn which state should I perform which actionrdquopolicy state action

4

The optimal Policy

If we know the utility we can easily compute the optimal policy1048708The problem is to compute the correct utilities for all states

π(s) = argmaxasumsrsquoT(s a srsquo)U(srsquo)

T(s a srsquo) = Probability of reaching state s from state s

U(srsquo) = Utility of state j

5

Finding π

Value iterationPolicy iteration

6

Value iterationProcess

Calculate the utility of each stateUse the values to select an optimal action

7

Bellman Equation

Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))

For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)

09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)

START

+1

-1

Up

Left

Down

Right

8

Bellman Equationproperties

U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables

ProblemOperator max is not a linear operatorNon-linear equations

SolutionIterative approach

9

Value iteration algorithm

Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors

Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))

Iteration step called Bellman update

Repeat till converges

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 3: Markov Decision Process (MDP)

3

Utility and Policyutility

Compute for every state ldquoWhat is the usage (utility) of this state for the overall taskrdquo1048708

PolicyComplete mapping from states to actionsldquoIn which state should I perform which actionrdquopolicy state action

4

The optimal Policy

If we know the utility we can easily compute the optimal policy1048708The problem is to compute the correct utilities for all states

π(s) = argmaxasumsrsquoT(s a srsquo)U(srsquo)

T(s a srsquo) = Probability of reaching state s from state s

U(srsquo) = Utility of state j

5

Finding π

Value iterationPolicy iteration

6

Value iterationProcess

Calculate the utility of each stateUse the values to select an optimal action

7

Bellman Equation

Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))

For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)

09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)

START

+1

-1

Up

Left

Down

Right

8

Bellman Equationproperties

U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables

ProblemOperator max is not a linear operatorNon-linear equations

SolutionIterative approach

9

Value iteration algorithm

Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors

Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))

Iteration step called Bellman update

Repeat till converges

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 4: Markov Decision Process (MDP)

4

The optimal Policy

If we know the utility we can easily compute the optimal policy1048708The problem is to compute the correct utilities for all states

π(s) = argmaxasumsrsquoT(s a srsquo)U(srsquo)

T(s a srsquo) = Probability of reaching state s from state s

U(srsquo) = Utility of state j

5

Finding π

Value iterationPolicy iteration

6

Value iterationProcess

Calculate the utility of each stateUse the values to select an optimal action

7

Bellman Equation

Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))

For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)

09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)

START

+1

-1

Up

Left

Down

Right

8

Bellman Equationproperties

U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables

ProblemOperator max is not a linear operatorNon-linear equations

SolutionIterative approach

9

Value iteration algorithm

Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors

Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))

Iteration step called Bellman update

Repeat till converges

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 5: Markov Decision Process (MDP)

5

Finding π

Value iterationPolicy iteration

6

Value iterationProcess

Calculate the utility of each stateUse the values to select an optimal action

7

Bellman Equation

Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))

For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)

09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)

START

+1

-1

Up

Left

Down

Right

8

Bellman Equationproperties

U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables

ProblemOperator max is not a linear operatorNon-linear equations

SolutionIterative approach

9

Value iteration algorithm

Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors

Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))

Iteration step called Bellman update

Repeat till converges

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 6: Markov Decision Process (MDP)

6

Value iterationProcess

Calculate the utility of each stateUse the values to select an optimal action

7

Bellman Equation

Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))

For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)

09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)

START

+1

-1

Up

Left

Down

Right

8

Bellman Equationproperties

U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables

ProblemOperator max is not a linear operatorNon-linear equations

SolutionIterative approach

9

Value iteration algorithm

Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors

Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))

Iteration step called Bellman update

Repeat till converges

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 7: Markov Decision Process (MDP)

7

Bellman Equation

Bellman EquationU(s) = R(s) + γmaxa sum (T(s a srsquo) U(srsquo))

For exampleU(11) = -004+γmax 08U(12)+01U(21)+01U(11)

09U(11)+01U(12)09U(11)+01U(21)08U(21)+01U(12)+01U(11)

START

+1

-1

Up

Left

Down

Right

8

Bellman Equationproperties

U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables

ProblemOperator max is not a linear operatorNon-linear equations

SolutionIterative approach

9

Value iteration algorithm

Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors

Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))

Iteration step called Bellman update

Repeat till converges

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 8: Markov Decision Process (MDP)

8

Bellman Equationproperties

U(s) = R(s) + γ maxa sum (T(s a srsquo) U(srsquo))n equations One for each stepn vaiables

ProblemOperator max is not a linear operatorNon-linear equations

SolutionIterative approach

9

Value iteration algorithm

Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors

Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))

Iteration step called Bellman update

Repeat till converges

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 9: Markov Decision Process (MDP)

9

Value iteration algorithm

Initial arbitrary values for utilities Update utility of each state from itrsquos neighbors

Ui+1(s) R(s) + γ maxa sum (T(s a srsquo) Ui(srsquo))

Iteration step called Bellman update

Repeat till converges

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 10: Markov Decision Process (MDP)

10

Value Iteration properties

This equilibrium is a unique solution Can prove that the value iteration process convergesDonrsquot need exact values

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 11: Markov Decision Process (MDP)

11

convergence Value iteration is contraction

Function of one argumentWhen applied on to inputs produces value that are ldquocloser togetherHave only one fixed pointWhen applied the value must be closer to fixed point

Wersquoll not going to prove last pointconverge to correct value

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 12: Markov Decision Process (MDP)

12

Value Iteration Algorithm

function VALUE_ITERATION (mdp) returns a utility functioninput mdp MDP with states S transition model T

reward function R discount γ local variables U Ursquo vectors of utilities for states in S

initially identical to R

repeat U Ursquo for each state s in S do

Ursquo[s] R[s] + γmaxa srsquo T(s a srsquo)U[srsquo]until close-enough(UUrsquo)return U

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 13: Markov Decision Process (MDP)

13

ExampleSmall version of our main example2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the board

R=-004 R=+1

R=-004 R=-1

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 14: Markov Decision Process (MDP)

14

Example (cont)First iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(-004) + 01x(-004) + 01x(-1)09x(-004) + 01x(-004)09x(-004) + 01x(-1)08x(-1) + 01x(-004) + 01x(-004)

=-004 + max -0136 -004 -0136 -0808 =-008

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(-004)+ 01x109x(-004) + 01x(-004)08x(-004) + 01x1 + 01x(-004)08x1 + 01x(-004) + 01x(-004)

=-004 + max 0064 -004 0064 0792 =0752

Goal states remain the same

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 15: Markov Decision Process (MDP)

15

Example (cont)Second iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(0752) + 01x(-008) + 01x(-1)09x(-008) + 01x(0752)09x(-008) + 01x(-1)08x(-1) + 01x(-008) + 01x(0752)

=-004 + max 04936 00032 -0172 -03728 =04536

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(0752)+ 01x109x(0752) + 01x(-008)08x(-008) + 01x1 + 01x(0752)08x1 + 01x(0752) + 01x(-008)

=-004 + max 07768 06688 01112 08672 = 08272

U=0752

R=-004

U=+1R=+1

U=-008R=-004

U=-1R=-1

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 16: Markov Decision Process (MDP)

16

Example (cont)Third iteration

U(11) = R(11) + γ max 08U(12) + 01U(11) + 01U(21)09U(11) + 01U(12)09U(11) + 01U(21)08U(21) + 01U(11) + 01U(12)

= -004 + 1 x max 08x(08272) + 01x(04536) + 01x(-1)09x(04536) + 01x(08272)09x(04536) + 01x(-1)08x(-1) + 01x(04536) + 01x(08272)

=-004 + max 06071 0491 03082 -06719 =05676

U(12) = R(12) + γ max 09U(12) + 01U(22)09U(12) + 01U(11)08U(11) + 01U(22) + 01U(12)08U(22) + 01U(12) + 01U(11)

= -004 + 1 x max 09x(08272)+ 01x109x(08272) + 01x(04536)08x(04536) + 01x1 + 01x(08272)08x1 + 01x(08272) + 01x(04536)

=-004 + max 08444 07898 05456 09281 = 08881

U=08272

R=-004

U=+1R=+1

U= 04536

R=-004

U=-1R=-1

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 17: Markov Decision Process (MDP)

17

Example (cont)Continue to next iterationhellip

Finish if ldquoclose enoughrdquoHere last change was 0114 ndash close enough

U= 08881

R=-004

U=+1R=+1

U=-05676

R=-004

U=-1R=-1

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 18: Markov Decision Process (MDP)

18

ldquoclose enough rdquoWe will not go down deeply to this issueDifferent possibilities to detect convergence

RMS error ndashroot mean square error of the utility value compare to the correct values

demand of RMS(U Ursquo) lt ε when ε ndash maximum error allowed in utility of

any state in an iteration

| | 2

1

1( ( ) ( ))

| |

S

iRMS U i U i

S

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 19: Markov Decision Process (MDP)

19

ldquoclose enoughrdquo (cont)

Policy Loss difference between the expected utility using the policy to the expected utility obtain by the optimal policy || Ui+1 ndash Ui || lt ε (1-γ) γ

When ||U|| = maxa |U(s)|

ε ndash maximum error allowed in utility of any state in an iteration

γ ndash the discount factor

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 20: Markov Decision Process (MDP)

20

Finding the policy

True utilities have foundedNew search for the optimal policy

For each s in S doπ[s] argmaxa sumsrsquo T(s a

srsquo)U(srsquo)Return π

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 21: Markov Decision Process (MDP)

21

Example (cont)Find the optimal police

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(08881) + 01x(05676) + 01x(-1)

09x(05676) + 01x(08881) 09x(05676) + 01x(-1) 08x(-1) + 01x(05676) + 01x(08881)

= argmaxa 06672 05996 04108 -06512 = Up

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(08881)+ 01x1 09x(08881) + 01x(05676) 08x(05676) + 01x1 + 01x(08881) 08x1 + 01x(08881) + 01x(05676)

= argmaxa 08993 08561 06429 09456 = Right

U= 08881R=-004

U=+1R=+1

U=-05676R=-004

U=-1R=-1

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 22: Markov Decision Process (MDP)

22

+1

-1

0705 0655 0611 0388

0762 0660

091208680812+1

-1

+1

-1

+1

-1

3 Extract optimal policy 4 Execute actions

1 The given environment 2 Calculate utilities

Summery ndash value iteration

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 23: Markov Decision Process (MDP)

23

Example - convergence Error allowed

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 24: Markov Decision Process (MDP)

24

Policy iteration

picking a policy then calculating the utility of each state given that policy (value iteration step)Update the policy at each state using the utilities of the successor statesRepeat until the policy stabilize

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 25: Markov Decision Process (MDP)

25

Policy iterationFor each state in each step

Policy evaluationGiven policy πi

Calculate the utility Ui of each state if π were to be execute

Policy improvementCalculate new policy πi+1

Based on πi

π i+1[s] argmaxa sumsrsquo T(s a srsquo)U(srsquo)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 26: Markov Decision Process (MDP)

26

Policy iteration Algorithmfunction POLICY_ITERATION (mdp) returns a policy

input mdp an MDP with states S transition model T local variables U Ursquo vectors of utilities for states in S initially identical to R

π a policy vector indexed by states initially random repeat U Policy-Evaluation(πUmdp) unchanged true for each state s in S do if maxa srsquo T(s a srsquo) U[srsquo] gt srsquoT(s π[s] srsquo) U[srsquo] then π[s] argmaxa srsquo T(s a srsquo) U[j] unchanged false end until unchangedreturn π

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 27: Markov Decision Process (MDP)

27

ExampleBack to our last examplehellip2x2 worldThe agent is placed in (11)States (21) (22) are goal statesIf blocked by the wall ndash stay in placeThe reward are written in the boardInitial policy Up (for every step)

R=-004

R=+1

R=-004

R=-1

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 28: Markov Decision Process (MDP)

28

Example (cont)First iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (09U(12) + 01U(22))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 09U(12) + 01U(22)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 0U(11) ndash 01U(12) + 0U(21) + 01U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 03778U(12) = 06U(21) = -1U(22) = 1

U=-004R=-

004

U=+1R=+1

U=-004R=-

004

U=-1R=-1Policy

Π(11) = Up

Π(12) = Up09 08 01 0 (11) 004

0 01 0 01 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 29: Markov Decision Process (MDP)

29

Example (cont)First iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(06) + 01x(03778) + 01x(-1)

09x(03778) + 01x(06) 09x(03778) + 01x(-1) 08x(-1) + 01x(03778) + 01x(06)

= argmaxa 04178 04 024 -07022 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(06)+ 01x1 09x(06) + 01x(03778) 08x(03778) + 01x1 + 01x(06) 08x1 + 01x(06) + 01x(03778)

= argmaxa 064 05778 04622 08978 = Right update

U= 06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Up

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 30: Markov Decision Process (MDP)

30

Example (cont)Second iteration ndash policy evaluation

U(11) = R(11) + γ x (08U(12) + 01U(11) + 01U(21))U(12) = R(12) + γ x (01U(12) + 08U(22) + 01U(11))U(21) = R(21)U(22) = R(22)U(11) = -004 + 08U(12) + 01U(11) + 01U(21)U(12) = -004 + 01U(12) + 08U(22) + 01U(11)U(21) = -1U(22) = 1004 = -09U(11) + 08U(12) + 01U(21) + 0U(22)004 = 01U(11) ndash 09U(12) + 0U(21) + 08U(22)-1 = 0U(11) + 0U(12) - 1U(21) + 0U(22) 1 = 0U(11) + 0U(12) - 0U(21) + 1U(22)

U(11) = 05413U(12) = 07843U(21) = -1U(22) = 1

U=06R=-004

U=+1R=+1

U=03778R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right09 08 01 0 (11) 004

01 09 0 08 (12) 004

0 0 1 0 (21) 1

0 0 0 1 (22) 1

U

U

U

U

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 31: Markov Decision Process (MDP)

31

Example (cont)Second iteration ndash policy improvement

Π(11) = argmaxa 08U(12) + 01U(11) + 01U(21) Up 09U(11) + 01U(12) Left 09U(11) + 01U(21) Down

08U(21) + 01U(11) + 01U(12) Right = argmaxa 08x(07843) + 01x(05413) + 01x(-1)

09x(05413) + 01x(07843) 09x(05413) + 01x(-1) 08x(-1) + 01x(05413) + 01x(07843)

= argmaxa 05816 05656 03871 -06674 = Up donrsquot have to update

Π(12) = argmaxa 09U(12) + 01U(22) Up 09U(12) + 01U(11) Left

08U(11) + 01U(22) + 01U(12) Down 08U(22) + 01U(12) + 01U(11) Right

= argmaxa 09x(07843)+ 01x1 09x(07843) + 01x(05413) 08x(05413) + 01x1 + 01x(07843) 08x1 + 01x(07843) + 01x(05413)

= argmaxa 08059 076 06115 09326 = Right donrsquot have to update

U=07843R=-004

U=+1R=+1

U=05413R=-004

U=-1R=-1

Policy

Π(11) = Up

Π(12) = Right

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 32: Markov Decision Process (MDP)

32

Example (cont)

No change in the policy has found finish

The optimal policyπ(11) = Upπ(12) = Right

Policy iteration must terminate since policyrsquos number is finite

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 33: Markov Decision Process (MDP)

33

Simplify Policy iteration

Can focus of subset of stateFind utility by simplified value iteration

Ui+1(s) = R(s) + γ sumsrsquo (T(s π(s) srsquo) Ui(srsquo))

ORPolicy Improvement

Guaranteed to converge under certain conditions on initial polity and utility values

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 34: Markov Decision Process (MDP)

34

Policy Iteration propertiesLinear equation ndash easy to solve Fast convergence in practice Proved to be optimal

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations

Page 35: Markov Decision Process (MDP)

35

Value vs Policy Iteration

Which to usePolicy iteration in more expensive per iterationIn practice Policy iteration require fewer iterations