markov decision processes & one-stage-look-ahead · stopping problems are given in tijms [5]....

Markov Decision Processes

& One-stage-look-ahead

Gideon Jager

July 14, 2016

Bachelor Thesis Mathematics

Supervisor: prof. dr. R. Nunez Queija

Korteweg-de Vries Instituut voor Wiskunde

Faculteit der Natuurwetenschappen, Wiskunde en Informatica

Universiteit van Amsterdam

Abstract

One must make decisions many times in one’s life. Certain decisions work out better

than others in specific situations. In this thesis, we consider games where have to take

decisions in order to maximize the probability of achieving a certain goal. We will start

with a discussion of Markov chains in discrete time and Markov decision processes. In

particular, we state the Bellman optimality condition. Then we demonstrate how this

theory works for the game Roulette. Our main goal is to use the Markov decision theory

to maximize the probability of winning the game Piglet and determine the corresponding

optimal strategies. Finally, we study a different class of decision rules, the one-stage-look-

ahead rules and apply them to Piglet. We compare the one-stage-look-ahead rule with the

results obtained from the Bellman optimality condition.

Title: Markov Decision Processes

& One-stage-look-ahead

Author: Gideon Jager, [email protected], 10565744

Supervisor: prof. dr. R. Nunez Queija

Second grader: prof. dr. M. R. H. Mandjes

Date: July 14, 2016

Korteweg-de Vries Instituut voor Wiskunde

Universiteit van Amsterdam

Science Park 904, 1098 XH Amsterdam

http://www.science.uva.nl/math

i

Contents

Prelude iv

1 Markov decision processes 1

1.1 Discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Decision rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Bellman’s condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Roulette 5

2.1 Roulette as a Markov decision process . . . . . . . . . . . . . . . . . . . . 5

2.2 Roulette with a finite horizon . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Roulette with an infinite horizon . . . . . . . . . . . . . . . . . . . . . . . 10

3 Piglet 13

3.1 The coin flipping game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Piglet with a finite horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.3 Piglet with an infinite horizon . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 A small horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 One-stage-look-ahead 21

4.1 One-stage-look-ahead decision rules . . . . . . . . . . . . . . . . . . . . . . 21

4.2 One-stage-look-ahead for Piglet . . . . . . . . . . . . . . . . . . . . . . . . 22

4.3 Optimality of one-stage-look-ahead . . . . . . . . . . . . . . . . . . . . . . 23

5 Conclusion 25

6 Popular summary (in Dutch) 27

A Matlab code 29

A.1 Roulette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

A.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A.3 Stopping criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A.4 Piglet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

A.5 Various graphs for Piglet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

Bibliography 37

ii

Prelude

In life, people have to make decisions many times. Some decisions involve a certain risk.

Hill [1] gives the example of when to stop watching and start evacuating when a hurricane

is approaching and states that bad timing can be ruinous. So we could ask ourselves the

question: when is it optimal to stop? Other questions considering optimality appear in

game theory such as: what is the maximum probability of winning given the current state

of the process? In this thesis we will consider problems where we want to find an optimal

strategy, a strategy that maximizes the probability of achieving a certain goal. Therefore,

we use the theory of Markov chains which was first developed in the early twentieth-century

by the Russian Andrey Markov.

Andrey Markov (1856 - 1922).

It will be our goal to find these maximum probabilities and optimal actions for the game

called Piglet. In order to do this we will begin with formulating the necessary theory of

Markov chains and extend this with the theory of Markov decision processes. In the second

chapter we show how this theory works by looking at another game, Roulette. Next, in

the third chapter we look at the main problem, the Piglet game. In the fourth chapter,

we will again consider the Piglet game but with a different type of strategy, the so called

one-stage-look-ahead rule and compare this to the results from the third chapter.

Gideon Jager,

Amsterdam, July 14, 2016

iv

1 Markov decision processes

In this chapter we will discuss the theory that will be necessary in the next two chapters.

Before we can discuss Markov decision processes we need to look at some concepts of

Markov chains. We will discuss these concepts in the first section by following Norris [2].

1.1 Discrete time Markov chains

We start this paragraph with some notation before we can state a formal definition of

a Markov chain. Therefore, we recall the definitions of a probability distribution and

stochastic matrices.

Definition 1.1 (State space and distribution). Let I be a countable set. We call I

the state space and every i ∈ I a state. λ = (λi : i ∈ I) is a measure on I if λi ∈ [0,∞) for

all i ∈ I. We call a measure λ a distribution if∑i∈Iλi = 1.

Definition 1.2 (Stochastic matrix). A matrix P = (pij : i, j ∈ I) is called stochastic if

(pij : j ∈ I) is a distribution for all i ∈ I, i.e. the sum of all elements on every row is equal

to 1.

Now we have specified our notation, we introduce Markov chains. Norris [2] gives many

examples of Markov Chains. One of these examples is about a flea hopping between the

vertices of a triangle. The idea of this example is that the vertex where the flea will jump

to only depends on the vertex where the flea remains at the moment and not of the vertices

it visited in the past. This leads to the following definition of a Markov chain.

Definition 1.3 (Markov chain). Let (Xn)n≥0 a sequence of random variables. (Xn)n≥0 is

a Markov chain with initial distribution λ and transition matrix P if X0 has distribution λ

and for n ≥ 0, conditional on Xn = i, Xn+1 has distribution (pij : j ∈ I) and is independent

of X0, . . . , Xn−1.

We can also state this using the notation introduced above as follows:

for n ≥ 0 and i1, . . . , in+1 ∈ I,

P (X0 = i1) = λi1 ;

P (Xn+1 = in+1 | X0 = 1, . . . , Xn = in) = pinin+1 .

1

Basically, a Markov chain can be represented as a directed graph where the vertices are

the states and the edges represent the possible jumps as Norris [2] shows in his exmaples.

In the upcoming chapters, we will look at problems where we can take certain actions

whenever we enter a state. If we stop we receive a terminal reward and if we continue

some costs might be included. Such problems where we try to maximize the reward or

minimize the costs over time are called optimal stopping problems. Examples of optimal

stopping problems are given in Tijms [5]. In the next two chapters we try to maximize

certain probabilities with respect to stopping rules. Therefore, we have to define Markov

decision processes and discuss multiple conditions on optimality and solution methods to

determine an optimal decision rule both for optimal stopping problems. In the next two

paragraphs, we start with a study of this theory from Nunez Queija [3].

1.2 Decision rules

This paragraph gives an introduction into decision theory. In decision theory we often

use Markov decision processes. A Markov decision process (MDP) is similar to a Markov

chain. We again assume that the state space I is countable. An MDP also has a finite

action space A and in an MDP we are allowed to choose an action and each action leads to

different transition probabilities. In other words, we can choose the transition probabilities

from a collection of probability distributions. If the process (Xn)n≥0 has an action space

with only one element such that we always choose the same action and always have the

same transition probabilities, then the MDP reduces to a regular Markov chain.

Let An denote the action taken at time n. Regarding the states and actions we further

assume that

P(Xn+1 = in+1 | X0 = i0, . . . , Xn = in,A0 = a0, . . . , An = an)

= P(Xn+1 = in+1 | Xn = in, An = an).

for i0, . . . in+1 ∈ I, a0, . . . an ∈ A. From now on, we will denote the right hand side of this

equation by pan(in, in+1). From this equation we can conclude something similar to the

definition of a Markov chain. Once we know the state and the action at time n, the state

at time n+ 1 is independent of the history before time n which we write as

Hn−1 := (X0, A0, . . . , Xn−1, An−1).

We also add a reward, ra(i, j), when action a is taken and the process moves from state

i to state j. The expected reward when we coose action a in state i will therefore be

ra(i) =∑j∈I

pa(i, j)ra(i, j).

2

We assume that this expected reward is uniformly bounded which means that there exists

an M > 0 such that |rai | < M for all i ∈ I, a ∈ A.

In short, in this thesis a Markov decision process (MDP) consists of a countable state

space I, a finite action space A, transition probabilities for each action and a reward for

all possible transitions given an action. If we have a decision problem over a period T ∈ N,

then we have a finite horizon decision problem.

In the next sections, it will be our goal to find an optimal strategy for optimal stopping

problems, like the Piglet game in chapter 3. Therfore, we will define what a strategy is

and how it is related to decision rules.

Definition 1.4 (Decision rule). Let hn−1 = Hn−1 where Hn consits of all states that

heve been visited and actions that have been taken up to and including time n − 1. A

strategy is a sequence

sann (hn−1, in) ∈ (0, 1), n ∈ Z≥0which gives the probability that action an is taken at time n. We call a strategy a decision

rule if for all n, the whole history and all states

sann (hn−1, in) = 1

holds for precisely one an ∈ A. Then we write f = (f0, f1, f2, . . .) for the decision rule where

fn(hn−1, in) = an. Whenever fn(hn−1, in) = fn(in), the decision rule does not depend on

the history and the decision rule is a Markov decision rule.

1.3 Bellman’s condition

In the upcoming games we want to find a decision rule that maximizes the total reward.

Thus, we want to maximize the function

V sT (i) :=

T−1∑n=0

Es[rAn(Xn, Xn+1) | X0 = 1

]+ Es [q(XT ) | X0 = i]

where q is the function that defines the final reward when the process ends in a certain

state. There exists a recursive algorithm that appears in dynamic programming which

maximizes the function V sT over s and is called Bellman’s optimality condition. More

on dynamic programming can be found in Parmigiani [4]. We will state this algorithm in

order to use it on the problems in the upcoming chapters.

Theorem 1.5 (Bellman’s optimality condition). Let V0 := q(i) and let

Vn(i) := maxa∈A

{ra(i) +

∑j∈I

pa(i, j)Vn−1(j)

}for all n ∈ N.

3

Then, Vn(i) maximizes the function V sn (i) for all n ∈ N. The corresponding decision rule

is f = (fn, fn−1, fn−2, . . . , f1) given by

fn(i) = arg maxa∈A

{ra(i) +

∑j∈I

pa(i, j)Vn−1(j)

}.

For a proof we refer to Nunez Queija [3].

Remark 1.6. This decision rule does not tell us what to do at time n, but it gives the

action we should take when we have n steps to go in the process. So, first we take the

action given by fn, then take the action given by fn−1, etc.

Determining an optimal decision rule that maximizes the reward is not the only use of

Bellman’s optimality condition. We conclude this chapter by showing how it could also be

used to maximize the probability of a certain goal. Suppose we want to reach a certain

state j before a given time T . Then, we want to maximize the entrance probabilty of that

state,

PsT (i) := Ps (Tj ≤ T | X0 = i)

where Tj = inf{n ≥ 1 : Xn = j} is the first passage time for state j. In order to prevent

that the entrance probability exceeds 1, we set pa(j, j) = 1. Then the process cannot

escape state j once it enters this state. We set the rewards ra(i, k) equal to 1 if k = j, and

0 otherwise. By setting n = T and q(i) = 0 in Bellman’s optimality condition, we get

PsT (i) = V s

T (i).

Thus, we can use Bellman’s condition to maximize entrance probabilities which will be

done frequently in the next two chapters.

4

2 Roulette

In this chapter we consider the problem one faces at the roulette table. It is a good example

to show how the theory of the previous chapter works. If one plays Roulette, should one

bet, and if so, how much? Using Bellman’s optimality condition, we will determine the

decision rule. Suppose we start with e75,- and that we want to go home with e200,-. The

roulette table has eighteen red holes, eighteen black holes and one white hole. We assume

that we are only allowed to bet on red or black. Each turn we bet an integer amount of

euros which can differ every turn. If we win in a single turn, we obtain the double amount

of euros we bet that turn. If we lose, we lose the whole amount of euros we bet that turn.

We will now formulate this in terms of an MDP.

2.1 Roulette as a Markov decision process

This problem has a finite state space

I = {0, 1, . . . , 200} .

The action space consists of all possible bets

A = {0, 1, . . . , 199}.

For each state i we define the subset Ai of the action space by

Ai = {0, 1, . . . ,min{i, 200− i}},

since there is no reason to bet more euros than necessary to reach i = 200. If we bet a ∈ Ai

euros when we are in state i we can either go to state i+a with probability pa(i, i+a) = 1837

or to state i − a with probability pa(i, i − a) = 1937

. Then we want to find the functions

(Vn)∞n=0 with

Vn+1(i) = maxa∈Ai

{18

37Vn(i+ a) +

19

37Vn(i− a)

}, (2.1)

where we set

V0(i) =

{1 if i = 200,

0 if i < 200.(2.2)

5

These functions Vn(i) denote the probability of going home with e200,- if we have n

rounds left to play and have i euros in our possession. Equation (2.1) can be obtained

from a standard MDP by setting r(i) = 0 in the Bellman equation.

2.2 Roulette with a finite horizon

In this section we take a finite horizon, i.e. T < ∞. To demonstrate how the Bellman

equation works, we consider the following example.

Example 2.1. Suppose we start with e125,- and bet only once. Then we only need V0and V1. V0 is the same as in equation (2.2). For V1 we find

V1(125) = maxa≤75

{18

37V0(125 + a) +

19

37V0(125− a)

}= max

a≤75

{18

37V0(125 + a)

}=

18

37,

by the definition of V0. Thus, the maximum probability of going home with e200,- equals1837

and is achieved when action a = 75 is taken. So the optimal strategy is f1(125) = 75.

Now we have seen an example where the initial state is given, we look at the problem

where we have a capital of i euros. We take horizon T = 2. We can write down V0, V1 and

V2 and find the optimal actions. V0 is defined as in equation (2.2). Using equation (2.1)

we get the following function when we have one round left to play:

V1(i) =

1 if i = 200,1837

if i ∈ {100, 101, . . . , 199},0 if i < 100.

This first and third cases are obviously true because we either have already reached our

goal or it is impossible to reach e200,-. The second case is true, because

maxa∈Ai

{18

37V0(i+ a) +

19

37V0(i− a)

}= max

a∈Ai

{18

37V0(i+ a) +

19

37· 0}

= maxa∈Ai

{18

37V0(i+ a)

}=

18

37.

This function is maximized when we bet a = 200− i euros.

6

With the same reasoning we can write down the function when we have two rounds left

to play:

V2(i) =

1 if i = 200,10081369

if i ∈ {150, 151, . . . , 199},1837

if i ∈ {100, 101, . . . , 149},(1837

)2if i ∈ {50, 51, . . . , 99},

0 if i < 50.

Again, the first and fifth cases are obviously true. The third case is the same as the

second case in V1. In this case it is optimal to bet 200 − i euros, because we can reach

e200,- in one round and playing a second round would decrease our probability to reach

e200, since we need to win two times in stead of one time. For the second case we get

i− a ∈ {100, 101, . . . , 199}, so we see that

maxa∈Ai

{18

37V1(i+ a) +

19

37V1(i− a)

}=

18

37· 1 +

19

37· 18

37

=1008

1369.

For the fourth case we can calculate that

maxa∈Ai

{18

37V1(i+ a) +

19

37V1(i− a)

}= max

a∈Ai

{18

37V1(i+ a) +

19

37· 0}

= maxa∈Ai

{18

37V1(i+ a)

}=

(18

37

)2

where we use that, given we have an amount of euros greater than or equal to 50 and less

than 100 after one round, we will have at most e198,- in the end. Hence V1(i+ a) ≤ 1837

.

Returning to the original problem where we start with e75,- and want to go home with

e200,-, we have to play more than one round. We simply cannot double e75,- to e200,-.

Using the same V0 as in the previous example, we find

V2(75) = maxa≤75

{18

37V1(75 + a) +

19

37V1(75− a)

}= max

a≤75

{18

37V1(75 + a)

}=

(18

37

)2

.

7

The first probability is the probability if we bet at least e25. Then V1(75 + a) = 1837

. If we

bet less than e25,- we cannot reach e200,- in two rounds and therefore the probability of

going home with e200,- equals 0. Hence we get the decision rule

f2(75) = a2, f1(i) =

{200− i if i ∈ [100, 150] ,

0 if i < 100.

with a2 ∈ [25, 75] the amount of money we bet when we have two more rounds to play.

Note that, when we have one more round to play, that the optimal action is not unique.

We decide to choose the smallest optimal action such that the loss is minimal in the case

we lose.

For large T it will become more difficult to calculate VT and it will take much more

time. So, in this case we will write a computer program in Matlab that will calculate these

functions and the decision rules. This program is given in appendix A.1.1. The table below

contains the probabilities and optimal first actions for a few initial capitals.

Initial capital 0 50 100 150 200

Maximum probability 0 0.2367 0.4865 0.7363 1

Optimal first action 0 50 0 50 0

Table 2.1: The maximum probabilities and optimal first actions for

some initial capitals with horizon 2.

The scatterplots in Figure 2.1 below give us more information. They give the maximum

probabilities and the smallest optimal first actions for every initial capital.

Figure 2.1: Maximum probabilities and optimal first actions for

T = 2.

8

The red graph describes the maximum probability of going home with e200,- if one has

an amount of euros equal to the initial state. These maximum probabilties are attained if

one plays the optimal actions. The smallest optimal first actions are described in the blue

graph for each initial state. For the same reason as before, we choose the smallest optimal

actions. We will keep choosing the smallest optimal action in this game if the optimal

action is not unique.

The graph showing the maximum probabilities coincides with our calculations in para-

graph 2.2. The graph of the optimal first actions tells us a few things about the game. We

see that for each initial state smaller than 50 that the smallest optimal first action equals

zero which is logical because by doubling twice we cannot reach e200,-. For the initial

states greater than or equal to 100 and smaller than 150 the smallest optimal first action is

also equal to zero. The probability of reaching e200,- in two steps equals the probability

of reaching e200,- in one step. Therefore, it is not necessary to play two rounds, so the

optimal first action is 0. When being in a state i greater or equal to 150 and less than 200

it is not optimal to bet more than 200− i euros, since we set the goal of reaching precisely

e200,-. Therefore, we see that the optimal first action decreases starting from the initial

state 150. A similar argument explains the decrease from the initial states 50 up to and

including 99.

The Matlab program works for every finite horizon. So, we will consider T = 3 too.

Again, we start by giving a few results from the program.

Initial capital 0 25 75 125 175 200

Maximum probability 0 0.1151 0.3582 0.6080 0.8646 1

Optimal first action 0 25 25 25 25 0


some initial capitals with horizon 3.

The graphs for T = 3 on the next page can be explained in a similar way as for T = 2.

We see that the maximum probability can now take nine different values. Also, the graph

of optimal first actions now comtains four strictly decreasing subsequences. This behaviour

is explained in a similar way as was the case for T = 2.

9


T = 3.

2.3 Roulette with an infinite horizon

For large values of T we used a computer program. When we take T =∞ we will have to

add a stopping criterion. The program must calculate a probabilty over an infinite amount

of rounds but must give us that probability at a certain time. Therefore we use the stopping

criterion of the Successive Approximation Algorithm which is stated in Nunez Queija [3].

The code for this stopping criterion is given in appendix A.1.3. Using this program we

calculate the maximum probabilities and optimal first actions in Table 2.3.

Initial capital 0 16 59 99 163 200

Maximum probability 0 0.0716 0.2783 0.4806 0.8010 1



some initial capitals with an infinite horizon.

Using the code in appendix A1.2 we get the scatterplots in Figure 2.3 on the next page:

10


T =∞.

Now, the red line gives the probability of going home with e200,- for each initial capital

if we only stop when we either reach our goal or are broke. The blue graph gives the optimal

first action we must take in order to achieve this maximum probability. We noticed that

for T = 3 the maximum probability can take on more different values than for T = 2.

As expected, this number of values greatly increases when we consider an infinite horizon.

Once again, the blue graph gives the smallest optimal first actions for each initial state.

Now we have seen how the theory of Markov decision processes works, we are ready to

use this theory for Piglet in the next chapter.

11

3 Piglet

In the previous chapters we studied Markov decision theory and its application at the

roulette table. In this chapter we will use this theory in order to achieve the main goal:

finding the decision rule that maximizes the probability of winning the Piglet game with

an infinite horizon. First, we explain the game and formulate it as an MDP.

3.1 The coin flipping game

Two players are involved in the Piglet game. Each round one player flips a fair coin,

possibly more than once. If heads, the flipper gets one point. If tails, the player loses

all his points obtained in the current round and the next round starts in which the other

player flips the coin. Each round, the player who flips the coin may decide to stop flipping

and freeze the points he obtained in the current round. Freezing means that you cannot

lose those points anymore. After stopping, the next round starts and the other player may

flip the coin. In tis game it is the goal of each player to be the first to obtain 10 points.

Notice that this game is two-dimensional. The game has the state space

I = {(i, j) ∈ {0, 1, . . . , 10} × {0, 1, . . . , 10}} \ {(10, 10)},

where i denotes the number of points of the player who is currently flipping the coin and

j denotes the number of points of the player who is waiting for his turn. Therefore, the

roles of i and j change when the game enters the next round. The action space consists of

the number of times we can flip the coin in one round

A = {0, 1, 2, . . . , 10}.

Once again, we will define a subset of the action space. Let A(i,j) ⊂ A be the action space

of the player who is going to flip the coin if he or she has i points and the opponent has j

points. We define A(i,j) = {0, 1, 2, . . . , 10− i} if j < 10. We also define A(i,10) = {0} such

that the game ends at the moment that one of the players has obtained 10 points.

For the transition probabilities, we get

pa((i, j), (j, i+ a)) =

(1

2

)a

and pa((i, j), (j, i)) = 1−(

1

2

)a

.

13

The first one is the probability of succesfully throwing heads a times and freezing a obtained

points. The second one is the probability that one of the throws results in tails and not

getting any points.

Just as in the case of roulette, we define the reward r(i) after each round equal to 0.

We need the functions (Vn)n∈Z≥0, the maximum probability of winning when we have

n rounds left to play for both players together, in order to determine V∞. The Bellman

condition gives these functions:

Vn+1((i, j)) = maxa≤10−i−1

{(1

2

)a

(1− Vn((j, i+ a))) +

(1−

(1

2

)a)(1− Vn((j, i)))

}, (3.1)

where we set

V0((i, j)) =

{0 if i < 10 and j = 10,12

for all i < 10, j < 10.(3.2)

The first case of V0 means that one of the players has won in the previous round and

that the other player has lost the game. The second case of V0 defines a tie when the game

ends and none of the players have enough points to be the winner of the game.

3.2 Piglet with a finite horizon

In this section, we start with showing how to calcalute the functions (Vn)n∈Z≥0by hand.

Therefore, we now consider a finite horizon. We return to the infinite horizon problem in

the next paragraph.

Suppose we take T = 1. Then we want to calculate the maximal probability of reaching

state (10, j) in one round with j ≤ 9 and find the decision rule corresponding to this

maximal probability. We let V0((i, j)) as in equation (3.2) and will write down the function

V1((i, j)). We calculate that

maxa∈A(i,j)

{(1

2

)a

(1− V0((j, i+ a))) +

(1−

(1

2

)a)(1− V0((j, i)))

}equals

maxa∈A(i,j)

{(1

2

)a

(1− V0((j, i+ a))) +1

2+

(1

2

)a+1}.

We easily see that(1

2

)a

(1− V0((j, i+ a))) +1

2+

(1

2

)a+1

=

{(12

)a+1+ 1

2−(12

)a+1if a < 10− i,(

12

)10−i+ 1

2−(12

)10−i+1if a = 10− i.

14

The first case equals 12

which is less than the second case. This implies that

V1((i, j)) =

(1

2

)10−i

+1

2−(

1

2

)10−i+1

. (3.3)

Thus the maximum probability of winning equals (3.3) and is attained when action a =

10− i is taken.

Just as for Roulette, this was easy to calculate but the larger T becomes the more the

computation time increases. We want to be able to give the maximum probabilities and

the optimal first actions for large T where T may be infinite. Again, we write a computer

program using Matlab. This program is based on a slightly different, but equivalent, version

of the game. Using the first program in appendix A1.4 we get for T = 1 and i = 6 that the

maximum probaility of winning equals 0.5313 and that the optimal first action is a = 4.

These are the exact same numbers as the numbers obtained from substituting i = 6 into

equation (3.3). The reader can easily verify this.

Using the program we show the numbers in Table 3.1, where we take the horizon to be

equal to 5, such that the reader can verify the graphs that follow.

Initial points player 0 1 4 6 7 7

Initial points observer 0 0 4 5 7 9

Maximum probability 0.5009 0.5028 0.5123 0.5665 0.5504 0.2874



some initial states and T = 5.

The first program in appendix A1.5 gives the graphs in Figure 3.1. The graph on the

left shows the maximum probabilities and the graph on the right shows the optimal first

actions.

Consider the graph that gives the maximum probabilities given the initial states of the

two contestants. We notice that whenever the initial state of the player (the contestant

who is flipping the coin in the current turn) increases, the maximum probability of winning

also increases which is to be expected. Also to be expected, the graph shows that whenever

the initial state of the observer (the contestant who flipped the coin in the previous turn

and will flip in the next turn) increases, the maximum probability of winning decreases.

15


T = 5.

We now consider the graph that gives the optimal first actions. We notice that the

optimal first action, too, is influenced by the number of points of each player. In Figure

3.2 two more scatterplots are shown. These scatterplots show in what way the optimal

first action changes whenever the initial states of either the player or the observer increase

and the other score is fixed. Here we set this fixed score equal to 3.

Figure 3.2: Optimal first actions T = 5 with fixed score of oppo-

nent.

The graph tells us that for larger initial states of the player, the optimal first action

decreases. This makes sense since the player is closer to his or her goal. Likewise, if the

initial state of the observer becomes relatively large, the optimal first action for the player

increases since he or she needs more heads to catch up with the observer.

16

3.3 Piglet with an infinite horizon

In the previous section we have seen how the game works and how to calculate V1 for T ≥ 1.

Now, we are interested in the infinite horizon case. We will use a computer program similar

to the one before but, just as in the case of an infinite horizon in Roulette, we add the

stopping criterion of appendix A1.2. This results in the second computer program given

in appendix A1.4. Again, we start by giving some numbers such that the reader can verify

the graphs that follow. We take the same initial states as for T = 5.

Initial points player 0 1 4 6 7 7

Initial points observer 0 0 4 5 7 9

Maximum probability 0.5264 0.5922 0.5310 0.6641 0.5455 0.2222



some initial states and T =∞.

Using the first program in appendix A1.5 with remark A5.1 in mind, we obtain the

following graph.

Figure 3.3: Maximum probabilities for T =∞.

This graph describes the maximum probabilities of ever winning Piglet given the initial

17

states for both contestants. The graph has the same, but smoother, behaviour as in the

case T = 5. Therefore, the behaviour is explained in the exact same way as before.

The graph of the maximum probabilities shows the same behaviour as before. The graph

of the optimal first actions also shows the same behaviour as before, but contrary to the

behaviour of the graph showing the optimal first actions for T = 5, the graph of the infinite

horizon problem is less smooth, see Figure 3.4.

Figure 3.4: Optimal first actions for T =∞.

To give a better illustration of the influence of the initial state of each contestant on the

optimal first action, we will make two sections again. We set one contestants score again

equal to 3 and consider the influence of the other contestants score, see Figure 3.5 on the

next page.

18

Figure 3.5: Optimal first actions T =∞ with fixed score of oppo-

nent.

The initial state of the player influences the optimal first action slightly. However, the

initial state of the observer greatly increases the value of the optimal first action once the

initial state of the observer bcomes larger than seven. This is due to the fact that we allow

our contestants to continue playing until one of them has thrown heads ten times.

We have determined optimal actions for Piglet. In this chapter we used the Bellman

condition in equation (3.1). In the next chapter we will determine a one-stage-look-ahead

rule for Piglet and compare it to the optimal actions from this chapter.

3.4 A small horizon

If we play Piglet with a small horizon we may need to flip the coin mores times in one

round than when we play with a larger horizon and consider the same pair of initial states,

the points we have and the points our opponent has. For T ≤ 4 we get remarkable results

compared to T ≥ 5. We will show this for T = 4. The graph of the maximum probability

of winning resembles the same graph for T = 5, but the graph of the optimal first actions

is completely different than the same graph for T = 5, see Figure 3.6 on the next page.

19


T = 4.

The differences are also shown in the sections below where we fix the other players score

at the same amount as for T = 5.

Figure 3.7: Optimal first actions T = 4 with fixed score of oppo-

nent.

From these graphs we can conclude that we have to take a larger risk if we play the

game with a small horizon and take the optimal actions, while for a larger horizon we can

take less risky optimal actions in the same initial states.

20

4 One-stage-look-ahead

In the previous chapters we considered decision rules obtained from the Bellman optimality

condition. In this chapter we will introduce another type of decision rules. The strategies

obtained in the previous chapter are useful if the state space is relatively small, like in the

previous chapter. But if the state space becoms larger and it becomes harder to calculate

and use the optimal strategies determined with the Bellman condition. Therefore, we

consider the one-stage-look-ahead rule also known as the one-step-look-ahead rule.

4.1 One-stage-look-ahead decision rules

This section briefly explains the theory of one-stage-look-ahead. We will follow Tijms [6]

in this section.

From now on, in each state there are only two possible actions: stop and continue one

more round and do this comparison once more. One-stage-look-ahead determines whether

it is at least as good to stop now or to continue one more step. When the process is in

state i and we choose to stop, a terminal reward R(i) is included. Also, when we continue,

we have to pay c(i), a cost in order to continue. We assume that one of the following

conditions holds:

(i) supi∈I

R(i) <∞ and infi∈I

c(i) > 0,

(ii) supi∈I

R(i) <∞ and c(i) = 0 for all i ∈ I

and there is a set S consisting of the states where stopping is mandatory which is

reached whithin a finite amount of time when the action of continuing is always

chosen in the states i /∈ S.

In this chapter, we will always assume that the second conditions hold. Now, we define

the set B of states in which the one-stage-look-ahead rule tells us that stopping is at least

as good as continue one more round and then stop:

B =

{i ∈ I : R(i) ≥

∑j∈I

pijR(j)

}(4.1)

21

We will use this theory to create a one-stage-look-ahead rule for Piglet and try to deter-

mine whether it is optimal or not.

4.2 One-stage-look-ahead for Piglet

We want to determine the set B specified in (4.1) for Piglet. Therefore, we must add an

extra rule to our game to make it an optimal stopping problem. From now on, both players

are allowed to press a button at the end of his or her round. By pressing this button, the

other player is only allowed to play one more round. If he or she overtakes the player who

pressed the button, then he or she wins. Catching up results in a tie. Otherwise, the player

who pressed the button wins.

Below, we state a one-stage-look-ahead rule for Piglet. The idea behind this rule is

inspired by Backward induction as explained in Hill [1].

Our one-stage-look-ahead-rule is based on the probability of being overtaken by our

opponent. First of all, we determine the set S. S consists of all states in which stopping

is mandatory, so

S = {(10, j) : 0 ≤ j ≤ 9} ∪ {(i, 10) : 0 ≤ i ≤ 9} .

Secondly, we define the following stopping rule. The button must be pressed if

P(Overtaken or caught up by opponent) <1

2. (4.2)

In other words,1

2i−j <1

2. (4.3)

Here we take 0 ≤ j < i ≤ 9. All the states that are not mentioned up to here are the states

in which we are not in front of our opponent. Then stopping could never be optimal. Now,

we define the set

B′ :=

{(i, j) :

1

2i−j <1

2

}.

Then, it immediately follows that

B′ = {(i, j) : i− j ≥ 2} .

We must consider the set of states in which we want to throw once more and then stop.

Since we decided to stop either at the end of this round or continue one more round and

then stop, we flip at least once more between deciding to stop at the end of which round

and the moment we press the button. Therefore, we define

B = {(i, j) : i− j ≥ 1} .

22

Using this set B we define the following one-stage-look-ahead rule. If we have entered the

set B, then we throw only once more and stop at the end of this round, otherwise, we

throw i− j + 1 -times and continue one more round.

4.3 Optimality of one-stage-look-ahead

In the previous section we determined the set B that tells us whether to stop after the

next flip or to continue after the next flip. In this section we will determine whether the

one-stage-look-ahead strategy defined by this set is optimal or not. One way of determin-

ing whether such a set gives an optimal one-stage-look-ahead rule was given by Richard

Weber. In [7] he proved that the one-stage-look-ahead rule is optimal if the set B is closed.

Unfortunately, his theorem is not applicable. The states of the set B determined in the

previous paragraph do even not communicate. For a definition of communicating states,

we refer to Norris [2].

So, we have to find another way to determine optimality. We will compare the set B to

the optimal first actions given by Bellman’s optimality condition. It is optimal to stop if,

for the pair of initial states, the optimal first action is a = 1. Therefore, we will determine

the set C of pairs of initial states with optimal first action a = 1 using Figure 3.4. We

notice that

B \ C = ∅ and C \B = {(8, 7)} .

Hence our one-stage-look-ahead rule gives an optimal strategy in many states, but is not

completely optimal.

23

5 Conclusion

We began with the study of Markov decision processes and used this theory to determine

optimal strategies for Markov decision processes (MDP). The first MDP we considered was

the game Roulette where we used the Bellman condition to calculate both the maximum

probability of going home with e200,- and what the corresponding optimal first actions

were. In the case where we had the horizon T = 3, which means that we have three

rounds more to play, we obtained the maximum probalities and optimal first actions using

a computer program. We saw that the optimal first action was betting such an amount

that, should we win, we can reach e200,- by repeatedly doubling this amount. For example,

if we have e20,- then we bet e5,- on the first turn. If we win, we have e25,- which we kan

keep doubling to reach e200,-. If we lose, we are left with e15,- and we have to bet e10,-

to reach e25,-.

Next, we considered the case where we may play forever. We used the computer program

of the finite horizon problem, made some small changes and added a stopping criterion in

order to obtain the maximum probabilities of ever going home with e200,- given a certain

amount of money and the optimal first actions to achieve these maximum probailities.

These probabilities and actions are shown in figure 5.1.


T =∞.

25

In Chapter 3 we used the knowledge of the first two chapters to consider our main

problem, the Piglet game, another MDP. We wrote a computer program for the finite

horizon problem and modified it in the same way as in chapter two to obtain a program for

the infinite horizon problem. As for Roulette, this program used the Bellman condition to

derive the maximum probabilities, the corresponding optimal actions and the graphs that

show these probabilities and actions, see Figure 5.2.


Piglet if T =∞.

We concluded that the function showing the maximal probabilities is an increasing func-

tion of the initial state of the player and a decreasing function of the initial state of the

observer. Here, the player is the contestant who flips the coin in the current turn and

the observer is the contestant who will flip the coin in the next turn. Also, the graph of

the optimal first actions is a decreasing function of the initial state of the player and an

increasing function of the initial state of the observer.

In Chapter 4, we considerd a different type of decision rules, the one-stage-look-ahead

rules, to find derive possible optimal actions for Piglet. We defined a one-stage-look-ahead

rule based on the probability that our opponent could catch up in points. If not, we

decided to stop and press a button such that our opponent was only allowed to play one

more round. Otherwise, we decided to take the optimal action of the Bellman condition

in chapter three. We concluded this chapter by determining whether this stragtegy is

optimal or not. Therefore, we considered the set of initial states such that the Bellman

condition gives optimal first action 1, and compared this to the set of states for which our

one-stage-look-ahead rule tells us to flip the coin once more and then press the button.

We noticed that these sets only differ in one pair of initial states which means that the

one-stage-look-ahead rule is optimal in most states, but that it is not completely optimal.

26

6 Popular summary (in Dutch)

Markovketens zijn processen die gegeven het heden onafhankelijk zijn van hun geschiedenis.

Dat wil zeggen dat het enige dat de volgende toestand van het proces beınvloedt is de

toestand waar het proces zich nu bevindt. Markovketens kunnen worden uitgebreid tot

Markov beslissingsprocessen. Dit doen we door een verzameling met acties toe te voegen.

Beslisregels vertellen welke actie we moeten ondernemen en deze beslisregels kunnen worden

gedefinieerd aan de hand van strategieen. Een strategie geeft de kans dat we voor een

bepaalde actie kiezen. Als deze kans 1 is voor precies een van deze acties dan noemen

we deze strategie een beslisregel. Als een beslisregel onafhankelijk is van de voorgaande

toestanden en de ondernomen acties dan noemen we deze beslisregel een Markov beslisregel,

de theorie van zulke regels noemen we Markov beslissingstheorie. We kunnen deze theorie

gebruiken om optimale acties te bepalen in verschillende kansspellen. In deze kansspellen

zouden we kunnen aannemen dat we een bepaald aantal rondes spelen of dat we oneindig

door mogen blijven spelen. In het eerste geval waar aan het begin het aantal rondes wordt

vastgesteld, noemen we het spel een eindig horizon probleem. In het andere geval waar we

oneindig door mogen blijven spelen noemen we het spel een oneindig horizon probleem.

Een manier om optimale acties te bepalen is door gebruik te maken van de Bellman op-

timaliteitsconditie. Dit recept berekent de maximale kans om een bepaald doel te bereiken

in een bepaald aantal rondes gebruikmakend van de maximale kans om datzelfde doel te

bereiken wanneer we een ronde minder zouden spelen. De acties die deze winkans maxi-

maliseren zijn de optimale acties.

Met deze Bellman optimaliteitsconditie hebben we de maximale kans en de bijbehorende

optimale acties bepaald voor de spellen Roulette en Piglet. Bij Roulette was het doel om

met e200,- naar huis te gaan wanneer we met een bedrag kleiner dan e200,- beginnen met

gokken.

Piglet is een spel waarbij we met een munt werpen en spelen tegen een tegenstander. In

elke ronde gooit een van de spelers mogelijk meerdere keren met een zuivere munt. Landt

de munt op kop dan krijgt de werper een punt. Bij munt verliest de werper alle punten

die hij of zij heeft verdiend in de huidige ronde en begint de volgende ronde waarin de

tegenstander mag werpen. In elke ronde mag de werper na elke worp besluiten om te

stoppen met werpen en de verdiende punten vast te zetten. Deze punten kunnen dan niet

meer worden verloren en begint er een nieuwe ronde, hetgeen betekent dat de tegenstander

aan de beurt is.

27

We hebben voor elke combinatie van onze punten en de punten van onze tegenstander de

maximale winkans en de optimale acties bepaald. Deze informatie hebben we weergeven

in grafieken en daarmee weten we voor alle mogelijke situaties binnen het spel welke actie

we moeten kiezen.

Deze grafieken tonen een mooi patroon. Als ons aantal punten toeneemt dan stijgt de

maximale winkans. Net zo, wanneer het aantal punten van onze tegenstander toeneemt

dan daalt de maximale winskans.

Een andere manier om acties te vinden is door middel van one-stage-look-ahead. De

naam spreekt voor zichzelf. De one-stage-look-ahead regel bepaalt of het minstens net

zo goed is om nu te stoppen als om nog een ronde te spelen en dan te stoppen. Bij

Piglet betekent dit dat we aan het begin van een ronde besluiten of we aan het eind van

de huidige ronde stoppen of dat we nog een ronde spelen en dan weer dezelfde afweging

maken. Daarom willen we een verzameling van toestanden bepalen waarin het minstens net

zo goed is om aan het eind van de huidige ronde te stoppen dan om daarna nog een ronde

te spelen. Als het proces zich in de verzameling bevindt dan zegt de one-stage-look-ahead

regel dat we moeten stoppen aan het einde van de huidige ronde. Om te bepalen of de one-

stage-look-ahead regel optimaal is voor Piglet (in de zin dat het de winkans maximaliseert)

vergelijken we deze verzameling met de grafiek van de optimale eerste acties die we met de

Bellman optimaliteitsconditie hebben verkregen. Het blijkt dat wanneer we spelen volgens

one-stage-look-ahead dat de verzameling van nog eenmaal werpen en dan stoppen alleen

in een mogelijke toestand verschilt van de verzameling van toestanden waarin eenmaal

werpen de optimale actie is volgens de Bellman optimaliteitsconditie.

28

A Matlab code

In this appendix the Matlab-code for Roulette and for the Piglet game will be stated.

A.1 Roulette

In this section we state the program used for Roulette with a finite horizon.

function[Tstepprobability,Tstepaction] = rouletteFH(T,initial)

for n=[1:200]

if n==200

vold(n)=1;

else

vold(n)=0;

end

end

while T>0;

for i=[1:200];

largest = 0;

maxaction = 0;

if i==200

largest=1;

else

for a=[0:min(i,200-i)]

if a<i

maxprob = (18/37)*vold(i+a)+(19/37)*vold(i-a);

else

maxprob = (18/37)*vold(i+a);

end

if maxprob>largest;

largest=maxprob;

maxaction=a;

end

end

29

end

vnew(i)=largest;

anew(i)=maxaction;

end

vold=vnew;

T=T-1;

end

Tstepprobability = vnew(initial)

Tstepaction = anew(initial)

A.2 Graphs

The following code produces the graphs throughout the second and third chapter. The

code below is specific for roulette with a finite horizon. For the the case of an infinite

horizon we apply the function rouletteIH in the third line.

initial=1:200;

[Tstepprobability,Tstepaction] = rouletteFH(3,initial);

y=Tstepprobability;

z=Tstepaction;

figure

axis([0 200 0 1])

scatter(initial,y,3,’red’,’filled’)

xlabel(’initial state’)

ylabel(’probability’)

title(’Maxmimal probability of reaching 200’)

hold on

figure

axis([0 200 0 100])

scatter(initial,z,3,’blue’,’filled’)

xlabel(’initial state’)

ylabel(’first action’)

title(’Optimal first action’)

A.3 Stopping criterion

For the case of an infinite horizon we need a stopping criterion as mentioned before in

section 2.3. We also mentioned that we use the stopping criterion of the Successive Ap-

proximation Algorithm. The following code runs this criterion.

30

maxdiff=max(vnew-vold);

mindiff=min(vnew-vold);

difference=maxdiff-mindiff;

In order to use this criterion we need to make a slight change in the while loop. Instead

of the horizon T we give an upperbound as input argument for the matlab function. We

also replace “while T > 0” by

difference = 1;

while difference >= upperbound;

The remaining code is unchanged. This results in the following code for an infinite

horizon.

function[probability,action] = rouletteIH(upperbound,initial)

for n=[1:200]

if n==200

vold(n)=1;

else

vold(n)=0;

end

end

difference = 1;

counter = 0;

while difference >= upperbound;

counter = counter + 1;

for i=[1:200];

largest = 0;

maxaction = 0;

if i==200

largest=1;

else

for a=[0:min(i,200-i)]

if a<i

maxprob = (18/37)*vold(i+a)+(19/37)*vold(i-a);

else

maxprob = (18/37)*vold(i+a);

end

if maxprob>largest;

largest=maxprob;

31

maxaction=a;

end

end

end

vnew(i)=largest;

anew(i)=maxaction;

end




vold=vnew;

end

counter = counter

probability = vnew(initial)

action = anew(initial)

A.4 Piglet

We start with the code for the finite horizon problem.

function[Tstepprob,Tstepact]=pigletFH(T,initialplayer,initialobserver);

initialplayer = initialplayer+1;

initialobserver = initialobserver+1;

for n=[1:11]

for m=[1:10]

if n==11

vold(n,m)=1;

else

vold(n,m)=(1/2);

end

end

end

while T>0;

for i=[1:11];

for j=[1:10];

largest = 0;

if i == 11;

largest=1;

maxaction=0;

32

else

for a=[1:11-i]

if a == 11-i;

maxprob = (1/2)â+(1-(1/2)â)*(1-vold(j,i))

else

maxprob = (1/2)â*(1-vold(j,i+a))+(1-(1/2)â)*(1-vold(j,i));

end

if maxprob>largest;

largest=maxprob;

maxaction=a;

end

end

end

vnew(i,j)=largest;

anew(i,j)=maxaction;

end

end

vold=vnew;

T=T-1;

end

Tstepprob = vnew(initialplayer,initialobserver)

Tstepact = anew(initialplayer,initialobserver)

With the same changes as we at the code for roulette we obtain the code for the main

problem: Piglet with an infinite horizon.

function[prob,act]=pigletIH(upperbound,initialplayer,initialobserver);

initialplayer = initialplayer+1;

initialobserver = initialobserver+1;

for n=[1:11]

for m=[1:10]

if n==11

vold(n,m)=1;

else

vold(n,m)=(1/2);

end

end

end

difference=1;

counter=0;

33

while difference>upperbound;

counter = counter + 1;

for i=[1:11];

for j=[1:10];

largest = 0;

if i == 11;

largest=1;

maxaction=0;

else

for a=[1:11-i]

if a == 11-i;

maxprob = (1/2)â+(1-(1/2)â)*(1-vold(j,i));

else

maxprob = (1/2)â*(1-vold(j,i+a))+(1-(1/2)â)*(1-vold(j,i));

end

if maxprob>largest;

largest=maxprob;

maxaction=a;

end

end

end

vnew(i,j)=largest;

anew(i,j)=maxaction;

end

end




vold=vnew;

end

counter = counter

prob = vnew(initialplayer,initialobserver)

act = anew(initialplayer,initialobserver)

A.5 Various graphs for Piglet

To produce the graphs for the probabilities and the corresponding optimal first actions we

use the following code. We use the plot option surf to show the behaviour of the function.

34

Note that the corners of the rectangles give the probabilities and actions.

Remark A.1. The following code is for the finite horizon problem. In the case of an

infinite horizon problem we simply replace the function pigletFH by the function pigletIH.

x1=0:10;

x2=0:9;

[x1g,x2g]=meshgrid(x2,x1);

[Tstepprob,Tstepact]=pigletFH(4,x1,x2);

y=Tstepprob;

z=Tstepact;

figure

axis([0 9 0 10 0 1])

surf(x1g,x2g,y)

xlabel(’initial state observer’)

ylabel(’initial state player’)

zlabel(’probability’)

title(’Maxmimal probability of winning Piglet’)

hold on

figure

axis([0 9 0 10 0 10])

surf(x1g,x2g,z)


ylabel(’initial state player’)

zlabel(’first action’)


The following code provides a section of the graph of optimal actions where we fix the

score of the observer for T = 4.

x1=0:10;

[Tstepprob,Tstepact] = pigletFH(4,x1,3);

z=Tstepact;

figure

axis([0 10 0 10])

scatter(x1,z,3,’blue’,’filled’)

xlabel(’initial state player’)




score of the player for T = 4.

35

x2=0:9;

[Tstepprob,Tstepact] = pigletFH(4,3,x2);

z=Tstepact;

figure

axis([0 10 0 10])






score of the observer for T =∞.

x1=0:10;

[prob,act] = pigletIH(0.000001,x1,3);

z=act;

figure

axis([0 10 0 10])


xlabel(’initial state player’)




score of the player for T =∞.

x2=0:9;

[prob,act] = pigletIH(0.000001,3,x2);

z=act;

figure

axis([0 10 0 10])





36

Bibliography

[1] Hill, T. P. (2007). “Knowing when to stop”, American Scientist, 97, 126-133.

[2] Norris, James. R. (1997). Markov Chains, Cambridge, United Kingdom: University

Press.

[3] Nunez Queija, R. (2011). Markov Decision Processes, The Netherlands.

[4] Parmigiani, G., Inoue, L. (2009). Decision Theory, United Kingdom.

[5] Tijms, H. C. (2012). Understanding Probability, third edition, Cambridge University

Press, New York.

[6] Tijms, H. C. (2015). Optimal Stopping and the One-Stage-Look-Ahead Rule, VU Ams-

terdam, The Netherlands.

[7] Weber, R. (2014). Optimization and Control, Class Notes, University of Cambridge,

http://www.statslab.cam.ac.uk/ rrw1/oc/oc2014.pdf

37

markov decision processes & one-stage-look-ahead · stopping problems are given in tijms [5]....

Documents