markov decision processes & one-stage-look-ahead · stopping problems are given in tijms [5]....
TRANSCRIPT
Markov Decision Processes
& One-stage-look-ahead
Gideon Jager
July 14, 2016
Bachelor Thesis Mathematics
Supervisor: prof. dr. R. Nunez Queija
Korteweg-de Vries Instituut voor Wiskunde
Faculteit der Natuurwetenschappen, Wiskunde en Informatica
Universiteit van Amsterdam
Abstract
One must make decisions many times in one’s life. Certain decisions work out better
than others in specific situations. In this thesis, we consider games where have to take
decisions in order to maximize the probability of achieving a certain goal. We will start
with a discussion of Markov chains in discrete time and Markov decision processes. In
particular, we state the Bellman optimality condition. Then we demonstrate how this
theory works for the game Roulette. Our main goal is to use the Markov decision theory
to maximize the probability of winning the game Piglet and determine the corresponding
optimal strategies. Finally, we study a different class of decision rules, the one-stage-look-
ahead rules and apply them to Piglet. We compare the one-stage-look-ahead rule with the
results obtained from the Bellman optimality condition.
Title: Markov Decision Processes
& One-stage-look-ahead
Author: Gideon Jager, [email protected], 10565744
Supervisor: prof. dr. R. Nunez Queija
Second grader: prof. dr. M. R. H. Mandjes
Date: July 14, 2016
Korteweg-de Vries Instituut voor Wiskunde
Universiteit van Amsterdam
Science Park 904, 1098 XH Amsterdam
http://www.science.uva.nl/math
i
Contents
Prelude iv
1 Markov decision processes 1
1.1 Discrete time Markov chains . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Decision rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Bellman’s condition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Roulette 5
2.1 Roulette as a Markov decision process . . . . . . . . . . . . . . . . . . . . 5
2.2 Roulette with a finite horizon . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Roulette with an infinite horizon . . . . . . . . . . . . . . . . . . . . . . . 10
3 Piglet 13
3.1 The coin flipping game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Piglet with a finite horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.3 Piglet with an infinite horizon . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 A small horizon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4 One-stage-look-ahead 21
4.1 One-stage-look-ahead decision rules . . . . . . . . . . . . . . . . . . . . . . 21
4.2 One-stage-look-ahead for Piglet . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Optimality of one-stage-look-ahead . . . . . . . . . . . . . . . . . . . . . . 23
5 Conclusion 25
6 Popular summary (in Dutch) 27
A Matlab code 29
A.1 Roulette . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
A.2 Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.3 Stopping criterion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
A.4 Piglet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
A.5 Various graphs for Piglet . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
Bibliography 37
ii
iii
Prelude
In life, people have to make decisions many times. Some decisions involve a certain risk.
Hill [1] gives the example of when to stop watching and start evacuating when a hurricane
is approaching and states that bad timing can be ruinous. So we could ask ourselves the
question: when is it optimal to stop? Other questions considering optimality appear in
game theory such as: what is the maximum probability of winning given the current state
of the process? In this thesis we will consider problems where we want to find an optimal
strategy, a strategy that maximizes the probability of achieving a certain goal. Therefore,
we use the theory of Markov chains which was first developed in the early twentieth-century
by the Russian Andrey Markov.
Andrey Markov (1856 - 1922).
It will be our goal to find these maximum probabilities and optimal actions for the game
called Piglet. In order to do this we will begin with formulating the necessary theory of
Markov chains and extend this with the theory of Markov decision processes. In the second
chapter we show how this theory works by looking at another game, Roulette. Next, in
the third chapter we look at the main problem, the Piglet game. In the fourth chapter,
we will again consider the Piglet game but with a different type of strategy, the so called
one-stage-look-ahead rule and compare this to the results from the third chapter.
Gideon Jager,
Amsterdam, July 14, 2016
iv
v
1 Markov decision processes
In this chapter we will discuss the theory that will be necessary in the next two chapters.
Before we can discuss Markov decision processes we need to look at some concepts of
Markov chains. We will discuss these concepts in the first section by following Norris [2].
1.1 Discrete time Markov chains
We start this paragraph with some notation before we can state a formal definition of
a Markov chain. Therefore, we recall the definitions of a probability distribution and
stochastic matrices.
Definition 1.1 (State space and distribution). Let I be a countable set. We call I
the state space and every i ∈ I a state. λ = (λi : i ∈ I) is a measure on I if λi ∈ [0,∞) for
all i ∈ I. We call a measure λ a distribution if∑i∈Iλi = 1.
Definition 1.2 (Stochastic matrix). A matrix P = (pij : i, j ∈ I) is called stochastic if
(pij : j ∈ I) is a distribution for all i ∈ I, i.e. the sum of all elements on every row is equal
to 1.
Now we have specified our notation, we introduce Markov chains. Norris [2] gives many
examples of Markov Chains. One of these examples is about a flea hopping between the
vertices of a triangle. The idea of this example is that the vertex where the flea will jump
to only depends on the vertex where the flea remains at the moment and not of the vertices
it visited in the past. This leads to the following definition of a Markov chain.
Definition 1.3 (Markov chain). Let (Xn)n≥0 a sequence of random variables. (Xn)n≥0 is
a Markov chain with initial distribution λ and transition matrix P if X0 has distribution λ
and for n ≥ 0, conditional on Xn = i, Xn+1 has distribution (pij : j ∈ I) and is independent
of X0, . . . , Xn−1.
We can also state this using the notation introduced above as follows:
for n ≥ 0 and i1, . . . , in+1 ∈ I,
P (X0 = i1) = λi1 ;
P (Xn+1 = in+1 | X0 = 1, . . . , Xn = in) = pinin+1 .
1
Basically, a Markov chain can be represented as a directed graph where the vertices are
the states and the edges represent the possible jumps as Norris [2] shows in his exmaples.
In the upcoming chapters, we will look at problems where we can take certain actions
whenever we enter a state. If we stop we receive a terminal reward and if we continue
some costs might be included. Such problems where we try to maximize the reward or
minimize the costs over time are called optimal stopping problems. Examples of optimal
stopping problems are given in Tijms [5]. In the next two chapters we try to maximize
certain probabilities with respect to stopping rules. Therefore, we have to define Markov
decision processes and discuss multiple conditions on optimality and solution methods to
determine an optimal decision rule both for optimal stopping problems. In the next two
paragraphs, we start with a study of this theory from Nunez Queija [3].
1.2 Decision rules
This paragraph gives an introduction into decision theory. In decision theory we often
use Markov decision processes. A Markov decision process (MDP) is similar to a Markov
chain. We again assume that the state space I is countable. An MDP also has a finite
action space A and in an MDP we are allowed to choose an action and each action leads to
different transition probabilities. In other words, we can choose the transition probabilities
from a collection of probability distributions. If the process (Xn)n≥0 has an action space
with only one element such that we always choose the same action and always have the
same transition probabilities, then the MDP reduces to a regular Markov chain.
Let An denote the action taken at time n. Regarding the states and actions we further
assume that
P(Xn+1 = in+1 | X0 = i0, . . . , Xn = in,A0 = a0, . . . , An = an)
= P(Xn+1 = in+1 | Xn = in, An = an).
for i0, . . . in+1 ∈ I, a0, . . . an ∈ A. From now on, we will denote the right hand side of this
equation by pan(in, in+1). From this equation we can conclude something similar to the
definition of a Markov chain. Once we know the state and the action at time n, the state
at time n+ 1 is independent of the history before time n which we write as
Hn−1 := (X0, A0, . . . , Xn−1, An−1).
We also add a reward, ra(i, j), when action a is taken and the process moves from state
i to state j. The expected reward when we coose action a in state i will therefore be
ra(i) =∑j∈I
pa(i, j)ra(i, j).
2
We assume that this expected reward is uniformly bounded which means that there exists
an M > 0 such that |rai | < M for all i ∈ I, a ∈ A.
In short, in this thesis a Markov decision process (MDP) consists of a countable state
space I, a finite action space A, transition probabilities for each action and a reward for
all possible transitions given an action. If we have a decision problem over a period T ∈ N,
then we have a finite horizon decision problem.
In the next sections, it will be our goal to find an optimal strategy for optimal stopping
problems, like the Piglet game in chapter 3. Therfore, we will define what a strategy is
and how it is related to decision rules.
Definition 1.4 (Decision rule). Let hn−1 = Hn−1 where Hn consits of all states that
heve been visited and actions that have been taken up to and including time n − 1. A
strategy is a sequence
sann (hn−1, in) ∈ (0, 1), n ∈ Z≥0which gives the probability that action an is taken at time n. We call a strategy a decision
rule if for all n, the whole history and all states
sann (hn−1, in) = 1
holds for precisely one an ∈ A. Then we write f = (f0, f1, f2, . . .) for the decision rule where
fn(hn−1, in) = an. Whenever fn(hn−1, in) = fn(in), the decision rule does not depend on
the history and the decision rule is a Markov decision rule.
1.3 Bellman’s condition
In the upcoming games we want to find a decision rule that maximizes the total reward.
Thus, we want to maximize the function
V sT (i) :=
T−1∑n=0
Es[rAn(Xn, Xn+1) | X0 = 1
]+ Es [q(XT ) | X0 = i]
where q is the function that defines the final reward when the process ends in a certain
state. There exists a recursive algorithm that appears in dynamic programming which
maximizes the function V sT over s and is called Bellman’s optimality condition. More
on dynamic programming can be found in Parmigiani [4]. We will state this algorithm in
order to use it on the problems in the upcoming chapters.
Theorem 1.5 (Bellman’s optimality condition). Let V0 := q(i) and let
Vn(i) := maxa∈A
{ra(i) +
∑j∈I
pa(i, j)Vn−1(j)
}for all n ∈ N.
3
Then, Vn(i) maximizes the function V sn (i) for all n ∈ N. The corresponding decision rule
is f = (fn, fn−1, fn−2, . . . , f1) given by
fn(i) = arg maxa∈A
{ra(i) +
∑j∈I
pa(i, j)Vn−1(j)
}.
For a proof we refer to Nunez Queija [3].
Remark 1.6. This decision rule does not tell us what to do at time n, but it gives the
action we should take when we have n steps to go in the process. So, first we take the
action given by fn, then take the action given by fn−1, etc.
Determining an optimal decision rule that maximizes the reward is not the only use of
Bellman’s optimality condition. We conclude this chapter by showing how it could also be
used to maximize the probability of a certain goal. Suppose we want to reach a certain
state j before a given time T . Then, we want to maximize the entrance probabilty of that
state,
PsT (i) := Ps (Tj ≤ T | X0 = i)
where Tj = inf{n ≥ 1 : Xn = j} is the first passage time for state j. In order to prevent
that the entrance probability exceeds 1, we set pa(j, j) = 1. Then the process cannot
escape state j once it enters this state. We set the rewards ra(i, k) equal to 1 if k = j, and
0 otherwise. By setting n = T and q(i) = 0 in Bellman’s optimality condition, we get
PsT (i) = V s
T (i).
Thus, we can use Bellman’s condition to maximize entrance probabilities which will be
done frequently in the next two chapters.
4
2 Roulette
In this chapter we consider the problem one faces at the roulette table. It is a good example
to show how the theory of the previous chapter works. If one plays Roulette, should one
bet, and if so, how much? Using Bellman’s optimality condition, we will determine the
decision rule. Suppose we start with e75,- and that we want to go home with e200,-. The
roulette table has eighteen red holes, eighteen black holes and one white hole. We assume
that we are only allowed to bet on red or black. Each turn we bet an integer amount of
euros which can differ every turn. If we win in a single turn, we obtain the double amount
of euros we bet that turn. If we lose, we lose the whole amount of euros we bet that turn.
We will now formulate this in terms of an MDP.
2.1 Roulette as a Markov decision process
This problem has a finite state space
I = {0, 1, . . . , 200} .
The action space consists of all possible bets
A = {0, 1, . . . , 199}.
For each state i we define the subset Ai of the action space by
Ai = {0, 1, . . . ,min{i, 200− i}},
since there is no reason to bet more euros than necessary to reach i = 200. If we bet a ∈ Ai
euros when we are in state i we can either go to state i+a with probability pa(i, i+a) = 1837
or to state i − a with probability pa(i, i − a) = 1937
. Then we want to find the functions
(Vn)∞n=0 with
Vn+1(i) = maxa∈Ai
{18
37Vn(i+ a) +
19
37Vn(i− a)
}, (2.1)
where we set
V0(i) =
{1 if i = 200,
0 if i < 200.(2.2)
5
These functions Vn(i) denote the probability of going home with e200,- if we have n
rounds left to play and have i euros in our possession. Equation (2.1) can be obtained
from a standard MDP by setting r(i) = 0 in the Bellman equation.
2.2 Roulette with a finite horizon
In this section we take a finite horizon, i.e. T < ∞. To demonstrate how the Bellman
equation works, we consider the following example.
Example 2.1. Suppose we start with e125,- and bet only once. Then we only need V0and V1. V0 is the same as in equation (2.2). For V1 we find
V1(125) = maxa≤75
{18
37V0(125 + a) +
19
37V0(125− a)
}= max
a≤75
{18
37V0(125 + a)
}=
18
37,
by the definition of V0. Thus, the maximum probability of going home with e200,- equals1837
and is achieved when action a = 75 is taken. So the optimal strategy is f1(125) = 75.
Now we have seen an example where the initial state is given, we look at the problem
where we have a capital of i euros. We take horizon T = 2. We can write down V0, V1 and
V2 and find the optimal actions. V0 is defined as in equation (2.2). Using equation (2.1)
we get the following function when we have one round left to play:
V1(i) =
1 if i = 200,1837
if i ∈ {100, 101, . . . , 199},0 if i < 100.
This first and third cases are obviously true because we either have already reached our
goal or it is impossible to reach e200,-. The second case is true, because
maxa∈Ai
{18
37V0(i+ a) +
19
37V0(i− a)
}= max
a∈Ai
{18
37V0(i+ a) +
19
37· 0}
= maxa∈Ai
{18
37V0(i+ a)
}=
18
37.
This function is maximized when we bet a = 200− i euros.
6
With the same reasoning we can write down the function when we have two rounds left
to play:
V2(i) =
1 if i = 200,10081369
if i ∈ {150, 151, . . . , 199},1837
if i ∈ {100, 101, . . . , 149},(1837
)2if i ∈ {50, 51, . . . , 99},
0 if i < 50.
Again, the first and fifth cases are obviously true. The third case is the same as the
second case in V1. In this case it is optimal to bet 200 − i euros, because we can reach
e200,- in one round and playing a second round would decrease our probability to reach
e200, since we need to win two times in stead of one time. For the second case we get
i− a ∈ {100, 101, . . . , 199}, so we see that
maxa∈Ai
{18
37V1(i+ a) +
19
37V1(i− a)
}=
18
37· 1 +
19
37· 18
37
=1008
1369.
For the fourth case we can calculate that
maxa∈Ai
{18
37V1(i+ a) +
19
37V1(i− a)
}= max
a∈Ai
{18
37V1(i+ a) +
19
37· 0}
= maxa∈Ai
{18
37V1(i+ a)
}=
(18
37
)2
where we use that, given we have an amount of euros greater than or equal to 50 and less
than 100 after one round, we will have at most e198,- in the end. Hence V1(i+ a) ≤ 1837
.
Returning to the original problem where we start with e75,- and want to go home with
e200,-, we have to play more than one round. We simply cannot double e75,- to e200,-.
Using the same V0 as in the previous example, we find
V2(75) = maxa≤75
{18
37V1(75 + a) +
19
37V1(75− a)
}= max
a≤75
{18
37V1(75 + a)
}=
(18
37
)2
.
7
The first probability is the probability if we bet at least e25. Then V1(75 + a) = 1837
. If we
bet less than e25,- we cannot reach e200,- in two rounds and therefore the probability of
going home with e200,- equals 0. Hence we get the decision rule
f2(75) = a2, f1(i) =
{200− i if i ∈ [100, 150] ,
0 if i < 100.
with a2 ∈ [25, 75] the amount of money we bet when we have two more rounds to play.
Note that, when we have one more round to play, that the optimal action is not unique.
We decide to choose the smallest optimal action such that the loss is minimal in the case
we lose.
For large T it will become more difficult to calculate VT and it will take much more
time. So, in this case we will write a computer program in Matlab that will calculate these
functions and the decision rules. This program is given in appendix A.1.1. The table below
contains the probabilities and optimal first actions for a few initial capitals.
Initial capital 0 50 100 150 200
Maximum probability 0 0.2367 0.4865 0.7363 1
Optimal first action 0 50 0 50 0
Table 2.1: The maximum probabilities and optimal first actions for
some initial capitals with horizon 2.
The scatterplots in Figure 2.1 below give us more information. They give the maximum
probabilities and the smallest optimal first actions for every initial capital.
Figure 2.1: Maximum probabilities and optimal first actions for
T = 2.
8
The red graph describes the maximum probability of going home with e200,- if one has
an amount of euros equal to the initial state. These maximum probabilties are attained if
one plays the optimal actions. The smallest optimal first actions are described in the blue
graph for each initial state. For the same reason as before, we choose the smallest optimal
actions. We will keep choosing the smallest optimal action in this game if the optimal
action is not unique.
The graph showing the maximum probabilities coincides with our calculations in para-
graph 2.2. The graph of the optimal first actions tells us a few things about the game. We
see that for each initial state smaller than 50 that the smallest optimal first action equals
zero which is logical because by doubling twice we cannot reach e200,-. For the initial
states greater than or equal to 100 and smaller than 150 the smallest optimal first action is
also equal to zero. The probability of reaching e200,- in two steps equals the probability
of reaching e200,- in one step. Therefore, it is not necessary to play two rounds, so the
optimal first action is 0. When being in a state i greater or equal to 150 and less than 200
it is not optimal to bet more than 200− i euros, since we set the goal of reaching precisely
e200,-. Therefore, we see that the optimal first action decreases starting from the initial
state 150. A similar argument explains the decrease from the initial states 50 up to and
including 99.
The Matlab program works for every finite horizon. So, we will consider T = 3 too.
Again, we start by giving a few results from the program.
Initial capital 0 25 75 125 175 200
Maximum probability 0 0.1151 0.3582 0.6080 0.8646 1
Optimal first action 0 25 25 25 25 0
Table 2.2: The maximum probabilities and optimal first actions for
some initial capitals with horizon 3.
The graphs for T = 3 on the next page can be explained in a similar way as for T = 2.
We see that the maximum probability can now take nine different values. Also, the graph
of optimal first actions now comtains four strictly decreasing subsequences. This behaviour
is explained in a similar way as was the case for T = 2.
9
Figure 2.2: Maximum probabilities and optimal first actions for
T = 3.
2.3 Roulette with an infinite horizon
For large values of T we used a computer program. When we take T =∞ we will have to
add a stopping criterion. The program must calculate a probabilty over an infinite amount
of rounds but must give us that probability at a certain time. Therefore we use the stopping
criterion of the Successive Approximation Algorithm which is stated in Nunez Queija [3].
The code for this stopping criterion is given in appendix A.1.3. Using this program we
calculate the maximum probabilities and optimal first actions in Table 2.3.
Initial capital 0 16 59 99 163 200
Maximum probability 0 0.0716 0.2783 0.4806 0.8010 1
Optimal first action 0 0 9 1 12 0
Table 2.3: The maximum probabilities and optimal first actions for
some initial capitals with an infinite horizon.
Using the code in appendix A1.2 we get the scatterplots in Figure 2.3 on the next page:
10
Figure 2.3: Maximum probabilities and optimal first actions for
T =∞.
Now, the red line gives the probability of going home with e200,- for each initial capital
if we only stop when we either reach our goal or are broke. The blue graph gives the optimal
first action we must take in order to achieve this maximum probability. We noticed that
for T = 3 the maximum probability can take on more different values than for T = 2.
As expected, this number of values greatly increases when we consider an infinite horizon.
Once again, the blue graph gives the smallest optimal first actions for each initial state.
Now we have seen how the theory of Markov decision processes works, we are ready to
use this theory for Piglet in the next chapter.
11
12
3 Piglet
In the previous chapters we studied Markov decision theory and its application at the
roulette table. In this chapter we will use this theory in order to achieve the main goal:
finding the decision rule that maximizes the probability of winning the Piglet game with
an infinite horizon. First, we explain the game and formulate it as an MDP.
3.1 The coin flipping game
Two players are involved in the Piglet game. Each round one player flips a fair coin,
possibly more than once. If heads, the flipper gets one point. If tails, the player loses
all his points obtained in the current round and the next round starts in which the other
player flips the coin. Each round, the player who flips the coin may decide to stop flipping
and freeze the points he obtained in the current round. Freezing means that you cannot
lose those points anymore. After stopping, the next round starts and the other player may
flip the coin. In tis game it is the goal of each player to be the first to obtain 10 points.
Notice that this game is two-dimensional. The game has the state space
I = {(i, j) ∈ {0, 1, . . . , 10} × {0, 1, . . . , 10}} \ {(10, 10)},
where i denotes the number of points of the player who is currently flipping the coin and
j denotes the number of points of the player who is waiting for his turn. Therefore, the
roles of i and j change when the game enters the next round. The action space consists of
the number of times we can flip the coin in one round
A = {0, 1, 2, . . . , 10}.
Once again, we will define a subset of the action space. Let A(i,j) ⊂ A be the action space
of the player who is going to flip the coin if he or she has i points and the opponent has j
points. We define A(i,j) = {0, 1, 2, . . . , 10− i} if j < 10. We also define A(i,10) = {0} such
that the game ends at the moment that one of the players has obtained 10 points.
For the transition probabilities, we get
pa((i, j), (j, i+ a)) =
(1
2
)a
and pa((i, j), (j, i)) = 1−(
1
2
)a
.
13
The first one is the probability of succesfully throwing heads a times and freezing a obtained
points. The second one is the probability that one of the throws results in tails and not
getting any points.
Just as in the case of roulette, we define the reward r(i) after each round equal to 0.
We need the functions (Vn)n∈Z≥0, the maximum probability of winning when we have
n rounds left to play for both players together, in order to determine V∞. The Bellman
condition gives these functions:
Vn+1((i, j)) = maxa≤10−i−1
{(1
2
)a
(1− Vn((j, i+ a))) +
(1−
(1
2
)a)(1− Vn((j, i)))
}, (3.1)
where we set
V0((i, j)) =
{0 if i < 10 and j = 10,12
for all i < 10, j < 10.(3.2)
The first case of V0 means that one of the players has won in the previous round and
that the other player has lost the game. The second case of V0 defines a tie when the game
ends and none of the players have enough points to be the winner of the game.
3.2 Piglet with a finite horizon
In this section, we start with showing how to calcalute the functions (Vn)n∈Z≥0by hand.
Therefore, we now consider a finite horizon. We return to the infinite horizon problem in
the next paragraph.
Suppose we take T = 1. Then we want to calculate the maximal probability of reaching
state (10, j) in one round with j ≤ 9 and find the decision rule corresponding to this
maximal probability. We let V0((i, j)) as in equation (3.2) and will write down the function
V1((i, j)). We calculate that
maxa∈A(i,j)
{(1
2
)a
(1− V0((j, i+ a))) +
(1−
(1
2
)a)(1− V0((j, i)))
}equals
maxa∈A(i,j)
{(1
2
)a
(1− V0((j, i+ a))) +1
2+
(1
2
)a+1}.
We easily see that(1
2
)a
(1− V0((j, i+ a))) +1
2+
(1
2
)a+1
=
{(12
)a+1+ 1
2−(12
)a+1if a < 10− i,(
12
)10−i+ 1
2−(12
)10−i+1if a = 10− i.
14
The first case equals 12
which is less than the second case. This implies that
V1((i, j)) =
(1
2
)10−i
+1
2−(
1
2
)10−i+1
. (3.3)
Thus the maximum probability of winning equals (3.3) and is attained when action a =
10− i is taken.
Just as for Roulette, this was easy to calculate but the larger T becomes the more the
computation time increases. We want to be able to give the maximum probabilities and
the optimal first actions for large T where T may be infinite. Again, we write a computer
program using Matlab. This program is based on a slightly different, but equivalent, version
of the game. Using the first program in appendix A1.4 we get for T = 1 and i = 6 that the
maximum probaility of winning equals 0.5313 and that the optimal first action is a = 4.
These are the exact same numbers as the numbers obtained from substituting i = 6 into
equation (3.3). The reader can easily verify this.
Using the program we show the numbers in Table 3.1, where we take the horizon to be
equal to 5, such that the reader can verify the graphs that follow.
Initial points player 0 1 4 6 7 7
Initial points observer 0 0 4 5 7 9
Maximum probability 0.5009 0.5028 0.5123 0.5665 0.5504 0.2874
Optimal first action 6 5 4 3 3 3
Table 3.1: The maximum probabilities and optimal first actions for
some initial states and T = 5.
The first program in appendix A1.5 gives the graphs in Figure 3.1. The graph on the
left shows the maximum probabilities and the graph on the right shows the optimal first
actions.
Consider the graph that gives the maximum probabilities given the initial states of the
two contestants. We notice that whenever the initial state of the player (the contestant
who is flipping the coin in the current turn) increases, the maximum probability of winning
also increases which is to be expected. Also to be expected, the graph shows that whenever
the initial state of the observer (the contestant who flipped the coin in the previous turn
and will flip in the next turn) increases, the maximum probability of winning decreases.
15
Figure 3.1: Maximum probabilities and optimal first actions for
T = 5.
We now consider the graph that gives the optimal first actions. We notice that the
optimal first action, too, is influenced by the number of points of each player. In Figure
3.2 two more scatterplots are shown. These scatterplots show in what way the optimal
first action changes whenever the initial states of either the player or the observer increase
and the other score is fixed. Here we set this fixed score equal to 3.
Figure 3.2: Optimal first actions T = 5 with fixed score of oppo-
nent.
The graph tells us that for larger initial states of the player, the optimal first action
decreases. This makes sense since the player is closer to his or her goal. Likewise, if the
initial state of the observer becomes relatively large, the optimal first action for the player
increases since he or she needs more heads to catch up with the observer.
16
3.3 Piglet with an infinite horizon
In the previous section we have seen how the game works and how to calculate V1 for T ≥ 1.
Now, we are interested in the infinite horizon case. We will use a computer program similar
to the one before but, just as in the case of an infinite horizon in Roulette, we add the
stopping criterion of appendix A1.2. This results in the second computer program given
in appendix A1.4. Again, we start by giving some numbers such that the reader can verify
the graphs that follow. We take the same initial states as for T = 5.
Initial points player 0 1 4 6 7 7
Initial points observer 0 0 4 5 7 9
Maximum probability 0.5264 0.5922 0.5310 0.6641 0.5455 0.2222
Optimal first action 2 2 2 1 1 3
Table 3.2: The maximum probabilities and optimal first actions for
some initial states and T =∞.
Using the first program in appendix A1.5 with remark A5.1 in mind, we obtain the
following graph.
Figure 3.3: Maximum probabilities for T =∞.
This graph describes the maximum probabilities of ever winning Piglet given the initial
17
states for both contestants. The graph has the same, but smoother, behaviour as in the
case T = 5. Therefore, the behaviour is explained in the exact same way as before.
The graph of the maximum probabilities shows the same behaviour as before. The graph
of the optimal first actions also shows the same behaviour as before, but contrary to the
behaviour of the graph showing the optimal first actions for T = 5, the graph of the infinite
horizon problem is less smooth, see Figure 3.4.
Figure 3.4: Optimal first actions for T =∞.
To give a better illustration of the influence of the initial state of each contestant on the
optimal first action, we will make two sections again. We set one contestants score again
equal to 3 and consider the influence of the other contestants score, see Figure 3.5 on the
next page.
18
Figure 3.5: Optimal first actions T =∞ with fixed score of oppo-
nent.
The initial state of the player influences the optimal first action slightly. However, the
initial state of the observer greatly increases the value of the optimal first action once the
initial state of the observer bcomes larger than seven. This is due to the fact that we allow
our contestants to continue playing until one of them has thrown heads ten times.
We have determined optimal actions for Piglet. In this chapter we used the Bellman
condition in equation (3.1). In the next chapter we will determine a one-stage-look-ahead
rule for Piglet and compare it to the optimal actions from this chapter.
3.4 A small horizon
If we play Piglet with a small horizon we may need to flip the coin mores times in one
round than when we play with a larger horizon and consider the same pair of initial states,
the points we have and the points our opponent has. For T ≤ 4 we get remarkable results
compared to T ≥ 5. We will show this for T = 4. The graph of the maximum probability
of winning resembles the same graph for T = 5, but the graph of the optimal first actions
is completely different than the same graph for T = 5, see Figure 3.6 on the next page.
19
Figure 3.6: Maximum probabilities and optimal first actions for
T = 4.
The differences are also shown in the sections below where we fix the other players score
at the same amount as for T = 5.
Figure 3.7: Optimal first actions T = 4 with fixed score of oppo-
nent.
From these graphs we can conclude that we have to take a larger risk if we play the
game with a small horizon and take the optimal actions, while for a larger horizon we can
take less risky optimal actions in the same initial states.
20
4 One-stage-look-ahead
In the previous chapters we considered decision rules obtained from the Bellman optimality
condition. In this chapter we will introduce another type of decision rules. The strategies
obtained in the previous chapter are useful if the state space is relatively small, like in the
previous chapter. But if the state space becoms larger and it becomes harder to calculate
and use the optimal strategies determined with the Bellman condition. Therefore, we
consider the one-stage-look-ahead rule also known as the one-step-look-ahead rule.
4.1 One-stage-look-ahead decision rules
This section briefly explains the theory of one-stage-look-ahead. We will follow Tijms [6]
in this section.
From now on, in each state there are only two possible actions: stop and continue one
more round and do this comparison once more. One-stage-look-ahead determines whether
it is at least as good to stop now or to continue one more step. When the process is in
state i and we choose to stop, a terminal reward R(i) is included. Also, when we continue,
we have to pay c(i), a cost in order to continue. We assume that one of the following
conditions holds:
(i) supi∈I
R(i) <∞ and infi∈I
c(i) > 0,
(ii) supi∈I
R(i) <∞ and c(i) = 0 for all i ∈ I
and there is a set S consisting of the states where stopping is mandatory which is
reached whithin a finite amount of time when the action of continuing is always
chosen in the states i /∈ S.
In this chapter, we will always assume that the second conditions hold. Now, we define
the set B of states in which the one-stage-look-ahead rule tells us that stopping is at least
as good as continue one more round and then stop:
B =
{i ∈ I : R(i) ≥
∑j∈I
pijR(j)
}(4.1)
21
We will use this theory to create a one-stage-look-ahead rule for Piglet and try to deter-
mine whether it is optimal or not.
4.2 One-stage-look-ahead for Piglet
We want to determine the set B specified in (4.1) for Piglet. Therefore, we must add an
extra rule to our game to make it an optimal stopping problem. From now on, both players
are allowed to press a button at the end of his or her round. By pressing this button, the
other player is only allowed to play one more round. If he or she overtakes the player who
pressed the button, then he or she wins. Catching up results in a tie. Otherwise, the player
who pressed the button wins.
Below, we state a one-stage-look-ahead rule for Piglet. The idea behind this rule is
inspired by Backward induction as explained in Hill [1].
Our one-stage-look-ahead-rule is based on the probability of being overtaken by our
opponent. First of all, we determine the set S. S consists of all states in which stopping
is mandatory, so
S = {(10, j) : 0 ≤ j ≤ 9} ∪ {(i, 10) : 0 ≤ i ≤ 9} .
Secondly, we define the following stopping rule. The button must be pressed if
P(Overtaken or caught up by opponent) <1
2. (4.2)
In other words,1
2i−j <1
2. (4.3)
Here we take 0 ≤ j < i ≤ 9. All the states that are not mentioned up to here are the states
in which we are not in front of our opponent. Then stopping could never be optimal. Now,
we define the set
B′ :=
{(i, j) :
1
2i−j <1
2
}.
Then, it immediately follows that
B′ = {(i, j) : i− j ≥ 2} .
We must consider the set of states in which we want to throw once more and then stop.
Since we decided to stop either at the end of this round or continue one more round and
then stop, we flip at least once more between deciding to stop at the end of which round
and the moment we press the button. Therefore, we define
B = {(i, j) : i− j ≥ 1} .
22
Using this set B we define the following one-stage-look-ahead rule. If we have entered the
set B, then we throw only once more and stop at the end of this round, otherwise, we
throw i− j + 1 -times and continue one more round.
4.3 Optimality of one-stage-look-ahead
In the previous section we determined the set B that tells us whether to stop after the
next flip or to continue after the next flip. In this section we will determine whether the
one-stage-look-ahead strategy defined by this set is optimal or not. One way of determin-
ing whether such a set gives an optimal one-stage-look-ahead rule was given by Richard
Weber. In [7] he proved that the one-stage-look-ahead rule is optimal if the set B is closed.
Unfortunately, his theorem is not applicable. The states of the set B determined in the
previous paragraph do even not communicate. For a definition of communicating states,
we refer to Norris [2].
So, we have to find another way to determine optimality. We will compare the set B to
the optimal first actions given by Bellman’s optimality condition. It is optimal to stop if,
for the pair of initial states, the optimal first action is a = 1. Therefore, we will determine
the set C of pairs of initial states with optimal first action a = 1 using Figure 3.4. We
notice that
B \ C = ∅ and C \B = {(8, 7)} .
Hence our one-stage-look-ahead rule gives an optimal strategy in many states, but is not
completely optimal.
23
24
5 Conclusion
We began with the study of Markov decision processes and used this theory to determine
optimal strategies for Markov decision processes (MDP). The first MDP we considered was
the game Roulette where we used the Bellman condition to calculate both the maximum
probability of going home with e200,- and what the corresponding optimal first actions
were. In the case where we had the horizon T = 3, which means that we have three
rounds more to play, we obtained the maximum probalities and optimal first actions using
a computer program. We saw that the optimal first action was betting such an amount
that, should we win, we can reach e200,- by repeatedly doubling this amount. For example,
if we have e20,- then we bet e5,- on the first turn. If we win, we have e25,- which we kan
keep doubling to reach e200,-. If we lose, we are left with e15,- and we have to bet e10,-
to reach e25,-.
Next, we considered the case where we may play forever. We used the computer program
of the finite horizon problem, made some small changes and added a stopping criterion in
order to obtain the maximum probabilities of ever going home with e200,- given a certain
amount of money and the optimal first actions to achieve these maximum probailities.
These probabilities and actions are shown in figure 5.1.
Figure 5.1: Maximum probabilities and optimal first actions for
T =∞.
25
In Chapter 3 we used the knowledge of the first two chapters to consider our main
problem, the Piglet game, another MDP. We wrote a computer program for the finite
horizon problem and modified it in the same way as in chapter two to obtain a program for
the infinite horizon problem. As for Roulette, this program used the Bellman condition to
derive the maximum probabilities, the corresponding optimal actions and the graphs that
show these probabilities and actions, see Figure 5.2.
Figure 5.2: Maximum probabilities and optimal first actions for
Piglet if T =∞.
We concluded that the function showing the maximal probabilities is an increasing func-
tion of the initial state of the player and a decreasing function of the initial state of the
observer. Here, the player is the contestant who flips the coin in the current turn and
the observer is the contestant who will flip the coin in the next turn. Also, the graph of
the optimal first actions is a decreasing function of the initial state of the player and an
increasing function of the initial state of the observer.
In Chapter 4, we considerd a different type of decision rules, the one-stage-look-ahead
rules, to find derive possible optimal actions for Piglet. We defined a one-stage-look-ahead
rule based on the probability that our opponent could catch up in points. If not, we
decided to stop and press a button such that our opponent was only allowed to play one
more round. Otherwise, we decided to take the optimal action of the Bellman condition
in chapter three. We concluded this chapter by determining whether this stragtegy is
optimal or not. Therefore, we considered the set of initial states such that the Bellman
condition gives optimal first action 1, and compared this to the set of states for which our
one-stage-look-ahead rule tells us to flip the coin once more and then press the button.
We noticed that these sets only differ in one pair of initial states which means that the
one-stage-look-ahead rule is optimal in most states, but that it is not completely optimal.
26
6 Popular summary (in Dutch)
Markovketens zijn processen die gegeven het heden onafhankelijk zijn van hun geschiedenis.
Dat wil zeggen dat het enige dat de volgende toestand van het proces beınvloedt is de
toestand waar het proces zich nu bevindt. Markovketens kunnen worden uitgebreid tot
Markov beslissingsprocessen. Dit doen we door een verzameling met acties toe te voegen.
Beslisregels vertellen welke actie we moeten ondernemen en deze beslisregels kunnen worden
gedefinieerd aan de hand van strategieen. Een strategie geeft de kans dat we voor een
bepaalde actie kiezen. Als deze kans 1 is voor precies een van deze acties dan noemen
we deze strategie een beslisregel. Als een beslisregel onafhankelijk is van de voorgaande
toestanden en de ondernomen acties dan noemen we deze beslisregel een Markov beslisregel,
de theorie van zulke regels noemen we Markov beslissingstheorie. We kunnen deze theorie
gebruiken om optimale acties te bepalen in verschillende kansspellen. In deze kansspellen
zouden we kunnen aannemen dat we een bepaald aantal rondes spelen of dat we oneindig
door mogen blijven spelen. In het eerste geval waar aan het begin het aantal rondes wordt
vastgesteld, noemen we het spel een eindig horizon probleem. In het andere geval waar we
oneindig door mogen blijven spelen noemen we het spel een oneindig horizon probleem.
Een manier om optimale acties te bepalen is door gebruik te maken van de Bellman op-
timaliteitsconditie. Dit recept berekent de maximale kans om een bepaald doel te bereiken
in een bepaald aantal rondes gebruikmakend van de maximale kans om datzelfde doel te
bereiken wanneer we een ronde minder zouden spelen. De acties die deze winkans maxi-
maliseren zijn de optimale acties.
Met deze Bellman optimaliteitsconditie hebben we de maximale kans en de bijbehorende
optimale acties bepaald voor de spellen Roulette en Piglet. Bij Roulette was het doel om
met e200,- naar huis te gaan wanneer we met een bedrag kleiner dan e200,- beginnen met
gokken.
Piglet is een spel waarbij we met een munt werpen en spelen tegen een tegenstander. In
elke ronde gooit een van de spelers mogelijk meerdere keren met een zuivere munt. Landt
de munt op kop dan krijgt de werper een punt. Bij munt verliest de werper alle punten
die hij of zij heeft verdiend in de huidige ronde en begint de volgende ronde waarin de
tegenstander mag werpen. In elke ronde mag de werper na elke worp besluiten om te
stoppen met werpen en de verdiende punten vast te zetten. Deze punten kunnen dan niet
meer worden verloren en begint er een nieuwe ronde, hetgeen betekent dat de tegenstander
aan de beurt is.
27
We hebben voor elke combinatie van onze punten en de punten van onze tegenstander de
maximale winkans en de optimale acties bepaald. Deze informatie hebben we weergeven
in grafieken en daarmee weten we voor alle mogelijke situaties binnen het spel welke actie
we moeten kiezen.
Deze grafieken tonen een mooi patroon. Als ons aantal punten toeneemt dan stijgt de
maximale winkans. Net zo, wanneer het aantal punten van onze tegenstander toeneemt
dan daalt de maximale winskans.
Een andere manier om acties te vinden is door middel van one-stage-look-ahead. De
naam spreekt voor zichzelf. De one-stage-look-ahead regel bepaalt of het minstens net
zo goed is om nu te stoppen als om nog een ronde te spelen en dan te stoppen. Bij
Piglet betekent dit dat we aan het begin van een ronde besluiten of we aan het eind van
de huidige ronde stoppen of dat we nog een ronde spelen en dan weer dezelfde afweging
maken. Daarom willen we een verzameling van toestanden bepalen waarin het minstens net
zo goed is om aan het eind van de huidige ronde te stoppen dan om daarna nog een ronde
te spelen. Als het proces zich in de verzameling bevindt dan zegt de one-stage-look-ahead
regel dat we moeten stoppen aan het einde van de huidige ronde. Om te bepalen of de one-
stage-look-ahead regel optimaal is voor Piglet (in de zin dat het de winkans maximaliseert)
vergelijken we deze verzameling met de grafiek van de optimale eerste acties die we met de
Bellman optimaliteitsconditie hebben verkregen. Het blijkt dat wanneer we spelen volgens
one-stage-look-ahead dat de verzameling van nog eenmaal werpen en dan stoppen alleen
in een mogelijke toestand verschilt van de verzameling van toestanden waarin eenmaal
werpen de optimale actie is volgens de Bellman optimaliteitsconditie.
28
A Matlab code
In this appendix the Matlab-code for Roulette and for the Piglet game will be stated.
A.1 Roulette
In this section we state the program used for Roulette with a finite horizon.
function[Tstepprobability,Tstepaction] = rouletteFH(T,initial)
for n=[1:200]
if n==200
vold(n)=1;
else
vold(n)=0;
end
end
while T>0;
for i=[1:200];
largest = 0;
maxaction = 0;
if i==200
largest=1;
else
for a=[0:min(i,200-i)]
if a<i
maxprob = (18/37)*vold(i+a)+(19/37)*vold(i-a);
else
maxprob = (18/37)*vold(i+a);
end
if maxprob>largest;
largest=maxprob;
maxaction=a;
end
end
29
end
vnew(i)=largest;
anew(i)=maxaction;
end
vold=vnew;
T=T-1;
end
Tstepprobability = vnew(initial)
Tstepaction = anew(initial)
A.2 Graphs
The following code produces the graphs throughout the second and third chapter. The
code below is specific for roulette with a finite horizon. For the the case of an infinite
horizon we apply the function rouletteIH in the third line.
initial=1:200;
[Tstepprobability,Tstepaction] = rouletteFH(3,initial);
y=Tstepprobability;
z=Tstepaction;
figure
axis([0 200 0 1])
scatter(initial,y,3,’red’,’filled’)
xlabel(’initial state’)
ylabel(’probability’)
title(’Maxmimal probability of reaching 200’)
hold on
figure
axis([0 200 0 100])
scatter(initial,z,3,’blue’,’filled’)
xlabel(’initial state’)
ylabel(’first action’)
title(’Optimal first action’)
A.3 Stopping criterion
For the case of an infinite horizon we need a stopping criterion as mentioned before in
section 2.3. We also mentioned that we use the stopping criterion of the Successive Ap-
proximation Algorithm. The following code runs this criterion.
30
maxdiff=max(vnew-vold);
mindiff=min(vnew-vold);
difference=maxdiff-mindiff;
In order to use this criterion we need to make a slight change in the while loop. Instead
of the horizon T we give an upperbound as input argument for the matlab function. We
also replace “while T > 0” by
difference = 1;
while difference >= upperbound;
The remaining code is unchanged. This results in the following code for an infinite
horizon.
function[probability,action] = rouletteIH(upperbound,initial)
for n=[1:200]
if n==200
vold(n)=1;
else
vold(n)=0;
end
end
difference = 1;
counter = 0;
while difference >= upperbound;
counter = counter + 1;
for i=[1:200];
largest = 0;
maxaction = 0;
if i==200
largest=1;
else
for a=[0:min(i,200-i)]
if a<i
maxprob = (18/37)*vold(i+a)+(19/37)*vold(i-a);
else
maxprob = (18/37)*vold(i+a);
end
if maxprob>largest;
largest=maxprob;
31
maxaction=a;
end
end
end
vnew(i)=largest;
anew(i)=maxaction;
end
maxdiff=max(vnew-vold);
mindiff=min(vnew-vold);
difference=maxdiff-mindiff;
vold=vnew;
end
counter = counter
probability = vnew(initial)
action = anew(initial)
A.4 Piglet
We start with the code for the finite horizon problem.
function[Tstepprob,Tstepact]=pigletFH(T,initialplayer,initialobserver);
initialplayer = initialplayer+1;
initialobserver = initialobserver+1;
for n=[1:11]
for m=[1:10]
if n==11
vold(n,m)=1;
else
vold(n,m)=(1/2);
end
end
end
while T>0;
for i=[1:11];
for j=[1:10];
largest = 0;
if i == 11;
largest=1;
maxaction=0;
32
else
for a=[1:11-i]
if a == 11-i;
maxprob = (1/2)^a+(1-(1/2)^a)*(1-vold(j,i))
else
maxprob = (1/2)^a*(1-vold(j,i+a))+(1-(1/2)^a)*(1-vold(j,i));
end
if maxprob>largest;
largest=maxprob;
maxaction=a;
end
end
end
vnew(i,j)=largest;
anew(i,j)=maxaction;
end
end
vold=vnew;
T=T-1;
end
Tstepprob = vnew(initialplayer,initialobserver)
Tstepact = anew(initialplayer,initialobserver)
With the same changes as we at the code for roulette we obtain the code for the main
problem: Piglet with an infinite horizon.
function[prob,act]=pigletIH(upperbound,initialplayer,initialobserver);
initialplayer = initialplayer+1;
initialobserver = initialobserver+1;
for n=[1:11]
for m=[1:10]
if n==11
vold(n,m)=1;
else
vold(n,m)=(1/2);
end
end
end
difference=1;
counter=0;
33
while difference>upperbound;
counter = counter + 1;
for i=[1:11];
for j=[1:10];
largest = 0;
if i == 11;
largest=1;
maxaction=0;
else
for a=[1:11-i]
if a == 11-i;
maxprob = (1/2)^a+(1-(1/2)^a)*(1-vold(j,i));
else
maxprob = (1/2)^a*(1-vold(j,i+a))+(1-(1/2)^a)*(1-vold(j,i));
end
if maxprob>largest;
largest=maxprob;
maxaction=a;
end
end
end
vnew(i,j)=largest;
anew(i,j)=maxaction;
end
end
maxdiff=max(vnew-vold);
mindiff=min(vnew-vold);
difference=maxdiff-mindiff;
vold=vnew;
end
counter = counter
prob = vnew(initialplayer,initialobserver)
act = anew(initialplayer,initialobserver)
A.5 Various graphs for Piglet
To produce the graphs for the probabilities and the corresponding optimal first actions we
use the following code. We use the plot option surf to show the behaviour of the function.
34
Note that the corners of the rectangles give the probabilities and actions.
Remark A.1. The following code is for the finite horizon problem. In the case of an
infinite horizon problem we simply replace the function pigletFH by the function pigletIH.
x1=0:10;
x2=0:9;
[x1g,x2g]=meshgrid(x2,x1);
[Tstepprob,Tstepact]=pigletFH(4,x1,x2);
y=Tstepprob;
z=Tstepact;
figure
axis([0 9 0 10 0 1])
surf(x1g,x2g,y)
xlabel(’initial state observer’)
ylabel(’initial state player’)
zlabel(’probability’)
title(’Maxmimal probability of winning Piglet’)
hold on
figure
axis([0 9 0 10 0 10])
surf(x1g,x2g,z)
xlabel(’initial state observer’)
ylabel(’initial state player’)
zlabel(’first action’)
title(’Optimal first action’)
The following code provides a section of the graph of optimal actions where we fix the
score of the observer for T = 4.
x1=0:10;
[Tstepprob,Tstepact] = pigletFH(4,x1,3);
z=Tstepact;
figure
axis([0 10 0 10])
scatter(x1,z,3,’blue’,’filled’)
xlabel(’initial state player’)
ylabel(’first action’)
title(’Optimal first action’)
The following code provides a section of the graph of optimal actions where we fix the
score of the player for T = 4.
35
x2=0:9;
[Tstepprob,Tstepact] = pigletFH(4,3,x2);
z=Tstepact;
figure
axis([0 10 0 10])
scatter(x2,z,3,’blue’,’filled’)
xlabel(’initial state observer’)
ylabel(’first action’)
title(’Optimal first action’)
The following code provides a section of the graph of optimal actions where we fix the
score of the observer for T =∞.
x1=0:10;
[prob,act] = pigletIH(0.000001,x1,3);
z=act;
figure
axis([0 10 0 10])
scatter(x1,z,3,’blue’,’filled’)
xlabel(’initial state player’)
ylabel(’first action’)
title(’Optimal first action’)
The following code provides a section of the graph of optimal actions where we fix the
score of the player for T =∞.
x2=0:9;
[prob,act] = pigletIH(0.000001,3,x2);
z=act;
figure
axis([0 10 0 10])
scatter(x2,z,3,’blue’,’filled’)
xlabel(’initial state observer’)
ylabel(’first action’)
title(’Optimal first action’)
36
Bibliography
[1] Hill, T. P. (2007). “Knowing when to stop”, American Scientist, 97, 126-133.
[2] Norris, James. R. (1997). Markov Chains, Cambridge, United Kingdom: University
Press.
[3] Nunez Queija, R. (2011). Markov Decision Processes, The Netherlands.
[4] Parmigiani, G., Inoue, L. (2009). Decision Theory, United Kingdom.
[5] Tijms, H. C. (2012). Understanding Probability, third edition, Cambridge University
Press, New York.
[6] Tijms, H. C. (2015). Optimal Stopping and the One-Stage-Look-Ahead Rule, VU Ams-
terdam, The Netherlands.
[7] Weber, R. (2014). Optimization and Control, Class Notes, University of Cambridge,
http://www.statslab.cam.ac.uk/ rrw1/oc/oc2014.pdf
37