markov decision processes christos dimitrakakis · planning under uncertainty markov decision...

122
. . . . . . Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014

Upload: others

Post on 26-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Planning under uncertaintyMarkov decision processes

Christos Dimitrakakis

Chalmers

August 31, 2014

Page 2: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ContentsSubjective probability and utility

Subjective probabilityRewards and preferences

Bandit problemsIntroductionBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples

Episodic problemsPolicy evaluationBackwards induction

Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms

Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC

Page 3: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Objective Probability

x

Figure: The double slit experiment

Page 4: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Objective Probability

Figure: The double slit experiment

Page 5: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Objective Probability

Figure: The double slit experiment

Page 6: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

What about everyday life?

Page 7: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Subjective probability

Making decisions requires making predictions.

Outcomes of decisions are uncertain.

How can we represent this uncertainty?

Subjective probability

Describe which events we think are more likely.

We quantify this with probability.

Why probability?

Quantifies uncertainty in a “natural” way.

A framework for drawing conclusions from data.

Computationally convenient for decision making.

Page 8: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Subjective probability

Making decisions requires making predictions.

Outcomes of decisions are uncertain.

How can we represent this uncertainty?

Subjective probability

Describe which events we think are more likely.

We quantify this with probability.

Why probability?

Quantifies uncertainty in a “natural” way.

A framework for drawing conclusions from data.

Computationally convenient for decision making.

Page 9: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Subjective probability

Making decisions requires making predictions.

Outcomes of decisions are uncertain.

How can we represent this uncertainty?

Subjective probability

Describe which events we think are more likely.

We quantify this with probability.

Why probability?

Quantifies uncertainty in a “natural” way.

A framework for drawing conclusions from data.

Computationally convenient for decision making.

Page 10: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Subjective probability

Making decisions requires making predictions.

Outcomes of decisions are uncertain.

How can we represent this uncertainty?

Subjective probability

Describe which events we think are more likely.

We quantify this with probability.

Why probability?

Quantifies uncertainty in a “natural” way.

A framework for drawing conclusions from data.

Computationally convenient for decision making.

Page 11: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Assumptions about our beliefs

Our beliefs must be consistent. This can be achieved if they satisfy someassumptions:

Assumption 1 (SP1)

It is always possible to say whether one event is more likely than the other.

Assumption 2 (SP2)

If we can split events A,B in such a way that each part of A is less likely thanits counterpart in B, then A is less likely than B.

There also a couple of technical assumptions..

Page 12: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Assumptions about our beliefs

Our beliefs must be consistent. This can be achieved if they satisfy someassumptions:

Assumption 1 (SP1)

It is always possible to say whether one event is more likely than the other.

Assumption 2 (SP2)

If we can split events A,B in such a way that each part of A is less likely thanits counterpart in B, then A is less likely than B.

There also a couple of technical assumptions..

Page 13: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Assumptions about our beliefs

Our beliefs must be consistent. This can be achieved if they satisfy someassumptions:

Assumption 1 (SP1)

It is always possible to say whether one event is more likely than the other.

Assumption 2 (SP2)

If we can split events A,B in such a way that each part of A is less likely thanits counterpart in B, then A is less likely than B.

There also a couple of technical assumptions..

Page 14: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Resulting properties of relative likelihoods

Theorem 1 (Transitivity)

If A,B,D such that A ≾ B and B ≾ D, then A ≾ D.

Theorem 2 (Complement)

For any A,B: A ≾ B iff A∁ ≿ B∁.

Theorem 3 (Fundamental property of relative likelihoods)

If A ⊂ B then A ≾ B. Furthermore, ∅ ≾ A ≾ S for any event A.

Theorem 4

For a given likelihood relation between events, there exists a unique probabilitydistribution P such that

P(A) ≥ P(B)⇔ A ≿ B

Similar results can be derived for conditional likelihoods and probabilities.

Page 15: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Rewards

We are going to receive a reward r from a set R of possible rewards.

We prefer some rewards to others.

Example 5 (Possible sets of rewards R)

R is a set of tickets to different musical events.

R is a set of financial commodities.

Page 16: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

When we cannot select rewards directly

In most problems, we cannot just choose which reward to receive.

We can only specify a distribution on rewards.

Example 6 (Route selection)

Each reward r ∈ R is the time it takes to travel from A to B.

Route P1 is faster than P2 in heavy traffic and vice-versa.

Which route should be preferred, given a certain probability for heavytraffic?

In order to choose between random rewards, we use the concept of utility.

Page 17: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

When we cannot select rewards directly

In most problems, we cannot just choose which reward to receive.

We can only specify a distribution on rewards.

Example 6 (Route selection)

Each reward r ∈ R is the time it takes to travel from A to B.

Route P1 is faster than P2 in heavy traffic and vice-versa.

Which route should be preferred, given a certain probability for heavytraffic?

In order to choose between random rewards, we use the concept of utility.

Page 18: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

When we cannot select rewards directly

In most problems, we cannot just choose which reward to receive.

We can only specify a distribution on rewards.

Example 6 (Route selection)

Each reward r ∈ R is the time it takes to travel from A to B.

Route P1 is faster than P2 in heavy traffic and vice-versa.

Which route should be preferred, given a certain probability for heavytraffic?

In order to choose between random rewards, we use the concept of utility.

Page 19: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Utility

Definition 7 (Utility)

The utility is a function U : R → R, such that for all a, b ∈ R

a ≿∗ b iff U(a) ≥ U(b), (1.1)

The expected utility of a distribution P on R is:

EP(U) =

∫R

U(r) dP(r) (1.2)

Assumption 3 (The expected utility hypothesis)

The utility of P is equal to the expected utility of the reward under P.Consequently,

P ≿∗ Q iff EP(U) ≥ EQ(U). (1.3)

Page 20: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Utility

Definition 7 (Utility)

The utility is a function U : R → R, such that for all a, b ∈ R

a ≿∗ b iff U(a) ≥ U(b), (1.1)

The expected utility of a distribution P on R is:

EP(U) =

∫R

U(r) dP(r) (1.2)

Assumption 3 (The expected utility hypothesis)

The utility of P is equal to the expected utility of the reward under P.Consequently,

P ≿∗ Q iff EP(U) ≥ EQ(U). (1.3)

Page 21: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Utility

Definition 7 (Utility)

The utility is a function U : R → R, such that for all a, b ∈ R

a ≿∗ b iff U(a) ≥ U(b), (1.1)

The expected utility of a distribution P on R is:

EP(U) =

∫R

U(r) dP(r) (1.2)

Assumption 3 (The expected utility hypothesis)

The utility of P is equal to the expected utility of the reward under P.Consequently,

P ≿∗ Q iff EP(U) ≥ EQ(U). (1.3)

Page 22: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Example 8

r U(r) P Qdid not enter 0 1 0

paid 1 CU and lost −1 0 0.99paid 1 CU and won 10 9 0 0.01

Table: A simple gambling problem

P QE(U | ·) 0 −0.9

Table: Expected utility for the gambling problem

Page 23: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n

currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear you’d be willing to pay any amount toplay.

Page 24: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n

currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear you’d be willing to pay any amount toplay.

Page 25: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n

currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear you’d be willing to pay any amount toplay.

Page 26: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n

currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear you’d be willing to pay any amount toplay.

Page 27: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The St. Petersburg Paradox

A simple game [Bernoulli, 1713]

A fair coin is tossed until a head is obtained.

If the first head is obtained on the n-th toss, our reward will be 2n

currency units.

How much are you willing to pay, to play this game once?

The probability to stop at round n is 2−n.

Thus, the expected monetary gain of the game is

∞∑n=1

2n2−n =∞.

If your utility function were linear you’d be willing to pay any amount toplay.

Page 28: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Summary

We can subjectively indicate which events we think are more likely.

Using relative likelihoods, we can define a subjective probability P for allevents.

Similarly, we can subjectively indicate preferences for rewards.

We can determine a utility function for all rewards.

Hypothesis: we prefer the probability distribution (over rewards) with thehighest expected utility.

Concave utility functions imply risk aversion (and convex, risk-taking).

Page 29: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Experimental design and Markov decision processes

The following problems

Shortest path problems.

Optimal stopping problems.

Reinforcement learning problems.

Experiment design (clinical trial) problems

Advertising.

can be all formalised as Markov decision processes.

Applications

Robotics.

Economics.

Automatic control.

Resource allocation

Page 30: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ContentsSubjective probability and utility

Subjective probabilityRewards and preferences

Bandit problemsIntroductionBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples

Episodic problemsPolicy evaluationBackwards induction

Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms

Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC

Page 31: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bandit problems

Page 32: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

..

-1

.

-0.5

.

0

.

0.5

.

1

.

1.5

.

2

.0

.0.5

.1

.1.5

.2

.2.5

.3

.3.5

.4

Page 33: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

..

-1

.

-0.5

.

0

.

0.5

.

1

.

1.5

.

2

.0

.0.5

.1

.1.5

.2

.2.5

.3

.3.5

.4

Page 34: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

..

-1

.

-0.5

.

0

.

0.5

.

1

.

1.5

.

2

.0

.0.5

.1

.1.5

.2

.2.5

.3

.3.5

.4

Page 35: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

Page 36: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

Ultrasound

Page 37: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bandit problems

Applications

Efficient optimisation.

Online advertising.

Clinical trials.

Robot scientist.

Page 38: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The stochastic n-armed bandit problem

Actions and rewards

A set of actions A = 1, . . . , n. Each action gives you a random reward with distribution P(rt | at = i).

The expected reward of the i-th arm is ρi ≜ E(rt | at = i).

Utility

The utility is the sum of the rewards obtained

U ≜∑t

rt .

Page 39: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Policy

Definition 9 (Policies)

A policy π is an algorithm for taking actions given the observed history.

Pπ(at+1 | a1, r1, . . . , at , rt)

is the probability of the next action at+1.

Page 40: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bernoulli bandits

Example 10 (Bernoulli bandits)

Consider n Bernoulli distributions with parameters ωi (i = 1, . . . , n) such thatrt | at = i ∼ Bern(ωi ). Then,

P(rt = 1 | at = i) = ωi P(rt = 0 | at = i) = 1− ωi (2.1)

Then the expected reward for the i-th bandit is ρi ≜ E(rt | at = i) = ?.

Exercise 1 (The optimal policy under perfect knowledge)

If we know ωi for all i , what is the best policy?

A At every step, play the bandit i with the greatest ωi .

B At every step, play the bandit i with probability increasing with ωi .

C There is no right answer. It depends on the horizon T .

D It is too complicated.

Page 41: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bernoulli bandits

Example 10 (Bernoulli bandits)

Consider n Bernoulli distributions with parameters ωi (i = 1, . . . , n) such thatrt | at = i ∼ Bern(ωi ). Then,

P(rt = 1 | at = i) = ωi P(rt = 0 | at = i) = 1− ωi (2.1)

Then the expected reward for the i-th bandit is ρi ≜ E(rt | at = i) = ωi .

Exercise 1 (The optimal policy under perfect knowledge)

If we know ωi for all i , what is the best policy?

A At every step, play the bandit i with the greatest ωi .

B At every step, play the bandit i with probability increasing with ωi .

C There is no right answer. It depends on the horizon T .

D It is too complicated.

Page 42: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bernoulli bandits

Example 10 (Bernoulli bandits)

Consider n Bernoulli distributions with parameters ωi (i = 1, . . . , n) such thatrt | at = i ∼ Bern(ωi ). Then,

P(rt = 1 | at = i) = ωi P(rt = 0 | at = i) = 1− ωi (2.1)

Then the expected reward for the i-th bandit is ρi ≜ E(rt | at = i) = ωi .

Exercise 1 (The optimal policy under perfect knowledge)

If we know ωi for all i , what is the best policy?

A At every step, play the bandit i with the greatest ωi .

B At every step, play the bandit i with probability increasing with ωi .

C There is no right answer. It depends on the horizon T .

D It is too complicated.

Page 43: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The unknown reward case

Say you keep a running average of the reward obtained by each arm

ρt,i = Rt,i/nt,i

where nt,i is the number of times you played arm i and Rt,i the total rewardreceived from i so that whenever you play at = i :

Rt+1,i = Rt,i + rt , nt+1,i = nt,i + 1.

You could then choose to play the strategy

at = argmaxi

ρt,i .

What should the initial values n0,i ,R0,i be?

Page 44: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The uniform policy

..

0

.

0.2

.

0.4

.

0.6

.

0.8

.

1

.0

.100

.200

.300

.400

.500

.600

.700

.800

.900

.1000

.

ρ1

.

ρ2

.

ρ1

.

ρ2

.

∑tk=1 rk/t

Page 45: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The greedy policy

..

0

.

0.2

.

0.4

.

0.6

.

0.8

.

1

.0

.100

.200

.300

.400

.500

.600

.700

.800

.900

.1000

.

ρ1

.

ρ2

.

ρ1

.

ρ2

.

∑tk=1 rk/t

For n0,i = R0,i = 0

Page 46: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The greedy policy

..

0

.

0.2

.

0.4

.

0.6

.

0.8

.

1

.0

.100

.200

.300

.400

.500

.600

.700

.800

.900

.1000

.

ρ1

.

ρ2

.

ρ1

.

ρ2

.

∑tk=1 rk/t

For n0,i = R0,i = 1

Page 47: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The greedy policy

..

0

.

0.2

.

0.4

.

0.6

.

0.8

.

1

.0

.100

.200

.300

.400

.500

.600

.700

.800

.900

.1000

.

ρ1

.

ρ2

.

ρ1

.

ρ2

.

∑tk=1 rk/t

Forn0,i = R0,i = 10

Page 48: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ContentsSubjective probability and utility

Subjective probabilityRewards and preferences

Bandit problemsIntroductionBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples

Episodic problemsPolicy evaluationBackwards induction

Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms

Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC

Page 49: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

A Markov processes

Page 50: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Markov process

..st−1. st. st+1

Definition 11 (Markov Process – or Markov Chain)

The sequence st | t = 1, . . . of random variables st : Ω → S is a Markovprocess if

P(st+1 | st , . . . , s1) = P(st+1 | st). (3.1)

st is state of the Markov process at time t.

P(st+1 | st) is the transition kernel of the process.

The state of an algorithm

Observe that the R, n vectors of our greedy bandit algorithm form a Markovprocess. They also summarise our belief about which arm is the best.

Page 51: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Reinforcement learning

The reinforcement learning problem.

Learning to act in an unknown environment, by interaction and reinforcement.

The environment has a changing state st .

The agents observes the state st (simplest case).

The agent takes action at .

It receives rewards rt .

The goal (informally)

Maximise total reward∑

t rt

Types of environments

Markov decision processes (MDPs).

Partially observable MDPs (POMDPs).

(Partially observable) Markov games.

First deal with the case when µ is known.

Page 52: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Markov decision processes

Markov decision processes (MDP).

At each time step t:

We observe state st ∈ S. We take action at ∈ A. We receive a reward rt ∈ R. .. at.

st

.

st+1

.

rt

Markov property of the reward and state distribution

Pµ(st+1 | st , at) (Transition distribution)

Pµ(rt | st , at) (Reward distribution)

Page 53: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The agent

The agent’s policy π

Pπ(at | st , . . . , s1, at−1, . . . , a1) (history-dependent policy)

Pπ(at | st) (Markov policy)

Definition 12 (Utility)

Given a horizon T , the utility can be defined as

Ut ≜T−t∑k=0

rt+k (3.2)

The agent wants to to find π maximising the expected total future reward

Eπµ Ut = Eπ

µ

T−t∑k=0

rt+k . (expected utility)

Page 54: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

State value function

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (3.3)

The optimal policy π∗

π∗(µ) : Vπ∗(µ)t,µ (s) ≥ V π

t,µ(s) ∀π, t, s (3.4)

dominates all other policies π everywhere in S.The optimal value function V ∗

V ∗t,µ(s) ≜ V

π∗(µ)t,µ (s), (3.5)

is the value function of the optimal policy π∗.

Page 55: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Deterministic shortest-path problems

X

Properties

T →∞.

rt = −1 unless st = X , in which casert = 0.

Pµ(st+1 = X |st = X ) = 1.

A = North, South,East,West Transitions are deterministic and walls

block.

Page 56: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

14 13 12 11 10 9 8 7

15 13 6

16 15 14 4 3 4 5

17 2

18 19 20 2 1 2

19 21 1 0 1

20 22

21 23 24 25 26 27 28

Properties

γ = 1, T →∞.

rt = −1 unless st = X , in which casert = 0.

The length of the shortest path from sequals the negative value of the optimalpolicy.

Also called cost-to-go.

Page 57: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Stochastic shortest path problem with a pit

O X

Properties

T →∞.

rt = −1, but rt = 0 at X and −100 at Oand the problem ends.

Pµ(st+1 = X |st = X ) = 1.

A = North, South,East,West Moves to a random direction with

probability ω. Walls block.

Page 58: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

(a) ω = 0.1

1

2

3

4

5

6

7

8

1 2 3 4 5 6 7 8

(b) ω = 0.5

0.51

1.52

2.5

-120 -100 -80 -60 -40 -20 0

(c) value

Figure: Pit maze solutions for two values of ω.

Exercise 2

Why should we only take the shortcut in (a)?

Why does the agent commit suicide at the bottom?

Page 59: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ContentsSubjective probability and utility

Subjective probabilityRewards and preferences

Bandit problemsIntroductionBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples

Episodic problemsPolicy evaluationBackwards induction

Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms

Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC

Page 60: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

How to evaluate a policy

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (4.1)

(4.2)

This derivation directly gives a number of policy evaluation algorithms.

Page 61: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

How to evaluate a policy

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (4.1)

=T−t∑k=0

Eπµ(rt+k | st = s) (4.2)

(4.3)

This derivation directly gives a number of policy evaluation algorithms.

Page 62: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

How to evaluate a policy

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (4.1)

=T−t∑k=0

Eπµ(rt+k | st = s) (4.2)

= Eπµ(rt | st = s) + Eπ

µ(Ut+1 | st = s) (4.3)

(4.4)

This derivation directly gives a number of policy evaluation algorithms.

Page 63: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

How to evaluate a policy

V πµ,t(s) ≜ Eπ

µ(Ut | st = s) (4.1)

=T−t∑k=0

Eπµ(rt+k | st = s) (4.2)

= Eπµ(rt | st = s) + Eπ

µ(Ut+1 | st = s) (4.3)

= Eπµ(rt | st = s) +

∑i∈S

V πµ,t+1(i)Pπ

µ(st+1 = i |st = s). (4.4)

This derivation directly gives a number of policy evaluation algorithms.

Page 64: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Monte-Carlo Policy evaluation

for s ∈ S do

for k = 1, . . . ,K doExecute policy π and record total reward K times:

Rk(s) =T∑t=1

rt,k .

end forCalculate estimate:

v1(s) =1

K

K∑k=1

Rk(s).

end for

Page 65: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Monte-Carlo Policy evaluation

for s ∈ S dofor k = 1, . . . ,K do

Execute policy π and record total reward K times:

Rk(s) =T∑t=1

rt,k .

end for

Calculate estimate:

v1(s) =1

K

K∑k=1

Rk(s).

end for

Page 66: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Monte-Carlo Policy evaluation

for s ∈ S dofor k = 1, . . . ,K do

Execute policy π and record total reward K times:

Rk(s) =T∑t=1

rt,k .

end forCalculate estimate:

v1(s) =1

K

K∑k=1

Rk(s).

end for

Page 67: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Backwards induction policy evaluation

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = Eπµ(rt | st = s) +

∑j∈S

Pπµ(st+1 = j | st = s)vt+1(j), (4.5)

end for

..

st

.

at

.

rt

.

st+1

.?.

?

.

?

.

1

.

0

.

1

.

0

.0.5

.

0.5

.0.7

.

0.3

.

0.4

. 0.6

Page 68: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Backwards induction policy evaluation

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = Eπµ(rt | st = s) +

∑j∈S

Pπµ(st+1 = j | st = s)vt+1(j), (4.5)

end for

..

st

.

at

.

rt

.

st+1

.?.

0.7

.

?

.

1

.

0

.

1

.

0

.0.5

.

0.5

.0.7

.

0.3

.

0.4

. 0.6

Page 69: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Backwards induction policy evaluation

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = Eπµ(rt | st = s) +

∑j∈S

Pπµ(st+1 = j | st = s)vt+1(j), (4.5)

end for

..

st

.

at

.

rt

.

st+1

.?.

0.7

.

1.4

.

1

.

0

.

1

.

0

.0.5

.

0.5

.0.7

.

0.3

.

0.4

. 0.6

Page 70: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Backwards induction policy evaluation

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = Eπµ(rt | st = s) +

∑j∈S

Pπµ(st+1 = j | st = s)vt+1(j), (4.5)

end for

..

st

.

at

.

rt

.

st+1

.1.05.

0.7

.

1.4

.

1

.

0

.

1

.

0

.0.5

.

0.5

.0.7

.

0.3

.

0.4

. 0.6

Page 71: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Backwards induction policy optimization

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = maxa

Eµ(rt | st = s, at = a)+∑j∈S

Pµ(st+1 = j | st = s, at = a)vt+1(j),

(4.6)end for

..

st

.

at

.

rt

.

st+1

.?.

0.7

.

1.4

.

1

.

0

.

1

.

0

.?

.

?

.0.7

.

0.3

.

0.4

. 0.6

Page 72: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Backwards induction policy optimization

for State s ∈ S , t = T , . . . , 1 doUpdate values

vt(s) = maxa

Eµ(rt | st = s, at = a)+∑j∈S

Pµ(st+1 = j | st = s, at = a)vt+1(j),

(4.6)end for

..

st

.

at

.

rt

.

st+1

.1.4.

0.7

.

1.4

.

1

.

0

.

1

.

0

.0

.

1

.0.7

.

0.3

.

0.4

. 0.6

Page 73: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ContentsSubjective probability and utility

Subjective probabilityRewards and preferences

Bandit problemsIntroductionBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples

Episodic problemsPolicy evaluationBackwards induction

Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms

Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC

Page 74: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Discounted total reward.

Ut = limT→∞

T∑k=t

γk rk , γ ∈ (0, 1)

Definition 13

A policy π is stationary if π(at | st) does not depend on t.

Remark 1

We can use the Markov chain kernel Pµ,π to write the expected utility vector as

vπ =∞∑t=0

γtP tµ,πr (5.1)

Page 75: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Theorem 14

For any stationary policy π, vπ is the unique solution of

v = r + γPµ,πv. ← fixed point (5.2)

In addition, the solution is:

vπ = (I − γPµ,π)−1r. (5.3)

Example 15

Similar to the geometric series:

∞∑t=0

αt =1

1− α

Page 76: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Backward induction for discounted infinite horizon problems

We can also apply backwards induction to the infinite case.

The resulting policy is stationary.

So memory does not grow with T .

Value iteration

for n = 1, 2, . . . and s ∈ S dovn(s) = maxa r(s, a) + γ

∑s′∈S Pµ(s

′ | s, a)vn−1(s′)

end for

Page 77: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Policy Iteration

Input µ, S.Initialise v0.for n = 1, 2, . . . do

πn+1 = argmaxπ r + γPπvn // policy improvement

vn+1 = Vπn+1µ // policy evaluation

break if πn+1 = πn.end forReturn πn,vn.

Page 78: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Summary

Markov decision processes model controllable dynamical systems. Optimal policies maximise expected utility can be found with:

Backwards induction / value iteration. Policy iteration.

The MDP state can be seen as The state of a dynamic controllable process. The internal state of an agent.

Page 79: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ContentsSubjective probability and utility

Subjective probabilityRewards and preferences

Bandit problemsIntroductionBernoulli bandits

Markov decision processes and reinforcement learningMarkov processesMarkov decision processesValue functionsExamples

Episodic problemsPolicy evaluationBackwards induction

Continuing, discounted problemsMarkov chain theory for discounted problemsInfinite horizon MDP Algorithms

Bayesian reinforcement learningReinforcement learningBounds on the utilityProperties of ABC

Page 80: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 16 ()

Eπµ

Ut =

Eπµ

T∑k=t

rk

Page 81: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 16 ()

Eπµ

Ut =

Eπµ

T∑k=t

rk

Page 82: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 16 (Utility)

Eπµ

Ut =

Eπµ

T∑k=t

rk

Page 83: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 16 (Expected utility)

Eπµ Ut = Eπ

µ

T∑k=t

rk

When µ is known, calculate maxπ Eπµ U.

Page 84: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The reinforcement learning problem

Learning to act in an unknown world, by interaction and reinforcement.

World µ; Policy π; at time t

µ generates observation xt ∈ X . We take action at ∈ A using π.

µ gives us reward rt ∈ R.

Definition 16 (Expected utility)

Eπµ Ut = Eπ

µ

T∑k=t

rk

Knowing µ is contrary to the problem definition

Page 85: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U

=

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Page 86: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U

=

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Page 87: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U

=

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Page 88: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U

=

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Page 89: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

π

Eπξ U =

maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Page 90: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

When µ is not known

Bayesian idea: use a subjective belief ξ(µ) on M

Initial belief ξ(µ).

The probability of observing history h is Pπµ(h).

We can use this to adjust our belief via Bayes’ theorem:

ξ(µ | h, π) ∝ Pπµ(h)ξ(µ)

We can thus conclude which µ is more likely.

The subjective expected utility

U∗ξ ≜ max

πEπ

ξ U = maxπ

∑µ

(Eπ

µ U)ξ(µ).

Integrates planning and learning, and the exploration-exploitation trade-off

Page 91: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

..

EU

. ξ.

U∗µ1: No trap

.

U∗µ2: Trap

Page 92: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

..

EU

. ξ.

U∗µ1: No trap

.

U∗µ2: Trap

.

π1

Page 93: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

..

EU

. ξ.

U∗µ1: No trap

.

U∗µ2: Trap

.

π1

.

π2

Page 94: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

..

EU

. ξ.

U∗µ1: No trap

.

U∗µ2: Trap

.

π1

.

π2

.

ξ1

Page 95: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

..

EU

. ξ.

U∗µ1: No trap

.

U∗µ2: Trap

.

π1

.

π2

.

π∗ξ1

.

U∗ξ1

.

ξ1

Page 96: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

..

EU

. ξ.

U∗µ1: No trap

.

U∗µ2: Trap

.

π1

.

π2

.

π∗ξ1

.

U∗ξ1

.

ξ1

.

U∗ξ

Page 97: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Bounds on the ξ-optimal utility U∗ξ ≜ maxπ Eπ

ξ U

..

EU

. ξ.

U∗µ1: No trap

.

U∗µ2: Trap

.

π1

.

π2

.

π∗ξ1

.

U∗ξ1

.

ξ1

.

U∗ξ

Page 98: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation) RL1

How to deal with an arbitrary model space M

The models µ ∈M may be non-probabilistic simulators.

We may not know how to choose the simulator parameters.

Overview of the approach

Place a prior on the simulator parameters.

Observe some data h on the real system.

Approximate the posterior by statistics on simulated data.

Calculate a near-optimal policy for the posterior.

Results

We prove soundness with general properties on the statistics.

In practice, can require much less data than a general model.

1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013

Page 99: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation) RL1

How to deal with an arbitrary model space M

The models µ ∈M may be non-probabilistic simulators.

We may not know how to choose the simulator parameters.

Overview of the approach

Place a prior on the simulator parameters.

Observe some data h on the real system.

Approximate the posterior by statistics on simulated data.

Calculate a near-optimal policy for the posterior.

Results

We prove soundness with general properties on the statistics.

In practice, can require much less data than a general model.

1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013

Page 100: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation) RL1

How to deal with an arbitrary model space M

The models µ ∈M may be non-probabilistic simulators.

We may not know how to choose the simulator parameters.

Overview of the approach

Place a prior on the simulator parameters.

Observe some data h on the real system.

Approximate the posterior by statistics on simulated data.

Calculate a near-optimal policy for the posterior.

Results

We prove soundness with general properties on the statistics.

In practice, can require much less data than a general model.

1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013

Page 101: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation) RL1

How to deal with an arbitrary model space M

The models µ ∈M may be non-probabilistic simulators.

We may not know how to choose the simulator parameters.

Overview of the approach

Place a prior on the simulator parameters.

Observe some data h on the real system.

Approximate the posterior by statistics on simulated data.

Calculate a near-optimal policy for the posterior.

Results

We prove soundness with general properties on the statistics.

In practice, can require much less data than a general model.

1Dimitrakakis, Tziortiotis. ABC Reinforcement Learning: ICML 2013

Page 102: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Cover tree Bayesian reinforcement learning

The model idea

Cover the space using a covertree.

A linear model for each set.

The tree defines a distribution onpiecewise-linear models.

Algorithm overview

Build the tree online

Do Bayesian inference on the tree.

Sample a model from the tree.

Get a policy for the model.

..c0.

c1

.

c2

.

c3

.

c4

Page 103: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Cover tree Bayesian reinforcement learning

The model idea

Cover the space using a covertree.

A linear model for each set.

The tree defines a distribution onpiecewise-linear models.

Algorithm overview

Build the tree online

Do Bayesian inference on the tree.

Sample a model from the tree.

Get a policy for the model.

..c0.

c1

.

c2

.

c3

.

c4

..st. st+1.

at

.

ct

.

θt

Page 104: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Cover tree Bayesian reinforcement learning

The model idea

Cover the space using a covertree.

A linear model for each set.

The tree defines a distribution onpiecewise-linear models.

Algorithm overview

Build the tree online

Do Bayesian inference on the tree.

Sample a model from the tree.

Get a policy for the model.

..c0.

c1

.

c2

.

c3

.

c4

.....−4

.−2

.0

.2

.4

.−1.

0

.

1

.

st

.st+

1.

104 samples

Page 105: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Cover tree Bayesian reinforcement learning

The model idea

Cover the space using a covertree.

A linear model for each set.

The tree defines a distribution onpiecewise-linear models.

Algorithm overview

Build the tree online

Do Bayesian inference on the tree.

Sample a model from the tree.

Get a policy for the model.

...

..

101

.

102

.

103

.

102

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Page 106: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

A comparison

ABC RL

Any simulator can be used ⇒ enables detailed prior knowledge

Our theoretical results prove soundness of ABC.

Downside: Computationally intensive.

Cover Tree Bayesian RL

Very general model.

Inference in logarithmic time due to the tree strcuture.

Downside: Hard to insert domain-specific prior knowledge.

Future work

Advanced algorithms (e.g. tree or gradient methods) for policy optimisation.

Page 107: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

Unknown MDPs can be handled in a Bayesian framework. This defines a belief-augmented MDP with

A state for the MDP. A state for the agent’s belief.

The Bayes-optimal utility is convex, enabling approximations.

A big problem in specifying the “right” prior.

Questions?

Page 108: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Page 109: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Page 110: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

Example 17 (Cumulative features)

Feature function ϕ : X → Rk .

f (h) ≜∑t

ϕ(xt)

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Page 111: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

Example 17 (Utility)

f (h) ≜∑t

rt

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Page 112: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Page 113: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Page 114: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Page 115: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Page 116: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

ABC (Approximate Bayesian Computation)

When there is no probabilistic model (Pµ is not available): ABC!

A prior ξ on a class of simulatorsM History h ∈ H from policy π.

Statistic f : H → (W, ∥ · ∥) Threshold ϵ > 0.

ABC-RL using Thompson sampling

do µ ∼ ξ, h′ ∼ Pπµ // sample a model and history

until ∥f (h′)− f (h)∥ ≤ ϵ // until the statistics are close

µ(k) = µ // approximate posterior sample µ(k) ∼ ξϵ(· | ht) π(k) ≈ argmaxEπ

µ(k) Ut // approximate optimal policy for sample

Page 117: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The approximate posterior ξϵ(· | h)

Corollary 17

If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).

Assumption 4 (A1. Lipschitz log-probabilities)

For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ

µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥

Theorem 18 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:

D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (6.1)

where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.

Page 118: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The approximate posterior ξϵ(· | h)

Corollary 17

If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).

Assumption 4 (A1. Lipschitz log-probabilities)

For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ

µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥

Theorem 18 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:

D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (6.1)

where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.

Page 119: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The approximate posterior ξϵ(· | h)

Corollary 17

If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).

Assumption 4 (A1. Lipschitz log-probabilities)

For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ

µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥

Theorem 18 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:

D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (6.1)

where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.

Page 120: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

The approximate posterior ξϵ(· | h)

Corollary 17

If f is a sufficient statistic and ϵ = 0, then ξ(· | h) = ξϵ(· | h).

Assumption 4 (A1. Lipschitz log-probabilities)

For the policy π, ∃L > 0 s.t. ∀h, h′ ∈ H and ∀µ ∈M∣∣ln [Pπµ(h)/Pπ

µ(h′)]∣∣ ≤ L∥f (h)− f (h′)∥

Theorem 18 (The approximate posterior ξϵ(· | h) is close to ξ(· | h))If A1 holds then ∀ϵ > 0:

D (ξ(· | h) ∥ ξϵ(· | h)) ≤ 2Lϵ+ ln |Ahϵ|, (6.1)

where Ahϵ ≜ z ∈ H | ∥f (z)− f (h)∥ ≤ ϵ.

Page 121: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

[1] Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite time analysis ofthe multiarmed bandit problem. Machine Learning, 47(2/3):235–256,2002.

[2] Dimitri P. Bertsekas. Dynamic Programming and Optimal Control.Athena Scientific, 2001.

[3] Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-DynamicProgramming. Athena Scientific, 1996.

[4] Herman Chernoff. Sequential design of experiments. Annals ofMathematical Statistics, 30(3):755–770, 1959.

[5] Herman Chernoff. Sequential Models for Clinical Trials. In Proceedings ofthe Fifth Berkeley Symposium on Mathematical Statistics and Probability,Vol.4, pages 805–812. Univ. of Calif Press, 1966.

[6] Morris H. DeGroot. Optimal Statistical Decisions. John Wiley & Sons,1970.

[7] Milton Friedman and Leonard J. Savage. The expected-utility hypothesisand the measurability of utility. The Journal of Political Economy,60(6):463, 1952.

[8] Marting L. Puterman. Markov Decision Processes : Discrete StochasticDynamic Programming. John Wiley & Sons, New Jersey, US, 1994.

[9] Leonard J. Savage. The Foundations of Statistics. Dover Publications,1972.

[10] Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger.Gaussian process optimization in the bandit setting: No regret andexperimental design. In ICML 2010, 2010.

Page 122: Markov decision processes Christos Dimitrakakis · Planning under uncertainty Markov decision processes Christos Dimitrakakis Chalmers August 31, 2014. . . .

. . . . . .

[11] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: AnIntroduction. MIT Press, 1998.