a complexity analysis solving markov decision processes using policy iteration romain hollanders,...

85
A complexity analysis ng Markov Decision Processes using Policy Itera Romain Hollanders, UCLouvain nt work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Ju Seminar at Loria – Inria, Nancy, February 2015

Upload: alexander-lloyd

Post on 18-Jan-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

A complexity analysis

Solving Markov Decision Processes using Policy Iteration

Romain Hollanders, UCLouvain

Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers

Seminar at Loria – Inria, Nancy, February 2015

Page 2: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Policy Iteration to solve Markov Decision Processes

Two powerful tools for the analysis

Acyclic Unique Sink Orientations Order-Regular matrices

Page 3: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 4: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 5: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 6: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 7: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 8: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 9: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 10: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 11: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

starting state

How much will we pay ?

Page 12: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

How much will we pay ?

starting state

Total-cost criterion . . . . . . . . . . . . . . .

horizon

cost vector

Page 13: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

How much will we pay ?

starting state

Total-cost criterion

Average-cost criterion

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

horizon

cost vector

Page 14: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

How much will we pay ?

starting state

Total-cost criterion

Average-cost criterion

Discounted-cost criterion

. . . . . . . . . . . . . . .

. . . . . . . . . . . . .

. . . . . . . . . . .

horizon

discount factor

cost vector

Page 15: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Markov chains

Page 16: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Markov chains

Markov Decision Processes

one a

ction

per s

tate

in general

Page 17: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 18: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

action

action cost

transition probability

Goal: find the optimal policy

Evaluate a policy using an objective function

Total-cost

Average-cost

Discounted-cost

Proposition: there always exists

what we aim for !

Page 19: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

How do we solve a Markov Decision Process ?

Policy Iteration

Page 20: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

POLICY ITERATION

Page 21: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Choose an initial policy0.

end

while

1. Evaluate

2. Improve

is the best action in each stateaccording to

POLICY ITERATION

Page 22: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Choose an initial policy0.

end

while

1. Evaluate

2. Improve

is the best action in each stateaccording to

POLICY ITERATION

Page 23: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Choose an initial policy0.

end

while

1. Evaluate

2. Improve

is the best action in each stateaccording to

Stop ! We found the optimal policy

POLICY ITERATION

Page 24: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Markov Decision Processes

Page 25: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Turn Based Stochastic Games

one p

layer

two players

Markov Decision Processes

Page 26: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 27: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

minimizer versus maximizer

STRATEGY ITERATION

Page 28: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

minimizer versus maximizer

STRATEGY ITERATION

find the best response

using POLICY ITERATION

against

Page 29: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

minimizer versus maximizer

STRATEGY ITERATION

find the best response

using POLICY ITERATION

against

find the best response

using POLICY ITERATION

against

Page 30: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

minimizer versus maximizer

STRATEGY ITERATION

find the best response

using POLICY ITERATION

against

Repeat until nothing changes

find the best response

using POLICY ITERATION

against

Page 31: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

What is the complexity of Policy Iteration ?

Page 32: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Total-cost criterion

Average-cost criterion

Discounted-cost criterion

Exponential. . . . . . . . . . . . . . . . . . . . . .

[Friedmann ‘09, Fearnley ‘10]

Page 33: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 34: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Total-cost criterion

Average-cost criterion

Discounted-cost criterion

Exponential

Exponential

Exponential

. . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . .

[H. et al. ‘12]

[Friedmann ‘09, Fearnley ‘10]

[Friedmann ‘09, Fearnley ‘10]

Page 35: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Exponential in general !

But…

Page 36: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Fearnley’s example is pathological

Page 37: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Deterministic MDPs

MDPs with only positive costs

Polynomial for a close variant

???

. . . . . .

. . . . . . . . . . . . . . . . .

[Ye ‘10, Hansen et al. ‘11, Scherrer ‘13]

[Post & Ye ‘12, Scherrer ‘13]

Discounted-cost criterionwith a fixed discount rate

Polynomial

. . . . . . . . . . . . . . . . . . . .

Page 38: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Let us find upper bounds for the general case !

Page 39: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 40: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 41: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 42: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 43: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles
Page 44: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

sink

Every subcube has a unique sink

The orientation is acyclic

Let us find the sink with POLICY ITERATION

Acyclic Unique Sink Orientation

Page 45: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Initial policy

Let us find the sink with POLICY ITERATION

Page 46: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

: the set of dimensions of the improvement edges

Let us find the sink with POLICY ITERATION

Page 47: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Let us find the sink with POLICY ITERATION

Page 48: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Let us find the sink with POLICY ITERATION

Page 49: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Convergence in 5 vertex evaluations

is the PI-sequence

Let us find the sink with POLICY ITERATION

Page 50: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Two properties to derive an upper bound

Page 51: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

There exists a path connectingthe policies of the PI-sequence

Two properties to derive an upper bound

1.

2.

Page 52: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

A new upper bound

total number of policies

Therefore we cannot have too many large ’s in a PI-sequence

We prove

Therefore

Page 53: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Can we do even better?

Page 54: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 55: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 56: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 57: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 58: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 59: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 60: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 61: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 62: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 63: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 64: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 65: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 66: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 67: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 68: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 69: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 70: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The matrix is “Order-Regular”

Page 71: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

How large are the largest Order-Regular matrices that we can build?

Page 72: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The answer of exhaustive search

??

Conjecture (Hansen & Zwick, 2012)

the Fibonacci number

the golden ratio

Page 73: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

The answer of exhaustive search

Theorem (H. et al., 2014)

for

(Proof: a “smart” exhaustive search)

Page 74: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

How large are the largest Order-Regular matrices that we can build?

Page 75: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

A constructive approach

Page 76: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

A constructive approach

Page 77: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

A constructive approach

Page 78: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

A constructive approach

Page 79: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

A constructive approach

Iterate and build matrices of size

Page 80: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Can we do better ?

Page 81: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Yes!

We can build matrices of size

Page 82: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

So, what do we know about Order-Regular matrices ?

Order-Regular matrix Acyclic Unique Sink Orientation

Page 83: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

Let’s recap’ !

Page 84: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

PART 1 Policy Iteration for Markov Decision Processes

Efficient in practice but not in the worst case

PART 2 The Acyclic Unique Sink Orientations point of view

Leads to a new upper bound

PART 3 Order-Regular matrices towards new bounds

The Fibonacci conjecture fails

Page 85: A complexity analysis Solving Markov Decision Processes using Policy Iteration Romain Hollanders, UCLouvain Joint work with: Balázs Gerencsér, Jean-Charles

A complexity analysis

Solving Markov Decision Processes using Policy Iteration

Romain Hollanders, UCLouvain

Joint work with: Balázs Gerencsér, Jean-Charles Delvenne and Raphaël Jungers

Seminar at Loria – Inria, Nancy, February 2015