optimism in the face of uncertainty: a unifying approach

Optimism in the Face of Uncertainty:

a Unifying approach

István Szita & András Lőrincz

Eötvös Loránd University

Hungary

Outline

background quick overview of exploration

methods construction of the new algorithm analysis & experimental results outlook

Background

Markov decision processes finite, discounted (…but wait until the end of the talk)

value function-based methods Q(x,a) values

the efficient exploration problem

Basic exploration: -greedy

extremely simple sufficient for convergence in the limit

for many classical methods likeQ-learning, Dyna, Sarsa

…under suitable conditions

extremely inefficient

Advanced exploration

in case of uncertainty, be optimistic! …details vary

we will use concepts from R-max optimistic initial values exploration bonus methods model-based interval estimation

there are many others, Bayesian methods UCT delayed Q-learning …

R-max

builds model from observations

uses an optimistic model unknown transitions go to

“garden of Eden” (hypothetical state with max. reward)

transitions declared known after O(nVisits3) steps

+poly-time convergence

−slow in practice

(Brafman &Tennenholz, 2001)

Optimistic initial values

set initial values high:

no extra work usually combined with

other techniques with very high initial

values, no need for additional exploration

+no extra work

−wears out slowly

only model-free

Exploration bonus methods

bonus reward for “interesting” states rarely visited, large TD-error,

etc. exact size/form varies can oscillate fervently

regular/bonus rewards accumulated in separate value functions

+can be efficientin practice

−ad-hoc method

bonuses do not converge

(e.g. Mealeau & Bourgine, 1999; many others)

Model-based interval estimation

builds model from observations

estimates confidence intervals of state values

exploration bonus: half-widths of intervals

+poly-time convergence

−???

(Wiering, 1998; Strehl & Littman, 2006)

Assembling the new algorithm

model estimation:sum of rewards for all (x,a,y) up to t

number of visits to (x,a,y) up to t

number of visits to (x,a) up to t

Assembling the new algorithm II Optimistic initial model: a single visit to xE from each (x,a)

really optimistic!

Assembling the new algorithm II Optimistic initial model: a single visit to xE from each (x,a)

really optimistic!

cf. optimistic initial values:no extra work after initialization

cf. R-max:hypothetical “Eden” statewith max. reward

Assembling the new algorithm III in each step t, at := greedy with respect to Qt(xt,¢)

perform at, observe next state and reward update counters, model parameters solve model MDP

... can be done incrementally & fast, e.g.: a few steps of value iteration asynchronously, by prioritized sweeping

get new value function Qt+1

Assembling the new algorithm IV Potential problem: Rmax is too large! separate real/bonus rewards!

initialize to 0add “real” rewards

initialize to 0 or Rmax

add nothing

we can use it at any time!

Assembling the new algorithm IV Potential problem: Rmax is too large! separate real/bonus rewards!

initialize to 0add “real” rewards

initialize to 0 or Rmax

add nothing

we can use it at any time!

cf. exploration bonus methods

exploration bonus!

Convergence results

One parameter: Rmax

for large Rmax, converges to near-optimum (with high probability)

proof is based on MBIE’s proof (and R-max, E3) by the time the bonus becomes small !

numVisits is large !model estimate is accurate

bonus is (instead of MBIE’s )

looser bound (but polynomial!)

Experimental results I

“RiverSwim”

“SixArms”

(Strehl & Littman, 2006)

Experimental results II

“Chain”

“Loop”

(Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000)

Experimental results III

“FlagMaze”

(Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000)

Experimental results IV

“Maze with subgoals”

(Wiering & Schmidhuber, 1998)

+500

+500 +1000

Outlook

extension to factored MDPs: almost ready (we need benchmarks)

extension to general function approximation: in progress

Advantages of OIM

polynomial-time convergence (to near-optimum, with high probability)

convincing performance in practice extremely simple to implement

all work done at initialization decision making is always greedy

Matlab source code to be released soon

Thank you for your attention!

check our web pages athttp://szityu.web.eotvos.elte.huhttp://inf.elte.hu/lorincz

or my reinforcement learning blog “Gimme Reward” athttp://gimmereward.wordpress.com

Full pseudocode of the OIM algorithm

Exact statement of the convergence theorem

optimism in the face of uncertainty: a unifying approach

Documents

new algorithm iiiin

maxbuilds model

model mdp

high initial values

new algorithmmodel estimation

additional exploration

rmaxfor large rmax

separate realbonus rewards