optimism in the face of uncertainty: a unifying approach

25
Optimism in the Face of Uncertainty: a Unifying approach István Szita & András Lőrincz Eötvös Loránd University Hungary

Upload: conway

Post on 12-Jan-2016

35 views

Category:

Documents


1 download

DESCRIPTION

Optimism in the Face of Uncertainty: a Unifying approach. István Szita & András Lőrincz Eötvös Loránd University Hungary. Outline. background quick overview of exploration methods construction of the new algorithm analysis & experimental results outlook. Background. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Optimism in the Face of Uncertainty: a Unifying approach

Optimism in the Face of Uncertainty:

a Unifying approach

István Szita & András Lőrincz

Eötvös Loránd University

Hungary

Page 2: Optimism in the Face of Uncertainty: a Unifying approach

Outline

background quick overview of exploration

methods construction of the new algorithm analysis & experimental results outlook

Page 3: Optimism in the Face of Uncertainty: a Unifying approach

Background

Markov decision processes finite, discounted (…but wait until the end of the talk)

value function-based methods Q(x,a) values

the efficient exploration problem

Page 4: Optimism in the Face of Uncertainty: a Unifying approach

Basic exploration: -greedy

extremely simple sufficient for convergence in the limit

for many classical methods likeQ-learning, Dyna, Sarsa

…under suitable conditions

extremely inefficient

Page 5: Optimism in the Face of Uncertainty: a Unifying approach

Advanced exploration

in case of uncertainty, be optimistic! …details vary

we will use concepts from R-max optimistic initial values exploration bonus methods model-based interval estimation

there are many others, Bayesian methods UCT delayed Q-learning …

Page 6: Optimism in the Face of Uncertainty: a Unifying approach

R-max

builds model from observations

uses an optimistic model unknown transitions go to

“garden of Eden” (hypothetical state with max. reward)

transitions declared known after O(nVisits3) steps

+poly-time convergence

−slow in practice

(Brafman &Tennenholz, 2001)

Page 7: Optimism in the Face of Uncertainty: a Unifying approach

Optimistic initial values

set initial values high:

no extra work usually combined with

other techniques with very high initial

values, no need for additional exploration

+no extra work

−wears out slowly

only model-free

Page 8: Optimism in the Face of Uncertainty: a Unifying approach

Exploration bonus methods

bonus reward for “interesting” states rarely visited, large TD-error,

etc. exact size/form varies can oscillate fervently

regular/bonus rewards accumulated in separate value functions

+can be efficientin practice

−ad-hoc method

bonuses do not converge

(e.g. Mealeau & Bourgine, 1999; many others)

Page 9: Optimism in the Face of Uncertainty: a Unifying approach

Model-based interval estimation

builds model from observations

estimates confidence intervals of state values

exploration bonus: half-widths of intervals

+poly-time convergence

−???

(Wiering, 1998; Strehl & Littman, 2006)

Page 10: Optimism in the Face of Uncertainty: a Unifying approach

Assembling the new algorithm

model estimation:sum of rewards for all (x,a,y) up to t

number of visits to (x,a,y) up to t

number of visits to (x,a) up to t

Page 11: Optimism in the Face of Uncertainty: a Unifying approach

Assembling the new algorithm II Optimistic initial model: a single visit to xE from each (x,a)

really optimistic!

Page 12: Optimism in the Face of Uncertainty: a Unifying approach

Assembling the new algorithm II Optimistic initial model: a single visit to xE from each (x,a)

really optimistic!

cf. optimistic initial values:no extra work after initialization

cf. R-max:hypothetical “Eden” statewith max. reward

Page 13: Optimism in the Face of Uncertainty: a Unifying approach

Assembling the new algorithm III in each step t, at := greedy with respect to Qt(xt,¢)

perform at, observe next state and reward update counters, model parameters solve model MDP

... can be done incrementally & fast, e.g.: a few steps of value iteration asynchronously, by prioritized sweeping

get new value function Qt+1

Page 14: Optimism in the Face of Uncertainty: a Unifying approach

Assembling the new algorithm IV Potential problem: Rmax is too large! separate real/bonus rewards!

initialize to 0add “real” rewards

initialize to 0 or Rmax

add nothing

we can use it at any time!

Page 15: Optimism in the Face of Uncertainty: a Unifying approach

Assembling the new algorithm IV Potential problem: Rmax is too large! separate real/bonus rewards!

initialize to 0add “real” rewards

initialize to 0 or Rmax

add nothing

we can use it at any time!

cf. exploration bonus methods

exploration bonus!

Page 16: Optimism in the Face of Uncertainty: a Unifying approach

Convergence results

One parameter: Rmax

for large Rmax, converges to near-optimum (with high probability)

proof is based on MBIE’s proof (and R-max, E3) by the time the bonus becomes small !

numVisits is large !model estimate is accurate

bonus is (instead of MBIE’s )

looser bound (but polynomial!)

Page 17: Optimism in the Face of Uncertainty: a Unifying approach

Experimental results I

“RiverSwim”

“SixArms”

(Strehl & Littman, 2006)

Page 18: Optimism in the Face of Uncertainty: a Unifying approach

Experimental results II

“Chain”

“Loop”

(Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000)

Page 19: Optimism in the Face of Uncertainty: a Unifying approach

Experimental results III

“FlagMaze”

(Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000)

Page 20: Optimism in the Face of Uncertainty: a Unifying approach

Experimental results IV

“Maze with subgoals”

(Wiering & Schmidhuber, 1998)

+500

+500 +1000

Page 21: Optimism in the Face of Uncertainty: a Unifying approach

Outlook

extension to factored MDPs: almost ready (we need benchmarks)

extension to general function approximation: in progress

Page 22: Optimism in the Face of Uncertainty: a Unifying approach

Advantages of OIM

polynomial-time convergence (to near-optimum, with high probability)

convincing performance in practice extremely simple to implement

all work done at initialization decision making is always greedy

Matlab source code to be released soon

Page 23: Optimism in the Face of Uncertainty: a Unifying approach

Thank you for your attention!

check our web pages athttp://szityu.web.eotvos.elte.huhttp://inf.elte.hu/lorincz

or my reinforcement learning blog “Gimme Reward” athttp://gimmereward.wordpress.com

Page 24: Optimism in the Face of Uncertainty: a Unifying approach

Full pseudocode of the OIM algorithm

Page 25: Optimism in the Face of Uncertainty: a Unifying approach

Exact statement of the convergence theorem