optimism in the face of uncertainty: a unifying approach
DESCRIPTION
Optimism in the Face of Uncertainty: a Unifying approach. István Szita & András Lőrincz Eötvös Loránd University Hungary. Outline. background quick overview of exploration methods construction of the new algorithm analysis & experimental results outlook. Background. - PowerPoint PPT PresentationTRANSCRIPT
Optimism in the Face of Uncertainty:
a Unifying approach
István Szita & András Lőrincz
Eötvös Loránd University
Hungary
Outline
background quick overview of exploration
methods construction of the new algorithm analysis & experimental results outlook
Background
Markov decision processes finite, discounted (…but wait until the end of the talk)
value function-based methods Q(x,a) values
the efficient exploration problem
Basic exploration: -greedy
extremely simple sufficient for convergence in the limit
for many classical methods likeQ-learning, Dyna, Sarsa
…under suitable conditions
extremely inefficient
Advanced exploration
in case of uncertainty, be optimistic! …details vary
we will use concepts from R-max optimistic initial values exploration bonus methods model-based interval estimation
there are many others, Bayesian methods UCT delayed Q-learning …
R-max
builds model from observations
uses an optimistic model unknown transitions go to
“garden of Eden” (hypothetical state with max. reward)
transitions declared known after O(nVisits3) steps
+poly-time convergence
−slow in practice
(Brafman &Tennenholz, 2001)
Optimistic initial values
set initial values high:
no extra work usually combined with
other techniques with very high initial
values, no need for additional exploration
+no extra work
−wears out slowly
only model-free
Exploration bonus methods
bonus reward for “interesting” states rarely visited, large TD-error,
etc. exact size/form varies can oscillate fervently
regular/bonus rewards accumulated in separate value functions
+can be efficientin practice
−ad-hoc method
bonuses do not converge
(e.g. Mealeau & Bourgine, 1999; many others)
Model-based interval estimation
builds model from observations
estimates confidence intervals of state values
exploration bonus: half-widths of intervals
+poly-time convergence
−???
(Wiering, 1998; Strehl & Littman, 2006)
Assembling the new algorithm
model estimation:sum of rewards for all (x,a,y) up to t
number of visits to (x,a,y) up to t
number of visits to (x,a) up to t
Assembling the new algorithm II Optimistic initial model: a single visit to xE from each (x,a)
really optimistic!
Assembling the new algorithm II Optimistic initial model: a single visit to xE from each (x,a)
really optimistic!
cf. optimistic initial values:no extra work after initialization
cf. R-max:hypothetical “Eden” statewith max. reward
Assembling the new algorithm III in each step t, at := greedy with respect to Qt(xt,¢)
perform at, observe next state and reward update counters, model parameters solve model MDP
... can be done incrementally & fast, e.g.: a few steps of value iteration asynchronously, by prioritized sweeping
get new value function Qt+1
Assembling the new algorithm IV Potential problem: Rmax is too large! separate real/bonus rewards!
initialize to 0add “real” rewards
initialize to 0 or Rmax
add nothing
we can use it at any time!
Assembling the new algorithm IV Potential problem: Rmax is too large! separate real/bonus rewards!
initialize to 0add “real” rewards
initialize to 0 or Rmax
add nothing
we can use it at any time!
cf. exploration bonus methods
exploration bonus!
Convergence results
One parameter: Rmax
for large Rmax, converges to near-optimum (with high probability)
proof is based on MBIE’s proof (and R-max, E3) by the time the bonus becomes small !
numVisits is large !model estimate is accurate
bonus is (instead of MBIE’s )
looser bound (but polynomial!)
Experimental results I
“RiverSwim”
“SixArms”
(Strehl & Littman, 2006)
Experimental results II
“Chain”
“Loop”
(Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000)
Experimental results III
“FlagMaze”
(Meuleau & Bourgine, 1999; Strens, 2000; Dearden, 2000)
Experimental results IV
“Maze with subgoals”
(Wiering & Schmidhuber, 1998)
+500
+500 +1000
Outlook
extension to factored MDPs: almost ready (we need benchmarks)
extension to general function approximation: in progress
Advantages of OIM
polynomial-time convergence (to near-optimum, with high probability)
convincing performance in practice extremely simple to implement
all work done at initialization decision making is always greedy
Matlab source code to be released soon
Thank you for your attention!
check our web pages athttp://szityu.web.eotvos.elte.huhttp://inf.elte.hu/lorincz
or my reinforcement learning blog “Gimme Reward” athttp://gimmereward.wordpress.com
Full pseudocode of the OIM algorithm
Exact statement of the convergence theorem