optimal nudging. presentation ud

Post on 03-Jun-2015

66 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Optimal NudgingA new approach to solving SMDPs

Reinaldo Uribe MUniversidad de los Andes — Oita University

Colorado State University

Nov. 11, 2013

Snakes & Ladders

Player advances thenumber of steps indicatedby a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves the playerforward to the top.

Goal: reaching state 100.

Snakes & Ladders

Boring!(No skill required, only luck.)

Player advances thenumber of steps indicatedby a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves the playerforward to the top.

Goal: reaching state 100.

Variation: Decision Snakes and Ladders

Sets of “win” and“loss” terminal states.

Actions: either“advance” or “goback,” to be decidedbefore throwing the die.

Reinforcement Learning: Finding an optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

Better than optimal?

(Old optimal policy)

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Better than optimal?

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Better than optimal?

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Better than optimal?

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

So, how are average-reward optimal policies found?

Algorithm 1 Generic SMDP solver

Initializerepeat forever

ActDo RL to find value of current π . Usually 1-step Q-learning

Update ρ.

Average-adjusted Q-learning:

Qt+1(st, at)← (1− γt)Qt(st, at) + γt

(rt+1 − ρtct+1 +max

aQt(st+1, a)

)

Generic Learning Algorithm

Table of algorithms. ARRL

Algorithm Gain update

AACJalali and Ferguson 1989 ρt+1 ←

t∑i=0

r(si, πi(si))

t+ 1

R–LearningSchwartz 1993

ρt+1 ← (1− α)ρt+α(rt+1 +max

aQt(st+1, a)−max

aQt(st, a)

)H–LearningTadepalli and Ok 1998

ρt+1 ← (1−αt)ρt+αt

(rt+1 −Ht(st) +Ht(st+1)

)αt+1 ←

αt

αt + 1

SSP Q-LearningAbounadi et al. 2001

ρt+1 ← ρt + αt minaQt(s, a)

HARGhavamzadeh and Mahadevan 2007 ρt+1 ←

t∑i=0

r(si, πi(si))

t+ 1

Generic Learning Algorithm

Table of algorithms. SMDPRL

Algorithm Gain update

SMARTDas et al. 1999

ρt+1 ←

t∑i=0

r(si, πi(si))

t∑i=0

c(si, πi(si))MAX-Q

Ghavamzadeh and Mahadevan 2001

Nudging

Algorithm 2 Nudged Learning

Initialize (π, ρ, Q)repeat

Set reward scheme to (r − ρc).Solve by any RL method.Update ρ

until Qπ(sI) = 0

Note: ‘by any RL method’ refers to a well-studied problem forwhich better algorithms (both practical and with theoreticalguarantees) exist.

ρ can (and will) be updated optimally.

Nudging

Algorithm 3 Nudged Learning

Initialize (π, ρ, Q)repeat

Set reward scheme to (r − ρc).Solve by any RL method.Update ρ

until Qπ(sI) = 0

Note: ‘by any RL method’ refers to a well-studied problem forwhich better algorithms (both practical and with theoreticalguarantees) exist.

ρ can (and will) be updated optimally.

The w − l space.Definition

(Policy π has expected average reward vπ and expected averagecost cπ. Let D be a bound on the absolute value of vπ)

wπ =D + vπ

2cπ, lπ =

D − vπ

2cπ.

D

D

l

w

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●● ●

●●

● ●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ● ●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●●

●● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●● ●

●●

●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

● ●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●● ●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

● ●

● ●

●●●

● ● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

● ●

●● ●

●● ●

● ●

● ●

●●

● ●

●●

● ●

●●

●●●

●●

● ●

●●●

●●

●●

●● ●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●●

● ●

●● ●

●●

●●

●●●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

● ●

●●

●●●

●●

● ●

The w − l space.Value and Cost

(Policy π has expected average reward vπ and expected averagecost cπ. Let D be a bound on the absolute value of vπ)

wπ =D + vπ

2cπ, lπ =

D − vπ

2cπ.

D

D

l

w

−D

−0.5

D

0

0.5D

D

D

D

l

w

1

2

48

The w − l space.Nudged value

(Policy π has expected average reward vπ and expected averagecost cπ. Let D be a bound on the absolute value of vπ)

wπ =D + vπ

2cπ, lπ =

D − vπ

2cπ.

D

D

l

w

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●●●

●●

●●

● ●

● ●

●●

● ●

●●

●●

●●●

● ●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

●● ●

●●

● ●

●●

●●

●●

●●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

●●●

●●

● ●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ● ●

●●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

● ●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●●

●●

●● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●● ●

●●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

● ●

● ●●

●●

●●

● ●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●●

●●

●●

●●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

● ●

● ●

●●

●●

●● ●

●●

●●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●●

●●

● ●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●●

●●

●●

●●

● ●

●● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●●

●●

●●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

● ●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●●

● ●

●●●

●● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●● ●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

● ●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

● ●●

●●

● ●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●

●●

●●

● ●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

●●

●●●

●●

●●

●●

●●●

●●●

●●

●●

●● ●

●●

●●

● ●

●●

●● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

● ●●

●●

●●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

●●

●●

● ●●

●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●● ●●

●●

●●

●●

● ●●

● ●●

●●

●●

●●

●●

●●

●●

● ●

● ●

●●

●● ●

●●

●● ●

●●

●●

●●

●●

●●

● ●

●●

● ●

●●

●●

●●

● ●

●●

●●

● ●●

●●

●●

●●

●●

● ●

● ●

●●●

● ● ●

●●

●●

●●

●●

● ●

●●

●●

● ●

● ●

● ●

● ●

●●

●●

●●

● ●

●●

●●

● ●

●●

● ●

●●

●●

●●

●●

●●

●● ●

● ●●

●●

● ●

●●

●●

● ●

● ●

●●

●●

●●

● ●

●●

●●

●●

●●

●● ●

●●●

●●

●●

●●

●●

●● ●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

●●

●●

●●

● ●

● ●

●●

●●

●●●

● ●

●●

●●

●●

●●

●●

●● ●

●●

● ●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

●●

●●

●●

●●

●●●

● ●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●●

●● ●●

●●

●●

●●

●●

●●

●● ●

●●

● ●

● ●

●● ●

●● ●

● ●

● ●

●●

● ●

●●

● ●

●●

●●●

●●

● ●

●●●

●●

●●

●● ●

● ●

●●

●●

●●

●●●

●●

●●

● ●

●●

●●

●●

● ●

●●

● ●●

● ●

●● ●

●●

●●

●●●

●●

●●

●● ●

●●

●●

● ●

● ●

●●

●●

● ●

●●

●●

●●

●●●

●●

●●

●●

● ●

●●

● ●

●●

●●●

●●

● ●

−D/2

0

D/2

The w − l space.As a projective transformation.

The w − l space.As a projective transformation.

1

D

−D

Policy V

alu

e

Episode Length

The w − l space.As a projective transformation.

1

D

−D

Policy V

alu

e

Episode LengthD

−D

w

l

Sample task: two states, continuous actions

s1a1 ∈ [0, 1]r1 = 1 + (a1 − 0.5)2

c1 = 1 + a1

s2

a2 ∈ [0, 1]r2 = 1 + a2c2 = 1 + (a2 − 0.5)2

Sample task: two states, continuous actions

s1a1 ∈ [0, 1]r1 = 1 + (a1 − 0.5)2

c1 = 1 + a1

s2a2 ∈ [0, 1]r2 = 1 + a2c2 = 1 + (a2 − 0.5)2

Sample task: two states, continuous actions

Policy Space (Actions)

0

a2

1

0 a1 1

Sample task: two states, continuous actions

Policy Values and Costs

Policy v

alu

e

Policy cost

4

4

Sample task: two states, continuous actions

Policy Manifold in w − l

l

w

D/2

D/2

And the rest...

Neat geometry, linear problems in w − l.Easily exploited using straightforward algebra / calculus.

Updating average reward between iterations can be optimized.

Becomes finding the (or rather an) intersection between twoconics.

Which can be solved in O(1) time.

Worst case, uncertainty reduces in half.

Typically much better than that.

Little extra complexity added to already PAC methods.

And the rest...

Neat geometry, linear problems in w − l.Easily exploited using straightforward algebra / calculus.

Updating average reward between iterations can be optimized.

Becomes finding the (or rather an) intersection between twoconics.

Which can be solved in O(1) time.

Worst case, uncertainty reduces in half.

Typically much better than that.

Little extra complexity added to already PAC methods.

Thank you.r-uribe@uniandes.edu.co

Untitled by Li Wei, School of Design, Oita University, 2009.

top related