Optimal Nudging. Presentation UD

Optimal NudgingA new approach to solving SMDPs

Reinaldo Uribe MUniversidad de los Andes — Oita University

Colorado State University

Nov. 11, 2013

Snakes & Ladders

Player advances thenumber of steps indicatedby a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves the playerforward to the top.

Goal: reaching state 100.

Snakes & Ladders

Boring!(No skill required, only luck.)

Player advances thenumber of steps indicatedby a die.

Landing on a snake’smouth sends the playerback to the tail.

Landing on a ladder’sbottom moves the playerforward to the top.

Goal: reaching state 100.

Variation: Decision Snakes and Ladders

Sets of “win” and“loss” terminal states.

Actions: either“advance” or “goback,” to be decidedbefore throwing the die.

Reinforcement Learning: Finding an optimal policy.

“Natural” Rewards: ±1on “win”/“lose”, 0othw.

Optimal policymaximizes totalexpected reward.

Dynamic programmingquickly finds theoptimal policy.

Probability of winning:pw = 0.97222 . . .

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

We know a lot!

Markov Decision Process: States, Actions, TransitionProbabilities, Rewards.

Policies and policy value.

Max winning probability 6= max earnings.

Taking an action costs (in units different from rewards.)

Different actions may have different costs.

Semi-Markov model with average rewards.

Better than optimal?

(Old optimal policy)

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Better than optimal?

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Better than optimal?

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

Better than optimal?

(Optimal policy) withaverage rewardρ = 0.08701

pw = 0.48673 (was0.97222 — 50.06%)

d = 11.17627 (was84.58333 — 13.21%)

This policy maximizes pwd

So, how are average-reward optimal policies found?

Algorithm 1 Generic SMDP solver

Initializerepeat forever

ActDo RL to find value of current π . Usually 1-step Q-learning

Update ρ.

Average-adjusted Q-learning:

Qt+1(st, at)← (1− γt)Qt(st, at) + γt

(rt+1 − ρtct+1 +max

aQt(st+1, a)

)

Generic Learning Algorithm

Table of algorithms. ARRL

Algorithm Gain update

AACJalali and Ferguson 1989 ρt+1 ←

t∑i=0

r(si, πi(si))

t+ 1

R–LearningSchwartz 1993

ρt+1 ← (1− α)ρt+α(rt+1 +max

aQt(st+1, a)−max

aQt(st, a)

)H–LearningTadepalli and Ok 1998

ρt+1 ← (1−αt)ρt+αt

(rt+1 −Ht(st) +Ht(st+1)

)αt+1 ←

αt

αt + 1

SSP Q-LearningAbounadi et al. 2001

ρt+1 ← ρt + αt minaQt(s, a)

HARGhavamzadeh and Mahadevan 2007 ρt+1 ←

t∑i=0

r(si, πi(si))

t+ 1

Generic Learning Algorithm

Table of algorithms. SMDPRL

Algorithm Gain update

SMARTDas et al. 1999

ρt+1 ←

t∑i=0

r(si, πi(si))

t∑i=0

c(si, πi(si))MAX-Q

Ghavamzadeh and Mahadevan 2001

Nudging

Algorithm 2 Nudged Learning

Initialize (π, ρ, Q)repeat

Set reward scheme to (r − ρc).Solve by any RL method.Update ρ

until Qπ(sI) = 0

Note: ‘by any RL method’ refers to a well-studied problem forwhich better algorithms (both practical and with theoreticalguarantees) exist.

ρ can (and will) be updated optimally.

Nudging

Algorithm 3 Nudged Learning

Initialize (π, ρ, Q)repeat

Set reward scheme to (r − ρc).Solve by any RL method.Update ρ

until Qπ(sI) = 0

Note: ‘by any RL method’ refers to a well-studied problem forwhich better algorithms (both practical and with theoreticalguarantees) exist.

ρ can (and will) be updated optimally.

The w − l space.Definition

(Policy π has expected average reward vπ and expected averagecost cπ. Let D be a bound on the absolute value of vπ)

wπ =D + vπ

2cπ, lπ =

D − vπ

2cπ.

D

l

w

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●●

●

● ●

●

●●

● ●

●●

●

●●

●

●●

●

●●

●

● ●

●

●● ●

●

●●

●

● ●

●

●●

●

●●

●

●●●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

● ●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●● ●

●●

●

●●

●

● ● ●

●

●●●

●

●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

●●

● ● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

● ●

●

●●

●

● ●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●●

●●

●

●●

●

●●●

●

●●

●

●● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●●

●

● ●

●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

● ●

●

●● ●

●

●●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

● ●

● ●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●●

●●

●

● ●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●● ●

●●

●

●●●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●●

●● ●

●

● ●

●●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●●

● ●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●●

●

●●●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●●

●

●● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●●

●

●●

●

●●

●

●●

●

● ●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●

● ●

●

●●

●● ●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●●

●

●●

●

●●

● ●

●

● ●

●●●

●

● ● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

● ●

●●

●

●●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

● ●●

●●

●

● ●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

● ●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●● ●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

● ●

●

● ●

●

●● ●

●

●● ●

●

● ●

●

● ●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●●

●

●●

●

● ●

●

●●●

●

●●

●

●●

●

●● ●

●

● ●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

● ●●

●

● ●

●

●● ●

●●

●

●●

●

●●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●●●

●

●●

●

● ●

●

The w − l space.Value and Cost

(Policy π has expected average reward vπ and expected averagecost cπ. Let D be a bound on the absolute value of vπ)

wπ =D + vπ

2cπ, lπ =

D − vπ

2cπ.

D

l

w

−D

−0.5

D

0

0.5D

D

l

w

1

2

48

The w − l space.Nudged value

(Policy π has expected average reward vπ and expected averagecost cπ. Let D be a bound on the absolute value of vπ)

wπ =D + vπ

2cπ, lπ =

D − vπ

2cπ.

D

l

w

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●●

●

● ●

●

●●

● ●

●●

●

●●

●

●●

●

●●

●

● ●

●

●● ●

●

●●

●

● ●

●

●●

●

●●

●

●●●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

● ●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

●● ●

●●

●

●●

●

● ● ●

●

●●●

●

●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

●●

● ● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

● ●

●

●●

●

● ●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●●

●●

●

●●

●

●●●

●

●●

●

●● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●●

●

● ●

●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

● ●

●

●● ●

●

●●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

● ●

● ●●

●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●●

●●

●

● ●

●●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●● ●

●●

●

●●●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●●

●● ●

●

● ●

●●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●●

● ●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●●

●

●●●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●●

●

●● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

● ●

●

●●●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●●

●

●●

●

●●

●

●●

●

● ●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

● ●

●

● ●

●

●●

●● ●

●

●●

●

●● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●●

●

●●

●

●●

● ●

●

● ●

●●●

●

● ● ●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

● ●

●

● ●

●

●●

●

●●

●

● ●

●●

●

●●

●

● ●

●

●●

●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

● ●●

●●

●

● ●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●

● ●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

● ●

●

●●

●

●●

●

●●

●

● ●

●

● ●

●

●●

●

●●

●

●●●

● ●

●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●●

●

● ●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

●●

●

● ●●

●

●● ●●

●

●●

●

●●

●

●●

●

●●

●

●● ●

●

●●

●

● ●

●

● ●

●

●● ●

●

●● ●

●

● ●

●

● ●

●●

●

● ●

●

●●

●

● ●

●

●●

●

●●●

●

●●

●

● ●

●

●●●

●

●●

●

●●

●

●● ●

●

● ●

●

●●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

● ●●

●

● ●

●

●● ●

●●

●

●●

●

●●●

●

●●

●

●●

●

●● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

● ●

●

●●

●

●●

●

●●●

●

●●

●

●●

●

●●

●

● ●

●

●●

●

● ●

●

●●

●●●

●

●●

●

● ●

●

−D/2

0

D/2

The w − l space.As a projective transformation.

1

D

−D

Policy V

alu

e

Episode Length

The w − l space.As a projective transformation.

1

D

−D

Policy V

alu

e

Episode LengthD

−D

w

l

Sample task: two states, continuous actions

s1a1 ∈ [0, 1]r1 = 1 + (a1 − 0.5)2

c1 = 1 + a1

s2

a2 ∈ [0, 1]r2 = 1 + a2c2 = 1 + (a2 − 0.5)2

Sample task: two states, continuous actions

s1a1 ∈ [0, 1]r1 = 1 + (a1 − 0.5)2

c1 = 1 + a1

s2a2 ∈ [0, 1]r2 = 1 + a2c2 = 1 + (a2 − 0.5)2

Sample task: two states, continuous actions

Policy Space (Actions)

0

a2

1

0 a1 1

Sample task: two states, continuous actions

Policy Values and Costs

Policy v

alu

e

Policy cost

4

Sample task: two states, continuous actions

Policy Manifold in w − l

l

w

D/2

And the rest...

Neat geometry, linear problems in w − l.Easily exploited using straightforward algebra / calculus.

Updating average reward between iterations can be optimized.

Becomes finding the (or rather an) intersection between twoconics.

Which can be solved in O(1) time.

Worst case, uncertainty reduces in half.

Typically much better than that.

Little extra complexity added to already PAC methods.

And the rest...

Neat geometry, linear problems in w − l.Easily exploited using straightforward algebra / calculus.

Updating average reward between iterations can be optimized.

Becomes finding the (or rather an) intersection between twoconics.

Which can be solved in O(1) time.

Worst case, uncertainty reduces in half.

Typically much better than that.

Little extra complexity added to already PAC methods.

Thank you.r-uribe@uniandes.edu.co

Untitled by Li Wei, School of Design, Oita University, 2009.

optimal nudging. presentation ud

average reward v

average rewards

old optimal policy

policy maximizespw d

dierent actions

absolute value of v

max qt

optimal nudging

Documents

nudging e digital

petit guide du nudging - furet

nudging adfærdsøkonomi nudge kommunikation

nudging pentia slides

nudging, a belgian perspective

foredrag nudging folkekirkens nødhjælp

+ q nudging

exploring potentials of digital nudging for business...

nudging corporate compliance

nudging and phishing

nudging – en metode

sensitivities of spectral nudging toward moisture

does nudging intentions translate into action? why nudging...

media and nudging

nudging...

ethical guidelines for nudging in information security...

concerns over nudging

produktkatalog...

nudging change

nudging energy efficiency behavior: role ... - duke …