reinforcement learning - lecture 5: bandit optimisation · interact with an i.i.d. or adversarial...

Reinforcement Learning

Lecture 5: Bandit optimisation

Alexandre Proutiere, Sadegh Talebi, Jungseul Ok

KTH, The Royal Institute of Technology

Objectives of this lecture

Introduce bandit optimisation: the most basic RL problem without

dynamics. Optimising the exploration vs exploitation trade-off.

• Regret lower bounds

• Algorithms based on the ”optimism in front of uncertainty” principle

• Thompson Sampling algorithm

• Structured bandits

2

Lecture 5: Outline

.

1. Classifying bandit problems

2. Regret lower bounds

3. Algorithms based on the ”optimism in front of uncertainty” principle

4. Thompson Sampling algorithm

5. Structured bandits

3

Lecture 5: Outline






4

Bandit Optimisation

• Interact with an i.i.d. or adversarial environment

• Set of available actions A with unknown sequences of rewards

rt(a), t = 1, . . .

• The reward is the only feedback – bandit feedback

• Stochastic vs. adversarial bandits

- i.i.d. environment: rt(a) random variable with mean θa

- adversarial environment: rt(a) is arbitrary!

• Objective: develop an action selection rule π maximising the

expected cumulative reward up to step T

Remark: π must select an action depending on the entire history of

observations!

5

Regret

• Difference between the cumulative reward of an ”Oracle” policy and

that of agent π

• Regret quantifies the price to pay for learning

• Exploration vs. exploitation trade-off: we need to probe all actions

to play the best later

6

Applications

Clinical trial, Thompson 1933

- Two available treatments with unknown rewards (’Live’ or ’Die’)

- Bandit feedback: after administrating the treatment to a patient, we

observe whether she survives or dies

- Goal: design a treatment selection scheme π maximising the number

of patients cured after treatment

7

Applications

Rate adaptation in 802.11 wireless systems

- The AP sequentially sends packets to the receiver and has K

available encoding rates r1 < r2 < . . . < rK

- The unknown probability a packet sent at rate rk is received is θk

- Goal: design a rate selection scheme that learns the θk’s and quickly

converges to rate rk? maximising µk = rkθk over k

8

Applications

Search engines

- The engine should list relevant webpages depending on the request

’jaguar’

- The CTRs (Click-Through-Rate) are unknown

- Goal: design a list selection scheme that learns the list maximising

its global CTRs

9

Bandit Taxonomy

• Stochastic bandits: the sequence of rewards (rt(a), a ∈ A)t≥1 is

generated according to an i.i.d. process – the average rewards are

unknown

• Adversarial bandits: arbitrary sequence of rewards

Most bandit problems in engineering are stochastic ...

10

Stochastic Bandit Taxonomy

Unstructured problems: average rewards are not related

θ = (θ1, . . . , θK) ∈ Θ =∏k[ak, bk]

Structured problems: the decision maker knows that average rewards

are related. She knows Θ. The rewards observed for a given arm provides

side information about the other arms.

θ = (θ1, . . . , θK) ∈ Θ not an hyper-

rectangle

11

Stochastic Bandit Taxonomy

12

Lecture 5: Outline






13

Unstructured Stochastic Bandits – Robbins 1952

• Finite set of actions A = {1, . . . ,K}

• (Unknown) rewards of action a ∈ A: (rt(a), t ≥ 0) i.i.d. Bernoulli

with E[rt(a)] = θa, θ ∈ Θ = [0, 1]K

• Optimal action a? ∈ arg maxa θa

• Online policy π: select action aπt at time t depending on

aπ1 , r1(aπ1 ), . . . , aπt−1, rt−1(aπt−1)

• Regret up to time T : Rπ(T ) = Tθa? −∑Tt=1 θaπt

14

Problem-specific regret Lower Bound

Uniformly good algorithms: An algorithm π is uniformly good if for all

θ ∈ Θ, for any sub-optimal arm a, the number of times Na(t) arm a is

selected up to round t satisfies: E[Na(t)] = o(tα) for all α > 0.

Theorem (Lai-Robbins 1985)

For any uniformly good algorithm π:

lim infT

Rπ(T )

log(T )≥∑a 6=a?

θa? − θaKL(θ, θa?)

where KL(a, b) = a log(ab ) + (1− a) log(1−a1−b ) (KL divergence)

15

Minimax regret Lower Bound

Theorem (Auer et al. 2002)

For any T , we can find a problem (depending on T ) such for any

algorithm π,

Rπ(T ) ≥√KT (1− 1

K).

16

Unified proofs

• Change-of-measure: θ → ν

• Log-likelihood ratio: Eθ[L] =∑j Eθ[Nj(T )]KL(θj , νj)

• Data processing inequality. For any event A ∈ FT ,

Pν(A) = Eθ[exp(−L)1A]. Jensen’s inequality yields:

Pν(A) ≥ exp(−Eθ[L]|A)Pθ(A)

Pν(Ac) ≥ exp(−Eθ[L]|Ac)Pθ(Ac)

Hence Eθ[L] ≥ KL(Pθ(A),Pν(A))

• Data processing inequality v2. For all Z FT -measurable,

Eθ[L] ≥ KL(Eθ(Z),Eν(Z))

17

Proof – problem-specific lower bound

Change-of-measure: θ → ν with θj = νj for all j 6= a, νa = θa? + ε.

Eθ[L] = Eθ[Na(t)]KL(θa, θa? + ε) ≥ KL(Pθ(A),Pν(A))

Select A = {Na?(T ) ≤ T −√T}. Markov inequality yields (for uniformly

good algorithms): limT→∞ Pθ[A] = 0 = limT→∞ Pν [Ac]

lim inft→∞

E[Na(T )

log(T )≥ 1

KL(θa, θa? + ε)

18

Proof – minimax lower bound

Change-of-measure: θa = 1/2 for all a. Then there exists a such that

Eθ[Na(T )] ≤ T/K. νi = θi for all i 6= a, and νa = 1/2 + ε.

Eθ[L] = Eθ[Na(t)]KL(1/2, 1/2 + ε) ≥ KL(Eθ(Z),Eν(Z))

where KL(1/2, 1/2 + ε) = 12 log( 1

1−4ε2 ). Select Z = Na(T )/T .

Pinsker’s inequality yields:

Eθ[Na(T )

2log(

1

1− 4ε2) ≥ 2

(Eν [Na(T )]

T− Eθ[Na(T )]

T

)2

Hence with Eθ[Na(T )] ≤ T/K,

Eν [Na(T )]

T≤ 1

K+

1

2

√T

Klog(

1

1− 4ε2)

19

Proof – minimax lower bound

Now Rν(T ) = Tε(1− Eν [Na(T )]T ), and we conclude choosing ε =

√K/T

that:

Rν(T ) ≥√KT (1− 1/K).

20

Lecture 5: Outline



3. Algorithms based on the ”optimism in front of uncertainty”

principle



21

Concentration

The main tools in the design and analysis of algorithms for stochastic

bandits are concentration-of-measure results.

Let X1, X2, . . . i.i.d. real-valued random variable with mean µ, and with

all moments G(λ) = log(E[eλ(Xn−µ)]). Sn =∑ni=1Xi.

• Strong law of large number: P[limn→∞Snn = µ] = 1

• Concentration inequality: let δ, λ > 0,

P[Sn − nµ ≥ δ] = P[eλ(Sn−nµ) ≥ eλδ]≤ e−λδE[eλ(Sn−nµ)]

= e−λδn∏i=1

E[eλ(Xi−µ)]

= enG(λ)−λδ

≤ e− supλ>0(λδ−nG(λ))

22

Concentration

P[Sn − nµ ≥ δ] ≤ e− supλ>0(λδ−nG(λ))

• Bounded r.v. Xn ∈ [a, b], G(λ) ≤ λ2 (b−a)28

Hoeffding’s inequality:

P[Sn − nµ ≥ δ] ≤ e− 2δ2

n(b−a)2

• Sub-gaussian r.v.: G(λ) ≤ σ2λ2/2

• Bernoulli r.v.: G(λ) = log(µeλ(1−µ) − (1− µ)e−λµ)

Chernoff’s inequality:

P[Sn − nµ ≥ δ] ≤ e−nKL(µ+δ/n,µ)

where KL(a, b) = a log(ab ) + (1− a) log(1−a1−b ) (KL divergence)

23

Algorithms

Estimating the average reward of arm a:

θa(t) =1

Na(t)

t∑n=1

rn(a)1a(n)=a

• ε-greedy. In each round t:

- with probability 1− ε, select the best empirical arm

a?(t) ∈ argmaxa θa(t)

- with probability ε, select an arm uniformly at random

The algorithm has linear regret (not uniformly good)

24

Algorithms

• εt-greedy. In each round t:

- with probability 1− εt, select the best empirical arm

a?(t) ∈ argmaxa θa(t)

- with probability εt, select an arm uniformly at random

The algorithm has logarithmic regret for Bernoulli rewards and

εt = min(1, Ktδ2 ) where δ = mina 6=a?(θa? − θa)

Sketch of proof. For a 6= a? to be selected in round t, we need (most

often) θa(t) ≥ θa + δ. The probability that this occurs is less than

exp(−2δ2Na(t)). But Na(t) is close to log(t)/δ2. Summing over t yields

the result.

25

Algorithms

Optimism in front of Uncertainty

Upper Confidence Bound algorithm:

ba(t) = θa(t) +

√2 log(t)

Na(t)

θ(t): empirical reward of a up to t

Na(t): nb of times a played up to t

In each round t, select the arm with highest

index ba(t)

Under UCB, the number of times a 6= a? is selected satisifies:

E[Na(T )] ≤ 8 log(T )

(θa? − θa)2+π2

6

26

Regret analysis of UCB (Auer et al. 2002)

Na(T ) = 1 +

T∑t=K+1

1{a(t) = a} ≤ `+

T∑t=K+1

1{a(t) = a,Na(t) ≥ `}

≤ `+

T∑t=K+1

1{Na(t) ≥ `}1{ba?(t− 1) ≤ ba(t− 1)}

≤ `+

T∑t=K+1

1{mins<t

(θ?s +

√2 log(t− 1)

s) ≤ max

`≤s<t(θas +

√2 log(t− 1)

s)}

≤ `+

T∑t=K+1

∑s<t,`≤s′<t

1{θ?s +

√2 log(t− 1)

s≤ θas′ +

√2 log(t− 1)

s′)}

27

Regret analysis of UCB (Auer et al. 2002)

θ?s +√

2 log(t−1)s ≤ θas′ +

√2 log(t−1)

s′ implies:

A : θ?s ≥ θ? −√

2 log(t− 1)

s, or B : θas′ ≥ θa +

√2 log(t− 1)

s′

or C : θ? < θa + 2

√2 log(t− 1)

s′

Hoeffding ineuqality yields: P[A] ≤ t−4 and P[B] ≤ t−4.

For ` = 8 log(T )/(θ? − θa)2 and s′ ≥ `, C does not happen. We

conclude by:

E[Na(T )] ≤ 8 log(T )

(θ? − θa)2+∑t≥1

t∑s,s′=1

2t−4

≤ 8 log(T )

(θ? − θa)2+π2

3

28

Algorithms

KL-UCB algorithm:

ba(t) = max{q ≤ 1 : Na(t)KL(θa(t), q) ≤ f(t)}

where f(t) = log(t) + 3 log log(t) is the confidence level.

In each round t, select the arm with highest index ba(t)

Under KL-UCB, the number of times a 6= a? is selected satisifies: for all

δ < θa? − θa,

E[Na(T )] ≤ log(T )

KL(θa + δ, θa?)+ C log log(T ) + δ−2

29

Lecture 5: Outline






30

Algorithms

Bayesian framework, put a prior distribution on the parameters θ

Example: Bernoulli distribution with uniform prior on [0, 1], we observed

p successes (’1’) and q failures (’0’). Then θ ∼ β(p+ 1, q + 1), i.e., the

density is proportional to θp(1− θ)q.

Thompson Sampling algorithm: Assume that at round t, arm a had

pa(t) successes and qa(t) failures. Let ba(t) ∼ β(pa(t) + 1, qa(t) + 1).

The algorithm selects the arm a with the highest ba(t).

Under Thompson Sampling, for any suboptimal arm a, we have:

lim supT→∞

E[Na(T )]

log(T )=

1

KL(θa, θa?)

31

Illustration: UCB vs. KL-UCB

32

Illustration: Thompson Sampling

33

Performance

34

Lecture 5: Outline






35

Structured bandits

• Unstructured bandits: best possible regret ∼ K log(T )

• How to exploit a known structure to speed up the learning process?

• Structure: the decision maker knows that θ ∈ Θ not an

hyper-rectangle

• Examples: the average reward is a (convex, unimodal, Lipschitz, ...)

function of the arm:

Θ = {θ : a 7→ θa unimodal}

36

Regret lower bound

Theorem (Application of Graves-Lai 1997)

For any uniformly good algorithm π:

lim infT

Rπ(T )

log(T )≥ c(θ)

where c(θ) is the minimal value of:

infna,a6=a?

∑a6=a?

na(θa? − θa)

s.t. infλ∈B(θ)

∑a6=a?

naKL(θa, λa) ≥ 1,

and B(θ) = {λ ∈ Θ : a? not optimal under λ, λa? = θa?}.

37

Graphically unimodal bandits

Arms = vertices of a known graph G

38

Graphically unimodal bandits

Arms = vertices of a known graph G

The average rewards are G-unimodal: from any vertex, there is a path in

the graph along which rewards are increased. Notation: θ ∈ ΘG

39

G-unimodal bandits: Regret lower bound

N (k): set of neighbours of k in G

Theorem For any uniformly good algorithm π:

lim infT

Rπ(T )

log(T )≥ cG(θ) =

∑k∈N (a?)

θa? − θkKL(θk, θa?)

40

G-unimodal bandits: Optimal algorithm

Defined through the maximum degree of G, the empirical means θa(t),

the leader L(t), the number of time arms have been the leader la(t), and

KL UCB indexes:

ba(t) = sup{q : Na(t)KL(θa(t), q) ≤ log(lL(t)(t)}

Algorithm. OAS (Optimal Action Sampling)

1. For t = 1, . . . ,K, select action a(t) = k

2. For t ≥ K + 1, select action

a(t) =

{L(t) if (lL(t)(t)− 1)/(γ + 1) ∈ Narg maxk∈N (L(t)) bk(t) otherwise

41

G-unimodal bandits: OAS optimality

Theorem For any θ ∈ Θ,

lim supT→∞

ROAS(T )

log(T )≤ cG(θ)

42

Rate adaptation in 802.11

Goal: adapt the modulation scheme to the channel quality

802.11 a/b/g

Select the rate only

rate r1 r2 . . . rKsuccess probability µ1 µ2 . . . µK

throughput θ1 θ2 . . . θK (θk = rkµk)

Structure: unimodality of k 7→ θk and µ1 ≥ µ2 ≥ . . . ≥ µK

43

Rate adaptation in 802.11

802.11 n/ac

Select the rate and a MIMO

mode

Structure: example with two modes, unimodality w.r.t. G

44

802.11g – stationary channels

Smooth throughput decay w.r.t. rate

45

802.11g – stationary channels

Steep throughput decay w.r.t. rate

46

802.11g – nonstationary channels

Traces

47

802.11g – nonstationary channels

Performance of OAS with sliding window

48

References

Discrete unstructured bandits

• Thompson, On the likelihood that one unknown probability exceeds

another in view of the evidence of two samples, 1933

• Robbins, Some aspects of the sequential design of experiments, 1952

• Lai and Robbins. Asymptotically efficient adaptive allocation rules,

1985

• Lai. Adaptive treatment allocation and the multi-armed bandit

problem, 1987

• Gittins, Bandit Processes and Dynamic Allocation Indices, 1989

• Auer, Cesa-Bianchi and Fischer, Finite time analysis of the

multiarmed bandit problem. 2002.

• Garivier and Moulines, On upper-confidence bound policies for

non-stationary bandit problems, 2008

• Slivkins and Upfal, Adapting to a changing environment: the

brownian restless bandits, 2008

49

References

• Garivier and Capp. The KL-UCB algorithm for bounded stochastic

bandits and beyond, 2011

• Honda and Takemura, An Asymptotically Optimal Bandit Algorithm

for Bounded Support Models, 2010

Discrete structured bandits

• Anantharam, Varaiya, and Walrand, Asymptotically efficient

allocation rules for the multiarmed bandit problem with multiple

plays, 1987

• Graves and Lai Asymptotically efficient adaptive choice of control

laws in controlled Markov chains, 1997

• Gyrgy, Linder, Lugosi and Ottucsk, The on-line shortest path

problem under partial monitoring, 2007

• Yu and Mannor, Unimodal bandits, 2011

• Cesa-Bianchi and Lugosi, Combinatorial bandits, 2012

50

References

• Chen, Wang and Yuan. Combinatorial multi-armed bandit: General

framework and applications, 2013

• Combes and Proutiere, Unimodal bandits: Regret lower bounds and

optimal algorithms, 2014

• Magureanu, Combes, and Proutiere. Lipschitz bandits: Regret lower

bounds and optimal algorithms, 2014

Thompson sampling

• Chapelle and Li, An Empirical Evaluation of Thompson Sampling,

2011

• Korda, Kaufmann and Munos, Thompson Sampling: an

asymptotically optimal finite-time analysis, 2012

• Korda, Kaufmann and Munos, Thompson Sampling for

one-dimensional exponential family bandits, 2013

• Agrawal and Goyal, Further optimal regret bounds for Thompson

Sampling, 2013.

51

Lecture 5: Appendix

Adversarial bandits

52

Adversarial Optimisation

• Finite set of actions A = {1, . . . ,K}

• Unknown and arbitrary rewards of action a ∈ A: (rt(a), t ≥ 0)

decided by an adversary at time 0

• Best empirical action a? ∈ arg maxa∑Tt=1 rt(a)

• Online policy π: select action aπt at time t depending on:

- Expert setting: (r1(a), . . . , rt−1(a))a=1,...,K

- Bandit setting: aπ1 , r1(aπ1 ), . . . , a

πt−1, rt−1(a

πt−1)

• Regret up to time T : Rπ(T ) =∑Tt=1 rt(a

?)− E∑Tt=1 rt(a

πt )

53

Expert Setting

• Let St(a) =∑tn=1 rt(a) be the cumulative reward of a

• Multiplicative update algorithm (Littlestone-Warmuth 1994)

Select arm a with probability

pt(a) =eηSt−1(a)∑b eηSt−1(b)

• Regret: the multiplicative update algorithm π has zero-regret:

∀T, Rπ(T ) ≤ Tη

8+

log(K)

η

For η =√

8 log(K)/T , Rπ(T ) ≤√

log(K)T2

54

Bandit Setting

• Let St(a) =∑tn=1 rt(a) be the cumulative reward of a

• Building an unbiased estimator of St(a): St(a) =∑tn=1Xt(a)

where

Xt(a) = 1aπ(t)=art(a)

pt(a)

• Multiplicative update algorithm (Littlestone-Warmuth 1994)

Select arm a with probability

pt(a) =eηSt−1(a)∑b eηSt−1(b)

• Regret: the multiplicative update algorithm π has zero-regret:

∀T, Rπ(T ) ≤ TKη

2+

log(K)

η

For η =√

2 log(K)/(KT ), Rπ(T ) ≤√

2K log(K)T

55

Adversarial Convex Bandits

At the beginning of each year, Volvo has to select a vector x (in a convex

set) representing the relative efforts in producing various models (S60,

V70, V90, . . .). The reward is an arbitrarily varying and unknown

concave function of x. How to maximise reward over say 50 years?

56






57






58






59






60






61


• Continuous set of actions A = [0, 1]

• (Unknown) Arbitrary but concave rewards of action x ∈ A: rt(x)

• Online policy π: select action xπt at time t depending on

xπ1 , r1(xπ1 ), . . . , xπt−1, rt−1(xπt−1)

• Regret up to time T : (defined w.r.t. the best empirical action up to

time T )

Rπ(T ) = maxx∈[0,1]

T∑t=1

rt(x)−T∑t=1

rt(xπt )

Can we do something smart at all? Achieve a sublinear regret?

62


• If rt(·) = r(·), and if r(·) was known, we could apply a gradient

ascent algorithm

• One-point gradient estimate:

f(x) = Ev∈B [f(x+ δv)], B = {x : ‖x‖2 ≤ 1}

Eu∈S [f(x+ δu)u] = δ∇f(x), S = {x : ‖x‖2 = 1}

• Simulated Gradient Ascent algorithm: at each step t, do

- ut uniformly chosen in S

- yt = xt + δut

- yt+1 = yt + αrt(xt)ut

• Regret: R(T ) = O(T 5/6)

63

reinforcement learning - lecture 5: bandit optimisation · interact with an i.i.d. or adversarial...

Documents