reinforcement learning - lecture 5: bandit optimisation · interact with an i.i.d. or adversarial...
TRANSCRIPT
Reinforcement Learning
Lecture 5: Bandit optimisation
Alexandre Proutiere, Sadegh Talebi, Jungseul Ok
KTH, The Royal Institute of Technology
Objectives of this lecture
Introduce bandit optimisation: the most basic RL problem without
dynamics. Optimising the exploration vs exploitation trade-off.
• Regret lower bounds
• Algorithms based on the ”optimism in front of uncertainty” principle
• Thompson Sampling algorithm
• Structured bandits
2
Lecture 5: Outline
.
1. Classifying bandit problems
2. Regret lower bounds
3. Algorithms based on the ”optimism in front of uncertainty” principle
4. Thompson Sampling algorithm
5. Structured bandits
3
Lecture 5: Outline
1. Classifying bandit problems
2. Regret lower bounds
3. Algorithms based on the ”optimism in front of uncertainty” principle
4. Thompson Sampling algorithm
5. Structured bandits
4
Bandit Optimisation
• Interact with an i.i.d. or adversarial environment
• Set of available actions A with unknown sequences of rewards
rt(a), t = 1, . . .
• The reward is the only feedback – bandit feedback
• Stochastic vs. adversarial bandits
- i.i.d. environment: rt(a) random variable with mean θa
- adversarial environment: rt(a) is arbitrary!
• Objective: develop an action selection rule π maximising the
expected cumulative reward up to step T
Remark: π must select an action depending on the entire history of
observations!
5
Regret
• Difference between the cumulative reward of an ”Oracle” policy and
that of agent π
• Regret quantifies the price to pay for learning
• Exploration vs. exploitation trade-off: we need to probe all actions
to play the best later
6
Applications
Clinical trial, Thompson 1933
- Two available treatments with unknown rewards (’Live’ or ’Die’)
- Bandit feedback: after administrating the treatment to a patient, we
observe whether she survives or dies
- Goal: design a treatment selection scheme π maximising the number
of patients cured after treatment
7
Applications
Rate adaptation in 802.11 wireless systems
- The AP sequentially sends packets to the receiver and has K
available encoding rates r1 < r2 < . . . < rK
- The unknown probability a packet sent at rate rk is received is θk
- Goal: design a rate selection scheme that learns the θk’s and quickly
converges to rate rk? maximising µk = rkθk over k
8
Applications
Search engines
- The engine should list relevant webpages depending on the request
’jaguar’
- The CTRs (Click-Through-Rate) are unknown
- Goal: design a list selection scheme that learns the list maximising
its global CTRs
9
Bandit Taxonomy
• Stochastic bandits: the sequence of rewards (rt(a), a ∈ A)t≥1 is
generated according to an i.i.d. process – the average rewards are
unknown
• Adversarial bandits: arbitrary sequence of rewards
Most bandit problems in engineering are stochastic ...
10
Stochastic Bandit Taxonomy
Unstructured problems: average rewards are not related
θ = (θ1, . . . , θK) ∈ Θ =∏k[ak, bk]
Structured problems: the decision maker knows that average rewards
are related. She knows Θ. The rewards observed for a given arm provides
side information about the other arms.
θ = (θ1, . . . , θK) ∈ Θ not an hyper-
rectangle
11
Stochastic Bandit Taxonomy
12
Lecture 5: Outline
1. Classifying bandit problems
2. Regret lower bounds
3. Algorithms based on the ”optimism in front of uncertainty” principle
4. Thompson Sampling algorithm
5. Structured bandits
13
Unstructured Stochastic Bandits – Robbins 1952
• Finite set of actions A = {1, . . . ,K}
• (Unknown) rewards of action a ∈ A: (rt(a), t ≥ 0) i.i.d. Bernoulli
with E[rt(a)] = θa, θ ∈ Θ = [0, 1]K
• Optimal action a? ∈ arg maxa θa
• Online policy π: select action aπt at time t depending on
aπ1 , r1(aπ1 ), . . . , aπt−1, rt−1(aπt−1)
• Regret up to time T : Rπ(T ) = Tθa? −∑Tt=1 θaπt
14
Problem-specific regret Lower Bound
Uniformly good algorithms: An algorithm π is uniformly good if for all
θ ∈ Θ, for any sub-optimal arm a, the number of times Na(t) arm a is
selected up to round t satisfies: E[Na(t)] = o(tα) for all α > 0.
Theorem (Lai-Robbins 1985)
For any uniformly good algorithm π:
lim infT
Rπ(T )
log(T )≥∑a 6=a?
θa? − θaKL(θ, θa?)
where KL(a, b) = a log(ab ) + (1− a) log(1−a1−b ) (KL divergence)
15
Minimax regret Lower Bound
Theorem (Auer et al. 2002)
For any T , we can find a problem (depending on T ) such for any
algorithm π,
Rπ(T ) ≥√KT (1− 1
K).
16
Unified proofs
• Change-of-measure: θ → ν
• Log-likelihood ratio: Eθ[L] =∑j Eθ[Nj(T )]KL(θj , νj)
• Data processing inequality. For any event A ∈ FT ,
Pν(A) = Eθ[exp(−L)1A]. Jensen’s inequality yields:
Pν(A) ≥ exp(−Eθ[L]|A)Pθ(A)
Pν(Ac) ≥ exp(−Eθ[L]|Ac)Pθ(Ac)
Hence Eθ[L] ≥ KL(Pθ(A),Pν(A))
• Data processing inequality v2. For all Z FT -measurable,
Eθ[L] ≥ KL(Eθ(Z),Eν(Z))
17
Proof – problem-specific lower bound
Change-of-measure: θ → ν with θj = νj for all j 6= a, νa = θa? + ε.
Eθ[L] = Eθ[Na(t)]KL(θa, θa? + ε) ≥ KL(Pθ(A),Pν(A))
Select A = {Na?(T ) ≤ T −√T}. Markov inequality yields (for uniformly
good algorithms): limT→∞ Pθ[A] = 0 = limT→∞ Pν [Ac]
lim inft→∞
E[Na(T )
log(T )≥ 1
KL(θa, θa? + ε)
18
Proof – minimax lower bound
Change-of-measure: θa = 1/2 for all a. Then there exists a such that
Eθ[Na(T )] ≤ T/K. νi = θi for all i 6= a, and νa = 1/2 + ε.
Eθ[L] = Eθ[Na(t)]KL(1/2, 1/2 + ε) ≥ KL(Eθ(Z),Eν(Z))
where KL(1/2, 1/2 + ε) = 12 log( 1
1−4ε2 ). Select Z = Na(T )/T .
Pinsker’s inequality yields:
Eθ[Na(T )
2log(
1
1− 4ε2) ≥ 2
(Eν [Na(T )]
T− Eθ[Na(T )]
T
)2
Hence with Eθ[Na(T )] ≤ T/K,
Eν [Na(T )]
T≤ 1
K+
1
2
√T
Klog(
1
1− 4ε2)
19
Proof – minimax lower bound
Now Rν(T ) = Tε(1− Eν [Na(T )]T ), and we conclude choosing ε =
√K/T
that:
Rν(T ) ≥√KT (1− 1/K).
20
Lecture 5: Outline
1. Classifying bandit problems
2. Regret lower bounds
3. Algorithms based on the ”optimism in front of uncertainty”
principle
4. Thompson Sampling algorithm
5. Structured bandits
21
Concentration
The main tools in the design and analysis of algorithms for stochastic
bandits are concentration-of-measure results.
Let X1, X2, . . . i.i.d. real-valued random variable with mean µ, and with
all moments G(λ) = log(E[eλ(Xn−µ)]). Sn =∑ni=1Xi.
• Strong law of large number: P[limn→∞Snn = µ] = 1
• Concentration inequality: let δ, λ > 0,
P[Sn − nµ ≥ δ] = P[eλ(Sn−nµ) ≥ eλδ]≤ e−λδE[eλ(Sn−nµ)]
= e−λδn∏i=1
E[eλ(Xi−µ)]
= enG(λ)−λδ
≤ e− supλ>0(λδ−nG(λ))
22
Concentration
P[Sn − nµ ≥ δ] ≤ e− supλ>0(λδ−nG(λ))
• Bounded r.v. Xn ∈ [a, b], G(λ) ≤ λ2 (b−a)28
Hoeffding’s inequality:
P[Sn − nµ ≥ δ] ≤ e− 2δ2
n(b−a)2
• Sub-gaussian r.v.: G(λ) ≤ σ2λ2/2
• Bernoulli r.v.: G(λ) = log(µeλ(1−µ) − (1− µ)e−λµ)
Chernoff’s inequality:
P[Sn − nµ ≥ δ] ≤ e−nKL(µ+δ/n,µ)
where KL(a, b) = a log(ab ) + (1− a) log(1−a1−b ) (KL divergence)
23
Algorithms
Estimating the average reward of arm a:
θa(t) =1
Na(t)
t∑n=1
rn(a)1a(n)=a
• ε-greedy. In each round t:
- with probability 1− ε, select the best empirical arm
a?(t) ∈ argmaxa θa(t)
- with probability ε, select an arm uniformly at random
The algorithm has linear regret (not uniformly good)
24
Algorithms
• εt-greedy. In each round t:
- with probability 1− εt, select the best empirical arm
a?(t) ∈ argmaxa θa(t)
- with probability εt, select an arm uniformly at random
The algorithm has logarithmic regret for Bernoulli rewards and
εt = min(1, Ktδ2 ) where δ = mina 6=a?(θa? − θa)
Sketch of proof. For a 6= a? to be selected in round t, we need (most
often) θa(t) ≥ θa + δ. The probability that this occurs is less than
exp(−2δ2Na(t)). But Na(t) is close to log(t)/δ2. Summing over t yields
the result.
25
Algorithms
Optimism in front of Uncertainty
Upper Confidence Bound algorithm:
ba(t) = θa(t) +
√2 log(t)
Na(t)
θ(t): empirical reward of a up to t
Na(t): nb of times a played up to t
In each round t, select the arm with highest
index ba(t)
Under UCB, the number of times a 6= a? is selected satisifies:
E[Na(T )] ≤ 8 log(T )
(θa? − θa)2+π2
6
26
Regret analysis of UCB (Auer et al. 2002)
Na(T ) = 1 +
T∑t=K+1
1{a(t) = a} ≤ `+
T∑t=K+1
1{a(t) = a,Na(t) ≥ `}
≤ `+
T∑t=K+1
1{Na(t) ≥ `}1{ba?(t− 1) ≤ ba(t− 1)}
≤ `+
T∑t=K+1
1{mins<t
(θ?s +
√2 log(t− 1)
s) ≤ max
`≤s<t(θas +
√2 log(t− 1)
s)}
≤ `+
T∑t=K+1
∑s<t,`≤s′<t
1{θ?s +
√2 log(t− 1)
s≤ θas′ +
√2 log(t− 1)
s′)}
27
Regret analysis of UCB (Auer et al. 2002)
θ?s +√
2 log(t−1)s ≤ θas′ +
√2 log(t−1)
s′ implies:
A : θ?s ≥ θ? −√
2 log(t− 1)
s, or B : θas′ ≥ θa +
√2 log(t− 1)
s′
or C : θ? < θa + 2
√2 log(t− 1)
s′
Hoeffding ineuqality yields: P[A] ≤ t−4 and P[B] ≤ t−4.
For ` = 8 log(T )/(θ? − θa)2 and s′ ≥ `, C does not happen. We
conclude by:
E[Na(T )] ≤ 8 log(T )
(θ? − θa)2+∑t≥1
t∑s,s′=1
2t−4
≤ 8 log(T )
(θ? − θa)2+π2
3
28
Algorithms
KL-UCB algorithm:
ba(t) = max{q ≤ 1 : Na(t)KL(θa(t), q) ≤ f(t)}
where f(t) = log(t) + 3 log log(t) is the confidence level.
In each round t, select the arm with highest index ba(t)
Under KL-UCB, the number of times a 6= a? is selected satisifies: for all
δ < θa? − θa,
E[Na(T )] ≤ log(T )
KL(θa + δ, θa?)+ C log log(T ) + δ−2
29
Lecture 5: Outline
1. Classifying bandit problems
2. Regret lower bounds
3. Algorithms based on the ”optimism in front of uncertainty” principle
4. Thompson Sampling algorithm
5. Structured bandits
30
Algorithms
Bayesian framework, put a prior distribution on the parameters θ
Example: Bernoulli distribution with uniform prior on [0, 1], we observed
p successes (’1’) and q failures (’0’). Then θ ∼ β(p+ 1, q + 1), i.e., the
density is proportional to θp(1− θ)q.
Thompson Sampling algorithm: Assume that at round t, arm a had
pa(t) successes and qa(t) failures. Let ba(t) ∼ β(pa(t) + 1, qa(t) + 1).
The algorithm selects the arm a with the highest ba(t).
Under Thompson Sampling, for any suboptimal arm a, we have:
lim supT→∞
E[Na(T )]
log(T )=
1
KL(θa, θa?)
31
Illustration: UCB vs. KL-UCB
32
Illustration: Thompson Sampling
33
Performance
34
Lecture 5: Outline
1. Classifying bandit problems
2. Regret lower bounds
3. Algorithms based on the ”optimism in front of uncertainty” principle
4. Thompson Sampling algorithm
5. Structured bandits
35
Structured bandits
• Unstructured bandits: best possible regret ∼ K log(T )
• How to exploit a known structure to speed up the learning process?
• Structure: the decision maker knows that θ ∈ Θ not an
hyper-rectangle
• Examples: the average reward is a (convex, unimodal, Lipschitz, ...)
function of the arm:
Θ = {θ : a 7→ θa unimodal}
36
Regret lower bound
Theorem (Application of Graves-Lai 1997)
For any uniformly good algorithm π:
lim infT
Rπ(T )
log(T )≥ c(θ)
where c(θ) is the minimal value of:
infna,a6=a?
∑a6=a?
na(θa? − θa)
s.t. infλ∈B(θ)
∑a6=a?
naKL(θa, λa) ≥ 1,
and B(θ) = {λ ∈ Θ : a? not optimal under λ, λa? = θa?}.
37
Graphically unimodal bandits
Arms = vertices of a known graph G
38
Graphically unimodal bandits
Arms = vertices of a known graph G
The average rewards are G-unimodal: from any vertex, there is a path in
the graph along which rewards are increased. Notation: θ ∈ ΘG
39
G-unimodal bandits: Regret lower bound
N (k): set of neighbours of k in G
Theorem For any uniformly good algorithm π:
lim infT
Rπ(T )
log(T )≥ cG(θ) =
∑k∈N (a?)
θa? − θkKL(θk, θa?)
40
G-unimodal bandits: Optimal algorithm
Defined through the maximum degree of G, the empirical means θa(t),
the leader L(t), the number of time arms have been the leader la(t), and
KL UCB indexes:
ba(t) = sup{q : Na(t)KL(θa(t), q) ≤ log(lL(t)(t)}
Algorithm. OAS (Optimal Action Sampling)
1. For t = 1, . . . ,K, select action a(t) = k
2. For t ≥ K + 1, select action
a(t) =
{L(t) if (lL(t)(t)− 1)/(γ + 1) ∈ Narg maxk∈N (L(t)) bk(t) otherwise
41
G-unimodal bandits: OAS optimality
Theorem For any θ ∈ Θ,
lim supT→∞
ROAS(T )
log(T )≤ cG(θ)
42
Rate adaptation in 802.11
Goal: adapt the modulation scheme to the channel quality
802.11 a/b/g
Select the rate only
rate r1 r2 . . . rKsuccess probability µ1 µ2 . . . µK
throughput θ1 θ2 . . . θK (θk = rkµk)
Structure: unimodality of k 7→ θk and µ1 ≥ µ2 ≥ . . . ≥ µK
43
Rate adaptation in 802.11
802.11 n/ac
Select the rate and a MIMO
mode
Structure: example with two modes, unimodality w.r.t. G
44
802.11g – stationary channels
Smooth throughput decay w.r.t. rate
45
802.11g – stationary channels
Steep throughput decay w.r.t. rate
46
802.11g – nonstationary channels
Traces
47
802.11g – nonstationary channels
Performance of OAS with sliding window
48
References
Discrete unstructured bandits
• Thompson, On the likelihood that one unknown probability exceeds
another in view of the evidence of two samples, 1933
• Robbins, Some aspects of the sequential design of experiments, 1952
• Lai and Robbins. Asymptotically efficient adaptive allocation rules,
1985
• Lai. Adaptive treatment allocation and the multi-armed bandit
problem, 1987
• Gittins, Bandit Processes and Dynamic Allocation Indices, 1989
• Auer, Cesa-Bianchi and Fischer, Finite time analysis of the
multiarmed bandit problem. 2002.
• Garivier and Moulines, On upper-confidence bound policies for
non-stationary bandit problems, 2008
• Slivkins and Upfal, Adapting to a changing environment: the
brownian restless bandits, 2008
49
References
• Garivier and Capp. The KL-UCB algorithm for bounded stochastic
bandits and beyond, 2011
• Honda and Takemura, An Asymptotically Optimal Bandit Algorithm
for Bounded Support Models, 2010
Discrete structured bandits
• Anantharam, Varaiya, and Walrand, Asymptotically efficient
allocation rules for the multiarmed bandit problem with multiple
plays, 1987
• Graves and Lai Asymptotically efficient adaptive choice of control
laws in controlled Markov chains, 1997
• Gyrgy, Linder, Lugosi and Ottucsk, The on-line shortest path
problem under partial monitoring, 2007
• Yu and Mannor, Unimodal bandits, 2011
• Cesa-Bianchi and Lugosi, Combinatorial bandits, 2012
50
References
• Chen, Wang and Yuan. Combinatorial multi-armed bandit: General
framework and applications, 2013
• Combes and Proutiere, Unimodal bandits: Regret lower bounds and
optimal algorithms, 2014
• Magureanu, Combes, and Proutiere. Lipschitz bandits: Regret lower
bounds and optimal algorithms, 2014
Thompson sampling
• Chapelle and Li, An Empirical Evaluation of Thompson Sampling,
2011
• Korda, Kaufmann and Munos, Thompson Sampling: an
asymptotically optimal finite-time analysis, 2012
• Korda, Kaufmann and Munos, Thompson Sampling for
one-dimensional exponential family bandits, 2013
• Agrawal and Goyal, Further optimal regret bounds for Thompson
Sampling, 2013.
51
Lecture 5: Appendix
Adversarial bandits
52
Adversarial Optimisation
• Finite set of actions A = {1, . . . ,K}
• Unknown and arbitrary rewards of action a ∈ A: (rt(a), t ≥ 0)
decided by an adversary at time 0
• Best empirical action a? ∈ arg maxa∑Tt=1 rt(a)
• Online policy π: select action aπt at time t depending on:
- Expert setting: (r1(a), . . . , rt−1(a))a=1,...,K
- Bandit setting: aπ1 , r1(aπ1 ), . . . , a
πt−1, rt−1(a
πt−1)
• Regret up to time T : Rπ(T ) =∑Tt=1 rt(a
?)− E∑Tt=1 rt(a
πt )
53
Expert Setting
• Let St(a) =∑tn=1 rt(a) be the cumulative reward of a
• Multiplicative update algorithm (Littlestone-Warmuth 1994)
Select arm a with probability
pt(a) =eηSt−1(a)∑b eηSt−1(b)
• Regret: the multiplicative update algorithm π has zero-regret:
∀T, Rπ(T ) ≤ Tη
8+
log(K)
η
For η =√
8 log(K)/T , Rπ(T ) ≤√
log(K)T2
54
Bandit Setting
• Let St(a) =∑tn=1 rt(a) be the cumulative reward of a
• Building an unbiased estimator of St(a): St(a) =∑tn=1Xt(a)
where
Xt(a) = 1aπ(t)=art(a)
pt(a)
• Multiplicative update algorithm (Littlestone-Warmuth 1994)
Select arm a with probability
pt(a) =eηSt−1(a)∑b eηSt−1(b)
• Regret: the multiplicative update algorithm π has zero-regret:
∀T, Rπ(T ) ≤ TKη
2+
log(K)
η
For η =√
2 log(K)/(KT ), Rπ(T ) ≤√
2K log(K)T
55
Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
56
Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
57
Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
58
Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
59
Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
60
Adversarial Convex Bandits
At the beginning of each year, Volvo has to select a vector x (in a convex
set) representing the relative efforts in producing various models (S60,
V70, V90, . . .). The reward is an arbitrarily varying and unknown
concave function of x. How to maximise reward over say 50 years?
61
Adversarial Convex Bandits
• Continuous set of actions A = [0, 1]
• (Unknown) Arbitrary but concave rewards of action x ∈ A: rt(x)
• Online policy π: select action xπt at time t depending on
xπ1 , r1(xπ1 ), . . . , xπt−1, rt−1(xπt−1)
• Regret up to time T : (defined w.r.t. the best empirical action up to
time T )
Rπ(T ) = maxx∈[0,1]
T∑t=1
rt(x)−T∑t=1
rt(xπt )
Can we do something smart at all? Achieve a sublinear regret?
62
Adversarial Convex Bandits
• If rt(·) = r(·), and if r(·) was known, we could apply a gradient
ascent algorithm
• One-point gradient estimate:
f(x) = Ev∈B [f(x+ δv)], B = {x : ‖x‖2 ≤ 1}
Eu∈S [f(x+ δu)u] = δ∇f(x), S = {x : ‖x‖2 = 1}
• Simulated Gradient Ascent algorithm: at each step t, do
- ut uniformly chosen in S
- yt = xt + δut
- yt+1 = yt + αrt(xt)ut
• Regret: R(T ) = O(T 5/6)
63