![Page 1: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/1.jpg)
Bandit AlgorithmsTor Lattimore & Csaba Szepesvari
![Page 2: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/2.jpg)
Bandits
Time 1 2 3 4 5 6 7 8 9 10 11 12Left arm $1 $0 $1 $1 $0Right arm $1 $0
Five rounds to go. Which arm would you play next?
![Page 3: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/3.jpg)
Overview
• What are bandits, and why you should care• Finite-armed stochastic bandits• Finite-armed adversarial bandits
![Page 4: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/4.jpg)
What’s in a name? A tiny bit of historyFirst bandit algorithm proposed by Thompson (1933)
Bush and Mosteller (1953) were in-terested in how mice behaved in aT-maze
![Page 5: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/5.jpg)
Why care about bandits?
1. Many applications2. They isolate an important component ofreinforcement learning: exploration-vs-exploitation3. Rich and beautiful (we think) mathematically
![Page 6: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/6.jpg)
Applications• Clinical trials/dose discovery• Recommendation systems (movies/news/etc)• Advert placement• A/B testing• Network routing• Dynamic pricing (eg., for Amazon products)• Waiting problems (when to auto-logout your computer)• Ranking (eg., for search)• A component of game-playing algorithms (MCTS)• Resource allocation• A way of isolating one interesting part of reinforcement learning
![Page 7: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/7.jpg)
Applications• Clinical trials/dose discovery• Recommendation systems (movies/news/etc)• Advert placement• A/B testing• Network routing• Dynamic pricing (eg., for Amazon products)• Waiting problems (when to auto-logout your computer)• Ranking (eg., for search)• A component of game-playing algorithms (MCTS)• Resource allocation• A way of isolating one interesting part of reinforcement learning
Lots for you to do!
![Page 8: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/8.jpg)
Finite-armed bandits• K actions• n rounds• In each round t the learner chooses an action
At ∈ {1, 2, . . . ,K} .
• Observes rewardXt ∼ PAt where P1, P2, . . . , PK are unknowndistributions
![Page 9: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/9.jpg)
Distributional assumptionsWhile P1, P2, . . . , PK are not known in advance, we make someassumptions:
• Pi is Bernoulli with unknown bias µi ∈ [0, 1]
• Pi is Gaussian with unit variance and unknown mean µi ∈ R• Pi is subgaussian• Pi is supported on [0, 1]
• Pi has variance less than one• ...
As usual, stronger assumptions lead to stronger bounds
This tutorial All reward distributions are Gaussian (or subgaussian) withunit variance
![Page 10: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/10.jpg)
Example: A/B testing• Business wants to optimize their webpage• Actions correspond to ‘A’ and ‘B’• Users arrive at webpage sequentially• Algorithm chooses either ‘A’ or ‘B’• Receives activity feedback (the reward)
![Page 11: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/11.jpg)
Measuring performance – the regret
• Let µi be the mean reward of distribution Pi• µ∗ = maxi µi is the maximum mean• The regret is
Rn = nµ∗ − E
[n∑t=1
Xt
]
• Policies for which the regret is sublinear are learning• Of course we would like to make it as ‘small as possible’
![Page 12: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/12.jpg)
Measuring performance – the regret
Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =
K∑i=1
∆iE[Ti(n)]
Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]
Rn = nµ∗ − E
[n∑t=1
Xt
]= nµ∗ −
n∑t=1
E[Et[Xt]] = nµ∗ −n∑t=1
E[µAt ]
=
n∑t=1
E[∆At ] = E
[n∑t=1
∆At
]= E
[n∑t=1
K∑i=1
1(At = i)∆i
]
= E
[K∑i=1
∆i
n∑t=1
1(At = i)
]= E
[K∑i=1
∆iTi(n)
]=
K∑i=1
∆iE[Ti(n)]
![Page 13: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/13.jpg)
Measuring performance – the regret
Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =
K∑i=1
∆iE[Ti(n)]
Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]
Rn = nµ∗ − E
[n∑t=1
Xt
]= nµ∗ −
n∑t=1
E[Et[Xt]] = nµ∗ −n∑t=1
E[µAt ]
=n∑t=1
E[∆At ] = E
[n∑t=1
∆At
]= E
[n∑t=1
K∑i=1
1(At = i)∆i
]
= E
[K∑i=1
∆i
n∑t=1
1(At = i)
]= E
[K∑i=1
∆iTi(n)
]=
K∑i=1
∆iE[Ti(n)]
![Page 14: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/14.jpg)
Measuring performance – the regret
Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =
K∑i=1
∆iE[Ti(n)]
Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]
Rn = nµ∗ − E
[n∑t=1
Xt
]= nµ∗ −
n∑t=1
E[Et[Xt]] = nµ∗ −n∑t=1
E[µAt ]
=
n∑t=1
E[∆At ] = E
[n∑t=1
∆At
]= E
[n∑t=1
K∑i=1
1(At = i)∆i
]
= E
[K∑i=1
∆i
n∑t=1
1(At = i)
]= E
[K∑i=1
∆iTi(n)
]=
K∑i=1
∆iE[Ti(n)]
![Page 15: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/15.jpg)
Measuring performance – the regret
Let ∆i = µ∗ − µi be the suboptimality gap for the ith arm and Ti(n) bethe number of times arm i is played over all n roundsLemma Rn =
K∑i=1
∆iE[Ti(n)]
Proof Let Et[·] = E[·|A1, X1, . . . , Xt−1, At]
Rn = nµ∗ − E
[n∑t=1
Xt
]= nµ∗ −
n∑t=1
E[Et[Xt]] = nµ∗ −n∑t=1
E[µAt ]
=
n∑t=1
E[∆At ] = E
[n∑t=1
∆At
]= E
[n∑t=1
K∑i=1
1(At = i)∆i
]
= E
[K∑i=1
∆i
n∑t=1
1(At = i)
]= E
[K∑i=1
∆iTi(n)
]=
K∑i=1
∆iE[Ti(n)]
![Page 16: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/16.jpg)
A simple policy: Explore-Then-Commit
1 Choose each action m times2 Find the empirically best action I ∈ {1, 2, . . . ,K}3 Choose At = I for all remaining rounds
In order to analyse this policy we need to bound the probability ofcomitting to a suboptimal action
![Page 17: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/17.jpg)
A simple policy: Explore-Then-Commit
1 Choose each action m times2 Find the empirically best action I ∈ {1, 2, . . . ,K}3 Choose At = I for all remaining rounds
In order to analyse this policy we need to bound the probability ofcomitting to a suboptimal action
![Page 18: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/18.jpg)
A crash course in concentrationLet Z,Z1, Z2, . . . , Zn be a sequence of independent and identicallydistributed random variables with mean µ ∈ R and variance σ2 <∞
empirical mean = µn =1
n
n∑t=1
Zt
How close is µn to µ?
Classical statistics says:1. (law of large numbers) limn→∞ µn = µ almost surely2. (central limit theorem) √n(µn − µ)
d→ N (0, σ2)
3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2
nε2
We need something nonasymptotic and stronger than Chebyshev’sNot possible without assumptions
![Page 19: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/19.jpg)
A crash course in concentrationLet Z,Z1, Z2, . . . , Zn be a sequence of independent and identicallydistributed random variables with mean µ ∈ R and variance σ2 <∞
empirical mean = µn =1
n
n∑t=1
Zt
How close is µn to µ?Classical statistics says:
1. (law of large numbers) limn→∞ µn = µ almost surely2. (central limit theorem) √n(µn − µ)
d→ N (0, σ2)
3. (Chebyshev’s inequality) P (|µn − µ| ≥ ε) ≤ σ2
nε2
We need something nonasymptotic and stronger than Chebyshev’sNot possible without assumptions
![Page 20: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/20.jpg)
A crash course in concentration
Random variable Z is σ-subgaussian if for all λ ∈ R,MZ(λ)
.= E[exp(λZ)] ≤ exp
(λ2σ2/2
)Lemma If Z,Z1, . . . , Zn are independent and σ-subgaussian, then
• aZ is |a|σ-subgaussian for any a ∈ R• ∑n
t=1 Zt is√nσ-subgaussian• µn is n−1/2σ-subgaussian
![Page 21: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/21.jpg)
A crash course in concentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then
P
(µn ≥
√2σ2 log(1/δ)
n
)≤ δ
Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.
P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))
≤ E [exp (λµn)] exp(−λε) (Markov’s)≤ exp
(σ2λ2/(2n)− λε
)= exp
(−nε2/(2σ2)
)
![Page 22: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/22.jpg)
A crash course in concentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then
P
(µn ≥
√2σ2 log(1/δ)
n
)≤ δ
Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))
≤ E [exp (λµn)] exp(−λε) (Markov’s)≤ exp
(σ2λ2/(2n)− λε
)= exp
(−nε2/(2σ2)
)
![Page 23: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/23.jpg)
A crash course in concentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then
P
(µn ≥
√2σ2 log(1/δ)
n
)≤ δ
Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))
≤ E [exp (λµn)] exp(−λε) (Markov’s)
≤ exp(σ2λ2/(2n)− λε
)= exp
(−nε2/(2σ2)
)
![Page 24: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/24.jpg)
A crash course in concentrationTheorem If Z1, . . . , Zn are independent and σ-subgaussian, then
P
(µn ≥
√2σ2 log(1/δ)
n
)≤ δ
Proof We use Chernoff’s method. Let ε > 0 and λ = εn/σ2.P (µn ≥ ε) = P (exp (λµn) ≥ exp (λε))
≤ E [exp (λµn)] exp(−λε) (Markov’s)≤ exp
(σ2λ2/(2n)− λε
)= exp
(−nε2/(2σ2)
)
![Page 25: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/25.jpg)
A crash course in concentration• Which distributions are σ-subgaussian? Gaussian, Bernoulli,bounded support.• And not: exponential, power law• Comparing Chebyshev’s w. subgaussian bound:
Chebyshev’s:√σ2
nδSubgaussian:
√2σ2 log(1/δ)
n
• Typically δ � 1/n in our use-cases
The results that follow hold when the distributionassociated with each arm is 1-subgaussian
![Page 26: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/26.jpg)
Analysing Explore-Then-Commit
• Standard convention Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Algorithms are symmetric and do not exploit this fact• Means that first arm is optimal
• Remember, Explore-Then-Commit chooses each arm m times• Then commits to the arm with the largest payoff• We consider only K = 2
![Page 27: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/27.jpg)
Analysing Explore-Then-Commit
• Standard convention Assume µ1 ≥ µ2 ≥ · · · ≥ µK• Algorithms are symmetric and do not exploit this fact• Means that first arm is optimal• Remember, Explore-Then-Commit chooses each arm m times• Then commits to the arm with the largest payoff• We consider only K = 2
![Page 28: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/28.jpg)
Analysing Explore-Then-CommitStep 1 Let µi be the average reward after exploringThe algorithm commits to the wrong arm if
µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆
Observation µ1 − µ1 + µ2 − µ2 is√2/m-subgaussian
Step 2 The regret isRn = E
[n∑t=1
∆At
]= E
[2m∑t=1
∆At
]+ E
[n∑
t=2m+1
∆At
]= m∆ + (n− 2m)∆P (commit to the wrong arm)
= m∆ + (n− 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)
≤ m∆ + n∆ exp
(−m∆2
4
)
![Page 29: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/29.jpg)
Analysing Explore-Then-CommitStep 1 Let µi be the average reward after exploringThe algorithm commits to the wrong arm if
µ2 ≥ µ1 ⇔ µ2 − µ2 + µ1 − µ1 ≥ ∆
Observation µ1 − µ1 + µ2 − µ2 is√2/m-subgaussianStep 2 The regret is
Rn = E
[n∑t=1
∆At
]= E
[2m∑t=1
∆At
]+ E
[n∑
t=2m+1
∆At
]= m∆ + (n− 2m)∆P (commit to the wrong arm)
= m∆ + (n− 2m)∆P (µ2 − µ2 + µ1 − µ1 ≥ ∆)
≤ m∆ + n∆ exp
(−m∆2
4
)
![Page 30: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/30.jpg)
Analysing Explore-Then-Commit
Rn ≤ m∆︸︷︷︸(A)
+n∆ exp(−m∆2/4)︸ ︷︷ ︸(B)
(A) is monotone increasing in m while (B) is monotone decreasing in mExploration/Exploitation dilemma Exploring too much (m large) then (A)is big, while exploring too little makes (B) largeBound minimised by m =
⌈4
∆2 log(n∆2
4
)⌉ leading toRn ≤ ∆ +
4
∆log
(n∆2
4
)+
4
∆
![Page 31: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/31.jpg)
Analysing Explore-Then-CommitLast slide: Rn ≤ ∆ +
4
∆log
(n∆2
4
)+
4
∆
What happens when ∆ is very small?
Rn ≤ min
{n∆, ∆ +
4
∆log
(n∆2
4
)+
4
∆
}
0 0.2 0.4 0.6 0.8 1
0
10
20
30
∆
Regret
![Page 32: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/32.jpg)
Analysing Explore-Then-CommitLast slide: Rn ≤ ∆ +
4
∆log
(n∆2
4
)+
4
∆
What happens when ∆ is very small?Rn ≤ min
{n∆, ∆ +
4
∆log
(n∆2
4
)+
4
∆
}
0 0.2 0.4 0.6 0.8 1
0
10
20
30
∆
Regret
![Page 33: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/33.jpg)
Analysing Explore-Then-CommitDoes this figure make sense? Why is the regret largest when ∆ is small,but not too small?
Rn ≤ min
{n∆, ∆ +
4
∆log
(n∆2
4
)+
4
∆
}0 0.2 0.4 0.6 0.8 1
0
10
20
30
∆
Small ∆ makes identification hard, but cost of failure is lowLarge ∆ makes the cost of failure high, but identification easyWorst case is when ∆ ≈
√1/n with Rn ≈ √n
![Page 34: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/34.jpg)
Analysing Explore-Then-CommitDoes this figure make sense? Why is the regret largest when ∆ is small,but not too small?
Rn ≤ min
{n∆, ∆ +
4
∆log
(n∆2
4
)+
4
∆
}0 0.2 0.4 0.6 0.8 1
0
10
20
30
∆Small ∆ makes identification hard, but cost of failure is lowLarge ∆ makes the cost of failure high, but identification easyWorst case is when ∆ ≈
√1/n with Rn ≈ √n
![Page 35: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/35.jpg)
Limitations of Explore-Then-Commit
• Need advance knowledge of the horizon n• Optimal tuning depends on ∆
• Does not behave well with K > 2
• Issues by using data to adapt the commitment time• All variants of ETC are at least a factor of 2 from being optimal• Better approaches now exist, but Explore-Then-Commit is often agood place to start when analysing a bandit problem
![Page 36: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/36.jpg)
Optimism principle
![Page 37: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/37.jpg)
Informal illustrationVisiting a new regionShall I try local cuisine?Optimist: Yes!Pessimist: No!Optimism leads to exploration, pessimism prevents itExploration is necessary, but how much?
![Page 38: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/38.jpg)
Optimism principle
• Let µi(t) = 1Ti(t)
∑ts=1 1(As = i)Xs
• Formalise the intuition using confidence intervals• Optimistic estimate of the mean of arm = ‘largest value it couldplausibly be’• Suggests
optimistic estimate = µi(t− 1) +
√2 log(1/δ)
Ti(t− 1)
• δ ∈ (0, 1) determines the level of optimism
![Page 39: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/39.jpg)
Upper confidence bound algorithm1 Choose each action once2 Choose the action maximising
At = argmaxi µi(t− 1) +
√2 log(t3)
Ti(t− 1)
3 Goto 2Corresponds to δ = 1/t3
This is quite a conservative choice. More on this laterAlgorithm does not depend on horizon n (it is anytime)
![Page 40: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/40.jpg)
Demonstration
![Page 41: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/41.jpg)
Regret of UCBTheorem The regret of UCB is at most
Rn = O
∑i:∆i>0
(∆i +
log(n)
∆i
)Furthermore,
Rn = O(√
Kn log(n))
Bounds of the first kind are called problem dependent or instancedependent
Bounds like the second are called distribution free or worst case
![Page 42: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/42.jpg)
Regret analysis
Rewrite the regret Rn =
K∑i=1
∆iE[Ti(n)]
Only need to show that E[Ti(n)] is not too large for suboptimal arms
![Page 43: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/43.jpg)
Regret analysisKey insight Arm i is only played if its index is larger than the index of theoptimal armNeed to show two things:(A) The index of the optimal arm is larger than its actual mean with highprobability(B) The index of suboptimal arms falls below the mean of the optimalarm after only a few plays
γi(t− 1) = µi(t− 1) +
√2 log(t3)
Ti(t− 1)︸ ︷︷ ︸index of arm i in round t
![Page 44: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/44.jpg)
Analysis intuition
Arm 1 Arm 2
∆
True meanEmpirical mean
![Page 45: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/45.jpg)
Analysis intuition
Arm 1 Arm 2
∆
True meanEmpirical mean
![Page 46: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/46.jpg)
Regret analysisTo make this intuition a reality we decompose the ‘pull-count’
E[Ti(n)] = E
[n∑t=1
1(At = i)
]=
n∑t=1
P (At = i)
=
n∑t=1
P (At = i and (γ1(t− 1) ≤ µ1 or γi(t− 1) ≥ µ1))
≤n∑t=1
P (γ1(t− 1) ≤ µ1)︸ ︷︷ ︸index of opt. arm too small?
+
n∑t=1
P (At = i and γi(t− 1) ≥ µ1)︸ ︷︷ ︸index of subopt. arm large?
![Page 47: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/47.jpg)
Regret analysisWe want to show that P (γ1(t− 1) ≤ µ1) is smallTempting to use the concentration theorem...
P (γ1(t− 1) ≤ µ1) = P
(µ1(t− 1) +
√2 log(t3)
Ti(t− 1)≤ µ1
)?≤ 1
t3
What’s wrong with this?
Ti(t− 1) is a random variable!
P
(µ1(t− 1) +
√2 log(t3)
Ti(t− 1)≤ µ1
)≤ P
(∃s < t : µ1,s +
√2 log(t3)
s≤ µ1
)
≤t−1∑s=1
P
(µ1,s +
√2 log(t3)
s≤ µ1
)
≤t−1∑s=1
1
t3≤ 1
t2.
![Page 48: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/48.jpg)
Regret analysisWe want to show that P (γ1(t− 1) ≤ µ1) is smallTempting to use the concentration theorem...
P (γ1(t− 1) ≤ µ1) = P
(µ1(t− 1) +
√2 log(t3)
Ti(t− 1)≤ µ1
)?≤ 1
t3
What’s wrong with this? Ti(t− 1) is a random variable!
P
(µ1(t− 1) +
√2 log(t3)
Ti(t− 1)≤ µ1
)≤ P
(∃s < t : µ1,s +
√2 log(t3)
s≤ µ1
)
≤t−1∑s=1
P
(µ1,s +
√2 log(t3)
s≤ µ1
)
≤t−1∑s=1
1
t3≤ 1
t2.
![Page 49: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/49.jpg)
Regret analysisn∑t=1
P (At = i and γi(t− 1) ≥ µ1) = E
[n∑t=1
1(At = i and γi(t− 1) ≥ µ1)
]
= E
[n∑t=1
1(At = i and µi(t− 1) +
√6 log(t)
Ti(t− 1)≥ µ1)
]
≤ E
[n∑t=1
1(At = i and µi(t− 1) +
√6 log(n)
Ti(t− 1)≥ µ1)
]
≤ E
[n∑s=1
1(µi,s +
√6 log(n)
s≥ µ1)
]
=
n∑s=1
P
(µi,s +
√6 log(n)
s≥ µ1
)
![Page 50: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/50.jpg)
Regret analysisLet u =
24 log(n)
∆2i
. Thenn∑s=1
P
(µi,s +
√6 log(n)
s≥ µ1
)≤ u+
n∑s=u+1
P
(µi,s +
√6 log(n)
s≥ µ1
)
≤ u+
n∑s=u+1
P(µi,s ≥ µi +
∆i
2
)
≤ u+
∞∑s=u+1
exp
(−s∆
2i
8
)≤ 1 + u+
8
∆2i
.
![Page 51: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/51.jpg)
Regret analysis
Combining the two parts we haveE[Ti(n)] ≤ 3 +
8
∆2i
+24 log(n)
∆2i
So the regret is bounded byRn =
∑i:∆i>0
∆iE[Ti(n)] ≤∑i:∆i>0
(3∆i +
8
∆i+
24 log(n)
∆i
)
![Page 52: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/52.jpg)
Distribution free bounds
Let ∆ > 0 be some constant to be chosen laterRn =
∑i:∆i>0
∆iE[Ti(n)] ≤ n∆ +∑
i:∆i>∆
∆iE[Ti(n)]
. n∆ +∑
i:∆i>∆
log(n)
∆i≤ n∆ +
K log(n)
∆.√nK log(n)
where in the last line we tuned ∆ =√K log(n)/n
![Page 53: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/53.jpg)
Improvements• The constants in the algorithm/analysis can be improved quitesignificantly.
At = argmaxi µi(t− 1) +
√2 log(t)
Ti(t− 1)
• With this choice:limn→∞
Rnlog(n)
=∑i:∆i>0
2
∆i
• The distribution-free regret is also improvableAt = argmaxi µi(t− 1) +
√4
Ti(t− 1)log
(1 +
t
KTi(t− 1)
)• With this index we save a log factor in the distribution free bound
Rn = O(√nK)
![Page 54: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/54.jpg)
Lower bounds
• Two kinds of lower bound: distribution free (worst case) andinstance-dependent• What could an instance-dependent lower bound look like?• Algorithms that always choose a fixed action?
![Page 55: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/55.jpg)
Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27
Proof sketch• µ = (∆, 0, . . . , 0)
• i = argmini>1 Eµ[Ti(n)]
• E[Ti(n)] ≤ n/(K − 1)
• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)
• Envs. indistinguishable if ∆ ≈√K/n
• Suffers n∆ regret on one of them
![Page 56: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/56.jpg)
Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27
Proof sketch• µ = (∆, 0, . . . , 0)
• i = argmini>1 Eµ[Ti(n)]
• E[Ti(n)] ≤ n/(K − 1)
• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)
• Envs. indistinguishable if ∆ ≈√K/n
• Suffers n∆ regret on one of them
![Page 57: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/57.jpg)
Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27
Proof sketch• µ = (∆, 0, . . . , 0)
• i = argmini>1 Eµ[Ti(n)]
• E[Ti(n)] ≤ n/(K − 1)
• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)
• Envs. indistinguishable if ∆ ≈√K/n
• Suffers n∆ regret on one of them
![Page 58: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/58.jpg)
Worst case lower boundTheorem For every algorithm and n and K ≤ n there exists a K-armedGaussian bandit such that Rn ≥√(K − 1)n/27
Proof sketch• µ = (∆, 0, . . . , 0)
• i = argmini>1 Eµ[Ti(n)]
• E[Ti(n)] ≤ n/(K − 1)
• µ′ = (∆, 0, . . . , 2∆, 0, . . . , 0)
• Envs. indistinguishable if ∆ ≈√K/n
• Suffers n∆ regret on one of them
![Page 59: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/59.jpg)
Instance-dependent lower boundsAn algorithm is consistent on class of bandits E if Rn = o(n) for allbandits in ETheorem If an algorithm is consistent for the class of Gaussian bandits,then
lim infn→∞
Rnlog(n)
≥∑i:∆i>0
2
∆i
• Consistency rules out stupid algorithms like the algorithm thatalways chooses a fixed action• Consistency is asymptotic, so it is not surprising the lower bound wederive from it is asymptotic• A non-asymptotic version of consistenncy leads to non-asymptoticlower bounds
![Page 60: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/60.jpg)
Instance-dependent lower boundsAn algorithm is consistent on class of bandits E if Rn = o(n) for allbandits in ETheorem If an algorithm is consistent for the class of Gaussian bandits,then
lim infn→∞
Rnlog(n)
≥∑i:∆i>0
2
∆i
• Consistency rules out stupid algorithms like the algorithm thatalways chooses a fixed action• Consistency is asymptotic, so it is not surprising the lower bound wederive from it is asymptotic• A non-asymptotic version of consistenncy leads to non-asymptoticlower bounds
![Page 61: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/61.jpg)
What else is there?• All kinds of variants of UCB for different noise models: Bernoulli,exponential families, heavy tails, Gaussian with unknown mean andvariance,...• A twist on UCB that replaces classifical confidence bounds withBayesian confidence bounds – offers empirical improvements• Thompson sampling: each round sample mean from posterior foreach arm, choose arm with largest• All manner of twists on the setup: non-stationarity, delayed rewards,playing multiple arms each round, moving beyond expected regret(high probability bounds)• Different objectives: Simple regret, risk aversion
![Page 62: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/62.jpg)
The adversarial viewpoint• Replace random rewards with an adversary• At the start of the game the adversary secretly chooses lossesy1, y2, . . . , yn where yt ∈ [0, 1]K
• Learner chooses actions At and suffers loss ytAt• Regret isRn = E
[n∑t=1
ytAt
]︸ ︷︷ ︸learner’s loss
− mini
n∑t=1
yti︸ ︷︷ ︸loss of best arm
• Mission Make the regret small, regardless of the adversary
• There exists an algorithm such thatRn ≤ 2
√Kn
![Page 63: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/63.jpg)
The adversarial viewpoint• Replace random rewards with an adversary• At the start of the game the adversary secretly chooses lossesy1, y2, . . . , yn where yt ∈ [0, 1]K
• Learner chooses actions At and suffers loss ytAt• Regret isRn = E
[n∑t=1
ytAt
]︸ ︷︷ ︸learner’s loss
− mini
n∑t=1
yti︸ ︷︷ ︸loss of best arm
• Mission Make the regret small, regardless of the adversary• There exists an algorithm such that
Rn ≤ 2√Kn
![Page 64: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/64.jpg)
The adversarial viewpoint• The trick is in the definition of regret• The adversary cannot be too mean
Rn = E
[n∑t=1
ytAt
]︸ ︷︷ ︸learner’s loss
− mini
n∑t=1
yti︸ ︷︷ ︸loss of best arm
y =
(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1
)
• The following alternative objective is hopelessR′n = E
[n∑t=1
ytAt
]︸ ︷︷ ︸learner’s loss
−n∑t=1
miniyti︸ ︷︷ ︸
loss of best sequence• Randomisation is crucial in adversarial bandits
![Page 65: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/65.jpg)
The adversarial viewpoint• The trick is in the definition of regret• The adversary cannot be too mean
Rn = E
[n∑t=1
ytAt
]︸ ︷︷ ︸learner’s loss
− mini
n∑t=1
yti︸ ︷︷ ︸loss of best arm
y =
(1 · · · 1 0 · · · 00 · · · 0 1 · · · 1
)
• The following alternative objective is hopelessR′n = E
[n∑t=1
ytAt
]︸ ︷︷ ︸learner’s loss
−n∑t=1
miniyti︸ ︷︷ ︸
loss of best sequence• Randomisation is crucial in adversarial bandits
![Page 66: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/66.jpg)
Tackling the adversarial bandit
• Learner chooses distribution Pt over the K actions• Samples At ∼ Pt• Observes Yt = ytAt• Expected regret is
Rn = maxi
E
[n∑t=1
(ytAt − yti)
]= max
p∈∆KE
[n∑t=1
〈Pt − p, yt〉
]
• This looks a lot like online linear optimisation on a simplex• Only yt is not observed
![Page 67: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/67.jpg)
Online convex optimisation (linear losses)• K ⊂ Rd is a convex set• Adversary secretly choosesy1, . . . , yn ∈ K◦ = {u : supx∈K |〈x, u〉| ≤ 1}
• Learner chooses xt ∈ K• Suffers loss 〈xt, yt〉 and the regret with respect to x ∈ K is
Rn(x) =
n∑t=1
〈xt − x, yt〉 .
• How to choose xt? Most simple idea ‘follow-the-leader’xt = argminx∈K
t−1∑s=1
〈x, ys〉 .
• Fails miserably: K = [−1, 1], y1 = 1/2, y2 = −1, y3 = 1, . . .and x1 = ?, x2 = −1, x3 = 1, . . . leading to Rn(0) ≈ n.
![Page 68: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/68.jpg)
Follow the regularised leader• New idea Add regularization to stabalize follow-the-leader• Let F be a convex function and η > 0 be the learning rate and
xt = argminx∈K
(F (x) + η
t−1∑s=1
〈x, ys〉
)• The Bregman divergence induced by F is
DF (x, y) = F (x)− F (y)− 〈∇F (y), x− y〉
a b
DF (b, a)
F (x)
F (a) +∇F (a)(x− a)
![Page 69: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/69.jpg)
Follow the regularised leaderTheorem The regret of follow the regularised leader satisfies
Rn(x) ≤ F (x)− F (x1)
η+
n∑t=1
(〈xt − xt+1, yt〉 −
1
ηDF (xt+1, xt)
)
≤ F (x)− F (x1)
η+η
2
n∑t=1
‖yt‖2t∗
Tradeoffs How much to regularise?
Let z ∈ [xt, xt+1] be such that DF (xt, xt+1) = 12‖xt − xt+1‖2∇2F (z)
and ‖ · ‖t = ‖ · ‖∇2F (z). Then〈xt − xt+1, yt〉 −
DF (xt+1, xt)
η≤ ‖yt‖t∗‖xt − xt+1‖t −
DF (xt+1, xt)
η
= ‖yt‖t∗√
2DF (xt+1, xt)−DF (xt+1, xt)
η≤ η
2‖yt‖2t∗
![Page 70: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/70.jpg)
Follow the regularised leaderTheorem The regret of follow the regularised leader satisfies
Rn(x) ≤ F (x)− F (x1)
η+
n∑t=1
(〈xt − xt+1, yt〉 −
1
ηDF (xt+1, xt)
)
≤ F (x)− F (x1)
η+η
2
n∑t=1
‖yt‖2t∗
Tradeoffs How much to regularise?Let z ∈ [xt, xt+1] be such that DF (xt, xt+1) = 1
2‖xt − xt+1‖2∇2F (z)
and ‖ · ‖t = ‖ · ‖∇2F (z). Then〈xt − xt+1, yt〉 −
DF (xt+1, xt)
η≤ ‖yt‖t∗‖xt − xt+1‖t −
DF (xt+1, xt)
η
= ‖yt‖t∗√
2DF (xt+1, xt)−DF (xt+1, xt)
η≤ η
2‖yt‖2t∗
![Page 71: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/71.jpg)
Let Φt(x) = F (x)/η +∑t
s=1〈x, ys〉
Rn(x) =
n∑t=1
〈xt − x, yt〉 =
n∑t=1
〈xt+1 − x, yt〉+
n∑t=1
〈xt − xt+1, yt〉
Then using: DΦt(·, ·) = DF (·, ·) and xt+1 = argminx Φt(x):n∑t=1
〈xt+1 − x, yt〉 =F (x)
η+
n∑t=1
(Φt(xt+1)− Φt−1(xt+1)) − Φn(x)
=F (x)
η− Φ0(x1) + Φn(xn+1)− Φn(x)︸ ︷︷ ︸
≤0
+
n∑t=0
(Φt(xt+1)− Φt(xt+2))
≤ F (x)− F (x1)
η+
n−1∑t=0
(Φt(xt+1)− Φt(xt+2))
=F (x)− F (x1)
η−n−1∑t=0
DΦt(xt+2, xt+1) + 〈∇Φt(xt+1), xt+2 − xt+1〉︸ ︷︷ ︸≥0
![Page 72: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/72.jpg)
Follow the regularised leader for bandits• Estimate yt with unbiased importance weighted estimator Yt
Yti =1(At = i)yti
Pti
• Then the expected regret satisfiesE[Rn] = max
iE
[n∑t=1
ytAt − yti
]= max
iE
[n∑t=1
〈Pt − ei, Yt〉
]
• Choosing Pt = argminpF (p)η +
∑t−1s=1〈p, Ys〉 leads to
E[Rn] ≤ F (ei)− F (P1)
η+η
2
n∑t=1
‖Yt‖2t∗
• We just need to choose F carefully
![Page 73: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/73.jpg)
![Page 74: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/74.jpg)
Follow the regularised leader for bandits• We showed E[Rn] ≤ E
[F (ei)− F (P1)
η+η
2
n∑t=1
‖Yt‖2t∗
]• Let’s randomly choose the unnormalised negentropy
F (p) =
K∑i=1
pi log(pi)− pi
• An ‘easy’ calculation shows that Pti =exp
(−η∑t−1
s=1 Ysi
)∑K
j=1 exp(−η∑t−1
s=1 Ysj
)• Then F (ei)− F (P1) ≤ log(K). For the dual norm,∇2F (p) = diag(1/p) =⇒ ‖y‖2t∗ =
K∑i=1
piy2i for some p ∈ [Pt, Pt+1]
• Yti is positive and Yti = 0 unless At = i. So Pt+1,At ≤ PtAt and‖Yt‖2t∗ ≤ PtAt Y
2tAt
![Page 75: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/75.jpg)
Follow the regularised leader for bandits• Now we have
E[Rn] ≤ log(K)
η+η
2E
[n∑t=1
PtAt Y2tAt
]
=log(K)
η+η
2E
[n∑t=1
y2tAt
PtAt
]
≤ log(K)
η+η
2E
[n∑t=1
1
PtAt
]
=log(K)
η+η
2E
[n∑t=1
K∑i=1
Pti ·1
Pti
]
=log(K)
η+ηnK
2
≤√
2nK log(K)
![Page 76: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/76.jpg)
Adversarial bandits
• Instance-dependence?• Moving beyond expected regret (high probability bounds)• Why bother with stochastic bandits?• Best of both worlds? Bubeck and Slivkins (2012); Seldin and Lugosi(2017); Auer and Chiang (2016)• Big myth Adversarial bandits do not address nonstationarity
![Page 77: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/77.jpg)
Resources
• Book by Bubeck and Cesa-Bianchi (2012)• Book by Cesa-Bianchi and Lugosi (2006)• The Bayesian books by Gittins et al. (2011) and Berry and Fristedt(1985). Both worth reading.• Our online notes: http://banditalgs.com• Notes by Aleksandrs Slivkins:http://slivkins.com/work/MAB-book.pdf
• We will soon release a 450 page book (“Bandit Algorithms” to bepublished by Cambridge)
![Page 78: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/78.jpg)
Historical notes
• First paper on bandits is by Thompson (1933). He proposed analgorithm for two-armed Bernoulli bandits and hand-runs somesimulations (Thompson sampling)• Popularised enormously by Robbins (1952)• Confidence bounds first used by Lai and Robbins (1985) to deriveasymptotically optimal algorithm• UCB by Katehakis and Robbins (1995) and Agrawal (1995).Finite-time analysis by Auer et al. (2002)• Adversarial bandits: Auer et al. (1995)• Minimax optimal algorithm by Audibert and Bubeck (2009)
![Page 79: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/79.jpg)
References IAgrawal, R. (1995). Sample mean based index policies with O(log n) regret for the multi-armedbandit problem. Advances in Applied Probability, pages 1054–1078.Audibert, J.-Y. and Bubeck, S. (2009). Minimax policies for adversarial and stochastic bandits. In
Proceedings of Conference on Learning Theory (COLT), pages 217–226.Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite-time analysis of the multiarmed banditproblem. Machine Learning, 47:235–256.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire, R. E. (1995). Gambling in a rigged casino: Theadversarial multi-armed bandit problem. In Foundations of Computer Science, 1995. Proceedings.,
36th Annual Symposium on, pages 322–331. IEEE.Auer, P. and Chiang, C. (2016). An algorithm with nearly optimal pseudo-regret for both stochasticand adversarial bandits. In Proceedings of the 29th Conference on Learning Theory, COLT 2016,
New York, USA, June 23-26, 2016, pages 116–120.Berry, D. and Fristedt, B. (1985). Bandit problems : sequential allocation of experiments. Chapmanand Hall, London ; New York :.Bubeck, S. and Cesa-Bianchi, N. (2012). Regret Analysis of Stochastic and Nonstochastic Multi-armed
Bandit Problems. Foundations and Trends in Machine Learning. Now Publishers Incorporated.Bubeck, S. and Slivkins, A. (2012). The best of both worlds: Stochastic and adversarial bandits. In
COLT, pages 42.1–42.23.Bush, R. R. and Mosteller, F. (1953). A stochastic model with applications to learning. The Annals of
Mathematical Statistics, pages 559–585.Cesa-Bianchi, N. and Lugosi, G. (2006). Prediction, learning, and games. Cambridge university press.
![Page 80: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/80.jpg)
References II
Gittins, J., Glazebrook, K., and Weber, R. (2011). Multi-armed bandit allocation indices. John Wiley &Sons.Katehakis, M. N. and Robbins, H. (1995). Sequential choice from several populations. Proceedings
of the National Academy of Sciences of the United States of America, 92(19):8584.Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in
applied mathematics, 6(1):4–22.Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American
Mathematical Society, 58(5):527–535.Seldin, Y. and Lugosi, G. (2017). An improved parametrization and analysis of the EXP3++ algorithmfor stochastic and adversarial bandits. In COLT, pages 1743–1759.Thompson, W. (1933). On the likelihood that one unknown probability exceeds another in view of theevidence of two samples. Biometrika, 25(3/4):285–294.
![Page 81: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/81.jpg)
Random concentration failureLet X1, X2, . . . be a sequence of independent and identically distributedstandard Gaussian. For any n we haveP
(n∑t=1
Xt ≥√
2n log(1/δ)
)≤ δ
Want to show this can fail if n is replaced by random variable T
Law of the iterated logaritm says thatlim supn→∞
∑nt=1Xt√
2n log log(n)= 1 almost surely
Let T = min{n :∑n
t=1Xt ≥√
2n log(1/δ)}. Then P (T <∞) = 1 andP
(T∑t=1
Xt ≥√
2T log(1/δ)
)= 1 .
Contradiction! (works if T is independent of X1, X2, . . . though)
![Page 82: Bandit Algorithms - tor-lattimore.com · Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next?](https://reader035.vdocuments.net/reader035/viewer/2022071017/5fd02de5380048796b539a70/html5/thumbnails/82.jpg)
Random concentration failureLet X1, X2, . . . be a sequence of independent and identically distributedstandard Gaussian. For any n we haveP
(n∑t=1
Xt ≥√
2n log(1/δ)
)≤ δ
Want to show this can fail if n is replaced by random variable TLaw of the iterated logaritm says that
lim supn→∞
∑nt=1Xt√
2n log log(n)= 1 almost surely
Let T = min{n :∑n
t=1Xt ≥√
2n log(1/δ)}. Then P (T <∞) = 1 andP
(T∑t=1
Xt ≥√
2T log(1/δ)
)= 1 .
Contradiction! (works if T is independent of X1, X2, . . . though)