online learning with experts & multiplicative weights ... · online learning with experts &...
TRANSCRIPT
![Page 1: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/1.jpg)
Online Learning with Experts & Multiplicative WeightsAlgorithms
CS 159 lecture #2
Stephan ZhengApril 1, 2016
Caltech
![Page 2: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/2.jpg)
Table of contents
1. Online Learning with Experts
With a perfect expert
Without perfect experts
2. Multiplicative Weights algorithms
Regret of Multiplicative Weights
3. Learning with Experts revisited
Continuous vector predictions
Subtlety in the discrete case
2
![Page 3: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/3.jpg)
Today
What should you take away from this lecture?
• How should you base your prediction on expert predictions?• What are the characteristics of multiplicative weights algorithms?
3
![Page 4: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/4.jpg)
References
• Online Learning [Ch 2-4], Gabor Bartok, David Pal, Csaba Szepesvari, andIstvan Szita.
• The Multiplicative Weights Update Method: A Meta-Algorithm andApplications, Sanjeev Arora, Elad Hazan, and Satyen Kale. Theory ofComputing, 8, 121-164, 2012.
• http://www.yisongyue.com/courses/cs159/
4
![Page 5: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/5.jpg)
Online Learning with Experts
![Page 6: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/6.jpg)
Leo at the Oscars: an online binary prediction problem
Will Leo win an Oscar this year? (running question since ∼1997...)
6
![Page 7: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/7.jpg)
Leo at the Oscars: an online binary prediction problem
... No
2005 No
2006 No
2007 No
2008 No
2009 No
2010 No7
![Page 8: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/8.jpg)
Leo at the Oscars: an online binary prediction problem
2011 No
2012 No
2013 No
2014 No
2015 No
2016 Yes!8
![Page 9: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/9.jpg)
Leo at the Oscars: an online binary prediction problem
A frivolous extrapolation:
2016 Yes!
2017 No
2018 Yes!
2019 No
2020 Yes
2021 Yes
9
![Page 10: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/10.jpg)
Leo at the Oscars: an online binary prediction problem
Imagine you’re a showbiz fan and want to predict the answer every year t.
But, you don’t know all the ins and outs of Hollywood.
Instead, you are lucky enough to have access to experts ii∈I , that eachmake a (possibly wrong!) prediction pi,t.
• How do you use the experts for your own prediction?• How do you incorporate the feedback from Nature?• How do we achieve sublinear regret?
10
![Page 11: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/11.jpg)
Formal description
11
![Page 12: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/12.jpg)
Online Learning with Experts
Formal setup: there are T rounds in which we predict a binary labelpt ∈ Y = 0, 1.
At every timestep t:
1. Each expert i = 1 . . .n predicts pi,t ∈ Y2. You make a prediction pt ∈ Y3. Nature reveals yt ∈ Y4. We suffer a loss l(pt, yt) = 1 [pt = yt]5. Each expert suffers a loss l(pi,t, yt) = 1 [pi,t = yt]
Loss: LT =∑T
t=1 l(pt, yt), Li,T =∑T
t=1 l(pi,t, yt) # of mistakes made
Regret: RT = LT −mini Li,T
How do you decide what pt to predict? How do you incorporate the feedbackfrom Nature? How do we achieve sublinear regret?
12
![Page 13: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/13.jpg)
Online Learning with Experts
• ”Experts” is a way to abstract a hypothesis class.• For the most part, we’ll deal with a finite, discrete number of experts,because that’s easier to analyze.
• In general, there can be a continuous space of experts = using astandard hypothesis class.
• Boosting uses a collection of n weak classifiers as ”experts”. At everytime t it adds 1 weak classifier with weight αt to the ensemble classifierh1:t. The prediction pt is based on the ensemble h1:t that we havecollected so far.
13
![Page 14: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/14.jpg)
Weighted Majority: the halving algorithm
Let’s assume that there is a perfect expert i∗, that is guaranteed to know theright answer (say a mind-reader that can read the minds of the votingAcademy members). That is, ∀t : pi∗,t = yt and li∗,t = 0.
Keep regret (= # mistakes) small −→ find the perfect expert quickly with fewmistakes.
• Eliminate experts that make a mistake• Take a majority vote of ”alive” experts
Let’s define some groups of experts:
The ”alive” set:Et = i : i did not make a mistake until time t , E0 = 1, . . . ,n
The nay-sayers: E0t−1 = i ∈ Et−1 : pi,t = 0
The yay-sayers: E1t−1 = i ∈ Et−1 : pi,t = 1
14
![Page 15: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/15.jpg)
Weighted Majority: the halving algorithm
t lt yt pt p1,t p2,t p3,t p4,t p5,t p6,t p7,t p8,t p9,t p10,t p11,t p12,t
2011 0 No
2012 0 No
2013 0 No
2014 0 No
2015 0 No
2016 1 Yes!
2017 0 No
2018 0 Yes!
2019 0 No
2020 1 Yes
2021 0 Yes
15
![Page 16: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/16.jpg)
Weighted Majority: the halving algorithm
The halving algorithm1: for t = 1 . . . T do2: Receive expert predictions (p1,t . . .pn,t)3: Split Et−1 into E0t−1 ∪ E1t−1
4: if |E1t−1| > |E0t−1| then ▷ follow the majority5: Predict pt = 16: else7: Predict pt = 08: end if9: Receive Nature’s answer yt (and incur loss l(pt, yt))10: Update Et with experts that continue to be right11: end for
16
![Page 17: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/17.jpg)
Weighted Majority: the halving algorithm
Notice that if halving makes a mistake, then at least half of the experts inEt−1 were wrong:
Wt = |Et| = |Eytt−1| ≤ |Et−1|/2.
Since there is always a perfect expert, the algorithm makes no more than⌊log2 |E0|⌋ = ⌊log2 n⌋ mistakes −→ sublinear regret.
Qualitatively speaking:
• Wt multiplicatively decreases when halving makes a mistake. If Wt
doesn’t shrink too much, then halving can’t make too many mistakes.• There is a lower bound on Wt for all t, since there is an expert. Wt can’tshrink ”too much” starting from its initial value.
We’ll see similar behavior later.
17
![Page 18: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/18.jpg)
Weighted Majority
So what should we do if we don’t know anything about the experts? Themajority vote might be always wrong!
Give each of them a weight wi,t, initialized as wi,1 = 1.
Decide based on a weighted sum of experts:
pt = 1 [weighted sum of yay-sayers > weighted sum of nay-sayers]
pt = 1[ n∑i=1
wi,t−1pi,t >n∑i=1
wi,t−1(1− pi,t)]
18
![Page 19: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/19.jpg)
Weighted Majority
Recall: with a perfect expert we had
The halving algorithm1: for t = 1 . . . T do2: Receive expert predictions (p1,t . . .pn,t)3: Split Et−1 into E0t−1 ∪ E1t−1
4: if |E1t−1| > |E0t−1| then ▷ follow the majority5: Predict pt = 16: else7: Predict pt = 08: end if9: Receive Nature’s answer yt (and incur loss l(pt, yt))10: Update Et with experts that continue to be right11: end for
19
![Page 20: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/20.jpg)
Weighted Majority
Without knowledge about the experts: choose a decay factor β ∈ [0, 1)
The Weighted Majority algorithm1: Initialize wi,1 = 1,W1 = n.2: for t = 1 . . . T do3: Receive expert predictions (p1,t . . .pn,t)4: if
∑ni=1 wi,t−1pi,t >
∑ni=1 wi,t−1(1− pi,t) then
5: Predict pt = 16: else7: Predict pt = 08: end if9: Receive Nature’s answer yt (and incur loss l(pt, yt))10: wi,t ← wi,t−1β
1(pi,t =yt)
11: end for
20
![Page 21: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/21.jpg)
Weighted Majority
• halving is equivalent to choosing wi,t = 1 if i is ”alive” (β = 0).• What happens as β −→ 1? Faulty experts are not punished that much.
21
![Page 22: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/22.jpg)
Weighted Majority
Define the potential Wt =∑n
i=1 wi,t (the weighted size of the set of experts).
Theorem 1:
• Wt ≤ Wt−1
• If pt = yt then Wt ≤ 1+β2 Wt−1.
Compare with before:
• Wt multiplicatively decreases when halving makes a mistake.• Is there always a lower non-zero bound on Wt for all t? Yes, with finite T.No, infinite time all weights could decay towards 0. Note that we didn’tnormalize the wi,t - we’ll fix this in the next part.
• What about the regret behavior? Is it sublinear? We’ll leave this for now.
22
![Page 23: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/23.jpg)
Multiplicative Weights algorithms
![Page 24: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/24.jpg)
Multiplicative Weights
High-level intuition:
• More general case: choose from n options.• Stochastic strategies allowed.• Not always experts present.
24
![Page 25: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/25.jpg)
Multiplicative Weights
Setup: at each timestep t = 1 . . . T:
1. you want to take a decision d ∈ D = 1, . . . ,n2. you choose a distribution pt over D and sample randomly from it.3. Nature reveals the cost vector ct, where each cost ci,t is bounded in
[−1, 1]4. the expected cost is then Ei∼pt [ct] = ct · pt.
Goal: minimize regret with respect to the best decision in hindsight,mini
∑Tt=1 ci,t.
In halving, decision = ”choose expert”, cost = ”mistake”, and pt was 1 for themajority of alive experts (note that p is not ”prediction” here).
25
![Page 26: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/26.jpg)
The Multiplicative Weights algorithm
Comes in many forms!
The multiplicative weights algorithm1: Fix η ≤ 1
2 .2: Initialize wi,1 = 13: Initialize W1 = n4: for t = 1 . . . T do5: Wt =
∑i wi,t.
6: Choose a decision i ∼ pt =wi,tWt
7: Nature reveals ct8: wi,t+1 ← wi,t (1− ηci,t)9: end for
26
![Page 27: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/27.jpg)
Regret of Multiplicative Weights
Regret of multiplicative weights
Assume cost ci,t is bounded in [−1, 1] and η ≤ 12 . Then after T rounds, for
any decision i:
T∑t=1
ct · pt ≤T∑t=1
ci,t + η
T∑t=1|ci,t|+
lognη
27
![Page 28: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/28.jpg)
Regret of Multiplicative Weights
Regret of multiplicative weights
Assume cost ci,t is bounded in [−1, 1] and η ≤ 12 . Then after T rounds, for
any decision i:
T∑t=1
ct · pt ≤T∑t=1
ci,t + η
T∑t=1|ci,t|+
lognη
• The cost of any fixed decision i• Relates to how quickly the MWA updates itself after observing the costsat time t
• How conservative the MWA should be - if η is too small, we ”overfit” tooquickly
28
![Page 29: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/29.jpg)
Regret of Multiplicative Weights
Regret of multiplicative weights
Assume cost ci,t is bounded in [−1, 1] and η ≤ 12 . Then after T rounds, for
any decision i:
T∑t=1
ct · pt ≤T∑t=1
ci,t + η
T∑t=1|ci,t|+
lognη
• η ≤ 12 is needed to make some inequalities work.
• Generality: no assumptions about the sequence of events (may becorrelated, costs may be adversarial)!
• Holds for all i: taking a min over decisions, we see:
RT ≤lognη
+ ηmini
T∑t=1|ci,t|
29
![Page 30: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/30.jpg)
Regret of Multiplicative Weights
RT ≤lognη
+ ηmini
T∑t=1|ci,t|
What is the regret behavior?
Since the sum is O(T), it’s sub-linear with appropriate choice η ∼√
log nT .
ThenRT ≤ O
(√T logn
)What is the issue with this?
We need to know the horizon T up front!
If T is unknown, use ηt ∼ min(√
log nt , 12
)or the doubling trick.
30
![Page 31: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/31.jpg)
Regret of Multiplicative Weights
RT ≤lognη
+ ηmini
T∑t=1|ci,t|
What if we don’t choose η ∼√
1T?
• What if η is constant w.r.t. T?• What if η < 1√
T?• Fundamental tension between fitting to expert and learning fromNature
• Optimality of MW: in the online setting you can’t do better: the regret islower bounded as RT ≥ Ω
(√T logn
). See Theorem 4.1 in Arora.
31
![Page 32: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/32.jpg)
Regret of Multiplicative Weights
RT ≤lognη
+ ηmini
T∑t=1|ci,t|
Why can’t you achieve O(log T) regret as with FTL in this case?
• Recall for Follow The Leader with strongly convex loss, the differencebetween the two consecutive decisions and losses scaled asp∗t − p∗t−1 = O( 1t )
• In MW, the loss-differential scales with pt − pt−1 = O(η), which is O( 1√t ),
so the MW algorithm needs to be prepared to make much biggerchanges than FTL over time.
32
![Page 33: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/33.jpg)
Proof of MW
The steps in the proof are:
1. Upper bound Wt in terms of cumulative decay factors2. Lower bound Wt by using convexity arguments3. Combine the upper and lower bounds on Wt to get the answer
Let’s go through the proof carefully.
33
![Page 34: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/34.jpg)
Proof of MW: Getting the upper bound on Wt
Wt+1 =∑i
wi,t+1
=∑i
wi,t (1− ηci,t)
= Wt − ηWt∑i
ci,tpi,t wi,t = Wtpi,t
= Wt (1− ηct · pt)≤ Wt exp (−ηct · pt) convexity
Recursively using this inequality:
WT+1 ≤ W1 exp(−η
T∑t=1
ct · pt
)
= n · exp(−η
T∑t=1
ct · pt
)∀x ∈ R : 1− x ≤ e−x
34
![Page 35: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/35.jpg)
Proof of MW: Getting the upper bound on Wt
Wt+1 =∑i
wi,t+1
=∑i
wi,t (1− ηci,t)
= Wt − ηWt∑i
ci,tpi,t wi,t = Wtpi,t
= Wt (1− ηct · pt)≤ Wt exp (−ηct · pt) convexity
Recursively using this inequality:
WT+1 ≤ W1 exp(−η
T∑t=1
ct · pt
)
= n · exp(−η
T∑t=1
ct · pt
)∀x ∈ R : 1− x ≤ e−x
34
![Page 36: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/36.jpg)
Proof of MW: Getting the upper bound on Wt
Wt+1 =∑i
wi,t+1
=∑i
wi,t (1− ηci,t)
= Wt − ηWt∑i
ci,tpi,t wi,t = Wtpi,t
= Wt (1− ηct · pt)≤ Wt exp (−ηct · pt) convexity
Recursively using this inequality:
WT+1 ≤ W1 exp(−η
T∑t=1
ct · pt
)
= n · exp(−η
T∑t=1
ct · pt
)∀x ∈ R : 1− x ≤ e−x
34
![Page 37: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/37.jpg)
Proof of MW: Getting the upper bound on Wt
Wt+1 =∑i
wi,t+1
=∑i
wi,t (1− ηci,t)
= Wt − ηWt∑i
ci,tpi,t wi,t = Wtpi,t
= Wt (1− ηct · pt)
≤ Wt exp (−ηct · pt) convexity
Recursively using this inequality:
WT+1 ≤ W1 exp(−η
T∑t=1
ct · pt
)
= n · exp(−η
T∑t=1
ct · pt
)∀x ∈ R : 1− x ≤ e−x
34
![Page 38: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/38.jpg)
Proof of MW: Getting the upper bound on Wt
Wt+1 =∑i
wi,t+1
=∑i
wi,t (1− ηci,t)
= Wt − ηWt∑i
ci,tpi,t wi,t = Wtpi,t
= Wt (1− ηct · pt)≤ Wt exp (−ηct · pt) convexity
Recursively using this inequality:
WT+1 ≤ W1 exp(−η
T∑t=1
ct · pt
)
= n · exp(−η
T∑t=1
ct · pt
)∀x ∈ R : 1− x ≤ e−x
34
![Page 39: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/39.jpg)
Proof of MW: Getting the upper bound on Wt
Wt+1 =∑i
wi,t+1
=∑i
wi,t (1− ηci,t)
= Wt − ηWt∑i
ci,tpi,t wi,t = Wtpi,t
= Wt (1− ηct · pt)≤ Wt exp (−ηct · pt) convexity
Recursively using this inequality:
WT+1 ≤ W1 exp(−η
T∑t=1
ct · pt
)
= n · exp(−η
T∑t=1
ct · pt
)∀x ∈ R : 1− x ≤ e−x
34
![Page 40: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/40.jpg)
Proof of MW: Getting the lower bound on Wt
WT+1 ≥ wi,T+1
=∏t≤T
(1− ηci,t) multipl. updates and wi,1 = 1
=∏
t:ci,t≥0
(1− ηci,t)∏
t:ci,t<0(1− ηci,t) split positive + negative costs
Now we use the fact that:
• ∀x ∈ [0, 1] : (1− η)x ≤ (1− ηx)• ∀x ∈ [−1, 0] : (1+ η)x ≤ (1− ηx)
Since ci,t ∈ [−1, 1], we can apply this to each factor in the product above:
WT+1 ≥ (1− η)∑
≥0 ci,t (1+ η)−∑
<0 ci,t
This is the lower bound we want.
35
![Page 41: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/41.jpg)
Proof of MW: Getting the lower bound on Wt
WT+1 ≥ wi,T+1
=∏t≤T
(1− ηci,t) multipl. updates and wi,1 = 1
=∏
t:ci,t≥0
(1− ηci,t)∏
t:ci,t<0(1− ηci,t) split positive + negative costs
Now we use the fact that:
• ∀x ∈ [0, 1] : (1− η)x ≤ (1− ηx)• ∀x ∈ [−1, 0] : (1+ η)x ≤ (1− ηx)
Since ci,t ∈ [−1, 1], we can apply this to each factor in the product above:
WT+1 ≥ (1− η)∑
≥0 ci,t (1+ η)−∑
<0 ci,t
This is the lower bound we want.
35
![Page 42: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/42.jpg)
Proof of MW: Getting the lower bound on Wt
WT+1 ≥ wi,T+1
=∏t≤T
(1− ηci,t) multipl. updates and wi,1 = 1
=∏
t:ci,t≥0
(1− ηci,t)∏
t:ci,t<0(1− ηci,t) split positive + negative costs
Now we use the fact that:
• ∀x ∈ [0, 1] : (1− η)x ≤ (1− ηx)• ∀x ∈ [−1, 0] : (1+ η)x ≤ (1− ηx)
Since ci,t ∈ [−1, 1], we can apply this to each factor in the product above:
WT+1 ≥ (1− η)∑
≥0 ci,t (1+ η)−∑
<0 ci,t
This is the lower bound we want.
35
![Page 43: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/43.jpg)
Proof of MW: Getting the lower bound on Wt
WT+1 ≥ wi,T+1
=∏t≤T
(1− ηci,t) multipl. updates and wi,1 = 1
=∏
t:ci,t≥0
(1− ηci,t)∏
t:ci,t<0(1− ηci,t) split positive + negative costs
Now we use the fact that:
• ∀x ∈ [0, 1] : (1− η)x ≤ (1− ηx)• ∀x ∈ [−1, 0] : (1+ η)x ≤ (1− ηx)
Since ci,t ∈ [−1, 1], we can apply this to each factor in the product above:
WT+1 ≥ (1− η)∑
≥0 ci,t (1+ η)−∑
<0 ci,t
This is the lower bound we want.
35
![Page 44: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/44.jpg)
Proof of MW: Getting the lower bound on Wt
WT+1 ≥ wi,T+1
=∏t≤T
(1− ηci,t) multipl. updates and wi,1 = 1
=∏
t:ci,t≥0
(1− ηci,t)∏
t:ci,t<0(1− ηci,t) split positive + negative costs
Now we use the fact that:
• ∀x ∈ [0, 1] : (1− η)x ≤ (1− ηx)• ∀x ∈ [−1, 0] : (1+ η)x ≤ (1− ηx)
Since ci,t ∈ [−1, 1], we can apply this to each factor in the product above:
WT+1 ≥ (1− η)∑
≥0 ci,t (1+ η)−∑
<0 ci,t
This is the lower bound we want.
35
![Page 45: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/45.jpg)
Proof of MW: Combine the upper and lower bounds
We get:
(1− η)∑
≥0 ci,t (1+ η)−∑
<0 ci,t ≤ WT+1 ≤ n · exp(−η
T∑t=1
ct · pt
)
Take logs:
∑≥0
ci,t log (1− η)−∑<0
ci,t log (1+ η) ≤ logn− ηT∑t=1
ct · pt
36
![Page 46: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/46.jpg)
Proof of MW: Combine the upper and lower bounds
∑≥0
ci,t log (1− η)−∑<0
ci,t log (1+ η) ≤ logn− η
T∑t=1
ct · pt
Now we’ll massage this into the form we want:
η
T∑t=1
ct · pt ≤ logn−∑≥0
ci,tlog (1− η) +∑<0
ci,t log (1+ η)
= logn+∑≥0
ci,tlog(
11− η
)+∑<0
ci,t log (1+ η)
37
![Page 47: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/47.jpg)
Proof of MW: Combine the upper and lower bounds
∑≥0
ci,t log (1− η)−∑<0
ci,t log (1+ η) ≤ logn− η
T∑t=1
ct · pt
Now we’ll massage this into the form we want:
η
T∑t=1
ct · pt ≤ logn−∑≥0
ci,tlog (1− η) +∑<0
ci,t log (1+ η)
= logn+∑≥0
ci,tlog(
11− η
)+∑<0
ci,t log (1+ η)
37
![Page 48: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/48.jpg)
Proof of MW: Combine the upper and lower bounds
∑≥0
ci,t log (1− η)−∑<0
ci,t log (1+ η) ≤ logn− η
T∑t=1
ct · pt
Now we’ll massage this into the form we want:
η
T∑t=1
ct · pt ≤ logn−∑≥0
ci,tlog (1− η) +∑<0
ci,t log (1+ η)
= logn+∑≥0
ci,tlog(
11− η
)+∑<0
ci,t log (1+ η)
37
![Page 49: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/49.jpg)
Proof of MW: Combine the upper and lower bounds
Since η ≤ 12 , we can use log
(1
1−η
)≤ η + η2 and log (1+ η) ≥ η − η2:
η
T∑t=1
ct · pt
≤ logn+∑≥0
ci,t log(
11− η
)+∑<0
ci,t log (1+ η)
≤ logn+∑≥0
ci,t(η + η2
)+∑<0
ci,t(η − η2
)use inequalities
= logn+ η
T∑t=1
ci,t + η2∑≥0
ci,t − η2∑<0
ci,t
= logn+ ηT∑t=1
ci,t + η2T∑t=1|ci,t| combine sums
38
![Page 50: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/50.jpg)
Proof of MW: Combine the upper and lower bounds
Since η ≤ 12 , we can use log
(1
1−η
)≤ η + η2 and log (1+ η) ≥ η − η2:
η
T∑t=1
ct · pt
≤ logn+∑≥0
ci,t log(
11− η
)+∑<0
ci,t log (1+ η)
≤ logn+∑≥0
ci,t(η + η2
)+∑<0
ci,t(η − η2
)use inequalities
= logn+ η
T∑t=1
ci,t + η2∑≥0
ci,t − η2∑<0
ci,t
= logn+ ηT∑t=1
ci,t + η2T∑t=1|ci,t| combine sums
38
![Page 51: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/51.jpg)
Proof of MW: Combine the upper and lower bounds
Since η ≤ 12 , we can use log
(1
1−η
)≤ η + η2 and log (1+ η) ≥ η − η2:
η
T∑t=1
ct · pt
≤ logn+∑≥0
ci,t log(
11− η
)+∑<0
ci,t log (1+ η)
≤ logn+∑≥0
ci,t(η + η2
)+∑<0
ci,t(η − η2
)use inequalities
= logn+ η
T∑t=1
ci,t + η2∑≥0
ci,t − η2∑<0
ci,t
= logn+ ηT∑t=1
ci,t + η2T∑t=1|ci,t| combine sums
38
![Page 52: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/52.jpg)
Proof of MW: Combine the upper and lower bounds
Since η ≤ 12 , we can use log
(1
1−η
)≤ η + η2 and log (1+ η) ≥ η − η2:
η
T∑t=1
ct · pt
≤ logn+∑≥0
ci,t log(
11− η
)+∑<0
ci,t log (1+ η)
≤ logn+∑≥0
ci,t(η + η2
)+∑<0
ci,t(η − η2
)use inequalities
= logn+ η
T∑t=1
ci,t + η2∑≥0
ci,t − η2∑<0
ci,t
= logn+ ηT∑t=1
ci,t + η2T∑t=1|ci,t| combine sums
38
![Page 53: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/53.jpg)
Proof of MW: Combine the upper and lower bounds
Since η ≤ 12 , we can use log
(1
1−η
)≤ η + η2 and log (1+ η) ≥ η − η2:
η
T∑t=1
ct · pt
≤ logn+∑≥0
ci,t log(
11− η
)+∑<0
ci,t log (1+ η)
≤ logn+∑≥0
ci,t(η + η2
)+∑<0
ci,t(η − η2
)use inequalities
= logn+ η
T∑t=1
ci,t + η2∑≥0
ci,t − η2∑<0
ci,t
= logn+ ηT∑t=1
ci,t + η2T∑t=1|ci,t| combine sums
38
![Page 54: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/54.jpg)
Proof of MW: Combine the upper and lower bounds
Dividing by η, we get what we want:
T∑t=1
ct · pt ≤lognη
+T∑t=1
ci,t + η
T∑t=1|ci,t|
39
![Page 55: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/55.jpg)
Some remarks
• Matrix form of MW• Gains instead of losses
See Arora’s paper for more details
40
![Page 56: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/56.jpg)
Learning with Experts revisited
![Page 57: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/57.jpg)
Multiplicative Weights with continuous label spaces.
42
![Page 58: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/58.jpg)
Exponential Weighted Average: continuous prediction case
Formal setup: there are T rounds in which you predict a label pt ∈ C.
C is a convex subset of some vector space.
At every timestep t:
1. Each expert i = 1 . . .n predicts pi,t ∈ C2. You make a prediction pt ∈ C3. Nature reveals yt ∈ Y4. We suffer a loss l(pt, yt)5. Each expert suffers a loss l(pi,t, yt)
Loss: Lt =∑T
t=1 l(pt, yt), Li,t =∑T
t=1 l(pi,t, yt)
Regret: Rt = Lt −mini Li,t
43
![Page 59: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/59.jpg)
Exponential Weighted Average: continuous prediction case
We assume that the loss l(pt, yt) as a function l : C × Y −→ R:
• is bounded: ∀p ∈ C, y ∈ Y : l(p, y) ∈ [0, 1]• l(., y) is convex for any fixed y ∈ Y
44
![Page 60: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/60.jpg)
Exponential Weighted Average
This brings us to a familiar algorithm.
Choose an η > 0 (we’ll make this more directed later).
Exponential Weighted Average algorithm1: Initialize wi,0 = 1,W0 = n.2: for t = 1 . . . T do3: Receive expert predictions (p1,t . . .pN,t)4: Predict pt =
∑Ni=1 wi,t−1pi,tWt−1
5: Receive Nature’s answer yt (and incur loss l(pt, yt))6: wi,t ← wi,t−1e−ηl(pi,t,yt)
7: Wt ←∑N
i=1 wi,t8: end for
45
![Page 61: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/61.jpg)
Exponential Weighted Average
Regret of EWA
Assume that the loss l is a function l : C × Y −→ [0, 1], where C is convexand l(., y) is convex for any fixed y ∈ Y . Then
RT = LT −miniLi,T ≤
lognη
+ηT8
Hence, if we choose η =√
8 log nT , the regret RT ≤
√T2 logn = O(
√T).
The proof follows from an application of Jensen’s inequality and Hoeffding’slemma. See chapter 3 of Bartok for details.
Again note that we need to know what T is in advance.
46
![Page 62: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/62.jpg)
So how about the regret in the discrete case?
47
![Page 63: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/63.jpg)
Weighted Majority
Regret of Weighted Majority
Let C = Y , Y have at least 2 elements and l(p, y) = 1(p = y). Let L∗i,T =
mini Li,T.
RT = LT −miniLi,T ≤
(log2
(1β
)− log2
(2
1+β
)L∗i,T)+ log2 N
log2(
21+β
)
This bound is of the form RT = aL∗i,T + b = O(T)!
• The discrete case is harder than the continuous case if we stick todeterministic algorithms.
• The worst-case regret is achieved by a set of 2 experts and an outcomesequence yt such that LT = T.
See chapter 4 of Bartok for details.
48
![Page 64: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/64.jpg)
Today
What should you (at least) take away from this lecture?
How should you base your prediction on expert predictions?
• Can use a weighted majority algorithm, which is a type of MWA.• General framework for online learning with experts (e.g. boosting).
What are the characteristics of multiplicative weights algorithms? ...
• Few assumptions on costs / Nature (can be adversarial), thereforebroadly applicable.
• Fundamental tension between fitting to decision and learning fromNature.
• Tends to have instantenous loss-differential O( 1√t ), worse than the
version of FTL O( 1t ) that we saw in lecture 1.
49
![Page 65: Online Learning with Experts & Multiplicative Weights ... · Online Learning with Experts & Multiplicative Weights Algorithms CS159lecture#2 StephanZheng April1,2016 Caltech](https://reader033.vdocuments.net/reader033/viewer/2022042915/5f521c4e53467943256132d4/html5/thumbnails/65.jpg)
Questions?
50