maya r. gupta, google - stanford university · multi‐task averaging: theory and practice maya r....
TRANSCRIPT
![Page 1: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/1.jpg)
Multi‐Task Averaging:Theory and Practice
Maya R. Gupta, Google Research, Univ. Washington
1
Sergey FeldmanUniv. Washington
Bela FrigyikUniv. Pecs
![Page 2: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/2.jpg)
Aristotle
2
The idea of a mean is old :
\By the mean of a thing I denote a pointequally distant from either extreme..."-Aristotle
v = ymin+ymax
2
v ¡ ymin = ymax ¡ v
ymin ymaxv
![Page 3: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/3.jpg)
Tycho Brahe (16th century)
3
Averaged to reduce measurement error.
¹y = 1N
PNi=1 yi
![Page 4: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/4.jpg)
Legendre (1805)
4
Legendre noted the mean minimizes squared error:
¹y = arg min¹
NXi=1
(yi ¡ ¹)2
![Page 5: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/5.jpg)
Legendre (1805)
5
Frigyik et al. 2008:the mean minimizes anyfunctional Bregman divergence.
Banerjee et al. 2005:the mean minimizes any Bregman divergence:
¹y = arg min¹
NXi=1
Ã(yi; ¹)
)
Legendre noted the mean minimizes squared error:
¹y = arg min¹
NXi=1
(yi ¡ ¹)2
(
![Page 6: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/6.jpg)
Gauss (1809)
6
The average was central to Gauss's construction ofthe normal distribution. His goals:
- a smooth distribution- whose likelihood peak was at the sample mean.
![Page 7: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/7.jpg)
Fisher 1922
7
\...no other statistic which can becalculated from the same sampleprovides any additional information asto the value of the parameter..."-R. A. Fisher
![Page 8: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/8.jpg)
Stein’s Paradox 1956
8
Total squared error can be reduced by estimating each of themeans of T Gaussian random variables using datasampled from all of them, even if the random variablesare independent and have di®erent means.
t = 1
t = 2
t = T
...
¹1
¹2
¹3
![Page 9: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/9.jpg)
Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables
Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T
9
¹1
¹2
¹3
![Page 10: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/10.jpg)
Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables
Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T
10
¹1
¹2
¹3
Maximum Likelihood Estimate
¹t = Yt
![Page 11: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/11.jpg)
Stein Estimation: One Sample CaseProblem: estimate means f¹tg of T Gaussian random variables
Given: random sample Yt » N (¹t; ¾2) for t = 1; : : : ; T
11
James-Stein Estimate
¹JSt =
Ã1¡ (T ¡ 2)¾2PT
r=1 Yr2
!Yt
Maximum Likelihood Estimate
¹t = Yt ¹1
¹2
¹3
![Page 12: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/12.jpg)
James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument
12
Key assumptions:
² ¹t » N (0; ¿ 2)
² ¿ 2 unknown
² Yt » N (¹t; ¾2)
² ¾2 known
James-Stein Estimate
¹JSt =
Ã1¡ (T ¡ 2)¾2PT
r=1 Yr2
!Yt
![Page 13: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/13.jpg)
13
E[¹tjYt] =
μ1¡ ¾2
¿2 + ¾2
¶Yt;
Key assumptions:
² ¹t » N (0; ¿ 2)
² ¿ 2 unknown
² Yt » N (¹t; ¾2)
² ¾2 known
James-Stein Estimate
¹JSt =
Ã1¡ (T ¡ 2)¾2PT
r=1 Yr2
!Yt
James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument
![Page 14: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/14.jpg)
E
"(T ¡ 2)¾2PT
r=1 Yr2
#=
¾2
¿ 2 + ¾2
Key assumptions:
² ¹t » N (0; ¿ 2)
² ¿ 2 unknown
² Yt » N (¹t; ¾2)
² ¾2 known
James-Stein Estimate
¹JSt =
Ã1¡ (T ¡ 2)¾2PT
r=1 Yr2
!Yt
E[¹tjYt] =
μ1¡ ¾2
¿2 + ¾2
¶Yt;
James‐Stein Estimator Derivation:Efron and Morris 1972 Empirical Bayes Argument
![Page 15: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/15.jpg)
² ¹t » N (»; ¿ 2)
² ¿ 2 and » unknown
² Yti » N (¹t; ¾2t )
i = 1; : : : ; Nt
² ¾2t unknown
² » = 1T
PTr=1
¹Yr
² Positive-part: (x)+ = max(x; 0)
² Diagonal § with §tt = ¾2t
Nt
² ¹Y is a T length vector with tth entry ¹Yt
General James-Stein Estimate
A More General JSE (Bock, 1972)
15
¹JSt = » +
³1¡ T¡3
(¹Y¡»)T§¡1(¹Y¡»)
´+ ³¹Yt ¡ »
´
![Page 16: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/16.jpg)
James‐Stein Dominates
16
James's and Stein's theorem (1961):
For T > 3 the general JSE dominates the sample average (MLE):
E[jj¹¡ ¹JSjj22] · E[jj¹¡ ¹Y jj22]
for every choice of ¹. This is also written as:
R(¹JS) · R( ¹Y )
![Page 17: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/17.jpg)
17
James-Stein Estimation Ã! Empirical Bayes
Multi-task Averaging (MTA) Ã! Empirical Loss Minimizationwith Regularization
(Empirical Vapnik)
(Tikhonov Regularization)
![Page 18: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/18.jpg)
Multi‐Task Averaging Feldman et al. 2012
Problem: estimate means f¹tg of T random variables.
Given: Nt IID samples fytigNti=1 from each random variable.
Data Model: Yti drawn IID from ºt with ¯nite mean ¹t.
18
t = 1
t = 2
t = T
...
¹1
¹2
¹3
![Page 19: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/19.jpg)
Building the MTA Objective\Single-task" averaging:
19
¹yt = argmin¹t
NtXi=1
(yti ¡ ¹t)2
![Page 20: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/20.jpg)
Building the MTA Objective
20
f¹ytgTt=1 = arg min
f¹tgTt=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
add across tasks
\Single-task" averaging:
![Page 21: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/21.jpg)
Building the MTA Objective
21
f¹ytgTt=1 = arg min
f¹tgTt=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
Mahalanobis distance
\Single-task" averaging:
![Page 22: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/22.jpg)
The MTA Objective\Multi-task" averaging:
22
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
similarity between task r and sMahalanobis distanceto samples from task t
![Page 23: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/23.jpg)
The MTA Objective\Multi-task" averaging:
23
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
Task 1:Estimateaverage movieticket price
Task 2:Estimatemean ageof kids atsummer camp
![Page 24: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/24.jpg)
The MTA Objective\Multi-task" averaging:
24
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
Task 1:Estimateaverage movieticket price
Task 2:Estimateprice ofteain China?
![Page 25: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/25.jpg)
The MTA Objective\Multi-task" averaging:
25
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
![Page 26: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/26.jpg)
The MTA Objective\Multi-task" averaging:
26
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹MTA1¹y1
¹y2¹MTA
2
![Page 27: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/27.jpg)
The MTA Objective\Multi-task" averaging:
27
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
![Page 28: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/28.jpg)
The MTA Objective\Multi-task" averaging:
28
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
¹MTA1
¹MTA2
![Page 29: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/29.jpg)
The MTA Objective\Multi-task" averaging:
29
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
![Page 30: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/30.jpg)
The MTA Objective\Multi-task" averaging:
30
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
t = 1
t = 2
¹y1
¹y2
¹MTA1
¹MTA2
![Page 31: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/31.jpg)
The MTA Objective\Multi-task" averaging:
31
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
empirical losslowers bias
regularizer:lowers estimationvariance
![Page 32: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/32.jpg)
MTA Closed Form Solution
For non-negative A:
vectorof Tsampleaverages
graphLaplacianof A + AT
diagonalmatrix ofsample mean
variances¾2
t
Nt 32
¹MTA =³I +
°
T§L
´¡1
¹y
vectorof TMTAsolution
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
![Page 33: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/33.jpg)
MTA Closed Form Solution
33
Lemma: this inverse always exists ifArs ¸ 0, ° ¸ 0, and Nt ¸ 1.
¹MTA =³I +
°
T§L
´¡1
¹y
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
![Page 34: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/34.jpg)
MTA Closed Form Solution
34
MTA estimatesare a linearcombo ofsampleaverages
T £ Tmatrix W
T sampleaverages
¹MTA =³I +
°
T§L
´¡1
¹y
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
![Page 35: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/35.jpg)
MTA Closed Form Solution
35
right-stochasticmatrix W
Theorem:convexcombo ofsampleaverages
T sampleaverages
¹MTA =³I +
°
T§L
´¡1
¹y
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
![Page 36: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/36.jpg)
When is MTA Better than the sample means?
36
Y1i = ¹1 + ²1
N1 samples, ¾21
Y2i = ¹2 + ²2
N2 samples, ¾22
Task 1:Estimateaverage movieticket price
Task 2:Estimatemean ageof kids atsummer camp
![Page 37: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/37.jpg)
37
Y1i = ¹1 + ²1
N1 samples, ¾21
Y2i = ¹2 + ²2
N2 samples, ¾22
MTA estimate¡I + °
T§L
¢¡1 ¹Y :
¹MTA1 =
ÃT +
¾22
N2A12
T +¾21
N1A12 +
¾22
N2A12
!¹Y1 +
þ21
N1A12
T +¾21
N1A12 +
¾22
N2A12
!¹Y2:
When is MTA Better than the sample means?
![Page 38: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/38.jpg)
38
N1 samples, ¾21 N2 samples, ¾2
2
Biased, but smaller error variance than sample averages.
Y1i = ¹1 + ²1 Y2i = ¹2 + ²2
Risk[¹MTA1 ] <Risk[¹Y1] if (¹1¡¹2)
2 < 4A12
+¾21
N1+
¾22
N2
MTA estimate¡I + °
T§L
¢¡1 ¹Y :
¹MTA1 =
ÃT +
¾22
N2A12
T +¾21
N1A12 +
¾22
N2A12
!¹Y1 +
þ21
N1A12
T +¾21
N1A12 +
¾22
N2A12
!¹Y2:
When is MTA Better than the sample means?
![Page 39: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/39.jpg)
39
Risk[¹MTA] <Risk[¹Y ] if (¹1¡¹2)2¡ ¾2
1
N1¡ ¾2
2
N2< 4
A12
¹2
![Page 40: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/40.jpg)
Optimal A for T = 2
4040
Example:
Answer: the optimal task similarity in terms of MSE:
A¤12 = 2
(¹1¡¹2)2
Y1 Y2
¹1¹2
![Page 41: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/41.jpg)
4141
Example:
Answer: the optimal task similarity in terms of MSE:
A¤12 = 2
(¹1¡¹2)2
Y1 Y2
¹1¹2
A¤12 = 2
(¹y1¡¹y2)2
Estimated Optimal A for T = 2
estimated
![Page 42: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/42.jpg)
42
Optimal sim:A¤
12 = 2(¹1¡¹2)2
A12
¾21 = 1 ¾2
2 = 1
¹1 = 0; ¹2 = 1
![Page 43: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/43.jpg)
Optimal A for T > 2
43
² Bad news: no simple analytical minimization of risk for T > 2.
![Page 44: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/44.jpg)
Optimal A for T > 2
44
² Bad news: no simple analytical minimization of risk for T > 2.
² One solution: use pairwise estimate to populate A:
A¤rs = 2
(¹yr¡¹ys)2
A¤12 A¤
13
A¤21 A¤
23
A¤32A¤
31
![Page 45: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/45.jpg)
Optimal A for T > 2² Bad news: no simple analytical minimization of risk for T > 2.
² One solution: use pairwise estimate to populate A
² Better solution: constrain A = a11T and optimize over a.
aa
a
a
a
![Page 46: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/46.jpg)
Optimal A for T > 2
46
aa
a
a
a
² Bad news: no simple analytical minimization of risk for T > 2.
² One solution: use pairwise estimate to populate A
² Better solution: constrain A = a11T and optimize over a.
² Analyzable: optimal similarity a is
a¤ = 21
T (T¡1)Tr=1
Ts=1(¹r¡¹s)2
average squared distancebetween the T means
![Page 47: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/47.jpg)
Optimal A for T > 2
47
aa
a
a
a
² Bad news: no simple analytical minimization of risk for T > 2.
² One solution: use pairwise estimate to populate A
² Better solution: constrain A = a11T and optimize over a.
² Analyzable: optimal similarity a is
a¤ = 21
T (T¡1)Tr=1
Ts=1(¹r¡¹s)2
a¤ = 21
T (T¡1)Tr=1
Ts=1(¹yr¡¹ys)2
² In practice, we estimate
![Page 48: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/48.jpg)
How to Set Similarity Matrix A?
48
Choose A to minimize expected total squared error.
Choose A to minimize worst-case total squared error.
![Page 49: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/49.jpg)
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
![Page 50: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/50.jpg)
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
(bl; bl)
(bu; bu)(bl; bu)
(bu; bl)
![Page 51: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/51.jpg)
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
2. A least favorable prior (LFP):
p(¹1; ¹2) =
8><>:12; if (¹1; ¹2) = (bl; bu)
12; if (¹1; ¹2) = (bu; bl)
0; otherwise.
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
![Page 52: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/52.jpg)
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
2. A least favorable prior (LFP).
3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:
MTA with sim Amm12 = 2
(bu¡bl)2.
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
![Page 53: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/53.jpg)
Minimax A for T = 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
2. A least favorable prior (LFP).
3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:
MTA with sim Amm12 = 2
(bu¡bl)2.
bl = mint ¹yt bu = maxt ¹yt
¹y1¹y2
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
![Page 54: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/54.jpg)
Minimax A for T > 2To ¯nd minimax A, we need:
1. A constraint set for f¹tg: ¹t 2 [bl; bu].
2. A least favorable prior (LFP).
3. A Bayes-optimal estimator withconstant risk w.r.t. the LFP:
MTA with constant simamm = 2
(bu¡bl)2.
(bl; bu)
(bl; bl)
(bu; bu)
(bu; bl)
bl = mint ¹yt bu = maxt ¹yt
¹y1 ¹y3 ¹y4 ¹y5¹y2
![Page 55: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/55.jpg)
Estimator Summary
55
![Page 56: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/56.jpg)
SimulationsGaussian Simulations Uniform Simulations¹t » N (0; ¾2
¹) ¹t » U(¡p
3¾2¹;
p3¾2
¹)¾2
t » Gamma(0:9; 1:0) + 0:1 ¾2t » U(0:1; 2:0)
Nt » Uf2; : : : ; 100g Nt » Uf2; : : : ; 100gyti » N (¹t; ¾
2t ) yti » U [¹t ¡
p3¾2
t ; ¹t +p
3¾2t ]
56
![Page 57: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/57.jpg)
5‐fold randomized Cross‐Validation
57
![Page 58: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/58.jpg)
Gaussian Simulation, T = 5
58
(Lower is better.)
![Page 59: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/59.jpg)
Gaussian Simulation, T = 25
59
![Page 60: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/60.jpg)
Gaussian Simulation, T = 500
60
![Page 61: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/61.jpg)
Uniform Simulation, T = 5
61
![Page 62: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/62.jpg)
Uniform Simulation, T = 25
62
![Page 63: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/63.jpg)
Uniform Simulation, T = 500
63
![Page 64: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/64.jpg)
Scales O(T)
64
Constant and minimax MTA weight matrices can be rewritten:
Sherman Morrisonformula
Z is diagonal so Z¡1, Z¡1z, and 1TZ¡1
can all be computed in O(T )W¹Y is O(T)
W = (I + §L(a11T ))¡1
= (I + §(aTI ¡ a11T ))¡1
= (I + aT§¡ a§11T )¡1
= (Z ¡ z1T )¡1
= Z¡1 +Z¡1z1TZ¡1
1 + 1TZ¡1z
![Page 65: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/65.jpg)
Application: Class Grades
65
Problem: estimate ¯nal grades f¹tg of T students.
Given: N homework grades fytigNi=1 from each student.
![Page 66: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/66.jpg)
Application: Class Grades
66
Problem: estimate ¯nal grades f¹tg of T students.
Given: N homework grades fytigNi=1 from each student.
² 16 classrooms ! 16 datasets.
² Uncurved grades normalized to be between 0 and 100.
² Pooled variance used for all tasks in a dataset.
² Final class grades include homeworks, projects, labs,quizzes, midterms, and the ¯nal exam.
![Page 67: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/67.jpg)
67
Percent change in risk vs. single-task.
![Page 68: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/68.jpg)
68
Percent change in risk vs. single-task.
![Page 69: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/69.jpg)
69
Percent change in risk vs. single-task.
![Page 70: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/70.jpg)
70
Percent change in risk vs. single-task.
![Page 71: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/71.jpg)
Application: Product Sales
71
Exp 1: How much will tth customer spend on their next order?
Given: $ amounts fytigNti=1 that tth customer spent on Nt orders.
² T = 477
² yti ranged from $15 to $480.
² Nt ranged from 2 to 17.
![Page 72: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/72.jpg)
Application: Product Sales
72
Given: $ amounts fytigNti=1 that each of Nt customers spent
after buying the tth puzzle.
Exp 2: If you bought the tth puzzle, how muchwill you spend on your next order?
² T = 77
² yti ranged from $0 to $480.
² Nt ranged from 8 to 348.
![Page 73: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/73.jpg)
Application: Product Sales
73
No ground truth ! use sample means from all data as ¹t
and use random half of data to get ¹yt.
Customer 1:
Customer 2:
Customer T :
![Page 74: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/74.jpg)
Application: Product Sales
74
No ground truth ! use sample means from all data as ¹t
and use random half of data to get ¹yt.
Customer 1:
Customer 2:
Customer T :
¹1
¹2
¹T
![Page 75: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/75.jpg)
Application: Product Sales
75
No ground truth ! use sample means from all data as ¹t
and use random half of data to get ¹yt.
Customer 1:
Customer 2:
Customer T :
¹y1
¹y2
¹yT
![Page 76: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/76.jpg)
Application: Product Sales
76
Percent change in risk vs. single-taskaveraged over 1000 random splits.
(Lower is better.)
111
![Page 77: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/77.jpg)
Application: Product Sales
77
Percent change in risk vs. single-taskaveraged over 1000 random splits.
(Lower is better.)
![Page 78: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/78.jpg)
Model Mismatch: 2008 Election
78
Problem: What percent of tth state's vote will go toObama and McCain on election day?
Given: Nt pre-election polls fytigNti=1 from each state.
![Page 79: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/79.jpg)
Model Mismatch: 2008 Election
79
Problem: What percent of tth state's vote will go toObama and McCain on election day?
Given: Nt pre-election polls fytigNti=1 from each state.
Percent change in average risk vs. single-task.(Lower is better.)
![Page 80: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/80.jpg)
MTA Applied to Kernel Density Estimation
80
KDE: Given that events fxig happened,estimate the probability of event z as
p(z) = 1N
PNi=1 K(xi; z)
x3
x1
x2
x4
x5
z
![Page 81: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/81.jpg)
MTA Applied to Kernel Density Estimation
81
KDE: Given that events fxig happened,estimate the probability of event z as
x3
x1
x2
x4
x5
p(z) = 1N
PNi=1 K(xi; z)
z
![Page 82: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/82.jpg)
MTA Applied to Kernel Density Estimation
82
KDE: Given that events fxig happened,estimate the probability of event z as
Equivalently,
argminy(z)
NXi=1
(K(xi; z)¡ y(z))2
p(z) = 1N
PNi=1 K(xi; z)
![Page 83: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/83.jpg)
MTA Applied to Kernel Density Estimation
83
KDE: Given that events fxig happened,estimate the probability of event z as
Equivalently,
argminy(z)
NXi=1
(K(xi; z)¡ y(z))2
Use MTA to form a Multi-task KDE:
arg minfyt(zt)gT
t=1
TXt=1
NtXi=1
(Kt(xti; zt)¡ yt(zt))2 +°
TXr=1
TXs=1
Ars (yr(zr)¡ ys(zs))2
p(z) = 1N
PNi=1 K(xi; z)
![Page 84: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/84.jpg)
MT‐KDE for Terrorism Risk AssessmentProblem: Estimate the probability of terrorist eventsat 40,000 locations in Jerusalem,each location z; xi 2 R74
T = 7 terrorist groups
84
Task similarity matrix A from terrorism expert Mohammed Hafez:
![Page 85: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/85.jpg)
MT‐KDE for Terrorism Risk Assessment
85
Suicides (T = 17) Bombings (T = 11)Single task .145 .1096James-Stein .145 .1096MTA constant .1897 .1096MTA minimax .1897 .1096Expert sim .1292 .0089
Mean Reciprocal Rank of a Left-Out Event
![Page 86: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/86.jpg)
MTA is an intuitive, simple, accurate approach to estimating multiple means jointly.
When can you estimate multiple means at once?
Can you estimate the task similarities better?
Learn more: see our 2012 NIPS paper or email me for the journal paper ([email protected])
86
Last Slide
![Page 87: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/87.jpg)
¹ = W ¹y
right stochastic W
W =¡I + °
T§L
¢¡1
diagonal § with §tt ¸ 0Ars ¸ 0, ° ¸ 0
¹t = ¸¹yt + (1¡ ¸)PT
r=1 ®r¹yr
¹ = W ¹y
MTA:¹ = W ¹y
0 < ¸ · 1PTr=1 ®r = 1
®r ¸ 0, 8r(J-S, more)
![Page 88: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/88.jpg)
Bayesian Analysis: IGMRFs
88
Recall that:
12
PNr=1
PTs=1 Ars(yr ¡ ys)
2 = 12yTLS y
The above regularizer can be thoughtof as coming from an intrinsic (improper)GMRF prior (Rue and Held, '05):
p(y) = (2¼)¡T2 jLSj 12 exp
¡¡1
2yTLS y
¢Usually used for graphical models when LS is sparse.
where L = D ¡ A
![Page 89: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/89.jpg)
Bayesian Analysis
Assuming di®erences are independent:
p(y) /TY
r=1
TYs=1
e¡°Ars(yr¡ys)2
Y1 ¡ Y2 » N (0; 1=2°A12)
Y2 ¡ Y3 » N (0; 1=2°A23)
Y1 ¡ Y3 » N (0; 1=2°A13)
Y1 ¡ Y3 = (Y1 ¡ Y2) + (Y2 ¡ Y3) !1
A13=
1
A12+
1
A23
Y1 ¡ Y2 = (Y1 ¡ Y3) + (Y3 ¡ Y2) !1
A12=
1
A13+
1
A32
Y2 ¡ Y3 = (Y2 ¡ Y1) + (Y1 ¡ Y3) !1
A23=
1
A21+
1
A13
Impossible to satisfy all RHS with any ¯nite A!
for T = 3
89
![Page 90: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/90.jpg)
Related Multi‐Task Regularizers
90
PTr=1 k¯r ¡ 1
T
PTs=1 ¯sk2
2 Distance to mean(Evgeniou and Pontil, 2004)
jj¯jj¤ Trace norm (Abernethy et al., 2009)
tr(¯TD¡1¯) Learned, shared feature covariance matrix(Argyriou et al., 2008)
tr(¯§¡1¯T ) Learned task covariance matrix(Jacob et al., 2008 and Zhang and Yeung, 2010)PT
r=1
PTs=1 Arsk¯r ¡ ¯sk2
2 Pairwise distance regularizer (Sheldon, 2008)or constraint (Kato et al., 2007)
![Page 91: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/91.jpg)
Gaussian Simulation, T = 2
91
(Lower is better.)(Lower is better.)
![Page 92: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/92.jpg)
Uniform Simulation, T = 2
92(Lower is better.)
![Page 93: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/93.jpg)
Pairwise T=5 Results
93(Lower is better.)
![Page 94: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/94.jpg)
Oracle T=5 Results
94(Lower is better.)
![Page 95: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/95.jpg)
Stein’s Unbiased Risk Estimate
95
² True A¤12 is depends on unknown ¹t. We plugged in ¹yt to get:
A¤12 = 2
(¹y1¡¹y2)2
² Another approach:minimize Stein's unbiased risk estimate (SURE),which is an empirical proxy Q such that E[Q] = risk.
² Result:
ASURE12 =
Ã2
(¹y1¡¹y2)2¡¾21
N1¡ ¾2
2N2
!+
![Page 96: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/96.jpg)
SURE T=2 Experiments
96(Lower is better.)
![Page 97: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/97.jpg)
Alternative Formulation
97
MTA is:¡I + °
T§L
¢¡1 ¹Y
MTA Variant is:
§1=2¡I + °
TL¢¡1
§¡1=2 ¹Y ;
with optimal sim:
A¤12 = 2
(¹1¡¹2)2:
with optimal sim:
A¤12 = 2
¹1¾1
¡¹2¾2
2 :
Di®erent notions of distance!
What if ¹1 = 2, ¾1 = 1, ¹2 = 4, ¾2 = 2?
![Page 98: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/98.jpg)
98
Alternative Formulation T=2 Results
(Lower is better.)
![Page 99: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/99.jpg)
MTA Closed Form Solution
99
more general form ofregularized Laplacian kernel(Smola and Kondor, 2003)
¹MTA =³I +
°
T§L
´¡1
¹y
f¹MTAt gT
t=1 = arg minf¹tgT
t=1
1
T
TXt=1
NtXi=1
(yti ¡ ¹t)2
¾2t
+°
T 2
TXr=1
TXs=1
Ars(¹r ¡ ¹s)2
\Multi-task" averaging:
![Page 100: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/100.jpg)
Stein’s Unbiased Risk Estimate
100
² True A¤12 is depends on unknown ¹t. We plugged in ¹yt to get:
A¤12 = 2
(¹y1¡¹y2)2
² Another approach:minimize Stein's unbiased risk estimate (SURE),which is an empirical proxy Q such that E[Q] = risk.
² Result:
ASURE12 =
Ã2
(¹y1¡¹y2)2¡¾21
N1¡ ¾2
2N2
!+
![Page 101: Maya R. Gupta, Google - Stanford University · Multi‐Task Averaging: Theory and Practice Maya R. Gupta, Google Research, Univ. Washington 1 Sergey Feldman Univ. Washington Bela](https://reader031.vdocuments.net/reader031/viewer/2022022721/5c66e58b09d3f20f218cef52/html5/thumbnails/101.jpg)
SURE T=2 Experiments
101(Lower is better.)