nips16 regularizingrado poster3users.cecs.anu.edu.au/~rnock/docs/nips16... ·...
TRANSCRIPT
Theorem: in table, and are equivalent, that is, there exists strictly increasing such that , !!Generalizes Nock et al. (ICML’15), Patrini et al. (IJCAI’16)
Interesting case: when exists , increasingand such that: !
!
defined over coordinates (example loss) defined over subset sums of coordinates (rado loss) In this case, and are equivalent
OnRegularizingRademacherObservationLosses!
RichardNock
A: 2 recent papers show, with 2 different proofs, that 2 popular example losses (log & square losses) are equivalent to Rademacher observation (rado) losses (linear classifiers). Rados help (i) to protect the examples privacy, (ii) to learn from distributed data without entity matching. But, why… 2 losses ? Anything hidden, more rado applications ? Contribution: (a) general theory for equivalent example and rado losses, using an adversarial formulation of loss functions, more equivalences; (b) extension to regularization: any regularization of the example loss transfers to rados (i.e. data) in the rado loss — it does not alter the rado loss ! (c) formal boosting algorithm for the exp-rado loss with various regularisers (ridge, lasso, SLOPE, , …) — boosts log loss over examples with the same regularisers !`1
examples
yi 2 {�1, 1}xi 2 Rd
⌃m.= {�1, 1}m rados
2m
|U| = nm n
// the s can be // (non) random, // (non) i.i.d., // learned from data, // etc.
Rademacher observation
(=1 rado)
X
i2I
yi · xi.= ⇡I
I 2 U
U ✓ 2[m]
S = {(xi, yi)}i2[m]
I 2 U
RU
zi.= ✓
>(yi · xi)(with )
example game rado game
adversary fits
Le(p, z).=
Pi2[m] pizi + µe
Pi2[m] 'e(pi) Lr(q, z)
.=
PI✓[m] qI
Pi2I zi + µr
PI✓[m] 'r(qI)
Le(z).= Le(p⇤(z), z) Lr(z)
.= Lr(q⇤(z), z)
p⇤(z).= arg min
p2RmLe(p, z) q⇤(z)
.= arg min
q2H2mLr(q, z)
learner builds to max✓
p⇤(z) = Gmq⇤(z)Theorem: when strictly convex and differentiable, holds iff
'e,'r1
Theorem: If holds then !
where'e(z) = (µr/µe) · 's(r)(z) + (b/µe)
1
's(r)(z).= 'r(z) + 'r(1� z)
example loss rado loss
`exp
= EI[exp(�✓>⇡I)]`log
= Ei[log(1 + exp(�✓
>(yi · xi)))]
`M = �(EI[✓>⇡I]� (1/2) · VI[✓
>⇡I])`sql = Ei[(1� ✓
>(yi · xi))2]
`e(S,✓) `r(RU,✓)
`H = Ei[max{0,�✓
>(yi · xi)}] `a = max{0,maxI✓[m]{�✓>⇡I}}
`U = Ei[�✓
>(yi · xi)] `U0 = EI[�✓>⇡I]
`e
f : R ! R
`e(S,✓) = f (`r (R2[m] ,✓))
8S, 8✓
`r8(`e, `r)
Theorem: If holds and in then the regularized example loss !
!
is equivalent to rado loss , where
1 fe(z) = ae · z + be 2
`e(S,✓,⌦).= `e(S,✓) + ⌦(✓)
R⌦,✓2[m]
.=
n
⇡I � ae⌦(✓)
k✓k22
· ✓ : ⇡I 2 R2[m]
o
`r
⇣R⌦,✓
2[m] ,✓⌘
`e `r
`e
`r
fe fr
b 2 R Le(z) = Lr(z) + b
�Le(z) = fe (`e(z))
�Lr(z) = fr (`r(z))
1
2
Rados & losses
Q: Why care ?
Gm.=
0>2m�1 1>
2m�1
Gm�1 Gm�1
�(2 {0, 1}m⇥2m) G1
.= [0 1]
Boosting with variously regularized rado losses
H2m .= {q 2 R2m : 1>q = 1}
one master boosting routine fits all applicable regularizers
Algorithm 1 ⌦-wl, for ⌦ 2 {k.k1, k.k2�, k.k1, k.k�}Input set of rados Sr
.= {⇡1,⇡2, ...,⇡n};
weights w 2 4n;
parameters � 2 (0, 1), ! 2 R+;
classifier ✓;Step 1 : pick weak feature ◆⇤ 2 [d];Step 2 : if ⌦ = k.k2� then
r⇤ ⇢
r◆⇤ if r◆⇤ 2 [��,�]sign (r◆⇤) · � otherwise
;
else r⇤ r◆⇤ ;Return (◆⇤, r⇤);
Algorithm 1 ⌦-R.AdaBoost
Input set of rados Sr.= {⇡1,⇡2, ...,⇡n};
parameters T 2 N⇤, � 2 (0, 1), ! 2 R+;
Step 1 : let ✓0 0, w0 (1/n)1 ;
Step 2 : for t = 1, 2, ..., TStep 2.1 : call weak learner:
(◆(t), rt) ⌦-wl(Sr,wt,�,!,✓t�1);
Step 2.2 : updating parameters ↵◆(t) and �t (⇡⇤k.= maxj |⇡jk|):
↵◆(t) (1/(2⇡⇤◆(t))) log((1 + rt)/(1� rt))
�t ! · (⌦(✓t)� ⌦(✓t�1)) ;
Step 2.3 : update and normalize weights: for j = 1, 2, ..., n,
wtj w(t�1)j · exp��↵◆(t)⇡j◆(t) + �t
�/Zt ;
Return ✓T ;
Code available at: http://users.cecs.anu.edu.au/~rnock/
✓t.=Pt
t0=1 ↵◆(t0) · 1◆(t0)
Theorem: if -WLA holds, after boosting rounds, the exp-rado loss satisfies , with: !
applicable regularizers ⌦
-Weak Learning Assumption�wl
r◆(t) =1
⇡⇤◆(t)
Pnj=1 wtj⇡j◆(t) (2 [�1, 1]) meets the -WLA iff all , with |r◆(t)| � �wl�wl⌦-wl
�wl T `exp
⇣R⌦,✓
U ,✓⌘
`exp
exp(�a�2wlT/2)
⌦(✓) = k✓k�⌦(✓) = k✓k2�q � 2 ·maxk
⇢✓1� �
� 3�wl
11 ·maxj |⇡jk|�◆��
kd
��a
.= min{3�wl/11,��1
(1� q/(2d))/mink maxj |⇡jk|}! < (2amink maxj ⇡2jk)/(T��)0 < a < 1/5
max. eigenvalue of�
� ⇡ 1
/ False Discovery Rate in SLOPE
Experiments
(see paper for all )⌦
AdaBoost^ ⌦-R.AdaBoost
reg.-AdaBoost
! = 0 ⌦ = k.k2id
⌦ = k.k1 ⌦ = k.k1 ⌦ = k.k�domain m d err±� err±� err±� � err±� � err±� � err±� �
Fertility 100 9 40.00±14.1 40.00±14.9 41.00±16.6 8.00 • 41.00±14.5 4.00 � 41.00±21.3 6.00 38.00±14.0 7.00
Sonar 208 60 24.57±9.11 • 27.88±4.33 25.05±7.56 8.14 24.05±8.41 4.83 24.52±8.65 10.12 � 25.00±13.4 3.83
Spectf 267 44 45.67±11.0 � 44.96±8.27 • 43.43±11.7 3.35 44.57±12.5 1.85 43.09±11.0 1.85 43.79±13.9 3.05
Ionosphere 351 33 13.11±6.36 • 14.51±7.36 13.64±5.99 5.43 14.24±6.15 2.83 � 13.38±4.44 3.15 � 14.25±5.04 3.41
Breastwisc 699 9 3.00±1.96 3.43±2.25 2.57±1.62 1.14 � 3.29±2.24 0.86 2.86±2.13 0.86 • 3.00±2.18 0.29
Transfusion 748 4 39.17±7.01 � 37.97±7.42 37.57±5.60 2.40 � 36.50±6.78 2.14 � 37.43±8.08 1.21 • 36.10±8.06 3.21
Qsar 1 055 31 22.09±3.73 24.37±4.06 22.47±3.84 3.41 • 23.13±2.74 2.75 • 23.23±3.64 2.64 � 23.80±3.79 2.55
Hill-nonoise 1 212 100 47.52±5.14 45.63±6.68 45.71±6.61 0.33 45.46±6.88 0.66 � 45.87±6.69 0.49 • 45.62±7.26 0.42
Hill-noise 1 212 100 47.61±3.48 � 45.05±2.98 44.80±2.86 0.99 44.97±3.37 0.66 44.88±2.82 0.66 • 44.64±3.26 0.74
Winered 1 599 11 26.33±2.75 � 28.02±3.32 • 27.83±3.95 1.19 • 27.45±4.17 1.00 � 27.58±3.76 1.12 • 27.45±3.34 1.25
Abalone 4 177 10 22.98±2.70 • 26.57±2.31 � 24.18±2.51 0.00 24.13±2.48 0.14 � 24.18±2.51 0.00 24.11±2.59 0.07
Statlog 4 435 36 4.49±0.61 22.41±2.20 • 4.71±0.82 0.25 � 20.43±1.89 0.23 4.69±0.72 0.45 20.00±1.80 0.18
Winewhite 4 898 11 30.73±2.20 • 32.63±2.52 • 31.85±1.66 1.18 32.16±1.73 1.31 32.16±2.02 0.90 � 31.97±2.26 1.12
Smartphone 7 352 561 0.00±0.00 � 0.67±0.25 0.19±0.22 0.00 � 0.44±0.29 0.03 • 0.20±0.24 0.01 0.19±0.22 0.04
Firmteacher 10 800 16 44.44±1.34 40.58±4.87 • 40.89±3.95 2.35 39.81±4.37 2.89 � 38.91±4.51 3.56 � 38.01±6.15 5.02
Eeg 14 980 14 45.38±2.04 • 44.09±2.32 � 44.01±1.48 0.40 • 43.89±2.19 0.89 � 44.07±2.02 0.81 • 43.87±1.40 0.95
Magic 19 020 10 21.07±1.09 � 37.51±0.46 • 22.11±1.32 0.28 � 26.41±1.08 0.00 23.00±1.71 0.66 � 26.41±1.08 0.00
Hardware 28 179 96 16.77±0.73 � 9.41±0.71 6.43±0.74 0.18 � 11.72±1.24 0.41 • 6.50±0.67 0.10 6.42±0.69 0.13
Marketing 45 211 27 30.68±1.01 27.70±0.69 27.33±0.73 0.33 � 28.02±0.47 0.00 • 27.19±0.87 0.51 � 28.02±0.47 0.00
Kaggle 120 269 10 47.80±0.47 • 39.22±8.47 � 16.90±0.51 0.00 16.90±0.51 0.00 16.89±0.50 0.01 16.90±0.51 0.00
= sparsest solution, = least sparse (among all ); shaded = best among •� ⌦-R.AdaBoost ⌦-R.AdaBoost
⌦(✓) = ! ·
8>><
>>:
k✓k1.= |✓|>1 Lasso
k✓k2�.= ✓>�✓ Ridge
k✓k1.= maxk |✓k| `1
k✓k�.= max
m2Sd(m|✓|)>⇠ slopeuser fixed
Rados vs examples
(see paper for non convex, non diff.)