nips16 regularizingrado poster3users.cecs.anu.edu.au/~rnock/docs/nips16... ·...

Theorem: in table, and are equivalent, that is, there exists strictly increasing such that , !!Generalizes Nock et al. (ICML’15), Patrini et al. (IJCAI’16)

Interesting case: when exists , increasingand such that: !

!

defined over coordinates (example loss) defined over subset sums of coordinates (rado loss) In this case, and are equivalent

OnRegularizingRademacherObservationLosses!

RichardNock

A: 2 recent papers show, with 2 different proofs, that 2 popular example losses (log & square losses) are equivalent to Rademacher observation (rado) losses (linear classifiers). Rados help (i) to protect the examples privacy, (ii) to learn from distributed data without entity matching. But, why… 2 losses ? Anything hidden, more rado applications ? Contribution: (a) general theory for equivalent example and rado losses, using an adversarial formulation of loss functions, more equivalences; (b) extension to regularization: any regularization of the example loss transfers to rados (i.e. data) in the rado loss — it does not alter the rado loss ! (c) formal boosting algorithm for the exp-rado loss with various regularisers (ridge, lasso, SLOPE, , …) — boosts log loss over examples with the same regularisers !`1

examples

yi 2 {�1, 1}xi 2 Rd

⌃m.= {�1, 1}m rados

2m

|U| = nm n

// the s can be // (non) random, // (non) i.i.d., // learned from data, // etc.

Rademacher observation

(=1 rado)

X

i2I

yi · xi.= ⇡I

I 2 U

U ✓ 2[m]

S = {(xi, yi)}i2[m]

I 2 U

RU

zi.= ✓

>(yi · xi)(with )

example game rado game

adversary fits

Le(p, z).=

Pi2[m] pizi + µe

Pi2[m] 'e(pi) Lr(q, z)

.=

PI✓[m] qI

Pi2I zi + µr

PI✓[m] 'r(qI)

Le(z).= Le(p⇤(z), z) Lr(z)

.= Lr(q⇤(z), z)

p⇤(z).= arg min

p2RmLe(p, z) q⇤(z)

.= arg min

q2H2mLr(q, z)

learner builds to max✓

p⇤(z) = Gmq⇤(z)Theorem: when strictly convex and differentiable, holds iff

'e,'r1

Theorem: If holds then !

where'e(z) = (µr/µe) · 's(r)(z) + (b/µe)

1

's(r)(z).= 'r(z) + 'r(1� z)

example loss rado loss

èxp

= EI[exp(�✓>⇡I)]`log

= Ei[log(1 + exp(�✓

>(yi · xi)))]

`M = �(EI[✓>⇡I]� (1/2) · VI[✓

>⇡I])`sql = Ei[(1� ✓

>(yi · xi))2]

è(S,✓) `r(RU,✓)

`H = Ei[max{0,�✓

>(yi · xi)}] à = max{0,maxI✓[m]{�✓>⇡I}}

Ù = Ei[�✓

>(yi · xi)] Ù0 = EI[�✓>⇡I]

è

f : R ! R

è(S,✓) = f (`r (R2[m] ,✓))

8S, 8✓

`r8(è, `r)

Theorem: If holds and in then the regularized example loss !

!

is equivalent to rado loss , where

1 fe(z) = ae · z + be 2

è(S,✓,⌦).= è(S,✓) + ⌦(✓)

R⌦,✓2[m]

.=

n

⇡I � ae⌦(✓)

k✓k22

· ✓ : ⇡I 2 R2[m]

o

`r

⇣R⌦,✓

2[m] ,✓⌘

è `r

è

`r

fe fr

b 2 R Le(z) = Lr(z) + b

�Le(z) = fe (è(z))

�Lr(z) = fr (`r(z))

1

2

Rados & losses

Q: Why care ?

Gm.=

0>2m�1 1>

2m�1

Gm�1 Gm�1

�(2 {0, 1}m⇥2m) G1

.= [0 1]

Boosting with variously regularized rado losses

H2m .= {q 2 R2m : 1>q = 1}

one master boosting routine fits all applicable regularizers

Algorithm 1 ⌦-wl, for ⌦ 2 {k.k1, k.k2�, k.k1, k.k�}Input set of rados Sr

.= {⇡1,⇡2, ...,⇡n};

weights w 2 4n;

parameters � 2 (0, 1), ! 2 R+;

classifier ✓;Step 1 : pick weak feature ◆⇤ 2 [d];Step 2 : if ⌦ = k.k2� then

r⇤ ⇢

r◆⇤ if r◆⇤ 2 [��,�]sign (r◆⇤) · � otherwise

;

else r⇤ r◆⇤ ;Return (◆⇤, r⇤);

Algorithm 1 ⌦-R.AdaBoost

Input set of rados Sr.= {⇡1,⇡2, ...,⇡n};

parameters T 2 N⇤, � 2 (0, 1), ! 2 R+;

Step 1 : let ✓0 0, w0 (1/n)1 ;

Step 2 : for t = 1, 2, ..., TStep 2.1 : call weak learner:

(◆(t), rt) ⌦-wl(Sr,wt,�,!,✓t�1);

Step 2.2 : updating parameters ↵◆(t) and �t (⇡⇤k.= maxj |⇡jk|):

↵◆(t) (1/(2⇡⇤◆(t))) log((1 + rt)/(1� rt))

�t ! · (⌦(✓t)� ⌦(✓t�1)) ;

Step 2.3 : update and normalize weights: for j = 1, 2, ..., n,

wtj w(t�1)j · exp��↵◆(t)⇡j◆(t) + �t

�/Zt ;

Return ✓T ;

Code available at: http://users.cecs.anu.edu.au/~rnock/

✓t.=Pt

t0=1 ↵◆(t0) · 1◆(t0)

Theorem: if -WLA holds, after boosting rounds, the exp-rado loss satisfies , with: !

applicable regularizers ⌦

-Weak Learning Assumption�wl

r◆(t) =1

⇡⇤◆(t)

Pnj=1 wtj⇡j◆(t) (2 [�1, 1]) meets the -WLA iff all , with |r◆(t)| � �wl�wl⌦-wl

�wl T èxp

⇣R⌦,✓

U ,✓⌘

èxp

exp(�a�2wlT/2)

⌦(✓) = k✓k�⌦(✓) = k✓k2�q � 2 ·maxk

⇢✓1� �

� 3�wl

11 ·maxj |⇡jk|�◆��

kd

��a

.= min{3�wl/11,��1

(1� q/(2d))/mink maxj |⇡jk|}! < (2amink maxj ⇡2jk)/(T��)0 < a < 1/5

max. eigenvalue of�

� ⇡ 1

/ False Discovery Rate in SLOPE

Experiments

(see paper for all )⌦

AdaBoost^ ⌦-R.AdaBoost

reg.-AdaBoost

! = 0 ⌦ = k.k2id

⌦ = k.k1 ⌦ = k.k1 ⌦ = k.k�domain m d err±� err±� err±� � err±� � err±� � err±� �

Fertility 100 9 40.00±14.1 40.00±14.9 41.00±16.6 8.00 • 41.00±14.5 4.00 � 41.00±21.3 6.00 38.00±14.0 7.00

Sonar 208 60 24.57±9.11 • 27.88±4.33 25.05±7.56 8.14 24.05±8.41 4.83 24.52±8.65 10.12 � 25.00±13.4 3.83

Spectf 267 44 45.67±11.0 � 44.96±8.27 • 43.43±11.7 3.35 44.57±12.5 1.85 43.09±11.0 1.85 43.79±13.9 3.05

Ionosphere 351 33 13.11±6.36 • 14.51±7.36 13.64±5.99 5.43 14.24±6.15 2.83 � 13.38±4.44 3.15 � 14.25±5.04 3.41

Breastwisc 699 9 3.00±1.96 3.43±2.25 2.57±1.62 1.14 � 3.29±2.24 0.86 2.86±2.13 0.86 • 3.00±2.18 0.29

Transfusion 748 4 39.17±7.01 � 37.97±7.42 37.57±5.60 2.40 � 36.50±6.78 2.14 � 37.43±8.08 1.21 • 36.10±8.06 3.21

Qsar 1 055 31 22.09±3.73 24.37±4.06 22.47±3.84 3.41 • 23.13±2.74 2.75 • 23.23±3.64 2.64 � 23.80±3.79 2.55

Hill-nonoise 1 212 100 47.52±5.14 45.63±6.68 45.71±6.61 0.33 45.46±6.88 0.66 � 45.87±6.69 0.49 • 45.62±7.26 0.42

Hill-noise 1 212 100 47.61±3.48 � 45.05±2.98 44.80±2.86 0.99 44.97±3.37 0.66 44.88±2.82 0.66 • 44.64±3.26 0.74

Winered 1 599 11 26.33±2.75 � 28.02±3.32 • 27.83±3.95 1.19 • 27.45±4.17 1.00 � 27.58±3.76 1.12 • 27.45±3.34 1.25

Abalone 4 177 10 22.98±2.70 • 26.57±2.31 � 24.18±2.51 0.00 24.13±2.48 0.14 � 24.18±2.51 0.00 24.11±2.59 0.07

Statlog 4 435 36 4.49±0.61 22.41±2.20 • 4.71±0.82 0.25 � 20.43±1.89 0.23 4.69±0.72 0.45 20.00±1.80 0.18

Winewhite 4 898 11 30.73±2.20 • 32.63±2.52 • 31.85±1.66 1.18 32.16±1.73 1.31 32.16±2.02 0.90 � 31.97±2.26 1.12

Smartphone 7 352 561 0.00±0.00 � 0.67±0.25 0.19±0.22 0.00 � 0.44±0.29 0.03 • 0.20±0.24 0.01 0.19±0.22 0.04

Firmteacher 10 800 16 44.44±1.34 40.58±4.87 • 40.89±3.95 2.35 39.81±4.37 2.89 � 38.91±4.51 3.56 � 38.01±6.15 5.02

Eeg 14 980 14 45.38±2.04 • 44.09±2.32 � 44.01±1.48 0.40 • 43.89±2.19 0.89 � 44.07±2.02 0.81 • 43.87±1.40 0.95

Magic 19 020 10 21.07±1.09 � 37.51±0.46 • 22.11±1.32 0.28 � 26.41±1.08 0.00 23.00±1.71 0.66 � 26.41±1.08 0.00

Hardware 28 179 96 16.77±0.73 � 9.41±0.71 6.43±0.74 0.18 � 11.72±1.24 0.41 • 6.50±0.67 0.10 6.42±0.69 0.13

Marketing 45 211 27 30.68±1.01 27.70±0.69 27.33±0.73 0.33 � 28.02±0.47 0.00 • 27.19±0.87 0.51 � 28.02±0.47 0.00

Kaggle 120 269 10 47.80±0.47 • 39.22±8.47 � 16.90±0.51 0.00 16.90±0.51 0.00 16.89±0.50 0.01 16.90±0.51 0.00

= sparsest solution, = least sparse (among all ); shaded = best among •� ⌦-R.AdaBoost ⌦-R.AdaBoost

⌦(✓) = ! ·

8>><

>>:

k✓k1.= |✓|>1 Lasso

k✓k2�.= ✓>�✓ Ridge

k✓k1.= maxk |✓k| `1

k✓k�.= max

m2Sd(m|✓|)>⇠ slopeuser fixed

Rados vs examples

(see paper for non convex, non diff.)

nips16 regularizingrado poster3users.cecs.anu.edu.au/~rnock/docs/nips16... ·...

Documents