nips16 regularizingrado poster3users.cecs.anu.edu.au/~rnock/docs/nips16... ·...

1
Theorem: in table, and are equivalent, that is, there exists strictly increasing such that , Generalizes Nock et al. (ICML’15), Patrini et al. (IJCAI’16) Interesting case: when exists , increasing and such that: defined over coordinates (example loss) defined over subset sums of coordinates (rado loss) In this case, and are equivalent On Regularizing Rademacher Observation Losses Richard Nock A: 2 recent papers show, with 2 different proofs, that 2 popular example losses (log & square losses) are equivalent to Rademacher observation (rado) losses (linear classifiers). Rados help (i) to protect the examples privacy, (ii) to learn from distributed data without entity matching. But, why… 2 losses ? Anything hidden, more rado applications ? Contribution: (a) general theory for equivalent example and rado losses, using an adversarial formulation of loss functions, more equivalences; (b) extension to regularization: any regularization of the example loss transfers to rados (i.e. data) in the rado loss — it does not alter the rado loss ! (c) formal boosting algorithm for the exp-rado loss with various regularisers (ridge, lasso, SLOPE, , …) — boosts log loss over examples with the same regularisers ! ` 1 examples y i 2 {-1, 1} x i 2 R d m . = {-1, 1} m rados 2 m |U| = n m n // the s can be // (non) random, // (non) i.i.d., // learned from data, // etc. Rademacher observation (=1 rado) X i2I y i · x i . = I I 2 U U 2 [m] S = {(x i ,y i )} i2[m] I 2 U R U z i . = > (y i · x i ) (with ) example game rado game adversary fits L e (p, z ) . = P i2[m] p i z i + μ e P i2[m] ' e (p i ) L r (q , z ) . = P I[m] q I P i2I z i + μ r P I[m] ' r (q I ) L e (z ) . = L e (p (z ), z ) L r (z ) . = L r (q (z ), z ) p (z ) . = arg min p2R m L e (p, z ) q (z ) . = arg min q 2H 2 m L r (q , z ) learner builds to max p (z )= G m q (z ) Theorem: when strictly convex and differentiable, holds iff ' e , ' r 1 Theorem: If holds then where ' e (z )=(μ r /μ e ) · ' s(r) (z )+(b/μ e ) 1 ' s(r) (z ) . = ' r (z )+ ' r (1 - z ) example loss rado loss ` exp = E I [exp(-> I )] ` log = E i [log(1 + exp(-> (y i · x i )))] ` M = -(E I [> I ] - ( 1 / 2) · V I [> I ]) ` sql = E i [(1 - > (y i · x i )) 2 ] ` e (S, ) ` r (R U , ) ` H = E i [max{0, -> (y i · x i )}] ` a = max{0, max I[m] {-> I }} ` U = E i [-> (y i · x i )] ` U 0 = E I [-> I ] ` e f : R ! R ` e (S, )= f (` r (R 2 [m] , )) 8S, 8` r 8(` e , ` r ) Theorem: If holds and in then the regularized example loss is equivalent to rado loss , where 1 f e (z )= a e · z + b e 2 ` e (S, , ) . = ` e (S, )+ () R ,2 [m] . = n I - a e () kk 2 2 · : I 2 R 2 [m] o ` r R ,2 [m] , ` e ` r ` e ` r f e f r b 2 R L e (z )= L r (z )+ b -L e (z )= f e (` e (z )) -L r (z )= f r (` r (z )) 1 2 Rados & losses Q: Why care ? G m . = 0 > 2 m-1 1 > 2 m-1 G m-1 G m-1 (2 {0, 1} m2 m ) G 1 . = [0 1] Boosting with variously regularized rado losses H 2 m . = {q 2 R 2 m : 1 > q =1} one master boosting routine fits all applicable regularizers Algorithm 1 -wl, for 2 {k.k 1 , k.k 2 Γ , k.k 1 , k.k Φ } Input set of rados S r . = {1 , 2 , ..., n }; weights w 24 n ; parameters γ 2 (0, 1), ! 2 R + ; classifier ; Step 1 : pick weak feature 2 [d]; Step 2 : if = k.k 2 Γ then r r if r 2 [-γ, γ] sign (r ) · γ otherwise ; else r r ; Return (,r ); Algorithm 1 -R.AdaBoost Input set of rados S r . = {1 , 2 , ..., n }; parameters T 2 N , γ 2 (0, 1), ! 2 R + ; Step 1 : let 0 0, w 0 (1/n)1 ; Step 2 : for t =1, 2, ..., T Step 2.1 : call weak learner: ((t),r t ) -wl(S r , w t , γ, !, t-1 ); Step 2.2 : updating parameters (t) and δ t (k . = max j |jk |): (t) (1/(2(t) )) log((1 + r t )/(1 - r t )) δ t ! · ((t ) - (t-1 )) ; Step 2.3 : update and normalize weights: for j =1, 2, ..., n, w tj w (t-1)j · exp ( -(t) j (t) + δ t ) /Z t ; Return T ; Code available at: http://users.cecs.anu.edu.au/~rnock/ t . = P t t 0 =1 (t 0 ) · 1 (t 0 ) Theorem: if -WLA holds, after boosting rounds, the exp-rado loss satisfies , with: applicable regularizers -Weak Learning Assumption γ wl r (t) = 1 (t) P n j =1 w tj j (t) (2 [-1, 1]) meets the -WLA iff all , with |r (t) | γ wl γ wl -wl γ wl T ` exp R ,U , ` exp exp(-aγ 2 wl T/2) ()= kk Φ ()= kk 2 Γ q 2 · max k ⇢✓ 1 - Φ ( 3γ wl 11 · max j |jk | ) ( k d ) a . = min{3γ wl /11, Φ -1 (1 - q/(2d))/ min k max j |jk |} ! < (2a min k max j 2 jk )/(T λ Γ ) 0 <a< 1/5 max. eigenvalue of Γ γ 1 / False Discovery Rate in SLOPE Experiments (see paper for all ) AdaBoost^ -R.AdaBoost reg.-AdaBoost ! =0 = k.k 2 i d = k.k 1 = k.k 1 = k.k Φ domain m d err±σ err±σ err±σ Δ err±σ Δ err±σ Δ err±σ Δ Fertility 100 9 40.00±14.1 40.00±14.9 41.00±16.6 8.00 41.00±14.5 4.00 41.00±21.3 6.00 38.00±14.0 7.00 Sonar 208 60 24.57±9.11 27.88±4.33 25.05±7.56 8.14 24.05±8.41 4.83 24.52±8.65 10.12 25.00±13.4 3.83 Spectf 267 44 45.67±11.0 44.96±8.27 43.43±11.7 3.35 44.57±12.5 1.85 43.09±11.0 1.85 43.79±13.9 3.05 Ionosphere 351 33 13.11±6.36 14.51±7.36 13.64±5.99 5.43 14.24±6.15 2.83 13.38±4.44 3.15 14.25±5.04 3.41 Breastwisc 699 9 3.00±1.96 3.43±2.25 2.57±1.62 1.14 3.29±2.24 0.86 2.86±2.13 0.86 3.00±2.18 0.29 Transfusion 748 4 39.17±7.01 37.97±7.42 37.57±5.60 2.40 36.50±6.78 2.14 37.43±8.08 1.21 36.10±8.06 3.21 Qsar 1 055 31 22.09±3.73 24.37±4.06 22.47±3.84 3.41 23.13±2.74 2.75 23.23±3.64 2.64 23.80±3.79 2.55 Hill-nonoise 1 212 100 47.52±5.14 45.63±6.68 45.71±6.61 0.33 45.46±6.88 0.66 45.87±6.69 0.49 45.62±7.26 0.42 Hill-noise 1 212 100 47.61±3.48 45.05±2.98 44.80±2.86 0.99 44.97±3.37 0.66 44.88±2.82 0.66 44.64±3.26 0.74 Winered 1 599 11 26.33±2.75 28.02±3.32 27.83±3.95 1.19 27.45±4.17 1.00 27.58±3.76 1.12 27.45±3.34 1.25 Abalone 4 177 10 22.98±2.70 26.57±2.31 24.18±2.51 0.00 24.13±2.48 0.14 24.18±2.51 0.00 24.11±2.59 0.07 Statlog 4 435 36 4.49±0.61 22.41±2.20 4.71±0.82 0.25 20.43±1.89 0.23 4.69±0.72 0.45 20.00±1.80 0.18 Winewhite 4 898 11 30.73±2.20 32.63±2.52 31.85±1.66 1.18 32.16±1.73 1.31 32.16±2.02 0.90 31.97±2.26 1.12 Smartphone 7 352 561 0.00±0.00 0.67±0.25 0.19±0.22 0.00 0.44±0.29 0.03 0.20±0.24 0.01 0.19±0.22 0.04 Firmteacher 10 800 16 44.44±1.34 40.58±4.87 40.89±3.95 2.35 39.81±4.37 2.89 38.91±4.51 3.56 38.01±6.15 5.02 Eeg 14 980 14 45.38±2.04 44.09±2.32 44.01±1.48 0.40 43.89±2.19 0.89 44.07±2.02 0.81 43.87±1.40 0.95 Magic 19 020 10 21.07±1.09 37.51±0.46 22.11±1.32 0.28 26.41±1.08 0.00 23.00±1.71 0.66 26.41±1.08 0.00 Hardware 28 179 96 16.77±0.73 9.41±0.71 6.43±0.74 0.18 11.72±1.24 0.41 6.50±0.67 0.10 6.42±0.69 0.13 Marketing 45 211 27 30.68±1.01 27.70±0.69 27.33±0.73 0.33 28.02±0.47 0.00 27.19±0.87 0.51 28.02±0.47 0.00 Kaggle 120 269 10 47.80±0.47 39.22±8.47 16.90±0.51 0.00 16.90±0.51 0.00 16.89±0.50 0.01 16.90±0.51 0.00 = sparsest solution, = least sparse (among all ); shaded = best among -R.AdaBoost -R.AdaBoost ()= ! · 8 > > < > > : kk 1 . = || > 1 Lasso kk 2 Γ . = > Γ Ridge kk 1 . = max k |k | ` 1 kk Φ . = max m2S d (m||) > slope user fixed Rados vs examples (see paper for non convex, non diff.)

Upload: others

Post on 24-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NIPS16 RegularizingRado poster3users.cecs.anu.edu.au/~rnock/docs/NIPS16... · NIPS16_RegularizingRado_poster3.key Created Date: 11/12/2016 11:54:06 AM

Theorem: in table, and are equivalent, that is, there exists strictly increasing such that , !!Generalizes Nock et al. (ICML’15), Patrini et al. (IJCAI’16)

Interesting case: when exists , increasingand such that: !

!

defined over coordinates (example loss) defined over subset sums of coordinates (rado loss) In this case, and are equivalent

OnRegularizingRademacherObservationLosses!

RichardNock

A: 2 recent papers show, with 2 different proofs, that 2 popular example losses (log & square losses) are equivalent to Rademacher observation (rado) losses (linear classifiers). Rados help (i) to protect the examples privacy, (ii) to learn from distributed data without entity matching. But, why… 2 losses ? Anything hidden, more rado applications ? Contribution: (a) general theory for equivalent example and rado losses, using an adversarial formulation of loss functions, more equivalences; (b) extension to regularization: any regularization of the example loss transfers to rados (i.e. data) in the rado loss — it does not alter the rado loss ! (c) formal boosting algorithm for the exp-rado loss with various regularisers (ridge, lasso, SLOPE, , …) — boosts log loss over examples with the same regularisers !`1

examples

yi 2 {�1, 1}xi 2 Rd

⌃m.= {�1, 1}m rados

2m

|U| = nm n

// the s can be // (non) random, // (non) i.i.d., // learned from data, // etc.

Rademacher observation

(=1 rado)

X

i2I

yi · xi.= ⇡I

I 2 U

U ✓ 2[m]

S = {(xi, yi)}i2[m]

I 2 U

RU

zi.= ✓

>(yi · xi)(with )

example game rado game

adversary fits

Le(p, z).=

Pi2[m] pizi + µe

Pi2[m] 'e(pi) Lr(q, z)

.=

PI✓[m] qI

Pi2I zi + µr

PI✓[m] 'r(qI)

Le(z).= Le(p⇤(z), z) Lr(z)

.= Lr(q⇤(z), z)

p⇤(z).= arg min

p2RmLe(p, z) q⇤(z)

.= arg min

q2H2mLr(q, z)

learner builds to max✓

p⇤(z) = Gmq⇤(z)Theorem: when strictly convex and differentiable, holds iff

'e,'r1

Theorem: If holds then !

where'e(z) = (µr/µe) · 's(r)(z) + (b/µe)

1

's(r)(z).= 'r(z) + 'r(1� z)

example loss rado loss

`exp

= EI[exp(�✓>⇡I)]`log

= Ei[log(1 + exp(�✓

>(yi · xi)))]

`M = �(EI[✓>⇡I]� (1/2) · VI[✓

>⇡I])`sql = Ei[(1� ✓

>(yi · xi))2]

`e(S,✓) `r(RU,✓)

`H = Ei[max{0,�✓

>(yi · xi)}] `a = max{0,maxI✓[m]{�✓>⇡I}}

`U = Ei[�✓

>(yi · xi)] `U0 = EI[�✓>⇡I]

`e

f : R ! R

`e(S,✓) = f (`r (R2[m] ,✓))

8S, 8✓

`r8(`e, `r)

Theorem: If holds and in then the regularized example loss !

!

is equivalent to rado loss , where

1 fe(z) = ae · z + be 2

`e(S,✓,⌦).= `e(S,✓) + ⌦(✓)

R⌦,✓2[m]

.=

n

⇡I � ae⌦(✓)

k✓k22

· ✓ : ⇡I 2 R2[m]

o

`r

⇣R⌦,✓

2[m] ,✓⌘

`e `r

`e

`r

fe fr

b 2 R Le(z) = Lr(z) + b

�Le(z) = fe (`e(z))

�Lr(z) = fr (`r(z))

1

2

Rados & losses

Q: Why care ?

Gm.=

0>2m�1 1>

2m�1

Gm�1 Gm�1

�(2 {0, 1}m⇥2m) G1

.= [0 1]

Boosting with variously regularized rado losses

H2m .= {q 2 R2m : 1>q = 1}

one master boosting routine fits all applicable regularizers

Algorithm 1 ⌦-wl, for ⌦ 2 {k.k1, k.k2�, k.k1, k.k�}Input set of rados Sr

.= {⇡1,⇡2, ...,⇡n};

weights w 2 4n;

parameters � 2 (0, 1), ! 2 R+;

classifier ✓;Step 1 : pick weak feature ◆⇤ 2 [d];Step 2 : if ⌦ = k.k2� then

r⇤ ⇢

r◆⇤ if r◆⇤ 2 [��,�]sign (r◆⇤) · � otherwise

;

else r⇤ r◆⇤ ;Return (◆⇤, r⇤);

Algorithm 1 ⌦-R.AdaBoost

Input set of rados Sr.= {⇡1,⇡2, ...,⇡n};

parameters T 2 N⇤, � 2 (0, 1), ! 2 R+;

Step 1 : let ✓0 0, w0 (1/n)1 ;

Step 2 : for t = 1, 2, ..., TStep 2.1 : call weak learner:

(◆(t), rt) ⌦-wl(Sr,wt,�,!,✓t�1);

Step 2.2 : updating parameters ↵◆(t) and �t (⇡⇤k.= maxj |⇡jk|):

↵◆(t) (1/(2⇡⇤◆(t))) log((1 + rt)/(1� rt))

�t ! · (⌦(✓t)� ⌦(✓t�1)) ;

Step 2.3 : update and normalize weights: for j = 1, 2, ..., n,

wtj w(t�1)j · exp��↵◆(t)⇡j◆(t) + �t

�/Zt ;

Return ✓T ;

Code available at: http://users.cecs.anu.edu.au/~rnock/

✓t.=Pt

t0=1 ↵◆(t0) · 1◆(t0)

Theorem: if -WLA holds, after boosting rounds, the exp-rado loss satisfies , with: !

applicable regularizers ⌦

-Weak Learning Assumption�wl

r◆(t) =1

⇡⇤◆(t)

Pnj=1 wtj⇡j◆(t) (2 [�1, 1]) meets the -WLA iff all , with |r◆(t)| � �wl�wl⌦-wl

�wl T `exp

⇣R⌦,✓

U ,✓⌘

`exp

exp(�a�2wlT/2)

⌦(✓) = k✓k�⌦(✓) = k✓k2�q � 2 ·maxk

⇢✓1� �

� 3�wl

11 ·maxj |⇡jk|�◆��

kd

��a

.= min{3�wl/11,��1

(1� q/(2d))/mink maxj |⇡jk|}! < (2amink maxj ⇡2jk)/(T��)0 < a < 1/5

max. eigenvalue of�

� ⇡ 1

/ False Discovery Rate in SLOPE

Experiments

(see paper for all )⌦

AdaBoost^ ⌦-R.AdaBoost

reg.-AdaBoost

! = 0 ⌦ = k.k2id

⌦ = k.k1 ⌦ = k.k1 ⌦ = k.k�domain m d err±� err±� err±� � err±� � err±� � err±� �

Fertility 100 9 40.00±14.1 40.00±14.9 41.00±16.6 8.00 • 41.00±14.5 4.00 � 41.00±21.3 6.00 38.00±14.0 7.00

Sonar 208 60 24.57±9.11 • 27.88±4.33 25.05±7.56 8.14 24.05±8.41 4.83 24.52±8.65 10.12 � 25.00±13.4 3.83

Spectf 267 44 45.67±11.0 � 44.96±8.27 • 43.43±11.7 3.35 44.57±12.5 1.85 43.09±11.0 1.85 43.79±13.9 3.05

Ionosphere 351 33 13.11±6.36 • 14.51±7.36 13.64±5.99 5.43 14.24±6.15 2.83 � 13.38±4.44 3.15 � 14.25±5.04 3.41

Breastwisc 699 9 3.00±1.96 3.43±2.25 2.57±1.62 1.14 � 3.29±2.24 0.86 2.86±2.13 0.86 • 3.00±2.18 0.29

Transfusion 748 4 39.17±7.01 � 37.97±7.42 37.57±5.60 2.40 � 36.50±6.78 2.14 � 37.43±8.08 1.21 • 36.10±8.06 3.21

Qsar 1 055 31 22.09±3.73 24.37±4.06 22.47±3.84 3.41 • 23.13±2.74 2.75 • 23.23±3.64 2.64 � 23.80±3.79 2.55

Hill-nonoise 1 212 100 47.52±5.14 45.63±6.68 45.71±6.61 0.33 45.46±6.88 0.66 � 45.87±6.69 0.49 • 45.62±7.26 0.42

Hill-noise 1 212 100 47.61±3.48 � 45.05±2.98 44.80±2.86 0.99 44.97±3.37 0.66 44.88±2.82 0.66 • 44.64±3.26 0.74

Winered 1 599 11 26.33±2.75 � 28.02±3.32 • 27.83±3.95 1.19 • 27.45±4.17 1.00 � 27.58±3.76 1.12 • 27.45±3.34 1.25

Abalone 4 177 10 22.98±2.70 • 26.57±2.31 � 24.18±2.51 0.00 24.13±2.48 0.14 � 24.18±2.51 0.00 24.11±2.59 0.07

Statlog 4 435 36 4.49±0.61 22.41±2.20 • 4.71±0.82 0.25 � 20.43±1.89 0.23 4.69±0.72 0.45 20.00±1.80 0.18

Winewhite 4 898 11 30.73±2.20 • 32.63±2.52 • 31.85±1.66 1.18 32.16±1.73 1.31 32.16±2.02 0.90 � 31.97±2.26 1.12

Smartphone 7 352 561 0.00±0.00 � 0.67±0.25 0.19±0.22 0.00 � 0.44±0.29 0.03 • 0.20±0.24 0.01 0.19±0.22 0.04

Firmteacher 10 800 16 44.44±1.34 40.58±4.87 • 40.89±3.95 2.35 39.81±4.37 2.89 � 38.91±4.51 3.56 � 38.01±6.15 5.02

Eeg 14 980 14 45.38±2.04 • 44.09±2.32 � 44.01±1.48 0.40 • 43.89±2.19 0.89 � 44.07±2.02 0.81 • 43.87±1.40 0.95

Magic 19 020 10 21.07±1.09 � 37.51±0.46 • 22.11±1.32 0.28 � 26.41±1.08 0.00 23.00±1.71 0.66 � 26.41±1.08 0.00

Hardware 28 179 96 16.77±0.73 � 9.41±0.71 6.43±0.74 0.18 � 11.72±1.24 0.41 • 6.50±0.67 0.10 6.42±0.69 0.13

Marketing 45 211 27 30.68±1.01 27.70±0.69 27.33±0.73 0.33 � 28.02±0.47 0.00 • 27.19±0.87 0.51 � 28.02±0.47 0.00

Kaggle 120 269 10 47.80±0.47 • 39.22±8.47 � 16.90±0.51 0.00 16.90±0.51 0.00 16.89±0.50 0.01 16.90±0.51 0.00

= sparsest solution, = least sparse (among all ); shaded = best among •� ⌦-R.AdaBoost ⌦-R.AdaBoost

⌦(✓) = ! ·

8>><

>>:

k✓k1.= |✓|>1 Lasso

k✓k2�.= ✓>�✓ Ridge

k✓k1.= maxk |✓k| `1

k✓k�.= max

m2Sd(m|✓|)>⇠ slopeuser fixed

Rados vs examples

(see paper for non convex, non diff.)