generalization error of linear neural networks in an empirical bayes approach

30
1 Generalization Error of Linear Neural Networks in an Empirical Bayes Approach Shinichi Nakajima Sumio Watanabe Tokyo Institute of Technology Nikon Corporation

Upload: odelia

Post on 17-Jan-2016

42 views

Category:

Documents


2 download

DESCRIPTION

Generalization Error of Linear Neural Networks in an Empirical Bayes Approach. Shinichi Nakajima Sumio Watanabe Tokyo Institute of Technology Nikon Corporation. Contents. Backgrounds Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose? Setting Model - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

1

Generalization Error of Linear Neural Networks in

an Empirical Bayes Approach

Shinichi Nakajima Sumio Watanabe 

Tokyo Institute of Technology

Nikon Corporation

Page 2: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

2

Contents Backgrounds

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?

Setting Model Subspace Bayes (SB) Approach

Analysis (James-Stein estimator) Solution Generalization error

Discussion & Conclusions

Page 3: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

3(Asymptotically) normal likelihood for ANY true parameter

Conventional Learning Theory

Regular Models

1. Asymptotic normalities of distribution of ML estimator

and Bayes posterior

Model selection methods(AIC, BIC, MDL)

Regular modelsEverywhere

det (Fisher Information) > 0

- Mean estimation- Linear regression

2. Asymptotic generalization error (ML) = (Bayes)

nKnFMDLBIC log)(ˆ2 KnGAIC 2)(ˆ2

K 22

K : dimensionality of parameter space

K

k

kk xaxy

1

1

11 nonnG nonnF loglog

n : # of samples

GE:

FE:

x : inputy : output

Page 4: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

4NON-normal likelihoodwhen true is on singularities.

ha

hb

Unidentifiable models

Unidentifiable modelsExist singularities, where

det (Fisher Information) = 0

- Neural networks- Bayesian networks- Mixture models- Hidden Markov models

xabxy th

H

hh

1

M

h Ra N

h Rb

NRyH : # of components

Unidentifiable set : 

00 hh ba

1. Asymptotic normalities NOT hold.

No (penalized likelihood type) information criterion.

MRx

Page 5: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

5

Superiority of Bayes to ML

Unidentifiable modelsExist singularities, where

det (Fisher Information) = 0

- Neural networks- Bayesian networks- Mixture models- Hidden Markov models

1. Asymptotic normalities NOT hold.

2. Bayes has advantage G(Bayes) < G(ML)

hb

ha

Increase of neighborhood of true accelerates overfitting.

When true is on singularities,

Increase of population denoting true suppresses overfitting. (only in Bayes)

K2In ML, K2In Bayes,

No (penalized likelihood type) information criterion.

How singularities workin learning ?

Page 6: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

6

What’s the purpose ?

Expensive. (Needs Markov chain Monte Carlo)

Is there any approximation with good generalization and tractability?

Variational Bayes (VB) [Hinton&vanCamp93; MacKay95; Attias99;Ghahramani&Beal00]

Subspace Bayes (SB)

Analyzed in another paper. [Nakajima&Watanabe05]

Bayes provides good generalization.

Page 7: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

7

Contents Backgrounds

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?

Setting Model Subspace Bayes (SB) Approach

Analysis (James-Stein estimator) Solution Generalization error

Discussion & Conclusions

Page 8: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

8

2HNMHK

Linear Neural Networks(LNNs)

2exp

2

1,,|

2

2/

BAxyBAxyp N

A : input parameter (H x M ) matrix

Essential parameter dimensionality:

t

H

t

a

a

A 1

HbbB 1

Mh Ra

Nh Rb

Trivial redundancy

M

N

H *H

learner true

H*<H H*=H

ML

[Fukumizu99]

> K / 2 = K / 2

Bayes

[Aoyagi&Watanabe03]

< K / 2 = K / 2

B : output parameter (N x H ) matrix

True map: B*A* with rank H* ( ).H

ABTTBA 1

H

h

thhabBA

1

LNN with M input, N output, and H hidden units:

Page 9: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

9

Maximum Likelihood estimator [Baldi&Hornik95]

n

i

tii

nn xyn

YXR1

1,

n

i

tii

n xxn

XQ1

1

11

ˆˆ

nOabABH

hMLE

thhMLE

1 RQab tbbMLE

thh hh

: h-th largest singular value of RQ -1/2.

ML estimator is given by

where

Here

: right singular vector.haω

hbω : left singular vector.

hh abh RQγ 2/12/1 RQγ t

bt

ah hh

Page 10: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

10

Learner Prior

Bayes estimation

w

x y: input : output w : parameter

wxyp ,| xyq | n training samples

niyxYX iinn ,,1;,,

dwYXwpwxypYXxyp nnnn ,|,|,,|Predictive :

dwwxypwXYZn

iii

nn ,||1

Marginal likelihood :

True

Posterior :

n

iiinn

nn wxypwXYZ

YXwp1

,||

1,|

In ML (or MAP) : Predict with one model

In Bayes : Predict with ensemble of models

Page 11: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

11

Empirical Bayes (EB) approach[Effron&Morris73]

xyq | n training samples

niyxYX iinn ,,1;,,

True

Hyperparameter :

||w τy|x,wp ||

Learner Prior

w wxyp ,|

Marginal likelihood :

dwτwφτwxypτ|XYZn

iii

nn ||||,|||1

Posterior : τwφτwxypXYZ

Yw|Xpn

iiinn

nn ˆ||ˆ||,||

1,

1

Predictive : dwYXwpwxypYXxyp nnnn ,|,|,,|

τYXZYXτ nn

τ

nn ||,maxarg,ˆ Hyperparameter is estimated by maximizing marginal likelihood.

Page 12: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

12

Subspace Bayes (SB) approach

a) MIP (Marginalizing in Input Parameter space) version

2exp

2

1||

2

2

BAxy

πBy|x,Ap N/Learner :

2exp

2

12

AAtr

πA

t

M/Prior :

A : parameter

b) MOP (Marginalizing in Output Parameter space) version

SB is an EB where part of parameters are regarded as hyperparameters.

B : hyperparameter

A : hyperparameter B : parameter

Marginalization can be done analytically in LNNs.

Page 13: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

13

Intuitive explanation

Bayes  posterior

ha hb

SB posterior

Optimize

For redundant comp.

Page 14: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

14

Contents Backgrounds

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?

Setting Model Subspace Bayes (SB) Approach

Analysis (James-Stein estimator) Solution Generalization error

Discussion & Conclusions

Page 15: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

15

Free energy (a.k.a. evidence, stochastic complexity)

Free energy : nn XYZnF |log

Important variable used for model selection. [Akaike80;Mackay92]

We minimize the free energy, optimizing hyperparameter.

Page 16: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

16

Generalization error

nn YXqDnG

,Predictive||TrueGeneralization Error :

: generalization coefficient

no

nnG

1

p||qD

qV

: Kullbuck-Leibler divergence between q & p

: Expectation of V over q

where

Asymptotic expansion :

In regular,

K2

In unidentifiable, K2

Page 17: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

17

James-Stein (JS) estimator

n

iiz

nz

1

1: ML estimator (arithmetic mean)    

zzn

Kjs

2

21̂

K

2

true mean

Domination of over : βα GG βα GG

for any true

for a certain true

K-dimensional mean estimation (Regular model)

nzz ,,1 : samples

3KML is efficient (never dominated by any unbiased estimator),but is inadmissible (dominated by biased estimator) when [Stein56].

James-Stein estimator [James&Stein61]

JS (K=3)

ML

A certain relation between EB and JSwas discussed in [Efron&Morris73]

Page 18: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

18

zznn

zpjs

21ˆ

Positive-part JS estimator

zpjs )'1(ˆ 1 2,max Xn where

  

Positive-part JS type (PJS) estimator

true)isevent (if 1

false) isevent (if 0event

Thresholding Model selection

PJS is a model selecting, shrinkage estimator.

where  

Page 19: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

19

Hyperparameter optimization

H

hh

nnh

nn bXYFBXYF1

||||||

nMωnM

Mnγ

nMb

hbh

h

h

h/ if

/ if0ˆ 2

Optimum hyperparameter value :

Mt Idxq(x)xx Assume orthonormality :

hF

2

hb

nMh

nMh hF

2

hb

dI : d x d identity matrix

Analytically solved in LNNs!

Page 20: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

20

11

1 1ˆˆ

nOabLLABH

hMLE

thhh

SB solution (Theorem1, Lemma1)

2,max hh nLL

Theorem 1: The SB estimator is given by

where

Lemma 1: Posterior is localized so that we can substitute the model at the SB estimator for predictive.

L : dimensionality of marginalized subspace (per component), i.e., L = M in MIP, or L = N in MOP.

SB is asymptotically equivalent to PJS estimation.zpjs )'1(ˆ 1 2,max Xn where

  

Page 21: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

21

Generalization error (Theorem 2)

2

*

1

2

2

2

2** 12

hq

HH

hh

hh

LLHNMH

2h : h-th largest eigenvalue of matrix

subject to WN-H* (M-H*, IN-H* ).

Expectation over Wishart distribution.

Theorem 2: SB generalization coefficient is given by

Page 22: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

22

Large scale approximation(Theorem 3)

1;0;21;2

2 2**

2**

MMM sJsJsJHNHM

HNMH

*

*

HM

HN

*,,, HHNM

*

*

HN

HH

*HM

L

ssssJ 12 cos121;

12

21cos1cos1120; 112

s

ssssJ

1cos1

12

1012

21cos

1

1cos

12

12

1;1

112

ss

ss

ss

s

s

sJ

0;2,

2

1max 1

JsM

Theorem 3: In the large scale limit when ,the generalization coefficient converges to

where

Page 23: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

23

Results 1 (true rank dependence)

*H

K

2

Bayes

SB(MIP)

ML

SB provides good generalization.

SB(MOP)

Note : This does NOT mean domination of SB over Bayes.Discussion of domination needs consideration of delicate situation. (See paper)

learner

true

*H

N = 30

M = 50

20H

N = 30

M = 50

Page 24: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

24

Results 2(redundant rank dependence)

H

Bayes

SB(MIP)

ML

depends on H similarly to ML.has also a property similar to ML.

K

2

SB(MOP)

learner

true

H

N = 30

M = 50

N = 30

M = 50

0* H

Page 25: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

25

Contents Backgrounds

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?

Setting Model Subspace Bayes (SB) Approach

Analysis (James-Stein estimator) Solution Generalization error

Discussion & Conclusions

Page 26: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

26

Feature of SB

provides good generalization. In LNNs, asymptotically equivalent to PJS.

requires smaller computational costs. Reduction of marginalized space.

In some models, marginalization can be done analytically.

related to variational Bayes (VB) approach.

Page 27: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

27

Variational Bayes (VB) Solution [Nakajima&Watanabe05]

VB results in same solution as MIP. VB automatically selects larger dimension to marginalize.

Bayes  posterior VB posterior

Similar to SB posterior

ha hb

1O

1nO

For NM

HhH *and

Page 28: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

28

Conclusions

We have introduced a subspace Bayes (SB) approach. We have proved that, in LNNs, SB is asymptotically equiv

alent to a shrinkage (PJS) estimation. Even in asymptotics, SB for redundant components converges

not to ML but to smaller value, which means suppression of overfitting.

Interestingly, MIP of SB is asymptotically equivalent to VB.

We have clarified the SB generalization error. SB has Bayes-like and ML-like properties, i.e., shrinkage and a

cceleration of overfitting by basis selection.

Page 29: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

29

Future work

Analysis of other models. (neural networks, Bayesian networks, mixture models, etc).

Analysis of variational Bayes (VB) in other models.

Page 30: Generalization Error of  Linear Neural Networks in  an Empirical Bayes Approach

30

Thank you!