generalization error of linear neural networks in an empirical bayes approach

1

Generalization Error of Linear Neural Networks in

an Empirical Bayes Approach

Shinichi Nakajima Sumio Watanabe　

Tokyo Institute of Technology

Nikon Corporation

2

Contents Backgrounds

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?

Setting Model Subspace Bayes (SB) Approach

Analysis (James-Stein estimator) Solution Generalization error

Discussion & Conclusions

3(Asymptotically) normal likelihood for ANY true parameter

Conventional Learning Theory

Regular Models

1. Asymptotic normalities of distribution of ML estimator

and Bayes posterior

Model selection methods(AIC, BIC, MDL)

Regular modelsEverywhere

det (Fisher Information) > 0

- Mean estimation- Linear regression

2. Asymptotic generalization error (ML) = (Bayes)

nKnFMDLBIC log)(ˆ2 KnGAIC 2)(ˆ2

K 22

K : dimensionality of parameter space

K

k

kk xaxy

1

1

11 nonnG nonnF loglog

n : # of samples

GE:

FE:

x : inputy : output

4NON-normal likelihoodwhen true is on singularities.

ha

hb

Unidentifiable models

Unidentifiable modelsExist singularities, where

det (Fisher Information) = 0

- Neural networks- Bayesian networks- Mixture models- Hidden Markov models

xabxy th

H

hh

1

M

h Ra N

h Rb

NRyH : # of components

Unidentifiable set :　

00 hh ba

1. Asymptotic normalities NOT hold.

No (penalized likelihood type) information criterion.

MRx

5

Superiority of Bayes to ML

Unidentifiable modelsExist singularities, where

det (Fisher Information) = 0

- Neural networks- Bayesian networks- Mixture models- Hidden Markov models

1. Asymptotic normalities NOT hold.

2. Bayes has advantage G(Bayes) < G(ML)

hb

ha

Increase of neighborhood of true accelerates overfitting.

When true is on singularities,

Increase of population denoting true suppresses overfitting. (only in Bayes)

K2In ML, K2In Bayes,

No (penalized likelihood type) information criterion.

How singularities workin learning ?

6

What’s the purpose ?

Expensive. (Needs Markov chain Monte Carlo)

Is there any approximation with good generalization and tractability?

Variational Bayes (VB) [Hinton&vanCamp93; MacKay95; Attias99;Ghahramani&Beal00]

Subspace Bayes (SB)

Analyzed in another paper. [Nakajima&Watanabe05]

Bayes provides good generalization.

7






8

2HNMHK

Linear Neural Networks(LNNs)

2exp

2

1,,|

2

2/

BAxyBAxyp N

A : input parameter (H x M ) matrix

Essential parameter dimensionality:

t

H

t

a

a

A 1

HbbB 1

Mh Ra

Nh Rb

Trivial redundancy

M

N

H *H

learner true

H*<H H*=H

ML

[Fukumizu99]

> K / 2 = K / 2

Bayes

[Aoyagi&Watanabe03]

< K / 2 = K / 2

B : output parameter (N x H ) matrix

True map: B*A* with rank H* ( ).H

ABTTBA 1

H

h

thhabBA

1

LNN with M input, N output, and H hidden units:

9

Maximum Likelihood estimator [Baldi&Hornik95]

n

i

tii

nn xyn

YXR1

1,

n

i

tii

n xxn

XQ1

1

hγ

11

ˆˆ

nOabABH

hMLE

thhMLE

1 RQab tbbMLE

thh hh

: h-th largest singular value of RQ -1/2.

ML estimator is given by

where

Here

: right singular vector.haω

hbω : left singular vector.

hh abh RQγ 2/12/1 RQγ t

bt

ah hh

10

Learner Prior

Bayes estimation

w

x y: input : output w : parameter

wxyp ,| xyq | n training samples

niyxYX iinn ,,1;,,

dwYXwpwxypYXxyp nnnn ,|,|,,|Predictive :

dwwxypwXYZn

iii

nn ,||1

Marginal likelihood :

True

Posterior :

n

iiinn

nn wxypwXYZ

YXwp1

,||

1,|

In ML (or MAP) : Predict with one model

In Bayes : Predict with ensemble of models

11

Empirical Bayes (EB) approach[Effron&Morris73]

xyq | n training samples

niyxYX iinn ,,1;,,

True

Hyperparameter :

||w τy|x,wp ||

Learner Prior

w wxyp ,|

Marginal likelihood :

dwτwφτwxypτ|XYZn

iii

nn ||||,|||1

Posterior : τwφτwxypXYZ

Yw|Xpn

iiinn

nn ˆ||ˆ||,||

1,

1

Predictive : dwYXwpwxypYXxyp nnnn ,|,|,,|

τYXZYXτ nn

τ

nn ||,maxarg,ˆ Hyperparameter is estimated by maximizing marginal likelihood.

12

Subspace Bayes (SB) approach

a) MIP (Marginalizing in Input Parameter space) version

2exp

2

1||

2

2

BAxy

πBy|x,Ap N/Learner :

2exp

2

12

AAtr

πA

t

M/Prior :

A : parameter

b) MOP (Marginalizing in Output Parameter space) version

SB is an EB where part of parameters are regarded as hyperparameters.

B : hyperparameter

A : hyperparameter B : parameter

Marginalization can be done analytically in LNNs.

13

Intuitive explanation

Bayes　 posterior

ha hb

SB posterior

Optimize

For redundant comp.

14






15

Free energy (a.k.a. evidence, stochastic complexity)

Free energy : nn XYZnF |log

Important variable used for model selection. [Akaike80;Mackay92]

We minimize the free energy, optimizing hyperparameter.

16

Generalization error

nn YXqDnG

,Predictive||TrueGeneralization Error :

: generalization coefficient

no

nnG

1

p||qD

qV

: Kullbuck-Leibler divergence between q & p

: Expectation of V over q

where

Asymptotic expansion :

In regular,

K2

In unidentifiable, K2

17

James-Stein (JS) estimator

n

iiz

nz

1

1: ML estimator (arithmetic mean)　　　　

zzn

Kjs

2

21̂

K

2

true mean

Domination of over : βα GG βα GG

for any true

for a certain true

K-dimensional mean estimation (Regular model)

nzz ,,1 : samples

3KML is efficient (never dominated by any unbiased estimator),but is inadmissible (dominated by biased estimator) when [Stein56].

James-Stein estimator [James&Stein61]

JS (K=3)

ML

A certain relation between EB and JSwas discussed in [Efron&Morris73]

18

zznn

zpjs

21ˆ

Positive-part JS estimator

zpjs )'1(ˆ 1 2,max Xn where

　　

Positive-part JS type (PJS) estimator

true)isevent (if 1

false) isevent (if 0event

Thresholding Model selection

PJS is a model selecting, shrinkage estimator.

where　　

19

Hyperparameter optimization

H

hh

nnh

nn bXYFBXYF1

||||||

nMωnM

Mnγ

nMb

hbh

h

h

h/ if

/ if0ˆ 2

Optimum hyperparameter value :

Mt Idxq(x)xx Assume orthonormality :

hF

2

hb

nMh

nMh hF

2

hb

dI : d x d identity matrix

Analytically solved in LNNs!

20

11

1 1ˆˆ

nOabLLABH

hMLE

thhh

SB solution (Theorem1, Lemma1)

2,max hh nLL

Theorem 1: The SB estimator is given by

where

Lemma 1: Posterior is localized so that we can substitute the model at the SB estimator for predictive.

L : dimensionality of marginalized subspace (per component), i.e., L = M in MIP, or L = N in MOP.

SB is asymptotically equivalent to PJS estimation.zpjs )'1(ˆ 1 2,max Xn where

　　

21

Generalization error (Theorem 2)

2

*

1

2

2

2

2** 12

hq

HH

hh

hh

LLHNMH

2h : h-th largest eigenvalue of matrix

subject to WN-H* (M-H*, IN-H* ).

Expectation over Wishart distribution.

Theorem 2: SB generalization coefficient is given by

22

Large scale approximation(Theorem 3)

1;0;21;2

2 2**

2**

MMM sJsJsJHNHM

HNMH

*

*

HM

HN

*,,, HHNM

*

*

HN

HH

*HM

L

ssssJ 12 cos121;

12

21cos1cos1120; 112

s

ssssJ

1cos1

12

1012

21cos

1

1cos

12

12

1;1

112

ss

ss

ss

s

s

sJ

0;2,

2

1max 1

JsM

Theorem 3: In the large scale limit when ,the generalization coefficient converges to

where

23

Results 1 (true rank dependence)

*H

K

2

Bayes

SB(MIP)

ML

SB provides good generalization.

SB(MOP)

Note : This does NOT mean domination of SB over Bayes.Discussion of domination needs consideration of delicate situation. (See paper)

learner

true

*H

N = 30

M = 50

20H

N = 30

M = 50

24

Results 2(redundant rank dependence)

H

Bayes

SB(MIP)

ML

depends on H similarly to ML.has also a property similar to ML.

K

2

SB(MOP)

learner

true

H

N = 30

M = 50

N = 30

M = 50

0* H

25






26

Feature of SB

provides good generalization. In LNNs, asymptotically equivalent to PJS.

requires smaller computational costs. Reduction of marginalized space.

In some models, marginalization can be done analytically.

related to variational Bayes (VB) approach.

27

Variational Bayes (VB) Solution [Nakajima&Watanabe05]

VB results in same solution as MIP. VB automatically selects larger dimension to marginalize.

Bayes　 posterior VB posterior

Similar to SB posterior

ha hb

1O

1nO

For NM

HhH *and

28

Conclusions

We have introduced a subspace Bayes (SB) approach. We have proved that, in LNNs, SB is asymptotically equiv

alent to a shrinkage (PJS) estimation. Even in asymptotics, SB for redundant components converges

not to ML but to smaller value, which means suppression of overfitting.

Interestingly, MIP of SB is asymptotically equivalent to VB.

We have clarified the SB generalization error. SB has Bayes-like and ML-like properties, i.e., shrinkage and a

cceleration of overfitting by basis selection.

29

Future work

Analysis of other models. (neural networks, Bayesian networks, mixture models, etc).

Analysis of variational Bayes (VB) in other models.

30

Thank you!

generalization error of linear neural networks in an empirical bayes approach

Documents