generalization error of linear neural networks in an empirical bayes approach

Generalization Error of Linear Neural Networks in

an Empirical Bayes Approach

Shinichi Nakajima Sumio Watanabe　

Tokyo Institute of Technology

Nikon Corporation

Contents Backgrounds

Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?

Setting Model Subspace Bayes (SB) Approach

Analysis (James-Stein estimator) Solution Generalization error

Discussion & Conclusions

3(Asymptotically) normal likelihood for ANY true parameter

Conventional Learning Theory

Regular Models

1. Asymptotic normalities of distribution of ML estimator

and Bayes posterior

Model selection methods(AIC, BIC, MDL)

Regular modelsEverywhere

det (Fisher Information) > 0

- Mean estimation- Linear regression

2. Asymptotic generalization error (ML) = (Bayes)

nKnFMDLBIC log)(ˆ2 KnGAIC 2)(ˆ2

K : dimensionality of parameter space

kk xaxy

11 nonnG nonnF loglog

n : # of samples

x : inputy : output

4NON-normal likelihoodwhen true is on singularities.

Unidentifiable models

Unidentifiable modelsExist singularities, where

det (Fisher Information) = 0

- Neural networks- Bayesian networks- Mixture models- Hidden Markov models

xabxy th

h Ra N

NRyH : # of components

Unidentifiable set :　

00 hh ba

1. Asymptotic normalities NOT hold.

No (penalized likelihood type) information criterion.

Superiority of Bayes to ML

Unidentifiable modelsExist singularities, where

det (Fisher Information) = 0

- Neural networks- Bayesian networks- Mixture models- Hidden Markov models

1. Asymptotic normalities NOT hold.

2. Bayes has advantage G(Bayes) < G(ML)

Increase of neighborhood of true accelerates overfitting.

When true is on singularities,

Increase of population denoting true suppresses overfitting. (only in Bayes)

K2In ML, K2In Bayes,

No (penalized likelihood type) information criterion.

How singularities workin learning ?

What’s the purpose ?

Expensive. (Needs Markov chain Monte Carlo)

Is there any approximation with good generalization and tractability?

Variational Bayes (VB) [Hinton&vanCamp93; MacKay95; Attias99;Ghahramani&Beal00]

Subspace Bayes (SB)

Analyzed in another paper. [Nakajima&Watanabe05]

Bayes provides good generalization.

2HNMHK

Linear Neural Networks(LNNs)

BAxyBAxyp N

A : input parameter (H x M ) matrix

Essential parameter dimensionality:

HbbB 1

Trivial redundancy

learner true

H*<H H*=H

[Fukumizu99]

> K / 2 = K / 2

[Aoyagi&Watanabe03]

< K / 2 = K / 2

B : output parameter (N x H ) matrix

True map: B*A* with rank H* ( ).H

ABTTBA 1

thhabBA

LNN with M input, N output, and H hidden units:

Maximum Likelihood estimator [Baldi&Hornik95]

nn xyn

nOabABH

thhMLE

1 RQab tbbMLE

thh hh

: h-th largest singular value of RQ -1/2.

ML estimator is given by

: right singular vector.haω

hbω : left singular vector.

hh abh RQγ 2/12/1 RQγ t

Learner Prior

Bayes estimation

x y: input : output w : parameter

wxyp ,| xyq | n training samples

niyxYX iinn ,,1;,,

dwYXwpwxypYXxyp nnnn ,|,|,,|Predictive :

dwwxypwXYZn

nn ,||1

Marginal likelihood :

Posterior :

nn wxypwXYZ

In ML (or MAP) : Predict with one model

In Bayes : Predict with ensemble of models

Empirical Bayes (EB) approach[Effron&Morris73]

xyq | n training samples

niyxYX iinn ,,1;,,

Hyperparameter :

||w τy|x,wp ||

Learner Prior

w wxyp ,|

Marginal likelihood :

dwτwφτwxypτ|XYZn

nn ||||,|||1

Posterior : τwφτwxypXYZ

Yw|Xpn

nn ˆ||ˆ||,||

Predictive : dwYXwpwxypYXxyp nnnn ,|,|,,|

τYXZYXτ nn

nn ||,maxarg,ˆ Hyperparameter is estimated by maximizing marginal likelihood.

Subspace Bayes (SB) approach

a) MIP (Marginalizing in Input Parameter space) version

πBy|x,Ap N/Learner :

M/Prior :

A : parameter

b) MOP (Marginalizing in Output Parameter space) version

SB is an EB where part of parameters are regarded as hyperparameters.

B : hyperparameter

A : hyperparameter B : parameter

Marginalization can be done analytically in LNNs.

Intuitive explanation

Bayes　 posterior

SB posterior

Optimize

For redundant comp.

Free energy (a.k.a. evidence, stochastic complexity)

Free energy : nn XYZnF |log

Important variable used for model selection. [Akaike80;Mackay92]

We minimize the free energy, optimizing hyperparameter.

Generalization error

nn YXqDnG

,Predictive||TrueGeneralization Error :

: generalization coefficient

: Kullbuck-Leibler divergence between q & p

: Expectation of V over q

Asymptotic expansion :

In regular,

In unidentifiable, K2

James-Stein (JS) estimator

1: ML estimator (arithmetic mean)　　　　

true mean

Domination of over : βα GG βα GG

for any true

for a certain true

K-dimensional mean estimation (Regular model)

nzz ,,1 : samples

3KML is efficient (never dominated by any unbiased estimator),but is inadmissible (dominated by biased estimator) when [Stein56].

James-Stein estimator [James&Stein61]

JS (K=3)

A certain relation between EB and JSwas discussed in [Efron&Morris73]

Positive-part JS estimator

zpjs )'1(ˆ 1 2,max Xn where

Positive-part JS type (PJS) estimator

true)isevent (if 1

false) isevent (if 0event

Thresholding Model selection

PJS is a model selecting, shrinkage estimator.

where　　

Hyperparameter optimization

nn bXYFBXYF1

||||||

nMωnM

/ if0ˆ 2

Optimum hyperparameter value :

Mt Idxq(x)xx Assume orthonormality :

nMh hF

dI : d x d identity matrix

Analytically solved in LNNs!

1 1ˆˆ

nOabLLABH

SB solution (Theorem1, Lemma1)

2,max hh nLL

Theorem 1: The SB estimator is given by

Lemma 1: Posterior is localized so that we can substitute the model at the SB estimator for predictive.

L : dimensionality of marginalized subspace (per component), i.e., L = M in MIP, or L = N in MOP.

SB is asymptotically equivalent to PJS estimation.zpjs )'1(ˆ 1 2,max Xn where

Generalization error (Theorem 2)

2** 12

LLHNMH

2h : h-th largest eigenvalue of matrix

subject to WN-H* (M-H*, IN-H* ).

Expectation over Wishart distribution.

Theorem 2: SB generalization coefficient is given by

Large scale approximation(Theorem 3)

1;0;21;2

MMM sJsJsJHNHM

*,,, HHNM

ssssJ 12 cos121;

21cos1cos1120; 112

1max 1

Theorem 3: In the large scale limit when ,the generalization coefficient converges to

Results 1 (true rank dependence)

SB(MIP)

SB provides good generalization.

SB(MOP)

Note : This does NOT mean domination of SB over Bayes.Discussion of domination needs consideration of delicate situation. (See paper)

learner

N = 30

M = 50

N = 30

M = 50

Results 2(redundant rank dependence)

SB(MIP)

depends on H similarly to ML.has also a property similar to ML.

SB(MOP)

learner

N = 30

M = 50

N = 30

M = 50

Feature of SB

provides good generalization. In LNNs, asymptotically equivalent to PJS.

requires smaller computational costs. Reduction of marginalized space.

In some models, marginalization can be done analytically.

related to variational Bayes (VB) approach.

Variational Bayes (VB) Solution [Nakajima&Watanabe05]

VB results in same solution as MIP. VB automatically selects larger dimension to marginalize.

Bayes　 posterior VB posterior

generalization error of linear neural networks in an empirical bayes approach

Documents

teorema bayes - rzabdulaziz.files.wordpress.com · teorema...

recent development in generalization error for supervised...

bayes error rate estimation using classiﬁer ensembles...

spam filtering with naïve bayes – which naïve...

stochastic complexity and generalization error of a...

submitted to statistical science bayes, oracle...

mathematical statistics, lecture 11 bayes procedures ·...

credibility procedures: laplace's generalization of bayes'...

learning to bound the multi-class bayes error

promoting generalization of positive behavior change ·...

distributional generalization: a new kind of generalization

generalization of delaunay meshes for the error control in...

generalization of consistent standard error estimators under...

dynamic bayesian networks beyond...

generalization from qualitative inquiry -...

naïve bayes -...

englishingilizce.com · error error error error error error...

machine learning lecture 7 bayes decision theory ng · 4 ng...

generalization error bounds in semi-supervised...

implementasi metode naïve bayes classifier untuk seleksi...