generalization error of linear neural networks in an empirical bayes approach
Post on 17-Jan-2016
42 Views
Preview:
DESCRIPTION
TRANSCRIPT
1
Generalization Error of Linear Neural Networks in
an Empirical Bayes Approach
Shinichi Nakajima Sumio Watanabe
Tokyo Institute of Technology
Nikon Corporation
2
Contents Backgrounds
Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?
Setting Model Subspace Bayes (SB) Approach
Analysis (James-Stein estimator) Solution Generalization error
Discussion & Conclusions
3(Asymptotically) normal likelihood for ANY true parameter
Conventional Learning Theory
Regular Models
1. Asymptotic normalities of distribution of ML estimator
and Bayes posterior
Model selection methods(AIC, BIC, MDL)
Regular modelsEverywhere
det (Fisher Information) > 0
- Mean estimation- Linear regression
2. Asymptotic generalization error (ML) = (Bayes)
nKnFMDLBIC log)(ˆ2 KnGAIC 2)(ˆ2
K 22
K : dimensionality of parameter space
K
k
kk xaxy
1
1
11 nonnG nonnF loglog
n : # of samples
GE:
FE:
x : inputy : output
4NON-normal likelihoodwhen true is on singularities.
ha
hb
Unidentifiable models
Unidentifiable modelsExist singularities, where
det (Fisher Information) = 0
- Neural networks- Bayesian networks- Mixture models- Hidden Markov models
xabxy th
H
hh
1
M
h Ra N
h Rb
NRyH : # of components
Unidentifiable set :
00 hh ba
1. Asymptotic normalities NOT hold.
No (penalized likelihood type) information criterion.
MRx
5
Superiority of Bayes to ML
Unidentifiable modelsExist singularities, where
det (Fisher Information) = 0
- Neural networks- Bayesian networks- Mixture models- Hidden Markov models
1. Asymptotic normalities NOT hold.
2. Bayes has advantage G(Bayes) < G(ML)
hb
ha
Increase of neighborhood of true accelerates overfitting.
When true is on singularities,
Increase of population denoting true suppresses overfitting. (only in Bayes)
K2In ML, K2In Bayes,
No (penalized likelihood type) information criterion.
How singularities workin learning ?
6
What’s the purpose ?
Expensive. (Needs Markov chain Monte Carlo)
Is there any approximation with good generalization and tractability?
Variational Bayes (VB) [Hinton&vanCamp93; MacKay95; Attias99;Ghahramani&Beal00]
Subspace Bayes (SB)
Analyzed in another paper. [Nakajima&Watanabe05]
Bayes provides good generalization.
7
Contents Backgrounds
Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?
Setting Model Subspace Bayes (SB) Approach
Analysis (James-Stein estimator) Solution Generalization error
Discussion & Conclusions
8
2HNMHK
Linear Neural Networks(LNNs)
2exp
2
1,,|
2
2/
BAxyBAxyp N
A : input parameter (H x M ) matrix
Essential parameter dimensionality:
t
H
t
a
a
A 1
HbbB 1
Mh Ra
Nh Rb
Trivial redundancy
M
N
H *H
learner true
H*<H H*=H
ML
[Fukumizu99]
> K / 2 = K / 2
Bayes
[Aoyagi&Watanabe03]
< K / 2 = K / 2
B : output parameter (N x H ) matrix
True map: B*A* with rank H* ( ).H
ABTTBA 1
H
h
thhabBA
1
LNN with M input, N output, and H hidden units:
9
Maximum Likelihood estimator [Baldi&Hornik95]
n
i
tii
nn xyn
YXR1
1,
n
i
tii
n xxn
XQ1
1
hγ
11
ˆˆ
nOabABH
hMLE
thhMLE
1 RQab tbbMLE
thh hh
: h-th largest singular value of RQ -1/2.
ML estimator is given by
where
Here
: right singular vector.haω
hbω : left singular vector.
hh abh RQγ 2/12/1 RQγ t
bt
ah hh
10
Learner Prior
Bayes estimation
w
x y: input : output w : parameter
wxyp ,| xyq | n training samples
niyxYX iinn ,,1;,,
dwYXwpwxypYXxyp nnnn ,|,|,,|Predictive :
dwwxypwXYZn
iii
nn ,||1
Marginal likelihood :
True
Posterior :
n
iiinn
nn wxypwXYZ
YXwp1
,||
1,|
In ML (or MAP) : Predict with one model
In Bayes : Predict with ensemble of models
11
Empirical Bayes (EB) approach[Effron&Morris73]
xyq | n training samples
niyxYX iinn ,,1;,,
True
Hyperparameter :
||w τy|x,wp ||
Learner Prior
w wxyp ,|
Marginal likelihood :
dwτwφτwxypτ|XYZn
iii
nn ||||,|||1
Posterior : τwφτwxypXYZ
Yw|Xpn
iiinn
nn ˆ||ˆ||,||
1,
1
Predictive : dwYXwpwxypYXxyp nnnn ,|,|,,|
τYXZYXτ nn
τ
nn ||,maxarg,ˆ Hyperparameter is estimated by maximizing marginal likelihood.
12
Subspace Bayes (SB) approach
a) MIP (Marginalizing in Input Parameter space) version
2exp
2
1||
2
2
BAxy
πBy|x,Ap N/Learner :
2exp
2
12
AAtr
πA
t
M/Prior :
A : parameter
b) MOP (Marginalizing in Output Parameter space) version
SB is an EB where part of parameters are regarded as hyperparameters.
B : hyperparameter
A : hyperparameter B : parameter
Marginalization can be done analytically in LNNs.
13
Intuitive explanation
Bayes posterior
ha hb
SB posterior
Optimize
For redundant comp.
14
Contents Backgrounds
Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?
Setting Model Subspace Bayes (SB) Approach
Analysis (James-Stein estimator) Solution Generalization error
Discussion & Conclusions
15
Free energy (a.k.a. evidence, stochastic complexity)
Free energy : nn XYZnF |log
Important variable used for model selection. [Akaike80;Mackay92]
We minimize the free energy, optimizing hyperparameter.
16
Generalization error
nn YXqDnG
,Predictive||TrueGeneralization Error :
: generalization coefficient
no
nnG
1
p||qD
qV
: Kullbuck-Leibler divergence between q & p
: Expectation of V over q
where
Asymptotic expansion :
In regular,
K2
In unidentifiable, K2
17
James-Stein (JS) estimator
n
iiz
nz
1
1: ML estimator (arithmetic mean)
zzn
Kjs
2
21̂
K
2
true mean
Domination of over : βα GG βα GG
for any true
for a certain true
K-dimensional mean estimation (Regular model)
nzz ,,1 : samples
3KML is efficient (never dominated by any unbiased estimator),but is inadmissible (dominated by biased estimator) when [Stein56].
James-Stein estimator [James&Stein61]
JS (K=3)
ML
A certain relation between EB and JSwas discussed in [Efron&Morris73]
18
zznn
zpjs
21ˆ
Positive-part JS estimator
zpjs )'1(ˆ 1 2,max Xn where
Positive-part JS type (PJS) estimator
true)isevent (if 1
false) isevent (if 0event
Thresholding Model selection
PJS is a model selecting, shrinkage estimator.
where
19
Hyperparameter optimization
H
hh
nnh
nn bXYFBXYF1
||||||
nMωnM
Mnγ
nMb
hbh
h
h
h/ if
/ if0ˆ 2
Optimum hyperparameter value :
Mt Idxq(x)xx Assume orthonormality :
hF
2
hb
nMh
nMh hF
2
hb
dI : d x d identity matrix
Analytically solved in LNNs!
20
11
1 1ˆˆ
nOabLLABH
hMLE
thhh
SB solution (Theorem1, Lemma1)
2,max hh nLL
Theorem 1: The SB estimator is given by
where
Lemma 1: Posterior is localized so that we can substitute the model at the SB estimator for predictive.
L : dimensionality of marginalized subspace (per component), i.e., L = M in MIP, or L = N in MOP.
SB is asymptotically equivalent to PJS estimation.zpjs )'1(ˆ 1 2,max Xn where
21
Generalization error (Theorem 2)
2
*
1
2
2
2
2** 12
hq
HH
hh
hh
LLHNMH
2h : h-th largest eigenvalue of matrix
subject to WN-H* (M-H*, IN-H* ).
Expectation over Wishart distribution.
Theorem 2: SB generalization coefficient is given by
22
Large scale approximation(Theorem 3)
1;0;21;2
2 2**
2**
MMM sJsJsJHNHM
HNMH
*
*
HM
HN
*,,, HHNM
*
*
HN
HH
*HM
L
ssssJ 12 cos121;
12
21cos1cos1120; 112
s
ssssJ
1cos1
12
1012
21cos
1
1cos
12
12
1;1
112
ss
ss
ss
s
s
sJ
0;2,
2
1max 1
JsM
Theorem 3: In the large scale limit when ,the generalization coefficient converges to
where
23
Results 1 (true rank dependence)
*H
K
2
Bayes
SB(MIP)
ML
SB provides good generalization.
SB(MOP)
Note : This does NOT mean domination of SB over Bayes.Discussion of domination needs consideration of delicate situation. (See paper)
learner
true
*H
N = 30
M = 50
20H
N = 30
M = 50
24
Results 2(redundant rank dependence)
H
Bayes
SB(MIP)
ML
depends on H similarly to ML.has also a property similar to ML.
K
2
SB(MOP)
learner
true
H
N = 30
M = 50
N = 30
M = 50
0* H
25
Contents Backgrounds
Regular models Unidentifiable models Superiority of Bayes to ML What’s the purpose?
Setting Model Subspace Bayes (SB) Approach
Analysis (James-Stein estimator) Solution Generalization error
Discussion & Conclusions
26
Feature of SB
provides good generalization. In LNNs, asymptotically equivalent to PJS.
requires smaller computational costs. Reduction of marginalized space.
In some models, marginalization can be done analytically.
related to variational Bayes (VB) approach.
27
Variational Bayes (VB) Solution [Nakajima&Watanabe05]
VB results in same solution as MIP. VB automatically selects larger dimension to marginalize.
Bayes posterior VB posterior
Similar to SB posterior
ha hb
1O
1nO
For NM
HhH *and
28
Conclusions
We have introduced a subspace Bayes (SB) approach. We have proved that, in LNNs, SB is asymptotically equiv
alent to a shrinkage (PJS) estimation. Even in asymptotics, SB for redundant components converges
not to ML but to smaller value, which means suppression of overfitting.
Interestingly, MIP of SB is asymptotically equivalent to VB.
We have clarified the SB generalization error. SB has Bayes-like and ML-like properties, i.e., shrinkage and a
cceleration of overfitting by basis selection.
29
Future work
Analysis of other models. (neural networks, Bayesian networks, mixture models, etc).
Analysis of variational Bayes (VB) in other models.
30
Thank you!
top related