[studies in fuzziness and soft computing] new learning paradigms in soft computing volume 84 ||...

29
Statistical Learning by Natural Gradient Des cent H.H. Yang and S. Amari 1 Department of Electrical and Computer Engineering Oregon Graduate Institute of Science and Technology 20000 NW, Walker Rd. Beaverton, OR 97006 U.S.A. 2 Laboratory for Information Synthesis RIKEN Brain Science Institute 2-1 Hirosawa, Wako-Shi Saitama 351-01 Japan Abs t rac t. Based on stochastic perceptron models and statistical inference, we train single-layer and two-layer perceptrons by natural gradient descent. We have discovered an efficient scheme to represent the Fisher information matrix of a sto- chastic two-layer perceptron. Based on this scheme, we have designed an algorithm to compute the natural gradient. When the input dimension n is much larger than the number of hidden neurons, the complexity of this algorithm is of order O(n). It is confirmed by simulations that the natural gradient descent learning rule is not only efficient but also robust. 1 Introduc ti on Artificial Neural Networks (ANNs) or perceptrons are widely used to model both low level neural activities and high level cognitive functions. Statisti cal inference provides an objective way to derive learning algorithms for training ANNs and to evaluate the performance of the trained ANNs. An ANN has input nodes for taking data, hidden nodes for the internal representation and output nodes for displaying patterns. The goal of learning is to find an input-output relation with a generalization error as small as possible. We assume a stochastic model for two-layer perceptrons. Considering a Riemannian parameter space in which the Fisher information matrix is a met- ric tensor, we apply the natural gradient learning rule to train single-layer and two-layer perceptrons. One direct way to compute the natural gradient is to compute the inverse of the Fisher information matrix. But this causes a heavy computational burden when the input dimension is high. To overcome this difficulty, we propose a fast al gorithm with lower complexity to com- pute the inverse of the Fisher information matrix in order to compute the natural gradient. We have discovered a. new scheme to represent the Fisher L. C. Jain et al. (eds.), New Learning Paradigms in Soft Computing © Springer-Verlag Berlin Heidelberg 2002

Upload: janusz

Post on 12-Dec-2016

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

Statistical Learning by Natural Gradient Descent

H.H. Yang and S. Amari

1 Department of Electrical and Computer Engineering Oregon Graduate Institute of Science and Technology 20000 NW, Walker Rd. Beaverton, OR 97006 U.S.A.

2 Laboratory for Information Synthesis RIKEN Brain Science Institute 2-1 Hirosawa, Wako-Shi Saitama 351-01 Japan

Abst ract . Based on stochastic perceptron models and statistical inference, we train single-layer and two-layer perceptrons by natural gradient descent. We have discovered an efficient scheme to represent the Fisher information matrix of a sto­chastic two-layer perceptron. Based on this scheme, we have designed an algorithm to compute the natural gradient. When the input dimension n is much larger than the number of hidden neurons, the complexity of this algorithm is of order O(n). It is confirmed by simulations that the natural gradient descent learning rule is not only efficient but also robust.

1 Introduction

Artificial Neural Networks (ANNs) or perceptrons are widely used to model both low level neural activities and high level cognitive functions. Statistical inference provides an objective way to derive learning algorithms for training ANNs and to evaluate the performance of the trained ANNs.

An ANN has input nodes for taking data, hidden nodes for the internal representation and output nodes for displaying patterns. The goal of learning is to find an input-output relation with a generalization error as small as possible.

We assume a stochastic model for two-layer perceptrons. Considering a Riemannian parameter space in which the Fisher information matrix is a met­ric tensor, we apply the natural gradient learning rule to train single-layer and two-layer perceptrons. One direct way to compute the natural gradient is to compute the inverse of the Fisher information matrix. But this causes a heavy computational burden when the input dimension is high. To overcome this difficulty, we propose a fast algorithm with lower complexity to com­pute the inverse of the Fisher information matrix in order to compute the natural gradient. We have discovered a. new scheme to represent the Fisher

L. C. Jain et al. (eds.), New Learning Paradigms in Soft Computing© Springer-Verlag Berlin Heidelberg 2002

Page 2: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

2

information matrix as a semi-analytic form in which some coefficients are integrals needed to be evaluated by numerical methods. When the scaled er­ror function is chosen as the sigmoid function for hidden neurons, most of those integrals can be evaluated analytically. In particular, for a committee machine we obtain an analytic form for the Fisher information matrix.

When the input dimension n is much larger than the number of hidden neurons, the time complexity of computing the natural gradient based on the new scheme is O(n) (Yang and Amari, 1998) [11]. Different from this approach, when the number of hidden neurons is in the same order as the input dimension, we gave an alternative approach in Yang and Amari, 1997 [10] to implement the natural gradient learning rule. The idea is to compute the natural gradient without inverting the Fisher information matrix.

2 A Stochastic Two-Layer P erceptron

Assume the following model of a stochastic two-layer perceptron:

m

z = .2:::: ai<p(wf x + bi) + ~ (1) i=l

where (-)r ~enotes the transpose, ~ ,....., N(O, a 2 ) is a Gaussian random vari­able, and <p(x) is a differentiable output function of hidden neurons such as hyperbolic tangent function or the error function. Assume the two-layer network has a n-dimensional input, m hidden neurons, a one dimensional out­put, and m::; n. Denote a = (a1, · · · ,am)T the weight vector of the output neuron, W i = (wti, · · ·, Wni)T the weight vector of the i-th hidden neuron, and b = (bt, · · ·, bmf the vector of thresholds for the hidden neurons. Let W = [w1, · · · , wm] be a matrix formed by column weight vectors W i, then (1) can be rewritten as z =aT <p(W T x + b)+~· Here, the scalar function <p operates on each component of the vector w T x + b.

The joint probability density function (pdf) of the input and the output is

p(x, z; W , a, b)= p(zix; W , a , b)p(x) .

Define a loss function:

L(x, z; 8) = - logp(x, z; 8) = l(zix; 8)- logp(x)

where 8 = (w[, .. ·, w~, ar, bTf includes all the parameters to be deter­mined and

1 l(zix; 8 ) = - logp(z jx; fJ) = - 2 (z - aT <p(W T x + b))2.

2a

Since ~ = ffh, the Fisher information matrix is defined by

(2)

Page 3: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

3

The inverse of G (O) is often used in the Cramer-Rao inequality:

where 0 is an unbiased estimator of a true parameter 0*. For the on-line estimator Ot based on the independent examples { (xs, z5 ),

s = 1, · · · , t} drawn from the probability law p(x, z; 0*), the Cra.mer-Rao inequality for the on-line estimator is

(3)

We now formulate an on-line estimator proposed by Amari(1998) 12] which attains the Cramer-Rao lower bound.

3 Natural G radient Learning

Consider a. parameter space e = { 0} in which the divergence between two points 01 and 0 2 is given by the Kullba.ck-Leibler divergence

When the two points are infinitesimally close, we have the quadratic form

D(O,O+dO) = ~dOTG(O)dO. (4)

This is regarded as the square of the length of dO. Since G ( 0) depends on 0 , the parameter space is regarded as a Riemannian space in which the local distance is defined by (4). Here, the Fisher information matrix G(O) plays the role of the Riemannian metric tensor.

It is shown by Amari that the steepest descent direction of a loss function C(O) in the Riemannian space e is

-~C(O) = - G - 1(0)\i'C(O).

The natural gradient descent method is to decrease the loss function by updating the parameter vector along this direction. By multiplying c - 1 (0), the covariant gradient \7 C( 0) is converted into its contravariant form, namely c - 1(0)\i'C(O), which is consistent with the contravariant differential form dC(O).

Since (j is unknown, instead of using l(z lx; 0) we use the following loss function:

1 h(zlx; 0) = 2(z- aT cp(WT x + b))2

.

We show in next section that G(O) = fr A (O) where A(O) does not depend on (j explicitly. Since -§b also contains the factor ;J.r, so

c-l(O)~ = A - 1(0) 8l1 ao ao ·

Page 4: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

4

The on-line learning algorithms based on the gradient ~ and the natural

gradient A -1 (0)~ are, respectively,

(5)

(6)

where J..L and Jl are learning rates. When the negative log-likelihood function is chosen as the loss function, the natural gradient descent algorithm (6) gives a Fisher efficient on-line estimator (Amari, 1998), i.e., the asymptotic variance of Ot driven by {6) satisfies

E((Ot - O* )(Ot- o·? 1 o·J ~ ~G-1 (0*) t

which gives the mean square error attaining the lower bound:

E lliOt- 0* 11 2 I 0*) ~ ~'fr{G- 1 {0*)). t

Note the algorithm (5) may not attain this limit.

(7)

(8)

Remark 1 · The local minima problem still exists in the natural gradient learning. The equilibrium points are the same for the averaged versions of the dynamics {5} and {6}, but the inverse of the Fisher information matrix optimizes the learning dynamics.

R emark 2 The natural gradient learning algorithm is efficient, but its com­plexity is generally high due to the computation of the matrix inverse. In later sections, we shall explore the structure of the Fisher information matrix for a multi-layer perceptron and design a low complexity algorithm to compute G-1(0). The dimension of G(O) is (nm +2m) x (nm +2m), the time com­plexity of computing G- 1(0) by commonly used algorithms is O(n3 ). This is too high for the on-line natural gradient learning algorithm. Our method re­duces the time complexity of computing G - 1(0) to O(n2 ) and the complexity of computing the natural gradient to O(n).

4 Fisher Information Matrix of a Multi-Layer P erceptron

4.1 Likelihood Function of t h e Stochastic Multi-Layer P erceptron

In previous sections, we did not give any specific assumption about the input. To compute an explicit expression of the Fisher information matrix, we now

Page 5: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

5

assume that the input x is subject to the standard Gaussian distribution N(0 1 ! )1 where I is the identity matrix.

From l(z!x; 6 ) = ~(z - 2:::~ 1 ai'P(wf x + bi))2 , we obtain

8l 1 ~ T 1T ~ = - 2(z-~ ak!f>(Wk x + bk))ai'P (w i x + bi)Xj uw,J (7 k=l

(9)

Ol 1 ~ T T a.= -2(z-~ ak!f>(Wk X+ bk))!f>(Wi X+ bi)

a, (7 k=I (10)

8l 1 ~ T 1T ob· =- (72 (z - ~ ak!f>(Wk X+ bk))ai'P (w i X+ bi)

• k=l

(11)

When (X 1 z) is the input-output pair of the stochastic multi-layer percep­tron1 the Equations (9)-(11) become

and in vector forms

()l { 1 T · -=--a·cn(w - x +b·)xt=1 ... m OWi (72 ,.,... • • I I I

8l { '!' J:l = -2'P(W x + b) ua (7

()l { I T ob =- (72 {a0rp(W x+b)}

(12)

{13)

{14)

(15)

{16)

{17)

where 0 denotes the Hadamard product which is the componentwise product of vectors or matrices. The above three vector forms can be written as one:

ol { [ (a 0!f>1(W Tx+ b)) ®x l

86 = - (72 'P(W T X+ b) a 0 rp1(W T X + b)

{18)

where ® denotes the Kronecker product. For two matrices A = (aij) and B = ( bij) 1 A ® B is defined as a larger matrix consisting of the block matrices [aiiBJ. For two column vectors a and b1 a ® b gives the higher-dimensional column vector [a1br,··· 1 anbT JT.

4.2 Block Structure in the Fisher Information Matrix

The parameter 6 consists of m + 2 blocks

Ll [ T T T bTJT v = w 1 , .. · , Wm 1 a 1

Page 6: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

6

where the first m blocks w i's are n-dimensional and the last two blocks a and b are m-dimensional. From (18), the Fisher information matrix G (8) = fr A (8) is of (n+2)m x (n+2)m dimensions. Here, the matrix A (8) does not depend on <J 2 . It is partitioned into (m + 2) x (m + 2) blocks of sub-matrices corresponding to the partition of 8. We denote each sub-matrix by A ii . The Fisher information matrix is written as

The partit ion diagram for the matrix A (8) is given below:

A(8) =

where

w[ ···W~ · · · A 1,m

Wm A m,l · · · A m,m A m,m+l A m,m+2 a A m +l ,l · · · A m+l ,m A m+l,m+l A m+l ,m+2

b A m+2,1 · · · A m+2,m A m+2,m+l A m+2,m+2

A ii = aiaiEicp'(w f x + bi)cp'(wJ x + bj)xxTJ, for 1 $ i, j $ m, A i,m+l = aiE[cp'(wf x + bi)xcp(xrw + b)] and A m+l,i = A T,m+l fo r 1 $ i $ m, A m+l,m+l = E[cp(WT X + b)cp(x T W + b)], A i,m+2 = E[aicp'(wf x + bi)x(aT 0 cp'(xrw + bT))] and A m+2,i = A T,m+ 2 for 1 $ i $ m, A m+l,m+2 = A~+2,m+l = E[cp(WT x + b)(aT 0 cp'(xTW + bT))],

A m+2,m+2 = (aaT) 0 E[cp'(WT x + b)cp'(xTW + bT) ].

(19)

(20)

Here, E[·] denotes the expectation with respect to p(x). From the above expressions, we know that A ( 8) = [ A ij J does not depend on <J2. The explicit forms of these matrix blocks in the Fisher information matrix are character­ized by the three lemmas in the next section.

4.3 Structure of Blocks in the Fisher Information Matrix

Lemma 1 Fori= 1, · · · , m , the diagonal blocks are given by

A ii = ai[dl(wi,bi)I + {d2(wi,bi)- dl(wi,bi)}uiuTJ (21)

wher-e

Wi = llwdl denoting the Euclidean norm of W i , U i = w i/wi,

1 !00 2 ,.2 d1(w, b) = ~ (cp'(wx +b)) e-z-dx > 0, v 2n -oo

(22)

1 !00 2 2 ".2 d2 ( w, b) = ~ ( cp' ( wx + b)) x e- T dx > 0. v 2n - oo

(23)

Page 7: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

7

When ai :f. 0, the inverse of the matrix A ii is given by

The proof of Lemma 1 is given in Appendix 1. In particular, for a single-layer stochastic perceptron

the expressions (21) and (24) give the Fisher information matrix:

1 T G (w ) = 2 [d1 (w,b)I + {d2(w,b)- dt(w,b)}uiu i J (25)

a

and its inverse formula

-1 ) 2 [ 1 { 1 G (w =a d1(w,b/ + d2(w, b) (26)

When b = 0, the above formula is found by Amari in [1]. To obtain the explicit forms of the other blocks A ij in G , we need to in­

troduce two bases in lltn which possess certain properties. Except for a set in lRn with zero Lebesgue measure, the vectors { UJ ' ... ' Um} are linearly inde­pendent. It is easy to supplement n- m vectors Um+l , · · ·, Un to them, such that they together form a basis in lltn, { Ut , · · · , U m; Um+ 1, · · · , Un} , having the following properties:

for j > m, U j ..L .C(u1, · · ·, U m)

uf Uk = 8j,k (delta notation) , for j , k > m,

where .C(u1 , · · ·, um) is the vector space spanned by {u1, · · ·, um} · Let U = [u 1 · · · un] and V = (U- 1

)T = [v 1 · · · vnl· The identity

v r u = uTv = I

implies

(27)

(28)

(29)

Hence, { v 1 , · · · , Vn} is the orthogonal conjugate basis of { u 1 , · · · , Un}. From (27), (28) and (29), we can easily prove that

for j > m.

The random input x is represented in dual ways by

n n

X= I:>iUi = LX~Vi i=l i=l

Page 8: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

8

where (x1 , · · ·, xn) and (x~, · · ·, x~) are coordinates of x on the bases, respec­tively: (u1 , · · ·, u n) and (v1 , · · ·, vn)· It follows from (29) that

(30)

We denote an m x n matrix by (aij )mxn · The subscript m x n may be omitted when the dimension of the matrix can be determined by the context. Define an x n matrix R = (rij) = (u[ u i)· Let R - 1 = (rii). Noticing that

R =CuT Uj) = u r u

and V = (U- l)T, we have

n -1 = u -1cu-1)T = v Tv i.e., rij = vT Vj.

Due to the properties (27) and (28),

R = [(Tij)mxm 0] 0 I I nxn

where 0 and I are zero and identity matrices respectively with proper dimen­sions admissible in the above partitions of R and R - 1

• By the definition of R , h 1 T T "n "n d "n jk 1 we ave xi = u i x = u i L.Jk= I xkuk = L.Jk=I TjkXk an Xj = L.Jk=I r xk.

Therefore, for 1 ~ j ~ m, xj = L:;:-'=1 rjkXk and Xj = L:;: 1 r-ikx~ , and for j = m + 1, · · ·, n, Xj = xj. With the above notations, we have an expression of the matrix A ij for 1 ~ i =f j ~ m in the following lemma.

Lemma 2 For 1 ~ i,j ~ m,

(31)

where

n m

no = L UkUk = I - L Ukv[. (32) k=m+l k=l

Cii = E[<p'(wix~ + bi)<p1(wixj + bi)J , (33) m

c~J = E[<p'(wix~ + bi)<p'(wixj + bi)(L r18x~)x~]. (34) s= l

The proof of Lemma 2 is given in Appendix 2.

R emar k 3 The vectors { Um+ 1, · · · , Un} are only used for theoretical analy­sis. They are not needed in the Equation (31} because of the Equation (32}. To

Page 9: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

9

compute {vt, ... ' Vm} , we define ul = [ul, ... ' UmJ and V I =[vi ' ... ' Vml· From ur v = I , we then have

(35)

which is the generalized inverse of U 1.

Remark 4 When i = j , the Equation {31} gives another representation for A ii which is more complicated than the representation {21}. But the new representation for A ii is needed to derive some properties of the Fisher in­formation matrix. To compute the inverse of A ii, we use the Equation {24). To compute the inverse of the Fisher information matrix, we also need to compute the inverse of matrices which have the same structure as A ii for i ::/: j. This will be explained later in details when we give an algorithm to compute the inverse of the Fisher information matrix.

It follows from (30) that both Xi and x~ are Gaussian random variables with a zero mean and

Therefore,

(x~, · · ·, x~) "' N(O, R ),

(Xt, · · · ,Xn) "'N(O, R - 1),

(36)

(37)

(38)

and from (33) and (34) Cij is a function of bi, bj , Wi, Wj and rli> and d~ is a function of bi, bj, Wi, Wj and (rij)mxm·

Lemma 3 For 1:::; i:::; m,

m m

A i,m+l = A ?,:+l,i = (l: c~tVk,···, I:c~mVk) (39) k=l k= l

where ct = E[cp'(wix~ + bi)cp(wjxj + bj)x~J, 1:::; i,j, k:::; m. (40)

A i,m+2 has the same structure as A i,rn+l:

m m

A ,m+2 = A~+2,i = (l:~tVk, · · ·, l:C!'mvk) (41) k=l k=l

where

Page 10: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

10

A m+l,m+l = (bij)mxm (42)

with bij = E[<p(wixi + bi)<p( Wjxj + bi )] is a function of bi, bi, Wi, Wj and rij.

T -A m+l.m+Z = A m+Z,m+l = (bij)mxm

with bii = aiE[<p(wix~ + bi)<p'(wixj + bi)J.

A m+2,m+2 = (b~j)mxm with bii = aiajE[<p'(wixi + bi)<p' (wjxj + bj)].

The proof of Lemma 3 is given in Appendix 3.

( 43)

( 44)

Remark 5 When the input is Gaussian random vector, the matrix G can be calculated analytically by Lemma 1-3 except for some numerical integra­tions for those coefficients d1, d2 and cV etc. When the pdf of the input is unknown, we can estimate the Fisher information matrix based on the empir­ical distribution of the input. However, this is time consuming and also needs a large number of input examples. It is difficult to implement the natural gradient descent method as an on-line algorithm in this way. When the pdf of the input is known but different from the standard Gaussian distribution, either non-standard Gaussian or non-Gaussian, we need some pr-epr·ocessing procedures.

4.4 Preprocessing

In the previous sections, the explicit form of G is found by assuming a stan­dard Gaussian input. This assumption facilitates the computation of G . In fact , the explicit form of G is still useful for non-Gaussian inputs if a pre­processing procedure is applied. This is equivalent to adding one preprocess­ing layer in front of the stochastic multi-layer perceptron defined by (1) .

Assume the sampled input Xt is an i.i.d. process. When the input is not a standard Gaussian process, we can use a linear or non-linear mapping to transform the input into a Gaussian process.

If the input is Gaussian but the covariance matrix is not an identity matrix, we can apply a linear transform Ut = Bxt to obtain a standard Gaussian process Ut. Here, the matrix B can be found either by a batch algorithm or an on-line algorithm. By the batch algorithm, we first compute the sample covariance matrix

T ~ 1 ""'"' - -T Rx = T _

1 L...(xt- x)(xt- x) , t=l

Page 11: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

11

then compute the Cholesky factorization Rx = F F r and find B = F - 1 .

We -can also apply the following on-line whitening algorithm in [4]:

(45)

where Ut = B tXt and /.1. > 0 is a learning rate. If the input is not Gaussian, we need a non-linear function to transform

the input X t to a Gaussian process. Let Fx(·) be the cumulative distribution function (cdf) of X t and FN(-) be the cdf of n-dimensional Gaussian r.v., then we obtain a standard Gaussian process

If Fx(·) is unknown, it can be approximated by the empirical dist ribution based on the data set. This method is applicable to the input with an ar­bitrary distribution but it is not applicable to on-line processing since the empirical distribution is computed by a batch algorithm. If the probability density function (pdf) of the input can be approximated by some expansions such as the Gram-Charlier expansion and the Edgeworth expansion [9], Fx ( ·) can be approximated on-line by using an adaptive algorithm to compute the moments or cumulants in these expansions.

After pr.eprocessing, we use the data set Dr = {(ut,zt),t = l , .. ·,T} instead of Dr to train the multi-layer perceptron.

5 Inverse of Fisher Information Matrix

Now we calculate the inverse of the Fisher information matrix which is needed in tihe natural gradient descent method.

We have already shown the explicit form (24) of the inverse of A ii for 1 :::; i :::; m. To calculate the entire inverse of G , we need to compute the inverse of a matrix with the similar representation as A ij for l :::; i =f. j :::; m .

To this end, let us define some sets

Gl(m, in) = {A E inmxm : det(A) =/:. 0}, m

M = {ao.f?o + L aijUivJ: ao =/:. O, A = (aij) E Gl(m,in)} , i,j= l

m

;\If= {ao.f?o + L aijUivJl (the closure of M), i,j=l

M = {[ao, A] : ao =/:. 0, A E Gl(m, in)} ,

and recall the definition of S?o = L~j=m+l u iuJ. The Gl(m, in) is the set of

all non-singular m x m matrices. The set M is a product space of in\ { 0} and Gl(m, in).

Page 12: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

12

5.1 A G roup Structure

Define a mapping '1/J: M ---+ !Rnxn by

m

'ljJ(A) = aosto + I: aiiuivJ EM , for A= lao, (aij)] EM. i,j= 1

Then M is the image set of the set M under the mapping 1/J. For A= [ao, A] and B = [bo, B ] E M, we define a multiplication in M

by

It is easy to verify that M is a group. For A = [ao, A] E M , its inverse element ~A - 1 = [ a1

0 , A - 1

] . The following lemma gives the relation between

M and M.

Lemma 4 The mapping '1/J is an isomorphism between M and M which implies that M is sub-group ofGl(n, !R) andforC = aosto+ 2:::~= 1 aijUivJ E M ,

( 46)

where the matrix (aii) = (aij) - 1 is the inverse of the matrix (aii)·

The proof of Lemma 4 is given in Appendix 4. Regarding the relation between M and M, we have Lemma 5.

Lemma 5 (47)

The proof of Lemma 5 is given in Appendix 5. Let B [mxm) = [B ij]mxm be a matrix consisting of m x m blocks. T o

invert the Fisher information matrix, we need the following lemma regarding the inverse of a block matrix B {mxm) whose blocks belong toM.

Lemma 6 If B [mxm) is a positive definite matrix with the blocks B ij E

M , then the inverse B~xm) has the same structure as B[mxm]• namely,

B - 1 IB ii] . h. h B ii M [m xm] = l mxm zn W ZC E .

The proof of Lemma 6 is given here since it includes procedures to com­pute the inverse of a b lock matrix with blocks in M. These procedures will be referred to for several times in later sections.

Page 13: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

13

Proof of Lemma 6:

We shall repeatedly apply the following 4-block matrix inverse formula in the current paper.

where B u B -1 B-1B B -1 B B- 1 = 11 + 11 12 22 1 21 11 ,

1 • B 22,1 = B22- B21B!1 B12,

B 22 B- 1 d = 22 1• an B 12 _ (B2~)T _ - B -lB B -1

- - 11 12 22,1 •

Since B[mxm) is positive definite, the diagonal blocks B ii in B[mx m) and B ;,!xmJ are also positive definite. Let us first compute the inverse of [Bij]2x 2 -

By Lemma 5, if an invertible matrix belongs toM, it also belongs toM. So B u E M, and by Lemma 4, B }/ EM. Since the equality (63) in Appendix 4 still holds in M, the set M is closed under the matrix product. It is easy to see that M is also closed under the operations such as the matrix summation and scaling._ By Lemma 5, B 22,1 EM, and so B2i,1 EM again by Lemma 4. Hence, B ij EM fori =f j. The proposition of Lemma 6 is proved form= 2. Assume the proposition is true for m = k - 1.

Again, we apply the 4-block matrix inverse formula to the following 2 x 2 block matrix

where

B [ B~1 B~z] [kxkJ = B' B'

21 22

B~1 = B[(k-I)x(k-l)J

B~2 = [ ~1,k l B k-1 ,k

B;1 = [Bk,l , ··· , Bk,k- d

B;2 = B kk·

Because M is closed to the matrix product, summation and scaling, B ;2 1 = B ;2 - B;1B~1 1 B~2 belongs to M by the inductive assumption that the proposition is true form = k- 1. Since B [kxkl is positive definite and so is B ;2 1 and B '22 = B~2 \. Similarly, we can prove that the blocks in B'12

, B'21 a~d B '11 belong to M. Therefore, we conclude that the proposi­tion is also true for m = k. Q.E.D.

Page 14: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

14

5.2 Inverse of A ij

To compute Aij1 for 1 $ i,j $ m, we use the expression (31). Define matrices

for 1 $ i =/: j $ m. Here, cV are defined by (34). Applying Lemma 4-5, from Lemma 2 we have t he following theorem.

Theorem 1 For 1 $ i :/; j $ m, A ij is invertible iff C.j :/; 0 and the matrix C ij is invertible. When A ij is invertible, we have the following inverse formula

A -1 1 ( 1 n Lm dlk T) i " = -- -J&o + i ·UtVk 3 a·a · c· · 3

' J ,, l,k=l ( 48)

5.3 Inverse of the Fisher Information Matrix

To compute the inverse of the Fisher information matrix G = q\-[A ij](m+2)x(m+2)• we consider the following partition of G :

(49)

where G 11 and G 22 are the m x m and 2 x 2 block matrices respectively on the left-upper and right-lower corners of G. Sub-blocks of Gij are defined in the Equation (20). We first apply t he 4-block matrix inverse formula to the above partition of G and then apply the procedures in the proof of Lemma 6 to compute the inverse of G ll ·

5.4 Complexity of Computing t he I nverse of t he Fisher Inform ation M atrix and the Natura l G radient

The method for computing G- 1 given in the above section has low complexity when m « n. We define the complexity of an algorithm by the time and space needed to run the algorithm. We count the t ime and space complexity by flops and units. A flop is one multiplication or one addition. A unit is the memory space for one variable.

One advantage of using the formulas (24) and (48) is to reduce the space complexity for storing G. To store G directly, we need ~(mn + 2m)(mn + 2m+ 1) units. But, by using the representation (31), when m « n the space complexity for storing G is reduced to O(n) from O(n2 ) since the dual bases

Page 15: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

15

are shared by all blocks in G 11 . Here, G u is defined in the partition (49) of G .

The dual bases {u1, · · ·, u n} and {v 1, · · ·, vn} are used in the theoretical analysis. The vectors { U m+ 1, · · · , Un} are not needed in the representation (31) of A ii and the inverse formula (48). Normalizing the weight vectors {w1,·· · , w m}, we obtain {u1,·· · , u m} with the t ime complexity O(n). By the algorithm given in R emark 3 , we can compute V 1 = lv 1 , · · · , vml· The time complexity is also O(n).

In the partition (49), the dimension of G u is mn x mn and that of G 22 is 2m X 2m. When m « n, the time complexity of computing c - l is determined by that of computing G!/ which is computed by the procedures in the proof of Lemma 6 where the 4-block matrix inverse formula and Lemma 4-5 are repeatedly applied. In this process, G!1

1 is recursively computed. The size of the working memory needed is still of order O(n2

) since m « n. But, when m « n, the time complexity of computing G!/ is significantly reduced from O(n3) by conventional methods such as the one in I7J to O(n2 ) by our method. In summary, we have the following table to compare our method with the conventional methods:

Table 1. The complexity of our method and the conventional methods.

Cost Conventional methods Our method Space to store G O(n~) O(n) Time to compute G O(nJ) O(n~)

Working memory needed O(n~) O(n•)

Since the explicit expressions (24) and (48) are suitable for parallel matri-x computations, we can further reduce the time complexity by using a computer with distributed memory multiprocessors.

Based on the structure of the inverse of the Fisher information matrix, we compute the natural gradient. It is shown in [11] that the complexity of computing the natural gradient is O(n).

5.5 The Cramer-R ao Lower Bound for a Committee Machine

For a committee machine, we can further simplify the factor 'fr( c - 1 ( 0*)) in the Cramer-Rao lower bound inequality (3). When a, = 1, b, = 0, i = 1, · · · , m, the multi-layer perceptron (1) becomes a model for a committee machine. The dimension of the parameter space of this committee machine is nm.

The Fisher information matrix for the committee machine is

G(O) = ~A(mxml = ~ [ ~·1 ·1 ::: ~~·.n ] a a A mi·· · A mm

Page 16: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

16

where f) = (wf, · · · ,w?:,)T and A ii are the same as those for the two-layer perceptron except ai = 1.

By Lemma 6,

where A ii EM for all i and A ii EM fori i j.

Note for A= ao.f?o + I::;;:t= l aktUkvT E M,

m

Tr(A) = ao(n- m) + 'fr((akt)) = ao(n- m) + L akk k=l

because of the properties {28) and {29). L t A ii i n + "m kl T s· T ( T) T r e = a0 uo L,k,l=l aii ukv1 . mce r u kv1 = v1 u k = ut.k. we

have m

Tr( A ii ) = a~(n- m) + L a~k. k=l

Hence,

m

'fr{G (fJ )-1) = 0'2'n·(A~x mJ) = 0'2 L 'fr(A ii ) = 0'2(co(B) + Ct(B)n) i=l

where

i=l m

co(fJ ) = L a~t - c, (fJ)m. i,k=l

So we have the following theorem which gives an order estimation of the Cramer-Rao lower bound for the committee machine.

Theorem 2 Assume m « n and G (fJ ) is non-singular around f)* , then the Cramer-Rao lower bound inequality is

{50)

wher·e co= eo(fJ* ) and c1 = c1(8*) are two constants not depending on the input dimension n.

Page 17: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

17

It is shown in !2] that the gradient descent learning rule (6) is Fisher efficient. In particular , for the committee machine, by Theorem 2 the natural gradient descent rule gives the mean square error

6 Close Forms

2

Eo· [IIOt- 0*11 2]:::::: ~(co+ c1n). t

(51)

If we choose <p(x) = erf( ~) as the sigmoid function for the hidden neurons,

where erf(x) = .};r fox e- t2 dt is the error fu nction, then most of the expecta­

tions (integrals) in Lemma 2 and Lemma 3 can be evaluated analytically. In particular, for a one-layer perceptron and a committee machine the Fisher information matrix can be evaluated analytically. This is inspired by Saad and Solla's work (6J where the error function was used to obtain the analytic expression of the generalization error.

For one-layer stochastic perceptron z = <p( w T x +b) + {, the explicit form of the Fisher information matrix is found to be

G(O) = _.!._ [Au a12] !72 a21 a22

where fJ = (w T, bf ,

T 2 Au = do(w, b)I + (d2(w, b)- do(w, b))ww jw ,

w=llwll, a 12 = a f1 = d1(w,b)w jw,

a22 = do(w, b), .../2 2 b2w2

do(w,b) = ..j exp{-b + 2 0 5},

11' w 2 + 0.5 w + . 1 1 w 2 b2

d2(w,b) = do(w,b) 2 0 5

( -2

+ 2 0 5

), w + . w + .

.../2wb 2 b2w2

dl(w,b)= 11'(w2+0.5)3/2exp{-b +w2+0.5} .

For a committee machine, after evaluating some integrals with respect to a multinormal distribution we can also obtain a close form of the the Fisher information matrix.

7 Other Methods to Compute Natural Gradient

In Section 5, we gave a fast algorithm to compute A -l ( Ot) and the natural gradient. However, this algorithm is only effective when the input dimension n is much larger than the number of hidden neurons m .

Page 18: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

18

When nand mare in the same order, we propose an alternative method to compute the natural gradient. The idea is to compute A - 1 (lh)~(zdxt; Ot)

without computing A - 1(0t)· The problem is to solve the following linear equation

Ay = b (52)

for unknown y without inverting A. Here, we denote

To solve the linear Equation (52) is equivalent to minimize the quadratic form

1 P(y) = 2YT Ay- bT y ,

since A is symmetric and positive definite. The steepest descent learning rule is

Yk+I = Yk + ak(b- Ayk)

where ak = vi Vk/vi Avk and Vk = b- A (Ot)Yk· Here, ak is chosen by minimizing P(yk + avk)·

Since the steepest descent algorithm is usually slow, we use the conjugate gradient algorithm (see !Stewart, 1994] for example) to find the solution of the linear Equation (52). By the conjugate gradient method, the solution can be found with O(n) iterations. However, since the matrix and vector multiplication which needs O(n) flops is performed in each iteration , the t ime complexity of the conjugate gradient algorithm is still in the order of O(n2 ).

Amari eta!. [3] recently proposed an adaptive method to realize the nat­ural gradient learning. The idea is to approximate the inverse of the Fisher information matrix and the natural gradient on-line.

8 Natural Gradient Vector Field

The behaviors of the two learning dynamics (5) and (6) are characterized, respectively, by the mean vector fields

_QL_(zix· 0) aw, •

Page 19: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

19

Again, the expectation with respect to p(x, z; 8 *) is denoted by E9· I·J and the samples (x, z) are independently drawn from the pdf p(x, z; 8* ) for the teacher network:

m

z = L ajcp(wjr x + bj) + ~-i=l

From (9)-(11), we have

where ~J(x; 8*, 8) = ajcp(wjr x + bj)- aicp(wJ x + bj)·

The above two equations are used to compute the vector field Vt (8 ). Note the a2 in v 1(8) and o -1(8) are canceled in computing v 2(8). So the vector field v2( 8) does not depend on t he noise level a2.

The following dynamic systems:

(53)

(54)

characterize the averaging behaviors of the gradient descent algorithm (5) and the natural gradient descent algorithm (6) respectively.

To show the difference between the two mean vector fields v 1 ( 0) and v 2 (8 ), we consider a simple committee machine with 2-dimensional input, 2 hidden neurons, a 1 = a2 = 1 and the thresholds b1 = b2 = 0, and the 2 weight vectors W i with unit length II wi ll = 1. The weight vectors are reparameterizcd by

[ cos(Cki )]

W i = sin(Cki) ' i = 1, 2.

Let cp(x) = tanh(x), a2 = 1 and Cki = 0 and Ck2 = 3.;' be the parameters for

the teacher. Let 8 = (Ck t , Ck2)· We calculate the mean vector fields Vt (Ckt. Ck2) and v2(Ck1, Ck2) in the same way of deriving (53) and (54). The two mean vector fields are plotted in Figures 1 and 2.

Although the gradient descent system (53) and the natural gradient de­scent system (54) have the same set of fixed points, they generate different gradient flows. The convergence of the ordinary gradient flow towards 8* is

Page 20: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

20

slower than the natural gradient flow especially along the diagonal line in the (a1, a 2)-space. On average, the learning dynamics of the natural gradient descent algorithm is optimized to achieve the Cramer-Rao lower bound.

alpha1·=o. alpha2 ' =3·pi/ 4 , VF

3.-----~------~-------r------~----------------------I \ \. "'- "\.. " ' .... \. ~. '· ... \, "\.. ., "

2 I \ \ '. \. . - ' ·

\ \ . ' ' /

\ ' ' ' ' \ '

' ' \

\

' ' ' ., \

-. ·, ' <, ·. \ 0 ~ ' '

.. \ \

' ' ' ' I . ~ . ' ' ' ' I

- 1 ..... ......... ' .. , '

- -·· .__ -- ---- -.... - - ....... ...... ., ....._ ...........

-2 ~ ............ .............. .... ............ -.. ........... ... .... ................... .,. ..... --

' ... ·· .. .. " ··- ..... ' "" -3 '1, ' " ·~ " .... •

\ ' ' "' ... , ... ""' , " ... ...

~L-----~------~------~------~-----------------------4 ·3 ·2 - 1 0 2 3

Fig. 1. The mean vector field v, (a,, a2).

9 Simulations

In this section, we give some simulation results to demonstrate that the nat­ural gradient descent algorithm is efficient and robust.

9.1 Single-Layer Perceptron

Assume ?-dimensional inputs Xt "' N(O, I ) and <p(u) = erf( 72). For the

single-layer perceptron, z = <p(wr x) , the on-line g radient descent (GD) and the natural GD algorithms are respectively

Llwt = J.Lo(t)(zt - <p(w [ xt))<p'(w[ Xt)Xt ,

LlWt = J.Lt(t) A - t(Wt)(zt- <p(w [ Xt))<p'(w[ Xt)Xt

(55)

(56)

Page 21: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

21

alph a 1' =0, alpha2'=3'pi/4. invG'VF

4.-----~------------.------r------------.------r------

2

0

· 1

·2

·3

' '\. ' "~ ' ' \

\ '

·' / / '

! '

. . /

' I

/ - -... _

- - - .... -- - - - ....... '-:: .. .

·4 L-----~------------~----~------------~----~-------4 -3 -2 _, 0 2 3 4

Fig. 2. The mean vector field V2 ( Q J , Q2).

where

-1 1 1 1 wwr A (w ) = do(w/ + (dz(w)- do(w))-:;;T, (57)

j2 do(w) = do(w, 0) = ,

1rJw2 + 0.5 (58)

dz(w) - dz(w 0) -1

- ' - J27r(w2 + 0.5)3/2' (59)

and f.Lo(t) and f.Lt(t) are two learning rate schedules defined by f.Li (t) = f.L(TJi, Ci, Tii t), i = 0, 1. Here,

c t c t t2

f.L(TJ,c,r;t) = TJ(1 + --)/(1 + -- + - ). TJT TJT T

(60)

is the search-then-converge schedule proposed by Darken and Moody [5]. Note that t < r is a "search phase" and t > r is a "converge phase". When Ti = 1, the learning rate function f.Li(t) has no search phase. Asymptotically, f.Li ( t) decreases as T.

Randomly choose a 7-dimensional vector as w* for the teacher network. Choose TJo = 0.75, 771 = 0.07, co = 8.75, Ct = 1, and ro = r1 = 1. These parameters are selected by trial and error to optimize the performance of the GD ancl t.he nat.ma.l GD met.hocls at the noise level rr = 0.2. The training

Page 22: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

22

6.---------.---~---.---------.---------.----.---~

~ 4 · ~ ·~ natural GO GO

sigma=0.2

0 \ ' ~ '·-\... \,. - .... - .. -<1>2 ·. ---,

CRLB

0 0

---~ --- ---=.~-_:· ..::: ____ , ..::o- - -- - ----

.... , .. '''' I'''' · · ·t · •• • I

50 1 00 150 200 250 300 350 400 450 500 iteration

10.---~--------~---.----,----.----~---.----.---~

----- natural GD sigma=0.4 --- -- GD

CRLB

50 1 00 150 200 250 300 350 400 450 500 iteration

20 .---~----.---~--------~~------~---.----.---~

. 50

natural GO ----- GO ............ CRLB

sigma=·t

100 150 200 250 300 350 400 450 500 iteration

F ig. 3 . Performance of the GD and the natural GD at different noise levels !7 = 0.2, 0.4, 1.

examples {(xt, Zt)} are generated by

Zt = <p(w*T X t) + ~t

where ~t "'N(O, o 2 ) and o 2 is unknown to the algorithms. Let W t and Wt be the weight vectors driven by the Equations (55) and

(56) respectively. Uwt - w * II and ll iVt - w * II are error functions for the GD a.nd the natural GD.

Denote w* = llw*ll-From the Equation (57), we obtain the Cramer-Rao Lower Bound (CRLB) for the deviation at the true weight vector w*:

<T n-1 1 CRLB(t) = t;t . -- + --

v& do(w*) dz(w*) · (61)

The performance of the natural GD algorithm is compared with that of the GD algorithm in Figure 3. Those parameters in the learning rate schedules are optimized such that two algorithms perform almost the same at

Page 23: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

23

the noise level CJ = 0.2 shown by the upper sub-figure in Figure 3. To examine the robustness of the two algorithms, we change the the noise level for the teacher network but do not change the learning rate schedules optimized at CJ = 0.2. It is shown by the two lower sub-figures in Figure 3 that the natural gradient descent algorithm is robust against the additive noise in the training examples.

9.2 Two-Layer Perceptron

Let us consider the simple two-layer perceptron with 2-dimensional input and 2-hidden neurons. The problem is to train the committee machine y = cp(w[x) + cp(wrx) based on the examples {(xt,zt),t = 1,· ·· .T} generated by the stochastic committee machine Zt = cp(wiT x t)+cp(w2T xt)+t:t· Assume llwi II = 1. We can reparameterize the weight vector to decrease the dimension of the parameter space from 4 to 2:

[cos(ai)] * [cos(ai)] .

w i = . ( ) , wi = . ( *) , 1. = 1, 2. sm ai sm ai

The parameter space is { 8 = ( a1, a2)}. Assume that the true parameters are ai = 0 and a2 = 3;. Due to the symmetry, both e; = (0, 3

;) and 82 = ( 3;, 0) are true parameters. Let 8t and 8~ be computed by the GD algorithm and the natural GD algorithm respectively. The errors are measured by

ct = min{ll8t - 8~ ll,ll 8t- 8211 }, and

c~ =min{ li B~ - e;n, n e~- 92 11 }. In this simulation, using 9o = (0.1, 0.2) as an initial estimate, we first

start the GD algorithm and run it for 80 iterations. Then, we use the estimate obtained from the GD algorithm at the 80-th iteration as an initial estimate for the natural GD algorithm and run the latter algorithm for 420 iterations. The noise level is CJ = 0.05. N independent runs are conducted to obtain the errors ct(j) and c~(j), j = 1, · · ·, N. Define root mean square errors

N

and ?, = ~ 2.::)c~(j))2. j= l

Based on N = 10 independent runs, the errors 'lt and ""iit are computed and compared with the CRLB in Figure 4. The search-then-converge learning schedule (60) is used in the GD algorithm while the learning rate for the natural GD algorithm is simply the annealing rate t·

10 Conclusions

T he natural gradient descent learning rule is statistically efficient. It can be used to train any adaptive system. But the complexity of this learning rule

Page 24: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

24

100 200

GD

--- natural GO

300 iteration

CRLB

400

Fig. 4 . The GD vs. the natural GD.

500 600

depends on the architecture of the learning machine. The main difficulty in implementing this learning rule is to compute the natural gradient. For a two­layer perceptron, when the input dimension is much larger than the number of hidden neurons, we found an efficient scheme to represent the Fisher in­formation matrix. Based on this scheme, we formulated a fast algorithm to compute the inverse of the Fisher information matrix and the natural gradi­ent. The complexities for computing the former and the latter are O(n2 ) and 0( n) respectively.

When the input dimension is in the same order as the number of hidden neurons, the above scheme is not effective. In this case, we proposed to use conjugate gradient algorithm to compute the natural gradient without invert­ing the high dimensional Fisher information matrix. The time complexity of this algorithm is O(n2), while the complexity of conventional algorithms for the same purpose is of order O(n3 ) .

The simulation results have confirmed the fast convergence and statistical efficiency of the natural gradient descent learning rule. T hey have also verified

Page 25: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

25

that this learning rule is robust against the changes of the noise levels in the training examples and the parameters in the learning rate schedules.

Appendix

A ppendix 1 Proof of Lemma 1:

Let u 1 = w dwi and extend it to { Ut , .. ·, u n} an orthogonal basis of lRn. The random input x is decomposed as

n

x = L: c;u ; j= l

where c; = xTu;. Because ofx,....., N(O, I ), c; are i.i.d. with N(O, 1). Noticing

n n

xxT = ciu1u[ + L c;ct(u;u[ + u1uJ) + L c;cku ;uf;

we have

E[(cp'(wTx + bi))2xxTJ

j=2 j,k=2

E[c;J = 0;

E[c;ckJ = b;,k,

= E[(cp'(wict + bi))2cnu~uf + E[(cp'(wicl + bi))2cd 2:;=2 E[c;l(u;uT + u1uj)

+ E[(cp'(wicl + b\))21 I:j,k=2 E[c;ckJu;u[ = E[(cp'(wiCt + bi))2c~Ju1uT + E[(cp'(wiCt + bi))2 j 2:;=2 u ;uJ = d2(Wi,bi)utuf +di(Wi ,bi)2:;=2 u;u J.

Since { Ui} are orthogonal,

n

:z= u iuT = 1 - u1uf. k=2

Therefore,

Q.E.D.

Page 26: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

26

Appendix 2 Proof of Lemma 2:

Here we use the notations introduced before Lemma 2. Since u ; = v; for j > m, we have x; = xj for j > m, and E[xixj] =

E [xix;] = Oi,j for i, j > m. When k =I l, E [xkx/] = E [v[ xxT ud = v[ Ut = 0. Since Xi and x~ are

Gaussian, Xk and xl are independent when k =I l. Therefore, { x~ , · · · , X~11 } and {xm+l • · · · ,xn} are independent. Applying this result, we simplify the following:

E[<p'(w[ x + bi)<p'(wJ x + b;)xxT] = E[<p'(wiu[ x + bi)<p'(wjuJ x + bj)(2:::':: 1 XtUt)(I:~=l xkvk)] = E[<p'(wix~ + bi)<p'(wjxj + b;)(L2k=l + L;:t L~=m+l + L :'::m+l 2::::=1 + L~k=m+ I)XtXkUtVkJ = L~:=l d;u,v[ + Cij L~k=m+l Ot,kUtVk + Terml + Term2 = Cij L~=m+l UkUk + E2k=l c~1UtVk + Terml + Term2

where Cij and cij are defined by (33) and (34). Here,

Terml = 2:::;~1 E~=m+t E[<p'(wix~ + bi)<p'(w;xj + bj)XtXk]UtUk and Term2 = E:'::m+t E::= t E[<p'(wix~ + b,)<p'(w;xj + b;)xtxk]u,vr

are zero d~e to three reasons: 1) x1 = 2:::::1 r13x~ for 1 ~ l ~ m, 2) {x~, · · · ,x~ } and {xm+t• · · · ,xn} are independent and 3) E[xk] = 0. Finally, we obtain (31).

n m

flo= L uku[ = I - Lukv[. k=m+l k=l

Q.E.D.

Appendix 3 Proof of Lemma 3:

Again, we shall apply the properties that {x~, · · ·, x~} and {xm+I. · · ·. Xn} are independent to simplify Ai,m+l for 1 ~ i ~ m:

m

= E[<p'(wix~ + bi)(L xkvk k=l

n

+ L XkVk)(<p(WtX~ + bt), .. ·, <p(wmx:n + bm)] k:=m+l

Page 27: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

27

m

= E[<p'(wix~) l: xkvk(<p(WIX~ + b1) , · · · ,<p(wmx:n + bm)J k=l

n

+E[<p'(wi< + bi) 2::: xkvk(<p(WJX; + bJ) , · · · ,<p(wmx;n + bm)J. k=m+l

(62)

The first term on the right hand side (RHS) of the above Equation (62) becomes

m m

Cl: c~tVk, · · ·, l: c~nvk) i=l i= l

where cfj are defined by (40). Since E[xkJ = 0, the second term on the RHS of (62) is zero. Denote

Similar to the derivation of A i,m+J, we obtain the expression (62) for A i,m+2· It is easy to obtain A m+l,m+l = (bij )mxm where

bij are functions of bi, bj, Wi, Wj, and Tij since E[x~xj J = rii for 1 :::; i, j :::; m. Similarly, we obtain (43) and (44).

Q.E.D.

Appendix 4 Proof of Lemma 4:

For A= [ao, (aij)J EM, define

m

1f;(A) = ao!?o + L aijUiv'J . t,j= l

To show that 1/J : M--+ M is one-to-one, we assume B = [bo , (bij)J E M and 1/J(A) = 1/J(B), i.e.,

m m

ao!?o + 2::: aijUivJ = boS?o + 2::: bijUivJ. i,j=l i,j=l

Since !?ouiv J = 0 for 1 :::; i , j :::; m , from the above equation we first obtain ao = bo then aij = bij after applying the property (29). Therefore, 1/J is a one-to-one mapping from M toM. Let C = 1/I(A) and C' = 1/J(B), then

m

CC' = aobo!?o + 2::: aijblkUiVJ U!Vr i,j,l,k=l

Page 28: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

28

m m

i,j,l,k=l l,j,k=l m

= aoboflo + L dikUivf , i,k=l

where (dij) = (aij)(bij)· So CC' = 1/l(laobo, (aij)(bij)]), i.e.,

1/I(A)"l/I(B) = 1/I(A *B).

(63)

Hence, 1/1 is an isomorphic mapping between M and M , and M is a sub-group of Gl(n, lR). For any element in M with the form C = aoflo+ L:2i=l a,iu ,vJ =

"l/l([a0 , (aii)J), to obtain the inverse of C , we first compute the inverse in M

j.-t = [~. (aij)-lJ ao

then t he inverse in M

m

c-l = 1/I(A- 1) =~no+ L aiiuivJ

ao i,j=t

by the relation ( 1/I(A))-1 = 1/I(A -t ), where (aii) = (aij )-1.

Q.E.D.

Appendix 5 Proof of Lemma 5:

It follows from Lemma 4:

Let A E M and det(A ) =/: 0. From A = aoflo + L:~=l a,iuivJ , we have

A u k= aouk fork= m + 1, · · ·, n, and

A u k = [u. , .. . ' UmJ[a,k, ... 'amkJT fork= 1, ... 'm.

So A [um+l' . .. ' Unl = ao[um+ l' ... ' Un] and A [ut , · · ·, umJ = [u t. · · ·, u mJ B where B = (a,1 ). Hence,

AU = A[ut ,···, Um, Um+t,···,unJ= U[~ a~!] ·

Therefore, det(B )a0-m = det(AU) f: 0 which implies det(B ) =/: 0 and ao =/: 0. So A E M. Q.E.D.

Page 29: [Studies in Fuzziness and Soft Computing] New Learning Paradigms in Soft Computing Volume 84 || Statistical Learning by Natural Gradient Descent

29

References

1. Amari, S. (1997), "Neural learning in structured parameter spaces - natural Riemannian gradient," in Mozer, M.C., Jordan, M.l. , and Petsche, T. (Eds.) , Advances in Neural Information Processing Systems, 9th ed., MIT Press, Cam­lbridge, MA., pp. 127-133.

2. Amari, S. (1998), "Natural gradient works efficiently in learning," Neur-al Com­putation, vol. 10, pp. 251-276.

3. Amari , S. , Park, H. , and Thkumizu, K. (1998), "Adaptive method of realiz­ing natural gradient learning," Technical Report 1953, RIKEN Brain Science Institute.

4. Cardoso, J.-F. and Laheld, B. (1996), "Equivariant adaptive source separation," IEEE Trans. on Signal Processing, vol. 44, no. 12, pp. 3017-3030, December.

5. Darken, C. and Moody, J. (1992), "Towards faster stochastic gradient search," in Moody, Hanson, and Lippmann (Eds.), Advances in Neural Information Pr·ocessing Systems, 4th eds., Morgan Kaufmann, San Mateo, pp. 1009-1016.

6. Saad, D. and Solla, S.A. (1995), "On-line learning in soft committee machines," Physical Review E, vol. 52, pp. 4225-4243.

7. Stewart, G.W. (1973), Introduction to Matrix Computations, New York , Aca­demic P ress.

8. Stewart, W.J. (1994), Introduction to the Numerical Solution of Markov Chains , Princeton University Press.

9. Stuart, A. and Ord, J.K. (1994) , I<endall 's Advanced Theory of Statistics, Ed­ward Arnold.

10. Yang, H.H. and Amari, S. (1997) , "Training multi-layer perceptrons by natural gradient descent," ICONIP'97 Proceedings , New Zealand.

11 . Yang, H.H. and Amari, S. (1998), "Complexity issues in natural gradient de­scent method for training multi-layer perceptrons," Neural Computation, vol. 10, pp. 2137-2157.