learning dynamics of linear denoising...

Learning Dynamics of Linear Denoising Autoencoders

Arnu Pretorius 1 2 Steve Kroon 1 2 Herman Kamper 3

Abstract

Denoising autoencoders (DAEs) have proven use-ful for unsupervised representation learning, buta thorough theoretical understanding is still lack-ing of how the input noise influences learning.Here we develop theory for how noise influenceslearning in DAEs. By focusing on linear DAEs,we are able to derive analytic expressions thatexactly describe their learning dynamics. We ver-ify our theoretical predictions with simulationsas well as experiments on MNIST and CIFAR-10.The theory illustrates how, when tuned correctly,noise allows DAEs to ignore low variance direc-tions in the inputs while learning to reconstructthem. Furthermore, in a comparison of the learn-ing dynamics of DAEs to standard regularisedautoencoders, we show that noise has a similarregularisation effect to weight decay, but withfaster training dynamics. We also show that ourtheoretical predictions approximate learning dy-namics on real-world data and qualitatively matchobserved dynamics in nonlinear DAEs.*

1. IntroductionThe goal of unsupervised learning is to uncover hidden struc-ture in unlabelled data, often in the form of latent featurerepresentations. One popular type of model, an autoencoder,does this by trying to reconstruct its input (Bengio et al.,2007). Autoencoders have been used in various forms to ad-dress problems in machine translation (Chandar et al., 2014;Tu et al., 2017), speech processing (Elman & Zipser, 1987;Zeiler et al., 2013), and computer vision (Rifai et al., 2011;

1Computer Science Division, Stellenbosch University, SouthAfrica 2CSIR/SU Centre for Artificial Intelligence Research,3Department of Electrical and Electronic Engineering, Stellen-bosch University, South Africa. Correspondence to: Steve Kroon<[email protected]>.

Proceedings of the 35 th International Conference on MachineLearning, Stockholm, Sweden, PMLR 80, 2018. Copyright 2018by the author(s).

*Code to reproduce all the results in this paper is available at:https://github.com/arnupretorius/lindaedynamics icml2018

Larsson, 2017), to name just a few areas. Denoising au-toencoders (DAEs) are an extension of autoencoders whichlearn latent features by reconstructing data from corruptedversions of the inputs (Vincent et al., 2008). Although thiscorruption step typically leads to improved performanceover standard autoencoders, a theoretical understanding ofits effects remains incomplete. In this paper, we providenew insights into the inner workings of DAEs by analysingthe learning dynamics of linear DAEs.

We specifically build on the work of Saxe et al. (2013a;b),who studied the learning dynamics of deep linear networksin a supervised regression setting. By analysing the gradientdescent weight update steps as time-dependent differentialequations (in the limit as the learning rate approaches asmall value), Saxe et al. (2013a) were able to derive exactsolutions for the learning trajectory of these networks as afunction of training time. Here we extend their approachto linear DAEs. To do this, we use the expected recon-struction loss over the noise distribution as an objective(requiring a different decomposition of the input covariance)as a tractable way to incorporate noise into our analyticsolutions. This approach yields exact equations which canpredict the learning trajectory of a linear DAE.

Our work here shares the motivation of many recent stud-ies (Advani & Saxe, 2017; Pennington & Worah, 2017;Pennington & Bahri, 2017; Nguyen & Hein, 2017; Dinhet al., 2017; Louart et al., 2017; Swirszcz et al., 2017; Linet al., 2017; Neyshabur et al., 2017; Soudry & Hoffer, 2017;Pennington et al., 2017) working towards a better theoreticalunderstanding of neural networks and their behaviour. Al-though we focus here on a theory for linear networks, suchnetworks have learning dynamics that are in fact nonlinear.Furthermore, analyses of linear networks have also provenuseful in understanding the behaviour of nonlinear neuralnetworks (Saxe et al., 2013a; Advani & Saxe, 2017).

First we introduce linear DAEs (§2). We then derive ana-lytic expressions for their nonlinear learning dynamics (§3),and verify our solutions in simulations (§4) which showhow noise can influence the shape of the loss surface andchange the rate of convergence for gradient descent optimi-sation. We also find that an appropriate amount of noise canhelp DAEs ignore low variance directions in the input whilelearning the reconstruction mapping. In the remainder of

https://github.com/arnupretorius/lindaedynamics_icml2018


the paper, we compare DAEs to standard regularised autoen-coders and show that our theoretical predictions match bothsimulations (§5) and experimental results on MNIST andCIFAR-10 (§6). We specifically find that while the noise ina DAE has an equivalent effect to standard weight decay, theDAE exhibits faster learning dynamics. We also show thatour observations hold qualitatively for nonlinear DAEs.

2. Linear Denoising AutoencodersWe first give the background of linear DAEs. Given trainingdata consisting of pairs {(x̃i,xi), i = 1, ..., N}, where x̃represents a corrupted version of the training data x ∈ RD,the reconstruction loss for a single hidden layer DAE withactivation function φ is given by

L =1

2N

N∑i=1

||xi −W2φ(W1x̃i)||2.

Here, W1 ∈ RH×D and W2 ∈ RD×H are the weights ofthe network with hidden dimensionality H . The learnedfeature representations correspond to the latent variablez = φ(W1x̃).

To corrupt an input x, we sample a noise vector ε, whereeach component is drawn i.i.d. from a pre-specified noisedistribution with mean zero and variance s2. We define thecorrupted version of the input as x̃ = x + ε. This ensuresthat the expectation over the noise remains unbiased, i.e.Eε(x̃) = x.

Restricting our scope to linear neural networks, with φ(a) =a, the loss in expectation over the noise distribution is

Eε [L] =1

2N

N∑i=1

||xi −W2W1xi||2

whitece+s2

2tr(W2W1W

T1 W

T2 ), (1)

See the supplementary material for the full derivation.

3. Learning Dynamics of Linear DAEsHere we derive the learning dynamics of linear DAEs, be-ginning with a brief outline to build some intuition.

The weight update equations for a linear DAE can be formu-lated as time-dependent differential equations in the limit asthe gradient descent learning rate becomes small (Saxe et al.,2013a). The task of an ordinary (undercomplete) linear au-toencoder is to learn the identity mapping that reconstructsthe original input data. The matrix corresponding to thislearned map will essentially be an approximation of the fullidentity matrix that is of rank equal to the input dimension.It turns out that tracking the temporal updates of this map-ping represents a difficult problem that involves dealing with

coupled differential equations, since both the on-diagonaland off-diagonal elements of the weight matrices need tobe considered in the approximation dynamics at each timestep.

To circumvent this issue and make the analysis tractable, wefollow the methodology introduced in Saxe et al. (2013a),which is to: (1) decompose the input covariance matrixusing an eigenvalue decomposition; (2) rotate the weightmatrices to align with these computed directions of vari-ation; and (3) use an orthogonal initialisation strategy todiagonalise the composite weight matrix W = W2W1. Theimportant difference in our setting, is that additional con-straints are brought about through the injection of noise.

The remainder of this section outlines this derivation for theexact solutions to the learning dynamics of linear DAEs.

3.1. Gradient descent update

Consider a continuous time limit approach to studying thelearning dynamics of linear DAEs. This is achieved bychoosing a sufficiently small learning rate α for optimisingthe loss in (1) using gradient descent. The update for W1

in a single gradient descent step then takes the form of atime-dependent differential equation

τd

dtW1 =

N∑i=1

WT2

(xix

Ti −W2W1xix

Ti

)whitesp− εWT

2 W2W1

= WT2 (Σxx −W2W1Σxx)− εWT

2 W2W1.

Here t is the time measured in epochs, τ = Nα , ε = Ns2 and

Σxx =∑Ni=1 xix

Ti , represents the input covariance matrix.

Let the eigenvalue decomposition of the input covariance beΣxx = V ΛV T , where V is an orthogonal matrix and denotethe eigenvalues λj = [Λ]jj , with λ1 ≥ λ2 ≥ · · · ≥ λD.The update can then be rewritten as

τd

dtW1 = WT

2 V(Λ− V TW2W1V Λ

)V T

morewhitespace− εWT2 W2W1.

The weight matrices can be rotated to align with the direc-tions of variation in the input by performing the rotationsW 1 = W1V andW 2 = V TW2. Following a similar deriva-tion for W2, the weight updates become

τd

dtW 1 = W

T

2

(Λ−W 2W 1Λ

)− εWT

2W 2W 1

τd

dtW 2 =

(Λ−W 2W 1Λ

)W

T

1 − εW 2W 1WT

1 .

3.2. Orthogonal initialisation and scalar dynamics

To decouple the dynamics, we can set W2 = V D2RT and

W1 = RD1VT , where R is an arbitrary orthogonal matrix


and D2 and D1 are diagonal matrices. This results in theproduct of the realigned weight matrices

W 2W 1 = V TV D2RTRD1V

TV = D2D1

to become diagonal. The updates now reduce to the follow-ing scalar dynamics that apply independently to each pair ofdiagonal elements w1j and w2j of D1 and D2 respectively:

τd

dtw1j = w2jλj (1− w2jw1j)− εw2

2jw1j (2)

τd

dtw2j = w1jλj (1− w2jw1j)− εw2jw

21j . (3)

Note that the same dynamics stem from gradient descent onthe loss given by

` =

D∑j=1

λj2τ

(1− w2jw1j)2 +

D∑j=1

ε

2τ(w2jw1j)

2. (4)

By examining (4), it is evident that the degree to which thefirst term will be reduced will depend on the magnitude ofthe associated eigenvalue λj . However, for directions inthe input covariance Σxx with relatively little variation thedecrease in the loss from learning the identity map will benegligible and is likely to result in overfitting (since littleto no signal is being captured by these eigenvalues). Thesecond term in (4) is the result of the input corruption andacts as a suppressant on the magnitude of the weights inthe learned mapping. Our interest is to better understandthe interplay between these two terms during learning bystudying their scalar learning dynamics.

3.3. Exact solutions to the dynamics of learning

As noted above, the dynamics of learning are dictated by thevalue ofw = w2w1 over time. An expression can be derivedfor w(t) by using a hyperbolic change of coordinates in (2)and (3), letting θ parameterise points along a dynamicstrajectory represented by the conserved quantity w2

2−w21 =

±c0. This relies on the fact that ` is invariant under a scalingof the weights such that w = (w1/c)(cw2) = w2w1 for anyconstant c (Saxe et al., 2013a). Starting at any initial point(w1, w2) the dynamics are

w(t) =c02

sinh (θt) , (5)

with

θt = 2tanh−1

[(1− E)

(ζ2 − β2 − 2βδ

)− 2(1 + E)ζδ

(1− E) (2β + 4δ)− 2(1 + E)ζ

]

where β = c0(1 + ε

λ

), ζ =

√β2 + 4, δ = tanh

(θ02

)and

E = eζλt/τ . Here θ0 depends on the initial weights w1

and w2 through the relationship θ0 = sinh−1(2w/c0). The

derivation for θt involves rewriting τ ddtw in terms of θ, in-tegrating over the interval θ0 to θt, and finally rearrangingterms to get an expression for θ(t) ≡ θt (see the supple-mentary material for full details). To derive the learningdynamics for different noise distributions, the correspond-ing ε must be computed and used to determine β and ζ . Forexample, sampling noise from a Gaussian distribution suchthat ε ∼ N (0, σ2I), gives ε = Nσ2. Alternatively, if ε isdistributed according to a zero-mean Laplace distributionwith scale parameter b, then ε = 2Nb2.

4. The Effects of Noise: a Simulation StudySince the expression for the learning dynamics of a lin-ear DAE in (5) evolve independently for each directionof variation in the input, it is enough to study the effectthat noise has on learning for a single eigenvalue λ. Todo this we trained a scalar linear DAE to minimise the loss`λ = λ

2 (1−w2w1)2+ ε2 (w2w1)2 with λ = 1 using gradient

descent. Starting from several different randomly initialisedweights w1 and w2, we compare the simulated dynamicswith those predicted by equation (5). The top row in Figure 1shows the exact fit between the predictions and numericalsimulations for different noise levels, ε = 0, 1, 5.

The trajectories in the top row of Figure 1 converge to theoptimal solution at different rates depending on the amountof injected noise. Specifically, adding more noise results infaster convergence. However, the trade-off in (4) ensuresthat the fixed point solution also diminishes in magnitude.

To gain further insight, we also visualise the associated losssurfaces for each experiment in the bottom row of Figure 1.Note that even though the scalar product w2w1 defines alinear mapping, the minimisation of `λ with respect to w1

and w2 is a non-convex optimisation problem. The losssurfaces in Figure 1 each have an unstable saddle point atw2 = w1 = 0 (red star) with all remaining fixed points lyingon a minimum loss manifold (cyan curve). This manifoldcorresponds to the different possible combinations ofw2 andw1 that minimise `λ. The paths that gradient descent followfrom various initial starting weights down to points situatedon the manifold are represented by dashed orange lines.

For a fixed value of λ, adding noise warps the loss surfacemaking steeper slopes and pulling the minimum loss mani-fold in towards the saddle point. Therefore, steeper descentdirections cause learning to converge at a faster rate to fixedpoints that are smaller in magnitude. This is the result of asharper curving loss surface and the minimum loss manifoldlying closer to the origin.

We can compute the fixed point solution for any pair ofinitial starting weights (not on the saddle point) by taking


Figure 1. Learning dynamics, loss surface and gradient descent paths for linear denoising autoencoders. Top: Learning dynamicsfor each simulated run (dashed orange lines) together with the theoretically predicted learning dynamics (solid green lines). The redline in each plot indicates the final value of the resulting fixed point solution w∗. Bottom: The loss surface corresponding to the loss`λ = λ

2(1 − w2w1)

2 + ε2(w2w1)

2 for λ = 1, as well as the gradient descent paths (dashed orange lines) for randomly initialisedweights. The cyan hyperbolas represent the global minimum loss manifold that corresponds to all possible combinations of w2 and w1

that minimise `λ. Left: ε = 0, w∗ = 1. Middle: ε = 1, w∗ = 0.5. Right: ε = 5, w∗ = 1/6.

the derivative

d`λdw

= −λτ

(1− w) +ε

τw,

and setting it equal to zero to find w∗ = λλ+ε . This solution

reveals the interaction between the input variance associatedwith λ and the noise ε. For large eigenvalues for whichλ� ε, the fixed point will remain relatively unaffected byadding noise, i.e., w∗ ≈ 1. In contrast, if λ� ε, the noisewill result in w∗ ≈ 0. This means that over a distributionof eigenvalues, an appropriate amount of noise can helpa DAE to ignore low variance directions in the input datawhile learning the reconstruction. In a practical setting, thismotivates the tuning of noise levels on a development set toprevent overfitting.

5. The Relationship Between Noise andWeight Decay

It is well known that adding noise to the inputs of a neuralnetwork is equivalent to a form of regularisation (Bishop,1995). Therefore, to further understand the role of noise inlinear DAEs we compare the dynamics of noise to those ofexplicit regularisation in the form of weight decay (Krogh& Hertz, 1992). The reconstruction loss for a linear weight

decayed autoencoder (WDAE) is given by

1

2N

N∑i=1

||xi −W2W1xi||2 +γ

2

(||W1||2 + ||W2||2

)(6)

where γ is the penalty parameter that controls the amountof regularisation applied during learning. Provided thatthe weights of the network are initialised to be small, it isalso possible (see supplementary material) to derive scalardynamics of learning from (6) as

wγ(t) =ξEγ

Eγ − 1 + ξ/w0, (7)

where ξ = (1−Nγ/λ) and Eγ = e2ξt/τ .

Figure 2 compares the learning trajectories of linear DAEsand WDAEs over time (as measured in training epochs) forλ = 2.5, 1, 0.5 and 0.1. The dynamics for both noise andweight decay exhibit a sigmoidal shape with an initial periodof inactivity followed by rapid learning, finally reaching aplateau at the fixed point solution. Figure 2 illustrates thatthe learning time associated with an eigenvalue is negativelycorrelated with its magnitude. Thus, the eigenvalue corre-sponding to the largest amount of variation explained is thequickest to escape inactivity during learning.

The colour intensity of the lines in Figure 2 correspond tothe amount of noise or regularisation applied in each run,


Figure 2. Theoretically predicted learning dynamics for noise compared to weight decay for linear autoencoders. Top: Noise dynamics(green), darker line colours correspond to larger amounts of added noise. Bottom: Weight decay dynamics (orange), darker line colourscorrespond to larger amounts of regularisation. Left to right: Eigenvalues λ = 2.5, 1 and 0.5 associated with high to low variance.

Figure 3. Learning dynamics for optimal discrete time learning rates (λ = 1). Left: Dynamics of DAEs (green) vs. WDAEs (orange),where darker line colours correspond to larger amounts noise or weigh decay. Middle: Optimal learning rate as a function of noise ε forDAEs, and for WDAEs using an equivalent amount of regularisation γ = λε/(λ+ ε). Right: Difference in mapping over time.

with darker lines indicating larger amounts. In the contin-uous time limit with equal learning rates, when comparedwith noise dynamics, weight decay experiences a delay inlearning such that the initial inactive period becomes ex-tended for every eigenvalue, whereas adding noise has noeffect on learning time. In other words, starting from smallweights, noise injected learning is capable of providing anequivalent regularisation mechanism to that of weight decayin terms of a constrained fixed point mapping, but with zerotime delay.

However, this analysis does not take into account the prac-tice of using well-tuned stable learning rates for discreteoptimisation steps. We therefore consider the impact ontraining time when using optimised learning rates for eachapproach. By using second order information from the Hes-sian as in Saxe et al. (2013a), (here of the expected recon-struction loss with respect to the scalar weights), we relatethe optimal learning rates for linear DAEs and WDAEs,

where each optimal rate is inversely related to the amountof noise/regularisation applied during training (see supple-mentary material). The ratio of the optimal DAE rate to thatfor the WDAE is

R =2λ+ γ

2λ+ 3ε. (8)

Note that the ratio in (8) will essentially be equal to onefor eigenvalues that are significantly larger than both ε andγ, with deviations from unity only manifesting for smallervalues of λ.

Furthermore, weight decay and noise injected learning re-sult in equivalent scalar solutions when their parametersare related by γ = λε

λ+ε (see supplementary material). Thisleads to the following two observations. First, it showsthat adding noise during learning can be interpreted as aform of weight decay where the penalty parameter γ adaptsto each direction of variation in the data. In other words,noise essentially makes use of the statistical structure of


Figure 4. The effect of noise versus weight decay on the norm of the weights during learning. Left: Two-dimensional loss surface`λ = λ

2(1 − w2w1)

2 + ε2(w2w1)

2 + γ2(w2

2 + w21). Gradient descent paths (orange/magenta dashed lines), minimum loss manifold

(cyan curves), saddle point (red star). Middle: Simulated learning dynamics. Right: Norm of the weights over time for each simulatedrun. Top: Noise with λ = 1, ε = 0.1 and γ = 0. Bottom: Weight decay with λ = 1, ε = 0 and γ = λ(0.1)/(λ+ 0.1) = 0.091. Themagenta line in each plot corresponds to a simulated run with small initialised weights.

the input data to influence the amount of shrinkage that isbeing applied in various directions during learning. Second,together with (8), we can theoretically compare the learningdynamics of DAEs and WDAEs, when both equivalent reg-ularisation and the relative differences in optimal learningrates are taken into account.

The effects of optimal learning rates (for λ = 1), are shownin Figure 3. DAEs still exhibit faster dynamics (left panel),even when taking into account the difference in the learn-ing rate as a function of noise, or equivalent weight decay(middle panel). In addition, for equivalent regularisationeffects, the ratio of the optimal rates R can be shown to be amonotonically decreasing function of the noise level, wherethe rate of decay depends on the size of λ. This meansthat for any amount of added noise, the DAE will requirea slower learning rate than that of the WDAE. Even so, afaster rate for the WDAE does not seem to compensate forits slower dynamics and the difference in learning time isalso shown to grow as more noise (regularisation) is appliedduring training (right panel).

5.1. Exploiting invariance in the loss function

A primary motivation for weight decay as a regulariser isthat it provides solutions with smaller weight norms, pro-ducing smoother models that have better generalisation per-formance. Figure 4 shows the effect of noise (top row)compared to weight decay (bottom row) on the norm of theweights during learning. Looking at the loss surface forweight decay (bottom left panel), the penalty on the size ofthe weights acts by shrinking the minimum loss manifolddown from a long curving valley to a single point (associ-

ated with a small norm solution). Interestingly, this resultsin gradient descent following a trajectory towards an “invis-ible” minimum loss manifold similar to the one associatedwith noise. However, once on this manifold, weight decaybegins to exploit invariances in the loss function to changesin the weights, so as to move along the manifold down to-wards smaller norm solutions. This means that even whenthe two approaches learn the exact same mapping over time(as shown by the learning dynamics in the middle columnof Figure 4), additional epochs will cause weight decay tofurther reduce the size of the weights (bottom right panel).This happens in a stage-like manner where the optimisationfirst focuses on reducing the reconstruction loss by learningthe optimal mapping and then reduces the regularisationloss through invariance.

5.2. Small weight initialisation and early stopping

It is common practice to initialise the weights of a networkwith small values. In fact, this strategy has recently beentheoretically shown to help, along with early stopping, toensure good generalisation performance for neural networksin certain high-dimensional settings (Advani & Saxe, 2017).In our analysis however, what we find interesting aboutsmall weight initialisation is that it removes some of thedifferences in the learning behaviour of DAEs compared toregularised autoencoders that use weight decay.

To see this, the magenta lines in Figure 4 show the learn-ing dynamics for the two approaches where the weights ofboth the networks were initialised to small random start-ing values. The learning dynamics are almost identical interms of their temporal trajectories and have equal fixed


points. However, what is interesting is the implicit regulari-sation that is brought about through the small initialisation.By starting small and making incremental updates to theweights, the scalar solution in both cases end up being equalto the minimum norm solution. In other words, the paththat gradient descent takes from the initialisation to the min-imum loss manifold, reaches the manifold where the normof the weights happen to also be small. This means thatthe second phase of weight decay (where the invariance ofthe loss function would be exploited to reduce the regular-isation penalty), is not only no longer necessary, but alsodoes not result in a norm that is appreciably smaller thanthat obtained by learning with added noise. Therefore inthis case, learning with explicit regularisation provides noadditional benefit over that of learning with noise in termsof reducing the norm of the weights during training.

When initialising small, early stopping can also serve as aform of implicit regularisation by ensuring that the weightsdo not change past the point where the validation loss startsto increase (Bengio et al., 2007). In the context of learn-ing dynamics, early stopping for DAEs can be viewed as amethod that effectively selects only the directions of varia-tion deemed useful for generalisation during reconstruction,considering the remaining eigenvalues to carry no additionalsignal.

6. Experimental ResultsTo verify the dynamics of learning on real-world data setswe compared theoretical predictions with actual learning onMNIST and CIFAR-10. In our experiments we consideredthe following linear autoencoder networks: a regular AE, aWDAE and a DAE.

For MNIST, we trained each autoencoder with small ran-domly initialised weights, using N = 50000 training sam-ples for 5000 epochs, with a learning rate α = 0.01 and ahidden layer width ofH = 256. For the WDAE, the penaltyparameter was set at γ = 0.5 and for the DAE, σ2 = 0.5.The results are shown in Figure 5 (left column).

The theoretical predictions (solid lines) in Figure 5 showgood agreement with the actual learning dynamics (points).As predicted, both regularisation (orange) and noise (green)suppress the fixed point value associated with the differ-ent eigenvalues and, whereas regularisation delays learning(fewer fixed points are reached by the WDAE during train-ing when compared to the DAE), the use of noise has noeffect on training time.

Similar agreement is shown for CIFAR-10 in the right col-umn of Figure 5. Here, we trained each network with smallrandomly initialised weights using N = 30000 trainingsamples for 5000 epochs, with a learning rate α = 0.001and a hidden dimension H = 512. For the WDAE, the

Figure 5. Learning dynamics for MNIST and CIFAR-10. Solidlines represent theoretical dynamics and ‘x’ markers simulateddynamics. Shown are the mappings associated with the set ofeigenvalues {λi, i = 1, 4, 8, 16, 32}, where the remaining eigen-values were excluded to improve readability. Top: Noise: AE(blue) vs. DAE with σ2 = 0.5 (green). Bottom: Weight decay:AE (blue) vs. WDAE with γ = 0.5 (orange). Left: MNIST. Right:CIFAR-10.

Figure 6. Learning dynamics for nonlinear networks using ReLUactivation. AE (blue), WDAE (orange) and DAE (green). Shownare the mappings associated with the first four eigenvalues, i.e.{λi, i = 1, 2, 3, 4}. Left: MNIST Right: CIFAR-10.

penalty parameter was set at γ = 0.5 and for the DAE,σ2 = 0.5.

Next, we investigated whether these dynamics are at leastalso qualitatively present in nonlinear autoencoder networks.Figure 6 shows the dynamics of learning for nonlinear AEs,WDAEs and DAEs, using ReLU activations, trained onMNIST (N = 50000) and CIFAR-10 (N = 30000) withequal learning rates. For the DAE, the input was corruptedusing sampled Gaussian noise with mean zero and σ2 = 3.For the WDAE, the amount of weight decay was manuallytuned to γ = 0.0045, to ensure that both autoencodersdisplayed roughly the same degree of regularisation in termsof the fixed points reached. During the course of training,the identity mapping associated with each eigenvalue wasestimated (see supplementary material), at equally spacedintervals of size 10 epochs.

The learning dynamics are qualitatively similar to the dy-


namics observed in the linear case. Both noise and weightdecay result in a shrinkage of the identity mapping asso-ciated with each eigenvalue. Furthermore, in terms of thenumber of training epochs, the DAE is seen to learn asquickly as a regular AE, whereas the WDAE incurs a delayin learning time. Although these experimental results stemfrom a single training run for each autoencoder, we note thatwall-clock times for training may still differ because DAEsrequire some additional time for sampling noise. Similarresults were observed when using a tanh nonlinearity andare provided in the supplementary material.

7. Related WorkThere have been many studies aiming to provide a better the-oretical understanding of DAEs. Vincent et al. (2008) anal-ysed DAEs from several different perspectives, includingmanifold learning and information filtering, by establishingan equivalence between different criteria for learning andthe original training criterion that seeks to minimise the re-construction loss. Subsequently, Vincent (2011) showed thatunder a particular set of conditions, the training of DAEscan also be interpreted as a type of score matching. Thisconnection provided a probabilistic basis for DAEs. Fol-lowing this, a more in-depth analysis of DAEs as a possiblegenerative model suitable for arbitrary loss functions andmultiple types of data was given by Bengio et al. (2013).

In contrast to a probabilistic understanding of DAEs, wepresent here an analysis of the learning process. Specifi-cally inspired by Saxe et al. (2013a), as well as by earlierwork on supervised neural networks (Opper, 1988; Sanger,1989; Baldi & Hornik, 1989; Saad & Solla, 1995), we pro-vide a theoretical investigation of the temporal behaviour oflinear DAEs using derived equations that exactly describetheir dynamics of learning. Specifically for the linear case,the squared error loss for the reconstruction contractive au-toencoder (RCAE) introduced in Alain & Bengio (2014) isequivalent to the expected loss (over the noise) for the DAE.Therefore, the learning dynamics described in this paperalso apply to linear RCAEs.

For our analysis to be tractable we used a marginalised re-construction loss where the gradient descent dynamics areviewed in expectation over the noise distribution. Whereasour motivation is analytical in nature, marginalising the re-construction loss tends to be more commonly motivatedfrom the point of view of learning useful and robust fea-ture representations at a significantly lower computationalcost (Chen et al., 2014; 2015). This approach has also beeninvestigated in the context of supervised learning (van derMaaten et al., 2013; Wang & Manning, 2013; Wager et al.,2013). Also related to our work is the analysis by Pooleet al. (2014), who showed that training autoencoders withnoise (added at different levels of the network architecture),

is closely connected to training with explicit regularisationand proposed a marginalised noise framework for noisyautoencoders.

8. Conclusion and Future WorkThis paper analysed the learning dynamics of linear de-noising autoencoders (DAEs) with the aim of providing abetter understanding of the role of noise during training. Byderiving exact time-dependent equations for learning, weshowed how noise influences the shape of the loss surface aswell as the rate of convergence to fixed point solutions. Wealso compared the learning behaviour of added input noiseto that of weight decay, an explicit form of regularisation.We found that while the two have similar regularisationeffects, the use of noise for regularisation results in fastertraining. We compared our theoretical predictions with ac-tual learning dynamics on real-world data sets, observinggood agreement. In addition, we also provided evidence(on both MNIST and CIFAR-10) that our predictions holdqualitatively for nonlinear DAEs.

This work provides a solid basis for further investigation.Our analysis could be extended to nonlinear DAEs, poten-tially using the recent work on nonlinear random matrixtheory for neural networks (Pennington & Worah, 2017;Louart et al., 2017). Our findings indicate that appropriatenoise levels help DAEs ignore low variance directions in theinput; we also obtained new insights into the training timeof DAEs. Therefore, future work might consider how theseinsights could actually be used for tuning noise levels andpredicting the training time of DAEs. This would requirefurther validation and empirical experiments, also on otherdatasets. Finally, our analysis only considers the trainingdynamics, while a better understanding of generalisationand what influences the quality of feature representationsduring testing, are also of prime importance.

AcknowledgementsWe would like to thank Andrew Saxe for early discussionsthat got us interested in this work, as well as the review-ers for insightful comments and suggestions. We wouldlike to thank the CSIR/SU Centre for Artificial IntelligenceResearch (CAIR), South Africa, for financial support. APwould also like to thank the MIH Media Lab at StellenboschUniversity and Praelexis (Pty) Ltd for providing stimulatingworking environments for a portion of this work.


ReferencesAdvani, M. S. and Saxe, A. M. High-dimensional

dynamics of generalization error in neural networks.arXiv:1710.03667, 2017.

Alain, G. and Bengio, Y. What regularized auto-encoderslearn from the data-generating distribution. The Journalof Machine Learning Research, 15(1):3563–3593, 2014.

Baldi, P. and Hornik, K. Neural networks and principalcomponent analysis: Learning from examples withoutlocal minima. Neural Networks, 2(1):53–58, 1989.

Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H.Greedy layer-wise training of deep networks. In Advancesin Neural Information Processing Systems, pp. 153–160,2007.

Bengio, Y., Yao, L., Alain, G., and Vincent, P. General-ized denoising auto-encoders as generative models. InAdvances in Neural Information Processing Systems, pp.899–907, 2013.

Bishop, C. M. Training with noise is equivalent to Tikhonovregularization. Neural Computation, 7(1):108–116, 1995.

Chandar, S., Lauly, S., Larochelle, H., Khapra, M., Ravin-dran, B., Raykar, V. C., and Saha, A. An autoencoderapproach to learning bilingual word representations. InAdvances in Neural Information Processing Systems, pp.1853–1861, 2014.

Chen, M., Weinberger, K., Sha, F., and Bengio, Y. Marginal-ized denoising auto-encoders for nonlinear representa-tions. In International Conference on Machine Learning,pp. 1476–1484, 2014.

Chen, M., Weinberger, K., Xu, Z., and Sha, F. Marginal-izing stacked linear denoising autoencoders. Journal ofMachine Learning Research, 16:3849–3875, 2015.

Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharpminima can generalize for deep nets. arXiv:1703.04933,2017.

Elman, J. L. and Zipser, D. Learning the hidden structure ofspeech. Journal of the Acoustic Society of America, 83:1615–1626, 1987.

Krogh, A. and Hertz, J. A. A simple weight decay can im-prove generalization. In Advances in Neural InformationProcessing Systems, pp. 950–957, 1992.

Larsson, G. Discovery of visual semantics by unsu-pervised and self-supervised representation learning.arXiv:1708.05812, 2017.

Lin, H. W., Tegmark, M., and Rolnick, D. Why does deepand cheap learning work so well? Journal of StatisticalPhysics, 168:1223–1247, 2017.

Louart, C., Liao, Z., and Couillet, R. A random matrixapproach to neural networks, 2017. arXiv:1702.05419v2.

Neyshabur, B., Tomioka, R., Salakhutdinov, R., and Srebro,N. Geometry of optimization and implicit regularizationin deep learning. arXiv:1705.03071, 2017.

Nguyen, Q. and Hein, M. The loss surface of deep and wideneural networks. arXiv:1704.08045, 2017.

Opper, M. Learning times of neural networks: exact solutionfor a perceptron algorithm. Physical Review A, 38(7):3824, 1988.

Pennington, J. and Bahri, Y. Geometry of neural networkloss surfaces via random matrix theory. In InternationalConference on Machine Learning, pp. 2798–2806, 2017.

Pennington, J. and Worah, P. Nonlinear random matrix the-ory for deep learning. In Advances in Neural InformationProcessing Systems, pp. 2634–2643, 2017.

Pennington, J., Schoenholz, S., and Ganguli, S. Resurrectingthe sigmoid in deep learning through dynamical isometry:theory and practice. In Advances in Neural InformationProcessing Systems, pp. 4788–4798, 2017.

Poole, B., Sohl-Dickstein, J., and Ganguli, S. An-alyzing noise in autoencoders and deep networks.arXiv:1406.1831, 2014.

Rifai, S., Vincent, P., Muller, X., Glorot, X., and Bengio,Y. Contractive auto-encoders: Explicit invariance dur-ing feature extraction. In International Conference onMachine Learning, pp. 833–840, 2011.

Saad, D. and Solla, S. A. Exact solution for on-line learningin multilayer neural networks. Physical Review Letters,74(21):4337, 1995.

Sanger, T. D. Optimal unsupervised learning in a single-layer linear feedforward neural network. Neural Net-works, 2(6):459–473, 1989.

Saxe, A. M., McClelland, J. L., and Ganguli, S. Exactsolutions to the nonlinear dynamics of learning in deeplinear neural networks. arXiv:1312.6120, 2013a.

Saxe, A. M., McClelland, J. L., and Ganguli, S. Learninghierarchical category structure in deep neural networks.In Proceedings of the Cognitive Science Society, pp. 1271–1276, 2013b.

Soudry, D. and Hoffer, E. Exponentially vanishing sub-optimal local minima in multilayer neural networks.arXiv:1702.05777, 2017.


Swirszcz, G., Czarnecki, W. M., and Pascanu, R. Localminima in training of neural networks. arXiv:1611.06310,2017.

Tu, Z., Liu, Y., Shang, L., Liu, X., and Li, H. Neural ma-chine translation with reconstruction. In AAAI Conferenceon Artificial Intelligence, pp. 3097–3103, 2017.

van der Maaten, L., Chen, M., Tyree, S., and Weinberger,K. Learning with marginalized corrupted features. InInternational Conference on Machine Learning, pp. 410–418, 2013.

Vincent, P. A connection between score matching and de-noising autoencoders. Neural Computation, 23(7):1661–1674, 2011.

Vincent, P., Larochelle, H., Bengio, Y., and Manzagol, P.-A.Extracting and composing robust features with denoisingautoencoders. In International Conference on MachineLearning, pp. 1096–1103, 2008.

Wager, S., Wang, S., and Liang, P. S. Dropout training asadaptive regularization. In Advances in Neural Informa-tion Processing Systems, pp. 351–359, 2013.

Wang, S. and Manning, C. Fast dropout training. In Inter-national Conference on Machine Learning, pp. 118–126,2013.

Zeiler, M., Ranzato, M., Monga, R., Mao, M., Yang, K.,Le, Q., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J.,and Hinton, G. E. On rectified linear units for speechprocessing. In Proceedings of the IEEE InternationalConference on Acoustics, Speech and Signal Processing,2013.

learning dynamics of linear denoising...

Documents