train faster, generalize better: stability of stochastic...

Train faster, generalize better: Stability of stochastic gradient descent

Moritz Hardt [email protected] Recht [email protected] Singer [email protected]

AbstractWe show that parametric models trained by astochastic gradient method (SGM) with few it-erations have vanishing generalization error. Weprove our results by arguing that SGM is algo-rithmically stable in the sense of Bousquet andElisseeff. Our analysis only employs elemen-tary tools from convex and continuous optimiza-tion. We derive stability bounds for both con-vex and non-convex optimization under standardLipschitz and smoothness assumptions.

Applying our results to the convex case, we pro-vide new insights for why multiple epochs ofstochastic gradient methods generalize well inpractice. In the non-convex case, we give anew interpretation of common practices in neuralnetworks, and formally show that popular tech-niques for training large deep models are indeedstability-promoting. Our findings conceptuallyunderscore the importance of reducing trainingtime beyond its obvious benefit.

1. IntroductionThe most widely used optimization method in machinelearning practice is stochastic gradient method (SGM).Stochastic gradient methods aim to minimize the empiri-cal risk of a model by repeatedly computing the gradient ofa loss function on a single training example, or a batch offew examples, and updating the model parameters accord-ingly. SGM is scalable, robust, and performs well acrossmany different domains ranging from smooth and stronglyconvex problems to complex non-convex objectives.

In a nutshell, our results establish that: Any model trainedwith stochastic gradient method in a reasonable amount oftime attains small generalization error.

As training time is inevitably limited in practice, our re-

Proceedings of the 33 rd International Conference on MachineLearning, New York, NY, USA, 2016. JMLR: W&CP volume48. Copyright 2016 by the author(s).

sults help to explain the strong generalization performanceof stochastic gradient methods observed in practice. Moreconcretely, we bound the generalization error of a modelin terms of the number of iterations that stochastic gradientmethod took in order to train the model. Our main analy-sis tool is to employ the notion of algorithmic stability dueto Bousquet and Elisseeff (2002). We demonstrate that thestochastic gradient method is stable provided that the ob-jective is relatively smooth and the number of steps takenis sufficiently small.

It is common in practice to perform a linear number of stepsin the size of the sample and to access each data point mul-tiple times. Our results show in a broad range of settingsthat, provided the number of iterations is linear in the num-ber of data points, the generalization error is bounded bya vanishing function of the sample size. The results holdtrue even for complex models with large number of param-eters and no explicit regularization term in the objective.Namely, fast training time by itself is sufficient to preventoverfitting.

Our bounds are algorithm specific: Since the number ofiterations we allow can be larger than the sample size, anarbitrary algorithm could easily achieve small training er-ror by memorizing all training data with no generalizationability whatsoever. In contrast, if the stochastic gradientmethod manages to fit the training data in a reasonablenumber of iterations, it is guaranteed to generalize.

Conceptually, we show that minimizing training time is notonly beneficial for obvious computational advantages, butalso has the important byproduct of decreasing generaliza-tion error. Consequently, it may make sense for practition-ers to focus on minimizing training time, for instance, bydesigning model architectures for which stochastic gradi-ent method converges fastest to a desired error level.

1.1. Our contributions

Our focus is on generating generalization bounds for mod-els learned with stochastic gradient descent. Recall that thegeneralization bound is the expected difference betweenthe error a model incurs on a training set versus the er-ror incurred on a new data point, sampled from the same

Stability of stochastic gradient descent

distribution that generated the training data. Throughout,we assume we are training models using n sampled datapoints.

Our results build on a fundamental connection between thegeneralization error of an algorithm and its stability proper-ties. Roughly speaking, an algorithm is stable if the train-ing error it achieves varies only slightly if we change anysingle training data point. The precise notion of stabilitywe use is known as uniform stability due to (Bousquet &Elisseeff, 2002). It states that a randomized algorithm A

is uniformly stable if for all data sets differing in only oneelement, the learned models produce nearly the same pre-dictions. We review this method in Section 2, and providea new adaptation of this theory to iterative algorithms.

In Section 3, we show that stochastic gradient is uniformlystable, and our techniques mimic its convergence proofs.For convex loss functions, we prove that the stability mea-sure decreases as a function of the sum of the step sizes.For strongly convex loss functions, we show that stochasticgradient is stable, even if we train for an arbitrarily longtime. We can combine our bounds on the generalization er-ror of stochastic gradient method with optimization boundsquantifying the convergence of the empirical loss achievedby SGM. In Section 5, we show that models trained formultiple epochs match classic bounds for stochastic gradi-ent (Nemirovski & Yudin, 1978; 1983).

More surprisingly, our results carry over to the case wherethe loss-function is non-convex. In this case we show thatthe method generalizes provided the steps are sufficientlysmall and the number of iterations is not too large. Morespecifically, we show the number of steps of stochasticgradient can grow as n

c for a small c > 1. This pro-vides some explanation as to why neural networks can betrained for multiple epochs of stochastic gradient and stillexhibit excellent generalization. In Section 4, we further-more show that various heuristics used in practice, espe-cially in the deep learning community, help to increase thestability of stochastic gradient method. For example, thepopular dropout scheme (Krizhevsky et al., 2012; Srivas-tava et al., 2014) improves all of our bounds. Similarly,`

2

-regularization improves the exponent of n in our non-convex result. In fact, we can drive the exponent arbitrar-ily close to 1/2 while preserving the non-convexity of theproblem.

1.2. Related work

There is a venerable line of work on stability and gener-alization dating back more than thirty years (Devroye &Wagner, 1979; Kearns & Ron, 1999; Bousquet & Elisseeff,2002; Mukherjee et al., 2006; Shalev-Shwartz et al., 2010).The landmark work by Bousquet and Elisseeff (Bousquet& Elisseeff, 2002) introduced the notion of uniform stabil-

ity that we rely on. They showed that several importantclassification techniques are uniformly stable. In particu-lar, under certain regularity assumptions, it was shown thatthe optimizer of a regularized empirical loss minimizationproblem is uniformly stable. Previous work generally ap-plies only to the exact minimizer of specific optimizationproblems. It is not immediately evident on how to com-pute a generalization bound for an approximate minimizersuch as one found by using stochastic gradient. Subsequentwork studied stability bounds for randomized algorithmsbut focused on random perturbations of the cost function,such as those induced by bootstrapping or bagging (Elis-seeff et al., 2005). This manuscript differs from this foun-dational work in that it derives stability bounds about thelearning procedure, analyzing algorithmic properties thatinduce stability.

Classic results by Nemirovski and Yudin show that thestochastic gradient method produces is nearly optimal forempirical risk minimization of convex loss functions (Ne-mirovski & Yudin, 1978; 1983; Nemirovski et al., 2009;Frostig et al., 2015). These results have been extendedby many machine learning researchers, yielding tighterbounds and probabilistic guarantees (Hazan et al., 2006;Hazan & Kale, 2014; Rakhlin et al., 2012). However, thereis an important limitation of all of this prior art. The derivedgeneralization bounds only hold for single passes over thedata. That is, in order for the bounds to be valid, each train-ing example must be used no more than once in a stochas-tic gradient update. In practice, of course, one tends torun multiple epochs of the stochastic gradient method. Ourresults resolve this issue by combining stability with op-timization error. We use the foundational results to esti-mate the error on the empirical risk and then use stabilityto derive a deviation from the true risk. This enables us tostudy the risk incurred by multiple epochs and provide sim-ple analyses of regularization methods for convex stochas-tic gradient. We compare our results to this related workin Section 5. We note that Rosasco and Villa obtain riskbounds for least squares minimization with an incrementalgradient method in terms of the number of epochs (Rosasco& Villa, 2014). These bounds are akin to our study in Sec-tion 5, although our results are incomparable due to variousdifferent assumptions.

Finally, we note that in the non-convex case, the stochasticgradient method is remarkably successful for training largeneural networks (Bottou, 1998; Krizhevsky et al., 2012).However, our theoretical understanding of this method islimited. Several authors have shown that the stochasticgradient method finds a stationary point of nonconvex costfunctions (Kushner & Yin, 2003; Ghadimi & Lan, 2013).Beyond asymptotic convergence to stationary points, littleis known about finding models with low training or gener-alization error in the nonconvex case. There have recently


been several important studies investigating optimal train-ing of neural nets. For example Livni et al. show thatnetworks with polynomial activations can be learned in agreedy fashion (Livni et al., 2014). Janzamin et al. (Jan-zamin et al., 2015) show that two layer neural networkscan be learned using tensor methods. Arora et al. (Aroraet al., 2015) show that two-layer sparse coding dictionar-ies can be learned via stochastic gradient. Our work com-plements these developments: rather than providing newinsights into mechanisms that yield low training error, weprovide insights into mechanisms that yield low generaliza-tion error. If one can achieve low training error quickly ona nonconvex problem with stochastic gradient, our resultsguarantee that the resulting model generalizes well.

2. Stability of randomized iterativealgorithms

Consider the following general setting of supervised learn-ing. There is an unknown distribution D over exam-ples from some space Z. We receive a sample S =(z

1

, . . . , zn) of n examples drawn i.i.d. from D. Our goalis to find a model w with small population risk, definedas: R[w]

def= Ez⇠D f(w; z) . Here, where f is a loss func-

tion and f(w; z) designates the loss of the model describedby w encountered on example z.

Since we cannot measure the objective R[w] directly, weinstead use a sample-averaged proxy, the empirical risk,defined as RS [w]

def= 1

n

Pni=1

f(w; zi) ,

The generalization error of a model w is the difference

RS [w]�R[w]. (2.1)

When w = A(S) is chosen as a function of the data by apotentially randomized algorithm A it makes sense to con-sider the expected generalization error

✏

gen

def= ES,A[RS [A(S)]�R[A(S)]] , (2.2)

where the expectation is over the randomness of A and thesample S.

In order to bound the generalization error of an algorithm,we employ the following notion of uniform stability inwhich we allow randomized algorithms as well.Definition 2.1. A randomized algorithm A is ✏-uniformlystable if for all data sets S, S

0 2 Z

n such that S and S

0

differ in at most one example, we have

supz

EA [f(A(S); z)� f(A(S0); z)] ✏ . (2.3)

Here, the expectation is taken only over the internal ran-domness of A. We will denote by ✏

stab

(A, n) the infimumover all ✏ for which (2.3) holds. We will omit the tuple(A, n) when it is clear from the context.

We recall the important theorem that uniform stability im-plies generalization in expectation. The proof is based onan argument in Lemma 7 of (Bousquet & Elisseeff, 2002)and very similar to Lemma 11 in (Shalev-Shwartz et al.,2010).Theorem 2.2. Let A be ✏-uniformly stable. Then,|ES,A [RS [A(S)]�R[A(S)]]| ✏ .

Theorem 2.2 proves that if an algorithm is uniformly stable,then its generalization error is small. We now turn to someproperties of iterative algorithms that control their uniformstability.

2.1. Properties of update rules

We consider general update rules of the form G : ⌦ ! ⌦which map a point w 2 ⌦ in the parameter space to anotherpoint G(w). The most common update is the gradient up-date rule G(w) = w � ↵rf(w) , where ↵ � 0 is a stepsize and f : ⌦ ! R is a function that we want to optimize.

The canonical update rule we will consider in thismanuscript is an incremental gradient update, whereG(w) = w � ↵rf(w) for some convex function f . Wewill return to a detailed discussion of this specific update inthe sequel, but the reader should keep this particular exam-ple in mind throughout the remainder of this section.

The following two definitions provide the foundation of ouranalysis of how two different sequences of update rules di-verge when iterated from the same starting point. Thesedefinitions will ultimately be useful when analyzing the sta-bility of stochastic gradient descent.Definition 2.3. An update rule is ⌘-expansive if for allv, w 2 ⌦, kG(v)�G(w)k ⌘kv�wk. It is �-bounded ifkw �G(w)k � .

With these two properties, we can establish the followinglemma of how a sequence of updates to a model divergewhen the training set is perturbed.Lemma 2.4 (Growth recursion). Fix an arbitrary sequenceof updates G

1

, . . . , GT and another sequence G01

, . . . , G

0T .

Let w0

= w

00

be a starting point in ⌦ and define �t =kw0

t � wtk where wt, w0t are defined recursively through

wt+1

= Gt(wt) w

0t+1

= G

0t(w

0t) . (t > 0)

Then, we have the recurrence relation �

0

= 0,

�t+1

8><

>:

⌘�t Gt = G

0t is ⌘-expansive

min(⌘, 1)�t + 2�t Gt and G

0t are �-bounded,

Gt is ⌘ expansive

3. Stability of Stochastic Gradient MethodGiven n labeled examples S = (z

1

, . . . , zn) where zi 2Z, consider a decomposable objective function f(w) =


1

n

Pni=1

f(w; zi), where f(w; zi) denotes the loss of w

on the example zi. The stochastic gradient update for thisproblem with learning rate ↵t > 0 is given by wt+1

=wt � ↵trwf(wt; zit) . Stochastic gradient method (SGM)is the algorithm resulting from performing stochastic gra-dient updates T times where the indices it are randomlychosen. There are two popular schemes for choosing theexamples’ indices. One is to pick it uniformly at randomin {1, . . . , n} at each step. The other is to choose a randompermutation over {1, . . . , n} and cycle through the exam-ples repeatedly in the order determined by the permutation.Our results hold for both variants.

In parallel with the previous section the stochastic gradientmethod is akin to applying the gradient update rule definedas follows.

Definition 3.1. For a nonnegative step size ↵ � 0 anda function f : ⌦ ! R, we define the gradient update ruleGf,↵ as Gf,↵(w) = w � ↵rf(w) .

3.1. Proof idea: Stability of stochastic gradient method

In order to prove that the stochastic gradient method is sta-ble, we will analyze the output of the algorithm on twodata sets that differ in precisely one location. Note that ifthe loss function is L-Lipschitz for every example z, wehave E |f(w; z)� f(w0; z)| LE kw � w

0k for all w andw

0. Hence, it suffices to analyze how wt and w

0t diverge

in the domain as a function of time t. Recalling that wt isobtained from wt�1

via a gradient update, our goal is tobound �t = kwt � w

0tk recursively and in expectation as a

function of �t�1

.

There are two cases to consider. In the first case, SGMselects the index of an example at step t on which is iden-tical in S and S

0. Unfortunately, it could still be the casethat �t grows, since wt and w

0t differ and so the gradients

at these two points may still differ. Below, we will showhow to control �t in terms of the convexity and smoothnessproperties of the stochastic gradients.

The second case to consider is when SGM selects the oneexample to update in which S and S

0 differ. Note thatthis happens only with probability 1/n if examples are se-lected randomly. In this case, we simply bound the increasein �t by the norm of the two gradient rf(wt�1

; z) andrf(w0

t�1

; z0). The sum of the norms is bounded by 2↵tL

and we obtain �t �t + 2↵tL. Combining the two cases,we can then solve a simple recurrence relation to obtain abound on �T .

This simple approach suffices to obtain the desired result inthe convex case, but there are additional difficulties in thenon-convex case. Here, we need to use an intriguing sta-bility property of stochastic gradient method. Specifically,the first time step t

0

at which SGM even encounters the

example in which S and S

0 differ is a random variable in{1, . . . , n} which tends to be relatively large. Specifically,for any m 2 {1, . . . , n}, the probability that t

0

m isupper bounded by m/n. This allows us to argue that SGMhas a long “burn-in period” where �t does not grow at all.Once �t begins to grow, the step size has already decayedallowing us to obtain a non-trivial bound.

We now turn to making this argument precise.

3.2. Expansion properties of stochastic gradients

Let us now record some of the core properties of thestochastic gradient update. The gradient update rule isbounded provided that the function f satisfies the followingcommon Lipschitz condition.Definition 3.2. We say that f is L-Lipschitz if for all pointsu in the domain of f we have krf(x)k L. This impliesthat |f(u)� f(v)| Lku� vk .Lemma 3.3. Assume that f is L-Lipschitz. Then, the gra-dient update Gf,↵ is (↵L)-bounded.

We now turn to expansiveness. As we will see shortly, dif-ferent expansion properties are achieved for non-convex,convex, and strongly convex functions.Definition 3.4. A function f : ⌦ ! R is �-strongly convexif for all u, v 2 ⌦ we have f(u) � f(v) + hrf(v), u �vi+ �

2

ku� vk2 .

We say f is convex if it is 0-strongly convex. The followingstandard notion of smoothness leads to a bound on howexpansive the gradient update is.Definition 3.5. A function f : ⌦ ! R is �-smooth if for allfor all u, v 2 ⌦ we have krf(u)�rf(v)k �ku� vk .

In general, smoothness will imply that the gradient updatescannot be overly expansive. When the function is also con-vex and the step size is sufficiently small the gradient up-date becomes non-expansive. When the function is addi-tionally strongly convex, the gradient update becomes con-tractive in the sense that ⌘ will be less than one and u andv will actually shrink closer to one another. The majorityof the following results can be found in several textbooksand monographs. Notable references are Polyak (Polyak,1987) and Nesterov (Nesterov, 2004). We include proofsin the appendix for completeness.Lemma 3.6. Assume that f is �-smooth. Then, Gf,↵ is(1+↵�)-expansive. If f is in addition convex, then for any↵ 2/�, the update Gf,↵ is 1-expansive. If f is in addition�-strongly convex, then for ↵ 2

�+� , Gf,↵ is⇣1� ↵��

�+�

⌘-

expansive.

Henceforth we will no longer mention which random se-lection rule we use as the proofs are almost identical forboth rules.


3.3. Convex optimization

We begin with a simple stability bound for convex loss min-imization via stochastic gradient method.Theorem 3.7. Assume that the loss function f(· ; z) is�-smooth, convex and L-Lipschitz for every z. Supposethat we run SGM with step sizes ↵t 2/� for T

steps. Then, SGM satisfies uniform stability with ✏

stab

2L2

n

PTt=1

↵t .

Proof. Let S and S

0 be two samples of size n differingin only a single example. Consider the gradient updatesG

1

, . . . , GT and G

01

, . . . , G

0T induced by running SGM on

sample S and S

0, respectively. Let wT and w

0T denote the

corresponding outputs of SGM.

We now fix an example z 2 Z and apply the Lipschitzcondition on f(· ; z) to get

E |f(wT ; z)� f(w0T ; z)| LE [�T ] , (3.1)

where �T = kwT �w

0T k. Observe that at step t, with prob-

ability 1 � 1/n, the example selected by SGM is the samein both S and S

0. In this case we have that Gt = G

0t and

we can use the 1-expansivity of the update rule Gt whichfollows from Lemma 3.6 using the fact that the objectivefunction is convex and that ↵t 2/�. With probability1/n the selected example is different in which case we usethat both Gt and G

0t are ↵tL-bounded as a consequence of

Lemma 3.3. Hence, we can apply Lemma 2.4 and linear-ity of expectation to conclude that for every t, E [�t+1

] �1� 1

n

�E [�t] +

1

n E [�t] +2↵tLn = E [�t] +

2L↵tn . Unrav-

eling the recursion gives E [�T ] 2Ln

PTt=1

↵t . Pluggingthis back into equation (3.1), gives the desired result. ⇤

We refer the reader to the full version (?) for our results onstrongly convex optimization.

3.4. Non-convex optimization

In this section we prove stability results for stochastic gra-dient methods that do not require convexity. We will stillassume that the objective function is smooth and Lipschitzas defined previously.

The crux of the proof is to observe that SGM typicallymakes several steps before it even encounters the one ex-ample on which two data sets in the stability analysis dif-fer.Theorem 3.8. Assume that f(·; z) 2 [0, 1] is an L-Lipschitz and �-smooth loss function for every z. Supposethat we run SGM for T steps with monotonically non-increasing step sizes ↵t c/t. Then, SGM has uniformstability with

✏

stab

1 + 1/�c

n� 1(2cL2)

1�c+1

T

�c�c+1

In particular, omitting constant factors that depend on �,

c, and L, we get ✏

stab

/ T 1�1/(�c+1)

n .

4. Stability-inducing operationsIn light of our results, it makes sense to analyse for op-erations that increase the stability of the stochastic gradi-ent method. We show in this section that pleasingly sev-eral popular heuristics and methods indeed improve thestability of SGM. Our rather straightforward analyses bothstrengthen the bounds we previously obtained and help toprovide an explanation for the empirical success of thesemethods.

Weight Decay and Regularization. Weight decay is asimple and effective method that often improves general-ization (Krogh & Hertz, 1992).

Definition 4.1. Let f : ⌦ ! ⌦, be a differentiable function.We define the gradient update with weight decay at rate µ

as Gf,µ,↵(w) = (1� ↵µ)w � ↵rf(w).

It is easy to verify that the above update rule is equivalentto performing a gradient update on the `

2

-regularized ob-jective g(w) = f(w) + µ

2

kwk2.

Lemma 4.2. Assume that f is �-smooth. Then, Gf,µ,↵ is(1 + ↵(� � µ))-expansive.

The above lemma shows as that a regularization parame-ter µ counters a smoothness parameter �. Once r > �, thegradient update with decay becomes contractive. Any the-orem we proved in previous sections that has a dependenceon � leads to a corresponding theorem for stochastic gradi-ent with weight decay in which � is replaced with � � µ.

Gradient Clipping. It is common when training deepneural networks to enforce bounds on the norm of the gra-dients encountered by SGD. This is often done by eithertruncation, scaling, or dropping of examples that cause anexceptionally large value of the gradient norm.

Consider for example gradient clipping. In this pro-cedure, a stochastic gradient rf(w; z) is replaced withrcf(w; z) = clip(rf(w; z)) where clip(x) = x if kxk B and B/kxk otherwise.

Lemma 4.3. If f is �-smooth, then krcf(w; z)k B andkrcf(v; z)�rcf(w; z)k �kv � wk.

Similar arguments would apply to the different variants ofgradient warping. Any such heuristic where the warpingoperation is Lipschitz directly leads to a bound on the Lip-schitz parameter L that appears in our bounds. It is alsoeasy to introduce a varying Lipschitz parameter Lt to ac-count for possibly different values.


Dropout. Dropout (Srivastava et al., 2014) is a popularand effective heuristic for preventing large neural networksfrom overfitting. Here we prove that, indeed, dropout im-proves all of our stability bounds generically.

At each iteration of dropout SGD, we sample a random di-agonal matrix P with precisely s diagonal entries equal to1 and the rest equal to 0. Instead of taking a stochastic gra-dient step rf(w; z) we instead update with the perturbedgradient rdf(w; z) := Prf(Pw; z). The dropout up-date is both more bounded and smoother than the standardstochastic gradient update.

Lemma 4.4. Assume that f is L-Lipschitz and �-smooth.Then, the dropout gradient rdf(w; z) with dropout rates has norm bounded by L and satisfies EP krdf(v; z) �rdf(w; z)k (

ps/d)�kv � wk.

5. Convex risk minimizationWe now outline how our generalization bounds lead tobounds on the population risk achieved by SGM in the con-vex setting. We restrict our attention to the convex casewhere we can contrast against known results. The mainfeature of our results is that we show that one can achievebounds comparable or perhaps better than known resultson stochastic gradient for risk minimization by running formultiple passes over the data set.

The key to the analysis in this section is to decompose therisk estimates into an optimization error term and a stabil-ity term. The optimization error designates how closely weoptimize the empirical risk or a proxy of the empirical risk.By optimizing with stochastic gradient, we will be ableto balance this optimization accuracy against how well wegeneralize. These results are inspired by the work of Bous-quet and Bottou who provided similar analyses for SGMbased on uniform convergence (Bottou & Bousquet, 2008).However, our stability results will yield sharper bounds.

Throughout this section, our risk decomposition works asfollows. We define the optimization error to be the gapbetween the empirical risk and minimum empirical riskin expectation: ✏

opt

(w)def= E

⇥RS [w]�RS [wS

? ]⇤

wherew

S? = argminw RS [w]. By Theorem 2.2, the expected

risk of a w output by SGM is bounded as E[R[w]] E[RS [w]] + ✏

stab

E[RS [wS? ]] + ✏

opt

(w) + ✏

stab

. Ingeneral, the optimization error decreases with the numberof SGM iterations while the stability increases. Balancingthese two terms will thus provide a reasonable excess riskagainst the empirical risk minimizer. Note that our analysisinvolves the expected minimum empirical risk which couldbe considerably smaller than the minimum risk. However,as we now show, it can never be larger.

Lemma 5.1. Let w? denote the minimizer of the popula-tion risk and w

S? denote the minimizer of the empirical risk

given a sampled data set S. Then E[RS [wS? ]] R[w?].

To analyze the optimization error, we will make use of aclassical result due to Nemirovski and Yudin (Nemirovski& Yudin, 1983).Theorem 5.2. Assume we run stochastic gradient descentwith constant stepsize ↵ on a convex function R[w] =Ez[f(w; z)] . Assume further that krf(w; z)k L andkw

0

� w?k D for some minimizer w? of R. Let wT de-note the average of the T iterates of the algorithm. Thenwe have R[wT ] R[w?] +

1

2

D2

T↵ + 1

2

L

2

↵ .

The upper bound stated in the previous theorem is knownto be tight even if the function is �-smooth (Nemirovski &Yudin, 1983).

Theorem 5.2 directly provides a generalization bound forSGM that holds when we make a single pass over the data.The theorem requires fresh samples from the distribution ineach update step of SGM. Hence, given n data points, wecannot make more than n steps, and each sample must notbe used more than once.Corollary 5.3. Let f be a convex loss function satisfyingkrf(w, z)k L and let w? be a minimizer of the popu-lation risk R[w] = Ez f(w; z). Suppose we make a singlepass of SGM over the sample S = (z

1

, . . . , zn) with fixedstep size ↵ = D

Lpn

starting from a point w0

that satis-fies kw

0

� w?k D. Then, the average wn of the iteratessatisfies E[R[wn]] R[w?] +

DLpn.

We now contrast this bound with what follows from ourresults.Proposition 5.4. Let S = (z

1

, . . . , zn) be a sample ofsize n. Let f be a �-smooth convex loss function satisfy-ing krf(w, z)k L and let wS

? be a minimizer of theempirical risk RS [w] =

1

n

Pni=1

f(w; zi). Suppose we runT steps of SGM with step size

↵ =D

L

pn

n

T

·r

T

n+ 2T

!

from a starting point w

0

that satisfies kw0

� w

S? k

D. Then, the average wT over the iterates satisfiesE[R[wT ]] E[RS [wS

? ]] +DLpn

qn+2T

T .

Note that the bound from our stability analysis is not di-rectly comparable to Corollary 5.3 as we are comparingagainst the expected minimum empirical risk rather thanthe minimum risk. Lemma 5.1 implies that the excess riskin our bound is at most worse by a factor of

p3 compared

with Corollary 5.3 when T = n. Moreover, the excess riskin our bound tends to a factor merely

p2 larger than the

Nemirovski-Yudin bound as T goes to infinity. In contrast,the classical bound does not apply when T > n.


6. Experimental EvaluationThe goal of our experiments is to isolate the effect of train-ing time, measured in number of steps, on the stability ofSGM. We evaluated broadly a variety of neural network ar-chitectures and varying step sizes on a number of differentdatasets.

To measure algorithmic stability we consider two proxies.The first is the Euclidean distance between the parametersof two identical models trained on the datasets which dif-fer by a single example. In all of our proofs, we use slowgrowth of this parameter distance as a way to prove stabil-ity. Note that it is not necessary for this parameter distanceto grow slowly in order for our models to be algorithmi-cally stable. This is a strictly stronger notion. Our secondweaker proxy is to measure the generalization error directlyin terms of the absolute different between the test error andtraining error of the model.

We analyzed four standard machine learning datasets eachwith their own corresponding deep architecture. We stud-ied the LeNet architecture for MNIST, the cuda-convnetarchitecture for CIFAR-10, the AlexNet model for Ima-geNet, and the LSTM model for the Penn Treebank Lan-guage Model (PTB). Full details of our architectures andtraining procedures can be found below.

In all cases, we ran the following experiment. We choose arandom example from the training set and remove it. Theremaining examples constitute our set S. Then we create aset S0 by replacing a random element of S with the elementwe deleted. We train stochastic gradient descent with thesame random seed on datasets S and S

0. We record the Eu-clidean distance between the individual layers in the neuralnetwork after every 100 SGM updates. We also record thetraining and testing errors once per epoch.

Our experiments show four primary findings:

Step size dependence. Doubling the step size no morethan doubles the generalization error (see Figure 1). Thisbehavior is fairly consistent for both generalization errordefined with respect to classification accuracy and cross en-tropy (the loss function used for training). It thus suggeststhat there is at most a linear dependence on the step size inthe generalization error.

Parameter distance. We evaluate the normalized Eu-clidean distance

pkw � w

0k2/(kwk2 + kw0k2) betweenthe parameters w and w

0 of two models trained on twocopies of the data differing in a random substitution. Weobserve that the parameter distance grows sub-linearly evenin cases where our theory currently uses an exponentialbound. This shows that our bounds are pessimistic.

Parameter distance verus generalization. There is aclose correspondence between the parameter distance and

generalization error. A priori, it could have been the casethat the generalization error is small even though the pa-rameter distance is large. Our experiments show that thesetwo quantities often move in tandem and seem to be closelyrelated.

Late substitution. When measuring parameter distanceit is indeed important that SGM does not immediatelyencounter the random substitution, but only after someprogress in training has occurred. If we artificially place thecorrupted data point at the first step of SGM, the parameterdistance can grow significantly faster subsequently. Thiseffect is most pronounced in the ImageNet experiments, asdisplayed in Figure 2.

Experiments. We evaluated convolutional neural net-works for image classification on three datasets: MNIST,Cifar10 and ImageNet. Starting with Cifar10, we chosea standard model consisting of three convolutional lay-ers each followed by a pooling operation. This modelroughly corresponds to that proposed by Krizhevsky etal. (Krizhevsky et al., 2012) and available in the “cuda-convnet” code1. However, to make the experiments moreinterpretable, we avoid all forms of regularization suchas weight decay or dropout. The learning rate was fixedat 0.01. We also do not employ data augmentation eventhough this would greatly improve the ultimate test accu-racy of the model. Additionally, we use only constant stepsizes in our experiments. With these restrictions the modelwe use converges to below 20% test error. While this isnot state of the art on Cifar10, our goal is not to optimizetest accuracy but rather a simple, interpretable experimen-tal setup.

The situation on MNIST is largely analogous to what wesaw on Cifar10. We trained a LeNet inspired model withtwo convolutional layers and one fully-connected layer.The first and second convolutional layers have 20 and 50hidden units respectively. This model is much smaller andconverges significantly faster than the Cifar10 models, typ-ically achieving best test error in five epochs. We trainedwith minibatch size 60. As a result, the amount of overfit-ting is smaller. In the case of MNIST, we also repeatedour experiments after replacing the usual cross entropy ob-jective with a squared loss objective. The results are in thefull version on arxiv. It turned out that this does not harmconvergence at all, while leading to somewhat smaller gen-eralization error and parameter divergence.

On ImageNet, we trained the standard AlexNet architec-ture (Krizhevsky et al., 2012) using data augmentation, reg-ularization, and dropout. Unlike in the case of Cifar10,we were unable to find a setting of hyperparameters thatyielded reasonable performance without using these tech-

1https://code.google.com/archive/p/cuda-convnet


0 5 10 15 20

epoch

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.16

abs(

train

erro

r-te

ster

ror)

Generalization errorstep size

.0008

.0016

.0032

.0064

.0128

0 10 20 30 40 50 60

epoch

0.0

0.1

0.2

0.3

0.4

0.5

Nor

mal

ized

eucl

idea

ndi

stan

ce

Parameter distancelayer

conv1conv2conv3all

0 10 20 30 40 50 60

epoch

0.0

0.1

0.2

0.3

0.4

0.5

0.6Train vs test vs parameter distance

norm difftrain errortest errorabs(train error - test error)

Figure 1. Experimental results on Cifar. Top: Generalization er-ror. Middle: Parameter divergence. Bottom: Comparison of train,test, and generalization eror with parameter divergence.

niques. However, for Figure 2 (bottom), we did not usedata-augmentation to exaggerate the effects of overfittingand demonstrate the impact scaling the model-size. Thisfigure demonstrates that the model-size appears to be asecond-order effect with regards to generalization error,and step-size has a considerably stronger impact.

6.1. Recurrent neural networks with LSTM

We also examined the stability of recurrent neural net-works. Specifically, we looked at an LSTM architecturefor language modeling (Zaremba et al., 2014). We focusedon word-level prediction experiments using the Penn TreeBank (PTB) (Marcus et al., 1993), consisting of 929,000training words, 73,000 validation words, and 82,000 testwords. PTB has 10,000 words in its vocabulary2. Follow-

2Data source: http://www.fit.vutbr.cz/

˜

imikolov/rnnlm/simple-examples.tgz

0 1 2 3 4 5 6 7

epoch

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

0.40

0.45

Nor

mal

ized

eucl

idea

ndi

stan

ce

Early substitution

layerconv1

0 1 2 3 4 5 6 7

epoch

0.000

0.002

0.004

0.006

0.008

0.010

0.012

0.014

0.016

0.018

Nor

mal

ized

eucl

idea

ndi

stan

ce

Late substitutionlayer

conv1

0 2 4 6 8 10 12 14 16

epoch

0.00

0.05

0.10

0.15

0.20

0.25

Gen

eral

izat

ion

erro

r

Varying model sizesize

102420484096819216384

Figure 2. Experiments on ImageNet. Top left: Parameter diver-gence with early substitution. Top right: Late substitution. Bot-tom: Generalization error for varying model size.

ing Zaremba et al., we trained regularized LSTMs with twolayers that were unrolled for 20 steps. We initialize the hid-den states to zero. We trained with minibatch size 20. TheLSTM has 200 units per layer and its parameters are initial-ized to have mean zero and standard deviation of 0.1. Wedid not use dropout to enhance reproducibility. Dropoutwould only increase the stability of our models. The re-sults are displayed in Figure 3.

0 5 10 15 20

epoch

0

50

100

150

200

250

Loss

Training, test errortraintestabs(train-test)

0 5 10 15 20

epoch

0.0

0.1

0.2

0.3

0.4

0.5

Nor

mal

ized

eucl

idea

ndi

stan

ce

Parameter distanceLayer

LSTMall

Figure 3. Training on PTB dataset with LSTM architecture.

AcknowledgementsThe authors would like to thank Martin Abadi, Samy Ben-gio, Thomas Breuel, John Duchi, Vineet Gupta, KevinJamieson, Kenneth Marino, Giorgio Patrini, John Platt,Eric Price, Ludwig Schmidt, Nati Srebro, Ilya Sutskever,and Oriol Vinyals for their valuable feedback.


ReferencesArora, Sanjeev, Ge, Rong, Ma, Tengyu, and Moitra, Ankur.

Simple, efficient, and neural algorithms for sparse cod-ing. In Proceedings of the Conference on Learning The-ory (COLT), 2015.

Bottou, Leon. Online algorithms and stochastic approxima-tions. In Saad, David (ed.), Online Learning and NeuralNetworks. Cambridge University Press, Cambridge, UK,1998.

Bottou, Leon and Bousquet, Olivier. The tradeoffs of largescale learning. In Neural Information Processing Sys-tems, 2008.

Bousquet, Olivier and Elisseeff, Andre. Stability and gen-eralization. Journal of Machine Learning Research, 2:499–526, 2002.

Devroye, Luc P. and Wagner, T.J. Distribution-free per-formance bounds for potential function rules. Informa-tion Theory, IEEE Transactions on, 25(5):601–604, Sep1979. ISSN 0018-9448.

Elisseeff, Andre, Evgeniou, Theodoros, and Pontil, Mas-similiano. Stability of randomized learning algorithms.Journal of Machine Learning Research, 6:55–79, 2005.

Frostig, R., Ge, R., Kakade, S. M., and Sidford, A. Com-peting with the empirical risk minimizer in a single pass.In Conference on Learning Theory (COLT), 2015.

Ghadimi, Saeed and Lan, Guanghui. Stochastic first-andzeroth-order methods for nonconvex stochastic program-ming. SIAM Journal on Optimization, 23(4):2341–2368,2013.

Hazan, Elad and Kale, Satyen. Beyond the regretminimization barrier: optimal algorithms for stochas-tic strongly-convex optimization. Journal of MachineLearning Research, 15(1):2489–2512, 2014.

Hazan, Elad, Kalai, Adam, Kale, Satyen, and Agarwal,Amit. Logarithmic regret algorithms for online convexoptimization. In Proceedings of the 19th Annual Confer-ence on Learning Theory, 2006.

Janzamin, Majid, Sedghi, Hanie, and Anandkumar, Anima.Generalization bounds for neural networks through ten-sor factorization. Preprint available at arXiv:1506.08473, 2015.

Kearns, Michael J. and Ron, Dana. Algorithmic stabil-ity and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453,1999.

Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.Imagenet classification with deep convolutional neuralnetworks. In Proc. NIPS, pp. 1097–1105, 2012.

Krogh, A. and Hertz, J. A. A simple weight decay can im-prove generalization. In Proc. NIPS, pp. 950–957, 1992.

Kushner, Harold J. and Yin, G. George. Stochastic Ap-proximation and Recursive Algorithms and Applications.Springer-Verlag, New York, second edition, 2003.

Livni, Roi, Shalev-Shwartz, Shai, and Shamir, Ohad. Onthe computational efficiency of training neural networks.In Advances in Neural Information Processing Systems,pp. 855–863, 2014.

Marcus, Mitchell P, Marcinkiewicz, Mary Ann, and San-torini, Beatrice. Building a large annotated corpus ofEnglish: The Penn Treebank. Computational linguistics,19(2):313–330, 1993.

Mukherjee, Sayan, Niyogi, Partha, Poggio, Tomaso, andRifkin, Ryan M. Learning theory: stability is sufficientfor generalization and necessary and sufficient for con-sistency of empirical risk minimization. Adv. Comput.Math., 25(1-3):161–193, 2006.

Nemirovski, A, Juditsky, A, Lan, G, and Shapiro, A.Robust stochastic approximation approach to stochasticprogramming. SIAM Journal on Optimization, 19(4):1574–1609, 2009.

Nemirovski, Arkadi and Yudin, David B. On Cezari’s con-vergence of the steepest descent method for approximat-ing saddle point of convex-concave functions. In SovietMathetmatics Doklady, volume 19, 1978.

Nemirovski, Arkadi and Yudin, David B. Problem com-plexity and method efficiency in optimization. Wiley In-terscience, 1983.

Nesterov, Y. Introductory lectures on convex optimization,volume 87 of Applied Optimization. Kluwer AcademicPublishers, Boston, MA, 2004. ISBN 1-4020-7553-7. Abasic course.

Polyak, B. T. Introduction to optimization. OptimizationSoftware, Inc., 1987.

Rakhlin, Alexander, Shamir, Ohad, and Sridharan, Karthik.Making gradient descent optimal for strongly convexstochastic optimization. In Proceedings of the 29th In-ternational Conference on Machine Learning, 2012. Ex-tended version at arxiv:1109.5647.

Rosasco, Lorenzo and Villa, Silvia. Learning with in-cremental iterative regularization. Preprint available atarXiv:1405.0042v2, 2014.


Shalev-Shwartz, Shai, Shamir, Ohad, Srebro, Nathan, andSridharan, Karthik. Learnability, stability and uniformconvergence. Journal of Machine Learning Research,11:2635–2670, 2010.

Srivastava, Nitish, Hinton, Geoffrey, Krizhevsky, Alex,Sutskever, Ilya, and Salakhutdinov, Ruslan. Dropout:A simple way to prevent neural networks from overfit-ting. The Journal of Machine Learning Research, 15(1):1929–1958, 2014.

Zaremba, Wojciech, Sutskever, Ilya, and Vinyals, Oriol.Recurrent neural network regularization. Technical re-port, 2014. Preprint available at arxiv:1409.2329.

train faster, generalize better: stability of stochastic...

Documents