arxiv.org · deltagrad objective function for a general machine learning model is defined as: f(w)...

65
DeltaGrad: Rapid retraining of machine learning models Yinjun Wu 1 Edgar Dobriban 2 Susan B. Davidson 1 Abstract Machine learning models are not static and may need to be retrained on slightly changed datasets, for instance, with the addition or deletion of a set of datapoints. This has many applications, includ- ing privacy, robustness, bias reduction, and un- certainty quantification. However, it is expensive to retrain models from scratch. To address this problem, we propose the DeltaGrad algorithm for rapid retraining machine learning models based on information cached during the training phase. We provide both theoretical and empirical support for the effectiveness of DeltaGrad, and show that it compares favorably to the state of the art. 1. Introduction Machine learning models are used increasingly often, and are rarely static. Models may need to be retrained on slightly changed datasets, for instance when datapoints have been added or deleted. This has many applications, including privacy, robustness, bias reduction, and uncertainty quantifi- cation. For instance, it may be necessary to remove certain datapoints from the training data for privacy and robust- ness reasons. Constructing models with some datapoints removed can also be used for constructing bias-corrected models, such as in jackknife resampling (Quenouille, 1956) which requires retraining the model on all leave-one-out datasets. In addition, retraining models on subsets of data can be used for uncertainty quantification, such as construct- ing statistically valid prediction intervals via conformal pre- diction e.g., Shafer & Vovk (2008). Unfortunately, it is expensive to retrain models from scratch. The most common training mechanisms for large-scale mod- els are based on (stochastic) gradient descent (SGD) and 1 Department of Computer and Information Science, Univer- sity of Pennsylvania, PA, United States 2 Department of Statis- tics, University of Pennsylvania, PA, United States. Correspon- dence to: Yinjun Wu <[email protected]>, Edgar Do- briban <[email protected]>, Susan B. Davidson <su- [email protected]>. Proceedings of the 37 th International Conference on Machine Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by the author(s). Figure 1. Running time of our DeltaGrad algorithm for retraining a logistic regression model on RCV1 as a function of the fraction of data deleted and added. Our algorithm is faster than training from scratch (Running time BaseL). Also shown is the distance of DeltaGrad and the model trained on full data from the correct values (Distance BaseL and Distance DeltaGrad, resp.), illustrating that our algorithm is accurate. See Section 18 for details. its variants. Retraining the models on a slightly different dataset would involve re-computing the entire optimization path. When adding or removing a small number of data points, this can be of the same complexity as the original training process. However, we expect models on two similar datasets to be similar. If we retrain the models on many different new datasets, it may be more efficient to cache some information about the training process on the original data, and compute the “updates”. Such ideas have been used recently e.g., Ginart et al. (2019); Guo et al. (2019); Wu et al. (2020). However, the existing approaches have various limitations: They only apply to specialized problems such as k-means (Ginart et al., 2019) or logistic regression (Wu et al., 2020), or they require additional randomization leading to non- standard training algorithms (Guo et al., 2019). To address this problem, we propose the DeltaGrad algo- rithm for rapid retraining of machine learning models when slight changes happen in the training dataset, e.g. deletion or addition of samples, based on information cached during training. DeltaGrad addresses several limitations of prior work: it is applicable to general machine learning models de- fined by empirical risk minimization trained using SGD, and does not require additional randomization. It is based on the idea of “differentiating the optimization path” with respect to the data, and is inspired by ideas from Quasi-Newton arXiv:2006.14755v2 [cs.LG] 30 Jun 2020

Upload: others

Post on 13-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad: Rapid retraining of machine learning models

Yinjun Wu 1 Edgar Dobriban 2 Susan B. Davidson 1

AbstractMachine learning models are not static and mayneed to be retrained on slightly changed datasets,for instance, with the addition or deletion of a setof datapoints. This has many applications, includ-ing privacy, robustness, bias reduction, and un-certainty quantification. However, it is expensiveto retrain models from scratch. To address thisproblem, we propose the DeltaGrad algorithm forrapid retraining machine learning models basedon information cached during the training phase.We provide both theoretical and empirical supportfor the effectiveness of DeltaGrad, and show thatit compares favorably to the state of the art.

1. IntroductionMachine learning models are used increasingly often, andare rarely static. Models may need to be retrained on slightlychanged datasets, for instance when datapoints have beenadded or deleted. This has many applications, includingprivacy, robustness, bias reduction, and uncertainty quantifi-cation. For instance, it may be necessary to remove certaindatapoints from the training data for privacy and robust-ness reasons. Constructing models with some datapointsremoved can also be used for constructing bias-correctedmodels, such as in jackknife resampling (Quenouille, 1956)which requires retraining the model on all leave-one-outdatasets. In addition, retraining models on subsets of datacan be used for uncertainty quantification, such as construct-ing statistically valid prediction intervals via conformal pre-diction e.g., Shafer & Vovk (2008).

Unfortunately, it is expensive to retrain models from scratch.The most common training mechanisms for large-scale mod-els are based on (stochastic) gradient descent (SGD) and

1Department of Computer and Information Science, Univer-sity of Pennsylvania, PA, United States 2Department of Statis-tics, University of Pennsylvania, PA, United States. Correspon-dence to: Yinjun Wu <[email protected]>, Edgar Do-briban<[email protected]>, Susan B. Davidson<[email protected]>.

Proceedings of the 37 th International Conference on MachineLearning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 bythe author(s).

Figure 1. Running time of our DeltaGrad algorithm for retraininga logistic regression model on RCV1 as a function of the fractionof data deleted and added. Our algorithm is faster than trainingfrom scratch (Running time BaseL). Also shown is the distanceof DeltaGrad and the model trained on full data from the correctvalues (Distance BaseL and Distance DeltaGrad, resp.), illustratingthat our algorithm is accurate. See Section 18 for details.

its variants. Retraining the models on a slightly differentdataset would involve re-computing the entire optimizationpath. When adding or removing a small number of datapoints, this can be of the same complexity as the originaltraining process.

However, we expect models on two similar datasets to besimilar. If we retrain the models on many different newdatasets, it may be more efficient to cache some informationabout the training process on the original data, and computethe “updates”. Such ideas have been used recently e.g.,Ginart et al. (2019); Guo et al. (2019); Wu et al. (2020).However, the existing approaches have various limitations:They only apply to specialized problems such as k-means(Ginart et al., 2019) or logistic regression (Wu et al., 2020),or they require additional randomization leading to non-standard training algorithms (Guo et al., 2019).

To address this problem, we propose the DeltaGrad algo-rithm for rapid retraining of machine learning models whenslight changes happen in the training dataset, e.g. deletionor addition of samples, based on information cached duringtraining. DeltaGrad addresses several limitations of priorwork: it is applicable to general machine learning models de-fined by empirical risk minimization trained using SGD, anddoes not require additional randomization. It is based on theidea of “differentiating the optimization path” with respectto the data, and is inspired by ideas from Quasi-Newton

arX

iv:2

006.

1475

5v2

[cs

.LG

] 3

0 Ju

n 20

20

Page 2: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

methods.

We provide both theoretical and empirical support for theeffectivenss of DeltaGrad. We prove that it approximates thetrue optimization path at a fast rate for strongly convex ob-jectives. We show experimentally that it is accurate and faston several medium-scale problems on standard datasests,including two-layer neural networks. The speed-ups can beup to 6.5x with negligible accuracy loss (see e.g., Fig. 1).This paves the way toward a large-scale, efficient, general-purpose data deletion/addition machine learning system.We also illustrate how it can be used in several applicationsdescribed above.

1.1. Related work

There is a great deal of work on model retraining and up-dating. Recently, this has gotten attention due to worldwideefforts on human-centric AI, data confidentiality and privacy,such as the General Data Protection Regulation (GDPR) inthe European Union (European Union, 2016). This man-dates that users can ask for their data to be removed fromanalysis in current AI systems. The required guarantees arethus stronger than what is provided by differential privacy(which may leave a non-vanishing contribution of the dat-apoints in the model, Dwork et al. (2014)), and or defenseagainst data poisoning attacks (which only requires that theperformance of the models does not degrade after poisoning,Steinhardt et al. (2017)).

Efficient data deletion is also crucial for many other appli-cations, e.g. model interpretability and model debugging.For example, repeated retraining by removing different sub-sets of training data each time is essential in many existingdata systems (Doshi-Velez & Kim, 2017; Krishnan & Wu,2017) to understand the effect of those removed data overthe model behavior. It is also close to deletion diagnotics,targeting locating the most influential data point for the MLmodels through deletion in the training set, dating back to(Cook, 1977). Some recent work (Koh & Liang, 2017) tar-gets general ML models, but requires explicitly maintainingHessian matrices and can only handle the deletion of onesample, thus inapplicable for many large-scale applications.

Efficient model updating for adding and removing data-points is possible for linear models, based on efficient rankone updates of matrix inverses (e.g., Birattari et al., 1999;Horn & Johnson, 2012; Cao & Yang, 2015, etc). The scopeof linear methods is extended if one uses linear featureembeddings, either randomized or learned via pretraining.Updates have been proposed for support vector machines(Syed et al., 1999; Cauwenberghs & Poggio, 2001) andnearest neighbors (Schelter, 2019).

Ginart et al. (2019) propose a definition of data erasurecompleteness and a quantization-based algorithm for k-means clustering achieving this. They also propose several

principles that can enable efficient model updating. Guoet al. (2019) propose a general theoretical condition thatguarantees that randomized algorithms can remove datafrom machine learning models. Their randomized approachneeds standard algorithms such as logistic regression to bechanged to apply. (Bourtoule et al., 2019) propose the SISA(or Sharded, Isolated, Sliced, Aggregated) training frame-work for “un-learning”, which relies on ideas similar todistributed training. Their approach requires dividing thetraining data in multiple shards such that a training point isincluded in a small number of shards only.

Our approach relies on large-scale optimization, which hasan enormous literature. Stochastic gradient methods dateback to Robbins & Monro (1951). More recently a lot ofwork (see e.g., Bottou, 1998; 2003; Zhang, 2004; Bousquet& Bottou, 2008; Bottou, 2010; Bottou et al., 2018) focuseson empirical risk minimization.

The convergence proofs for SGD are based on the contrac-tion of the expected residuals. They are based on assump-tions such as bounded variances, the strong or weak growth,smoothness, convexity (or Polyak-Lojasiewicz) on the in-dividual and overall loss functions. See e.g., (Gladyshev,1965; Amari, 1967; Kul’chitskiy & Mozgovoy, 1992; Bert-sekas & Tsitsiklis, 1996; Moulines & Bach, 2011; Karimiet al., 2016; Bottou et al., 2018; Gorbunov et al., 2019;Gower et al., 2019), etc, and references therein. Our ap-proach is similar, but the technical details are very different,and more closely related to Quasi-Newton methods such asL-BFGS (Zhu et al., 1997).

Contributions. Our contributions are:

1. DeltaGrad: We propose the DeltaGrad algorithm forfast retraining of (stochastic) gradient descent basedmachine learning models on small changes of the data(small number of added or deleted points).

2. Theoretical support: We provide theoretical resultsshowing the accuracy of the DeltaGrad. Both for GDand SGD we show the error is of smaller order thanthe fraction of points removed.

3. Empirical results: We provide empirical results show-ing the speed and accuracy of DeltaGrad, for addition,removal, and continuous updates, on a number of stan-dard datasets.

4. Applications: We describe the applications of Delt-aGrad to several problems in machine learning, in-cluding privacy, robustness, debiasing, and statisticalinference.

2. Algorithms2.1. Setup

The training set (xi, yi)ni=1 has n samples. The loss or

Page 3: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

objective function for a general machine learning model isdefined as:

F (w) =1

n

n∑i=1

Fi (w)

where w represents a vector of the model parameters andFi (w) is the loss for the i-th sample. The gradient andHessian matrix of F (w) are

∇F (w) =1

n

n∑i=1

∇Fi (w) , H (w) =1

n

n∑i=1

Hi (w)

Suppose the model parameter is updated through mini-batchstochastic gradient descent (SGD) for t = 1, . . . , T :

wt+1 ← wt −ηtB

∑i∈Bt

∇Fi (wt)

where Bt is a randomly sampled mini-batch of size Band ηt is the learning rate at the tth iteration. As a specialcase of SGD, the update rule of gradient descent (GD) iswt+1 ← wt − ηt/n

∑ni=1∇Fi (wt). After training on

the full dataset, the training samples with indices R =i1, i2, . . . , ir are removed, where r n. Our goal is toefficiently update the model parameter to the minimizer ofthe new empirical loss. Our algorithm also applies when rnew datapoints are added.

The naive solution is to apply GD directly over the remain-ing training samples (we use wU to denote the correspond-ing model parameter), i.e. run:

wUt+1 ← wUt −ηt

n− r∑i6∈R

∇Fi(wUt

)(1)

which aims to minimize FU (w) = 1/(n−r)∑i 6∈R Fi (w).

2.2. Proposed DeltaGrad Algorithm

To obtain a more efficient method, we rewrite Equation (1)via the following “leave-r-out” gradient formula (we usewI to denote the model parameter derived by DeltaGrad):

wI t+1 = wI t −ηt

n− r

[n∇F

(wI t)−∑i∈R∇Fi

(wI t)].

(2)

Computing the sum∑i∈R∇Fi

(wI t)

of a small number ofterms is more efficient than computing

∑i6∈R ∇Fi

(wI t)

when |R| = r n. For this we need to approximaten∇F

(wI t)

=∑ni=1∇Fi

(wI t)

by leveraging the histori-cal gradient ∇F (wt) (recall that wt is the model parameterbefore deletions), for each of the T iterations.

Suppose we can cache the model parameters w0, . . . ,wtand the gradients∇F (w0), . . . ,∇F (wt) for each iteration

Algorithm 1 DeltaGradInput : The full training set (X,Y), model parameters cached

during the training phase over the full training sam-ples w0,w1, . . . ,wt and corresponding gradients∇F (w0) ,∇F (w1) , . . . ,∇F (wt), the indices ofthe removed training samplesR, period T0, total iterationnumber T , history size m, “burn-in” iteration number j0,learning rate ηt

Output :Updated model parameter wI t1 Initialize wI0 ← w0

2 Initialize an array ∆G = []3 Initialize an array ∆W = []4 for t = 0; t < T ; t+ + do5 if [((t− j0) mod T0) == 0] or t ≤ j0 then6 compute∇F

(wI t)

exactly7 compute∇F

(wI t)−∇F (wt) based on the cached gra-

dient∇F (wt)8 set ∆G [k] = ∇F

(wI t)−∇F (wt)

9 set ∆W [k] = wI t−wt, based on the cached parameterswt

10 k ← k + 1

11 compute wI t+1 by using exact GD update (equation (1))12 else13 Pass ∆W [−m :], ∆G [−m :], the last m elements in

∆W and ∆G, which are from the jth1 , jth2 , . . . , jthm it-erations where j1 < j2 < · · · < jm depend on t,v = wI t − wt, and the history size m, to the L-BFGFSAlgorithm (see Section A.2.1 in the Appendix) to getthe approximation of H(wt)v, i.e., Bjmv

14 Approximate∇F(wI t)

= ∇F (wt)+Bjm(wI t − wt

)15 Compute wI t+1 by using the ”leave-r-out” gradient for-

mula, based on the approximated∇F (wI t)16 end17 end18 return wI t

of training over the original dataset. Suppose that we havebeen able to approximate wI0, . . . ,wI t. Then at iterationt + 1, ∇F

(wI t)

can be approximated using the Cauchymean-value theorem:

∇F(wI t)

= ∇F (wt) + Ht ·(wI t − wt

)(3)

in which Ht is an integrated Hessian, Ht =∫ 1

0H(wt + x

(wI t − wt

))dx.

Equation (3) requires a Hessian-vector product at everyiteration. We leverage the L-BFGS algorithm to approx-imate this, see e.g. Matthies & Strang (1979); Nocedal(1980); Byrd et al. (1994; 1995); Zhu et al. (1997); No-cedal & Wright (2006); Mokhtari & Ribeiro (2015) andreferences therein. The L-BFGS algorithm uses past datato approximate the projection of the Hessian matrix in thedirection of wt+1 − wt. We denote the required historicalobservations at prior iterations j as: ∆wj = wI j − wj ,∆gj = ∇F

(wI j

)−∇F (wj).

L-BFGS computes Quasi-Hessians Bt approximating thetrue Hessians Ht (we follow the notations from the classi-

Page 4: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

cal L-BFGS papers, e.g., Byrd et al. (1994)). DeltaGrad(Algorithm 1) starts with a “burn-in” period of j0 itera-tions, where it computes the full gradients ∇F

(wI t)

ex-actly. Afterwards, it only computes the full gradients everyT0 iterations. For other iterations t, it uses the L-BGFSalgorithm, maintaining a set of updates at some prior iter-ations j1, j2, . . . , jm, i.e. ∆wj1 ,∆wj2 , . . . , ∆wjm and∆gj1 ,∆gj2 , . . . ,∆gjm where jk− jk−1 ≤ T0. Then it usesan efficient L-BGFS update from Byrd et al. (1994) (seeAppendix A.2.1 for the details of the L-BGFS update).

By approximating Ht with Bt in Equation (3) and pluggingEquation (3) into Equation (2), the DeltaGrad update is:wI t+1 − wI t = ηt/(n− r)·

·

∑i6∈R∇F (wI t), (t− j0) mod T0 = 0 or t ≤ j0

n[Bjm(wI t − wt) +∇F (wt)]−∑i∈R∇F (wI t), else

2.3. Convergence rate for strongly convex objectives

We provide the convergence rate of DeltaGrad for stronglyconvex objectives in Theorem 1. We need to introduce someassumptions. The norm used throughout the rest of the paperis `2 norm.

Assumption 1 (Small number of samples removed). Thenumber of removed samples, r, is far smaller than the totalnumber of training samples, n. There is a small constantδ > 0 such that r/n ≤ δ.

Assumption 2 (Strong convexity and smoothness). EachFi (w) (i = 1, 2, . . . , n) is µ−strongly convex and L-smooth with µ > 0, so for any w1,w2

(∇Fi (w1)−∇Fi (w2))T (w1 − w2) ≥ µ‖w1 − w2‖2,

‖∇Fi (w1)−∇Fi (w2) ‖ ≤ L‖w1 − w2‖.

Then F (w) and FU (w) are L-smooth and µ-strongly con-vex. Typical choices of ηt are based on the smoothnessand strong convexity parameters, so the same choices leadto the convergence for both wt and wUt. For instance,GD over a strongly convex objective with fixed step sizeηt = η ≤ 2/[L + µ] converges geometrically at rate(L−µ)/(L+µ) < 1. For simplicity, we will use a constantlearning rate ηt = η ≤ 2/[L+ µ].

We assume bounded gradients and Lipschitz Hessians,which are standard (Boyd & Vandenberghe, 2004; Bottouet al., 2016). The proof may be relaxed to weak growthconditions, see the related works for references.

Assumption 3 (Bounded gradients). For any model param-eter w in the sequence [w0,w1,w2, . . . , wt, . . . ], the normof the gradient at every sample is bounded by a constant c2,i.e. for all i, j:

‖∇Fi (wj) ‖ ≤ c2.

Assumption 4 (Lipschitz Hessian). The Hessian H (w) isLipschitz continuous. There exists a constant c0 such thatfor all w1 and w2,

‖H (w1)−H (w2) ‖ ≤ c0‖w1 − w2‖.

An assumption specific to Quasi-Newton methods is thestrong independence of the weight updates: the smallestsingular value of the normalized weight updates is boundedaway from zero (Ortega & Rheinboldt, 1970; Conn et al.,1991). This has sometimes been motivated empirically, asthe iterates of certain quasi-newton iterations empiricallysatisfy it (Conn et al., 1988).

Assumption 5 (Strong independence). For any sequence,[∆wj1 ,∆wj2 , . . . ,∆wjm ], the matrix of normalized vectors

∆Wj1,j2,...,jm = [∆wj1 ,∆wj2 , . . . ,∆wjm ]/sj1,jm

where sj1,jm = max (‖∆wj1‖, ‖∆wj2‖, . . . , ‖∆wjm‖),has its minimum singular value σmin bounded away fromzero. We have σmin (∆Wj1,j2,...,jm) ≥ c1 where c1 is inde-pendent of (j1, j2, . . . , jm).

Empirically, we find c1 around 0.2 for the MNIST datasetusing our default hyperparameters.

2.3.1. RESULTS

Then our first main result is the convergence rate of theDeltaGrad algorithm.

Theorem 1 (Bound between true and incrementally updatediterates). For a large enough iteration counter t, the resultwI t of DeltaGrad (Algorithm 1) approximates the correctiteration values wUt at the rate

‖wUt − wI t‖ = o( rn

).

So ‖wUt − wI t‖ is of a lower order than r/n.

The baseline error rate between the full model parameterswt and wI t is expected to be of the order r/n, as can beseen from the example of the sample mean. This shows thatDeltaGrad has a better convergence rate for approximatingwI t. The proof is quite involved. It relies on a delicate anal-ysis of the difference between the approximate Hessians Btand the true Hessians Ht (see the Appendix, and specificallyA.2).

2.4. Complexity analysis

We will do our complexity analysis assuming that the modelis given by a computation graph. Suppose the number ofmodel parameters is p and the time complexity for forwardpropagation is f(p). Then according to the Baur-Strassentheorem (Griewank & Walther, 2008), the time complexity

Page 5: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

of backpropagation in one step will be at most 5f(p) andthus the total complexity to compute the derivatives for eachtraining sample is 6f(p). Plus, the overhead of computingthe product of Bjm(wI t − wt) is O(m3) + 6mp + p ac-cording to (Byrd et al., 1994), which means that the totaltime complexity at the step where the gradients are approx-imated is 6rf(p) +O(m3) + 6mp+ p (the gradients of rremoved/added samples are explicitly evaluated), which ismore efficient than explicit computation of the gradientsover the full batch (a time complexity of 6(n − r)f(p))when r n.

Suppose there are T iterations in the training process. Thenthe running time of BaseL will be 6(n−r)f(p)T . DeltaGradevaluates the gradients for the first j0 iterations and once ev-ery T0 iterations. So its total running time is 6(n−r)f(p)×T−j0T0

+ (6rf(p) +O(m3) + 6mp+p)× (1− 1T0

)(T − j0),which is close to 6nf(p)×T−j0

T0+(O(m3)+6mp+p)×(1−

1T0

)(T−j0) since r is small. Also, when n is large, the over-head of approximate computation, i.e. (O(m3) + 6mp+ p)should be much smaller than that of explicit computation.Thus speed-ups of a factor T0 are expected when j0 is farsmaller than T .

3. Extension to SGDConsider now mini-batch stochastic gradient descent:

wSt+1 = wSt −η

B

∑i∈Bt

∇Fi(wSt).

The naive solution for retraining the model is:

wU,St+1 = wU,St −η

B −∆Bt

∑i∈Bt,i6∈R

∇Fi(wU,St).

Here ∆Bt is the size of the subset removed from the t-thminibatch. If B − ∆Bt = 0, then we do not change theparameters at that iteration. DeltaGrad can be naturallyextended to this case: wI,St+1 − wI,St = ηt/(B −∆Bt)·

·

∑i6∈R∇F (wI t), t mod T0 = 0 or t ≤ j0

[B(Bjm(wI t − wt))−∑i∈R∇F (wI t)], else

which relies on a series of historical observations: ∆wSj =wI,Sj − wSj , ∆gSj = B−1

∑i∈Bj ∇Fi(wI,Sj) −

B−1∑i∈Bj ∇Fi(wSj).

3.1. Convergence rate for strongly convex objectives

Recall B is the mini-batch size, p is the total number ofmodel parameters and T is the number of iterations in SGD.Our main result for SGD is the following.

Theorem 2 (SGD bound for DeltaGrad). With probabilityat least

1− T · [2p exp

− log(2p)√B

4 + 23

(log2(2p)

B

)1/4

+ (p+ 1) exp(− log(p+ 1)

√B

4 + 23

((log(p+1))2

B

)1/4) + 2 exp(−2

√B)],

the result wI,St of Algorithm 1 approximates the correctiteration values wU,St at the rate

‖wU,St − wI,St‖ = o

(r

n+

1

B14

).

Thus, whenB is large, and when r/n is small, our algorithmaccurately approximates the correct iteration values.

Its proof is in the Appendix (Section A.3).

4. Experiments4.1. Experimental setup

Datasets. We used four datasets for evaluation: MNIST(LeCun et al., 1998), covtype (Blackard & Dean, 1999),HIGGS (Baldi et al., 2014) and RCV1 (Lewis et al., 2004) 1

. MNIST contains 60,000 images as the training dataset and10,000 images as the test dataset; each image has 28× 28features (pixels), containing one digit from 0 to 9. The cov-type dataset consists of 581,012 samples with 54 features,each of which may come from one of the seven forest covertypes; as a test dataset, we randomly picked 10% of thedata. HIGGS is a dataset produced by Monte Carlo simula-tions for binary classification, containing 21 features with11,000,000 samples in total; 500,000 samples are used as thetest dataset. RCV1 is a corpus dataset; we use its binary ver-sion which consists of 679,641 samples and 47,236 features,of which the first 20,242 samples are used for training.

Machine configuration. All experiments are run over aGPU machine with one Intel(R) Core(TM) i9-9920X CPUwith 128 GB DRAM and 4 GeForce 2080 Titan RTX GPUs(each GPU has 10 GB DRAM). We implemented DeltaGradwith PyTorch 1.3 and used one GPU for accelerating thetensor computations.

Deletion/Addition benchmark. We run regularized logis-tic regression over the four datasets with L2 norm coeffi-cient 0.005, fixed learning rate 0.1. The mini-batch sizesfor RCV1 and other three datasets are 16384 and 10200 re-spectively (Recall that RCV1 only has around 20k trainingsamples). We also evaluated our approach over a two-layerneural network with 300 hidden ReLU neurons over MNIST.There L2 regularization with rate 0.001 is added along with

1We used its binary version from LIBSVM: https://www.csie.ntu.edu.tw/˜cjlin/libsvmtools/datasets/binary.html#rcv1.binary

Page 6: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

a decaying learning rate (first 10 iterations with learningrate 0.2 and the rest with learning rate 0.1) and with deter-ministic GD. There are no strong convexity or smoothnessguarantees for DNNs. Therefore, we adjusted Algorithm1 to fit general DNN models (see Algorithm 4 in the Ap-pendix C.3). In Algorithm 1, we assume that the convexityholds locally where we use the L-BFGS algorithm to esti-mate the gradients. For all the other regions, we explicitlyevaluate the gradients. The details on how to check whichregions satisfy the convexity for DNN models can be foundin Algorithm 4. We also explore the use of DeltaGrad formore complicated neural network models such as ResNetby reusing and fixing the pre-trained parameters in all butthe last layer during the training phase, presented in detailin Appendix D.4.

We evaluate two cases of addition/deletion: batch and online.Multiple samples are grouped together for addition anddeletion in the former, while samples are removed one afteranother in the latter. Algorithm 1 is slightly modified tofit the online deletion/addition cases (see Algorithm 3 inAppendix C.2). In what follows, unless explicitly specified,Algorithm 1 and Algorithm 3 are used for experiments inthe batch addition/deletion case and online addition/deletioncase respectively.

To simulate deleting training samples, w∗ is evaluated overthe full training dataset of n samples, which is followed bythe random removal of r samples and evaluation over theremaining n − r samples using BaseL or DeltaGrad. Tosimulate adding training samples, r samples are deleted first.After w∗ is evaluated over the remaining n− r samples, ther samples are added back to the training set for updatingthe model. The ratio of r to the total number of trainingsamples n is called the Delete rate and Add rate for the twoscenarios, respectively.

Throughout the experiments, the running time of BaseL andDeltaGrad to update the model parameters is recorded. Toshow the difference between wU∗ (the output of BaseL, andthe correct model parameters after deletion or addition) andwI∗ (the output of DeltaGrad), we compute the `2-norm ordistance ‖wU∗ − wI∗‖. For comparison and justifying thetheory in Section 18, ‖w∗ − wU∗‖ is also recorded (w∗ arethe parameters trained over the full training data). Giventhe same set of added or deleted samples, the experimentsare repeated 10 times, with different minibatch randomnesseach time. After the model updates, wU∗ and wI∗ are evalu-ated over the test dataset and their prediction performanceis reported.

Hyperparameter setup. We set T0 (the period of explicitgradient updates) and j0 (the length of the inital “burn-in”) as follows. For regularized logistic regression, we setT0 = 10, j0 = 10 for RCV1, T0 = 5, j0 = 10 for MNISTand covtype, and T0 = 3, j0 = 300 for HIGGS. For the

2-layer DNN, T0 = 2 is even smaller and the first quarterof the iterations are used as “burn-in”. The history sizem is 2 for all experiments. The effect of hyperparametersand suggestions on how to choose them is discussed in theAppendix D.2.

4.2. Experimental results

4.2.1. BATCH ADDITION/DELETION.

To test the robustness and efficiency of DeltaGrad in batchdeletion, we vary the Delete and Add rate from 0 to 0.01.The first three sub-figures in Figures 2 and 3 along withFigure 1 show the running time of BaseL and DeltaGrad(blue and red dotted lines, resp.) and the two distances,‖wU∗ − w∗‖ and ‖wU∗ − wI∗‖ (blue and red solid lines,resp.) over the four datasets using regularized logistic regres-sion. The results on the use of 2-layer DNN over MNIST arepresented in the last sub-figures in Figures 2 and 3, whichare denoted by MNISTn.

The running time of BaseL and DeltaGrad is almost constantregardless of the delete or add rate, confirming the time com-plexity analysis of DeltaGrad in Section 18. The theoreticalrunning time is free of the number of removed samples r,when r is small. For any given delete/add rate, DeltaGradachieves significant speed-ups (up to 2.6x for MNIST, 2xfor covtype, 1.6x for HIGGS, 6.5x for RCV1) compared toBaseL. On the other hand, the distance between wU∗ andwI∗ is quite small; it is less than 0.0001 even when up to 1%of samples are removed or added. When the delete or addrate is close to 0, ‖wU∗−wI∗‖ is of magnitude 10−6 (10−8

for RCV1), indicating that the approximation brought bywI∗ is negligible. Also, ‖wU∗ − wI∗‖ is at least one orderof magnitude smaller than ‖wU∗ − w∗‖, confirming ourtheoretical analysis comparing the bound of ‖wU∗ − wI∗‖to that of ‖wU∗ − w∗‖.

To investigate whether the tiny difference between wU∗ andwI∗ will lead to any difference in prediction behavior, theprediction accuracy using wU∗ and wI∗ is presented in Table1. Due to space limitations, only results on a very small(0.005%) and the largest (1%) add/delete rates are presented.Due to the randomness in SGD, the standard deviation forthe prediction accuracy is also presented. In most cases,the models produced by BaseL and DeltaGrad end up witheffectively the same prediction power. There are a few caseswhere the prediction results of wU∗ and wI∗ are not exactlythe same (e.g. Add (1%) over MNIST), their confidenceintervals overlap, so that statistically wU∗ and wI∗ providethe same prediction results.

For the 2-layer net model where strong convexity does nothold, we use the variant of DeltaGradmentioned above, i.e.Algorithm 4. See the last sub-figures in Figure 2 and 3. Thefigures show that DeltaGrad achieves about 1.4x speedupcompared to BaseL while maintaining a relatively small

Page 7: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Figure 2. Running time and distance with varied add rate

Figure 3. Running time and distance with varied delete rate

Table 1. Prediction accuracy of BaseL and DeltaGrad with batchaddition/deletion. MNISTn refers to MNIST with a neural net.

Dataset BaseL(%) DeltaGrad(%)

Add(0.005%)

MNIST 87.530± 0.0025 87.530± 0.0025MNISTn 92.340± 0.002 92.340± 0.002covtype 62.991± 0.0027 62.991± 0.0027HIGGS 55.372± 0.0002 55.372± 0.0002RCV1 92.222± 0.00004 92.222± 0.00004

Add(1%)

MNIST 87.540± 0.0011 87.542± 0.0011MNISTn 92.397± 0.001 92.397± 0.001covtype 63.022± 0.0008 63.022± 0.0008HIGGS 55.381± 0.0007 55.380± 0.0007RCV1 92.233± 0.00010 92.233± 0.00010

Delete(0.005%)

MNIST 86.272± 0.0035 86.272± 0.0035MNISTn 92.203± 0.004 92.203± 0.004covtype 62.966± 0.0017 62.966± 0.0017HIGGS 52.950± 0.0001 52.950± 0.0001RCV1 92.241± 0.00004 92.241± 0.00004

Delete(1%)

MNIST 86.082± 0.0046 86.074± 0.0048MNISTn 92.373± 0.003 92.370± 0.003covtype 62.943± 0.0007 62.943± 0.0007HIGGS 52.975± 0.0002 52.975± 0.0002RCV1 92.203± 0.00007 92.203± 0.00007

difference between wI∗ and wU∗. This suggests that it maybe possible to extend our analysis for DeltaGrad beyondstrong convexity; this is left for future work.

4.2.2. ONLINE ADDITION/DELETION.

To simulate deletion and addition requests over the train-

ing data continuously in an on-line setting, 100 randomselected samples are added or deleted sequentially. Eachtriggers model updates by either BaseL or DeltaGrad. Therunning time comparison between the two approaches inthis experiment is presented in Figure 4, which shows thatDeltaGrad is about 2.5x, 2x, 1.8x and 6.5x faster than BaseLon MNIST, covtype, HIGGS and RCV1 respectively. Theaccuracy comparison is shown in Table 2. There is essen-tially no prediction performance difference between wU∗and w∗.

Discussion. By comparing the speed-ups brought by Delt-aGrad and the choice of T0, we found that the theoreticalspeed-ups are not fully achieved. One reason is that in theapproximate L-BFGS computation, a series of small matrixmultiplications are involved. Their computation on GPUvs CPU cannot bring about very significant speed-ups com-pared to the larger matrix operations2, which indicates thatthe overhead of L-BFGS is non-negligible compared to gra-dient computation. Besides, although r is far smaller than n,to compute the gradients over the r samples, other overheadbecomes more significant: copying data from CPU DRAMto GPU DRAM, the time to launch the kernel on GPU, etc.This leads to non-negligible explicit gradient computationcost over the r samples. It would be interesting to explorehow to adjust DeltaGrad to fully utilize the computation

2See the matrix computation benchmark on GPU with variedmatrix sizes: https://developer.nvidia.com/cublas

Page 8: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Table 2. Distance and prediction performance of BaseL and DeltaGrad in online deletion/addition

Dataset Distance Prediction accuracy (%)‖wU∗ − w∗‖ ‖wI∗ −wU∗‖ BaseL DeltaGrad

MNIST (Addition) 5.7× 10−3 2× 10−4 87.548± 0.0002 87.548± 0.0002MNIST (Deletion) 5.0× 10−3 1.4× 10−4 87.465± 0.002 87.465± 0.002covtype (Addition) 8.0× 10−3 2.0× 10−5 63.054± 0.0007 63.054± 0.0007covtype (Deletion) 7.0× 10−3 2.0× 10−5 62.836± 0.0002 62.836± 0.0002HIGGS (Addition) 2.1× 10−5 1.4× 10−6 55.303± 0.0003 55.303± 0.0003HIGGS (Deletion) 2.5× 10−5 1.7× 10−6 55.333± 0.0008 55.333± 0.0008RCV1 (Addition) 0.0122 3.6× 10−6 92.255± 0.0003 92.255± 0.0003RCV1 (Deletion) 0.0119 3.5× 10−6 92.229± 0.0006 92.229± 0.0006

power of GPU in the future.

Figure 4. Running time comparison of BaseL and DeltaGrad with100 continuous deletions/addition

Other experiments with DeltaGrad are in the Appendix(Section D): evaluations with larger delete rate (i.e. whenr n may not hold), comparisons with state-of-the-artwork and studies on the effect of mini-batch sizes and hyper-parameters etc.

5. ApplicationsOur algorithm has many applications, including privacy re-lated data deletion, continuous model updating, robustness,bias reduction, and uncertainty quantification (predictive in-ference). Some of these applications are quite direct, and sofor space limitations we only briefly describe them. Someinitial experimental results on how our method can acceler-ate some of those applications such as robust learning areincluded in Appendix D.5.

5.1. Privacy related data deletion

By adding a bit of noise one can often guarantee differentialprivacy, the impossibility to distinguish the presence orabsence of a datapoint from the output of an algorithm(Dwork et al., 2014). We leverage and slightly extend aclosely related notion, approximate data deletion, (Ginartet al., 2019) to guarantee private deletion.

We will consider learning algorithms A that take as inputa dataset D, and output a model A(D) in the hypothesis

spaceH. With the i-th sample removed, the resulting modelis thus A(D−i). A data deletion operation RA maps D,A(D) and the index of the removed sample i to the modelRA(D,A(D), i). We call RA an ε-approximate deletion iffor all D and measurable subsets S ⊂ H:

| logP (A(D−i) ∈ S|D−i)

P (RA(D,A(D), i) ∈ S|D−i)| ≤ ε

Here if either of the two probabilities is zero, the other mustbe zero too. Using the standard Laplace mechanism (Dworket al., 2014), we can make the output of our algorithm anε-approximate deletion. We add independent Laplace (δ/ε)noise to each coordinate of w∗, wU ∗ and wI∗, where

δ =

√pAM2

1 r2

η( 12µ−

rn−rµ−

c0M1r2n )2(n− r)(n/2− r)

is an upper bound on p1/2‖wU ∗ − wI∗‖. See the appendixfor details.

5.2. Continous model updating

Continous model updating is a direct application. In manycases, machine learning models run in production need tobe retrained on newly acquired data. DeltaGrad can be usedto update the models. Similarly, if there are changes in thedata, then we can run DeltaGrad twice: first to remove theoriginal data, then to add the changed data.

5.3. Robustness

Our method has applications to robust statistical learning.The basic idea is that we can identify outliers by fitting apreliminary model. Then we can prune them and re-fit themodel. Methods based on this idea are some of the moststatistically efficient ones for certain problems, see e.g., thereview Yu & Yao (2017).

5.4. Data valuation

Our method can be also used to evaluate the importanceof training samples (see Cook (1977) and the follow-upworks such as Ghorbani & Zou (2019)). One commonmethod to do this is the leave-one-out test, i.e. comparingthe difference of the model parameters before and after

Page 9: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

the deletion of one single training sample of interest. Ourmethod is thus useful to speed up evaluating the modelparameters after the deletion operations.

5.5. Bias reduction

Our algorithms can be used directly to speed up existingtechniques for bias correction. There are many differenttechniques based on subsampling (Politis et al., 1999). Abasic one is the jackknife (Quenouille, 1956). Suppose wehave an estimator fn computed based on n training data-points, and defined for both n and n−1. The jackknife bias-correction is fjack = fn − b(fn) where b(fn) is the jack-knife estimator of the bias b(fn) of the estimator fn. Thisis constructed as b(fn) = (n − 1)

(n−1

∑ni=1 f−i − fn

)where f−i is the estimator fn−1 computed on the trainingdata removing the i-th data point. Our algorithm can beused to recompute the estimator on all subsets of size n− 1of the training data. To validate that this works, a goodexample may be logistic regression with n not much largerthan p, which will have bias (Sur & Candes, 2018).

5.6. Uncertainty quantification / Predictive inference

Our algorithm has applications to uncertainty quantifica-tion and predictive inference. These are fundamental prob-lems of wide applicability. Techniques based on conformalprediction (e.g., Shafer & Vovk, 2008) rely on retrainingmodels on subsets of the data. As an example, in cross-conformal prediction (Vovk, 2015) we have a predictivemodel f that can be trained on any subset of the data. Wecan split the data into K subsets of roughly equal size. Wecan train f−Sk on the data excluding Sk, and computethe cross-validation residuals Ri = |yi − f−Sk(xi)| fori ∈ Si. Then for a test datapoint xn+1, we form a predictionset C(xn+1) with all y overlapping n − (1 − α)(n + 1)

of the intervals f−Sk(xn+1) ± Ri. This forms a validβ = 1 − 2α − 2K/n level prediction set in the sense thatP (yn+1 ∈ C(xn+1)) ≥ β over the randomness in all sam-ples. The ”best” (shortest) intervals arise for large K, whichmeans a small number of samples is removed to find f−Sk .Thus our algorithm is applicable.

6. ConclusionIn this work, we developed the efficient DeltaGrad retrain-ing algorithm after slight changes (deletions/additions) ofthe training dataset by differentiating the optimization pathwith Quasi-Newton method. This is provably more accuratethan the baseline of retraining from scratch. Its perfor-mance advantage has been empirically demonstrated withsome medium-scale public datasets, revealing its great po-tential in constructing data deletion/addition machine learn-ing systems for various applications. The code for replicat-ing our experiments is available on Github: https://github.com/thuwuyinjun/DeltaGrad. Adjust-

ing DeltaGrad to handle smaller mini-batch sizes in SGDand more complicated ML models without strong convexityand smoothness guarantees is important future work.

Page 10: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

AcknowledgementsThis material is based upon work that is in part supported bythe Defense Advanced Research Projects Agency (DARPA)under Contract No. HR001117C0047. Partial support wasprovided by NSF Awards 1547360 and 1733794.

ReferencesAmari, S. A theory of adaptive pattern classifiers. IEEE

Transactions on Electronic Computers, (3):299–307,1967.

Baldi, P., Sadowski, P., and Whiteson, D. Searching forexotic particles in high-energy physics with deep learning.Nature communications, 5:4308, 2014.

Bertsekas, D. P. and Tsitsiklis, J. N. Neuro-dynamic pro-gramming, volume 5. Athena Scientific Belmont, MA,1996.

Birattari, M., Bontempi, G., and Bersini, H. Lazy learningmeets the recursive least squares algorithm. In Advancesin neural information processing systems, pp. 375–381,1999.

Blackard, J. A. and Dean, D. J. Comparative accuraciesof artificial neural networks and discriminant analysis inpredicting forest cover types from cartographic variables.Computers and electronics in agriculture, 24(3):131–151,1999.

Bottou, L. Online learning and stochastic approximations.On-line learning in neural networks, 17(9):142, 1998.

Bottou, L. Stochastic learning. In Summer School on Ma-chine Learning, pp. 146–168. Springer, 2003.

Bottou, L. Large-scale machine learning with stochasticgradient descent. In Proceedings of COMPSTAT’2010,pp. 177–186. Springer, 2010.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimizationmethods for large-scale machine learning. arXiv preprintarXiv:1606.04838, 2016.

Bottou, L., Curtis, F. E., and Nocedal, J. Optimizationmethods for large-scale machine learning. Siam Review,60(2):223–311, 2018.

Bourtoule, L., Chandrasekaran, V., Choquette-Choo, C.,Jia, H., Travers, A., Zhang, B., Lie, D., and Papernot, N.Machine unlearning. arXiv preprint arXiv:1912.03817,2019.

Bousquet, O. and Bottou, L. The tradeoffs of large scalelearning. In Advances in neural information processingsystems, pp. 161–168, 2008.

Boyd, S. and Vandenberghe, L. Convex optimization. Cam-bridge university press, 2004.

Byrd, R. H., Nocedal, J., and Schnabel, R. B. Representa-tions of quasi-newton matrices and their use in limitedmemory methods. Mathematical Programming, 63(1-3):129–156, 1994.

Byrd, R. H., Lu, P., Nocedal, J., and Zhu, C. A limitedmemory algorithm for bound constrained optimization.SIAM Journal on scientific computing, 16(5):1190–1208,1995.

Cao, Y. and Yang, J. Towards making systems forget withmachine unlearning. In 2015 IEEE Symposium on Secu-rity and Privacy, pp. 463–480. IEEE, 2015.

Cauwenberghs, G. and Poggio, T. Incremental and decre-mental support vector machine learning. In Advancesin neural information processing systems, pp. 409–415,2001.

Chaudhuri, K. and Monteleoni, C. Privacy-preserving lo-gistic regression. In Advances in neural informationprocessing systems, pp. 289–296, 2009.

Conn, A. R., Gould, N. I., and Toint, P. L. Testing a class ofmethods for solving minimization problems with simplebounds on the variables. Mathematics of computation, 50(182):399–430, 1988.

Conn, A. R., Gould, N. I., and Toint, P. L. Convergenceof quasi-newton matrices generated by the symmetricrank one update. Mathematical programming, 50(1-3):177–195, 1991.

Cook, R. D. Detection of influential observation in linearregression. Technometrics, 19(1):15–18, 1977.

Doshi-Velez, F. and Kim, B. A roadmap for a rigorous sci-ence of interpretability. arXiv preprint arXiv:1702.08608,150, 2017.

Dwork, C., Roth, A., et al. The algorithmic foundationsof differential privacy. Foundations and Trends R© inTheoretical Computer Science, 9(3–4):211–407, 2014.

European Union, C. o. Council regulation (eu) no2016/679. 2016. URL https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=CELEX:02016R0679-20160504.

Ghorbani, A. and Zou, J. Data shapley: Equitable valuationof data for machine learning. In International Conferenceon Machine Learning, pp. 2242–2251, 2019.

Ginart, A., Guan, M., Valiant, G., and Zou, J. Y. Makingai forget you: Data deletion in machine learning. InAdvances in Neural Information Processing Systems, pp.3513–3526, 2019.

Page 11: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Gladyshev, E. On stochastic approximation. Theory ofProbability & Its Applications, 10(2):275–278, 1965.

Gorbunov, E., Hanzely, F., and Richtarik, P. A unified theoryof sgd: Variance reduction, sampling, quantization andcoordinate descent. arXiv preprint arXiv:1905.11261,2019.

Gower, R. M., Loizou, N., Qian, X., Sailanbayev, A.,Shulgin, E., and Richtarik, P. Sgd: General analysisand improved rates. arXiv preprint arXiv:1901.09401,2019.

Griewank, A. and Walther, A. Evaluating derivatives: prin-ciples and techniques of algorithmic differentiation, vol-ume 105. Siam, 2008.

Guo, C., Goldstein, T., Hannun, A., and van der Maaten,L. Certified data removal from machine learning models.arXiv preprint arXiv:1911.03030, 2019.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In Proceedings of the IEEEconference on computer vision and pattern recognition,pp. 770–778, 2016.

Horn, R. A. and Johnson, C. R. Matrix analysis. Cambridgeuniversity press, 2012.

Karimi, H., Nutini, J., and Schmidt, M. Linear conver-gence of gradient and proximal-gradient methods underthe polyak-łojasiewicz condition. In Joint European Con-ference on Machine Learning and Knowledge Discoveryin Databases, pp. 795–811. Springer, 2016.

Koh, P. W. and Liang, P. Understanding black-box predic-tions via influence functions. In Proceedings of the 34thInternational Conference on Machine Learning-Volume70, pp. 1885–1894. JMLR. org, 2017.

Krishnan, S. and Wu, E. Palm: Machine learning explana-tions for iterative debugging. In Proceedings of the 2ndWorkshop on Human-In-the-Loop Data Analytics, pp. 4.ACM, 2017.

Krizhevsky, A. and Hinton, G. Learning multiple layers offeatures from tiny images. 2009.

Kul’chitskiy, O. Y. and Mozgovoy, A. Estimation of conver-gence rate for robust identification algorithms. Interna-tional journal of adaptive control and signal processing,6(3):247–251, 1992.

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based learning applied to document recognition. Proceed-ings of the IEEE, 86(11):2278–2324, 1998.

Lewis, D. D., Yang, Y., Rose, T. G., and Li, F. Rcv1: Anew benchmark collection for text categorization research.Journal of machine learning research, 5(Apr):361–397,2004.

Matthies, H. and Strang, G. The solution of nonlinear finiteelement equations. International journal for numericalmethods in engineering, 14(11):1613–1626, 1979.

Mokhtari, A. and Ribeiro, A. Global convergence of onlinelimited memory bfgs. The Journal of Machine LearningResearch, 16(1):3151–3181, 2015.

Moulines, E. and Bach, F. R. Non-asymptotic analysis ofstochastic approximation algorithms for machine learning.In Advances in Neural Information Processing Systems,pp. 451–459, 2011.

Nocedal, J. Updating quasi-newton matrices with limitedstorage. Mathematics of computation, 35(151):773–782,1980.

Nocedal, J. and Wright, S. Numerical optimization. SpringerScience & Business Media, 2006.

Oliveira, R. I. Concentration of the adjacency matrix and ofthe laplacian in random graphs with independent edges.arXiv preprint arXiv:0911.0600, 2009.

Ortega, J. M. and Rheinboldt, W. C. Iterative solutionof nonlinear equations in several variables, volume 30.Siam, 1970.

Politis, D. N., Romano, J. P., and Wolf, M. Subsampling.Springer Science & Business Media, 1999.

Quenouille, M. H. Notes on bias in estimation. Biometrika,43(3/4):353–360, 1956.

Robbins, H. and Monro, S. A stochastic approximationmethod. The annals of mathematical statistics, pp. 400–407, 1951.

Schelter, S. amnesia–towards machine learning modelsthat can forget user data very fast. In 1st InternationalWorkshop on Applied AI for Database Systems and Appli-cations (AIDB19), 2019.

Shafer, G. and Vovk, V. A tutorial on conformal prediction.Journal of Machine Learning Research, 9(Mar):371–421,2008.

Steinhardt, J., Koh, P. W. W., and Liang, P. S. Certifieddefenses for data poisoning attacks. In Advances in neuralinformation processing systems, pp. 3517–3529, 2017.

Sur, P. and Candes, E. J. A modern maximum-likelihoodtheory for high-dimensional logistic regression. arXivpreprint arXiv:1803.06964, 2018.

Page 12: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Syed, N. A., Huan, S., Kah, L., and Sung, K. Incrementallearning with support vector machines. 1999.

Tropp, J. A. User-friendly tail bounds for sums of randommatrices. Foundations of computational mathematics, 12(4):389–434, 2012.

Tropp, J. A. The expected norm of a sum of independentrandom matrices: An elementary approach. In Highdimensional probability VII, pp. 173–202. Springer, 2016.

Vovk, V. Cross-conformal predictors. Annals of Mathemat-ics and Artificial Intelligence, 74(1-2):9–28, 2015.

Wu, Y., Tannen, V., and Davidson, S. B. Priu: A provenance-based approach for incrementally updating regressionmodels. In Proceedings of the 2020 ACM SIGMOD In-ternational Conference on Management of Data, pp. 447–462, 2020.

Yu, C. and Yao, W. Robust linear regression: A review andcomparison. Communications in Statistics-Simulationand Computation, 46(8):6261–6282, 2017.

Zhang, T. Solving large scale linear prediction problemsusing stochastic gradient descent algorithms. In Pro-ceedings of the twenty-first international conference onMachine learning, pp. 116. ACM, 2004.

Zhu, C., Byrd, R. H., Lu, P., and Nocedal, J. Algorithm778: L-bfgs-b: Fortran subroutines for large-scale bound-constrained optimization. ACM Transactions on Mathe-matical Software (TOMS), 23(4):550–560, 1997.

Page 13: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Appendix for DeltaGrad: Rapid retraining of machine learning models

Contents

A Mathematical details 2

A.1 Additional notes on setup, preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

A.1.1 Classical results on GD convergence, SGD convergence . . . . . . . . . . . . . . . . . . . . . . 2

A.1.2 Notations for DeltaGrad with SGD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

A.1.3 Classical results for random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

A.2 Results for deterministic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

A.2.1 Quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

A.2.2 Proof that Quasi-Hessians are well-conditioned . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

A.2.3 Proof preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

A.2.4 Main recursions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

A.2.5 Proof of Theorem 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

A.2.6 Proof of Theorem 5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

A.2.7 Proof of Theorem 6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

A.2.8 Proof of Theorem 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.3 Results for stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

A.3.1 Quasi-Newton . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

A.3.2 Proof preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

A.3.3 Main recursions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

A.3.4 Proof of Theorem 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

A.3.5 Proof of Theorem 11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

A.3.6 Proof of Theorem 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

A.3.7 Proof of Theorem 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

B Details on applications 34

B.1 Privacy related data deletion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

C Supplementary algorithm details 36

C.1 Extension of DeltaGrad for stochastic gradient descent . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

C.2 Extension of DeltaGrad for online deletion/addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

C.2.1 Convergence rate analysis for online gradient descent version of DeltaGrad . . . . . . . . . . . . 36

C.3 Extension of DeltaGrad for non-strongly convex, non-smooth objective functions . . . . . . . . . . . . . 48

D Supplementary experiments 49

D.1 Experiments with large deletion rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Page 14: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

D.2 Influence of hyper-parameters on performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

D.3 Comparison against the state-of-the-art work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

D.4 Experiments on large ML models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

D.5 Applications of DeltaGrad to robust learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A. Mathematical detailsThe main result for DeltaGrad with GD is Theorem 7, proved in Section A.2.8.

A.1. Additional notes on setup, preliminaries

A.1.1. CLASSICAL RESULTS ON GD CONVERGENCE, SGD CONVERGENCE

Lemma 1 (GD convergence, folklore, e.g., (Boyd & Vandenberghe, 2004)). Gradient descent over a strongly convexobjective function with fixed step size ηt = η ≤ 2

L+µ has exponential convergence rate, i.e.:

F (wt)− F (w∗) ≤ ctL2||w0 − w∗||2 (S1)

where c := (L− µ)/(L+ µ) < 1.

Recall also that the eigenvalues of the ”contraction operator” I− ηtH (w) are bounded as follows.

Lemma 2 (Classical bound on eigenvalues of the ”contraction operator”). Under the convergence conditions of gradientdescent with fixed step size, i.e. ηt = η ≤ 2

µ+L , the following inequality holds for any parameter w:

‖I− ηH (w) ‖ ≤ 1. (S2)

This lemma follows directly, because the eigenvalues of I − ηH are bounded between −1 ≤ 1− ηL ≤ 1− ηµ ≤ 1.

Lemma 3 (SGD convergence, see e.g., (Bottou et al., 2018)). Suppose that the stochastic gradient estimates are correlatedwith the true gradient, and bounded in the following way. There exist two scalars J1 ≥ J2 > 0 such that for arbitrary Bt,the following two inequalities hold:

∇F (wt)T E

1

Bt

∑i∈Bt

∇Fi (wt) ≥ J2‖∇F (wt) ‖2, (S3)

‖E 1

Bt

∑i∈Bt

∇Fi (wt) ‖ ≤ J1‖∇F (wt) ‖.

Also, assume that for two scalars J3, J4 ≥ 0, we have:

V ar

(1

Bt

∑i∈Bt

∇Fi (wt)

)≤ J3 + J4‖∇F (wt) ‖2. (S4)

By combining equations (S3)-(S4), the following inequality holds:

E‖ 1

Bt

∑i∈Bt

∇Fi (wt) ‖2 ≤ J3 + J5‖∇F (wt) ‖2

where J5 = J4 + J21 ≥ J2

2 ≥ 0.

Page 15: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Then stochastic gradient descent with fixed step size ηt = η ≤ J2LJ5

has the convergence rate:

E [F (wt)− F (w∗)] ≤ ηLJ3

2µJ2+ (1− ηµJ2)

t−1

(F (w1)− F (w∗)− ηLJ3

2µJ2

)→ ηLJ3

2µJ2.

If the gradient estimates are unbiased, then E 1Bt

∑i∈Bt ∇Fi (wt) = 1

n

∑ni=1 ∇Fi (wt) = ∇F (wt) and thus J1 = J2 = 1.

Moreover, J3 ∼ 1/m, where m is the minibatch size, because J2 is the variance of the stochastic gradient.

So the convergence condition for fixed step size becomes ηt = η ≤ 1LJ5

, in which J5 = J4 + J21 = J4 + 1 ≥ 1. So

ηt = η ≤ 1LJ5≤ 1

L suffices to ensure convergence.

A.1.2. NOTATIONS FOR DELTAGRAD WITH SGD

The SGD parameters trained over the full dataset, explicitly trained over the remaining dataset and incrementally trainedover the remaining dataset are denoted by wS , wU,S and wI,S respectively. Then given the mini-batch size B, mini-batchBt, the number of removed samples from each mini-batch ∆Bt and the set of removed samples R, the update rules for thethree parameters are:

wSt+1 = wSt − η1

B

∑i∈Bt

∇Fi(wSt) = wSt − ηGB,S(wSt), (S5)

wU,St+1 = wU,St − η1

B −∆Bt

∑i∈Bt,i6∈R

∇Fi(wU,St)

= wU,St − ηGUB−∆B,S(wU,St),(S6)

wI,St+1 =

wI,St − η

B−∆Bt

∑i∈Bt,i6∈R∇F (wI,St) (t− j0) mod T0 = 0

or t ≤ j0wI,St − η

B−∆BtB[Bjm(wI,St − wSt) + 1

B

∑i∈Bt ∇Fi(wSt)] −∑

i∈R,i∈Bt ∇F (wI,St)otherwise

(S7)

in whichGB,S(wSt) andGUB−∆B,S(wU,St) represent the average gradients over the minibatch Bt before and after removingsamples.

We assume that the minibatch randomness of wU,S and wI,S is the same as wS . By following Lemma 3, we assume thatthe gradient estimates of SGD are unbiased, i.e. E

(1Bt

∑i∈Bt ∇Fi (w)

)= 1

n

∑ni=1∇Fi (w) = ∇F (w) for any w, which

indicates that:

E

(1

B

∑i∈Bt

∇Fi(wSt)

)=

1

n

n∑i=1

∇Fi(wSt

)= ∇F

(wSt

),

E

1

B −∆Bt

∑i∈Bt,i6∈R

∇Fi(wU,St)

=1

n−∆n

∑i 6∈R

∇Fi(wU,St

)= ∇FU

(wU,St

).

A.1.3. CLASSICAL RESULTS FOR RANDOM VARIABLES

To analyze DeltaGrad with SGD, Bernsteins inequality (Oliveira, 2009; Tropp, 2012; 2016) is necessary. Both its scalarversion and matrix version are stated below.

Lemma 4 (Bernsteins inequality for scalars). Consider a list of independent random variables, S1,S2, . . . ,Sk satisfying

Page 16: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

E(Si) = 0 and |Si| ≤ J , and their sum Z =∑ki=1 Si. Then the following inequality holds:

Pr(‖Z‖ ≥ x) ≤ exp

(−x2∑k

i=1 E(S2i ) + Jx

3

),∀x ≥ 0.

Lemma 5 (Bernsteins inequality for matrices). Consider a list of independent d1 × d2 random matrices, S1,S2, . . . ,Sksatisfying E(Si) = 0 and ‖Si‖ ≤ J , and their sum Z =

∑ki=1 Si. Define the deterministic ”varianc surrogate”:

V (Z) = max

(‖

k∑i=1

E(SiS∗i )‖, ‖k∑i=1

E(S∗i Si)‖

). (S8)

Then the following inequalities hold:

Pr(‖Z‖ ≥ x) ≤ (d1 + d2) exp

(−x2

V (Z) + Jx3

),∀x ≥ 0, (S9)

E(‖Z‖) ≤√

2V (Z) log(d1 + d2) +1

3J log (d1 + d2) . (S10)

A.2. Results for deterministic gradient descent

The main result for DeltaGrad with GD is Theorem 7, proved in Section A.2.8.

A.2.1. QUASI-NEWTON

By following equations 1.2 and 1.3 in (Byrd et al., 1994), the Quasi-Hessian update can be written as:

Bt+1 = Bt −Bt∆wt∆wTt Bt∆wTt Bt∆wt

+∆gt∆g

Tt

∆gTt ∆wt. (S11)

We have used the indices k to index the Quasi-Hessians Bjk . This allows us to see that they correspond to the appropriateparameter gap ∆wjk and gradient gap ∆gjk . The indices jk depend on the iteration number t in the main algorithm, andthey are updated by removing the “oldest” entry, and adding T0 at every period.

DeltaGrad uses equation (S11) on the prior updates:

Bjk+1= Bjk −

Bjk∆wjk∆wTjkBjk∆wTjkBjk∆wjk

+∆gjk∆gTjk∆gTjk∆wjk

, (S12)

where the initialized matrix Bj0 is Bj0 = ∆gTi0∆wj0 /[∆wTi0

∆wj0 ]I.

We use formulas 3.5 and 2.25 from (Byrd et al., 1994) for the Quasi-Newton method, with the caveat that they use slightlydifferent notation.

For the update rule of Bjk , i.e.:

Bjk+1= Bjk −

Bjk∆wjk∆wTjkBjk∆wTjkBjk∆wjk

+∆gjk∆gTjk∆gTjk∆wjk

. (S13)

There is an equivalent expression for the inverse of Bjk as below:

B−1jk+1

=

(I−

∆wjk∆gTjk∆gTjk∆wjk

)B−1jk

(I−

∆gjk∆wTjk∆gTjk∆wjk

)+

∆wjk∆wTjk∆gTjk∆wjk

. (S14)

See Algorithm 2 for an overview of the L-BFGS algorithm.

Page 17: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Algorithm 2 Overview of L-BFGS algorithmInput :The sequence of the model parameter differences ∆W = ∆w0,∆w1, . . . ,∆wm−1, the sequence of the gradient differences

∆G = ∆g0,∆g1, . . . ,∆gm−1, a vector v, history size mOutput :Approximate results of H(wm)v at point wm, and for some v, such that ∆wi ≈ wi − wi−1 for all i

19 Compute ∆WT∆W

20 Compute ∆WT∆G, get its diagonal matrix D and its lower triangular submatrix L21 Compute σ = ∆gTm−1∆wm−1/

(∆wTm−1∆wm−1

)22 Compute the Cholesky factorization for σ∆WT∆W + LDLT to get JJT

23 Compute p =

[−D

12 D− 1

2LT

0 JT

]−1[

D12 0

D− 12LT JT

]−1 [∆GT vσ∆WT v

]24 return σv−

[∆G σ∆W

]p

A.2.2. PROOF THAT QUASI-HESSIANS ARE WELL-CONDITIONED

We show that the Quasi-Hessian matrices computed by L-BFGS are well-conditioned.

Lemma 6 (Bounds on Quasi-Hessians). The Quasi-Hessian matrices Bjk are well-conditioned. There exist two positiveconstants K1 and K2 (depending on the problem parameters µ,L, etc) such that for any t, any vector z, and all k ∈0, 1,. . .,m, the following inequality holds:

K1‖z‖2 ≤ zTBjkz ≤ K2‖z‖2.

Proof. We start with the lower bound. Based on equation (S14), ‖B−1jk‖ can be bounded by:

‖B−1jk+1‖ ≤ ‖I−

∆wjk∆gTjk∆gTjk∆wjk

‖ · ‖B−1jk‖ · ‖I−

∆gjk∆wTjk∆gTjk∆wjk

‖+ ‖∆wjk∆wTjk∆gTjk∆wjk

‖. (S15)

in which by using the mean value theorem, ‖I−∆wjk∆gTjk∆gTjk

∆wjk‖ can be bounded as:

‖I−∆wjk∆gTjk∆gTjk∆wjk

‖ ≤ 1 +‖∆wjk∆gTjk‖∆gTjk∆wjk

= 1 +‖∆wjk(Hjk∆wjk)T ‖

∆wTjkHjk∆wjk≤ 1 +

‖∆wjq‖‖Hjk‖‖∆wjq‖µ‖∆wjq‖2

≤ 1 +L

µ.

(S16)

In addition, ‖∆wjk∆wTjk∆gTjk

∆wjk‖ can be bounded as:

‖∆wjk∆wTjk∆gTjk∆wjk

‖ = ‖∆wjk∆wTjk

∆wTjkHjk∆wjk‖ ≤ ‖

∆wTjk∆wjkµ∆wTjk∆wjk

‖ =1

µ. (S17)

So by combining Equation (S16) and Equation (S17), Equation (S15) can be bounded by:

‖B−1jk+1‖ ≤ (1 +

L

µ)2‖B−1

jk‖+

1

µ≤ (1 +

L

µ)2k‖B−1

j0‖+

1− (1 + Lµ )2k

1− (1 + Lµ )2

1

µ

= (1 +L

µ)2kL

µ+

1− (1 + Lµ )2k

1− (1 + Lµ )2

1

µ.

Page 18: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

which thus implies that ‖Bjk‖ ≥ K1 := 1

(1+Lµ )2k Lµ+

1−(1+Lµ

)2k

1−(1+Lµ

)21µ

where 0 ≤ k ≤ m. Recall that m is small, (set as m = 2

in the experiments). So the lower bound will not approach zero.

Then based on Equation (S11), we derive an upper bound for ‖Bjk‖ as follows:

zTBjk+1z = zTBjkz−

zTBjk∆wjk∆wTjkBjkz∆wTjkBjk∆wjk

+zT∆gjk∆gTjkz

∆gTjk∆wjk

≤ zTBjkz +zT∆gjk∆gTjkz

∆gTjk∆wjk= zTBjkz +

zTHjk∆wjk∆wTjkHjkz∆wTjkHjk∆wjk

≤ zTBjkz +zTHjkz∆wTjkHjk∆wjk

∆wTjkHjk∆wjk= zTBjkz + zTHjkz

≤ zTBjkz + L‖z‖2.

The first inequality uses the fact that zTBjk∆wjk∆wTjkBjkz =(zTBjk∆wjk

)2 ≥ 0 and ∆wTjkBjk∆wjk ≥ 0, due to thepositive definiteness of Bjk . The second inequality uses the Cauchy-Schwarz inequality for the Quasi-Hessian, i.e.:(

aTHjkb)2 ≤ (aTHjka

) (bTHjkb

).

By applying the formula above recursively, we get zTBjk+1z ≤ (k + 1)L‖z‖2 where 0 ≤ k ≤ m. Again, as m is bounded,

so we have (k + 1)L ≤ K2 := (m+ 1)L. This finishes the proof.

A.2.3. PROOF PRELIMINARIES

First of all, we provide the bound on δt, which is defined as:

Lemma 7 (Upper bound on δt). By defining

δt = − η

n− r

(r

n

n∑i=1

∇Fi(wUt

)−∑i∈R∇Fi

(wUt

)),

we then have ‖δt‖ ≤ 2c2rηn .

Proof. Based on the definition of δt, we can rearrange it a little bit as:

‖δt‖ = ‖ − ηr

n (n− r)

n∑i=1

∇Fi(wUt

)+

η

n− r∑i∈R∇Fi

(wUt

)‖

= ‖ − ηr

n (n− r)[

n∑i=1

∇Fi(wUt

)−∑i∈R∇Fi

(wUt

)] + (

η

n− r− ηr

n(n− r))∑i∈R∇Fi

(wUt

)‖

= ‖ − ηr

n (n− r)∑i 6∈R

∇Fi(wUt

)+η

n

∑i∈R∇Fi

(wUt

)‖.

Then by using the triangle inequality and Assumption 3 (bounded gradients), the formula above can be bounded as:

≤ ηr

n (n− r)∑i 6∈R

‖∇Fi(wUt

)‖+

η

n

∑i∈R‖∇Fi

(wUt

)‖ ≤ ηr

nc2 +

ηr

nc2 =

2ηr

nc2

Notice that Algorithm 2 requires 2m vectors as the input, i.e. [∆wj0 , ∆wj1 ,. . . ,∆wjm−1] and [∆gj0 ,∆gj1 , . . . , ∆gjm−1

]to approximate the product of the Hessian matrix H(wt) and the input vector ∆wt at the tth iteration where jm−1 ≤ t ≤jm−1 + T0.

Page 19: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Note that by multiplying ∆wjk on both sides of the Quasi-Hessian update Equation (S12), we have the classical secantequation that characterizes Quasi-Newton methods as below:

Bjk+1∆wjk = ∆gjk . (S18)

Then we give a bound on the quantity ‖∆gjk − Bjq∆wjk‖ where the intermediate index q is in between the ”correct” indexk + 1 and the final index m, so m ≥ q ≥ k + 1. This characterizes the error by using a different Quasi-Hessian at someiteration. Its proof borrows ideas from (Conn et al., 1991). Unlike (Conn et al., 1991), our proof relies on a preliminaryestimate on the bound on ‖wt − wI t‖, which is at the level of O( rn ). The proof of the bound will be presented later.

Theorem 3. Suppose that the preliminary estimate is: ‖wjk − wI jk‖ ≤ 112−

rn

M1rn , where k = 1, 2, . . . ,m and M1 = 2c2

µ .

Let e = L(L+1)+K2LµK1

, for the upper and lower boundsK1,K2 on the eigenvalues of the quasi-Hessian from Lemma 6, for theupper bounds c2 on the gradient from Assumption 3 and for the Lipshitz constant c0 of the Hessian. For 1 ≤ k+ 1 ≤ q ≤ m,we have:

‖Hjk −Hjq‖ ≤ c0djk,jq + c01

12 −

rn

M1r

n

and

‖∆gjk − Bjq∆wjk‖ ≤[(1 + e)q−k−1 − 1

]· c0(djk,jq +

112 −

rn

M1r

n) · sj1,jm ,

where sj1,jm = max (‖∆wa‖)a=j1,j2,...,jmand d is defined as the maximum gap between the steps of the algorithm over

the iterations from jk to jq:

djk,jq = max (‖wa − wb‖)jk≤a≤b≤jq . (S19)

Proof. Let vq = ∆gjk − Bjq+1∆wjk , bq = ‖vq‖ and f = c0(dj1,jm+T0−1 + 112−

rn

M1rn )sj1,jm .

Let us bound the difference between the averaged Hessians ‖Hjk −Hjq‖, where 1 ≤ k < q ≤ m, using their definition, aswell as using Assumption 4 on the Lipshitzness of the Hessian:

‖Hjk −Hjq‖

= ‖∫ 1

0

[H(wjk + x(wI jk − wjk))]dx−∫ 1

0

[H(wjq + x(wI jq − wjq ))]dx‖

= ‖∫ 1

0

[H(wjk + x(wI jk − wjk))−H(wjq + x(wI jq − wjq ))]dx‖

≤ c0∫ 1

0

‖wjk + x(wI jk − wjk)− [wjq + x(wI jq − wjq )]‖dx

≤ c0‖wjk − wjq‖+c02‖wI jk − wjk − (wI jq − wjq )‖

≤ c0‖wjk − wjq‖+c02‖wjq − wI jq‖+

c02‖wI jk − wjk‖

≤ c0djk,jq +c0

12 −

rn

M1r

n≤ c0dj1,jm+T0−1 +

c012 −

rn

M1r

n.

(S20)

On the last line, we used the definition of djk,jq , and the assumption on the boundedness of ‖wI jk − wjk‖.

Then, when q = k, according to Equation (S18), the secant equation ∆gjk = Bjk+1∆wjk holds. So ‖∆gjk−Bjk+1

∆wjk‖ =0, which proves the claim when q = k. So vq = bq = 0.

Next, let uq = ∆gjq − Bjq∆wjq . This quantity is closely related to vq−1 = ∆gjk − Bjq∆wjk , and the difference is that in

Page 20: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

uq , the ∆g,∆w terms are defined at q, as opposed to the base one at k. Then |uTq ∆wjk |, where q > k, can be bounded as:

|uTq ∆wjk |= |∆gTjq∆wjk −∆gTjk∆wjq + ∆gTjk∆wjq −∆wTjqBjq∆wjk |

≤ |∆gTjq∆wjk −∆gTjk∆wjq |+ |∆wTjqvq−1|

≤ |∆gTjq∆wjk −∆gTjk∆wjq |+ ‖∆wjq‖ · bq−1

= |∆wTjqHjq∆wjk −∆wTjkHjk∆wjq |+ ‖∆wjq‖ · bq−1

= |∆wTjq(Hjq −Hjk

)∆wjk |+ ‖∆wjq‖ · bq−1

≤ ‖∆wjq‖ · ‖Hjq −Hjk‖ · ‖∆wjk‖+ ‖∆wjq‖ · bq−1

≤ (f + bq−1) ‖∆wjq‖,

(S21)

in which the first inequality uses the triangle inequality, the second inequality uses the Cauchy-Schwarz inequality, and thesubsequent equality uses the Cauchy mean value theorem. Finally, the third inequality uses Assumption 4 and equation(S20). We also use the following bounds, which hold by definition (notice that k, q ≤ m):

‖wjk − wjq‖ ≤ djk,jq ‖∆wjq‖ ≤ sj1,jm .

The argument on the upper bound of bq will proceed by induction. The claim is true for the base case q = k. Assuming thatthe claim is true for q − 1, we want to prove it for q, which is bounded as below:

bq = ‖∆gjk −

(Bjq −

Bjq∆wjq∆wTjqBjq∆wTjqBjq∆wjq

+∆gjq∆g

Tjq

∆gTjq∆wjq

)∆wjk‖. (S22)

By using the triangle inequality, we obtain the following upper bound:

≤ bq−1 + ‖

(∆gjq∆g

Tjq

∆gTjq∆wjq−

Bjq∆wjq∆wTjqBjq∆wTjqBjq∆wjq

)∆wjk‖.

Now we come to a key and nontrivial step of the argument. By bringing fractions to the common denominator in the secondterm, adding and subtracting ∆gjq∆g

Tjq

∆wTjq∆gjq and ∆gjq (Bjq∆wjq )T∆wTjq∆gjq , and rearranging to factor out the term−uq in the numerator of each summand, the formula above can be rewritten as:

= bq−1 +‖[−∆gjq∆g

Tjq

∆wTjquq + ∆gjquTq ∆wTjq∆gjq + uq∆w

Tjq

Bjq∆wTjq∆gjq ]∆wjk‖∆gTjq∆wjq∆w

Tjq

Bjq∆wjq.

Next, using the Cauchy mean value theorem, and the fact that the smallest eigenvalues of Hjq ,Bjq are lower bounded byµ,K1 respectively, the formula above is bounded as:

≤ bq−1 +‖[−∆gjq∆g

Tjq

∆wTjquq + ∆gjquTq ∆wTjq∆gjq + uq∆w

Tjq

Bjq∆wTjq∆gjq ]∆wjk‖µK1‖∆wjq‖4

≤ bq−1 + (‖∆gjq‖2 · ‖∆wTjquq∆wjk‖+ ‖∆gjq‖ · ‖uTq ∆wTjq∆gjq∆wjk‖

+ ‖uq∆wTjqBjq∆wjk∆wTjq∆gjq‖)/µK1‖∆wjq‖4.

Now we want to bound the last three terms one by one. First of all, ‖∆gjq‖2‖∆wTjquq∆wjk‖ can be bounded as:

‖∆gjq‖2 · ‖∆wTjquq∆wjk‖ = ‖Hjq∆wjq‖2 · |∆wTjquq| · ‖∆wjq‖

≤ L‖∆wjq‖3 · |∆wTjquq| ≤ L (f + bq−1) ‖∆wjq‖4,

in which the first equality uses the Cauchy mean value theorem, the subsequent inequality uses Assumption 3 and the lastinequality uses equation (S21), the upper bound on |∆wTjquq|.

Page 21: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Then for ‖∆gjq‖ · ‖uTq ∆wTjq∆gjq∆wjk‖, we have a very similar argument. The only difference is that we factor out thescalar ∆wTjq∆gjq , and bound it by L‖∆wjq‖2, i.e.:

‖∆gjq‖ · ‖uTq ∆wTjq∆gjq∆wjk‖

= ‖Hjq∆wjq‖ · |∆wTjq∆gjq | · |uTq ∆wjk |

≤ L2 (f + bq−1) ‖∆wjq‖4,

in which the first equality uses Cacuhy mean value theorem and the fact that ∆wTjq∆gjq is a scalar and the last inequalityuses Assumption 3 and Equation (S21).

In terms of the bound on ‖uq∆wTjqBjq∆wjk∆wTjq∆gjq‖, it is derived as:

‖uq∆wTjqBjq∆wjk∆wTjq∆gjq‖

= ‖uq∆wTjqBjq∆wjk∆wTjq∆gjq‖

≤ ‖uq∆wTjq‖ · |∆wTjqBjq∆wjk | · ‖∆gjq‖

≤ (f + bq−1) ‖∆wjq‖ · |∆wTjqBjq∆wjk | · ‖Hjq∆wjq‖

≤ (f + bq−1) ‖∆wjq‖ ·K2‖∆wjq‖2 · L‖∆wjq‖= K2L (f + bq−1) ‖∆wjq‖4,

in which the first inequality uses the Cauchy Schwarz inequality, the second inequality uses equation (S21) and the thirdinequality uses Assumption 6.

In summary, for all j ≥ t+ 1, Equation (S22) is bounded by:

bq ≤ bq−1 +L(L+ 1) +K2L

µK1‖∆wjq‖4(f + bq−1) ‖∆wjq‖4

= (1 + e)bq−1 + ef.

By recursion and using the fact that bk = 0, this can be bounded as:

≤ (1 + e)q−k

bk+1 +

q−k−1∑i=0

(1 + e)ie · f

=(1 + e)q−k − 1

e· ef = [(1 + e)q−k − 1]f.

(S23)

This proves the required claim bq ≤ [(1 + e)q−k − 1]f and finishes the proof.

Corollary 1 (Approximation accuracy of quasi-Hessian to mean Hessian). Suppose that ‖wjs − wI js‖ ≤ 112−

rn

M1rn and

‖wt − wI t‖ ≤ 112−

rn

M1rn where s = 1, 2, . . . ,m. Then for jm ≤ t ≤ jm + T0 − 1,

‖Ht − Bjm‖ ≤ ξj1,jm := Adj1,jm+T0−1 +A1

12 −

rn

M1r

n, (S24)

where recall again that c0 is the Lipschitz constant of the Hessian, dj1,jm+T0−1 is the maximal gap between the iteratesof the GD algorithm on the full data from j1 to jm + T0 − 1 (see equation (S19)), which goes to zero as t → ∞) andA = c0

√m[(1+e)m−1]

c1+c0 in which e is a problem dependent constant defined in Theorem 3, c1 is the “strong independence”

constant from (5).

Page 22: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Proof. Based on Theorem 3, bq−1 = ‖Hjq∆wjk − Bjq∆wjk‖ ≤[(1 + e)q−k−1 − 1

]f .

Then based on the “strong linear independence” in Assumption 5, the matrix ∆Wj1,j2,...,jm = [∆wj1sj1,jm

, ∆wj2sj1,jm

, . . . , ∆wjmsj1,jm

]

has its smallest singular value lower bounded by c1 > 0. Then ‖Hjm − Bjm‖ can be bounded as below:

‖Hjm − Bjm‖ ≤1

c1‖ (Hjm − Bjm) ∆Wj1,j2,...,jm‖

≤√m[(1 + e)m − 1]

c0c1

(dj1,jm+T0−1 +

112 −

rn

M1r

n

) (S25)

The second inequality uses the bound ‖M‖ ≤√mmaxi ‖mi‖, where M is a matrix with the m columns mi.

So by combining the results from equation (S25), we can upper bound ‖Ht − Bjm‖ where jm ≤ t ≤ jm + T0 − 1, i.e.:

‖Ht − Bjm‖ = ‖Ht −Hjm + Hjm + Bjm‖≤ ‖Ht −Hjm‖+ ‖Hjm − Bjm‖

≤ c0(djm,t +M1r

n) +√m[(1 + e)m − 1]

c0c1

(dj1,jm+T0−1 +

112 −

rn

M1r

n

)≤ Adj1,jm+T0−1 +A

112 −

rn

M1r

n

(S26)

This finishes the proof.

Note that in the upper bound on ‖Ht − Bjm‖, there is one term dj1,jm+T0−1. So we need to do some analysis of this term:

Lemma 8 (Contraction of the GD iterates). Recall the definition of djk,jq from Theorem 3:

djk,jq = max (‖wa − wb‖)jk≤a≤b≤jq .

Then djk,jq ≤ djk−z,jq−z for any positive integers z and djk,jq ≤ (1− µη)jkd0,jq−jk for any 0 ≤ jk ≤ jq .

Proof. To prove the two inequalities, we should look at djk,jq and djk−z,jq−z where z is a positive integer. For any givenjk ≤ a ≤ b ≤ jq , the upper bound on ‖wa − wb‖ can be derived as below:

‖wa − wb‖ = ‖wa−1 − η∇F (wa−1)− (wb−1 − η∇F (wb−1)‖= ‖wa−1 − wb−1 − η(∇F (wa−1)−∇F (wb−1))‖= ‖wa−1 − wb−1−

η1

n

(∫ 1

0

n∑i=1

Hi(wa−1 + x(wb−1 − wa−1))dx

)(wa−1 − wb−1)‖

= ‖

(I− η

n

(∫ 1

0

n∑i=1

Hi(wa−1 + x(wb−1 − wa−1))dx

))(wa−1 − wb−1)‖.

The derivation above uses the update rule of gradient descent and Cauchy mean-value theorem. Then according to CauchySchwarz inequality and strong convexity, it can be further bounded as ‖wa − wb‖ ≤ (1− ηµ)‖wa−1 − wb−1‖.

This can be used iteratively, which ends up with the following inequality:

‖wa − wb‖ ≤ (1− ηµ)z‖wa−z − wb−z‖ (S27)

Page 23: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

which indicates that djk,jq ≤ (1− ηµ)zdjk−z,jq−z and thus djk,jq ≤ djk−z,jq−z . So by replacing z with jk, we will have:djk,jq ≤ (1− µη)jkd0,jq−jk .

A.2.4. MAIN RECURSIONS

We bound the difference between wI t and wUt. The proofs of the theorems stated below are in the following sections.

Our proof starts out with the usual approach of trying to show a contraction for the gradient updates, see e.g., (Bottou et al.,2018). First we bound ‖wt − wUt‖, i.e.:Theorem 4 (Bound between iterates on full and the leave-r-out dataset). ‖wt − wUt‖ ≤M1

rn where M1 = 2

µc2 is somepositive constant that does not depend on t.

To show that the preliminary estimate on the bound on ‖wI t − wt‖ used in Theorem 3 and Corollary 1 holds, the proof isprovided as below:Theorem 5 (Bound between iterates on full data and incrementally updated ones). Consider an iteration t indexed with jmfor which jm ≤ t < jm + T0 − 1, and suppose that we are at the x-th iteration of full gradient updates, so j1 = j0 + xT0,jm = j0 + (m−1 +x)T0. Suppose that we have the bounds ‖Ht−Bjm‖ ≤ ξj1,jm = Adj1,jm+T0−1 + 1

12−

rn

AM1rn (where

we recalled the definition of ξ) and ξj1,jm ≤µ2 for all iterations x. Then

‖wI t+1 − wt+1‖ ≤2rc2/n

(1− r/n)µ− ξj0,j0+(m−1)T0

≤ 112 −

rn

M1r

n.

Recall that c0 is the Lipshitz constant of the Hessian, M1 and A are defined in Theorem 4 and Corollary 1 respectively,which do not depend on t,

For this theorem, note that this inequality depends on the condition ‖Ht − Bjm‖ ≤ ξj1,jm while in Theorem 3, to prove‖Ht − Bjm‖ ≤ ξj1,jm , we need to use the inequality in Theorem 5, i.e. ‖wI t+1 − wt+1‖ ≤ 1

12−

rn

M1rn . In what follows,

we will show that both inequalities hold for all the iterations t without relying on other conditions.

We can select hyper-parameters T0, j0 such that

A(1− ηµ)j0−m+1d0,(m−1)T0+

112 −

rn

AM1r

n< min(

µ

2, (1− r

n)µ− c0M1r(n− r)

2n2),

e.g. when m = 2 and T0 = 5, which is what we used in our experiments. It is enough that

j0 > max(log( 1

Ad0,5[µ2 −

112−

rn

AM1rn )]

log(1− ηµ),

log( 1Ad0,5

[(1− rn )µ− 1

12−

rn

AM1rn )]

log(1− ηµ)) +m− 1.

This holds for small enough r/n:

j0 >log( 1

Ad0,5[µ2 −

112−

rn

AM1rn )]

log(1− ηµ)+m− 1

Then the following two theorems hold.Theorem 6 (Bound between iterates on full data and incrementally updated ones (all iterations)). For any jm < t <jm + T0 − 1, ‖wI t − wt‖ ≤ 1

12−

rn

M1rn and ‖Ht − Bjm‖ ≤ ξj1,jm .

Then we have the following bound for ‖wUt − wI t‖, which is our main result.Theorem 7 (Convergence rate of DeltaGrad). For all iterations t, the result wI t of DeltaGrad, Algorithm 1, approximatesthe correct iteration values wUt at the rate

‖wUt − wI t‖ = o(r

n).

So ‖wUt − wI t‖ is of a lower order than rn .

This is proved in Section A.2.8.

Page 24: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

A.2.5. PROOF OF THEOREM 4

Proof. By subtracting the GD update from equation (1), we have:

wUt+1 − wt+1 = wUt − wt

− η

(1

n− r

(n∑i=1

∇Fi(wUt

)−∑i∈R∇Fi

(wUt

))− 1

n

n∑i=1

∇Fi (wt)

)(S28)

in which the right-hand side can be rewritten as:

wUt − wt − η(∇F (wUt)−∇F (wt)

)− η

(1

n− r

(n∑i=1

∇Fi(wUt

)−∑i∈R∇Fi

(wUt

))− 1

n

n∑i=1

∇Fi(wUt

))= wUt − wt − η

(∇F (wUt)−∇F (wt)

)− η

n− r

(r

n

n∑i=1

∇Fi(wUt

)−∑i∈R∇Fi

(wUt

))= wUt − wt − η

(∇F (wUt)−∇F (wt)

)+ δt.

Then by applying Cauchy mean value theorem, the triangle inequality, Cauchy schwarz inequality and Lemma 7 respectively,we have:

‖wt+1 − wUt+1‖

≤ ‖wt − wUt − η(

∫ 1

0

H(wt + x

(wUt − wt

))dx)

(wt − wUt

)‖+ ‖δt‖

≤ ‖I− η∫ 1

0

H(wt + x

(wUt − wt

))dx‖‖wt − wUt‖+

2c2rη

n

Then by applying the triangle inequality over integrals and Lemma 2, the formula can be further bounded as:

≤ ‖∫ 1

0

(I− ηH(wt + x

(wUt − wt

))dx)‖‖wt − wUt‖+

2c2rη

n

≤ (1− ηµ)‖wt − wUt‖+2c2rη

n

Then by applying this formula iteratively, we get:

‖wt+1 − wUt+1‖ ≤1

ηµ

2c2rη

n=

2c2µ

r

n:= M1

r

n

A.2.6. PROOF OF THEOREM 5

Proof. The updates for the iterations jm ≤ t ≤ jm + T0 − 1 follow the Quasi-Hessian update. We proceed in a similar wayas before, by expanding the recursion as below:

‖wI t+1 − wt+1‖= ‖wI t − (wt − η∇F (wt))

− η

n− r(n[Bjm

(wI t − wt

)+∇F (wt)

]−∑i∈R∇Fi

(wI t))‖

= ‖(I− η n

n− rBjm)(wI t − wt)−

n− r∇F (wt) +

η

n− r∑i∈R∇Fi(wI t)‖

(S29)

Page 25: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

By rearranging the formula above and using the triangle inequality, we get:

= ‖(I− η n

n− rBjm)(wI t − wt)−

n− r∇F (wt)

n− r∑i∈R

(Ht,i × (wI t − wt) +∇Fi(wt))‖

≤ ‖(I− η n

n− rBjm)(wI t − wt) +

η

n− r∑i∈R

Ht,i × (wI t − wt)‖

+ ‖ rη

n− r∇F (wt)‖+ ‖ η

n− r∑i∈R∇Fi(wt)‖

(S30)

in which we use Ht,i to denote∫ 1

0Hi(wt + x(wI t − wt))dx (recall that Hi represents the Hessian matrix evaluated at the

ith sample). Then the terms in the first absolute value are rewritten as:

[I− η n

n− r(Bjm −Ht + Ht)](wI t − wt) +

η

n− r∑i∈R

Ht,i × (wI t − wt)

= [I− η n

n− r(Bjm −Ht)](wI t − wt)−

η

n− r∑i 6∈R

Ht,i × (wI t − wt)

which uses the fact that Ht =∑ni=1 Ht,i =

∑i 6∈R Ht,i +

∑i∈R Ht,i. Then Formula (S30) can be further bounded as:

≤ ‖[I− η

n− r∑i 6∈R

Ht,i](wI t − wt)‖+nη

n− r‖(Bjm −Ht)(wI t − wt)‖

+ ‖ rη

n− r∇F (wt)‖+ ‖ η

n− r∑i∈R∇Fi(wt)‖

≤ (1− ηµ+ ηn

n− rξj1,jm)‖wI t − wt‖+

rηc2n− r

+ηrc2n− r

(S31)

Then according to Lemma 8, dj1,jm+T0−1 = dj0+xT0,j0+(x+m)T0−1 decreases with increasing x, and thus ξj1,jm =

Adj1,jm+T0−1 + 112−

rn

AM1rn is also decreasing with increasing x. So the formula above can be further bounded as:

≤ (1− ηµ+ ηn

n− rξj0,j0+(m−1)T0

)‖wI t − wt‖+2rηc2n− r

This shows a recurrent inequality for ‖wI t − wt‖. Next, notice that the conditions for deriving the above inequality hold forall jm ≤ t ≤ jm + T0 − 1.

Then, when we reach t = jm, we have an iteration where the gradient is computed exactly. For these iterations we havewI t+1 = wI t − η

n−r∑i 6∈R∇F (wI t) as well as wt+1 = wt − η∇F (wt). Using the same argument as in the bound for

wt − wUt we can get:

‖wt+1 − wI t+1‖ ≤ [1− ηµ] ‖wt − wI t‖+2c2rη

n.

Therefore, we effectively have ξ = 0 for these iterations. We then continue with t← t− 1, and use the appropriate boundamong the two derived above. This recursive process works until we reach t = 1.

As long as ξj0,j0+(m−1)T0≤ µ

2 , −ηµ+ η nn−r ξj0,j0+(m−1)T0

< −ηµ+ η nn−r

µ2 < 0. Then we get the following inequality:

‖wI t − wt‖ ≤2ηrc2n−r

ηµ− η nn−r ξj0,j0+(m−1)T0

=2rc2/n

(1− r/n)µ− ξj0,j0+(m−1)T0

Page 26: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

As long as ξj0,j0+(m−1)T0≤ µ

2 , then

‖wI t − wt‖ ≤2rc2/n

(1− r/n)µ− ξj0,j0+(m−1)T0

≤ 2rc2/n

(1− r/n)µ− µ2

=1

12 −

rn

M1r

n.

The last step uses the fact that M1 = 2c2µ .

A.2.7. PROOF OF THEOREM 6

Architecture of the proof. To visualize the recursive proof process, we draw a picture as:

‖wUt − wt‖t ≤ j0

j0 < t ≤ j0 + T0

j1, j2, . . . , jm= j0−m+1, j0−m+2, . . . , j0−2, j0−1, j0

j0 + T0 < t ≤ j0 + 2T0

j1, j2, . . . , jm= j0 −m+ 2, . . . , j0 − 2, j0 − 1, j0, j0 + T0

‖wI t − wt‖

‖wI t − wt‖ ‖Ht−1−Bjm‖

‖Ht−1−Bjm‖‖wI t − wt‖

. . . . . .. . . . . .

‖Ht−1−Bjm‖‖wI t − wt‖

j0 + (x + m − 1)T0 < t ≤ j0 + (x + m)T0

j1, j2, . . . , jm= j0 + xT0, j0 + (x +

1)T0, . . . , j0 + (x + m − 1)T0

. . . . . . . . . . . .

Theorem 4

Corollary 1

Theorem 5Corollary 1

Corollary 1

Theorem 5Corollary 1

Corollary 1

Corollary 1

Corollary 1

Theorem 5Corollary 1

Proof. First of all, in terms of the bound on ξj1,jm which is required in Theorem 5, i.e. ξj1,jm ≤µ2 , we do the analysis

below to show that we can adjust the value of j0 and T0 such that it can hold for all t. When j1 ≥ j0, i.e. j1 = j0 + xT0,then

j2, j3, . . . , jm = j0 + (x+ 1)T0, j0 + (x+ 2)T0, . . . , j0 + (x+m− 1)T0,

thus ξj1,jm = ξj0+xT0,j0+(x+m−1)T0= Adj0+xT0,j0+(x+m)T0−1 + 1

12−

rn

AM1rn . Here dj0+xT0,j0+(x+m)T0−1 decreases

with x, and so does ξj1,jm = ξj0+xT0,j0+(x+m−1)T0. So the following inequality holds:

dj0+xT0,j0+(x+m)T0−1 ≤ dj0,j0+mT0−1 ≤ (1− µη)j0d0,mT0−1.

Page 27: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

When j1 < j0, there are only m different choices for j1, j2, . . . , jm, in which the smallest j1 used for approximation isj0 −m+ 1. Then, the following inequality holds:

dj1,jm ≤ (1− ηµ)j1d0,jm−j1 ≤ (1− ηµ)j0−m+1d0,jm−j1 .

For those j1, j2, . . . , jm, we have jm − j1 ≤ (m− 1)T0 and thus

dj1,jm ≤ (1− ηµ)j1d0,jm−j1 ≤ (1− ηµ)j0−m+1d0,jm−j1 ≤ (1− ηµ)j0−m+1d0,(m−1)T0.

So ξj1,jm is bounded by A(1− ηµ)j0−m+1d0,(m−1)T0+ 1

12−

rn

AM1rn . To make sure ξj1,jm ≤

µ2 , we can adjust j0,m, T0

to make A(1− ηµ)j0−m+1d0,(m−1)T0+ 1

12−

rn

AM1rn smaller than µ

2 .

Then when t ≤ j0, the gradient is evaluated explicitly, which means that wUt = wI t, so the bound clearly holds, i.e., fromTheorem 4, we have ‖wUt − wt‖ ≤ M1r

n and thus ‖wI t − wt‖ = ‖wUt − wt‖ ≤ M1rn ≤ 1

12−

rn

M1rn .

When j0 < t < j0 + T0, in order to compute wI t, we need to use the history information ∆wj1 , ∆wj2 ,. . . , ∆wjm,∆gj1 ,∆gj2 , . . . , ∆gjm and the corresponding quasi-Hessian matrices Bj1 ,Bj2 , . . . , Bjm where j1, j2, . . . , jm =j0 −m+ 1, j0 −m+ 2, . . . , j0 (we suppose m < j0, which is a natural assumption). Since ‖wI t − wt‖ ≤ M1r

n for anyt ≤ j0, the conditions of Corollary 1 (used here with the j1, . . . , jm described above) hold up to j0, so when t = j0 + 1,‖Ht−1 − Bjm‖ ≤ ξj1,jm where

ξj1,jm = ξj0−m+1,j0 = Adj1,jm+T0−1 +AM1r

n= Adj0−m+1,j0+T0 +AM1

r

n.

Plus, according to Theorem 5, ‖wI t − wt‖ ≤ 2rc2/n(1−r/n)µ−ξj1,jm

= 2rc2/n(1−r/n)µ−ξj0−m+1,j0

. When ξj0−m+1,j0 ≤µ2 , then

‖wI t − wt‖ ≤2rc2/n

(1− r/n)µ− ξj1,jm=

2rc2/n

(1− r/n)µ− µ2

=1

12 −

rn

M1r

n.

So the bound on ‖wI t − wt‖ holds for all t ≤ j0 + 1. Then according to the conditions of Corollary 1, when t = j0 + 2,‖Ht−1 − Bjm‖ ≤ ξj1,jm holds. This can proceed recursively until t = j0 + T0, in which the gradients are explicitlyevaluated according to Theorem 5, i.e.:

‖wI j0+T0− wj0+T0

‖ ≤ 2rc2/n

(1− r/n)µ− ξj1,jm=

2rc2/n

(1− r/n)µ− µ2

=1

12 −

rn

M1r

n.

Next when j0 + T0 < t < j0 + 2T0, jm is updated as j0 + T0 while j1, j2, . . . , jm−1 is updated as j0 −m+ 2, j0 −m+ 3, . . . , j0 and we know that ‖wI jk − wjk‖ ≤ 1

12−

rn

M1rn . So based on Corollary 1, the following inequality holds:

‖Ht − Bjm‖ ≤ ξj1,jm = ξj0−m+2,j0+T0 = Adj1,jm+T0−1 +AM1r

n

= Adj0−m+2,j0+2T0+AM1

r

n

This process can proceed recursively.

When j0 + xT0 < t < j0 + (x+ 1)T0, we know that:

‖Ht − Bjm‖ ≤ ξj1,jm = Adj1,jm+T0−1 +AM1r

n.

Then based on Theorem 5, ‖wI t − wt‖ ≤ 112−

rn

M1rn . Then at iteration j0 + (x + 1)T0, we update j1, j2, . . . , jm as:

jm ← j0 + (x+ 1)T0 and ji−1 ← ji (i = 2, 3, . . . ,m) and thus

‖wI jk − wjk‖ ≤1

12 −

rn

M1r

n

Page 28: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

still holds for all k = 1, 2, . . . ,m.

So when j0 + (x+ 1)T0 < t < j0 + (x+ 2)T0, Corollary 1 and Theorem 5 are applied alternatively. Then the followingtwo inequalities hold for all iterations t satisfying j0 + (x+ 1)T0 < t < j0 + (x+ 2)T0:

‖Ht − Bjm‖ ≤ ξj1,jm = Adj1,jm+T0−1 +AM1r

n,

‖wI jk − wjk‖ ≤1

12 −

rn

M1r

n.

So in the end, we know that:

‖wI t − wt‖ ≤1

12 −

rn

M1r

n

and‖Ht − Bjm‖ ≤ ξj1,jm

hold for all t.

A.2.8. PROOF OF THEOREM 7

Proof. The proof is by induction.

When t ≤ j0, the gradient is evaluated explicitly, which means that wUt = wI t, so the bound clearly holds.

From iteration T0 to iteration t, the difference between wI t and wUt can be bounded as follows. In these equations, we usethe definition of the update formula wI t+1 = wI t − η

n−r [n(Bjm(wI t − wt)) −∑i∈R∇F (wI t)]. By rearranging terms

appropriately, we get:

‖wI t+1 − wUt+1‖ = ‖wI t − wUt −nη

n− r[Bjm(wI t − wt) +∇F (wt)

]+

η

n− r∑i∈R∇Fi(wI t) +

η

n− r∑i 6∈R

∇Fi(wUt)‖(S32)

Then by bringing in Ht into the expression above, it is rewritten as:

= ‖wI t − wUt −nη

n− r[(Bjm −Ht) (wI t − wt) + Ht × (wI t − wt) +∇F (wt)

]+

η

n− r∑i∈R

(∇Fi(wI t)−∇Fi(wt) +∇Fi(wt)) +η

n− r∑i 6∈R

∇Fi(wUt)‖(S33)

In the formula above, we will try to make sure there is no confusion between Ht(w) (Hessian as a function evaluatedat w) and Ht × (w) (Hessian times a vector). Then by applying the Cauchy mean value theorem over each individual∇Fi(wI t) − ∇Fi(wt) and by denoting the corresponding Hessian matrix as Ht,i (note that

∑ni=1 Ht,i = nHt), the

expression becomes:

= ‖wI t − wUt −nη

n− r[(Bjm −Ht) (wI t − wt) + Ht × (wI t − wt) +∇F (wt)

]+

η

n− r∑i∈R

(Ht,i × (wI t − wt) +∇Fi(wt)) +η

n− r∑i 6∈R

∇Fi(wUt)‖

Then by using the fact that∑i∈R∇Fi(wt) +

∑i6∈R∇Fi(wt) = n∇F (wt) and

∑i∈R Ht,i +

∑i6∈R Ht,i = nHt, the

expression can be rearranged as:

Page 29: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

= ‖wI t − wUt −η

n− r∑i6∈R

Ht,i × (wI t − wt)−nη

n− r[(Bjm −Ht) (wI t − wt)

]− η

n− r∑i 6∈R

∇Fi(wt) +η

n− r∑i6∈R

∇Fi(wUt)‖

in which ηn−r

∑i∈R∇Fi(wt) is canceled out. Then by adding and subtracting wUt in the first part, we get:

= ‖wI t − wUt −η

n− r∑i6∈R

Ht,i × (wI t − wUt)−nη

n− r[(Bjm −Ht) (wI t − wUt)

]− η

n− r∑i6∈R

Ht,i × (wUt − wt)−nη

n− r[(Bjm −Ht) (wUt − wt)

]− η

n− r∑i6∈R

∇Fi(wt) +η

n− r∑i 6∈R

∇Fi(wUt)‖

We apply Cauchy mean value theorem over − ηn−r

∑i6∈R∇Fi(wt) + η

n−r∑i 6∈R∇Fi(wUt), i.e.:

− η

n− r∑i 6∈R

∇Fi(wt) +η

n− r∑i6∈R

∇Fi(wUt)

n− r[∑i6∈R

∫ 1

0

Hi(wt + x(wUt − wt))dx](wUt − wt).

In addition, note that Ht,i =∫ 1

0Hi(wt + x(wI t − wt))dx. So the formula above becomes:

= ‖wI t − wUt −η

n− r∑i 6∈R

Ht,i × (wI t − wUt)−nη

n− r[(Bjm −Ht) (wI t − wUt)

]− η

n− r∑i 6∈R

(

∫ 1

0

Hi(wt + x(wI t − wt))dx)(wUt − wt)−nη

n− r[(Bjm −Ht) (wUt − wt)

]+

η

n− r∑i 6∈R

(

∫ 1

0

Hi(wt + x(wUt − wt))dx)(wUt − wt)‖.

Then by applying the triangle inequality and rearranging the expression appropriately, the expression can be bounded as:

≤ ‖(I− η

n− r∑i 6∈R

Ht,i)(wI t − wUt)‖+ ‖ nη

n− r[(Bjm −Ht) (wI t − wUt)

]‖

+ ‖ η

n− r[∑i 6∈R

∫ 1

0

Hi(wt + x(wUt − wt))dx−∫ 1

0

Hi(wt + x(wI t − wt))dx](wUt − wt)‖

+ ‖ nη

n− r[(Bjm −Ht) (wUt − wt)

]‖,

in which the first term is the main contraction component which always appears in the analyses of gradient descent typealgorithms. The remaining terms are error terms due to the various sources of error: using a quasi-Hessian, not having aquadratic objective (implicitly assumed by the local models at each step), using the iterate wI for our update instead of thecorrect wU .

Then by using the following facts:

Page 30: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

1. ‖I− ηHt,i‖ ≤ 1− ηµ;

2. from Theorem 6 on the approximation accuracy of the quasi-Hessian to mean Hessian, we have the error bound‖Ht − Bjm‖ ≤ ξj1,jm ;

3. we can bound the difference of integrated Hessians using the strategy from equation (S20);

4. from Theorem 4, we have the error bound ‖wUt − wt‖ ≤M1rn (and this requires no additional assumptions),

the expression above can be bounded as follows:

≤ (1− ηµ+nη

n− rξj1,jm)‖wI t − wUt‖+

ηc02‖wUt − wI t‖‖wUt − wt‖

+nη

n− rξj1,jm‖wUt − wt‖

≤ (1− ηµ+nη

n− rξj1,jm +

c0M1rη

2n)‖wI t − wUt‖+

M1rη

n− rξj1,jm

(S34)

Recall from Corollary 1 that ξj1,jm = ξj0+xT0,j0+(m+x−1)T0= Adj0+xT0,j0+(m+x)T0−1 +A 1

12−

rn

M1rn decreases with the

increasing x. So the formula above can be bounded as:

≤ (1− ηµ+nη

n− rξj0,j0+(m−1)T0

+c0M1rη

2n)‖wI t − wUt‖+

M1rη

n− rξj1,jm . (S35)

Also by plugging the formula for ξ into the formula above and using Lemma 8 (contraction of GD updates), we get:

≤ (1− ηµ+nη

n− rξj0,j0+(m−1)T0

+c0M1rη

2n)‖wI t − wUt‖

+M1rη

n− r(Adj0+xT0,j0+(m+x)T0−1 +A

112 −

rn

M1r

n)

≤ (1− ηµ+nη

n− rξj0,j0+(m−1)T0

+c0M1rη

2n)‖wI t − wUt‖

+M1rη

n− r(A(1− ηµ)j0+xT0d0,mT0−1 +A

112 −

rn

M1r

n)

(S36)

Now, we will argue that it is pssible to choose hyperparameters such that ξj0,j0+(m−1)T0≤ (1− r

n )µ− c0M1r(n−r)2n2 . Then

1−ηµ+ nηn−r ξj0,j0+(m−1)T0

+ c0M1rη2n is a constant for all t and smaller than 1. By denoting µ− n

n−r ξj0,j0+(m−1)T0− c0M1r

2nas C, the formula above can be written as:

= (1− ηC)‖wI t − wUt‖+M1rη

n− r(A(1− ηµ)j0+xT0d0,mT0−1 +A

112 −

rn

M1r

n).

This can be used recursively until iteration jm = j0 + (x+m)T0 − 1, i.e.:

≤ (1− ηC)t−(j0+(x+m−1)T0)−1‖wI j0+(x+m−1)T0+1 − wUj0+(x+m−1)T0+1‖

+1− (1− ηC)t−(j0+(x+m−1)T0)

ηC

M1rη

n− r(A(1− ηµ)j0+xT0d0,mT0−1 +A

112 −

rn

M1r

n)

≤ (1− ηC)t−(j0+(x+m−1)T0)−1‖wI j0+(x+m−1)T0+1 − wUj0+(x+m−1)T0+1‖

+M1r

C(n− r)(A(1− ηµ)j0+xT0d0,mT0−1 +A

112 −

rn

M1r

n)

Page 31: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

We can set t = j0 + (y +m)T0 and for any y = 1, 2, . . . , x− 1, the formula above can be rewritten as:

‖wI j0+(y+m)T0− wUj0+(y+m)T0

‖≤ (1− ηC)T0−1‖wI j0+(y+m−1)T0+1 − wUj0+(y+m−1)T0+1‖

+M1r

C(n− r)(A(1− ηµ)j0+yT0d0,mT0−1 +A

112 −

rn

M1r

n)

Then at the iteration t = j0 + (y +m− 1)T0, the gradient is explicitly evaluated, which means that:

‖wI j0+(y+m−1)T0+1 − wUj0+(y+m−1)T0+1‖ ≤ (1− ηµ)‖wI j0+(y+m−1)T0− wUj0+(y+m−1)T0

‖.

Since C = µ− nn−r ξj0,j0+(m−1)T0

− c0M1r2n , then 1− ηµ < 1− ηC and thus

‖wI j0+(y+m−1)T0+1 − wUj0+(y+m−1)T0+1‖ ≤ (1− ηC)‖wI j0+(y+m−1)T0− wUj0+(y+m−1)T0

‖,

which can be plugged into the formula above:

‖wI j0+(y+m)T0− wUj0+(y+m)T0

‖≤ (1− ηC)T0‖wI j0+(y+m−1)T0

− wUj0+(y+m−1)T0‖

+M1r

C(n− r)(A(1− ηµ)j0+yT0d0,mT0−1 +A

112 −

rn

M1r

n).

This can be used recursively over y = x− 1, x− 2, . . . , 2, 1:

‖wI j0+(y+m)T0− wUj0+(y+m)T0

‖≤ (1− ηC)yT0‖wI j0+mT0

− wUj0+mT0‖

+

y∑p=1

(1− ηC)(y−p)T0M1r

C(n− r)(A(1− ηµ)j0+pT0d0,mT0−1 +A

112 −

rn

M1r

n)

= (1− ηC)yT0‖wI j0+mT0− wUj0+mT0

+

y∑p=1

(1− ηC)(y−p)T0M1r

C(n− r)(A(1− ηµ)j0+pT0d0,mT0−1)

+

y∑p=1

(1− ηC)(y−p)T0AM2

1 r2

C(n− r)(n/2− r),

(S37)

in whichy∑p=1

(1− ηC)(y−p)T0M1r

C(n− r)(A(1− ηµ)j0+pT0d0,mT0−1)

=AM1rη

C(n− r)(1− ηC)yT0(1− ηµ)j0d0,mT0−1

y∑p=1

(1− ηC)−pT0(1− ηµ)pT0 .

Recall that since 1− ηC > 1− ηµ, then the formula above can be bounded as:

y∑p=1

(1− ηC)(y−p)T0M1r

C(n− r)(A(1− ηµ)j0+pT0d0,mT0−1)

≤ AM1rη

C(n− r)(1− ηC)yT0(1− ηµ)j0d0,mT0−1

1

1− ( 1−ηµ1−ηC )T0

.

Page 32: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Also∑yp=1(1− ηC)(y−p)T0 AM2

1 r2

C(n−r)(n/2−r) can be simplified to:

y∑p=1

(1− ηC)(y−p)T0AM2

1 r2

C(n− r)(n/2− r)=

y−1∑p=0

(1− ηC)pT0AM2

1 r2

C(n− r)(n/2− r)

≤ 1

1− (1− ηC)T0

AM21 r

2

C(n− r)(n/2− r).

So equation (S37) can be further bounded as:

‖wI j0+(y+m)T0− wUj0+(y+m)T0

‖≤ (1− ηC)yT0‖wI j0+mT0

− wUj0+mT0‖

+AM1r

C(n− r)(1− ηC)yT0(1− ηµ)j0d0,mT0−1

1

1− ( 1−ηµ1−ηC )T0

+1

1− (1− ηC)T0

AM21 r

2

C(n− r)(n/2− r).

(S38)

When t→∞ and thus y →∞, (1− ηC)yT0 → 0 and thus

‖wI j0+(y+m)T0− wUj0+(y+m)T0

‖ = o(r

n).

A.3. Results for stochastic gradient descent

A.3.1. QUASI-NEWTON

We modify Equations (S13) and (S12) to SGD versions:

BSjk+1= BSjk −

BSjk∆wSjk∆wSTjk

BSjk∆wS

Tjk

BSjk∆wSjk+

∆gSjk∆gSjkT

∆gSjkT

∆wSjk(S39)

BS−1

jk+1=

(I−

∆wSjk∆gSTjk

∆gSTjk

∆wSjk

)BS−1

jk

(I−

∆gSjk∆wSTjk

∆gSTjk

∆wSjk

)+

∆wSjk∆wSTjk

∆gSTjk

∆wSjk(S40)

This iteration has the same initialization as Bjk and B−1jk

but relies on the history information collected from the SGD-basedtraining process [∆wSj0 , ∆wSj1 ,. . . ,∆wSjm−1

] and [∆gSj0 ,∆gSj1 , . . . , ∆gSjm−1

] where ∆wSjx = wSjx − wI,Sjx and∆gSjx = GB,S(wI,Sjx) − GB,S(wSjx) (x = 0, 1, 2, . . .m − 1). By the same argument as the proof of Lemma 6, thefollowing inequality holds:

K1‖z‖2 ≤ zTBSjkz ≤ K2‖z‖2 (S41)

where K1 := 1

(1+Lµ )2m L

µ+1−(1+L

µ)2m

1−(1+Lµ

)21µ

and K2 := (m+ 1)L, which are both positive values representing a lower bound

and an upper bound on the eigenvalues of BSjk .

A.3.2. PROOF PRELIMINARIES

Similar to the argument for the GD-version of DeltaGrad, we can give an upper bound on δt,S :Lemma 9 (Upper bound on δt,S). Define δt,S = GB,S(wU,St) − GUB−∆B,S(wU,St). Then ‖δt,S‖ ≤ 2c2

∆BtB . Moreover,

with probability higher than 1− t× 2 exp(−2√B),

‖δt′,S‖ ≤ 2c2(r

n+

1

B1/4)

uniformly over all iterations t′ ≤ t.

Page 33: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Proof. Recall that

GB,S(wU,St) =1

B

∑i∈Bt

∇Fi(wU,St)

andGUB−∆B,S(wU,St) =

1

B −∆Bt

∑i∈Bt,i6∈R

∇Fi(wU,St).

By subtracting GB,S(wU,St) from GUB−∆B,S(wU,St), we have:

‖GUB−∆B,S(wU,St)−GB,S(wU,St)‖

= ‖ 1

B

∑i∈Bt

∇Fi(wU,St)−1

B −∆Bt

∑i∈Bt,i6∈R

∇Fi(wU,St)‖

= ‖ 1

B

∑i∈Bt,i∈R

∇Fi(wU,St) + (1

B− 1

B −∆Bt)∑

i∈Bt,i6∈R

∇Fi(wU,St)‖

Then by using the triangle inequality and the fact that ‖∇Fi(wU,St)‖ ≤ c2 (Assumption 3), the formula above can bebounded by 2∆Btc2

B .

Because of the randomness from SGD, the r removed samples can be viewed as uniformly distributed among all n trainingsamples. Each sample is included in a mini-batch according to the outcome of a Bernoulli(r/n) random variable Si. Withina single mini-batch Bt′ at the iteration t′, we get E(

∑i∈Bt′

Si) = E(∆Bt′) = B rn and Var(

∑i∈Bt′

Si) = B rn (1− r

n ). So

in terms of the random variable ∆Bt′B , its expectation and variance will be E(∆Bt′

B ) = rn and Var(∆Bt′

B ) = rBn (1− r

n ).

Then based on Hoeffding’s inequality, the following inequality holds:

Pr(|∆Bt′

B− r

n| ≤ ε) ≥ 1− 2 exp(−2ε2B).

Then by setting ε = 1B1/4 the formula above can be written as:

Pr(|∆Bt′

B− r

n| ≥ 1

B1/4) ≤ 2 exp(−2

√B)

Then by taking the union for all the iterations before t, we get: with probability higher than 1− t× 2 exp(−2√B),

|∆Bt′

B− r

n| ≤ 1

B1/4

and thus∆Bt′

B≤ r

n+

1

B1/4(S42)

for all t′ ≤ t.

In what follows, we use Ψ1 to represent Ψ1 := 2 exp(−2√B), which goes to 0 with large B.

Next we provide a bound for the sum of random sampled Hessian matrices within a minibatch in SGD.Theorem 8 (Hessian matrix bound in SGD). With probability higher than

1− 2p exp

− log(2p)√B

4 + 23

(log2(2p)

B

)1/4

,

for a given iteration t, ‖(

1B

∑i∈Bt Hi(wSt)

)− H(wSt)‖ ≤ L

(log2(2p)

B

)1/4

where p represents the number of modelparameters.

Page 34: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Proof. We consider using the matrix Bernstein inequality, Lemma 5. We define the random matrix Si = Hi(w)−H(w)B

(i ∈ Bt). Due to the randomness from SGD, we know that E(Si) = E( Hi(w)−H(w)B ) = 0. Using the sum Z as required in

Lemma 5, Z =(

1B

∑i∈Bt Hi(w)

)−H(w). Also note that H(w) and Hi(w) are both p× p matrices, so d1 = d2 = p in

Lemma 5.

Furthermore, for each Si = Hi(w)−H(w)B , its norm is bounded by 2L

B based on the smoothness condition, which means thatJ = 2L

B in Lemma 5. Then we can explicitly calculate the upper bound on E(SiS∗i ) and V (Z):

‖E(SiS∗i )‖ ≤ E(‖SiS∗i ‖) ≤ E(‖Si‖‖S∗i ‖) ≤ J2 =4L2

B2,

V (Z) ≤∑i∈Bt

4L2

B2=

4L2

B.

Thus by plugging the above expression into equation (S9) and (S10), we get:

P (‖Z‖ ≥ x) = Pr(‖

(1

B

∑i∈Bt

Hi(w)

)−H(w)‖ ≥ x)

≤ (d1 + d2) exp

(−x2

4L2

B + 2Lx3B

)= 2p exp

(−x2

4L2

B + 2Lx3B

),∀x ≥ 0

(S43)

E(‖Z‖) = E

(‖

(1

B

∑i∈Bt

Hi(w)

)−H(w)‖

)

≤√

8L2

Blog(d1 + d2) +

2L

3Blog(d1 + d2) =

√8L2

Blog(2p) +

2L

3Blog(2p).

(S44)

Then by setting x = L(

log2(2p)B

)1/4

, Equation (S43) becomes:

Pr(‖Z‖ ≥ L(

log2(2p)

B

)1/4

)

= Pr

(‖

(1

B

∑i∈Bt

Hi(w)

)−H(w)‖ ≥ L

(log2(2p)

B

)1/4)

≤ (2p) exp

−L2log(2p)√

B

4L2

B + 2L2

3B

(log2(2p)

B

)1/4

= (2p) exp

− log(2p)√B

4 + 23

(log2(2p)

B

)1/4

.

(S45)

For large mini-batch size B, both L(

log2(2p)B

)1/4

and (2p) exp

(− log(2p)

√B

4+ 23

(log2(2p)

B

)1/4

)are approaching 0.

In what follows, we use Ψ2 to denote the probability Ψ2 := (2p) exp

(− log(2p)

√B

4+ 23

(log2(2p)

B

)1/4

).

Based on this result, we can derive an SGD version of Theorem 3 as below, which also relies on a preliminary estimate onthe bound on ‖wI,St − wSt‖:

Page 35: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Theorem 9 (Error in mean Hessian, and in secant equation with incorrect quasi-Hessian for SGD). Suppose that ‖wSt′ −wI,St′‖ ≤M1

112−

rn−

1

B1/4

( rn + 1B1/4 ) and

1

B

∑i∈Bt′

Hi(wSt′)

−H(wSt′)‖ ≤ L(

log2(2p)

B

)1/4

hold for any t′ ≤ t where M1 = 2c2µ , µ is from Assumption 2 and c2 is from Assumption 3. Let e = L(L+1)+K2L

µK1for the

upper and lower bounds K1,K2 on the eigenvalues of the quasi-Hessian from equation (S41) and for the Lipshitz constantc0 of the Hessian. For any t1, t2 such that 1 ≤ t1 < t2 ≤ t, we have:

‖HSt1 −HS

t2‖ ≤ 2L

(log2(2p)

B

)1/4

+ c0dt1,t2 + 3c0M11

12 −

rn −

1B1/4

(r

n+

1

B1/4).

For any j1, j2, . . . , jm such that jm ≤ t′ ≤ jm + T0 − 1 and t′ ≤ t, we have:

‖∆gSjk − BSjq∆wSjk‖

≤[(1 + e)jq−jk−1 − 1

]· [2L

(log2(2p)

B

)1/4

+ c0djk,jq +3c0M1

12 −

rn −

1B1/4

(r

n+

1

B1/4)] · sjm,j1 .

Here sjm,j1 = max(‖∆wSa‖

)a=j1,j2,...,jm

, djk,jq = max(‖wSa − wSb‖

)jk≤a≤b≤jq

, HSt is the average of the Hessian

matrix evaluated between wSt and wI,St for the samples in mini-batch Bt:

HSt =

1

B

∑i∈Bt

∫ 1

0

Hi(wSt + x(wU,St − wSt))dx.

Proof. First of all, let us bound ‖HSt1−

∫ 1

0H(wSt1 +x(wI,St1−wSt1))dx‖ by adding and subtracting 1

B

∑i∈Bt1

Hi(wSt1)

and H(wSt1) inside the norm:

‖HSt1 −

∫ 1

0

H(wSt1 + x(wI,St1 − wSt1))dx‖

= ‖∫ 1

0

1

B

∑i∈Bt1

Hi(wSt1 + x(wI,St1 − wSt1))dx−∫ 1

0

H(wSt1 + x(wI,St1 − wSt1))dx‖

= ‖∫ 1

0

1

B

∑i∈Bt1

(Hi(wSt1 + x(wI,St1 − wSt1))−Hi(wSt1))dx+1

B

∑i∈Bt1

Hi(wSt1)

−∫ 1

0

(H(wSt1 + x(wI,St1 − wSt1))−H(wSt1))dx−H(wSt1)‖.

Page 36: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Then by using the triangle inequality and Assumption 4, the formula above can be bounded as:

≤∫ 1

0

1

B

∑i∈Bt1

‖Hi(wSt1 + x(wI,St1 − wSt1))−Hi(wSt1)‖dx

+

∫ 1

0

‖H(wSt1 + x(wI,St1 − wSt1))−H(wSt1)‖dx

+ ‖∫ 1

0

1

B

∑i∈Bt1

Hi(wSt1)−H(wSt1)‖

≤ 1

B(∑i∈Bt1

∫ 1

0

c0x‖wI,St1 − wSt1‖dx) +

∫ 1

0

c0x‖wI,St1 − wSt1‖dx

+ ‖ 1

B

∑i∈Bt1

Hi(wSt1)−H(wSt1)‖

≤ c0‖wI,St1 − wSt1‖+ ‖ 1

B

∑i∈Bt1

Hi(wSt1)−H(wSt1)‖.

(S46)

Then based on the above results, we can compute the bound on ‖HSt1 −HS

t2‖, for which we use the triangle inequalityfirst:

‖HSt1 −HS

t2‖

= ‖HSt1 −

∫ 1

0

H(wSt1 + x(wI,St1 − wSt1))dx‖

+ ‖∫ 1

0

H(wSt1 + x(wI,St1 − wSt1))dx−∫ 1

0

H(wSt2 + x(wI,St2 − wSt2))dx‖

+ ‖∫ 1

0

H(wSt2 + x(wI,St2 − wSt2))dx−HSt2‖.

(S47)

Then by using the result from Formula (S46), this term can be further bounded as:

≤ c0‖wI,St1 − wSt1‖+ ‖∫ 1

0

1

B

∑i∈Bt1

Hi(wSt1)−H(wSt1)‖

+ c0‖wI,St2 − wSt2‖+ ‖∫ 1

0

1

B

∑i∈Bt2

Hi(wSt2)−H(wSt2)‖

+ ‖∫ 1

0

H(wSt1 + x(wI,St1 − wSt1))dx−∫ 1

0

H(wSt2 + x(wI,St2 − wSt2))dx‖.

Since

1

B

∑i∈Bt′

Hi(wSt′)

−H(wSt′)‖ ≤ L(

log2(2p)

B

)1/4

for any t′ ≤ t, then the formula above can be bounded as:

Page 37: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

≤ 2L

(log2(2p)

B

)1/4

+ c0‖wSt1 − wSt2‖+c02‖wSt1 − wI,St1‖

+c02‖wI,St2 − wSt2‖+ c0‖wI,St1 − wSt1‖+ c0‖wI,St2 − wSt2‖

= 2L

(log2(2p)

B

)1/4

+ 3c0M11

12 −

rn −

1B1/4

(r

n+

1

B1/4) + c0dt1,t2 .

This finishes the proof of the first inequality. Then by defining

f =

(2L

(log2(2p)

B

)1/4

+ 3c0M11

12 −

rn −

1B1/4

(r

n+

1

B1/4) + c0djk,jq

)sjm,j1

and using the same argument as Equation (S21)-(S23) (except that ∆w and ∆g are replaced with ∆wS and ∆gS), thefollowing inequality thus holds:

bjq = ‖∆gSjk −

(BSjq −

BSjq∆wSjq∆wSTjqBSjq

∆wSTjqBSjq∆wSjq

+∆gSjq∆g

STjq

∆gSTjq∆w

Sjq

)∆wSjk‖

≤ [(1 + e)jq−jk − 1]f

(S48)

and thus‖∆gSjk − BSjq∆w

Sjk‖ ≤ [(1 + e)jq−jk−1 − 1]f,

which finishes the proof.

For simplicity, we denote MS1 := M1

112−

rn−

1

B1/4

. So the preliminary estimate of the bound on ‖wSt′ − wI,St′‖ becomes:

‖wSt′ − wI,St′‖ ≤MS1 ( rn + 1

B1/4 )

Similarly, we get a SGD-version of Corollary 1:Corollary 2 (Approximation accuracy of Quasi-Hessian to mean Hessian). Suppose that ‖wSt′−wI,St′‖ ≤MS

1 ( rn + 1B1/4 )

and

1

B

∑i∈Bt′

Hi(wSt′)

−H(wSt′)‖ ≤ L(

log2(2p)

B

)1/4

hold for any t′ ≤ t. M1 and MS1 are provided in Theorem 9, i.e. M1 = 2c2

µ and MS1 = M1

112−

rn−

1

B1/4

. Then for any t′ and

jm such that jm ≤ t′ ≤ jm + T0 − 1 and t′ ≤ t, the following inequality holds:

‖HSt′ − BSjm‖ ≤ ξSj1,jm

:= A(dj1,jm+T0−1 + 3MS1 (r

n+

1

B1/4) +

2

c0L

(log2(2p)

B

)1/4

)

where recall again that c0 is the Lipschitz constant of the Hessian, dj1,jm+T0−1 is the maximal gap between the iterates ofthe SGD algorithm on the full data from j1 to jm+T0−1 and A = c0

√m[(1+e)m−1]

c1+ c0 in which e is a problem dependent

constant defined in Theorem 9, c1 is the ”strong independence” constant from Assumption (5).

This proof is similar to the proof of Corollary 1. First of all, H, B, ξj1,jm in Corollary 1 are replaced with HS , BS , ξSj1,jm .Second, Theorem 9 holds and thus the following inequality holds:

‖HSt′ −HS

jm‖ ≤ 2L

(log2(2p)

B

)1/4

+ c0dt′,jm + 3c0MS1 (r

n+

1

B1/4)

Page 38: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

by using strong independence from Assumption 5, ‖HSjm − BSjm‖ can be bounded as:

‖HSjm − BSjm‖ ≤

√m[(1 + e)m − 1]

c0c1·

(dj1,jm+T0−1 + 3MS

1 (r

n+

1

B1/4) +

2

c0L

(log2(2p)

B

)1/4)

(S49)

Then by combining the two formulas above, we know that Corollary 2 holds. Note that the definition of ξSj1,jm can berewritten as below:

ξSj1,jm = A(dj1,jm+T0−1 + 3MS1 (r

n+

1

B1/4) +

2

c0L

(log2(2p)

B

)1/4

)

=: Adj1,jm+T0−1 +A1r

n+A2

1

B1/4

(S50)

in which A1 := 3AMS1 and A2 := 3AMS

1 + 2AL(log(2p))1/2

c0.

We can do a similar analysis to Lemma 8 by simply replacing wt and F (∗) with wSt and GB,S :

Lemma 10. Let us use the definition of dk,q from Theorem 9:

dk,q = max(‖wSa − wSb‖

)k≤a≤b≤q

where k < q ≤ t, then dk,q ≤ (1 − ηµ)kd0,q−j + 2c2

((log(p+1))2

B

)1/4

holds with probability higher than 1 − t(p +

1) exp

(− log(p+1)

√B

4+ 23

((log(p+1))2

B

)1/4

).

Proof. According to Lemma 5, we can define a random matrix Si = 1B (∇Fi(wSa) − ∇F (wSa) where recall that

∇F (wSa) = 1n

∑ni=1∇Fi,a(wSa) (i ∈ Bt). Due to the randomness from SGD, we know that E(Si) = 0. Based on the

definition of Z in Lemma 5, Z = 1B

∑i∈Bt ∇Fi(wSa) − ∇F (wSa). Also note that ∇Fi(wSa) and ∇F (wSa) are both

p× 1 matrices, so d1 = p and d2 = 1 in Lemma 5.

Moreover according to Assumption 3, ‖∇Fi(wSa)‖ ≤ c2. Then we know that V (Z) ≤ 4c22B and ‖Si‖ ≤ 2c2

B . So accordingto Lemma 5, the following inequality holds:

P (‖Z‖ ≥ x) = Pr(‖ 1

B

∑i∈Bt

∇Fi(wSa)−∇F (wSa)‖ ≥ x)

≤ (d1 + d2) exp

(−x2

4c22B + 2c2x

3B

)= (p+ 1) exp

(−x2

4c22B + 2c2x

3B

),∀x ≥ 0

(S51)

By setting x = c2

((log(p+1))2

B

)1/4

, the formula above is evaluated as:

Pr(‖ 1

B

∑i∈Bt

∇Fi(wSa)−∇F (wSa)‖ ≥ c2(

(log(p+ 1))2

B

)1/4

)

≤ (p+ 1) exp

− log(p+ 1)√B

4 + 23

((log(p+1))2

B

)1/4

Page 39: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

So by taking the union for the first t iterations, then with probability higher than 1− t(p+ 1) exp

(− log(p+1)

√B

4+ 23

((log(p+1))2

B

)1/4

),

the following inequality holds for all t′ ≤ t:

‖ 1

B

∑i∈Bt′

∇Fi(wSa)−∇F (wSa)‖ ≤ c2(

(log(p+ 1))2

B

)1/4

(S52)

Then by using the similar arguments to Lemma 8, we get:

‖wSa−wSb‖ ≤ (1−ηµ)z‖wSa−z−wSb−z‖+2c2µ

((log(p+ 1))2

B

)1/4

= (1−ηµ)z‖wSa−z−wSb−z‖+M1

((log(p+ 1))2

B

)1/4

and thus dk,q ≤ (1 − ηµ)kd0,q−k + M1

((log(p+1))2

B

)1/4

holds with probability higher than 1 − t(p +

1) exp

(− log(p+1)

√B

4+ 23

((log(p+1))2

B

)1/4

). In what follows, we use Ψ3 to denote (p+ 1) exp

(− log(p+1)

√B

4+ 23

((log(p+1))2

B

)1/4

).

Then by using the definition of ξSj1,jm , the following inequality holds with probability higher than 1− tΨ3 for any x suchthat for j0 + (x+m− 1)T0 ≤ t, the following inequality holds:

ξSj1,jm = ξSj0+xT0,j0+(x+m−1)T0≤ (1− ηµ)xT0Adj0,j0+mT0−1

+A1r

n+A2

1

B1/4+AM1

((log(p+ 1))2

B

)1/4 (S53)

A.3.3. MAIN RECURSIONS

We bound the difference between wI,St and wU,St. First we bound ‖wSt − wU,St‖:Theorem 10 (Bound between iterates on full and the leave-r-out dataset). When

∆Bt′

B≤ r

n+

1

B1/4

holds for all t′ < t, ‖wSt − wU,St‖ ≤ 2c2µ ( rn + 1

B1/4 ). Since with probability higher than 1− t×Ψ1,

∆Bt′

B≤ r

n+

1

B1/4

holds for all t′ < t. Then with the same probability, ‖wSt′+1 − wU,St′+1‖ ≤M1( rn + 1B1/4 ) for all iterations t′ < t, where

recall that M1 = 2c2µ .

Similarly, we can bound the difference between wI t and wt.Theorem 11 (Bound between iterates on full data and incrementally updated ones). Suppose that for at some iteration tand any given t′ ≤ t such that j′m ≤ t′ ≤ j′m + T0 − 1, we have the following bounds:

1. ‖HSt′ − BSj′m‖ ≤ ξ

Sj′1,j′m

= Adj′1,j′m+T0−1 +A 312−

rn

M1( rn + 1B1/4 ) +A 2

c0L(

log2(2p)B

)1/4

;

2. ∆Bt′B ≤ r

n + 1B1/4 ;

3. Formula (S53) holds for any x such that j0 + (x+m− 1)T0 ≤ t;

4. ξSj0,j0+(m−1)T0+A×M1

((log(p+1))2

B

)1/4

≤ µ2 ,

Page 40: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

then‖wI,St′+1 − wSt′+1‖ ≤

2c2

( 12 −

rn −

1B1/4 )µ

(r

n+

1

B1/4) = MS

1 (r

n+

1

B1/4)

for any t′ ≤ t Recall that c0 is the Lipshitz constant of the Hessian, M1 and A are defined in Theorem 10 and Corollary 2respectively, and do not depend on t.

in particular for all t, the following inequality holds:

‖wI,St+1 − wSt+1‖ ≤2c2

( 12 −

rn −

1B1/4 )µ

(r

n+

1

B1/4) = MS

1 (r

n+

1

B1/4).

Similarly, we will show that both inequalities ‖HSt − BSjm‖ ≤ ξSj1,jm and ‖wI,St+1 − wSt+1‖ ≤MS

1 ( rn + 1B1/4 ) hold

for all iterations t.

Theorem 12 (Bound between iterates on full data and incrementally updated ones (all iterations)). Suppose that there are Titerations in total for each training phase, then with probability higher than 1 − T × (Ψ1 + Ψ2 + Ψ3), for any t wherejm < t < jm +T0− 1, ‖wI,St−wSt‖ ≤ 1

12−

rn−

1

B1/4

M1( rn + 1B1/4 ) and ‖HS

t−BSjm‖ ≤ ξSj1,jm , where ξSj1,jm is defined

in Corollary 2, Ψ1 is defined in Lemma 9, Ψ2 is defined in Theorem 8 and Ψ3 is defined in Lemma 10.

Then we have the following bound for ‖wUt − wI t‖.Theorem 13 (Main result: Bound between true and incrementally updated iterates for SGD). Suppose that there are Titerations in total for each training phase, then with probability higher than 1− T × (Ψ1 + Ψ2 + Ψ3), the result wI,St ofAlgorithm 1 approximates the correct iteration values wU,St at the rate

‖wU,St − wI,St‖ ≤ o((r

n+

1

B1/4)).

So ‖wU,St − wI,St‖ is of a lower order than ( rn + 1B1/4 ).

A.3.4. PROOF OF THEOREM 10

Proof. By subtracting wSt − wU,St, taking the matrix norm and using the update rule in equation (S5) and (S6), we get:

‖wSt+1 − wU,St+1‖= ‖wSt − ηGB,S(wSt)−

(wU,St − ηGUB−∆B,S(wU,St)

)‖

= ‖wSt − wU,St − η(GB,S(wSt)−GUB−∆B,S(wU,St)

)‖

= ‖wSt − wU,St − η(GB,S(wSt)−GB,S(wU,St)

+GB,S(wU,St)−GUB−∆B,S(wU,St))‖= ‖wSt − wU,St − η

(GB,S(wSt)−GB,S(wU,St)

)+

η(GB,S(wU,St)−GUB−∆B,S(wU,St)

)‖

(S54)

By Cauchy mean-value theorem and the triangle inequality, the above formula becomes:

≤ ‖wSt − wU,St − η(1

B

∫ 1

0

∑i∈Bt

Hi

(wSt + x

(wU,St − wSt

))dx)

(wSt − wU,St

)‖+ η‖δt,S‖

= ‖

(I− η(

1

B

∫ 1

0

∑i∈Bt

Hi

(wSt + x

(wU,St − wSt

))dx)

)(wSt − wU,St

)‖+ η‖δt,S‖

Then by using the Lemma 3 and Lemma 9, the formula above can be bounded as:

≤ (1− ηµ)‖wSt − wU,St‖+ η2c2∆BtB

Page 41: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Then by using Lemma 9 and using the formula above recursively, we get that with probability higher than 1 − t · Ψ1,‖wSt′ − wU,St′‖ ≤ 2c2

µ ( rn + 1B1/4 ) holds for all iterations t′ ≤ t, which finishes the proof.

A.3.5. PROOF OF THEOREM 11

Proof. For any t′ ≤ t, by subtracting wSt′ by wI,St′ and taking the same argument as equation (S29)-(S31) (except thatwt′ , wI t′ , H, B, n, r are replaced with wSt′ , wI,St′ , HS , BS , B, ∆Bt′), the following equality holds due to the bound on‖HS

t′ − BSjm‖:

‖wI,St′+1 − wSt′+1‖

≤ (1− ηµ+ ηB

B −∆BtξSj1,jm)‖wI,St − wSt‖+

2∆Btηc2B −∆Bt

.(S55)

Since ∆BtB ≤ r

n + 1B1/4 for all iterations between 0 and t, the following two inequalities hold:

2∆Btηc2B −∆Bt

=2ηc2B

∆Bt− 1≤ 2ηc2

1rn+ 1

B1/4

− 1=

2ηc2

1− rn −

1B1/4

(r

n+

1

B1/4), (S56)

B

B −∆Bt=

1

1− ∆BtB

≤ 1

1− ( rn + 1B1/4 )

. (S57)

Moreover, since Formula (S53) holds and ξSj0,j0+(m−1)T0+A×M1

((log(p+1))2

B

)1/4

≤ µ2 , then:

ξSj1,jm = ξSj0+xT0,j0+(x+m−1)T0≤ (1− ηµ)xT0Adj0,j0+mT0−1

+A1r

n+A2

1

B1/4+AM1

((log(p+ 1))2

B

)1/4

≤ Adj0,j0+mT0−1 +A1r

n+A2

1

B1/4+AM1

((log(p+ 1))2

B

)1/4

= ξj0,j0+mT0−1 +AM1

((log(p+ 1))2

B

)1/4

≤ µ

2.

Then the Formula (S55) can be bounded as:

‖wI,St′+1 − wSt′+1‖

≤ (1− ηµ+ ηB

B −∆Bt(ξSj0,j0+(m−1)T0

+A×M1

((log(p+ 1))2

B

)1/4

)‖wI,St′ − wSt′‖+2∆Btηc2B −∆Bt

≤ (1− ηµ+ ηξSj0,j0+(m−1)T0

+A×M1

((log(p+1))2

B

)1/4

1− ( rn + 1B1/4 )

)‖wI,St′ − wSt′‖+2ηc2

1− rn −

1B1/4

(r

n+

1

B1/4),

which uses equation (S56) and (S57). Then applying the formula recursively from iteration t to 0, we can get:

‖wI,St′+1 − wSt′+1‖

≤ 1

η(µ−ξSj0,j0+(m−1)T0

+2c2(

(log(p+1))2

B

)1/4

1−( rn+ 1

B1/4)

)

2ηc2

1− rn −

1B1/4

(r

n+

1

B1/4).

Page 42: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Then since ξSj0,j0+(m−1)T0≤ µ

2 , the formula above can be further bounded as:

=2c2

(1− rn −

1B1/4 )µ− ξSj0,j0+(m−1)T0

−A×M1

((log(p+1))2

B

)1/4(r

n+

1

B1/4)

≤ 2c2

( 12 −

rn −

1B1/4 )µ

(r

n+

1

B1/4).

A.3.6. PROOF OF THEOREM 12

The proof is the same as the proof of Theorem 6 except that w, wI , n, r, ξj1,jm , H, B need to be replaced by wS , wI,S , B, r,ξSj1,jm , HS , BS and the main theorems that the proof depends on will be replaced by Theorem 11 and Corollary 2. But weneed some careful explanations for the probability, which is shown as:

Proof. We define the following event at a given iteration k:

Ω0(k) = ‖wSk − wU,Sk‖ ≤M1(r

n+

1

B1/4),

Ω1(k) = ‖wSk − wI,Sk‖ ≤MS1 (r

n+

1

B1/4),

Ω2(k) = ‖HSk−1 − BSjm‖ ≤ ξSj1,jm (jm ≤ k − 1 ≤ jm + T0 − 1),

Ω3(k) = ‖

1

B

∑i∈Bk−1

Hi(wSk−1)

−H(wSk−1)‖ ≤ L(

log2(2p)

B

)1/4

,

Ω4(k) = ξSj0+xT0,j0+(x+m−1)T0≤ (1− ηµ)j0+xT0Ad0,mT0−1

+A1r

n+A2

1

B1/4+AM1

((log(p+ 1))2

B

)1/4

where j0 + (x+m− 1)T0 ≤ k − 1 ≤ j0 + (x+m)T0 − 1,

Ω5(k) = ∆Bk−1

B≤ r

n+

1

B1/4.

For all t, according to Corollary 2, the following equation holds:

Pr(

t⋂k=1

Ω2(k)|t−1⋂k=1

Ω1(k),

t⋂k=1

Ω3(k)) = 1.

in which the co-occurrence of multiple events is denoted by⋂

or “,”. So this formula means that the probability that Ω2(k)is true for all k ≤ t given that the events Ω1(k) and Ω3(k) are true at the same time for all k ≤ t is 1.

Similarly, according to Theorem 11, Pr(⋂tk=1 Ω1(k)

∣∣∣⋂tk=1 Ω2(k),⋂tk=1 Ω4(k),

⋂tk=1 Ω5(k)) = 1. Then we know that:

Pr(

t⋂k=1

Ω1(k)∣∣∣ t⋂k=1

Ω2(k),

t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k)) · Pr(

t⋂k=1

Ω2(k)|t−1⋂k=1

Ω1(k),

t⋂k=1

Ω3(k))

= Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t−1⋂k=1

Ω1(k),

t⋂k=1

Ω3(k)) = 1,

which can be multiplied by

Pr(

t−1⋂k=1

Ω1(k)∣∣∣ t−1⋂k=1

Ω2(k),

t−1⋂k=1

Ω4(k),

t−1⋂k=1

Ω5(k)).

Page 43: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

The result is then multiplied by

Pr(

t−1⋂k=1

Ω2(k)|t−2⋂k=1

Ω1(k),

t−1⋂k=1

Ω3(k)) = 1.

Then the following equality holds:

Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t−2⋂k=1

Ω1(k),

t⋂k=1

Ω3(k)) = 1

which uses the fact that⋂tk=1 Ωy(k)

⋂⋂t−1k=1 Ωy(k) =

⋂tk=1 Ωy(k) (y = 1, 2, 3, 4, 5). So by repeating this until the

iteration j0, then the following equality holds:

Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

j0⋂k=1

Ω1(k),

t⋂k=1

Ω3(k)) = 1 (S58)

When t ≤ j0, we know that wI,St = wU,St and MS1 ≥ M1, which means that if Ω0(k) holds, then Ω1(k) holds when

wI,St = wU,St, and thus

Pr(

j0⋂k=1

Ω1(k)|j0⋂k=1

Ω0(k)) = 1.

Then according to Theorem 10, we know that:

Pr(

j0⋂k=1

Ω0(k)∣∣∣ j0⋂k=1

Ω5(k)) = 1.

By multiplying the above two formulas, we get:

Pr(

j0⋂k=1

Ω1(k)|j0⋂k=1

Ω0(k)) · Pr(

j0⋂k=1

Ω0(k)∣∣∣ j0⋂k=1

Ω5(k))

= Pr(

j0⋂k=1

Ω1(k),

j0⋂k=1

Ω0(k)∣∣∣ j0⋂k=1

Ω5(k)) = 1

Note that since the probability of two joint events is smaller than that of either of the events, the following inequality holds:

Pr(

j0⋂k=1

Ω1(k),

j0⋂k=1

Ω0(k)∣∣∣ j0⋂k=1

Ω5(k)) ≤ Pr(

j0⋂k=1

Ω1(k)∣∣∣ j0⋂k=1

Ω5(k)) ≤ 1.

So we know that:

Pr(

j0⋂k=1

Ω1(k)∣∣∣ j0⋂k=1

Ω5(k)) = 1.

which can be multiplied by Formula (S58) and thus the following equality holds:

Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k)) = 1 (S59)

Page 44: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Then we can compute the probability of the negation of the joint event (⋂t+1k=1 Ω1(k),

⋂tk=1 Ω2(k)):

Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k))

= Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k)) · Pr(

t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k))

+ Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k)) · Pr(

t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k))

≤ Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k)) + Pr(

t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k))

= Pr(

t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k)).

The last two steps use the fact that

Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k)) = 0

and

Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k)) ≤ 1.

By further using the property of the probability of the union of multiply events, the formula above is bounded as:

Pr(

t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k)) ≤ Pr(

t⋂k=1

Ω4(k)⋃ t⋂

k=1

Ω5(k)⋃ t⋂

k=1

Ω3(k))

≤ Pr(

t⋂k=1

Ω4(k)) + Pr(

t⋂k=1

Ω5(k)) + Pr(

t⋂k=1

Ω3(k)).

Then by using Theorem 8, Formula (S53), Lemma 9 and taking the union between iteration 0 and t, we get:

Pr(

t⋂k=1

Ω3(k)) ≤ t×Ψ2,

Pr(

t⋂k=1

Ω4(k)) ≤ tΨ3,

Pr(

t⋂k=1

Ω5(k)) ≤ t×Ψ1.

Then we can know that:

Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)) ≥ 1− t(Ψ2 + Ψ3 + Ψ1)

Page 45: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

and thus

Pr(

t⋂k=1

Ω1(k)) ≥ Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k)) ≥ 1− t(Ψ2 + Ψ3 + Ψ1).

This finishes the proof.

Similarly, from Formula (S59), we know that for all T iterations:

Pr(

t⋂k=1

Ω1(k),

t⋂k=1

Ω2(k),

t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k)∣∣∣ t⋂k=1

Ω4(k),

t⋂k=1

Ω5(k),

t⋂k=1

Ω3(k)) = 1. (S60)

Through the same argument, we know that:

Pr(

T⋂k=1

Ω1(k),

T⋂k=1

Ω2(k),

T⋂k=1

Ω4(k),

T⋂k=1

Ω5(k)) ≥ 1− T (Ψ2 + Ψ3 + Ψ1).

A.3.7. PROOF OF THEOREM 13

The proof is the same as the proof of Theorem 6 except that w, wI , n, r, ξj1,jm , H, B need to be replaced by wS , wI,S , B, r,ξSj1,jm , HS , BS and the main theorems that the proof depends on will be replaced by Theorem 11 and Corollary 2. We willshow some key steps below.

First of all, according to the proofs of Theorem 12, we know that the following inequalities hold with probability higherthan 1− T (Ψ2 + Ψ3 + Ψ1):

‖wSk − wI,Sk‖ ≤1

12 −

rn −

1B1/4

M1(r

n+

1

B1/4);

‖HSk − BSjm‖ ≤ ξSj1,jm ;

ξSj0+xT0,j0+(x+m−1)T0≤ (1− ηµ)j0+xT0Ad0,mT0−1 +A1

r

n+A2

1

B1/4+AM1

((log(p+ 1))2

B

)1/4

≤ ξj0,j0+(m−1)T0+AM1

((log(p+ 1))2

B

)1/4

;

∆BkB≤ r

n+

1

B1/4.

Then by subtracting wI,St by wU,St and following the arguments from Formula (S32) to (S34), the following inequalityholds for ‖wI,St − wU,St‖ with probability higher than 1− T × (Ψ2 + Ψ1 + Ψ3):

‖wI,St − wU,St‖

≤ (1− ηµ+Bη

B −∆BtξSj1,jm)‖wI,St − wU,St‖

+ηc02‖wU,St − wI,St‖‖wU,St − wSt‖+

B −∆BtξSj1,jm‖w

U,St − wSt‖

≤(

1− ηµ+Bη

B −∆BtξSj1,jm +

c0M1η

2(r

n+

1

B1/4)

)‖wI,St − wU,St‖+

B −∆BtξSj1,jmM1(

r

n+

1

B1/4).

By using the fact that ∆BtB ≤ r

n + 1B1/4 and ξj1,jm ≤ ξj0,j0+(m−1)T0

+A×M1

((log(p+1))2

B

)1/4

, the formula above can

Page 46: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

be bounded as:

‖wI,St − wU,St‖

≤(

1− ηµ+Bη

B −∆BtξSj1,jm +

c0M1η

2(r

n+

1

B1/4)

)‖wI,St − wU,St‖+

B −∆BtξSj1,jmM1(

r

n+

1

B1/4)

≤ [1− ηµ+η

1− rn −

1B1/4

(ξSj0,j0+(m−1)T0+A×M1

((log(p+ 1))2

B

)1/4

)

+c0M1η

2(r

n+

1

B1/4)]‖wI,St − wU,St‖+

η

1− rn −

1B1/4

ξSj0+xT0,j0+(x+m−1)T0M1(

r

n+

1

B1/4).

Since ξSj0,j0+(m−1)T0+A×M1

((log(p+1))2

B

)1/4

≤ µ2 and B is a large mini-batch size, then

1− (ηµ− η

1− rn −

1B1/4

(ξSj0,j0+(m−1)T0+A×M1

((log(p+ 1))2

B

)1/4

)− c0M1η

2(r

n+

1

B1/4)) < 1.

Then after explicitly using the definition of ξSj1,jm and following the argument of equation (S35) to (S38), we get:

‖wI j0+(y+m)T0− wUj0+(y+m)T0

‖≤ (1− ηC)yT0‖wI j0+mT0 − wUj0+mT0‖

+M1( rn + 1

B1/4 )

C(1− rn −

1B1/4 )

(1− ηC)yT0(1− ηµ)j0d0,mT0−11

1− ( 1−ηµ1−ηC )T0

+1

1− (1− ηC)T0

M1( rn + 1B1/4 )

C(1− rn −

1B1/4 )

(A1r

n+A2

1

B1/4+AM1

((log(p+ 1))2

B

)1/4

)

(S61)

when t → ∞ and thus y → ∞, (1 − ηC)yT0 → 0. Also with large mini-batch value B, A1rn + A2

1B1/4 +

AM1

((log(p+1))2

B

)1/4

is a value of the same order as rn + 1

B1/4 . Thus

‖wI j0+(y+m)T0− wUj0+(y+m)T0

‖ = o(r

n+

1

B1/4)

and

‖wU,St − wI,St‖ ≤ o(r

n+

1

B1/4).

B. Details on applicationsB.1. Privacy related data deletion

The notion of Approximate Data Deletion from the training dataset is proposed in (Ginart et al., 2019):

Definition 1. A data deletion operation RA is a δ−deletion for algorithm A if, for all datasets D and for all measurablesubset S, the following inequality holds:

Pr[A(D−i) ∈ S|D−i] ≥ δPr[RA(D,A(D), i) ∈ S|D−i],

where D is the full training dataset, D−i is the remaining dataset after the ith sample is removed, A(D) and A(D−i)represent the model trained over D and D−i respectively. Also RA is an approximate model update algorithm, whichupdates the model after the sample i is removed.

This definition mimics the classical definition of differential privacy (Dwork et al., 2014):

Page 47: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Definition 2. A mechanism M is ε-differentially private, where ε ≥ 0 , if for all neighboring databases D0 and D1, i.e., fordatabases differing in only one record, and for all sets S ∈ [M ], where [M ] is the range of M , the following inequalityholds:

Pr[M(D0) ∈ S] ≤ eεPr[M(D1) ∈ S].

By borrowing the notations from (Ginart et al., 2019), we define a version of approximate data deletion, which is slightlymore strict than the one from (Ginart et al., 2019):

Definition 3. RA is an ε−approximate deletion for A if for all D and measurable subset S ⊂ H:

P (A(D−i) ∈ S|D−i) ≤ eεP (RA(D,A(D), i) ∈ S|D−i)

andP (RA(D,A(D), i) ∈ S|D−i) ≤ eεP (A(D−i) ∈ S|D−i).

To satisfy this definition for gradient descent, necessary randomness is added to the output of the BaseL and DeltaGrad. Onesimple way is the Laplace mechanism (Dwork et al., 2014), also following the idea from (Chaudhuri & Monteleoni, 2009)where noise following the Laplace distribution, i.e.

Lap(x| 2

nελ) =

12nελ

exp(− |x|2nελ

),

is added to the each coordinate of the output of the regularized logistic regression. Here p is the number of the parameters,λ is the regularization rate and 2

nλ is the sensitivity of logistic regression (see (Chaudhuri & Monteleoni, 2009) for moredetails).

We can add even smaller noise to w∗, wU ∗ and wI∗, which follows the distribution Lap( δε ) for each coordinate of w∗, wU ∗

and wI∗ and is independent across different coordinates. Here δ >√pδ0 and

δ0 =1

η( 12µ−

rn−rµ−

c0M1r2n )2

M1r

n− r(A

112 −

rn

M1r

n)

(which is an upper bound on ‖wU ∗ − wI∗‖), such that the randomized DeltaGrad preserves ε−approximate deletion.

Proof. We denote the model parameters after adding the random noise over wR, wU,R and wI,R, and vi as the value of v inthe ith coordinate. We have:

w∗ − wR∗,wU ∗ − wU,R∗,wI∗ − wI,R∗ ∼ Lap(δε

)

Given an arbitrary vector z = [z1, z2, . . . , zp], the probability density ratio between Pdf(wU,R∗ = z) and Pdf(wI,R∗ = z)can be calculated as

Pdf(wU,R∗ = z)

Pdf(wI,R∗ = z)=

Πpi=1

εδ exp(− ε|z−wU∗|

δ )

Πpi=1

εδ exp(−ε |zi−wI∗i |

δ )

= Πpi=1 exp(

ε(|zi − wU ∗| − |zi − wI∗i |)δ

)

≤ Πpi=1 exp(

ε(|wI∗i − wU ∗i |)δ

)

= exp(ε(‖wI∗ − wU ∗‖1)

δ)

Since‖wI∗ − wU ∗‖1 ≤

√p‖wI∗ − wU ∗‖2 =

√p‖wI∗ − wU ∗‖

Then,

Page 48: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Pdf(wU,R∗ = z)

Pdf(wI,R∗ = z)≤ exp(

ε(‖wI∗ − wU ∗‖)δ

)

≤ exp(ε√pδ0

δ) ≤ exp(ε)

Similarly, we can also prove Pdf(wU,R∗=z)Pdf(wI,R∗=z) ≥ exp(ε) by symmetry.

C. Supplementary algorithm detailsIn Section , we only provided the details of DeltaGrad for deterministic gradient descent for the strongly convex and smoothobjective functions in batch deletion/addition scenarios. In this section, we will provide more details on how to extendDeltaGrad to handle stochastic gradient descent, online deletion/addition scenarios and non-strongly convex, non-smoothobjective functions.

C.1. Extension of DeltaGrad for stochastic gradient descent

By using the notations from equations (S5)-(S7), we need to approximately or explicitly compute GB,S , i.e. the averagegradient for a mini-batch in the SGD version of DeltaGrad, instead of∇F , which is the average gradient for all samples.So by replacing wt, wUt, wI t,∇F , B and H with wSt, wU,St, wI,St, GB,S , BS and HS in Algorithm 1, we get the SGDversion of DeltaGrad.

C.2. Extension of DeltaGrad for online deletion/addition

In the online deletion/addition scenario, whenever the model parameters are updated after the deletion or addition of onesample, the history information should be also updated to reflect the changes. By assuming that only one sample is deletedor added each time, the online deletion/addition version of DeltaGrad is provided in Algorithm 3 and the differences relativeto Algorithm 1 are highlighted.

Since the history information needs to be updated every time when new deletion or addition requests arrive, we need to dosome more analysis on the error bound, which is still pretty close to the analysis in Section A.

In what follows, the analysis will be conducted on gradient descent with online deletion. Other similar scenarios, e.g.stochastic gradient descent with online addition, will be left as the future work.

C.2.1. CONVERGENCE RATE ANALYSIS FOR ONLINE GRADIENT DESCENT VERSION OF DELTAGRAD

Additional notes on setup, preliminaries

Let us still denote the model parameters for the original dataset at the tth iteration by wt. During the model update phase forthe kth deletion request at the tth iteration, the model parameters updated by BaseL and DeltaGrad are denoted by wUt(k)and wI t(k) respectively where wUt(0) = wI t(0) = wt. We also assume that the total number of removed samples in alldeletion requests, r, is still far smaller than the total number of samples, n.

Also suppose that the indices of the removed samples are i1, i2, . . . , ir, which are removed at the 1st, 2nd, 3rd, . . . ,, rthdeletion request. This also means that the cumulative number of samples up to the kth deletion request (k ≤ r) is n− k forall 1 ≤ k ≤ r and thus the objective function at the kth iteration will be:

F k(w) =1

n− k∑i 6∈Rk

Fi(w).

where Rk = i1, i2, . . . , ik. Plus, at the kth deletion request, we denote by Hkt the average Hessian matrix of F k(w)

Page 49: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Algorithm 3 DeltaGrad (online deletion/addition)Input :The full training set (X,Y), model parameters cached during the training phase for the full training samples w0,w1, . . . ,wt

and corresponding gradients ∇F (w0) ,∇F (w1) , . . . ,∇F (wt), the index of the removed training sample or the addedtraining sample ir , period T0, total iteration number T , history size m, warmup iteration number j0, learning rate η

Output :Updated model parameter wI t25 Initialize wI0 ← w0

26 Initialize an array ∆G = []27 Initialize an array ∆W = []28 for t = 0; t < T ; t+ + do29 if [((t− j0) mod T0) == 0] or t ≤ j0 then30 compute∇F

(wI t)

exactly31 compute∇F

(wI t)−∇F (wt) based on the cached gradient∇F (wt)

32 set ∆G [k] = ∇F(wI t)−∇F (wt)

33 set ∆W [k] = wI t − wt, based on the cached parameters wt34 k ← k + 1

35 compute wI t+1 by using exact GD update (equation (1))36 wt ← wI t37 ∇F (wt)← ∇F (wI t)38 else39 Pass ∆W [−m :], ∆G [−m :], the last m elements in ∆W and ∆G, which are from the jth1 , jth2 , . . . , jthm iterations where

j1 < j2 < · · · < jm depend on t, v = wI t − wt, and the history size m, to the L-BFGFS Algorithm (See Supplement) to getthe approximation of H(wt)v, i.e., Bjmv

40 Approximate∇F(wI t)

= ∇F (wt) + Bjm(wI t − wt

)41 Compute wI t+1 by using the ”leave-1-out” gradient formula, based on the approximated∇F (wI t)42 wt ← wI t43 ∇F (wt)← η

n−1[n(Bjm(wI t − wt) +∇F (wt))−∇Fir (wt)]

44 end45 end46 return wI t

evaluated between wI t(k + 1) and wI t(k):

Hkt =

1

n− k∑i 6∈Rk

∫ 1

0

Hi(wI t(k) + x(wI t(k + 1)− wI t(k)))dx

Specifically,

H0t =

1

n

n∑i=1

∫ 1

0

Hi(wI t(0) + x(wI t(1)− wI t(0)))dx.

Also the model parameters and the approximate gradients evaluated by DeltaGrad at the r − 1st deletion request are used atthe rth request, and are denoted by:

wI0(r − 1),wI1(r − 1), . . . ,wI t(r − 1)

andga

(wI0(r − 1)

), ga

(wI1(r − 1)

), . . . , ga

(wI t(r − 1)

).

Note that ga(wI t(k)) (k ≤ r) is not necessarily equal to∇F due to the approximation brought by DeltaGrad. But due tothe periodicity of DeltaGrad, at iteration 0, 1, . . . , j0 and iteration j0 + xT0 (x = 1, 2, . . . ,), the gradients are explicitlyevaluated, i.e.:

ga(wI t(k)) =1

n− k∑i 6∈Rk

∇Fi(wI t(k))

for t = 0, 1, . . . , j0 or t = j0 + xT0 (x ≥ 1) and all k ≤ r.

Page 50: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Also, due to the periodicity, the sequence [∆gj0 ,∆gj1 , . . . ,∆gjm−1] used in approximating the Hessian matrix always uses

the exact gradient information, which means that:

∆gjq =1

n− k[∑i6∈Rk

∇Fi(wI jq (k))−∑i 6∈Rk

∇Fi(wI jq (k − 1))]

where q = 1, 2, . . . ,m− 1. So Lemma 6 on the bound on the eigenvalues of Bjq holds for all q and k = 1, 2, . . . , r.

But for the iterations where the gradients are not explicitly evaluated, the calculation of ga(wI t(k)) depends on theapproximated Hessian matrix Bk−1

jmand the approximated gradients calculated at the tth iteration at the k − 1-st deletion

request. So the update rule for ga(wI t(k)) is:

ga(wI t(k)) =1

n− k(n− k + 1)[Bk−1

jm(wI t(k)− wI t(k − 1))

+ ga(wI t(k − 1))]−∇Fik(wI t(k)).(S62)

Here the product Bk−1jm· (wI t(k)− wI t(k − 1)) approximates

1

n− k + 1

∑i 6∈Rk−1

∇Fi(wI t(k))−∇Fi(wI t(k − 1))

and ga(wI t(k − 1)) approximates 1n−k+1

∑i 6∈Rk−1

∇Fi(wI t(k − 1)).

Similarly, the online version of ∆w (at the kth iteration) becomes:

∆wjq (k) = wI jq (k)− wI jq (k − 1)

where q = 1, 2, . . . ,m− 1.

Similarly, we use dja,jb(k) to denote the value of the upper bound d on the distance between the iterates at the kth deletionrequest and use Bk−1

jmto denote the approximated Hessian matrix in the kth deletion request, which approximated the

Hessian matrix Hk−1t .

So the update rule for wI t(k) becomes:

wI t+1(k) =

wI t(k)− η

n−k∑i 6∈Rk ∇Fi(wI t(k)), [(t− j0) mod T0 = 0] or t ≤ j0

wI t(k)− η

n− k(n− k + 1)[Bk−1

jm(wI t(k)− wI t(k − 1))

+ ga(wI t(k − 1))]−∇Fik(wI t(k)), else.

(S63)

Proof preliminaries.

On each deletion request, the BaseL model parameters are retrained from scratch on the remaining samples. This impliesthat Theorem 4 still holds, if we replace wUt, wt and r with wUt(k), wUt(k − 1) and 1 respectively:

Theorem 14 (Bound between iterates deleting one datapoint). ‖wUt(r)− wUt(r − 1)‖ ≤M11n where M1 = 2

µc2 is somepositive constant that does not depend on t. Here µ is the strong convexity constant, and c2 is the bound on the individualgradients.

By induction, we have:

‖wUt(r)− wt‖ = ‖wUt(r)− wUt(0)‖ ≤M1r

n. (S64)

Then let us do some analysis on dja,jb(k). We use the notation Mr1

1n for

2M1n

1− r+1n −

2(r−1)n ( 2L+µ

µ ), where Mr

1 is a constant

which does not depend on k.

Page 51: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Lemma 11. If ‖wI t(k)−wI t(k− 1)‖ ≤2M1n

1− k+1n −

2(k−1)n ( 2L+µ

µ )for all k ≤ r, then dja,jb(r) ≤ dja,jb(0) + 2r ·Mr

11n where

M1 is defined in Theorem 14.

Proof. Recall that dja,jb(k) = max(‖wIy(k) − wIz(k)‖)ja<y<z<jb . Then for two arbitrary iterations y, z, let us bound‖wIy(k)− wIz(k)‖ as below:

‖wIy(k)− wIz(k)‖= ‖wIy(k)− wIz(k) + wIy(k − 1)− wIz(k − 1) + wIz(k − 1)− wIy(k − 1)‖≤ ‖wIy(k)− wIy(k − 1)‖+ ‖wIz(k)− wIz(k − 1)‖+ ‖wIz(k − 1)− wIy(k − 1)‖.

Then by using the bound on ‖wI t(k)− wI t(k − 1)‖, the above formula leads to:

≤ 2 ·2M1

n

1− k+1n −

2(k−1)n ( 2L+µ

µ )+ ‖wIz(k − 1)− wIy(k − 1)‖.

By using that2M1n

1− k+1n −

2(k−1)n ( 2L+µ

µ )≤

2M1n

1− r+1n −

2(r−1)n ( 2L+µ

µ )and applying it recursively for k = 1, 2, . . . , r, we have:

‖wIy(r)− wIz(r)‖ ≤ 2r ·2M1

n

1− r+1n −

2(r−1)n ( 2L+µ

µ )+ ‖wIz(0)− wIy(0)‖.

Then by using the definition of dja,jb(k), the following inequality holds:

dja,jb(r) ≤ dja,jb+T0−1(0) + 2r ·2M1

n

1− r+1n −

2(r−1)n ( 2L+µ

µ ).

Recalling the definition of Mr1

1n , this is exactly the required result.

We also mention that, since2M1n

1− k+1n −

2(k−1)n ( 2L+µ

µ )≤

2M1n

1− r+1n −

2(r−1)n ( 2L+µ

µ ), then ‖wI t(k) − wI t(k − 1)‖ ≤ Mr

11n for any

k ≤ r.

Theorem 15. Suppose that at the kth deletion request, ‖wI jq (k) − wI jq (k − 1)‖ ≤ Mr1

1n , where q = 1, 2, . . . ,m and

M1 = 2c2µ . Let e = L(L+1)+K2L

µK1for the upper and lower bounds K1,K2 on the eigenvalues of the quasi-Hessian from

Lemma 6, and for the Lipshitz constant c0 of the Hessian. For 1 ≤ z + 1 ≤ y ≤ m we have:

‖Hk−1jz−Hk−1

jy‖ ≤ c0djz,jy (k − 1) + c0M

r1

1

n

and

‖∆gjz − Bk−1jy

∆wjz‖ ≤[(1 + e)y−z−1 − 1

]· c0(djz,jy +Mr

1

1

n) · sj1,jm(k − 1)

where sj1,jm(k − 1) = max (‖∆wa(k − 1)‖)a=j1,j2,...,jm= max

(‖wI,Sa(k − 1)− wI,Sa(k − 2)‖

)a=j1,j2,...,jm

. Recallthat d is defined as the maximum gap between the steps of the algorithm for the iterations from jz to jy:

djz,jy (k − 1) = max(‖wIa(k − 1)− wIb(k − 1)‖

)jz≤a≤b≤jy

. (S65)

Proof. Let us bound the difference between the averaged Hessians ‖Hk−1jz−Hk−1

jy‖, where 1 ≤ z < y ≤ m, using their

Page 52: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

definition, as well as using Assumption 4 on the Lipshitzness of the Hessian. First we can get the following equality:

‖Hk−1jy−Hk−1

jz‖

= ‖∫ 1

0

[H(wI jy (k − 1) + x(wI jy (k)− wI jy (k − 1)))]dx

−∫ 1

0

[H(wI jz (k − 1) + x(wI jz (k)− wI jz (k − 1)))]dx‖

= ‖∫ 1

0

[H(wI jy (k − 1) + x(wI jy (k)− wI jy (k − 1)))

−H(wI jz (k − 1) + x(wI jz (k)− wI jz (k − 1)))]dx‖

(S66)

Then we can bound this as:

≤ c0∫ 1

0

‖wI jy (k − 1) + x(wI jy (k)− wI jy (k − 1))

− [wI jz (k − 1) + x(wI jz (k)− wI jz (k − 1))]‖dx≤ c0‖wI jy (k − 1)− wI jz (k − 1)‖

+c02‖wI jy (k)− wI jy (k − 1)− (wI jz (k)− wI jz (k − 1))‖

≤ c0‖wI jy (k − 1)− wI jz (k − 1)‖

+c02‖wI jz (k)− wI jz (k − 1)‖+

c02‖wI jy (k)− wI jy (k − 1)‖

≤ c0djy,jz (k − 1) + c0Mr1

1

n≤ c0dj1,jm+T0−1(k − 1) + c0M

r1

1

n.

On the last line, we have used the definition of djz,jy , and the assumption on the boundedness of ‖wI jz (k)− wI jz (k − 1)‖.

Then by following the rest of the proof of Theorem 3, we get:

‖∆gjz − Bjy∆wjz‖ ≤[(1 + e)y−z−1 − 1

]· c0(djz,jy (k − 1) +Mr

1

1

n) · sj1,jm(k − 1).

Similarly, the online version of Corollary 1 also holds by following the same derivation as the proof of Corollary 1 (exceptthat r, ξj1,jm and dj1,jm+T0−1 is replaced by 1, ξj1,jm(k − 1) and dj1,jm+T0−1(k − 1) respectively), i.e.:

Corollary 3 (Approximation accuracy of quasi-Hessian to mean Hessian (online deletion)). Suppose that at the kth deletionrequest, ‖wI js(k) − wI js(k − 1)‖ ≤ Mr

11n and ‖wI t(k) − wI t(k − 1)‖ ≤ Mr

11n where s = 1, 2, . . . ,m. Then for

jm ≤ t ≤ jm + T0 − 1,

‖Hk−1t − Bk−1

jm‖ ≤ ξj1,jm(k − 1) := Adj1,jm+T0−1(k − 1) +AMr

1

1

n. (S67)

Recall that A = c0√m[(1+e)m−1]

c1+ c0, where c0 is the Lipschitz constant of the Hessian, c1 is the ”strong independence”

constant from Assumption 5, and dj1,jm+T0−1(k − 1) is the maximal gap between the iterates of the GD algorithm on thefull data from j1 to jm + T0 − 1 after the k − 1-st deletion.

Based on this, let us derive a bound on ‖∇Fi(wI t(r))‖, ‖ga(wI t(r))− 1n−r

∑i6∈Rr ∇Fi(wI t(r))‖ and ‖ga(wI t(r))‖.

Lemma 12. Suppose we are at an iteration t such that jm ≤ t ≤ jm + T0 − 1. If the following inequality holds for allk < r:

‖wI t(k)− wI t(k − 1)‖ ≤Mr1

1

n,

then the following inequality holds for all i = 1, 2, . . . , n:

‖∇Fi(wI t(r − 1))‖ ≤Mr1

1

nLr + c2.

Page 53: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Proof. By adding and subtracting∇Fi(wI t(r − 2)) inside ‖∇Fi(wI t(r − 1))‖, we get:

‖∇Fi(wI t(r − 1))‖= ‖∇Fi(wI t(r − 1))−∇Fi(wI t(r − 2)) +∇Fi(wI t(r − 2))‖≤ ‖∇Fi(wI t(r − 1))−∇Fi(wI t(r − 2))‖+ ‖∇Fi(wI t(r − 2))‖

The last inequality uses the triangle inequality. Then by using the Cauchy mean value theorem, the upper bound on theeigenvalue of the Hessian matrix (i.e. Assumption 2) and the bound on ‖wI t(k) − wI t(k − 1)‖, the formula above isbounded as (recall Hi is an integrated Hessian):

= ‖Hi(wI t(r − 1) + x(wI t(r − 2)− wI t(r − 1))) · (wI t(r − 1)− wI t(r − 2))‖+ ‖∇Fi(wI t(r − 2))‖

≤ LMr1

1

n+ ‖∇Fi(wI t(r − 2))‖.

By using this recursively, we get:

≤r−1∑k=1

Mr1

1

nL+ ‖∇Fi(wI t(0))‖ ≤Mr

1

1

nLr + c2.

Lemma 13. If at a given iteration t such that jm ≤ t ≤ jm + T0 − 1, for all k < r, the following inequalities hold:

‖wI t(k)− wI t(k − 1)‖ ≤Mr1

1

n

andξj1,jm(k − 1) ≤ µ

2,

then we have

‖ 1

n− r + 1

∑i6∈Rr−1

∇Fi(wI t(r − 1))− ga(wI t(r − 1))‖ ≤ rMr1

1

nµ.

Proof. First of all, 1n−r+1

∑i6∈Rr−1

∇Fi(wI t(r − 1)) can be rewritten as below by using the Cauchy mean-value theorem:

1

n− r + 1

∑i 6∈Rr−1

∇Fi(wI t(r − 1)) =1

n− r + 1[∑

i 6∈Rr−2

∇Fi(wI t(r − 1))−∇Fir−1(wI t(r − 1))]

=1

n− r + 1(n− r + 2)[Hr−2

t × (wI t(r − 1)− wI t(r − 2)]

+∑

i 6∈Rr−2

∇Fi(wI t(r − 2))−∇Fir−1(wI t(r − 1)).

By subtracting the above formula from equation (S62), i.e., the update rule for the approximate gradient, the norm of theapproximation error between true and approximate gradients is:

‖ 1

n− r + 1

∑i 6∈Rr−1

∇Fi(wI t(r − 1))− ga(wI t(r − 1))‖

=1

n− r + 1‖(n− r + 2)(Hr−2

t − Br−2jm

)× (wI t(r − 1)− wI t(r − 2))

+∑

i 6∈Rr−2

∇Fi(wI t(r − 2))− (n− r + 2)ga(wI t(r − 2))‖

Page 54: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Then by using the triangle inequality, Corollary 3 on the approximation accuracy of the quasi-Hessian (where the bound isin terms of ξ), and the bound on ‖wI t(r − 1)− wI t(r − 2)‖, the formula above is bounded as:

≤ n− r + 2

n− r + 1‖Hr−2

t − Br−2jm‖‖wI t(r − 1)− wI t(r − 2)‖

+1

n− r + 1‖∑

i6∈Rr−2

∇Fi(wI t(r − 2))− (n− r + 2)ga(wI t(r − 2))‖

≤ n− r + 2

n− r + 1ξj1,jm(r − 2)Mr

1

1

n+n− r + 2

n− r + 1‖ 1

n− r + 2

∑i 6∈Rr−2

∇Fi(wI t(r − 2))− ga(wI t(r − 2))‖

(S68)

By using that ξj1,jm(r − 2) ≤ µ2 , the formula above is bounded as:

≤ n− r + 2

n− r + 1

µ

2(Mr

1

1

n) +

n− r + 2

n− r + 1‖ 1

n− r + 2

∑i 6∈Rr−2

∇Fi(wI t(r − 2))− ga(wI t(r − 2))‖

We can use this recursively. Note that∇F (wI t(0)) = ga(wI t(0)). In the end, we get the following inequality:

≤r−1∑k=1

n− kn− r

µ

2(Mr

1

1

n) ≤Mr

1

1

n

µ

2

r−1∑k=1

n− kn− r

Also for r n, n−kn−r ≤ 2 (in fact we assumed r/n ≤ δ for a sufficiently small δ, so this holds). So we get the boundrMr

11nµ.

Note that for ‖ 1n−r+1

∑i6∈Rr−1

∇Fi(wI t(r−1))−ga(wI t(r−1))‖, we get a tighter bound when t→∞ by using equation(S68), Lemma 11 (i.e. dja,jb(r) ≤ dja,jb(0) + 2r ·Mr

11n ) and Lemma 8 without using ξj1,jm(r − 1) ≤ µ

2 , which starts bybounding ξj1,jm(k − 1) where k <= r, j1 = j0 + xT0 and jm = j0 + (x+m− 1)T0:

ξj1,jm(k − 1) = Adj1,jm+T0−1(k − 1) +AMr1

1

n

≤ Adj1,jm+T0−1(0) + 2(k − 1)A ·Mr1

1

n+AMr

1

1

n

≤ A(1− µη)j0+xT0d0,mT0−1(0) +A(2k − 1)Mr1

1

n,

(S69)

which can be plugged into Equation (S68), i.e.:

‖ 1

n− r + 1

∑i 6∈Rr−1

∇Fi(wI t(r − 1))− ga(wI t(r − 1))‖

≤ n− r + 2

n− r + 1ξj1,jm(r − 2)Mr

1

1

n

+n− r + 2

n− r + 1‖ 1

n− r + 2

∑i 6∈Rr−2

∇Fi(wI t(r − 2))− ga(wI t(r − 2))‖

≤r−1∑k=1

n− k + 1

n− kξj1,jm(k − 1)Mr

1

1

n

≤r−1∑k=1

n− k + 1

n− k[A(1− µη)j0+xT0d0,mT0−1(0) +A(2k − 1)Mr

1

1

n] ·Mr

1

1

n

≤ 2A(1− µη)j0+xT0d0,mT0−1(0)rMr1

1

n+ 2A(rMr

1

1

n)2

(S70)

Page 55: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

The last step uses that n−k+1n−k ≤ 2 and

∑r−1k=1(2k − 1) <

∑rk=1(2k − 1) = r2. So when t → ∞ and thus x → ∞,

‖ 1n−r

∑i 6∈Rr ∇Fi(wI t(r))− ga(wI t(r))‖ = o( rn ).

Then based on Lemma 12 and 13, the bound on ‖ga(wI t(r))‖ becomes:

‖ga(wI t(r − 1))‖

= ‖ga(wI t(r − 1))− 1

n− r + 1

∑i6∈Rr−1

∇Fi(wI t(r − 1)) +1

n− r + 1

∑i6∈Rr−1

∇Fi(wI t(r − 1))‖

≤ ‖ga(wI t(r − 1))− 1

n− r + 1

∑i6∈Rr−1

∇Fi(wI t(r − 1))‖+ ‖ 1

n− r + 1

∑i 6∈Rr−1

∇Fi(wI t(r − 1))‖

= rMr1

1

nµ+Mr

1

1

nLr + c2 = (rµ+ Lr)Mr

1

1

n+ c2

(S71)

Main resultsTheorem 16 (Bound between iterates on full data and incrementally updated ones (online deletions)). Suppose that forany k < r, ‖wI t(k)− wI t(k − 1)‖ ≤Mr

11n . At the rth deletion request, consider an iteration t indexed with jm for which

jm ≤ t < jm + T0 − 1, and suppose that we are at the x-th iteration of full gradient updates, so j1 = j0 + xT0, jm =j0 +(m−1+x)T0. Suppose that we have the bounds ‖Hr−1

t −Br−1jm‖ ≤ ξj1,jm(r−1) = Adj1,jm+T0−1(r−1)+A(Mr

11n )

(where we recalled the definition of ξ) and

ξj1,jm(r − 1) = Adj1,jm+T0−1(r − 1) +A(Mr1

1

n) ≤ µ

2

for all iterations x. Then

‖wI t+1(r)− wI t+1(r − 1)‖ ≤Mr1

1

n.

Recall that c0 is the Lipshitz constant of the Hessian, M1 and A are defined in Theorem 14 and Corollary 3 respectively,and do not depend on t,

Then by using the same derivation as the proof of Theorem 6, we get the following results at the rth deletion request.

Theorem 17 (Bound between iterates on full data and incrementally updated ones (all iterations, online deletion)). At thedeletion request r, if for all k < r, ‖wI t(k)− wI t(k − 1)‖ ≤Mr

11n holds, then for any jm < t < jm + T0 − 1,

‖wI t(r)− wI t(r − 1)‖ ≤Mr1

1

n

and‖Hr−1

t − Br−1jm‖ ≤ ξj1,jm(r − 1) := Adj1,jm+T0−1(r − 1) +AMr

1

1

n

and‖ 1

n− r + 1

∑i 6∈Rr−1

∇Fi(wI t(r − 1))− ga(wI t(r − 1))‖ ≤ rMr1

1

hold

Then by induction (the base case is similar to Theorem 6), we know that the following theorem holds for all iterations t:

Theorem 18 (Bound between iterates on full data and incrementally updated ones (all iterations, all deletion requests, onlinedeletion)). At the rth deletion request, for any jm < t < jm + T0 − 1,

‖wI t(r)− wI t(r − 1)‖ ≤Mr1

1

n

and‖Hr−1

t − Br−1jm‖ ≤ ξj1,jm(r − 1) := Adj1,jm+T0−1(r − 1) +AMr

1

1

n

Page 56: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

and‖ 1

n− r + 1

∑i 6∈Rr−1

∇Fi(wI t(r − 1))− ga(wI t(r − 1))‖ ≤ rMr1

1

hold

Then by induction (from the rth deletion request to the 1st deletion request), the following inequality holds:

‖wI t(r)− wI t(0)‖ = ‖wI t(r)− wt‖ ≤ r ·Mr1

1

n

Then by using equation (S64), the following inequality holds:

‖wUt(r)− wI t(r − 1)‖ = ‖wUt(r)− wt + wt − wI t(r − 1)‖≤ ‖wUt(r)− wt‖+ ‖wt − wI t(r − 1)‖

≤M1r

n+ (r − 1) ·Mr

1

1

n:= M2

r

n

(S72)

where M2 is a constant which does not depend on t or k.

In the end, we get a similar result for the bound on ‖wI t(r)− wUt(r)‖:Theorem 19 (Convergence rate of DeltaGrad (online deletion)). At the rth deletion request, for all iterations t, the resultwI t(r) of DeltaGrad, Algorithm 3, approximates the correct iteration values wUt(r) at the rate

‖wUt(r)− wI t(r)‖ = o(r

n).

So ‖wUt(r)− wI t(r)‖ is of a lower order than rn .

The proof of Theorem 16

Proof. Note that the approximated update rules for wI t at the rth and the (r − 1)st deletion request are:

wI t+1(r) = wI t(r)−η

n− r(n− r + 1)[Br−1

jm(wI t(r)− wI t(r − 1))

+ ga(wI t(r − 1))]−∇Fir (wI t(r))(S73)

and

wI t+1(r − 1) = wI t(r − 1)− η

n− r + 1(n− r + 2)[Br−2

jm(wI t(r − 1)− wI t(r − 2))

+ ga(wI t(r − 2))]−∇Fir−1(wI t(r − 1)).(S74)

Note that sincega(wI t(r − 1)) =

1

n− r + 1(n− r + 2)[Br−2

jm(wI t(r − 1)− wI t(r − 2))

+ga(wI t(r − 2))]−∇Fir−1(wI t(r − 1)),

then equation (S74) can be rewritten as:

wI t+1(r − 1) = wI t(r − 1)− η

n− r + 1(n− r + 2)[Br−2

jm(wI t(r − 1)− wI t(r − 2))

+ ga(wI t(r − 2))]−∇Fir−1(wI t(r − 1))

= wI t(r − 1)− ηga(wI t(r − 1)).

(S75)

Then by subtracting equation (S74) from equation (S75), the result becomes:

Page 57: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

wI t+1(r)− wI t+1(r − 1)

= (wI t(r)− wI t(r − 1))− η

n− r(n− r + 1)[Br−1

jm(wI t(r)− wI t(r − 1))

+ ga(wI t(r − 1))]−∇Fir (wI t(r))+ ηga(wI t(r − 1)).

Then by adding and subtracting Hr−1t and 1

n−r+1

∑i 6∈Rr−1

∇F (wI t(r − 1)) in the formula above and rearranging theresult properly, it becomes:

wI t+1(r)− wI t+1(r − 1)

= (I− ηn− r + 1

n− r(Br−1jm−Hr−1

t ))(wI t(r)− wI t(r − 1))

− η

n− r(n− r + 1)[Hr−1

t (wI t(r)− wI t(r − 1))

+1

n− r + 1

∑i 6∈Rr−1

∇F (wI t(r − 1))− 1

n− r + 1

∑i 6∈Rr−1

∇F (wI t(r − 1))

+ ga(wI t(r − 1))]−∇Fir (wI t(r))+ ηga(wI t(r − 1)).

(S76)

Then by using the fact that

Hr−1t (wI t(r)− wI t(r − 1)) +

1

n− r + 1

∑i6∈Rr−1

∇F (wI t(r − 1))

=1

n− r + 1

∑i 6∈Rr−1

∇F (wI t(r))

and

(∑

i6∈Rr−1

∇F (wI t(r)))−∇Fir (wI t(r)) =∑i 6∈Rr

∇F (wI t(r)),

Equation (S76) becomes:

wI t+1(r)− wI t+1(r − 1)

= (I− ηn− r + 1

n− r(Br−1jm−Hr−1

t ))(wI t(r)− wI t(r − 1))

− η

n− r[∑i 6∈Rr

∇F (wI t(r))−∑

i 6∈Rr−1

∇F (wI t(r − 1))

+ (n− r + 1)ga(wI t(r − 1))] + ηga(wI t(r − 1)).

(S77)

Also note that by using the Cauchy mean-value theorem, the following equation holds:∑i6∈Rr

∇Fi(wI t(r))−∑

i 6∈Rr−1

∇Fi(wI t(r − 1))

=∑i6∈Rr

∇Fi(wI t(r))−∑i 6∈Rr

∇Fi(wI t(r − 1))−∇Fir (wI t(r − 1))

= [∑i 6∈Rr

∫ 1

0

Hi(wI t(r − 1) + x(wI t(r)− wI t(r − 1)))dx](wI t(r)− wI t(r − 1))−∇Fir (wI t(r − 1)),

Page 58: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

which can be plugged into equation (S77), i.e.:

wI t+1(r)− wI t+1(r − 1)

= (I− ηn− r + 1

n− r(Br−1jm−Hr−1

t ))(wI t(r)− wI t(r − 1))

− η

n− r[∑i6∈Rr

∫ 1

0

Hi(wI t(r − 1) + x(wI t(r)− wI t(r − 1)))dx]

· (wI t(r)− wI t(r − 1))−∇Fir (wI t(r − 1))

+ (n− r + 1)ga(wI t(r − 1))+ ηga(wI t(r − 1)),

(S78)

which can be rearranged as:

wI t+1(r)− wI t+1(r − 1)

= (I− ηn− r + 1

n− r(Br−1jm−Hr−1

t ))(wI t(r)− wI t(r − 1))

− η

n− r[∑i6∈Rr

∫ 1

0

Hi(wI t(r − 1) + x(wI t(r)− wI t(r − 1)))dx]

· (wI t(r)− wI t(r − 1))−∇Fir (wI t(r − 1)) − η

n− rga(wI t(r − 1)).

(S79)

Then by taking the matrix norm on both sides of equation (S79) and using that ‖Hi(wI t(r−1)+x(wI t(r)−wI t(r−1)))‖ ≥µ and ‖Br−1

jm−Hr−1

t ‖ ≤ ξj1,jm(r − 1), equation (S79) can be bounded as:

‖wI t+1(r)− wI t+1(r − 1)‖≤ (1− ηµ)‖wI t(r)− wI t(r − 1)‖

+(n− r + 1)η

n− rξj1,jm(r − 1)‖wI t(r)− wI t(r − 1)‖

n− r‖∇Fir (wI t(r − 1))‖+ ‖ η

n− rga(wI t(r − 1))‖.

Then by using Lemma 12 and equation (S71), the formula above becomes:

≤ (1− ηµ+(n− r + 1)η

n− rξj1,jm(r))‖wI t(r)− wI t(r − 1)‖

n− r(Mr

1

1

nL(r − 1) + c2) +

η

n− r(Mr

1

1

n(r − 1)µ+Mr

1

1

nL(r − 1) + c2).

By using the bound on ξj1,jm(r) and applying the above formula recursively across all iterations, the formula abovebecomes:

≤ 1

ηµ− η(n−r+1)n−r

µ2

n− r(Mr

1

1

nL(r − 1) + c2)

n− r(Mr

1

1

nL(r − 1) +Mr

1

1

nµ(r − 1) + c2)

=2

(n− r − 1)µ((Mr

1

1

n(2L(r − 1) + (r − 1)µ) + 2c2).

Page 59: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Then by using that M1 = 2c2µ and Mr

11n =

2M1n

1− r+1n −

2(r−1)n ( 2L+µ

µ ), the formula above can be rewritten as:

=2

(n− r − 1)µ

2M1(r−1)n (2L+ µ) + µM1(1− r+1

n −2(r−1)n ( 2L+µ

µ ))

1− r+1n −

2(r−1)n ( 2L+µ

µ )

=2M1

n

1− r+1n −

2(r−1)n ( 2L+µ

µ )= Mr

1

1

n.

This finishes the proof.

The proof of Theorem 19

Proof. Recall that the update rule for wUt(r) is:

wUt+1(r) = wUt(r)− η1

n− r∑i6∈Rr

∇F (wUt(r))

and the update rule for wI t(r) is (where the gradients are explicitly evaluated):

wI t+1(r) = wI t(r)−η

n− r[(n− r + 1)(Br−1

jm(wI t(r)− wI t(r − 1)) + ga(wI t(r − 1)))−∇Fir (wI t(r))].

Then by subtracting wI t+1(r) from wUt+1(r), we get:

‖wI t+1(r)− wUt+1(r)‖

= ‖wI t(r)− wUt(r)−η

n− r(n− r + 1)[Br−1

jm(wI t(r)− wI t(r − 1))

+ ga(wI t(r − 1))]−∇Fir (wI t(r))+η

n− r∑i 6∈Rr

∇F (wUt(r))‖.

Then by bringing in Hr−1t and 1

n−r+1

∑i∈Rr−1

∇F (wI t(r − 1)) into the formula above, we get:

= ‖wI t(r)− wUt(r)−(n− r + 1)η

n− r[(

Br−1jm−Hr−1

t

)(wI t(r)− wI t(r − 1))

+ Hr−1t × (wI t(r)− wI t(r − 1)) + ga(wI t(r − 1))

− 1

n− r + 1

∑i∈Rr−1

∇F (wI t(r − 1)) +1

n− r + 1

∑i∈Rr−1

∇F (wI t(r − 1))]

n− r[∇Fir (wI t(r))−∇Fir (wI t(r − 1)) +∇Fir (wI t(r − 1))] +

η

n− r∑i6∈Rr

∇Fi(wUt(r))‖.

Then by using the triangle inequality and the result from equation (S70), the formula above can be bounded as:

≤ ‖wI t(r)− wUt(r)−(n− r + 1)η

n− r[(

Br−1jm−Hr−1

t

)(wI t(r)− wI t(r − 1))

+ Hr−1t × (wI t(r)− wI t(r − 1)) +

1

n− r + 1

∑i∈Rr−1

∇Fi(wI t(r − 1))

+

η

n− r[∇Fir (wI t(r))−∇Fir (wI t(r − 1)) +∇Fir (wI t(r − 1))]

n− r∑i 6∈Rr

∇Fi(wUt(r))‖+ 2A(1− µη)j0+xT0d0,(m−1)T0(0)rMr

1

1

n+ 2A(rMr

1

1

n)2.

Page 60: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Note that the first matrix norm in this formula is the same as equation (S33) by replacing n, r, wI t, wUt, wt, Bjm , Ht and∇F (wt) with n− r + 1, 1, wI t(r), wUt(r), wI t(r − 1), Br−1

jm, Hr−1

t and 1n−r+1

∑i 6∈Rr−1

∇Fi(wI t(r − 1)) reps.. So byfollowing the same derivation, the formula above can be bounded as:

≤ ‖(I− η

n− r∑i6∈Rr

Hr−1t,i )(wI t(r)− wUt(r))‖

+ ‖ (n− r + 1)η

n− r[(

Br−1jm−Hr−1

t

)(wI t(r)− wUt(r))

]‖

+ ‖ η

n− r[∑i6∈Rr

∫ 1

0

Hi(wI t(r − 1) + x(wUt(r)− wI t(r − 1)))dx

−∫ 1

0

Hi(wI t(r − 1) + x(wI t(r)− wI t(r − 1)))dx](wUt(r)− wI t(r − 1))‖

+ ‖ (n− r + 1)η

n− r[(

Br−1jm−Hr−1

t

)(wUt(r)− wI t(r − 1))

]‖

+ 2A(1− µη)j0+xT0d0,(m−1)T0(0)rMr

1

1

n+ 2A(rMr

1

1

n)2.

Then by using the following facts:

1. ‖I− ηHr−1t,i ‖ ≤ 1− ηµ;

2. from Theorem 18 on the approximation accuracy of the quasi-Hessian to mean Hessian, we have the error bound‖Hr−1

t − Br−1jm‖ ≤ ξj1,jm(r − 1);

3. we bound the difference of integrated Hessians using the strategy from Equation (S20);

4. from Equation (S72), we have the error bound ‖wUt(r) − wI t(r − 1)‖ ≤ M2rn (and this requires no additional

assumptions),

the expression can be bounded as follows:

≤ (1− ηµ+(n− r + 1)η

n− rξj0,j0+(m−1)T0

(r − 1) +c0M2rη

2n)‖wI t − wUt‖

+M2(n− r + 1)rη

n(n− r)ξj1,jm(r − 1) + 2A(1− µη)j0+xT0d0,(m−1)T0

(0)rMr1

1

n

+ 2A(rMr1

1

n)2,

which is very similar to equation (S36) (except the difference in the coefficient). So by following the derivation afterequation (S36), we know that:

‖wI t(r)− wUt(r)‖ = o(r

n)

when t→∞.

C.3. Extension of DeltaGrad for non-strongly convex, non-smooth objective functions

For the original version of the L-BFGS algorithm, strong convexity is essential to make the secant condition hold. In thissubsection, we present our extension of DeltaGrad to non-strongly convex, non-smooth objectives.

To deal with non-strongly convex objectives, we assume that convexity holds in some local regions. When constructing thearrays ∆G and ∆W , only the model parameters and their gradients where local convexity holds are used.

Page 61: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

For local non-smoothness, we found that even a small distance between wt and wI t can make the estimated gradient∇F (wI t) drift far away from∇F (wt). To deal with this, we explicitly check if the norm of Bjm(wt − wI t) (which equalsto∇F (wI t)−∇F (wt)) is larger than the norm of L(wt − wI t) for a constant L. In our experiments, L is configured as 1.The details of the modifications above are highlighted in Algorithm 4.

Algorithm 4 DeltaGrad (general models)Input :The full training set (X,Y), model parameters cached during the training phase for the full training samples w0,w1, . . . ,wt

and corresponding gradients ∇F (w0) ,∇F (w1) , . . . ,∇F (wt), the removed training sample or the added training sampleR, period T0, total iteration number T , history size m, warmup iteration number j0, learning rate η

Output :Updated model parameter wI t47 Initialize wI0 ← w0

48 Initialize an array ∆G = []49 Initialize an array ∆W = []50 Initialize last t = j051 is explicit = False52 for t = 0; t < T ; t+ + do53 if (t− lastt) mod T0 == 0 or t ≤ j0 then54 is explicit = True55 else56 end57 if is explicit == True or t ≤ j0 then58 last t = t

59 compute∇F(wI t)

exactly60 compute∇F

(wI t)−∇F (wt) based on the cached gradient∇F (wt)

/* check local convexity */

61 if < ∇F(wI t)−∇F (wt) ,wI t − wt >≤ 0 then

62 compute wI t+1 by using exact GD update (equation (1))63 continue64 end65 set ∆G [k] = ∇F

(wI t)−∇F (wt)

66 set ∆W [k] = wI t − wt, based on the cached parameters wt67 k ← k + 1

68 compute wI t+1 by using exact GD update (equation (1))69 else70 Pass ∆W [−m :], ∆G [−m :], the last m elements in ∆W and ∆G, which are from the jth1 , jth2 , . . . , jthm iterations where

j1 < j2 < · · · < jm depend on t, v = wI t − wt, and the history size m, to the L-BFGFS Algorithm (See Supplement) to getthe approximation of H(wt)v, i.e., Bjmv

/* check local smoothness */71 if ‖Bjmv‖ ≥ ‖v‖ then72 go to line 5873 end74 Approximate∇F

(wI t)

= ∇F (wt) + Bjm(wI t − wt

)75 Compute wI t+1 by using the ”leave-r-out” gradient formula, based on the approximated∇F (wI t)76 end77 end78 return wI t

D. Supplementary experimentsIn this section, we present some supplementary experiments that could not be presented in the paper due to space limitations.

D.1. Experiments with large deletion rate

In this experiment, instead of deleting at most 1% of training samples each time as we did in Section 18 in the main paper,we vary the deletion rate from 0 to up to 20% on MNIST dataset and still compare the performance between DeltaGrad(with T0 as 5 and j0 as 10) and BaseL. All other hyper-parameters such as the learning rate and mini-batch size remain thesame as in Section 18 in the main paper.

Page 62: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Figure S1. Running time and distance with varied deletion rate up to 20%

The experimental results in Figure S1 show that even with the largest deletion rate, i.e. 20%, DeltaGrad can still be 1.67xfaster than BaseL (2.27s VS 1.53s) and the error bound between their resulting model parameters (i.e. wI∗ VS wU∗) are stillacceptable (on the order of 10−3), far smaller than the error bound between wU∗ and w∗ (on the order of 10−1). Such asmall difference between wI∗ and wU∗ also results in almost the same prediction performance, i.e. 87.460± 0.0011% and87.458± 0.0012% respectively. This experiment thus provides some justification for the feasibility of DeltaGrad even whenthe number of the removed samples is not far smaller than the entire training dataset size.

Figure S2. Running time and distance comparison with varying mini-batch size under fixed j0 = 10 and varying T0 (T0 = 20 VST0 = 10 VS T0 = 5)

D.2. Influence of hyper-parameters on performance

To begin with, the influence of different hyper-parameters used in SGD and DeltaGrad is explored. We delete one samplefrom the training set of MNIST by running regularized logistic regression with the same learning rate and regularizationrate as in Section 18 and varying mini-batch sizes (1024 - 60000), T0 (T0 = 20, 10, 5) and j0 (j0 = 5, 10, 50). Theexperimental results are presented in Figure S2-S3. For different mini-batch sizes, we also used different epoch numbers tomake sure that the total number of running iterations/steps in SGD are roughly the same. In what follows, we analyze howthe mini-batch size, the hyper-parameters T0 and j0 influence the performance, thus providing some hints on how to chooseproper hyper-parameters when DeltaGrad is used.

Influence of the mini-batch size. It is clear from Figure S2-S3 that with larger mini-batch sizes, DeltaGrad can gain morespeed with longer running time for both BaseL and DeltaGrad. As discussed in Section 18, to compute the gradients, otherGPU-related overhead (the overhead to copy data from CPU DRAM to GPU DRAM, the time to launch the kernel on

Page 63: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Figure S3. Running time and distance comparison with varying mini-batch size under fixed T0 = 5 and varying j0 (j0 = 5 VS j0 = 10VS j0 = 50)

Figure S4. Comparison of DeltaGrad and PrIU

GPU) cannot be ignored. This can become more significant when compared against the smaller computational overhead forsmaller mini-batch data. Also notice that, when T0 = 5, with increasing B, the difference between wU and wI becomessmaller and smaller, which matches our conclusion in Theorem 13, i.e. with larger B, the difference o( rn + 1

B14

) is smaller.

Influence of T0. By comparing the three sub-figures in Figure S2, the running time slightly (rather than significantly)decreases with increasing T0 for the same mini-batch size. This is explained by the earlier analysis in Section 18 on the non-ideal performance for GPU computation over small matrices. Interestingly, when T0 = 10 or T0 = 20, ‖wI,S −wU,S‖ doesnot decrease with larger mini-batch sizes. This is because in Formula (S61), one component of the bound of ‖wI,S −wU,S‖is

M1( rn + 1B1/4 )

C(1− rn −

1B1/4 )

(1− ηC)yT0(1− ηµ)j0d0,mT0−11

1− ( 1−ηµ1−ηC )T0

(while the other component is o(( rn + 1

B14

))). Here d0,mT0−1 increases with larger T0 and the term (1 − ηC)yT0 is notarbitrarily approaching 0 since yT0 cannot truly go to infinity. So when T0 = 20 and T0 = 10, this component becomes thedominating term in the bound of ‖wI,S −wU,S‖. So to make the bound o(( rn + 1

B14

)) hold, so that we can adjust the bound

of ‖wI,S − wU,S‖ by varying B, proper choice of T0 is important. For example, T0 = 5 is a good choice for the MNISTdataset. This can achieve speed-ups comparable to larger T0 without sacrificing the closeness between wI,S and wU,S .

Influence of j0. By comparing the three sub-figures in Figure S3, with increasing j0, long “burn-in” iterations are expected,thus incurring more running time. This, however, does not significantly reduce the distance between wI,S and wU,S . It

Page 64: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

indicates that we can select smaller j0, e.g. 5 or 10 for more speed-up.

Discussions on tuning the hyper-parameters for DeltaGrad. Through our extensive experiments, we found that forregularized logistic regression, setting T0 as 5 and j0 as 5− 20 would lead to some of the most favorable trade-offs betweenrunning time and the error ‖wU,S − wI,S‖. But in terms of more complicated models, e.g. 2-layer DNN, higher j0 (evenhalf of the total iteration number) and smaller T0 (2 or 1) are necessary. Similar experiments were also conducted on addingtraining samples, in which similar trends were observed.

D.3. Comparison against the state-of-the-art work

To our knowledge, the closest work to ours is (Wu et al., 2020), which targets simple ML models, i.e. linear regressionand regularized logistic regression with an ad-hoc solution (called PrIU) rather than solutions for general models. Theirsolutions can only deal with the deletion of samples from the training set without supporting the addition of samples. In ourexperiments, we compared DeltaGrad (with T0 = 5 and j0 = 10) against PrIU by running regularized logistic regressionover MNIST and covtype with the same mini-batch size (16384), the same learning rate and regularization rate, but withvarying deletion rates.

Table 1. Memory usage of DeltaGrad and PrIU(GB)

Deletion rate MNIST covtypePrIU DeltaGrad PrIU DeltaGrad

2× 10−5 26.61 2.74 9.30 2.565× 10−5 27.02 2.74 9.30 2.561× 10−4 27.13 2.74 9.30 2.552× 10−4 27.75 2.74 9.31 2.565× 10−4 29.10 2.74 10.67 2.561× 10−3 29.10 2.74 10.67 2.56

The running time and the distance term ‖wU − wI‖ of both PrIU and DeltaGrad with varying deletion rate are presented inFigure S4. First, it shows that DeltaGrad is always faster than PrIU, with more significant speed-ups on MNIST. The reasonis that the time complexity of PrIU is O(rp) for each iteration where p represents the total number of model parameterswhile r represents the reduced dimension after Singular Value Decomposition is conducted over some p× p matrix. This isa large integer for large sparse matrices, e.g. MNIST.

As a result, O(rp) is larger than the time complexity of DeltaGrad. Also, the memory usage of PrIU and DeltaGrad is shownin Table 1. PrIU needs much more DRAM (even 10x in MNIST) than DeltaGrad. The reason is that to prepare for the modelupdate phase, PrIU needs to collect more information during the training phase over the full dataset. This is needed in themodel update phase and is quadratic in the number of the model parameters p. The authors of (Wu et al., 2020) claimedthat their solution cannot provide good performance over sparse datasets in terms of running time, error term wU − wI andmemory usage. In contrast, both the time and space overhead of DeltaGrad are smaller, which thus indicates the potential ofits usage in the realistic, large-scale scenarios.

D.4. Experiments on large ML models

In this section, we compare DeltaGrad with BaseL using the state-of-the-art ResNet152 network (He et al., 2016) (ResNetfor short hereafter) with all but the top layer frozen, for which we use the pre-trained parameters from Pytorch torchvisionlibrary3. The pre-trained layers with fixed parameters are regarded as the feature transformation layer, applied over eachtraining sample as the pre-processing step before the training phase. Those transformed features are then used to train thelast layer of ResNet, which is thus equivalent to training a logistic regression model.

This experiment is conducted on CIFAR-10 dataset (Krizhevsky & Hinton, 2009), which is composed of 60000 32×32 colorimages (50000 of them are training samples while the rest of are test samples). We run SGD with mini-batch size 10000,fixed learning rate 0.05 and L2 regularization rate 0.0001. Similar to the experimental setup introduced in Section 18 in themain paper, the deletion rate is varied from 0 to 1% and the model parameters are updated by using BaseL and DeltaGrad(with T0 as 5 and j0 as 20) respectively after the deletion operations. The experimental results are presented in Figure S5,again showing significant speed-ups for DeltaGrad relative to BaseL (up to 3x speed-ups when the deletion rate is 0.005%)

3https://pytorch.org/docs/stable/torchvision/models.html

Page 65: arxiv.org · DeltaGrad objective function for a general machine learning model is defined as: F(w) = 1 n Xn i=1 F i(w) where w represents a vector of the model parameters and F i(w)

DeltaGrad

Figure S5. Comparison of DeltaGrad and BaseL on the CIFAR-10 dataset with pre-trained ResNet152 network

Figure S6. Comparison of DeltaGrad and BaseL on RCV1 dataset after deleting outliers

with far smaller error bound (up to 4× 10−3) than the baseline error bound (up to 2× 10−2). Since it is quite common toreuse sophisticated pre-trained models in practice, we expect that the use of DeltaGrad in this manner is applicable in manycases.

D.5. Applications of DeltaGrad to robust learning

As Section 18 in the main paper reveals, DeltaGrad has many potential applications. In this section, we explored howDeltaGrad can accelerate the evaluations of the effect of the outliers in robust statistical learning. Here the effect of outliersis represented by the difference of the model parameters before and after the deletion of the outliers (see (Yu & Yao, 2017)).

In the experiments, we start by training a model on the training dataset (RCV1 here) along with some randomly generatedoutliers. Then we remove those outliers and update the model on the remaining training samples by using DeltaGrad andBaseL. We also evaluate the effect of the fraction of outlier: the ratio between the number of the outliers and the trainingdataset size is also defined as the Deletion rate. It is varied from 1% to 10%. According to the experimental results shownin Figure S6, when there are up to 10% outliers in the training dataset, DeltaGrad is at least 2.18x faster than BaseL inevaluating the updated model parameters by only sacrificing little computational accuracy (no more than 5× 10−3), thusreducing the computational overhead on evaluating the effect of the outliers in robust learning.