junyu zhang lin xiao may 12, 2021 arxiv:1908.11468v2 [math

30
Multi-Level Composite Stochastic Optimization via Nested Variance Reduction Junyu Zhang * Lin Xiao May 12, 2021 Abstract We consider multi-level composite optimization problems where each mapping in the compo- sition is the expectation over a family of randomly chosen smooth mappings or the sum of some finite number of smooth mappings. We present a normalized proximal approximate gradient (NPAG) method where the approximate gradients are obtained via nested variance reduction. In order to find an approximate stationary point where the expected norm of its gradient map- ping is less than , the total sample complexity of our method is O( -3 ) in the expectation case, and O(N + N -2 ) in the finite-sum case where N is the total number of functions across all composition levels. In addition, the dependence of our total sample complexity on the number of composition levels is polynomial, rather than exponential as in previous work. keywords: composite stochastic optimization, proximal gradient method, variance reduction 1 Introduction In this paper, we consider multi-level composite stochastic optimization problems of the form minimize xR d E ξm f m,ξm ( ··· E ξ 2 f 22 ( E ξ 1 [f 11 (x)] ) ··· ) + Ψ(x), (1) where ξ 1 ,...,ξ m are independent random variables. We assume that each realization of f i,ξ i : R d i-1 R d i is a smooth vector mapping (with d 0 = d and d m = 1) and the function Ψ is convex but possibly nonsmooth. An interesting special case is when each ξ i follows the uniform distribution over a finite support {1,...,N i }, i.e., minimize xR d 1 N m Nm X j =1 f m,j ··· 1 N 2 N 2 X j =1 f 2,j 1 N 1 N 1 X j =1 f 1,j (x) ··· + Ψ(x). (2) To simplify presentation, we write the above problems as minimize xR d Φ(x) , F (x) + Ψ(x) , (3) * Department of Electrical Engineering, Princeton University, Princeton, New Jersey 08544; and National Univer- sity of Singapore. (Email: [email protected]). Work done at Microsoft Research, Redmond, WA 98052 ([email protected], https://linxiaolx.github.io). 1 arXiv:1908.11468v2 [math.OC] 10 May 2021

Upload: others

Post on 28-Mar-2022

8 views

Category:

Documents


0 download

TRANSCRIPT

Variance Reduction
Abstract
We consider multi-level composite optimization problems where each mapping in the compo- sition is the expectation over a family of randomly chosen smooth mappings or the sum of some finite number of smooth mappings. We present a normalized proximal approximate gradient (NPAG) method where the approximate gradients are obtained via nested variance reduction. In order to find an approximate stationary point where the expected norm of its gradient map- ping is less than ε, the total sample complexity of our method is O(ε−3) in the expectation case, and O(N +
√ Nε−2) in the finite-sum case where N is the total number of functions across all
composition levels. In addition, the dependence of our total sample complexity on the number of composition levels is polynomial, rather than exponential as in previous work.
keywords: composite stochastic optimization, proximal gradient method, variance reduction
1 Introduction
minimize x∈Rd
+ Ψ(x), (1)
where ξ1, . . . , ξm are independent random variables. We assume that each realization of fi,ξi : Rdi−1 → Rdi is a smooth vector mapping (with d0 = d and dm = 1) and the function Ψ is convex but possibly nonsmooth. An interesting special case is when each ξi follows the uniform distribution over a finite support {1, . . . , Ni}, i.e.,
minimize x∈Rd
minimize x∈Rd
†Work done at Microsoft Research, Redmond, WA 98052 ([email protected], https://linxiaolx.github.io).
1
F (x) = fm fm−1 · · · f1(x). (4)
For problems (1) we use the definition fi(·) = Eξi [fi,ξi(·)] and for problem (2), we have fi(·) =
(1/Ni) ∑Ni
j=1 fi,j(·). The formulations (1) and (2) cover a broader range of applications beyond the classical stochastic
programming and empirical risk minimization (ERM) problems, which are special cases of (1) and (2) when m = 1, respectively. Important applications with m = 2 include policy evaluation in reinforcement learning and Markov decision processes (e.g., [34, 7]), risk-averse optimization (e.g., [31, 33]), and stochastic variational inequality (e.g., [17, 19] through a novel formulation in [15]). Other applications in machine learning and statistics include optimization of conditional random fields (CRF), Cox’s partial likelihood and expectation-maximization; see formulations in [6, Section 3.1-3.3]. Multi-level composite risk minimization problems with m > 2 have been considered in, e.g., [10] and [39]. A more recent application of the multi-level problem is adversarial attacks and defenses in deep neural networks; see the survey paper [40].
1.1 Related work
Existing work on composite stochastic optimization traces back to [12]. Several recent progresses have been made for two-level (m = 2) problems, both for the general stochastic formulation (e.g., [35, 36, 39, 15]) and for the finite-sum formulation (e.g., [20, 43, 16, 41]).
In order to cite specific results, we first explain the measure of efficiency for composite stochastic optimization. A well-adopted measure is their sample complexity, i.e., the total number of times that any component mapping fi,ξi(·) or Jacobian f ′i,ξi(·) is evaluated, in order to find some x (output
of a randomized algorithm) such that E[G(x)2] ≤ ε for a given precision ε > 0. Here G(x) is the proximal gradient mapping of the objective function Φ ≡ F + Ψ at x (exact definition given in Section 2). If Ψ ≡ 0, then G(x) = F ′(x) where F ′ denotes the gradient of F and the criteria for ε-approximation becomes E[F ′(x)2] ≤ ε.
For problem (1) with two levels (m = 2) and Ψ ≡ 0, stochastic composite gradient methods were developed in [35] with sample complexities O(ε−4), O(ε−3.5) and O(ε−1.25) for the smooth nonconvex case, smooth convex case and smooth strongly convex case respectively. For nontrivial convex Ψ, improved sample complexities of O(ε−2.25), O(ε−2) and O(ε−1) were obtained in [36] for the three cases respectively. These algorithms employ two-level stochastic updates in different time scales, i.e., the two step-size sequences converge to zero at different rates. Similar algorithms for solving multi-level problems were proposed in [39], with sample complexity O(ε−(m+7)/4). A single time-scale method is proposed in [15] for solving two-level problems, which achieves an O(ε−2) sample complexity, matching the result for single-level stochastic optimization [14].
For the composite finite-sum problem (2) with two levels (m = 2), several recent works ap- plied stochastic variance-reduction techniques to obtain improved sample complexity. Stochas- tic variance-reduction techniques were first developed for one-level finite-sum convex optimization problems (e.g., [32, 18, 8, 38, 4, 24]), and later extended to the one-level nonconvex setting (e.g., [27, 29, 3, 1, 2, 13, 37, 25, 26]) and the two-level finite-sum nonconvex setting (e.g., [20, 43, 16, 41]). For the two-level finite-sum case, the previous best known sample complexity is O(N + N2/3ε−1) where N = N1 +N2, obtained in [16, 41].
Interestingly, better complexity bounds have been obtained recently for the non-composite case of m = 1, specifically, O(ε−1.5) for the expectation case and O(N1 +
√ N1ε
2
case [13, 37, 25, 26]. These results look to be even better than the classical O(ε−2) complexity of stochastic gradient method for single-level smooth nonconvex optimization (e.g., [14]), which is consistent with the statistical lower bound for sample average approximation [10]. Indeed, a stronger smoothness assumption is responsible for the improvement: instead of the classical assumption that the gradient of the expected function is Lipschitz continuous, the improved results require a Mean- Square-Lipschitz (MSL) condition (exact definition given in Section 3).
Different from the one-level case, a major difficulty in the multi-level cases (m ≥ 2) is to deal with biased gradient estimators for the composite function (e.g., [35, 36, 41]). Unbiased gradient estimators for two-level problems are developed in [6], but they do not seem to improve sample complexity. On the other hand, it is shown in [42] that the improved sample complexity of O
( min{ε−1.5,
√ N1ε
−1} )
can be obtained with biased gradient estimator for two-level problems where the outer level is deterministic or without finite-sum. This result indicates that similar improvements are possible for multi-level problems as well.
1.2 Contributions and outline
In this paper, we propose stochastic gradient algorithms with nested variance-reduction for solving problems (1) and (2) for any m ≥ 1, and show that their sample complexity for finding x such that E[G(x)] ≤ ε is O(ε−3) for the expectation case, and O(N +
√ Nε−2) for the finite-sum
case where N = ∑m
i=1Ni. To compare with previous work listed in Section 1.1, we note that a complexity O(ε−a) for E[G(x)2] ≤ ε (measure used in Section 1.1) implies an O(ε−2a) complexity for E[G(x)] ≤ ε (by Jensen’s inequality, but not vice versa); in other words, ε = O(ε2). In this sense, our results are of the same orders as the best known sample complexities for solving one-level stochastic optimization problems [13, 37, 25, 26], and they are strictly better than previous results for solving two and multi-level problems.
The improvement in our results relies on a nested variance-reduction scheme using mini-batches and a stronger uniform Lipschitz assumption. Specifically, we assume that each realization of the random mapping fi,ξi and its Jacobian are Lipschitz continuous. This condition is stronger than the MSL condition assumed in [13, 37, 25, 26] for obtaining the O
( min{ε−3,
√ N1ε
−2} )
sample complexity for one-level problems. Nevertheless, it holds for many applications in machine learning (e.g., [7, 6, 41]), especially when dom Ψ is bounded. The results in [39] also rely on such a uniform smooth condition.
Our contributions can be summarized as follows. In Section 2, we consider solving more general problems of the form (3) without specifying any composition structure, and propose a Normalized Proximal Approximate Gradient (NPAG) method. We show that so long as the Mean Square Errors (MSEs) of the approximate gradients at each iteration are uniformly bounded by ε2, then the iteration complexity of NPAG for finding an x such that EG(x)] ≤ ε is O(ε−2).
In Section 3, we discuss stochastic variance-reduction techniques for constructing the approx- imate gradients in NPAG. In particular, we present a simple and unified perspective on variance reduction under the MSL condition. This perspective clearly explains the reason for periodic restart, and allows us to derive the optimal period length and mini-batch sizes without binding to any particular optimization algorithm. When using the SARAH/Spider estimator ([24] and [13] respectively) in NPAG to solve one-level stochastic or finite-sum problems, we obtain a Prox- Spider method. We also analyze the SVRG [18] and SAGA [8] estimators in the same framework, and obtain the best-known complexities for them in the literature for nonconvex optimization.
In Section 4, we present the multi-level Nested-Spider method and show that under the uniform
3
Lipschitz condition (stronger than MSL), its sample complexity for the expectation is O(m4ε−3) and for the finite-sum cases is O(N +m4
√ Nε−1) where N =
∑m i=1Ni. In particular, these results
are polynomial in the number of levels m, which improve significantly over the previous result O(ε−(m+7)/2) in [39], where m appears in the exponent of ε.
In Section 5, we present numerical experiments to compare different variance reduction methods on two applications, with one-level and two-level composition structure respectively. Finally, we give some concluding remarks in Section 6.
2 Normalized proximal approximate gradient method
In this section, we present a normalized proximal approximate gradient (NPAG) method for solving problems of form (3), which we repeat here for convenience:
minimize x∈Rd
} . (5)
Here we consider a much more general setting, without imposing any compositional structure on F . Formally, we have the following assumption.
Assumption 2.1. The functions F , Ψ and Φ in (5) satisfy:
(a) F : Rd → R is differentiable and its gradient F ′ is L-Lipschitz continuous;
(b) Ψ : Rd → R ∪ {+∞} is convex and lower semi-continuous;
(c) Φ ≡ F + Ψ is bounded below, i.e., there exists Φ∗ such that Φ(x) ≥ Φ∗ for all x ∈ dom Ψ.
Under Assumption 2.1, we define the proximal operater of Ψ, with step size η, as
proxηΨ(x) , arg min y∈Rd
{ Ψ(y) +
1
2η y − x2
} , x ∈ dom Ψ. (6)
Since Ψ is convex, this operator is well defined and unique for any x ∈ dom Ψ. Given the gradient of F at x, denoted as F ′(x), the proximal gradient mapping of Φ at x is defined as
GηΦ(x) , 1
)) . (7)
When Ψ ≡ 0, we have GηΦ(x) = F ′(x) for any η > 0. Additional properties of the proximal gradient mapping are given in [22]. Throughout this paper, we denote the proximal gradient mapping as G(·) by omitting the subscript Φ and superscript η whenever they are clear from the context.
Our goal is to find an ε-stationary point x such that G(x) ≤ ε, or E [ G(x)
] ≤ ε if using a
stochastic or randomized algorithm (see [11] for justifications of G(·) as a termination criterion). If the full gradient oracle F ′(·) is available, we can use the proximal gradient method
xt+1 = proxηΨ ( xt − ηF ′(xt)
) . (8)
By setting η = 1/L where L is the Lipschitz constant of F ′, this algorithm finds an ε-stationary point within O(Lε−2) iterations; see, e.g., [5, Theorem 10.15].
Here we focus on the case where the full gradient oracle F ′(·) is not available; instead, we can compute at each iteration t an approximate gradient vt. A straightforward approach is to replace
4
Algorithm 1: Normalized Proximal Approximate Gradient (NPAG) Method
input: initial point x0 ∈ Rd, proximal step size η > 0, a sequence εt > 0,∀t. for t = 0, ..., T − 1 do
Find an approximate gradient vt at the point xt.
Compute xt+1 = proxηΨ(xt − ηvt). Update xt+1 = xt + γt(x
t+1 − xt) with γt = min {
η εt xt+1−xt , 1
} .
end output: x∈{x0, ..., xT−1}, with xt chosen w.p. pt := εt∑T−1
k=0 εk for 0 ≤ t ≤ T − 1.
F ′(xt) with vt in (8), and then analyze its convergence properties based on certain measures of approximation quality such as E[vt − F ′(xt)2]. In Algorithm 1, we present a more sophisticated variant, which first compute a tentative iterate xt+1 = proxηΨ(xt − ηvt), and then move in the direction of xt+1 − xt with a step size γt, i.e., xt+1 = xt + γt(x
t+1 − xt). The choice of step size
γt = min {
ηεt xt+1−xt , 1
} ensures that the step length xt+1 − xt is always bounded:
xt+1 − xt = γtxt+1 − xt ≤ ηεt, (9)
where εt is a pre-specified parameter. Algorithm 1 can also be written in terms of the approximate gradient mapping, which we define
as
) . (10)
Using this notation, the updates in Algorithm 1 become γt = min {
εt G(xt) , 1
xt − ηεt G(xt)
G(xt) if G(xt)
≤ εt. Notice that when
G(xt) > εt, the algorithm uses the normalized proximal gradient mapping with
a step length ηεt. Therefore, we call it the Normalized Proximal Approximate Gradient (NPAG) method.
Next, we prove a general convergence result concerning Algorithm 1 without specifying how the approximate gradient vt is generated. The only condition we impose is that the Mean-Square Error (MSE), E[vt − F ′(xt)2], is sufficiently small for all t. We first state a useful lemma.
Lemma 2.2. Suppose Assumption 2.1 hold. Then the sequence {xt} generated by Algorithm 1 satisfies
Φ(xt+1) ≤ Φ(xt)− ( γt/η − Lγ2
t
5
Proof. According to the update rule for xt+1, we have
Φ(xt+1) = F (xt + γt(x t+1 − xt)) + Ψ(xt + γt(x
t+1 − xt))
⟩ + Lγ2
t
= F (xt) + Ψ(xt) + Lγ2
⟩ +γt
) , (11)
where the inequality is due to the L-smoothness of F (see, e.g., [23, Lemma 1.2.3]) and the convexity of Ψ. Since xt+1 is the minimizer of a 1
η -strongly convex function, i.e.,
xt+1 = proxηΨ ( xt − ηvt
) = arg min
y∈Rd
2η xt+1 − xt2,
which implies ⟨ vt, xt+1 − xt
⟩ + Ψ(xt+1)−Ψ(xt) ≤ − 1
xt+1 − xt 2 .
Applying the above inequality to the last term in (11) yields
Φ(xt+1) ≤ Φ(xt)− (γt η − Lγ2
t
2
) xt+1 − xt2 + γtF ′(xt)− vt, xt+1 − xt
≤ Φ(xt)− (γt η − Lγ2
1
= Φ(xt)− (γt η − Lγ2
where the second inequality is due to the Cauchy-Schwarz inequality.
We note this result is similar to that of [26, Lemma 3], but we used a slightly looser bound for the inner product in the last step. This bound and our choice of γt allow a simple convergence analysis of the NPAG method, which we present next.
Theorem 2.3. Suppose Assumption 2.1 hold and we set η = 1/2L in Algorithm 1. If it holds that
E [ vt − F ′(xt)2
] ≤ ε2t , ∀ t ≥ 0, (12)
for some sequence {εt}, then the output x of Algorithm 1 satisfies
E [ G(x)
. (13)
6
Proof. Using Lemma 2.2, the assumption (12), and the choice η = 1/2L, we obtain
E [ Φ(xt+1)
] + ε2t 2L , (14)
where the equality is due to (10) and the last inequality is due to 0 ≤ γt ≤ 1, which implies
2γt − γ2 t ≥ γt. Note that γt = min
{ εt
} ≥ εt
G(xt) − 1
4 ε2t ,
where the inequality is due to the fact min { |z|, z2
} ≥ |z| − 1
4 for all z ∈ R (here we take z =
ε−1 t G(xt)). Taking expectations and substituting the above inequality into (14) gives
E [ Φ(xt+1)
. (15)
On the other hand, by (12) and Jessen’s inequality we obtain
E [ vt − F ′(xt)
] ≤ √ E [vt − F ′(xt)2] ≤ εt.
Denote xt+1 = proxηr(xt− ηF ′(xt)) and also note that xt+1 = proxηr(xt− ηvt). Since the proximal operator is nonexpansive (see, e.g., [30, Section 31]), we have
E [ xt+1 − xt+1
] ≤ ηE[vt − F ′(xt)] ≤ ηεt, ∀ t ≥ 0.
Therefore,
] ≤ 1
] +
] ≤ E
εt 4L
E [G(xt)
] ≤ E [ Φ(xt)
] + ε2t L . (16)
Adding the above inequality for k = 0, ..., T − 1 and rearranging the terms, we obtain
E [G(x)
·E [G(xt)
+ 4 ∑T−1
.
7
Remark 2.4. In the simplest case, one can set εt ≡ ε for all t = 0, 1, ..., T − 1 and
T ≥ 4L ( Φ(x0)− Φ∗
E [G(x)
] ≤ 4L(Φ(x0)− Φ∗)
Tε ≤ 5ε.
Theorem 2.3 applies directly to the deterministic proximal gradient method with vt = F ′(xt). In this case, with εt and T set according to Remark 2.4, condition (12) is automatically satisfied and we recover the O(Lε−2) complexity of proximal gradient method for nonconvex optimization [5, Theorem 10.15].
The problem of using a small, constant εt is that we will have very small step size γt in the beginning iterations of the algorithm. Although the order of the sample complexity is optimal, it may lead to slow convergence in the practice. We will see in Section 4 that we can obtain the same orders of samples complexity using a variable sequence {εt}, starting with large values and gradually decreases to the target precision ε. These choices of variable sequence {εt} lead to improved performance in practice and are adopted in our numerical experiments in Section 5.
In general, for the stochastic problems, we do not require the estimates vt to be unbiased. On the other hand, condition (12) looks to be quite restrictive by requiring the MSE of approximate gradient to be less than ε2t , where εt is some predetermined precision. In the rest of this paper, we show how to ensure condition (12) using stochastic variance-reduction techniques and derive the total sample complexity required for solving problems (1) and (2).
3 A general framework of stochastic variance reduction
In this section, we discuss stochastic variance-reduction techniques for smooth nonconvex optimiza- tion. In order to prepare for the multi-level compositional case, we proceed with a general framework of constructing stochastic estimators that satisfies (12) for Lipschitz-continuous vector or matrix mappings. To simplify presentation, we restrict our discussion in this section to the setting where εt ≡ ε in (12), where ε is the target precsion so that the output x satisfies E[G(x)] ≤ O(ε). A more general scheme with variable sequence {εt} is presented in Section 4.
Consider a family of mappings {φξ : Rd → Rp×q} where the index ξ is a random variable. We assume that they satisfy a uniform Lipschitz condition, i.e., there exists L > 0 such that
φξ(x)− φξ(y) ≤ Lx− y, ∀x, y ∈ Rd, ∀ ξ ∈ , (17)
where · denotes the matrix Frobenius norm and is the sample space of ξ. This uniform Lipschitz condition implies the Mean-Squared Lipschitz (MSL) property, i.e.,
Eξ [ φξ(x)− φξ(y)2
which is responsible for obtaining the O ( min{ε−3,
√ N1ε
−2} )
sample complexity for one-level prob- lems in [13, 37, 25, 26]. This property in turn implies that the average mapping, defined as
φ(x) , Eξ[φξ(x)],
8
is L-Lipschitz. In addition, we make the standard assumption that they have bounded variance everywhere, i.e.,
Eξ[φξ(x)− φ(x)2] ≤ σ2, ∀x ∈ Rd. (19)
In the context of solving structured optimization problems of form (5), we can restrict the above assumptions to hold within dom Ψ instead of Rd.
Suppose a stochastic algorithm generates a sequence {x0, x1, . . . , } using some recursive rule
xt+1 = ψt(x t, vt),
where vt is a stochastic estimate of φ(xt). As an example, consider using Algorithm 1 to solve problem (5) where the smooth part F (x) = Eξ[fξ(x)]. In this case, we have φξ(x) = f ′ξ(x) and
φ(x) = F ′(x), and can use the simple estimator vt = f ′ξt(x t) where ξt is a realization of the random
variable at time t. If ε ≥ σ, then assumption (19) directly implies (12) with εt ≡ ε; thus Theorem 2.3 and Remark 2.4 show that the sample complexity of Algorithm 1 for obtaining E[G(x)] ≤ ε is T = O(Lε−2).
However, if the desired accuracy ε is smaller than the uncertainty level σ, then condition (12) is not satisfied and we cannot use Theorem 2.3 directly. A common remedy is to use mini-batches; i.e., at each iteration of the algorithm, we randomly pick a subset Bt of ξ and set
vt = φBt(x t) ,
|Bt| ∑ ξ∈Bt
φξ(x t). (20)
In this case, we have E[vt] = φ(xt) and E[vt − φ(xt)2] = σ2/|Bt|. In order to satisfy (12) with εt ≡ ε, we need |Bt| ≥ σ2/ε2 for all t = 0, . . . , T − 1. Then we can apply Theorem 2.3 and Remark 2.4 to derive its sample complexity for obtaining E[G(x)] ≤ ε, which amounts to T · σ2/ε2 = O(Lσ2ε−4).
Throughout this paper, we assume that the target precision ε < σ. In this case (and assuming Ψ ≡ 0), the total sample complexity O(Lε−2 + σ2ε−4) can be obtained without using mini-batches (see, e.g., [14]). Therefore, using mini-batches alone seems to be insufficient to improve sample complexity (although it can facilitate parallel computing; see, e.g., [9]). Notice that the simple mini-batch estimator (20) does not rely on any Lipschitz condition, so it can also be used with non- Lipschitz mappings such as subgradients of nonsmooth convex functions. Under the MSL condi- tion (18), several variance-reduction techniques that can significantly reduce the sample complexity of stochastic optimization have been developed in the literature (e.g., [32, 18, 8, 38, 4, 24, 13].
In order to better illustrate how Lipschitz continuity can help with efficient variance reduc- tion, we consider the amount of samples required in constructing a number of, say τ , consecutive estimates {v0, . . . , vτ−1} that satisfy
E [vt − φ(xt)
2] ≤ ε2, t = 0, 1, . . . , τ − 1. (21)
If the incremental step lengths xt−xt−1 for t = 1, . . . , τ−1 are sufficiently small and the period τ is not too long, then one can leverage the MSL condition (18) to reduce the total sample complexity, to be much less than using the simple mini-batch estimator (20) for all τ steps. In the rest of this section, we first study the SARAH/Spider estimator proposed in [24] and [13] respectively. Then we also analyze the SVRG [18] and SAGA [8] estimators from the same general framework.
9
3.1 SARAH/SPIDER estimator for stochastic optimization
In order to construct τ consecutive estimates {v0, . . . , vτ−1}, this estimator uses the mini-batch estimator (20) for v0, and then constructs v1 through vτ−1 using a recursion:
v0 = φB0(x0),
(22)
It was first proposed in [24] as a gradient estimator of SARAH (StochAstic Recursive grAdient algoritHm) for finite-sum convex optimization, and later extended to nonconvex and stochastic optimization [25, 26]. The form of (22) for general Lipschitz continuous mappings is proposed in [13], known as Spider (Stochastic Path-Integrated Differential EstimatoR). The following lemma is key to our analysis.
Lemma 3.1. [13, Lemma 1] Suppose the random mappings φξ satisfy (18). Then the MSE of the estimator in (22) can be bounded as
E [vt − φ(xt)
] . (23)
If we can control the step lengths xt − xt−1 ≤ δ for t = 1, . . . , τ − 1, then
E [v0 − φ(x0)
|B0| + τ−1∑ t=1
L2δ2
|Bt| . (24)
To simplify notation, here we allow the batch sizes |Bt| to take fraction values, which does not change the order of required sample complexity. Now, in order to satisfy (21), it suffices to set
|B0| = 2σ2
ε2 , t = 1, . . . , τ − 1. (25)
In this case, the number of samples required over τ steps is
|B0|+ (τ − 1)b ≤ 2σ2
ε2 +
2τ2L2δ2
ε2 .
Therefore, it is possible to choose δ and τ to make the above number smaller than τσ2/ε2, which is the number of samples required for (21) with the simple mini-batching scheme (20).
To make further simplifications, we choose δ = ε/2L, which leads to b = τ/2 and
|B0|+ (τ − 1)b ≤ 2σ2/ε2 + τ2/2.
The ratio between the numbers of samples required by Spider and simple mini-batching is
ρ(τ) ,
( 2σ2
2σ2 .
In order to maximize the efficiency of variance reduction, we should choose τ to minimize this ratio, which is done by setting
τ? = 2σ/ε.
Algorithm 2: Prox-Spider Method
input: initial point x0 ∈ Rd, proximal step size η > 0, and precision ε > 0; Spider epoch length τ , large batch size B and small batch size b.
for t = 0, 1, ..., T − 1 do if t mod τ == 0 then
Randomly sample a batch Bt of ξ with |Bt| = B vt = f ′Bt(x
t)
else Randomly sample a batch Bt of ξ with |Bt| = b vt = vt−1 + f ′Bt(x
t)− f ′Bt(x t−1)
end xt+1 = proxηΨ(xt − ηvt) xt+1 = xt + γt(x
t+1 − xt) with γt = min {
η ε xt+1−xt , 1
} end output: x uniformly sampled from {x0, ..., xT−1}.
Using (25) and δ = ε/2L, the corresponding mini-batch size and optimal ratio of saving are
b? = τ?/2 = σ/ε, ρ(τ?) = 2 ε/σ. (26)
Therefore, significant reduction in sample complexity can be expected when ε σ. The above derivation gives an optimal length τ? for variance reduction over consecutive estimates
{v0, . . . , vτ−1}. If the number of iterations required by the algorithm is lager than τ?, then we need to restart the Spider estimator every τ? iterations. This way, we can keep the optimal ratio ρ(τ?) to obtain the best total sample complexity.
Algorithm 2 is an instantiation of the NPAG method (Algorithm 1) that uses Spider to con- struct vt. It incorporates the periodic restart scheme dictated by the analysis above. The Spider- SFO method (Option II) in [13] can be viewed as a special case of Algorithm 2 with Ψ ≡ 0:
xt+1 = xt − γtη vt, where η = 1
2L and γt = min
} . (27)
Several variants of Spider-SFO, including ones that use constant step sizes and direct extensions of the form xt+1 = proxηΨ(xt−ηvt), have been proposed in the literature [24, 13, 25, 26]. Algorithm 1 is very similar to ProxSARAH [26], but with different choices of γt. In ProxSARAH, the whole sequence {γt}t≥0 is specified off-line in terms of parameters such as L and number of steps to run. Our choice of γt follows closely that of Spider-SFO in (27), which is automatically adjusted in an online fashion, and ensures bounded step length. This property allows us to extend the algorithm further to work with more general gradient estimators, including those for multi-level composite stochastic optimization, all within the common framework of NPAG and Theorem 2.3.
The sample complexity of Algorithm 2 is given in the following corollary.
Corollary 3.2. Consider problem (5) with F (x) = Eξ[fξ(x)]. Suppose Assumption 2.1 holds and the gradient mapping φξ ≡ f ′ξ satisfies (18) and (19) on dom Ψ (instead of Rd). If the parameters in Algorithm 2 are chosen as
η = 1
2L , τ =
⌉ , (28)
11
where d·e denotes the integer ceiling, then the output x satisfies E[G(x)] ≤ ε after O(Lε−2) iterations, and the total sample complexity is O(Lσε−3 + σ2ε−2).
Proof. In Algorithm 2, the step lengths satisfy the same bound in (9), i.e., xt+1−xt ≤ ηε = ε/2L. From the analysis following (25), the parameters in (28) guarantee (12). Therefore we can apply Theorem 2.3 and Remark 2.4 to conclude that E[G(x)] ≤ 5ε provided that T ≥ 4L(Φ(x0)−Φ∗)ε
−2. To estimate the sample complexity, we can simply multiply the number of samples required by
simple mini-batching, T · σ2/ε2, by the optimal ratio in (26) to obtain
T · σ 2
) ,
where we used the assumption ε < σ. Considering that B = O(σ2/ε2) samples are always needed at t = 0, we can include it in the total sample complexity, which becomes O(Lσε−3 + σ2ε−2).
We note that for each t that is not a multiple of τ , we need to evaluate f ′ξ twice for each ξ ∈ Bt, one at xt and the other at xt−1. Thus the computational cost per iteration, measured by the number of evaluations of fξ and f ′ξ, is twice the number of samples. Therefore the overall computational cost is proportional to the sample complexity. This holds for all of our results in this paper.
3.2 SARAH/SPIDER estimator for finite-sum optimization
Now consider a finite number of mappings φi : Rd → Rp×q for i = 1, . . . , n, and define φ(x) = (1/n)
∑n i=1 φi(x) for every x ∈ Rd. We assume that the MSL condition (18) holds, but do not
require a uniform bound on the variance as in (19). Again, we would like to consider the number of samples required to construct τ consecutive estimates {v0, . . . , vτ−1} that satisfy condition (21). Without the uniform variance bound (19), this cannot be treated exactly as a special case of what we did in Section 3.1, and thus deserves separate attention.
We use the same Sarah/Spider estimator (22) and choose B0 = {1, . . . , n}. In this case, we have v0 = φ(x0) and E
[ v0 − φ(x0)2
] = 0. From (23), if we can control the step lengths
xt − xt−1 ≤ ε/2L and use a constant batch size Bt = b for t = 1, . . . , τ − 1, then
E [v1 − φ(x1)
τ−1∑ t=1
4b .
In order to satisfy (21), it suffices to use b = τ/4. The number of samples required for τ steps is
|B0|+ (τ − 1)b ≤ n+ τ2/4.
As before, we compare the above number of samples against the simple mini-batching scheme of using vt = φBt(x
t) for all t = 0, . . . , τ − 1. However, without an assumption like (19), we cannot guarantee (21) unless using Bt = {1, . . . , n} for all t in simple mini-batching, which becomes full- batching. The ratio between the numbers of samples required by Spider and full-batching is
ρ(τ) = n+ τ2/4
nτ =
1
τ +
τ
4n .
By minimizing this ratio, we obtain the optimal epoch length and batch size:
τ? = √
2n/4.
12
In this case, the minimum ratio is ρ(τ?) = 1/ √ n. Hence we expect significant saving in sample
complexity for large n. Similar to the expectation case, we should restart the estimator every τ? =
√ 2n iterations to keep the optimal ratio of reduction.
As an example, we consider using the Prox-Spider method to solve problem (5) when the smooth part of the objective is a finite-sum. The resulting sample complexity is summarized in the following corollary of Theorem 2.3. Its proof is similar to that of Corollary 3.2 and thus omitted.
Corollary 3.3. Consider problem (5) with F (x) = (1/n) ∑n
i=1 fi(x). Suppose Assumption 2.1 holds and the gradient f ′i satisfies (18) on dom Ψ. If we set the parameters in Algorithm 2 as
η = 1/2L, τ = ⌈√
⌉ ,
and use Bt = {1, . . . , n} whenever t mod τ = 0, then the output satisfies E[G(x)] ≤ ε after O(Lε−2) iterations, and the total sample complexity is O
( n+ L
√ nε−2
3.3 SVRG estimator for stochastic optimization
As a general framework, NPAG may also incorporate the SVRG estimator proposed in [18]. We illustrate how SVRG estimator can be applied to construct τ consecutive estimates {v0, ..., vτ−1} that satisfy (21) for stochastic optimization (the expectation case). For the SVRG estimator, we have
v0 = φB0(x0),
(29)
E [vt − φ(xt)
] ,
see [27, Lemma 4]. If we bound the step length by δ, then by the triangle inequality,
xt − x0 ≤ t∑
|Bt| .
To guarantee (21), we can choose δ = ε/2L and |B0| ≥ 2σ2/ε2 and for t = 1, . . . , τ , and |Bt| = b ≥ 2τ2L2δ2/ε2 = τ2/2. Thus the total sample complexity per epoch is
|B0|+ (τ − 1)b = 2σ2
2 .
We can minimize the ratio between the above quantity and τσ2/ε2 (of the naive mini-batch scheme) by choosing τ = (
√ 2σ/ε)2/3 and total sample complexity per epoch becomes O(σ2/ε2), which can
be much smaller than τσ2/ε2. By restarting the gradient estimation every τ iterations, (12) is satisfied with εt ≡ ε. By Theorem 2.3 and Remark 2.4, getting a x s.t. E [G(x)] ≤ 5ε requires T = O(Lε−2) iterations and a total complexity of O
( (T/τ) · (σ2/ε2)
) = O(Lσ4/3ε−10/3), which is
slightly worse than that of the SARAH/Spider estimator in the expectation case. The finite-sum case is similar and is hence omitted. It will have the same sample complexity as
the SAGA estimator as we discuss next.
13
3.4 SAGA estimator for finite-sum optimization
In this section, we discussion how the SAGA estimator proposed in [8] can be incorporated into our NPAG framework. Note that the SAGA estimator is only meaningful for the finite-sum case (sampling with replacement). The SAGA estimator can be written as
v0 = u0 = 1
vt = ut−1 + 1
) .
After each iteration, we update αti by setting αti ← xt if i ∈ Bt and αti ← αt−1 i if i /∈ Bt. Denote
α[n, t] = {αri } r=0,...,t i=1,...,n. Suppose |Bt| ≡ b for all t. Then the MSE of the SAGA estimator is bounded
by
] ≤ L2
nb
2.
Note that the αt−1 i ’s are identically distributed, thus we further have
E [ vt − φ(xt)2
1 2 ] . (30)
When we restrict the step lengths by xr − xr−1 ≤ δ, then we have xt− xr ≤ |t− r|δ for all t, r. Note that all the batches Br are independently and uniformly sampled and αt−1
1 = xt−1 if 1 ∈ Bt−1. That is, αt−1
1 = xt−1 with probability b/n. Similarly, for r < t − 1, we have αt−1 1 = xr if 1 ∈ Br
and 1 /∈ ∪t−1 j=r+1Bj . Hence αt−1
1 = xr with probability (b/n)(1− b/n)t−r−1. Therefore, (30) implies
E [ vt − φ(xt)2
b
n
] (31)
≤ L2
( 1− b
≤ 3n2L2δ2
b3 .
When we choose δ = ε/2L and b = n2/3, we have E [ vt−φ(xt)2
] ≤ ε2 for any t ≥ 0. Consequently,
Theorem 2.3 and Remark 2.4 implies a total sample complexity of O(n2/3ε−2) to output a point x s.t. E
[ G(x)
] ≤ 5ε. This sample complexity is worse than that of the SARAH/Spider estimator.
In this section, we analyzed the normalized-step SARAH/Spider, SVRG and SAGA estimators. Specifically, we described SARAH/Spider and SVRG from the perspective of efficient variance reduction over a period of consecutive iterations in a stochastic algorithm. This perspective allows us to derive the optimal period length and mini-batch sizes of the estimator without binding with
14
any particular algorithm. Using these estimators in the NPAG framework led to the Prox-Spider method, and we obtained its sample complexity by simply combining the iteration complexity of NPAG (Theorem 2.3) and the optimal period and mini-batch sizes of SARAH/Spider. The complexities in these cases on one-level stochastic optimization are not new (see, e.g., [13, 37, 26, 27, 28]). But the NPAG framework and the separate perspective on variance reduction are more powerful and also work for multi-level stochastic optimization, which we show next.
4 Multi-level nested SPIDER
In this section, we present results for the general composite optimization problems (1) and (2) with m ≥ 2. To provide convergence and complexity result, we only need to construct a gradient estimator vt that is sufficiently close to F ′(xt). For the ease of presentation, let us denote
Fk = fk fk−1 · · · f1, k = 1, . . . ,m.
In particular, we have F1 = f1, and Fm = F . Then, by chain rule,
F ′(x) = [ f ′1(x)
]T · · · [f ′m−1(Fm−2(x)) ]T f ′m(Fm−1(x)). (32)
To approximate F ′(xt), we construct m− 1 estimators for the mappings,
yt1 ≈ F1(xt), . . . , ytm−1 ≈ Fm−1(xt), (33)
and m estimators for the Jacobians,
zt1 ≈ f ′1(xt), zt2 ≈ f ′2(yt1), . . . , ztm ≈ f ′m(ytm−1). (34)
(Notice that we do not need to estimate F (xt) = Fm(xt).) Then we can construct
vt = (zt1)T (zt2)T · · · (ztm−1)T ztm. (35)
We use the SARAH/Spider estimator (22) for all m−1 mappings in (33) and m Jacobians in (34), then use the above vt as the approximate gradient in the NPAG method. This leads to the multi- level Nested-Spider method shown in Algorithm 3. To simplify presentation, we adopted the notation yt0 = xt for all t ≥ 0.
Each iteration t in Algorithm 3 is assigned a stage or epoch indicator χ(t) ∈ {1, . . . ,K}, indicating the current variance-reduction stage. Each stage k has length τk which is determined by the variable precision εk. Note that all iterations within the same stage k share the same εχ(t) = εk, which in turn is used to determine the step size γt = min
{ η εχ(t)/xt+1 − xt, 1
} .
For convergence analysis, we make the following assumption.
Assumption 4.1. For the functions and mappings appearing in (1) and (2), we assume:
(a) For each i = 1, . . . ,m and each realization of ξi, mapping fi,ξi : Rdi−1→Rdi is `i-Lipschitz and its Jacobian f ′i,ξi : R
di−1→Rdi×di−1 is Li-Lipschitz.
(b) For each i = 1, . . . ,m − 1, there exists δi such that Eξi [ fi,ξi(y) − fi(y)2
] ≤ δ2
15
Algorithm 3: Multi-level Nested-Spider Method
input: initial point x0 ∈ Rd, proximal step size η > 0, a sequence of precisions {εk}Kk=1, stage lengths {τk}Kk=1, and batch sizes {Bk
i , S k i , b
k i , s
initialize: y0 0 = x0, k = 1, s = 0
for t = 0, 1, ..., T − 1 do χ(t) = k. /* iteration t is in the k-th stage */
if s == 0 then for i = 1, . . . ,m do
Randomly sample batch Sti of ξi with |Sti |=Ski , batch Bti with |Bti |=Bk i
yti = fi,Sti (y t i−1) /* skip if i == m */
zti = f ′ i,Bti
else for i = 1, . . . ,m do
Randomly sample batch Sti of ξi with |Sti |=ski , batch Bti with |Bti |=bki yti = yt−1
i + fi,Sti (y t i−1)− f ′
i,Sti (yt−1 i−1) /* skip if i == m */
zti = zt−1 i + f ′
i,Bti (yti−1)− f ′
end
end vt = (zt1)T (zt2)T · · · (ztm−1)T ztm xt+1 = proxηΨ(xt − ηvt) xt+1 = xt + γt(x
t+1 − xt) with γt = min {
η εχ(t) xt+1−xt , 1
} yt+1
0 = xt+1, s← s+ 1 if s == τk then s← 0, k ← k + 1 /* reset s = 0 for next stage */
end
output: x from {x0, ..., x T−1} with xt sampled with probability
εχ(t)∑K k=1 τkεk
.
(c) For each i = 1, . . . ,m, there exists σi such that Eξi [ f ′i,ξi(y) − f ′i(y)2
] ≤ σ2
i for any y ∈ dom fi.
As a result, the function F = fm · · · f1 and its gradient F ′ are both Lipschitz continuous, and their Lipschitz constants are given in the following lemma.
Lemma 4.2. Suppose Assumption 4.1.(a) holds. Then the composite function F and its gradient F ′
are Lipschitz continuous, with respective Lipschitz constants
`F = m∏ i=1
Li
where we use the convention ∏0 r=1 `r = 1.
The proof of this lemma is presented in Appendix A. For the special case m = 2,
`F = `1`2, LF = `2L1 + `21L2,
16
which have appeared in previous work on two-level problems [15, 41]. In addition, we define two constants:
σ2 F =
δ2 i . (37)
They will be convenient in characterizing the sample complexity of Algorithm 3. As inputs to Algorithm 3, we need to choose the variable precisions εk, epoch length τk and
batch sizes {Bk i , S
k i , b
k i , s
k i } k=1,...,K i=1,...,m to ensure E[vt − F ′(xt)|2] ≤ ε2χ(t). This is done through the
following lemma.
Lemma 4.3. Suppose Assumption 4.1 holds. In Algorithm 3, if we set η = 1 2LF
, τk = `F 2mεk
then E [ vt − F ′(xt)2
] ≤ ε2k for all k and all t ∈ {r : χ(r) = k}.
Proof. First, we bound vt − F ′(xt)2 in terms of the approximation errors of the individual esti- mators in (33) and (34). Denoting Fi(x
t) by F ti for simplicity, we have from (32) and (35),
E [ vt− F ′(xt)2
] = E
[[zt1]T [zt2]T · · · [ztm−1]Tztm − [f ′1(xt)]T [f ′2(F t1)]T · · · f ′m(F tm−1) 2 ] .
To bound this value, we add and subtract intermediate terms inside the norm such that each adjacent pair of products differ at most in one factor. Then we use the inequality
∑r i=1 ai−bi
2 ≤ r ∑r
i=1 ai − bi2 to split the bound. More concretely, we split vt − F ′(xt) into k = 2m− 1 pairs:
E [ vt − F ′(xt)2
] (40)
≤ (2m− 1)
( E [[zt1]T [zt2]T · · · [ztm−1]T ztm − [f ′1(xt)]T [zt2]T · · · [ztm−1]T ztm
2 ]
+E [[f ′1(xt)]T [zt2]T [zt3]T· · · [ztm−1]T ztm − [f ′1(xt)]T [f ′2(yt1)]T [zt3]T· · · [ztm−1]T ztm
2 ]
+E [[f ′1(xt)]T [f ′2(yt1)]T [zt3]T· · ·[ztm−1]Tztm−[f ′1(xt)]T [f ′2(F t1)]T [zt3]T· · ·[ztm−1]T ztm
2 ]
+ · · ·
+E [[f ′1(xt)]T · · · [f ′m−1(F tm−2)]T ztm − [f ′1(xt)]T · · · [f ′m−1(F tm−2)]T f ′m(ytm−1)
2 ]
+E [[f ′1(xt)]T· · ·[f ′m−1(F tm−2)]Tf ′m(ytm−1)−[f ′1(xt)]T· · ·[f ′m−1(F tm−2)]Tf ′m(F tm−1)
2 ])
≤ 2m
)( m∏
r=i+1 ztr2
) ( zti − f ′i(yti−1)2 + 1i≥2f ′i(yti−1)− f ′i(F ti−1)2
)] ≤ 2m
)( zti − f ′i(yti−1)2 + 1i≥2L
2 i yti−1 − F ti−12
)] ,
17
where we used the notation F t0 = yt0 = xt and 1i≥2 = 1 if i ≥ 2 and 0 otherwise. In the last inequality, we used f ′r(·) ≤ `r, which is a consequence of fr being `r-Lipschitz.
Next, we first derive deterministic bounds on zti2 so that they can be moved outside of the expectation in (40). Then we will bound E
[ zti − f ′i(yti−1)2
] and E
] separately.
Without loss of generality, we focus on the first epoch with t = 0, 1, . . . , τ1 − 1, but the results hold for all the epochs. For the ease of notation, we denote ε1, τ1 and B1
i , S 1 i , b
1 i , s
1 i by ε, τ and
Bi, Si, bi, si in this proof. Step 1: Bounding the temporal differences yti−y
t−1 i . For i = 0, we have yt0 = xt and from (9),
xt − xt−1 ≤ ηε = ε/2LF ≤ ε/LF . When i = 1, for any t ≥ 1,
yt1 − yt−1 1 ≤
LF .
In general, for all 1 ≤ i ≤ m− 1, we have
yti−yt−1 i ≤
∏i r=1 `r LF
ε. (41)
∑t r=1 zri − z
r−1 i for all 1 ≤ i ≤ m. Moreover,
zri −zr−1 i ≤
r−1 i−1 )
( i−1∏ r=1
`r
) ε
LF ,
where the last inequality is due to (41). Consequently, for all 0 ≤ t ≤ τ − 1,
zti ≤ z0 i + t ·
( i−1∏ r=1
] and E
E [ zti − f ′i(yti−1)2
] ≤ E
i−1 2 ]
E [ yti − fi(yti−1)2
] ≤ δ2
i
Si +
( i∏
. (44)
Step 4: Bounding E[yti−Fi(xt)2]. We show that the following inequality holds for i = 1, . . . ,m:
E [ yti − Fi(xt)2
] . (45)
18
(Here we use the convention ∏i j=i+1 `j = 1.) We prove this result by induction. With the notations
F1 = f1 and yt0 = xt, the base case of i = 1 is the same as (44), thus already proven. Suppose (45) holds for i = k, for i = k + 1, we have
E [ ytk+1 − Fk+1(xt)2
] ≤ (1 + k)E
] + (1 + k−1)E
] + (1 + k−1)`2k+1E
[ ytk − Fk(xt)2
[ k∑ r=1
[( k∏ j=1
`2j
which completes the induction. Therefore (45) holds.
Now we go back to inequality (40) and first apply the bound on zti in (42). Using
τ = `F
we have for r = 1, . . . ,m and all t ≥ 0,
ztr ≤ `r + τLr r−1∏ j=1
`j ε
LF = `r
2mLF
2m
) `r,
where the last inequality is due to the definition of LF in (36). Consequently,( i−1∏ r=1
`2r
)( m∏
) ≤
∏ r 6=i `2r ,
where we applied the inequality (1 + a−1)a ≤ e ≤ 3 for a > 0. Applying the above inequality to (40), we obtain
E [ vt − F ′(xt)2
] ≤ T1 + T2, (46)
] ,
( ∏ r 6=i+1
Next we bound the two terms T1 and T2 separately.
19
For T1, we use (43) and set Bi = B and bi = b for all i = 1, . . . ,m, which yields
T1 ≤ 6m
)) .
Now recall the definition of σ2 F in (37) and definition of LF in (36), which implies
L2 F =
m+ 1 ε2,
where the equality is due to the choice of parameters in (38). For T2, recalling our notation F ti = Fi(x
t), we use (45) to obtain
T2 ≤ 6m m−1∑ i=1
( ∏ r 6=i+1
δ2 r
`2j ·L2 i+1 ·
i∏ j=r+1
i+1 · m∏
δ2 r
δ2 r
i+1 · i∏
j=1 `4j
( δ2 rL
2 F
) ,
where in the second inequality we used i ≤ m and in the last inequality we used (47). Now setting Sr = S and sr = s for r = 1, . . . ,m− 1, and using the definition of δ2
F in (37), we have
T2 ≤ 6m2 δ 2 F
S + 6m3 τε
m+ 1 ε2,
where the last equality is due to the choice of parameters in (39). Finally, with the above bounds on T1 and T2, we get from (46) that for any t ≥ 0,
E[vt − F ′(xt)2] ≤ T1 + T2 ≤ 1
m+ 1 ε2 +
20
Theorem 4.4. Consider problem (1) with m ≥ 2, and suppose Assumptions 2.1 and 4.1 hold. In Algorithm 3, if we set εk = mθLF
k·`F with θ > 0 and set the other parameters as in Lemma 4.3, then
the output x satisfies E[G(x)] ≤ ε after K = O ( mLF ε·`F
) epochs, and the total sample complexity
is O ( m4LF (σ2
) .
Proof. From Theorem 2.3 and Lemma 4.3, the output of Algorithm 3 satisfies
E [ G(x)
t=0 εχ(t)
+ 4 ∑T−1
. (48)
≤ θLF ln(K).
E [G(x)] ≤ 8mLF
) such that E [G(x)] ≤ ε. With the Bk
i , b k i , S
k i , s
K∑ k=1
m∑ i=1
Next, we present our results of the finite-sum optimization problem.
Theorem 4.5. Consider problem (2) with m ≥ 2, and suppose Assumptions 2.1 and 4.1.(a) hold. In addition, let Nmax = max{N1, . . . , Nm} and assume the target precision ε satisfies
√ Nmax ≤ `F
2mε .
In Algorithm 3, we set η = 1/2LF . For k = 1, ...,K and i = 1, ...,m, we set εk = (
θLF k· √ Nmax
bki ≡ min {
} .
Moreover, for every t at the beginning of each epoch, let Bti = Sti = {1, . . . , Ni} be the full batches for i = 1, . . . ,m. Then the output x satisfies E[G(x)] ≤ ε after K = O
( LF√
i=1Ni +m4LF √ Nmaxε
−2 ) .
21
Proof. Through a similar line of proof of Lemma 4.3, E [ vt − F ′(xt)2
] ≤ ε2χ(t) still holds. Conse-
T−1∑ t=0
εχ(t) = K∑ k=1
τkεk = K∑ k=1
k · √ Nmax
≤ 2θLF ln(K).
Substitute the above inequalities into (48), we get E [G(x)] ≤ O ( ( LF√
Nmax·K ) 1 2
LF√ Nmax·ε2
the total sample complexity is
K∑ k=1
m∑ i=1
( Ni+(τk−1)bki
If the condition that √ Nmax ≤ `F
2mε does not hold, then we can use the setting in Theorem 4.4 and obtain the same order or slightly better sample complexity.
Remark 4.6. In Theorem 4.4, if we set εk ≡ ε, then the Algorithm output x s.t. E[G(x)] ≤ ε after K = O(mLFε·`F ) epochs, and the total sample complexity is O(m4LF (σ2
F + δ2 F + `2F )ε−3). Similarly,
in Theorem 4.5, if we set εk ≡ ε, then the Algorithm output x s.t. E[G(x)] ≤ ε after K = O( LF√
Nmax·ε2 ) epochs, and the total sample complexity is O
(∑m i=1Ni +m4LF
−2 ) . Although
using constant εk ≡ ε improves the sample complexities of Theorem 4.4 and 4.5 by a logarithmic factor, the O(ε) step length may be too consevative in the early stages. In practice, the adaptive setting of εk in Theorem 4.4 and 4.5 can be much more efficient.
5 Numerical experiments
In this section, we present numerical experiments to demonstrate the effectiveness of the proposed algorithms and compare them with related work. We first apply NPAG with different variance- reduced estimators on a noncovex sparse classification problem. Since this is a one-level finite-sum problem, we also compare it with ProxSARAH [26]. Then we consider a sparse portfolio selection problem, which has a two-level composition structure. We compare the Nested-SPIDER method with several other methods for solving two-level problems.
5.1 Sparse binary classification
min w
`(aTi x, bi) + βx1, (49)
where {(ai, bi)}Ni=1 are N labeled examples with feature vectors ai ∈ Rd and binary labels yi ∈ {−1, ,+1}. Specifically, we consider two different loss functions: the logistic difference loss `ld(t, b) =
22
# of samples 10 6
# of samples 10 6
# of samples 10 6
# of samples 10 6
ProxSARAH-vstp
ProxSARAH-cstp
ProxSPIDER
ProxSVRG
ProxSAGA
Figure 1: Experiments on sparse binary classification on mnist dataset, β = 1/12691.
log ( 1 + e−bt
( 1− 1
)2 . For
the `1-norm penalty, we set β = 1/N . This is a standard one-level finite-sum problem, corresponding to the case of m = 1 in (2).
We compare the ProxSPIDER method (Algorithm 2) with the ProxSARAH algorithm [26]. Prox- SARAH can be implemented with two schemes: for the variable stepsize scheme (Theorem 5, [26]) we use the name ProxSARAH-vstp; for the constant stepsize scheme (Theorem 6, [26]) we call it ProxSARAH-cstp. We also include the SVRG and SAGA estimator with normalized stepsize rule as described in Section 3.3 and 3.4 respectivley, which we call ProxSVRG and ProxSAGA.
The parameter setting (batch size, epoch length, etc.) of ProxSARAH-vstp and ProxSARAH- cstp are made to be consistent with the ProxSARAH-A-v1 and ProxSARAH-v1 in [26] where each mini-batch consists of only 1 sample, this is the parameter setting that yields the best empirical performance in [26]. For ProxSPIDER, we set both the batchsize and epoch length to be
⌈√ N ⌉ .
For ProxSVRG and ProxSAGA, we set the batchsize to be ⌈ N
2 3
⌈ N
⌉ .
Finally, for ProxSPIDER, ProxSVRG and ProxSAGA, we use a variable sequence εk as in NPAG (Algorithm 1 and set εk = 10/
√ k.
First, we compare these methods over the mnist1 data sets. To fit the binary classification
1http://yann.lecun.com/exdb/mnist/
23
ProxSARAH-vstp
ProxSARAH-cstp
ProxSPIDER
ProxSVRG
ProxSAGA
Figure 2: Experiments on sparse binary classification on rcv1 dataset, β = 1/20242.
problems, we extract the data of two arbitrary digits, which are 1 and 9 in our experiments. This subset of mnist consists N = 12691 data points. For the ProxSARAH-vstp and ProxSARAH- cstp, the parameter to be tuned is the estimator of the Lipschtiz constant L. The smaller L is, the larger the stepsize is. For the ProxSPIDER, ProxSVRG and ProxSAGA, the parameter to be tuned is the stepsize η, which corresponds to 1/L in this experiment. We choose L among {0.01, 0.1, 1, 10} and it turns out that L = 1 works best for ProxSARAH-vstp and ProxSARAH- cstp for both loss functions. We choose η among {0.1, 1, 10, 100} and it turns out that η = 1 works best for ProxSIDER, ProxSVRG and ProxSAGA. These choices are consistent with the relationship η = 1/L.
Figure 1 shows the norm of the gradient mapping and objective value gap of different methods. For all the compared methods, the curves are averaged over 20 runs of the algorithm. It can be seen that these different methods have similar performance, with ProxSVRG and ProxSAGA slightly worse and ProxSPIDER slightly better than other methods. In theory ProxSARAH and ProxSPIDER have the same sampling complexity, and the empirical results are consistent with the theory. Since the plots are very similar for the norm of gradient mapping and the objective value gap, we will only present plots of the gradient mapping norms for the rest of the experiments.
We also conduct our experiment on the rcv1.binary2 dataset, which is a medium sized dataset with N = 20242 data points and each data point has d = 47236 features. The parameters are chosen in a similar way to the experiments on mnist dataset. In particular, L = 0.01 works best for ProxSARAH-vstp and ProxSARAH-cstp for both loss functions and η = 100 works best for ProxSIDER, ProxSVRG and ProxSAGA (again we have η = 1/L). Figure 2 shows the results of comparison. In this particular case, ProxSAGA perform the best despite a worse sample complexity in theory.
5.2 Sparse portfolio selection problem
In this section, we present the numerical results for a risk-averse portfolio optimization problems, which is a common benchmark example used in many stochastic composite optimization methds (e.g., [16, 41, 42, 21]). Consider the scenario where d assets can be invested during the time periods
2https://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/binary.html
# of samples 10 5
# of samples 10 6
Figure 3: Numerical experiments on the sparse portfolio optimization problem.
{1, ..., N}. We denote the per unit payoff vector at time i by ri := [ri,1, ..., ri,d] T , with ri,j being
the per unit payoff of asset j at time period i. Let x ∈ Rd be the decision variable, where the j-th component xj represents the amount of investment to asset j throughout the N time period, for j = 1, ..., d.
In the portfolio optimization problems, our aim is to maximize the average payoff over N periods. However, a penalty is imposed on the variance of the payoff across the N periods, which accounts for the risk of the investment. In addition, we also apply an `1 penalty to impose sparsity on the portfolio x, which gives the following formulation:
maximize x∈Rd
E [ rξ, x
] − λVar(rξ, x)− βx1,
where the random variable ξ follows the uniform distribution over {1, ..., N}. Using the identity that Var(X) = E[X2]−E[X]2, the problem can be reformulated as
minimize x∈Rd
] − λ ·E
[ rξ, x
]2 + βx1. (50)
If we set f1,ξ(x) : Rd 7→ R2 = [rξ, x rξ, x2]T , f2,ξ2(y, z) : R2 7→ R ≡ −y − λy2 + λz, and Ψ(x) = βx1, then (50) becomes a special case of (1) with m = 2.
In the experiments, we compare our method with the C-SAGA algorithm [41], the CIVR and CIVR-adp algorithms [42], and the VRSC-PG algorithm [16]. We test these algorithms with the real world portfolio datasets from Keneth R. French Data Library3. The first dataset is named Industrial-38, which contains 38 industrial assets payoff over 2552 consecutive time periods; and the second dataset is named Industrial-49, and contains 49 industrial assets over 24400 consecutive time periods. For both datasets, we set λ = 0.2 and β = 0.01 in problem (50). All the curves are averaged over 20 runs of the methods and are plotted against the number of samples of the component function and gradient evaluations.
For these experiments, CIVR and Nested-SPIDER both take the batch size of dN1/2e; CIVR- adp takes the adaptive batch size of Sk = dmin{10k + 1, N1/2}e; VRSC-PG and C-SAGA both take the batch size of dN2/3e. For all these methods, their stepsizes η are chosen as the best from
3http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data library.html
# of samples 10 5
Figure 4: Experiments on Industrial-38 datasets using different step size parameter η.
{0.0001, 0.001, 0.01, 0.1, 1} by experiments. In addition, we set εk = C/ √ k for Nested-SPIDER
with C chosen from {0.1, 1, 10, 50} by experiments. For the Industrial-38 dataset, η = 0.01 works best for CIVR, CIVR-adp, C-SAGA and VRSC-PG, and η = 0.1 and C = 1 work best for Nested- SPIDER. For the Industrial-49 dataset, η = 0.0001 works best for all five tested algorithms and C = 50 works best for the Nested-SPIDER algorithm.
Figure 3 shows that for the Industrial-49 dataset, the CIVR-adp and CIVR perform the best. However, Nested-SPIDER quickly catches up with them after a slightly slower early stage, which is due to the relatively restrictive stepsizes caused by normalization (in early stages of the algorithm). On the contrary, Nested-SPIDER converges much faster than all other methods in the Industrial- 38 dataset because the normalization step enables us to exploit a significantly larger stepsizes (in later stages of the algorithm), under which all the other methods diverge. In Figure 4, we show the following behavior of the Nested-SPIDER on Industrial-38. On the left, we set η = 0.1 for all the algorithms and C = 1 for Nested-SPIDER. All methods except for Nested-SPIDER diverge. On the right, we set η = 0.01 for all the algorithms and C = 50 for Nested-SPIDER. In this case, Nested-SPIDER behaves almost identical to CIVR.
6 Conclusion
We have proposed a normalized proximal approximate gradient (NPAG) method for solving multi- level composite stochastic optimization problems. The approximate gradients at each iteration are obtained via nested variance reduction using the SARAH/Spider estimator. In order to find an approximate stationary point where the expected norm of its gradient mapping is less than ε, the total sample complexity of our method is O(ε−3) in the expectation case, and O(N+
√ Nε−2) in the
finite-sum case where N is the total number of functions across all composition levels. In addition, the dependence of the sample complexity on the number of composition levels is polynomial, rather than exponential as in previous work. Our results rely on a uniform Lipschitz condition on the composite mappings and functions, which is stronger than the mean-squared Lipschitz condition required for obtaining similar complexities for single-level stochastic optimization.
The NPAG method extends the Spider-SFO method [13] to the proximal setting and for solving multi-level composite stochastic optimization problems. In particular, requiring a uniform bound
26
on the MSEs of the approximate gradients allows separate analysis of the optimization algorithm and variance reduction. In addition, the uniform bound on the step lengths at each iteration is the key to enable our analysis of nested variance reduction. More flexible rules for choosing the step sizes or step lengths have been developed to obtain similar sample complexities for single- level stochastic or finite-sum optimization problems, especially those used in the Spider-Boost [37] and Prox-SARAH [26] methods. It is likely that such step size rules may also be developed for multi-level problems, although their complexity analysis may become much more involved.
A Proof of Lemma 4.2
Proof. Under Assumption 4.1.(a), it is straightforward to show that fi = Eξifi,ξi is `i-Lipschitz and its gradient f ′i = Eξif
′ i,ξi
is Li-Lipschitz. Recall the definitions Fi = fi fi−1 · · · f1 for i = 1, . . . ,m. By the chain rule, we have
F ′(x)− F ′(y) =
[f ′1(x)]T · · · [f ′m−1(Fm−2(x))]T f ′m(Fm−1(x))
− [f ′1(y)]T · · · [f ′m−1(Fm−2(y))]T f ′m(Fm−1(y))
≤ [f ′1(x)]T· · · [f ′m−1(Fm−2(x))]Tf ′m(Fm−1(x))− [f ′1(y)]T [f ′2(F1(x))]T· · · f ′m(Fm−1(x))
+ · · · + [f ′1(y)]T· · ·[f ′m−1(Fm−2(y))]Tf ′m(Fm−1(x))− [f ′1(y)]T [f ′2(F1(y))]T· · ·f ′m(Fm−1(y))
≤ L1
`rFm(x)− Fm(y).
On the other hand, we have for any x and y that
Fi(x)− Fi(y) = fi(Fi−1(x))− fi(Fi−1(y))
≤ `iFi−1(x)− Fi−1(y) ≤ i∏
r=1 `rx− y.
Therefore Fi is `F -Lipschitz with `F = ∏m i=1 `i. Substituting these Lipschitz constants into the
bound of F ′(x)− F ′(y) yields
F ′(x)− F ′(y) ≤
References
[1] Zeyuan Allen-Zhu. Natasha: Faster non-convex stochastic optimization via strongly non- convex parameter. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of Proceedings of Machine Learning Research, pages 89–97, Sydney, Aus- tralia, 2017.
[2] Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Advances in Neural Information Processing Systems 31, pages 2675–2686. Curran Associates, Inc., 2018.
27
[3] Zeyuan Allen-Zhu and Elad Hazan. Variance reduction for faster non-convex optimization. In Proceedings of the 33rd International Conference on Machine Learning, pages 699–707, 2016.
[4] Zeyuan Allen-Zhu and Yang Yuan. Improved SVRG for non-strongly-convex or sum-of-non- convex objectives. In Proceedings of the 33rd International Conference on International Con- ference on Machine Learning (ICML), pages 1080–1089, 2016.
[5] Amir Beck. First-Order Methods in Optimization. MOS-SIAM Series on Optimization. SIAM, 2017.
[6] Jose Blanchet, Donald Goldfarb, Garud Iyengar, Fengpei Li, and Chaoxu Zhou. Unbiased simulation for optimizing stochastic function compositions. Preprint, arXiv:1711.07564, 2017.
[7] Christoph Dann, Gerhard Neumann, and Jan Peters. Policy evaluation with temporal dif- ferences: a survey and comparison. Journal of Machine Learning Research, 15(1):809–883, 2014.
[8] Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems 27, pages 1646–1654, 2014.
[9] Ofer Dekel, Ran Gilad-Bachrach, Ohad Shamir, and Lin Xiao. Optimal distributed online prediction using mini-batches. Jorunal of Machine Learning Research, 13:165–202, 2012.
[10] Darinka Dentcheva, Spiridon Penev, and Andrzej Ruszczynski. Statistical estimation of com- posite risk functionals and risk optimization problems. Annals of the Institute of Statistical Mathematics, 69(4):737–760, 2017.
[11] Dmitriy Drusvyatskiy and Adrian S. Lewis. Error bounds, quadratic growth, and linear con- vergence of proximal methods. Mathematics of Operations Research, 43(3):919–948, 2018.
[12] Y. Ermoliev. Methods of Stochastic Programming. Monographs in Optimization and OR. Nauka, Moscow, 1976.
[13] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural In- formation Processing Systems 31, pages 689–699. Curran Associates, Inc., 2018.
[14] Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
[15] Saeed Ghadimi, Andrzej Ruszczynski, and Mengdi Wang. A single time-scale stochastic ap- proximation method for nested stochastic optimization. Preprint, arXiv:1812.01094, 2018.
[16] Zhouyuan Huo, Bin Gu, Ji Jiu, and Heng Huang. Accelerated method for stochastic composi- tion optimization with nonsmooth regularization. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pages 3287–3294, 2018.
[17] A. N. Iusem, A. Jofre, R. I. Oliveira, and P. Phompson. Extragradient method with variance reduction for stochastic variational inequalities. SIAM Journal on Optimization, 27(2):686– 724, 2017.
28
[18] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive vari- ance reduction. In Advances in Neural Information Processing Systems 26, pages 315–323, 2013.
[19] J. Koshal, A. Nedic, and U. B. Shanbhag. Regularized iterative stochastic approximation methods for stochastic variational inequality problems. IEEE Transactions on Automatic Control, 58(3):594–609, 2013.
[20] Xiangru Lian, Mengdi Wang, and Ji Liu. Finite-sum composition optimization via variance reduced gradient descent. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1159–1167, 2017.
[21] Tianyi Lin, Chenyou Fan, Mengdi Wang, and Michael I Jordan. Improved oracle complexity for stochastic compositional variance reduced gradient. Preprint, arXiv:1806.00458, 2018.
[22] Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Pro- gramming, 140(1):125–161, 2013.
[23] Yurii Nesterov. Lectures on Convex Optimization. Springer, 2nd edition, 2018.
[24] Lam M. Nguyen, Jie Liu, Katya Scheinberg, and Martin Takac. SARAH: A novel method for machine learning problems using stochastic recursive gradient. In Proceedings of the 34th International Conference on Machine Learning (ICML), volume 70 of PMLR, pages 2613–2621, 2017.
[25] Lam M. Nguyen, Marten van Dijk, Dzung T. Phan, Phuong Ha Nguyen, Tsui-Wei Weng, and Jayant R. Kalagnanam. Finite-sum smooth optimization with SARAH. arXiv:1901.07648, 2019.
[26] Nhan H. Pham, Lam M. Nguyen, Dzung T. Phan, and Quoc Tran-Dinh. ProxSARAH: An efficient algorithmic framework for stochastic composite nonconvex optimization. Preprint, arXiv:1902.05679, 2019.
[27] Sashank J. Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In Proceedings of The 33rd International Con- ference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 314–323, New York, New York, USA, 2016.
[28] Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alex Smola. Fast incremental method for smooth nonconvex optimization. In 2016 IEEE 55th Conference on Decision and Control (CDC), pages 1971–1977. IEEE, 2016.
[29] Sashank J Reddi, Suvrit Sra, Barnabas Poczos, and Alexander J Smola. Proximal stochastic methods for nonsmooth nonconvex finite-sum optimization. In Advances in Neural Information Processing Systems 29, pages 1145–1153, 2016.
[30] R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, 1970.
[31] R. Tyrrell Rockafellar. Coherent approaches to risk in optimization under uncertainty. IN- FORMS TutORials in Operations Research, 2007.
29
[32] Nicolas L. Roux, Mark Schmidt, and Francis R. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In Advances in Neural Information Processing Systems 25, pages 2663–2671. Curran Associates, Inc., 2012.
[33] Andrzej Ruszczynski. Advances in risk-averse optimization. INFORMS TutORials in Operation Research, 2013.
[34] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA, 1998.
[35] Mengdi Wang, Ethan X Fang, and Han Liu. Stochastic compositional gradient descent: algo- rithms for minimizing compositions of expected-value functions. Mathematical Programming, 161(1-2):419–449, 2017.
[36] Mengdi Wang, Ji Liu, and Ethan Fang. Accelerating stochastic composition optimization. Journal of Machine Learning Research, 18(105):1–23, 2017.
[37] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. SpiderBoost: A class of faster variance-reduced algorithms for nonconvex optimization. Preprint, arXiv:1810.10690, 2018.
[38] Lin Xiao and Tong Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4):2057–2075, 2014.
[39] Shuoguang Yang, Mengdi Wang, and Ethan X. Fang. Multilevel stochastic gradient methods for nested composition optimization. SIAM Journal on Optimization, 29(1):616–659, 2019.
[40] Xiaoyong Yuan, Pan He, Qile Zhu, and Xiaolin Li. Adversarial examples: Attacks and defenses for deep learning. IEEE Transactions on Neural Networks and Learning Systems, 30(9):2805– 2824, 2019.
[41] Junyu Zhang and Lin Xiao. A composite randomized incremental gradient method. In Pro- ceedings of the 36th International Conference on Machine Learning (ICML), number 97 in Proceedings of Machine Learning Research (PMLR), Long Beach, California, 2019.
[42] Junyu Zhang and Lin Xiao. A stochastic composite gradient method with incremental variance reduction. In Advances in Neural Information Processing Systems 32, pages 9078–9088. Curran Associates, Inc., 2019.
[43] Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems 31, pages 3921–3932. Curran Associates, Inc., 2018.
30
3 A general framework of stochastic variance reduction
3.1 SARAH/SPIDER estimator for stochastic optimization
3.2 SARAH/SPIDER estimator for finite-sum optimization
3.3 SVRG estimator for stochastic optimization
3.4 SAGA estimator for finite-sum optimization
4 Multi-level nested SPIDER
6 Conclusion