optimal step length for the newton method: case of self

Optimal step length for the Newton method: case of self-concordant functionsSubmitted on 23 Oct 2020
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Optimal step length for the Newton method: case of self-concordant functions
Roland Hildebrand
To cite this version: Roland Hildebrand. Optimal step length for the Newton method: case of self-concordant functions. Mathematical Methods of Operations Research, Springer Verlag, 2021, 94, pp.253-279. 10.1007/s00186-021-00755-9. hal-02571626v2
Roland Hildebrand ∗
Abstract
In path-following methods for conic programming knowledge of the performance of the (damped) Newton method at finite distances from the minimizer of a self-concordant function is crucial for the tuning of the parameters of the method. The available bounds on the progress and the step length to be used are based on conservative relations between the Hessians at different points and are hence sub-optimal. In this contribution we use methods of optimal control theory to compute the optimal step length of the Newton method on the class of self-concordant functions, as a function of the initial Newton decrement, and the resulting worst-case decrease of the decrement. The exact bounds are expressed in terms of solutions of ordinary differential equations which cannot be integrated explicitly. We provide approximate numerical and analytic expressions which are accurate enough for use in optimization methods. As an application, the neighbourhood of the central path in which the iterates of path- following methods are required to stay can be enlarged, enabling faster progress along the central path during each iteration and hence fewer iterations to achieve a given accuracy.
1 Introduction
The Newton method is a century-old second order method to find a zero of a vector field or a stationary point of a sufficiently smooth function. It is well-known that it is guaranteed to converge only in a neighbourhood of a solution, even on such well-behaved classes of objective functions as the self-concordant functions. In this neighbourhood the method converges quadratically. Farther away the step length has to be decreased to ensure convergence, leading to the damped Newton method. Several rules have been proposed to choose the step length on different classes of cost functions or vector fields. Among these are line searches or path searches until some pre-defined condition is met [2, 9], strategies imported from gradient descent methods [6, p. 37], and explicit formulas [7, p. 24], [6, p. 353]. These rules provide sufficient conditions for convergence, but they are optimal only by the order of convergence [6].
A (damped) Newton step is given by the iteration
xk+1 = xk − γk(F ′′(xk))−1F ′(xk),
where γk = 1 for a full step and γk ∈ (0, 1) if the step is damped. Here F is the function to be minimized, assumed to be sufficiently smooth with a positive definite Hessian. A convenient measure to quantify the distance from the iterate xk to the minimizer is the Newton decrement [7]
ρk = √ F ′(xk)T (F ′′(xk))−1F ′(xk) = ||F ′(xk)||∗xk ,
where the norm of the gradient as measured in the local metric given by the inverse Hessian of F . In this paper we investigate the following problems. Given ρk and γk, what is the largest possible
(worst-case) value ρk+1 of the decrement after the iteration? What is the optimal step-length γ∗, i.e., the value minimizing the upper bound on ρk+1 with respect to γk for given ρk?
With the application to barrier methods for conic programming in mind we consider the class of self- concordant functions. This class has been introduced in [7], because due to its invariance under affine
∗Univ. Grenoble Alpes, CNRS, Grenoble INP, LJK, 38000 Grenoble, France ([email protected]).
1
coordinate transformations it is especially well suited for analysis in conjunction with the also affinely invariant Newton method. Self-concordant functions are locally strongly convex C3 functions F which satisfy the inequality
|F ′′′(x)[h, h, h]| ≤ 2(F ′′(x)[h, h])3/2
for all x in the domain of definition and all h in the tangent space at x. We shall now motivate the problem setting defined above by giving a short overview of the layout of path-following methods. For more details see [7]. We consider the basic setup of a short-step primal path-following method.
Consider the convex optimization problem minx∈Xc, x with linear objective function, where X ⊂ Rn is a closed convex set. Suppose on the interior D = Xo of the set a self-concordant function F is defined which tends to infinity at the boundary of the set, F |∂X = +∞. Such a situation arises within the framework of conic programming, in particular, linear, second-order conic, and semi-definite programming. Here X is the feasible set of the primal problem, and F is the restriction of the self-concordant barrier function to this set.
A primal path-following method generates a sequence xk ∈ D of iterates which lie in a neighbourhood of the central path x∗(τ) of the problem and at the same time move along (follow) this path. Here for given τ > 0 the point x∗(τ) is the unique minimizer of the composite function Fτ (x) = F (x) + τc, x on D. For τ → +∞ the central path and hence also the iterates xk converge to the optimal solution x∗ of the original problem.
In order to generate the next iterate xk+1 from the current iterate xk, a (damped) Newton step is made towards a target point x∗(τk) on the central path by attempting to minimize the composite function Fτk . The performance of the method will depend on how the parameter τk of the target point and the step length γk are chosen. The overall convergence rate of the method depends on how fast the sequence τk tends to +∞. However, the larger τk at any given iterate is chosen, the farther the current point will lie from the target point, and the smaller the worst-case progress of the Newton iterate will be. The right choice of τk should optimally balance these two opposite trends.
Recall that the distance from the current iterate xk to the target point x∗(τ) is measured by the Newton
decrement ρk(τ) = √ dTk (F ′′(xk))−1dk, where dk = F ′(xk) + τc is the difference between the gradients of
F at the target and the current point. Note that ρ2k(τ) is a nonnegative quadratic function of τ , with the
leading coefficient given by cT (F ′′(xk))−1c. The derivative dρk dτ is therefore bounded by
√ cT (F ′′(xk))−1c.
The target point x∗(τk) is chosen such that τk is maximal under the constraint ρk(τk) ≤ λ, where λ is a parameter of the method. The subsequent Newton step from xk to xk+1 moves the iterate closer to the target point. More precisely, we have ρk+1(τk) ≤ λ, where λ is some monotone function of λ which denotes an upper bound on the decrement at the next iterate. Since the preceding iteration from xk−1 to xk also satisfied the constraint ρk−1(τk−1) ≤ λ and consequently ρk(τk−1) ≤ λ, we may find a lower bound on the progress made by the iteration. Namely, from ρk(τk) = λ and the bound on the derivative of ρk we get
τk − τk−1 ≥ λ−λ√ cT (F ′′(xk))−1c
.
It now becomes clear that we have to choose the parameter λ in order to maximize the difference λ−λ. The smaller λ as a function of λ, the larger are both the optimal value λ∗ of λ and the difference λ∗ − λ∗, where λ∗ = λ(λ∗), and the faster the path-following method will progress. Recall that λ denotes the Newton decrement before the Newton step, while λ is an upper bound on the decrement after the Newton step. It is therefore of interest to use bounds λ(λ) on the decrement which are as tight as possible. This leads us exactly to the problem setting defined above.
The currently used upper bound λ is based on the following estimate of the Hessian of a self-concordant function F at a point x near the current iterate xk [7, Theorem 2.1.1] (see also [6, Theorem 5.1.7])
(1− ||x− xk||xk)2F ′′(xk) F ′′(x) (1− ||x− xk||xk)−2F ′′(xk).
Here ||x− xk||xk ∈ (0, 1) is the distance between x and xk as measured in the local metric at xk. Based on this estimate and assuming a full Newton step one obtains [6, Theorem 5.2.2.1]
λ =
( λ
2
Figure 1: Upper bounds on the Newton decrement ρk+1 (left: dashed — full Newton step, solid — optimal damped Newton step) and optimal damping coefficient γk (right) as a function of the current Newton decrement ρk.
leading to λ∗ ≈ 0.2291 and λ∗ − λ∗ ≈ 0.1408. If instead one applies a damping coefficient γk = 1+ρk
1+ρk+ρ2k , then one obtains the bound [6, Theorem
5.2.2.3]
) .
Then λ∗ increases to ≈ 0.2910, the difference λ∗ − λ∗ to ≈ 0.1638, and the step length is given by 1+λ∗
1+λ∗+(λ∗)2 ≈ 0.9384.
In this contribution we perform an exact worst-case analysis of the performance of the Newton iterate by reformulating it as an optimal control problem. The values of λ∗ for the full and the optimal damped Newton step turn out to be given by ≈ 0.3943 and ≈ 0.4429, respectively. The respective values of λ∗ − λ∗ are given by ≈ 0.2184 and ≈ 0.2300, which represents an improvement by more than 40%. The tight bound on ρk+1 and the optimal damping coefficient γk as functions of ρk are depicted in Fig. 1.
The idea of analyzing iterative algorithms by optimization techniques is not new. An exact worst-case analysis of gradient descent algorithms by semi-definite programming has been performed in [3, 10]. In these papers an arbitrary finite number of steps is analyzed, but the function classes are such that the resulting optimization problem is finite-dimensional.
In [4] a performance analysis for a single step of the Newton method on self-concordant functions is conducted. The class of self-concordant functions is, however, overbounded by a class of functions with Lipschitz-continuous Hessian, and the gradient at the next iterate is measured in the local norm of the previous iterate or in the (algorithmically inaccessible) local norm of the minimum. This yields also a finite-dimensional optimization problem. In [5] it is shown that the step size γk = 1
1+ρk proposed in [7]
maximizes a lower bound on the progress if the latter is measured in terms of the decrease of the function value. Using the techniques of this paper, one can show that this step length is actually optimal for this performance criterion1.
In our case the properties of the class of self-concordant functions do not allow to obtain tight constraints on gradient and Hessian values at different points without taking into consideration all intermediate points, and the problem becomes infinite-dimensional. Our techniques, however, are borrowed from optimal control theory and nevertheless allow to obtain optimal bounds.
The remainder of the paper is structured as follows. In Section 2 we analyze the Newton iteration for self-concordant functions in one dimension. In this case the problem can be solved analytically. In Section 3 we generalize to an arbitrary number of dimensions. It turns out that the general case is no more difficult than the 2-dimensional one due to the rotational symmetry of the problem, and is described by the solutions of a Hamiltonian dynamical system in a 4-dimensional space. In Section 4 we analyze the solutions of the Hamiltonian system. We establish that for small enough step sizes the 1-dimensional solutions are still
1Unpublished joint work with Anastasia S. Ivanova (Moscow Institute of Physics and Technology).
3
optimal, while for larger step sizes (including the optimal one) the solution is more complicated and has no analytic expression anymore. In Section 5 we minimize the upper bound on the Newton decrement ρk+1
with respect to the step length in order to obtain the optimal damped Newton step. In turns out that in this case the system simplifies to ordinary differential equations (ODE) on the plane. The optimal step length and the optimal bound on the decrement are then described in terms of solutions of these ODEs. In Section 6 we provide approximate numerical and analytic expressions for the upper bound on the decrement and for the optimal step size. In Section 7 we provide numerical values for the optimal parameters in path- following methods, both for the full and the optimally damped Newton step. We illustrate the obtained performance gain on a simple example. In Section 8 we provide a game-theoretic interpretation of our results and summarize our findings.
2 One-dimensional case
In this section we analyze the Newton iteration in one dimension. Given a damping coefficient and the value of the Newton decrement at the current iterate, we would like to find the maximal possible value of the decrement at the next iterate. We reformulate this optimization problem as an optimal control problem. The solution of this problem is found by presenting an analytic expression for the Bellman function.
Let F : I → R be a self-concordant function on an interval, i.e., a C3 function satisfying F ′′ > 0,
|F ′′′| ≤ 2(F ′′)3/2. Suppose the Newton decrement ρk = √
F ′(xk)2
F ′′(xk) at some point xk ∈ I equals a ∈ (0, 1).
Fix a constant γ ∈ (0, 1) and consider the damped Newton step
xk+1 = xk − γ F ′(xk)
F ′′(xk) .
The Newton decrement at the next iterate, which we suppose to lie in I, is given by ρk+1 = √
F ′(xk+1)2
F ′′(xk+1) .
Our goal in this section is to find the maximum of ρk+1 as a function of the parameters a, γ. First of all, we may use the affine invariance of the problem setup to make some simplifications. We
move the current iterate to the origin, xk = 0, normalize the Hessian at the initial point to F ′′(0) = 1 by scaling the coordinate x, and possibly flip the real axis to achieve F ′(0) = −a < 0. Then we get xk+1 = aγ. Introducing the functions h = F ′′(x), g = F ′(x), we obtain the optimal control problem
g′ = h, h′ = 2uh3/2, u ∈ U = [−1, 1]
with initial conditions g(0) = −a, h(0) = 1
and objective function √ g(aγ)2
h(aγ) → sup .
Replacing the state variable g by y = h−1/2g and the independent variable x by t = h1/2 · (x− aγ), we obtain dt
dx = h1/2 · (1 + ut) and the problem becomes
y = 1− uy 1 + ut
, u ∈ U = [−1, 1]
with initial conditions y(−aγ) = −a and objective function |y(0)| → sup. The variable h becomes disconnected from the relevant part of the dynamics and can be discarded. If the control is bang-bang, i.e., u is piece-wise constant with values in {−1, 1}, then the dynamics can be integrated explicitly, with solutions
−u log |1− uy|+ const = u log |1 + ut| ⇒ (1− uy)(1 + ut) = const.
Following the principles of dynamic programming [1], consider the Bellman function B(t0, y0) denoting the maximal objective value which can be achieved on a system trajectory y(t) satisfying the initial condition y(t0) = y0. It satisfies the Bellman equation
max u∈U
∂B
∂t
) = 0
4
Figure 2: Optimal synthesis of the control problem modeling the one-dimensional case. The switching curve and the dispersion curve are dashed. The level curves of the Bellman function are at the same time the optimal trajectories of the system.
with boundary condition B(0, y) = |y|. After a bit of calculation one obtains the solution
B(t, y) =
√ 1+t3)
(1− y)(1 + t), 2(−1+ √ 1+t3)
t2 ≤ y ≤ −t, y − t− ty, y ≥ −t
on the domain (t, y) ∈ [−1, 0] × R. The curve y = −t is a switching curve, where the control u switches
from +1 to −1. The curve y = 2(−1+ √ 1+t3)
t2 is a dispersion curve, there are two optimal trajectories with controls u = ±1 emanating from the points of this curve, leading to final points y(0) of different sign. The optimal synthesis of the system is depicted in Fig. 2. The optimal objective value is given by the value of B at the initial point (−aγ,−a).
Let us now consider the result for the full Newton step. In this case γ = 1, and at the initial point is given by (t, y) = (−a,−a). It hence lies between the dispersion curve and the switching curve, yielding the upper bound
ρk+1 ≤ B(−a,−a) = 4− a2 − 4 √
1− a2 = 4− ρ2k − 4 √
1− ρ2k.
For the optimally damped Newton step, we have to minimize B(t,−a) with respect to t ∈ [−a, 0] for
given a. The minimum is given by the intersection point ( 2(1− √ 1+a3)
a2 ,−a) of the line y = −a with the
dispersion curve. Hence the optimal damping coefficient is given by γk = 2( √
1+ρ3k−1) ρ3k
ρk+1 ≤ a+ 2(1−
3 Reduction to a control problem in the general case
In this section we perform a similar analysis for the case of self-concordant functions F defined on n- dimensional domains. We reduce the problem to an optimal control problem in two state space dimensions. This results in a Hamiltonian system in 4-dimensional extended phase space.
First we simplify the dynamics as in the previous section. Introduce the vector-valued variable g = F ′
and the matrix-valued variable h = F ′′ = WWT , where W is the lower-triangular factor of h with positive diagonal. Let further P be the set of homogeneous cubic polynomials p(x) =
∑n i,j,k=1 pijkxixjxk which are
5
bounded by 1 on the unit sphere Sn−1 ⊂ Rn. This is a compact convex set. The self-concordance condition then expresses the third derivatives of F in the form
∂3F
We shall need also the projection
U = {U | ∃ p ∈ P : Uij = pij1}
of P. The set U is a compact convex subset of the space of real symmetric matrices Sn. Clearly it is overbounded by the set U ′ = {U | −I U I}. We shall also introduce the sets of lower-triangular matrices
V =
} which are compact and convex as well.
Using affine invariance, we may achieve the normalization xk = 0, g(0) = −ae1, h(0) = W (0) = I, xk+1 = aγe1. Here a ∈ (0, 1) is the Newton decrement ρk, γ ∈ (0, 1] the damping coefficient, and e1 the first canonical basis vector. We consider the evolution of the variables g, h,W only on the line segment joining xk and xk+1, and may hence pass to a scalar variable τ ∈ [0, aγ], such that x(τ) = τe1 and g(τ) := g(x(τ)), h(τ) := h(x(τ)). The dynamics of the resulting control system can be written as
dg
It follows that dh
Since W−1 dWdτ is lower-triangular, and dWT
dτ W−T is its transpose, we finally obtain
dW
dτ = W11WV, V ∈ V.
We now replace g by the variable y = W−1g and introduce a new independent variable t = W11 ·(τ−aγ). This variable then evolves in the interval t ∈ [−aγ, 0], and we have dt
dτ = W 2 11V11 · (τ − aγ) + W11 =
W11 · (tV11 + 1). The dynamics of the system becomes
y = 1
−V y + e1 tV11 + 1
, V ∈ V
ρk+1 = √ gT (xk+1)h−1(xk+1)g(xk+1) = ||y(0)|| → sup .
The matrix-valued variable W becomes disconnected and can be discarded. For this problem the Bellman function cannot be presented in closed form, however, and we have to
employ Pontryagins maximum principle [8] to solve it. Introduce an adjoint vector-valued variable p, then the Pontryagin function and the Hamiltonian of the system are given by
H(t, y, p, V ) = p,−V y + e1 tV11 + 1
, H(t, y, p) = max V ∈V
p,−V y + e1 tV11 + 1
,
6
Figure 3: True set V of controls V (left) and overbounding set V ′ (right). The sharp circular edge of both bodies is the circle C.
respectively. The transversality condition is non-trivial at the end-point t = 0 and states that p(0) equals
the gradient ∂||y(0)|| ∂y(0) = y(0)
||y(0)|| of the objective function. The optimal control problem is then reduced to
the two-point boundary value problem
y = ∂H
y(0)
||y(0)|| . (1)
Note that the problem setting is invariant with respect to orthogonal transformations of Rn which leave the distinguished vector e1 invariant. Suppose that at some point on the trajectory the vectors y, p span a plane containing e1. Then at this point the component of the gradient of the Hamiltonian which is orthogonal to the plane vanishes due to symmetry, and the derivatives of y, p are also contained in this plane. Therefore these variables will remain in this plane along the whole trajectory. Since at the end-point t = 0 the vectors y, p are linearly dependent, we may assume without loss of generality that y, p are for all t contained in the plane spanned by the basis vectors e1, e2, or equivalently, that the dimension n equals 2.
Then the set P is given by the set of bi-variate homogeneous cubic polynomials which are bounded by 1 on the unit circle, and can be expressed via the semi-definite representable set of nonnegative univariate trigonometric polynomials. This allows to obtain an explicit description of the set V. Its boundary is given by the matrices
V = ± 1
2 cos3 ξ
( cosφ(3 cos2 ξ − cos2 φ) 0 2 sinφ(sin2 ξ − sin2 φ) cosφ(cos2 ξ − sin2 φ)
) with ξ ∈ [0, π3 ], |φ| ≤ ξ. Recall that the set V is overbounded by the set V ′ (see Fig. 3). Both sets share the circle
C =
{ V =
} .
Note that the level sets of the Pontryagin function H are planes, because H is a fractional-linear function of V . Hence the maximum of H over a compact convex set is attained at an extreme point of this set. The maximum of H over the circle C can be computed explicitly and is given by the expression
max V ∈C H =
1− t2 .
Besides C, the set V ′ has the extreme points V = ±I, on which H evaluates to p1(p1y1+p2y2) 1±t . Hence we
have maxV ∈V′ H = maxV ∈CH if√ (p1y1 − p2y2 + p1t)2 + 4p22y
2 1(1− t2) ≥ −p1t− p1y1 − p2y2 + 2p2y2t,√
(p1y1 − p2y2 + p1t)2 + 4p22y 2 1(1− t2) ≥ p1t+ p1y1 + p2y2 + 2p2y2t.
7
These conditions then also imply maxV ∈V H = maxV ∈CH and hence
H(t, y, p) = p1 + (p1y1 − p2y2)t+
√ (p1y1 − p2y2 + p1t)2 + 4p22y
2 1(1− t2)
1− t2 . (2)
It turns out a posteriori that the conditions are satisfied on the relevant solutions, and by virtue of the necessity of Pontryagins maximum principle for optimality we obtain the following result.
Theorem 3.1. Let a, γ ∈ (0, 1) be given. Then the upper bound on the Newton decrement ρk+1 after a damped Newton step with damping coefficient γ and initial value of the decrement ρk = a is given by the norm ||y(0)||, where (y(t), p(t)), y = (y1, y2), p = (p1, p2), t ∈ [−aγ, 0], is a solution of boundary value problem (1) with Hamiltonian H defined by (2).
In the next section we shall analyze the qualitative behaviour of the solutions of this Hamiltonian system.
4 Qualitative behaviour of the solution curves
In this section we analyze the Hamiltonian system obtained in the previous section. It turns out that for small enough damping coefficients the trajectories corresponding to the 1-dimensional solution obtained in Section 2 are optimal. These trajectories are characterized by the condition that the variables y2, p2 are identically zero. If the initial point of the trajectory lies beyond a critical curve in the (t, y1)-plane, however, all entries of y, p participate in the dynamics. This section is devoted to the computation of the critical curve separating the two qualitatively different behaviours of the optimal trajectories.
As in the 1-dimensional case, the Bellman function B(t, y) of the problem is constant on the trajectories of the Hamiltonian system and satisfies the boundary condition B(0, y) = ||y||. In order to construct it, we
have to integrate the system in backward time with initial condition p(0) = y(0) ||y(0)|| , launching a trajectory
from every point of the y-plane. In general, the projections on y-space of trajectories launched from different points may eventually
intersect. In this case the trajectory with the maximal value of the Bellman function along it is retained. Therefore trajectories cease to be optimal at dispersion surfaces where they meet other trajectories with the same value of the Bellman function. In our case the plane y2 = 0 acts as a dispersion surface by virtue of the symmetry (y1, y2, p1, p2) 7→ (y1,−y2, p1,−p2) of the system. Indeed, a trajectory launched from a point with y2(0) 6= 0 and hitting the plane y2 = 0 at some time will meet there with its image under the symmetry, which necessarily has the same value of the Bellman function. Thus the trajectories which are relevant for Theorem 3.1 are those which either completely evolve on the plane y2 = 0, or which do not cross this plane in the time interval (−aγ, 0].
The first kind of trajectories correspond to the 1-dimensional system considered in Section 2 and are depicted in Fig. 2. In the regions between the dashed curves and the vertical axis they are given by
y(t) = c+ t
e1, (3)
where c = y1(0) is a parameter. We now perturb trajectory (3) by launching it from a nearby point y(0) = (c, ε), and consider its
evolution up to first order in ε. This can be done by solving the linearization of ODE (1). The coefficient matrix of the linearized system is given by
( ∂2H ∂p∂y
4y21 p1(t+y1)
=
0 − 1 1−t 0 4|c|(c+t)2
c(c+2t−t2)(1−t)2
.
Here the first relation is obtained by setting y2 = p2 = 0 in the partial derivatives of H, the second one by inserting the values (3). The linearized system has to be integrated with initial condition ∂(c,ε, c√
c2+ε2 , ε√
c2+ε2 )
∂ε |ε=0 = (0, 1, 0, 1 |c| ) at t = 0. It has the solution
δy(t) = δy2e2, δp(t) = 1
|c|(1− t) e2,
8
where the scalar function δy2 is a solution of the ODE
d(δy2)
with initial condition δy2(0) = 1. Integrating the ODE, we obtain
δy2(t) = −(c2 + 5c+ 16)(1− t)
3c(c+ 1) +
4(c+ 2)
c+2t−t2
2(c+ 2)(1− t) log (c+2t−t2)( √ c+1+1)2
c( √ c+1+1−t)2
c(c+ 1)3/2
δy2(t) = −(c2 + 5c+ 16)(1− t)
3c(c+ 1) +
4(c+ 2)
c+2t−t2
−1−c
) c(−1− c)3/2
for c < −1. Setting the variable δy2 to zero, we obtain a critical value of t (a focal point) on trajectory (3) beyond
which other nearby trajectories of the system start to intersect the y2 = 0 plane. At this point trajectory (3) ceases to be optimal. The ensemble of these critical points for all values c ∈ R forms a curve in the (t, y1)-plane which marks the limit of optimality of the synthesis obtained in Section 2. In order to obtain an expression for this curve, we have to use (3) to express c as a function of t, y1. Inserting this expression into the relation δy2 = 0 yields a relation between t and y1.
For c > −1 this relation is given by
(−Y 4T 6 +4Y 4−3Y 2T 4−12y1t)Y +24T 2Y log T +6T (Y T −1)2 log Y − T Y T − 1
+6T (Y T +1)2 log Y T + 1
Y + T = 0,
1 + y1, T = √
1− t. For c = −1 we obtain the point (t∗,−1) with t∗ = 1 − 22/3 ≈ −0.5874, and for c < −1 we get
(−Y 4T 3 +4Y 4 +3Y 2T 3−12T 2y1t)+12T 3
( 2 log T +
Y )
where Y = √ −(1 + y1)(1− t), T = 1− t.
In particular, both the switching curve and the dispersion curve of the 1-dimensional optimal synthesis lie beyond the critical curve and are not part of the full-dimensional solution (see Fig. 4, left).
We obtain the following result.
Theorem 4.1. Let a, γ ∈ (0, 1) be given, and suppose that the point (t, y1) = (−aγ,−a) lies to the right of the critical curve defined above. Then the upper bound on the Newton decrement ρk+1 after a damped Newton step with damping coefficient γ and initial value of the decrement ρk = a is given by a− aγ + a2γ.
Proof. Since in this case the analysis in Section 2 is applicable, the optimal value of ρk+1 is given by B(t, y1) = −y1 + t+ ty1.
The trajectories with initial condition y(−aγ) = −ae1 corresponding to a point (t, y1) = (−aγ,−a) beyond the critical curve can only be computed numerically by solving (1). To each such initial condition there correspond two solutions which are taken to each other by the symmetry (y1, y2, p1, p2) 7→ (y1,−y2, p1,−p2). In particular, this will be the case both for the full Newton step and the optimal step length.
9
Figure 4: Left: Critical curve marking the limit of optimality of the 1-dimensional solution (solid). For comparison the switching curve and the dispersion curve of the 1-dimensional optimal synthesis are also depicted (dashed). For |y| → +∞ the critical curve tends to the line t = 1 − 22/3 (dotted). Right: Projections on the (t, y1)-plane of the trajectories corresponding to a full Newton step for different initial values a of the Newton decrement. The dotted line is the locus of the initial points.
5 Optimal step length and bounds for the damped Newton step
In this section we minimize the upper bound on ρk+1 with respect to the damping coefficient γ for fixed initial values of the decrement ρk = a. The minimizer of this problem yields the optimal damping coefficient for the Newton iterate which leads to the largest guaranteed decrease of the decrement.
Technically, releasing γ is equivalent to releasing the left end of the time interval on which the trajectory of the Hamiltonian system evolves, while leaving the initial state fixed. It is well known that the partial derivative with respect to time of the Bellman function, i.e., the objective achieved by the trajectory, equals the value of the Hamiltonian [8]. Therefore we look for trajectories with starting points lying on the surface H = 0.
Let us first evaluate H on the critical curve obtained in the previous section, more precisely on its arc between the lines y1 = −1 and y1 = 0. There we have p1 < 0, y1 + t < 0. Setting y2 = 0, p2 = 0, we obtain
H = p1(1+y1) 1−t . Thus for a < 1 we have H < 0 and hence the optimal initial time instant t is strictly smaller
than the time instant defined by the critical curve. For a = 1, or equivalently y1 = −1, the trajectory of the 1-dimensional system is optimal. As a consequence, for a→ 1 the optimal damping coefficient tends to 22/3 − 1 ≈ 0.5874.
Lemma 5.1. The hyper-surface H = 0 is integral for the Hamiltonian system defined by (2).
Proof. One easily computes
√ (p1y1 − p2y2 + p1t)2 + 4p22y
,
and hence if H = 0 somewhere on a trajectory, then H ≡ 0 everywhere on the trajectory.
In particular, it follows that at the end-point t = 0 of the trajectory we also have H = 0. From (2) we
then obtain by virtue of the transversality conditions p(0) = y(0) ||y(0)|| that
H(0) = p1 + √
y1 + y21 + y22 ||y(0)||
= 0.
The locus of the end-points in y-space is hence given by the circle (y1 + 1 2 )2 + y22 = 1
4 . From now on we assume without loss of generality that y2 ≥ 0 on the trajectory of the system. Using the homogeneity of the dynamics with respect to the adjoint variable p we may eliminate this
variable altogether. On the surface H = 0 we have
(y21 − 1)p21 − 2y1y2p1p2 + (4y21 + y22)p22 = 0, p1 p2
= y1y2 +
,
=
√ −4y41 + 4y21 + y22)
,
2 2
√ −4y41 + 4y21 + y22)
.
A closer look reveals that the quotient of the two derivatives does not depend on t, and we obtain a planar dynamical system defined by the scalar ODE
dy2 dy1
We obtain the following result.
Theorem 5.2. Let a ∈ (0, 1) be given, and let σ be the solution curve of ODE (4) through the point y0 = (−a, 0). Then the upper bound on the Newton decrement ρk+1 after a damped Newton step with optimal damping coefficient and initial value ρk = a of the decrement is given by ||y∗|| =
√ −y∗1 , where
y∗ = (y∗1 , y ∗ 2) is the intersection point of the curve σ with the circle centered on (− 1
2 , 0) and with radius 1 2
in the upper half-plane y2 > 0.
The Riemann surfaces corresponding to the solutions of ODE (4) in the complex plane have an infinite number of quadratic ramification points, and the equation is not integrable in closed form. The optimal bound on ρk+1 as a function of ρk can be computed numerically and is depicted in Fig. 1, left.
In order to obtain the value of the optimal damping coefficient one also has to integrate the linear differential equation
dt
dy1 =
−4y41 + 4y21 + y22(−y2(y1 + t) + (y1t+ 1) √ −4y41 + 4y21 + y22)
= y2(y1 + t) + (y1t+ 1)
.
(5)
Theorem 5.3. Let a ∈ (0, 1) be given, and let σ be the solution curve of ODE (4) through the point y0 = (−a, 0), intersecting the circle (y1 + 1
2 )2 + y22 = 1 4 in the point y∗ in the upper half-plane y2 > 0. Then
the optimal damping coefficient γ for the Newton step with initial value ρk = a of the decrement is given by the value of t(y0), where t(y) is the solution of ODE (5) along the curve σ with initial value t(y∗) = 0.
In order to compute the optimal value of γ one hence has first to integrate ODE (4) from y0 to y∗ and then ODE (5) back from y∗ to y0. The result is depicted on Fig. 1, right. The solution curves of the ODEs are depicted in Fig. 5.
6 Approximate values of the bound and the optimal step length
As was established above, the worst-case Newton decrement after a Newton step with a given step length can only be computed by solving the two-point boundary value problem (1). The optimal step length and the corresponding bound on the decrement can be computed by solving ODEs (4) and (5). In Table 1 we provide numerical values for the upper bound λ of the decrement for the full Newton step and the optimal Newton step as a function of the decrement λ at the current point, and values for the optimal step length γ∗.
In Fig. 4, right, the solution curves of problem (1) corresponding to the value γ = 1 (full Newton step) and different a are depicted. The resulting bound on the decrement after the iteration, listed in column 2 of Table 1, is depicted in Fig. 1, left, as the dashed line. As an analytic approximation of the optimal bound on the interval [0, 23 ] one may use the upper bound
λ = 1.01 · λ2 + 1.02 · λ4, (6)
11
0.02 0.0004002129 0.0004002020 0.9999959049 0.04 0.0016030061 0.0016027905 0.9999672844 0.05 0.0025070365 0.0025064667 0.9999357402 0.06 0.0036140910 0.0036128237 0.9998883080 0.08 0.0064421367 0.0064376124 0.9997320984 0.10 0.0100985841 0.0100863025 0.9994703910 0.12 0.0145975978 0.0145695698 0.9990735080 0.14 0.0199560957 0.0198993569 0.9985103062 0.15 0.0229637111 0.0228857123 0.9981561975 0.16 0.0261938350 0.0260886230 0.9977481582 0.18 0.0333335449 0.0331510886 0.9967529721 0.20 0.0414011017 0.0411009637 0.9954892620 0.22 0.0504257486 0.0499526517 0.9939202785 0.24 0.0604403592 0.0597204227 0.9920082112 0.25 0.0658302428 0.0649521741 0.9909114838 0.26 0.0714817517 0.0704180523 0.9897144751 0.28 0.0835910576 0.0820584243 0.9870000911 0.30 0.0968141551 0.0946530992 0.9838261704 0.32 0.1112021742 0.1082118504 0.9801545078 0.34 0.1268120901 0.1227421781 0.9759482831 0.35 0.1350948306 0.1303732829 0.9736337787 0.36 0.1437074168 0.1382488115 0.9711728663 0.38 0.1619590244 0.1547332145 0.9657967079 0.40 0.1816461018 0.1721931153 0.9597922905 0.42 0.2028572983 0.1906220829 0.9531371033 0.44 0.2256920826 0.2100091752 0.9458145937 0.45 0.2377527563 0.2200572997 0.9418997667 0.46 0.2502623689 0.2303386829 0.9378150396 0.48 0.2766944756 0.2515899946 0.9291362838 0.50 0.3051314992 0.2737375986 0.9197842716 0.52 0.3357362119 0.2967512350 0.9097733400 0.54 0.3686946267 0.3205961990 0.8991262221 0.55 0.3861220234 0.3328184758 0.8935734256 0.56 0.4042204205 0.3452337887 0.8878737471 0.58 0.4425604721 0.3706218774 0.8760542409 0.60 0.4840018685 0.3967155859 0.8637126540 0.62 0.5288808639 0.4234680202 0.8508994659 0.64 0.5775944788 0.4508310379 0.8376694301 0.65 0.6035336028 0.4647263175 0.8309160437 0.66 0.6306157177 0.4787560083 0.8240802319 0.68 0.6885138337 0.5071945318 0.8101911337 0.70 0.7519817648 0.5360990922 0.7960616771 0.72 0.8218739745 0.5654236219 0.7817504964 0.74 0.8992597480 0.5951239679 0.7673142876 0.75 0.9411738550 0.6101018892 0.7600662717 0.76 0.9855000696 0.6251582547 0.7528069572 0.78 1.0823615911 0.6554871449 0.7382789623 0.80 1.1921910478 0.6860740057 0.7237768386 0.82 1.3181923462 0.7168849922 0.7093429029 0.84 1.4648868491 0.7478890580 0.6950151115 0.85 1.5479540249 0.7634545483 0.6879016840 0.86 1.6389206049 0.7790579083 0.6808270495 0.88 1.8505789254 0.8103659079 0.6668080264 0.90 2.1168852322 0.8417899549 0.6529832527 0.92 2.4687193058 0.8733093323 0.6393740764 0.94 2.9700851395 0.9049055446 0.6259982580 0.95 3.3195687439 0.9207272500 0.6194025003 0.96 0.9365621489 0.6128702685 0.98 0.9682645833 0.6000015959
Table 1: Upper bounds on the decrement for a full and an optimal Newton step and optimal damping coefficient.
12
Figure 5: Left: Solution curves of ODE (4) between the line y2 = 0 and the circle (y1 + 1 2 )2 + y22 = 1
4 in the upper half-plane. Right: Solutions of ODE (5) on the curves depicted on the left part of the figure. The dotted curves are the locus of the end-points of the trajectories.
which is tight up to an error of 0.013. For the optimal step length the upper bound λ is depicted in Fig. 1, left, as the solid line. An asymptotic
analysis of ODE (4) for small values of a yields the expansion λ = λ 2 − 1
4λ 4
log λ+ (
λ = λ 2 − 0.556 · λ4 · log λ (7)
yields an upper bound which is tight up to an error of 0.007.
An asymptotic analysis of ODE (5) for small values of a leads to the expansion γ∗ = 1 − λ 3
2 + λ 4
log λ) of the optimal damping coefficient. The analytic approximation
γ∗ ≈ 1− 0.95 · λ3 + 0.53 · λ4 (8)
has an error of 0.008 on the whole interval [0, 1]. Unlike the situation in one dimension, in multiple dimensions the upper bound on the Newton decrement is a smooth function of the damping coefficient. Therefore a small deviation from the optimal value of γ will result only in an increase of the upper bound λ by a term of second order.
7 Application to path-following methods
In this section we demonstrate how the analysis of the Newton iterate derived above serves to tune the parameters of a path-following method in a way leading to a significant performance gain.
For ease of implementation we consider a basic primal short-step path-following method for a random dense semi-definite program (SDP). Although it is in general not competitive with long-step methods, it serves to demonstrate the performance gain from an optimized tuning of its parameters. We shall compare four different setups, which differ by the used step length γ and the policy of updating the parameter τk on the central path (see Section 1 for a description of the method). The updating policy is determined by the choice of the parameter λ, which equals the decrement before the Newton step.
Let us compute this parameter for the full and the optimal Newton step, respectively. We have to maximize the difference λ−λ(λ) with respect to λ. A numerical analysis of the optimal upper bounds λ(λ) for γ = 1 and γ = γ∗ yields the following results.
In the case of a full Newton step the optimal parameters λ∗, λ∗ are approximately given by 0.394257, 0.175841, respectively. The maximal difference evaluates to λ∗ − λ∗ ≈ 0.2184159486, which is a 55% improvement over the respective value presented in Section 1.
For an optimally damped Newton step λ∗, λ∗ are approximately given by 0.442946, 0.212945, respectively. The maximal difference evaluates to λ∗ − λ∗ ≈ 0.2300010331, which is a 40% improvement with
13
Figure 6: Performance of a short-step path-following method on a random SDP: growth of the parameter τk for the different choices of γ, λ.
respect to the bound from the intermediate Newton step in [6, Theorem 5.2.2.3]. The optimal step length at λ = λ∗ is given by γ∗ ≈ 0.944679.
The setups of the path-following methods use the following parameter values:
• traditional method with full step, γ = 1, λ = 0.2291;
• tight method with full step, γ = 1, λ = 0.394257;
• traditional method with intermediate step, γ = 0.9384, λ = 0.2910;
• tight method with optimal step, γ = 0.944679, λ = 0.442946.
The progress along the central path for the four setups is shown in Fig. 7. Optimizing the parameters of the method leads to approximately halving the number of iterations in comparison with the most simple setup.
8 Conclusion
We first furnish an interpretation of the results. Let us imagine the worst-case behaviour of the self- concordant function on the segment between the iterates xk, xk+1 as the response of an adversarial player to our choice of the damping coefficient. The goal of this player is to maximize the Newton decrement ρk+1
at the next iterate. Since the control which is at the disposal of the adversarial player affects the third derivative of the function, we can roughly assume that he manipulates the acceleration of the gradient.
First consider the case when the function is defined on an interval. Here the adversarial player has two different options. One is to maximally decelerate the gradient in order to prevent it from reaching zero at the end-point of the interval. This strategy will pay off more if we choose a smaller step length. The other strategy is to first maximally accelerate the gradient, in order to give it enough velocity to overshoot. At some point, corresponding to the crossing of the switching curve in Fig. 2, the gradient is again decelerated by decreasing the Hessian, because the effect of a smaller denominator F ′′ in the objective function overweighs the effect of a larger gradient F ′, which enters in the numerator. This strategy pays off more if we choose a larger step length. Our optimal strategy will therefore be to choose that value of the damping coefficient which results in the same objective for both strategies of the adversarial player, i.e., we choose the initial point (−aγ,−a) on the dispersion curve in Fig. 2.
In the case of a multi-dimensional domain of definition the adversarial player has more options. In addition to acceleration or deceleration of the gradient in the direction of movement, he may boost it in a perpendicular direction. Here he may choose this perpendicular direction arbitrarily, but once it is chosen, it is optimal to keep the acceleration vector in the plane spanned by the direction of movement and this particular direction. If the damping coefficient is large enough, more precisely if it corresponds to an initial point (t, y1) beyond the critical curve in Fig. 4, left, the optimal strategy of the adversarial player is then
14
indeed a mixture of boosts in the parallel and the perpendicular direction. Here the parallel component may be an acceleration or a deceleration, but the perpendicular component is always increased. For smaller damping coefficients the optimal strategy is a pure deceleration of the gradient in the direction of movement.
Let us now summarize our findings. For a given value λ of the decrement at the initial point and given step length γ the worst-case value λ of the decrement after the iteration can be computed by solving the two-point boundary value problem (1). For a full Newton step, i.e., γ = 1, we provided both the numerical values of the bound on a grid (column 2 of Table 1) and the suboptimal analytic approximation (6). To find the optimal value γ∗ of the damping coefficient and the corresponding minimal λ for a given initial value λ one needs to integrate two scalar ODEs. The numerical values of these quantities can be found in columns 4 and 3 of Table 1, respectively. In addition we provided the analytic approximations (8),(7). Formula (8) or any other reasonably simple approximation of the last column in Table 1 can be used in any context when a self-concordant function is minimized by the Newton method.
Tuning of a path-following method using a predetermined step length γ(λ) can be accomplished by maximizing the difference λ−λ(λ) with respect to λ. The parameters to be used are then the maximizer λ∗
and the corresponding damping coefficient γ(λ∗). This computation needs to be done only once, thereafter the two obtained values can be encoded in the method. In Section 7 we computed these parameter values for two strategies, namely for the full Newton step γ = 1 and for the optimal Newton step, which minimizes λ with respect to γ.
Acknowledgments
The paper was written in coronavirus quarantine at Moscow Institute of Physics and Technology. The author would like to thank the medical staff for having provided appropriate working conditions.
References
[1] Richard Bellman. Dynamic programming. Dover, 2003.
[2] Oleg P. Burdakov. Some globally convergent modifications of Newton’s method for solving systems of linear equations. Soviet Math. Dokl., 22(2):376–379, 1980.
[3] Etienne de Klerk, Francois Glineur, and Adrien Taylor. On the worst-case complexity of the gradient method with exact line search for smooth strongly convex functions. Optimization Letters, 11:1185– 1199, 2017.
[4] Etienne de Klerk, Francois Glineur, and Adrien Taylor. Worst-case convergence analysis of gradient and Newton methods through semidefinite programming performance estimation, 2020. Accepted in SIAM J. Optimiz.
[5] Wenbo Gao and Donald Goldfarb. Quasi-newton methods: superlinear convergence without line searches for self-concordant functions. Optim. Method Softw., 34(1):194–217, 2019.
[6] Yurii Nesterov. Lectures on Convex Optimization, volume 137 of Springer Optimization and its Appli- cations. Springer, 2nd edition, 2018.
[7] Yurii Nesterov and Arkadii Nemirovskii. Interior-point Polynomial Algorithms in Convex Programming, volume 13 of SIAM Stud. Appl. Math. SIAM, Philadelphia, 1994.
[8] L.S. Pontryagin, V.G. Boltyanskii, R.V. Gamkrelidze, and E.F. Mischchenko. The mathematical theory of optimal processes. Wiley, New York, London, 1962.
[9] Daniel Ralph. Global convergence of damped Newton’s method for nonsmooth equations via the path search. Math. Oper. Res., 19(2):352–389, 1994.
[10] Adrien B. Taylor, Julien M. Hendrickx, and Francois Glineur. Exact worst-case performance of first- order methods for composite convex optimization. SIAM J. Optim., 27(3):1283–1313, 2017.
15

optimal step length for the newton method: case of self

Documents