v-models for nonconvex minimization: prerequisite for a vu
TRANSCRIPT
V-models for Nonconvex Minimization:
Prerequisite for a VU-algorithm
Robert Mifflin
http://www.math.wsu.edu/faculty/mifflin
Work with Claudia Sagastizabal
MCA 2013, Guanajuato
Grant support: NSF DMS 0707205, AFOSR FA9550-11-1-0139 and SOARD
Introduction
minx∈IRn f(x); f locally Lipschitz
with Clarke subdifferential ∂f and
only one generalized gradient of f
computed by a black box at each x.
Ultimate Goal: Design a VU-algorithm for such f .
A VU-algorithm implicitly exploits
underlying nonsmooth/smooth structure
to achieve rapid convergence.
Current Goal: Define a bundle method that
achieves convergence to stationary points and
produces good V-models for f .
Outline
- Introduction to minimization with VU-algorithms
- Bundle algorithm concepts and notation for V-models
with negative curvature estimates from line search
- General definitions of null and serious points
- Three general V-model subfunction conditions for
achieving stationarity results in the limit
- Single variable minimization for motivating the n-variable
case and for use in a finite line search
- V-model subfunctions for nonconvex objectives on IRn
- Future work for a complete VU-algorithm
V and U subspaces & graphs of f thereon
A nonconvex pdg-structured example
f(x1, x2, x3) = 12x2
1 + 12
ln(
1 +√
(x21 − 2x2)2 + (x3 − x2)2
)x∗ = (0, 0, 0) is a stationary point (minimizer)
zero subgradient ∈ ∂f(x∗)
−1−0.8
−0.6−0.4
−0.20
0.20.4
0.60.8
1
−1.5−1
−0.50
0.51
1.5
−0.5
0
0.5
In general, for any x
g ∈ ∂f(x), V(x) := lin(∂f(x)− g) and U(x) := V(x)⊥
A view of f on V(x∗)
f(u, 0, 0)
and for u ∈ U(x∗)
’U-Lagrangian’ L0U(u) = f(χ(u))
with χ(u) a primal track
[MS, Primal-dual gradient structured functions, SIOPT 13(4):1174-1194, 2003]
Lewis and Overton 8-variable half-and-half function
[MS, A VU-algorithm for convex minimization, Math. Prog. 104(2-3), 583-608, 2005]
Sublinear, linear, and superlinear convergence
Bundle algorithm concepts and notation
g(·): one (generalized) gradient in ∂f(·)H(·): low rank n× n Hessian matrix
s(·): safeguard scalar for convergence
last two are for down-shifting; are zero if f is convex
B: bundle of data items (yi, f(yi), g(yi), H(yi), s(yi))
x: bundle center = some yj in B
g(x, yi): estimate of a gradient at x
e(x, yi): ≥ 0, so that f(x)− e(x, yi) underestimates f at x
both based on data at yi
φB(x+ d): V-model (cutting-plane) function for d ∈ Rn
max{f(x)− e(x, yi) + g(x, yi)Td : yi ∈ B}
If f is known to be convex then
(Cg) g(x, y):= g(y), independent of x,
(Ce) e(x, y) := e(x, y):= f(x)− (f(y) + g(y)T (x− y)) ≥ 0;
with y = yi this gives a good underestimating V-model.
µ: parameter for proximal subproblem
P (µ, x,B): min{ 12µ|d|2 + φB(x+ d)− f(x) : d ∈ Rn}
equivalent to quadratic program
d(x): subproblem solution d = d(x)
δ(x):= f(x)− φB(x+ d(x)): V-model value change
both also dependent on µ and B
serious step: gives adequate f-descent for an x-change
null step: gives adequate improvement of V-model φB
line search needed to generate one if f is nonconvex
Bundle algorithm iteration
Loop: Solve subproblem P (µ, x,B) for d(x), δ(x).
If δ(x) = 0, stop with x stationary.
Else (δ(x) > 0 and d(x) 6= 0), call for a line search
from x along d(x) with stepsize t > 0 so that
either t ↑ ∞ and f(x+ td(x)) ↓ −∞or it stops with t such that the point
y+ := x+ td(x)
is either null or serious.
Append (y+, f(y+), g(y+),H(y+), s(y+)) to B,
delete inactive yi-data from B and
if y+ is serious, then choose µ+ > 0 and
replace (x, f(x), µ) by (y+, f(y+), µ+).
Update all relevant V-model functions
f(x)−e(x, yj) + g(x, yj)Td
and go to Loop.
Definition of H(y(t)) with y(t):= x+ td(x)
Except for when the first line search t-value 1 is not
serious, each t-value has an associated 2nd derivative
estimate computed from two difference vectors
∆y:= (x+ td(x))− (x+ t−d(x)) = (t− t−)d(x),
∆g:= g(x+ td(x))− g(x+ t−d(x))
where t− is a particular previous t-value.
For the nonexceptional cases H(y(t)) is a
symmetric rank one update of the zero matrix
using ∆y and ∆g if ∆gT ∆y < 0.
The safeguard s(y(t)) is discussed below after n = 1 case.
Definitions of null and serious points
Under boundedness assumptions and general g and e
properties assumed below the null point definition given
next is the weakest one known with the property that if
there is an infinite number of consecutive null steps then
the corresponding δ(x)-sequence converges to zero. This
then implies that the last center x is stationary for f .
The null step point definition has a parameter mN∈ (0, 1)
while the serious point one has a complementary
Armijo-type f-descent parameter mA∈ (0,mN )
and a stepsize controlling parameter mS∈ (0,mN −mA).
Both definitions have a new bundle method parameter
mV ∈ [0, 1].
y+ = x+ td(x) with t > 0
is a null step point if
−e(x, y+) + g(x, y+)Td(x) ≥ −mNδ(x)−mV12µ|d(x)|2;
is a serious step point if
[f(y+)− f(x)]/t ≤ −mAδ(x)−mV12µ|d(x)|2
and t ≥ 1 or e(x, y+) ≥ mSδ(x).
Lemma. If f is convex with g and e given by (Cg) and
(Ce) then x+ 1d(x) is either a null or a serious point.
For t < 1 the inequality with parameter mS results from
the need for (1) a finite line search assuming f is
semismooth and (2) proving stationarity convergence
results if there is an infinite sequence of serious steps,
possibly with a serious t-sequence converging to zero.
Three general conditions for g and e to give
stationarity convergence results
For general f assume g and e satisfy
(G1) e(x, y) ≥ 0, e(x, x) = 0 and g(x, x) = g(x) ∈ ∂f(x);
(G2) if {(x, y)} → (x, x) and {g(x, y)} is bounded
then {e(x, y)} → 0;
(G3) if {(x, y)} → (x, y), {g(x, y)} → g and {e(x, y)} → 0
then g ∈ ∂f(x).
[Special case: g(x, y) ∈ ∂f(x) if e(x, y) = 0]
Lemma. If f is convex with g and e given by (Cg) and
(Ce) then conditions (G1), (G2) and (G3) hold.
Let {y`} be the bundle algorithm generated sequence of
y+ points and {xk} be the subsequence of these points
that are also serious points and, hence, bundle centers.
Theorem(Stationarity implied by (G1), (G2) and (G3)):
(i) Suppose xk is the last serious point generated and
{g(xk, y`)}` is bounded. Then xk is stationary.
(ii) Suppose {xk} is infinite and bounded,
0 <µmin≤ µk ≤µmax<∞ and
{g(xk, y`) : y` ∈ B`k ∪ {xk+1}} is bounded where subproblem
P (xk, µk, B`k) generates y+ = xk+1.
Then every accumulation point of {xk} is stationary.
? What should g and e be if f is nonconvex ?
(Cg+) g = g + nonconvexity correction?
(Ce+) e = e + nonconvexity correction + safeguard?
Geometry of single variable V U-minimization(n = 1)
An interval [x, y] or [y, x] is called VU-model compatible if
f(x) ≤ f(y) and g(x)T (y − x) ≤ 0.
The n = 1 algorithm generates such intervals and updates
x- and y-side 2nd derivative estimates and two associated
quadratic f-approximates using endpoints of a previous
interval [L,R] strictly containing x and y.
A U-model is defined by the x-side quadratic function qx
and the x-side of the V -model is always defined by the
linearization depending on f(x) and g(x).
The y-side of the V -model is defined as illustrated next
if the y-side quadratic function qy is strictly concave;
else, by the linearization depending on f(y) and g(y) as in
previous figure.
superlinearly convergent for certain piecewise C2 functions
when safeguarded properly
Initially define a safeguard scalar s := ( 12)/|R− L| where
[L,R] is the first interval of uncertainty.
If qy(x)> f(x), i.e. no down-shift, reset s to ( 12)/|y − x|.
Choose d so that x+ d is the
V - or U-model minimizer that is closer to x.
Set the next iterate to be the
projection of x+ d into the safeguard interval
[min(x, y) + s|y − x|2,max(x, y)− s|y − x|2].
In order to not prevent the possibility of superlinear
convergence, s is not reset as long as qy(x)≤ f(x),
since a reset causes a bisection step.
Algebraic definitions of g and e for f on IRn
Given a bundle center x and a point y with associated
data g(y), H(y) and s(y) let
h(x, y) := (x− y)TH(y)(x− y),
g(x, y) := g(y) if h(x, y) ≥ 0
:= g(y) +H(y)(x− y) otherwise,
e(x, y) := max{e(x, y)− 12min(0, h(x, y)), s(y)|x− y|2}
where e(x, y) := f(x)− (f(y) + g(y)T (x− y))
and s(y) ≥ smin≥ 0
with safeguard smin> 0 if f is not convex.
Lemma : If g(·, ·) and e(·, ·) are so defined then a finite line
search exists for each subproblem if f is semismooth,
(G1) holds,
the result of (G2) holds if, in addition,
{s(y)} is bounded from above as {y} → x,
and the result of (G3) holds if, in addition,
smin> 0 and {H(y)} is bounded as {y} → y.
This Lemma and the definition of g(·, ·) then imply that
the stationarity results of the Theorem hold if smin> 0,
0 <µmin≤ µk ≤µmax<∞ and {y`}, {s(y`)} and {H(y`)} are
all appropriately bounded.
Future research
(i) For the exceptional case when y(1) = x+ d(x) does not
satisfy an Armijo descent test, determine conditions for
when H(y(1)) can be an SR1 update of H(yj) for some yj
active in the bundle that generated d(x). This would
include the angle between d(x) and yj − x being small.
(ii) Determine choices for s(y) in the n-variable case;
dependence on x too as in the 1-variable case?
(iii) Using the above bundle algorithm develop a
VU-algorithm for lower-C2 functions [Janin, 1974],
[Rockafellar, 1982].
This involves choosing values for mV≤ 1 to generate
very good V-models.
Recall, y+ = x+ td(x) with t > 0
is a null step point if
−e(x, y+) + g(x, y+)Td(x) ≥ −mNδ(x)−mV12µ|d(x)|2;
is a serious step point if
[f(y+)− f(x)]/t ≤ −mAδ(x)−mV12µ|d(x)|2
and t ≥ 1 or e(x, y+) ≥ mSδ(x).
Line Search with variable t
The search generates a sequence of nested intervals
[tL, tR] where x+ td(x) with t = tL(tR) does (does not)
satisfy the serious point Armijo f-descent condition.
Start with t = 1 in the initial interval [tL, tR) = [0,∞).
If t = 1 =: tR then enter the interpolation
Loop: If x+ td(x) is a serious or null point, exit.
If [tL, tR] is a VU-model compatible interval compute
the next value of t as in the single variable algorithm.
Else replace t by the bisector of [tL, tR].
Replace the appropriate endpoint of [tL, tR] by t
and go to Loop.
Else (t = 1 =: tL),
sequentially increase t until there is an exit with
t =: tL and g(x+ td(x))Td(x) satisfying a Wolfe test, or
t =: tL and t being too large (i.e. f(x+ td(x)) too small),
or
t =: tR.
In the last case an interpolation phase as above could be
entered to find a serious point, possibly better than the
one given by the interpolation entering tL value.
Lemma : If f is semismooth then the above line search is
finite.