v-models for nonconvex minimization: prerequisite for a vu

V-models for Nonconvex Minimization:

Prerequisite for a VU-algorithm

Robert Mifflin

http://www.math.wsu.edu/faculty/mifflin

Work with Claudia Sagastizabal

MCA 2013, Guanajuato

Grant support: NSF DMS 0707205, AFOSR FA9550-11-1-0139 and SOARD

Introduction

minx∈IRn f(x); f locally Lipschitz

with Clarke subdifferential ∂f and

only one generalized gradient of f

computed by a black box at each x.

Ultimate Goal: Design a VU-algorithm for such f .

A VU-algorithm implicitly exploits

underlying nonsmooth/smooth structure

to achieve rapid convergence.

Current Goal: Define a bundle method that

achieves convergence to stationary points and

produces good V-models for f .

Outline

- Introduction to minimization with VU-algorithms

- Bundle algorithm concepts and notation for V-models

with negative curvature estimates from line search

- General definitions of null and serious points

- Three general V-model subfunction conditions for

achieving stationarity results in the limit

- Single variable minimization for motivating the n-variable

case and for use in a finite line search

- V-model subfunctions for nonconvex objectives on IRn

- Future work for a complete VU-algorithm

V and U subspaces & graphs of f thereon

A nonconvex pdg-structured example

f(x1, x2, x3) = 12x2

1 + 12

ln(

1 +√

(x21 − 2x2)2 + (x3 − x2)2

)x∗ = (0, 0, 0) is a stationary point (minimizer)

zero subgradient ∈ ∂f(x∗)

−1−0.8

−0.6−0.4

−0.20

0.20.4

0.60.8

1

−1.5−1

−0.50

0.51

1.5

−0.5

0

0.5

In general, for any x

g ∈ ∂f(x), V(x) := lin(∂f(x)− g) and U(x) := V(x)⊥

A view of f on V(x∗)

f(u, 0, 0)

and for u ∈ U(x∗)

’U-Lagrangian’ L0U(u) = f(χ(u))

with χ(u) a primal track

[MS, Primal-dual gradient structured functions, SIOPT 13(4):1174-1194, 2003]

Lewis and Overton 8-variable half-and-half function

[MS, A VU-algorithm for convex minimization, Math. Prog. 104(2-3), 583-608, 2005]

Sublinear, linear, and superlinear convergence

Bundle algorithm concepts and notation

g(·): one (generalized) gradient in ∂f(·)H(·): low rank n× n Hessian matrix

s(·): safeguard scalar for convergence

last two are for down-shifting; are zero if f is convex

B: bundle of data items (yi, f(yi), g(yi), H(yi), s(yi))

x: bundle center = some yj in B

g(x, yi): estimate of a gradient at x

e(x, yi): ≥ 0, so that f(x)− e(x, yi) underestimates f at x

both based on data at yi

φB(x+ d): V-model (cutting-plane) function for d ∈ Rn

max{f(x)− e(x, yi) + g(x, yi)Td : yi ∈ B}

If f is known to be convex then

(Cg) g(x, y):= g(y), independent of x,

(Ce) e(x, y) := e(x, y):= f(x)− (f(y) + g(y)T (x− y)) ≥ 0;

with y = yi this gives a good underestimating V-model.

µ: parameter for proximal subproblem

P (µ, x,B): min{ 12µ|d|2 + φB(x+ d)− f(x) : d ∈ Rn}

equivalent to quadratic program

d(x): subproblem solution d = d(x)

δ(x):= f(x)− φB(x+ d(x)): V-model value change

both also dependent on µ and B

serious step: gives adequate f-descent for an x-change

null step: gives adequate improvement of V-model φB

line search needed to generate one if f is nonconvex

Bundle algorithm iteration

Loop: Solve subproblem P (µ, x,B) for d(x), δ(x).

If δ(x) = 0, stop with x stationary.

Else (δ(x) > 0 and d(x) 6= 0), call for a line search

from x along d(x) with stepsize t > 0 so that

either t ↑ ∞ and f(x+ td(x)) ↓ −∞or it stops with t such that the point

y+ := x+ td(x)

is either null or serious.

Append (y+, f(y+), g(y+),H(y+), s(y+)) to B,

delete inactive yi-data from B and

if y+ is serious, then choose µ+ > 0 and

replace (x, f(x), µ) by (y+, f(y+), µ+).

Update all relevant V-model functions

f(x)−e(x, yj) + g(x, yj)Td

and go to Loop.

Definition of H(y(t)) with y(t):= x+ td(x)

Except for when the first line search t-value 1 is not

serious, each t-value has an associated 2nd derivative

estimate computed from two difference vectors

∆y:= (x+ td(x))− (x+ t−d(x)) = (t− t−)d(x),

∆g:= g(x+ td(x))− g(x+ t−d(x))

where t− is a particular previous t-value.

For the nonexceptional cases H(y(t)) is a

symmetric rank one update of the zero matrix

using ∆y and ∆g if ∆gT ∆y < 0.

The safeguard s(y(t)) is discussed below after n = 1 case.

Definitions of null and serious points

Under boundedness assumptions and general g and e

properties assumed below the null point definition given

next is the weakest one known with the property that if

there is an infinite number of consecutive null steps then

the corresponding δ(x)-sequence converges to zero. This

then implies that the last center x is stationary for f .

The null step point definition has a parameter mN∈ (0, 1)

while the serious point one has a complementary

Armijo-type f-descent parameter mA∈ (0,mN )

and a stepsize controlling parameter mS∈ (0,mN −mA).

Both definitions have a new bundle method parameter

mV ∈ [0, 1].

y+ = x+ td(x) with t > 0

is a null step point if

−e(x, y+) + g(x, y+)Td(x) ≥ −mNδ(x)−mV12µ|d(x)|2;

is a serious step point if

[f(y+)− f(x)]/t ≤ −mAδ(x)−mV12µ|d(x)|2

and t ≥ 1 or e(x, y+) ≥ mSδ(x).

Lemma. If f is convex with g and e given by (Cg) and

(Ce) then x+ 1d(x) is either a null or a serious point.

For t < 1 the inequality with parameter mS results from

the need for (1) a finite line search assuming f is

semismooth and (2) proving stationarity convergence

results if there is an infinite sequence of serious steps,

possibly with a serious t-sequence converging to zero.

Three general conditions for g and e to give

stationarity convergence results

For general f assume g and e satisfy

(G1) e(x, y) ≥ 0, e(x, x) = 0 and g(x, x) = g(x) ∈ ∂f(x);

(G2) if {(x, y)} → (x, x) and {g(x, y)} is bounded

then {e(x, y)} → 0;

(G3) if {(x, y)} → (x, y), {g(x, y)} → g and {e(x, y)} → 0

then g ∈ ∂f(x).

[Special case: g(x, y) ∈ ∂f(x) if e(x, y) = 0]

Lemma. If f is convex with g and e given by (Cg) and

(Ce) then conditions (G1), (G2) and (G3) hold.

Let {y`} be the bundle algorithm generated sequence of

y+ points and {xk} be the subsequence of these points

that are also serious points and, hence, bundle centers.

Theorem(Stationarity implied by (G1), (G2) and (G3)):

(i) Suppose xk is the last serious point generated and

{g(xk, y`)}` is bounded. Then xk is stationary.

(ii) Suppose {xk} is infinite and bounded,

0 <µmin≤ µk ≤µmax<∞ and

{g(xk, y`) : y` ∈ B`k ∪ {xk+1}} is bounded where subproblem

P (xk, µk, B`k) generates y+ = xk+1.

Then every accumulation point of {xk} is stationary.

? What should g and e be if f is nonconvex ?

(Cg+) g = g + nonconvexity correction?

(Ce+) e = e + nonconvexity correction + safeguard?

Geometry of single variable V U-minimization(n = 1)

An interval [x, y] or [y, x] is called VU-model compatible if

f(x) ≤ f(y) and g(x)T (y − x) ≤ 0.

The n = 1 algorithm generates such intervals and updates

x- and y-side 2nd derivative estimates and two associated

quadratic f-approximates using endpoints of a previous

interval [L,R] strictly containing x and y.

A U-model is defined by the x-side quadratic function qx

and the x-side of the V -model is always defined by the

linearization depending on f(x) and g(x).

The y-side of the V -model is defined as illustrated next

if the y-side quadratic function qy is strictly concave;

else, by the linearization depending on f(y) and g(y) as in

previous figure.

superlinearly convergent for certain piecewise C2 functions

when safeguarded properly

Initially define a safeguard scalar s := ( 12)/|R− L| where

[L,R] is the first interval of uncertainty.

If qy(x)> f(x), i.e. no down-shift, reset s to ( 12)/|y − x|.

Choose d so that x+ d is the

V - or U-model minimizer that is closer to x.

Set the next iterate to be the

projection of x+ d into the safeguard interval

[min(x, y) + s|y − x|2,max(x, y)− s|y − x|2].

In order to not prevent the possibility of superlinear

convergence, s is not reset as long as qy(x)≤ f(x),

since a reset causes a bisection step.

Algebraic definitions of g and e for f on IRn

Given a bundle center x and a point y with associated

data g(y), H(y) and s(y) let

h(x, y) := (x− y)TH(y)(x− y),

g(x, y) := g(y) if h(x, y) ≥ 0

:= g(y) +H(y)(x− y) otherwise,

e(x, y) := max{e(x, y)− 12min(0, h(x, y)), s(y)|x− y|2}

where e(x, y) := f(x)− (f(y) + g(y)T (x− y))

and s(y) ≥ smin≥ 0

with safeguard smin> 0 if f is not convex.

Lemma : If g(·, ·) and e(·, ·) are so defined then a finite line

search exists for each subproblem if f is semismooth,

(G1) holds,

the result of (G2) holds if, in addition,

{s(y)} is bounded from above as {y} → x,

and the result of (G3) holds if, in addition,

smin> 0 and {H(y)} is bounded as {y} → y.

This Lemma and the definition of g(·, ·) then imply that

the stationarity results of the Theorem hold if smin> 0,

0 <µmin≤ µk ≤µmax<∞ and {y`}, {s(y`)} and {H(y`)} are

all appropriately bounded.

Future research

(i) For the exceptional case when y(1) = x+ d(x) does not

satisfy an Armijo descent test, determine conditions for

when H(y(1)) can be an SR1 update of H(yj) for some yj

active in the bundle that generated d(x). This would

include the angle between d(x) and yj − x being small.

(ii) Determine choices for s(y) in the n-variable case;

dependence on x too as in the 1-variable case?

(iii) Using the above bundle algorithm develop a

VU-algorithm for lower-C2 functions [Janin, 1974],

[Rockafellar, 1982].

This involves choosing values for mV≤ 1 to generate

very good V-models.

Recall, y+ = x+ td(x) with t > 0

is a null step point if

−e(x, y+) + g(x, y+)Td(x) ≥ −mNδ(x)−mV12µ|d(x)|2;

is a serious step point if

[f(y+)− f(x)]/t ≤ −mAδ(x)−mV12µ|d(x)|2

and t ≥ 1 or e(x, y+) ≥ mSδ(x).

Line Search with variable t

The search generates a sequence of nested intervals

[tL, tR] where x+ td(x) with t = tL(tR) does (does not)

satisfy the serious point Armijo f-descent condition.

Start with t = 1 in the initial interval [tL, tR) = [0,∞).

If t = 1 =: tR then enter the interpolation

Loop: If x+ td(x) is a serious or null point, exit.

If [tL, tR] is a VU-model compatible interval compute

the next value of t as in the single variable algorithm.

Else replace t by the bisector of [tL, tR].

Replace the appropriate endpoint of [tL, tR] by t

and go to Loop.

Else (t = 1 =: tL),

sequentially increase t until there is an exit with

t =: tL and g(x+ td(x))Td(x) satisfying a Wolfe test, or

t =: tL and t being too large (i.e. f(x+ td(x)) too small),

or

t =: tR.

In the last case an interpolation phase as above could be

entered to find a serious point, possibly better than the

one given by the interpolation entering tL value.

Lemma : If f is semismooth then the above line search is

finite.

v-models for nonconvex minimization: prerequisite for a vu

Documents