lecture 8: newton methods - xiamen universitymath.xmu.edu.cn/group/nona/damc/lecture08.pdflecture 8:...

Lecture 8 Newton Methods

April 15 - 17 2020

Newton Methods Lecture 8 April 15 - 17 2020 1 16

1 Basic Newtonrsquos Method

Consider the problemminxisinRn

f(x)

where f Rn 983041rarr R is Lipschitz twice continuously differentiable

A second-order Taylor series approximation to f around xk is

f(xk + d) asymp f(xk) +nablaf(xk)Td+1

2dTnabla2f(xk)d

When nabla2f(xk) is positive definite the minimizer dk of theright-hand side is unique it is

dk = minusnabla2f(xk)minus1nablaf(xk)

Basic Newtonrsquos iteration

xk+1 = xk minusnabla2f(xk)minus1nablaf(xk)


11 Newtonrsquos Method vs Steepest Descent vs CG

Given SPD A isin Rntimesn

Aminus1b = argminxisinRn

1

2xTAxminus bTx

Steepest Descent iteration

xk+1 = xk minus (Axk minus b)T(Axk minus b)

(Axk minus b)TA(Axk minus b)(Axk minus b)

Newtonrsquos iteration x1 = x0 minusAminus1(Ax0 minus b)


Theorem 1 (Local quadratic convergence)

Suppose f(x) is twice Lipschitz continuously differentiable withLipschitz constant M ie

983042nabla2f(x)minusnabla2f(y)983042 le M983042xminus y983042

Suppose that (the second-order sufficient conditions)

nablaf(x983183) = 0 and nabla2f(x983183) ≽ γI for some γ gt 0

which ensure that x983183 is a local minimizer of f(x) If

983042x0 minus x983183983042 le γ

2M

then the sequence xkinfin0 in Newtonrsquos method converges to x983183 at aquadratic rate with

983042xk+1 minus x983183983042 le M

γ983042xk minus x9831839830422 k = 0 1 2


2 DD + Newton for smooth strongly convex functions

If f is γ-strongly convex and nablaf is L-Lipschitz continuous thennabla2f(x) is positive definite and γI ≼ nabla2f(x) ≼ LI The Newtondirection


is a descent direction satisfying

nablaf(xk)Tdk le minus γ

L983042nablaf(xk)983042983042dk983042

The DD method using the Newton direction yields xk rarr x983183 wherex983183 is the (unique) global minimizer of f

Two stage method DD + Newton

Global sublinear convergence of DD is enhanced to the localquadratic convergence if we use αk = 1 whenever it satisfies theweak Wolfe conditions

DD + Newton global sublinear + local quadratic convergence


3 DD + Newton for smooth convex functions

If f is convex but not strongly convex and nablaf is L-Lipschitzcontinuous then nabla2f(x) may be singular for some x ie

0 ≼ nabla2f(x) ≼ LI

So the Newton direction may not be well defined

Consider the modified Newton direction

dk = minus[nabla2f(xk) + λkI]minus1nablaf(xk)

which is a descent direction The DD method using the modifiedNewton direction yields xk rarr x983183 where x983183 is a minimizer of f


If the minimizer x983183 is unique and nabla2f(x983183) is positive definite thennabla2f(x983183) will be positive definite for sufficiently large k



4 DD + Newton for smooth nonconvex functions

For smooth nonconvex f the Hessian nabla2f(xk) may be indefinitefor some k The Newton direction may not exist (when nabla2f(xk) issingular) or it may not be a descent direction (when nabla2f(xk) hasnegative eigenvalues) The modified Newton direction


will be a descent direction for λk sufficiently large For given0 lt η lt 1 a sufficient condition is

λk + λmin(nabla2f(xk))

λk + Lge η


Once again if the DD iterates xk enter the neighborhood of a localsolution x983183 for which nabla2f(x983183) is positive definite some strategyfor choosing λk and αk recovers the local quadratic convergence


41 Other modified Newton directions

Modified Cholesky factorization For indefinite nabla2f(xk) by addingpositive elements if needed to avoid taking the square root of anegative number the factorization continues to proceed Using themodified factorization in place of nabla2f(xk) in the calculation of theNewton direction dk we obtain a new modified Newton direction

Given eigenvalue decomposition

nabla2f(xk) = QkΛkQTk

we can define a modified Newton direction

dk = minusQk983144Λminus1k QT

knablaf(xk)

where 983144Λk with positive diagonal entries is a modified version of Λk

For more modified Newton directions to ensure descent in a DDframework see Chapter 3 of NO


5 Trust region method

The trust-region subproblem Given gk and symmetric Bk

mind

f(xk) + gTk d+

1

2dTBkd st 983042d983042 le ∆k

where ∆k is the radius of the trust region in which the quadraticf(xk) + gT

k d+ 12d

TBkd ldquowellrdquo captures the true behavior of f

The solution dk of the subproblem satisfies the linear system

[Bk + λI]dk = minusgk for some λ ge 0

where λ is chosen such that Bk + λI is positive semidefinite andλ(983042dk983042 minus∆k) = 0 (Exercise [Sorensen etc])

Solving the subproblem thus reduces to a search for the value of λSpecialized methods have been devised [Sorensen etc]


The trust-region method procedure

Define the ratio ρk between the actual decrease in f and theamount of decrease in the quadratic objective

ρk =f(xk + dk)minus f(xk)1

2(dk)TBkd

k + gTk d

k

If ρk is at least greater than a small tolerance (eg 01) we acceptthe step and proceed to the next iteration Otherwise the trustregion radius ∆k is too large so we do not take the step shrink thetrust region and resolve the new subproblem to obtain a new step

If ρk is close to 1 and the bound 983042dk983042 le ∆k is active (ie983042dk983042 = ∆k) we conclude that a larger trust region may hastenprogress so we increase ∆k for the next iteration


51 Dogleg method for trust region subproblem

For large-scale problems it may be too expensive to solvetrust-region subproblem near-exactly since the process mayrequire several factorizations of Bk + λI for different values of λ

A popular approach for finding approximate solutions which canbe used when Bk is positive definite is the dogleg method


52 Trust-region Newton method

The subproblem

mind

f(xk) +nablaf(xk)Td+1

2dTnabla2f(xk)d st 983042d983042 le ∆k

The trust-region Newton method can ldquoescaperdquo from a saddlepoint Suppose nablaf(xk) = 0 and nabla2f(xk) indefinite with somestrictly negative eigenvalues Then the solution dk of thesubproblem will be nonzero and the algorithm will step away fromthe saddle point xk in the direction of most negative curvature fornabla2f(xk) This guarantees that any accumulation points willsatisfy second-order necessary conditions

Another appealing feature of the trust-region Newton approach isthat when the sequence xk approaches a point x983183 satisfyingsecond-order sufficient conditions the trust region bound becomesinactive and the method takes basic Newton steps for allsufficiently large k so it has local quadratic convergence


53 Difference between line-search and trust-region methods

The basic difference between line-search and trust-region methodscan be summarized as follows

Line-search methods first choose a direction dk then decide howfar to move along that direction

Trust-region methods do the opposite They choose the distance∆k first then find the direction that makes the best progress forthis step length

6 Cubic regularization approachs

Assume that 983042nabla2f(x)minusnabla2f(y)983042 le M983042xminus y983042 Then

TM (zx) = f(x) +nablaf(x)T(zminus x)

+1

2(zminus x)Tnabla2f(x)(zminus x) +

M

6983042zminus x9830423

ge f(z)


Approach I the basic cubic regularization algorithm

xk+1 = argminz

TM (zxk) k = 0 1 2

Approach II Seek 983141x approximately satisfying second-ordernecessary conditions that is

983042nablaf(983141x)983042 le εg λmin(nabla2f(983141x)) ge minusεH

where εg and εH are two small positive constants

Assume nabla2f is M -Lipschitz continuous


nablaf is L-Lipschitz continuous

983042nablaf(x)minusnablaf(y)983042 le L983042xminus y983042

and f is lower-bounded f(x) ge f


(i) If 983042nablaf(xk)983042 gt εg set

xk+1 = xk minus 1

Lnablaf(xk)

(ii) If 983042nablaf(xk)983042 le εg and λmin(nabla2f(xk)) lt minusεH choose dk to be theeigenvector corresponding to λmin(nabla2f(xk)) Choose the size andsign of dk such that

983042dk983042 = 1

andnablaf(xk)Tdk le 0

Set

xk+1 = xk + αkdk where αk =

2εHM

(iii) If neither of these conditions hold then xk satisfies theapproximate second-order necessary conditions so we terminate


For the steepest-descent step (i)

f(xk+1) le f(xk)minus 1

2L983042nablaf(xk)9830422 le f(xk)minus

ε2g2L

For a step of type (ii)

f(xk+1) 983249 f(xk) + αknablaf(xk)⊤dk

+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2

We attain a decrease in the objective of at least

min

983075ε2g2L

2

3

ε3HM2

983076


1 Basic Newtonrsquos Method

Consider the problemminxisinRn

f(x)

where f Rn 983041rarr R is Lipschitz twice continuously differentiable

A second-order Taylor series approximation to f around xk is

f(xk + d) asymp f(xk) +nablaf(xk)Td+1

2dTnabla2f(xk)d

When nabla2f(xk) is positive definite the minimizer dk of theright-hand side is unique it is


Basic Newtonrsquos iteration

xk+1 = xk minusnabla2f(xk)minus1nablaf(xk)





1

2xTAxminus bTx












983042x0 minus x983183983042 le γ

2M


983042xk+1 minus x983183983042 le M

γ983042xk minus x9831839830422 k = 0 1 2







L983042nablaf(xk)983042983042dk983042






















λk + Lge η










knablaf(xk)






mind

f(xk) + gTk d+

1

2dTBkd st 983042d983042 le ∆k


k d+ 12d










2(dk)TBkd

k + gTk d

k









The subproblem

mind













+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076





1

2xTAxminus bTx












983042x0 minus x983183983042 le γ

2M


983042xk+1 minus x983183983042 le M

γ983042xk minus x9831839830422 k = 0 1 2







L983042nablaf(xk)983042983042dk983042






















λk + Lge η










knablaf(xk)






mind

f(xk) + gTk d+

1

2dTBkd st 983042d983042 le ∆k


k d+ 12d










2(dk)TBkd

k + gTk d

k









The subproblem

mind













+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076








983042x0 minus x983183983042 le γ

2M


983042xk+1 minus x983183983042 le M

γ983042xk minus x9831839830422 k = 0 1 2







L983042nablaf(xk)983042983042dk983042






















λk + Lge η










knablaf(xk)






mind

f(xk) + gTk d+

1

2dTBkd st 983042d983042 le ∆k


k d+ 12d










2(dk)TBkd

k + gTk d

k









The subproblem

mind













+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076







L983042nablaf(xk)983042983042dk983042






















λk + Lge η










knablaf(xk)






mind

f(xk) + gTk d+

1

2dTBkd st 983042d983042 le ∆k


k d+ 12d










2(dk)TBkd

k + gTk d

k









The subproblem

mind













+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076


















λk + Lge η










knablaf(xk)






mind

f(xk) + gTk d+

1

2dTBkd st 983042d983042 le ∆k


k d+ 12d










2(dk)TBkd

k + gTk d

k









The subproblem

mind













+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076








knablaf(xk)






mind

f(xk) + gTk d+

1

2dTBkd st 983042d983042 le ∆k


k d+ 12d










2(dk)TBkd

k + gTk d

k









The subproblem

mind













+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076




mind

f(xk) + gTk d+

1

2dTBkd st 983042d983042 le ∆k


k d+ 12d










2(dk)TBkd

k + gTk d

k









The subproblem

mind













+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076





2(dk)TBkd

k + gTk d

k









The subproblem

mind













+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076







The subproblem

mind













+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076









+1


M

6983042zminus x9830423

ge f(z)



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076



xk+1 = argminz

TM (zxk) k = 0 1 2











xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076



xk+1 = xk minus 1

Lnablaf(xk)


983042dk983042 = 1


Set


2εHM






ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076





ε2g2L



+1

2α2k(d

k)Tnabla2f(xk)dk +1

6Mα3

k983042dk9830423

983249 f(xk)minus 1

2

9830612εHM

9830622

εH +1

6M

9830612εHM

9830623

= f(xk)minus 2

3

ε3HM2


min

983075ε2g2L

2

3

ε3HM2

983076


lecture 8: newton methods - xiamen universitymath.xmu.edu.cn/group/nona/damc/lecture08.pdflecture 8:...

Documents