notes on numerical optimization - university of wisconsin...

37
Notes on Numerical Optimization University of Chicago, 2014 Vivak Patel October 18, 2014 1

Upload: others

Post on 12-Mar-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Notes on Numerical Optimization

University of Chicago, 2014

Vivak Patel

October 18, 2014

1

Page 2: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Contents

Contents 2

List of Algorithms 4

I Fundamentals of Optimization 5

1 Overview of Numerical Optimization 51.1 Problem and Classification . . . . . . . . . . . . . . . . . . . . . . 51.2 Theoretical Properties of Solutions . . . . . . . . . . . . . . . . . 51.3 Algorithm Overview . . . . . . . . . . . . . . . . . . . . . . . . . 71.4 Fundamental Definitions and Results . . . . . . . . . . . . . . . . 8

2 Line Search Methods 92.1 Step Length . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Step Length Selection Algorithms . . . . . . . . . . . . . . . . . . 112.3 Global Convergence and Zoutendjik . . . . . . . . . . . . . . . . 12

3 Trust Region Methods 153.1 Trust Region Subproblem and Fidelity . . . . . . . . . . . . . . . 153.2 Fidelity Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 163.3 Approximate Solutions to Subproblem . . . . . . . . . . . . . . . 16

3.3.1 Cauchy Point . . . . . . . . . . . . . . . . . . . . . . . . . 163.3.2 Dogleg Method . . . . . . . . . . . . . . . . . . . . . . . . 173.3.3 Global Convergence of Cauchy Point Methods . . . . . . 18

3.4 Iterative Solutions to Subproblem . . . . . . . . . . . . . . . . . . 203.4.1 Exact Solution to Subproblem . . . . . . . . . . . . . . . 203.4.2 Newton’s Root Finding Iterative Solution . . . . . . . . . 22

3.5 Trust Region Subproblem Algorithms . . . . . . . . . . . . . . . 22

4 Conjugate Gradients 23

II Model Hessian Selection 24

5 Newton’s Method 24

6 Newton’s Method with Hessian Modification 28

7 Quasi-Newton’s Method 297.1 Rank-2 Update: DFP & BFGS . . . . . . . . . . . . . . . . . . . 307.2 Rank-1 Update: SR1 . . . . . . . . . . . . . . . . . . . . . . . . . 31

III Specialized Applications of Optimization 32

8 Inexact Newton’s Method 32

9 Limited Memory BFGS 33

2

Page 3: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

10 Least Squares Regression Methods 3410.1 Linear Least Squares Regression . . . . . . . . . . . . . . . . . . 3410.2 Line Search: Gauss-Newton . . . . . . . . . . . . . . . . . . . . . 3510.3 Trust Region: Levenberg-Marquardt . . . . . . . . . . . . . . . . 35

IV Constrained Optimization 3610.4 Theory of Constrained Optimization . . . . . . . . . . . . . . . . 36

3

Page 4: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

List of Algorithms

1 Backtracking Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 112 Interpolation Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 113 Trust Region Management Algorithm . . . . . . . . . . . . . . . . 164 Solution Acceptance Algorithm . . . . . . . . . . . . . . . . . . . . 165 Overview of Conjugate Gradient Algorithm . . . . . . . . . . . . . 236 Conjugate Gradients Algorithm for Convex Quadratic . . . . . . . 237 Fletcher Reeves CG Algorithm . . . . . . . . . . . . . . . . . . . . 238 Line Search with Modified Hessian . . . . . . . . . . . . . . . . . . 289 Overview of Inexact CG Algorithm . . . . . . . . . . . . . . . . . 3210 L-BFGS Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4

Page 5: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Part I

Fundamentals of Optimization

1 Overview of Numerical Optimization

1.1 Problem and Classification

1. Problem:

arg minz∈Rn

f(z) :

ci(z) = 0 i ∈ Eci(z) ≥ 0 i ∈ I

(a) f : Rn → R is known as the objective function

(b) E are equality constraints

(c) I are inequality constraints

2. Classifications

(a) Unconstrained vs. Constrained. If E ∪ I = ∅ then it is an uncon-strained problem.

(b) Linear Programming vs. Nonlinear Programming. When f(z) andci(z) are all linear functions of z then this is a linear programmingproblem. Otherwise, it is a nonlinear programming problem.

(c) Note: This is not the same as having linear or nonlinear equations.

1.2 Theoretical Properties of Solutions

1. Global Minimizer. Weak Local Minimizer (Local Minimizer). Strict LocalMinimizer. Isolated Local Minimizer.

Definition 1.1. Let f : Rn → R

(a) x∗ is a global minimizer of f if f(x∗) ≤ f(x) for all x ∈ Rn

(b) x∗ is a local minimizer of f if for some neighborhood, N of x∗,f(x∗) ≤ f(x) for all x ∈ N(x∗).

(c) x∗ is a strict local minimizer of f if for some neighborhood, N(x∗),f(x∗) < f(x) for all x ∈ N(x∗).

(d) x∗ is an isolated local minimizer of f if for some neighborhood, N(x∗),there are no other minimizers of f in N(x∗).

Note 1.1. Every isolated local minimizer is a strict minimizer, butit is not true that a strict minimizer is always an isolated local min-imizer.

2. Taylor’s Theorem

Theorem 1.1. Suppose f : Rn → R is continuously differentiable. Letp ∈ Rn. Then for some t ∈ [0, 1]:

f(x+ p) = f(x) +∇f(x+ tp)T p (1)

5

Page 6: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

If f is twice continuously differentiable, then:

∇f(x+ p) = ∇f(x) +

∫ 1

0

∇2f(x+ tp)pdt (2)

And for some t ∈ [0, 1]:

f(x+ p) = f(x) +∇f(x)T p+1

2pT∇2f(x+ tp)p (3)

3. Necessary Conditions

(a) First Order Necessary Conditions

Lemma 1.1. Let f be continuously differentiable. If x∗ is a mini-mizer of f then ∇f(x∗) = 0.

Proof. Suppose ∇f(x∗) 6= 0. Let p = −∇f(x∗). Then, pT∇f(x∗) <0 and by continuity there is some T > 0 such that for any t ∈ [0, T ),pT∇f(x∗+ tp) < 0. We now apply Equation 1 of Taylor’s Theoremto get a contradiction.

Let t ∈ [0, T ) and q = x+ tp. Then ∃t′ ∈ [0, t] such that:

f(x∗ + q) = f(x∗) + t∇f(x∗ + t′tp)T p < f(x∗)

(b) Second Order Necessary Conditions

Lemma 1.2. Suppose f is twice continuously differentiable. If x∗ isa minimizer of f then ∇f(x∗) = 0 and ∇2f(x∗) 0 (i.e. positivesemi-definite)

Proof. The first conclusion follows from the first order necessary con-dition. We proceed with the second one by contradiction and Taylor’sTheorem. Suppose ∇2f(x∗) ≺ 0. Then there is a p such that:

pT∇2f(x∗)p < 0

By continuity, there is a T > 0 such that pT∇2f(x∗ + tp)p < 0 fort ∈ [0, T ]. Fix t ∈ [0, T ] and let q = tp. Then, by Equation 3 fromTaylor’s Theorem, ∃t′ ∈ [0, t] such that:

f(x∗ + q) = f(x∗) +1

2t2pT∇2f(x∗ + t′tp)p < f(x∗)

4. Second Order Sufficient Condition

Lemma 1.3. Let x∗ ∈ Rn. Suppose ∇2f is continuous, ∇2f(x∗) 0,and ∇f(x∗) = 0. Then x∗ is a strict local minimizer.

6

Page 7: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Proof. By continuity, there is a ball of radius T about x∗, B, in which∇2f(x) 0. Let ‖p‖ < T . Then, there exists a t ∈ [0, 1] such that:

f(x∗ + p) = f(x∗) +1

2pT∇2f(x∗ + tp)p > f(x∗)

5. Convexity and Local Minimizers

Lemma 1.4. When f is convex any local minimizer x∗ is a global min-imizer. If, in addition, f is differentiable then any stationary point is aglobal minimizer.

Proof. Suppose x∗ is not a global minimizer. Let z be the global mini-mizer. Then f(x∗) > f(z). By convexity, for any λ ∈ [0, 1]:

f(x∗) > λf(z) + (1− λ)f(x∗) ≥ f(λz + (1− λ)x∗)

So as λ → 0, we see that in any neighborhood of x∗ there is a point wsuch that f(w) < f(x∗). A contradiction.

For the second part, suppose z is as above. Then:

∇f(x∗)T (z − x∗) = limλ↓0

f(x∗ + λ(z − x∗))− f(x∗)

λ

≤ limλ↓0

λf(z) + (1− λ)f(x∗)− f(x∗)

λ

≤ f(z)− f(x∗)

< 0

1.3 Algorithm Overview

1. In general, algorithms begin with a seed point, x0, and locally search fordecreases in the objective function, producing iterates xk, until stoppingconditions are met.

2. Algorithms typically generate a local model for f near a point xk:

f(xk + p) ≈ mk(p) = fk + pT∇fk +1

2pTBkp

3. Different choices of Bk will lead to different methods with different prop-erties:

(a) Bk = 0 will lead to steepest descent methods

(b) Letting Bk be the closest positive definite approximation to ∇2fkleads to newton’s methods

(c) Iterative approximations to the Hessian given by Bk lead to quasi-newton’s methods

(d) Conjugate Gradient methods update p without explicitly computingBk

7

Page 8: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

1.4 Fundamental Definitions and Results

1. Q-convergence.

Definition 1.2. Let xk → x in Rn.

(a) xk converge Q-linearly if ∃q ∈ (0, 1) such that for sufficiently large k:

‖xk+1 − x‖‖xk − x‖

≤ q

(b) xk converge Q-superlinearly if:

limk→∞

‖xk+1 − x‖‖xk − x‖

→ 0

(c) xk converges Q-quadratically if ∃q2 > 0 such that for sufficiently largek:

‖xk+1 − x‖‖xk − x‖2

≤ q

2. R-convergence.

Definition 1.3. Let xk → x in Rn

(a) xk converges R-linearly if ∃vk ∈ R converging Q-linearly such that

‖xk − x‖ ≤ vk

(b) xk converges R-superlinearly if ∃vk ∈ R converging Q-superlinearlysuch that

‖xk − x‖ ≤ vk(c) xk converges R-quadratically if ∃vk ∈ R converging Q-quadratically

such that‖xk − x‖ ≤ vk

3. Sherman-Morrison-Woodbury

Lemma 1.5. If A and A are non-singular and related by A = A + abT

then:

A−1 = A−1 − A−1abTA−1

1 + bTA−1a

Proof. We can check this by simply plugging in the formula. However, to“derive” this:

A−1 =(A+ abT

)−1

=(A[I +A−1abT

])−1

=[I +A−1abT

]−1A−1

=[I −A−1abT +A−1abTA−1abT − · · ·

]A−1

=[I −A−1a(1− bTA−1a+ · · · )bT

]A−1

= A−1 − A−1abTA−1

1 + bTA−1a

8

Page 9: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

2 Line Search Methods

2.1 Step Length

1. Descent Direction

Definition 2.1. p ∈ Rn is a descent direction if pT∇fk < 0.

2. Problem: Find αk ∈ arg minφ(α) where φ(α) = f(xk + αpk). However, itis too costly to find the exact minimizer of φ. Instead, we find an α whichacceptably reduces φ.

3. Wolfe Condition

(a) Armijo Condition. Curvature Condition. Wolfe Condition. StrongWolfe Condition.

Definition 2.2. Fix x and p to be a descent direction. Let φ(α) =f(x+ αp) and so φ′(α) = fT (x+ αp)p. Fix 0 < c1 < c2 < 1.

i. α satisfies the Armijo Condition if φ(α) ≤ φ(0)+αc1φ′(0). Note

l(α) := φ(0) + αc1φ′(0).

ii. α satisfies the Curvature Condition if φ′(α) ≥ c2φ′(0)

iii. α satisfies the Strong Curvature Condition if |φ′(α)| ≤ c2|φ′(0)|iv. The Wolfe Condition is the Armijo Condition with the Curvature

Condition

v. The Strong Wolfe Condition is the Armijo Condition with theStrong Curvature Condition.

(b) The Armijo Condition: guarantees a decrease at the next iterate byensuring φ(alpha) < φ(0).

(c) Curvature Condition

i. If φ′(α) < c2φ′(0), then φ is still decreasing at α and so we

improve the reduction in f by taking a larger α

ii. If φ′(α) ≥ c2φ′(0), then either we are closer to a minimum whereφ′ = 0 or φ′ > 0 which means we have surpassed the minimum.

iii. The strong condition guarantees that the choice of α is closer toφ′ = 0.

(d) Existence of Step satisfying Wolfe and Strong Wolfe Conditions

Lemma 2.1. Suppose f : Rn → R is continuously differentiable. Letp be a descent direction at x and assume that f is bounded from belowalong the ray x + αp : α > 0. If 0 < c1 < c2 < 1 there exists aninterval satisfying the Wolfe and Strong Wolfe Condition.

Proof.

i. Satisfying Armijo Condition: Let l(α) = f(x) + c1αpT∇f(x).

Since l(α) is decreasing and φ(α) is bounded from below, even-tually φ(α) = l(α). Let αA > 0 be the first time this intersectionoccurs. Then φ(α) < l(α) for α ∈ (0, αA).

9

Page 10: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

ii. Curvature Condition: By the mean value theorem, ∃β ∈ (0, αA)such that (since l(αA) = φ(αA)):

αAφ′(β) = φ(αA)− φ(0) = c1αAφ

′(0) > c2αAφ′(0)

And since φ′ is smooth there is an interval containing β for whichthis inequality holds.

iii. Strong Curvature Condition. Since φ′(β) = c1φ′(0) < 0, it fol-

lows that:|φ′(β)| < |c2φ′(0)|

4. Goldstein Condition

(a) Goldstein Condition

Definition 2.3. Take c ∈ (0, 1/2). The Goldstein Condition is sat-isfied by α if:

φ(0) + (1− c)αφ′(0) ≤ φ(α) ≤ φ(0) + cαφ′(0)

(b) Existence of Step satisfying Goldstein Condition

Lemma 2.2. Suppose f : Rn → R is continuously differentiable. Letp be a descent direction at x and assume that f is bounded from belowalong the ray x + αp : α > 0. Let c ∈ (0, 1/2). Then there existsan interval satisfying the Goldstein condition.

Proof. Let l(α) = φ(0)+(1−c)αφ′(0) and u(α) = φ(0)+cαφ′(0). Forα > 0, l(α) < u(α) since c ∈ (0, 1/2). Since φ is bounded from belowlet αu be the smallest point of intersection between u(α) and φ(α) forα > 0. And let αl be the largest point of intersection, less than αu,of l(α) and φ(α). Then, for α ∈ (αl, αu), l(α) < φ(α) < u(α).

5. Backtracking

(a) Backtracking

Definition 2.4. Let ρ ∈ (0, 1) and α > 0. Backtracking checks overa sequence α, ρα, ρ2α, . . . until a α is found satisfying the Armijocondition.

(b) Existence of Step satisfying Backtracking

Lemma 2.3. Suppose f : Rn → R is continuously differentiable. Letp be a descent direction at x. Then there is a step produced by thebacktracking algorithm which satisfies the Armijo condition.

Proof. ∃ε > 0 such that for 0 < α < ε, φ(α) < φ(0) + cαφ′(0) bycontinuity. Since ρnα→ 0, there is an n such that ρnα < ε.

10

Page 11: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

2.2 Step Length Selection Algorithms

1. Algorithms which implement the Wolfe or Goldstein Conditions exist andare ideal, but are not discussed herein.

2. Backtracking Algorithm

Algorithm 1: Backtracking Algorithminput : Reduction rate ρ ∈ (0, 1), Initial estimate α > 0, c ∈ (0, 1), φ(α)α← αwhile φ(α) ≥ φ(0) + αcφ′(0) do

α← ραendreturn α

3. Interpolation with Expensive First Derivatives

(a) Suppose the Interpolation technique produces a sequence of guessesα0, α1, . . ..

(b) Interpolation produces αk by modeling φ with a polynomialm (quadraticor cubic) and then minimizes m to find a new estimate for α. Theparameters of the polynomial are given by requiring:

i. m(0) = φ(0)

ii. m′(0) = φ′(0)

iii. m(αk−1) = φ(αk−1)

iv. m(αk−2) = φ(αk−2)

(c) When k = 1, we model using a quadratic polynomial

4. Interpolation with Inexpensive First Derivatives

(a) Suppose interpolation produces a sequence of guesses α0, α1, . . .

(b) In this case, m is always a cubic polynomial such that, αk is computedby:

i. m(αk−1) = φ(αk−1)

ii. m(αk−2) = φ(αk−2)

iii. m′(αk−1) = φ′(αk−1)

iv. m′(αk−2) = φ′(αk−2)

5. Interpolation Algorithm

Algorithm 2: Interpolation Algorithminput : Feasible Search Region [a, b], Initial estimate α0 > 0, φα← 0, β ← α0

while φ(β) ≥ φ(0) + cβφ′(0) doApproximate mif Inexpensive Derivatives then

α← βendExplicitly compute minimizer of m in feasible region. Store as β.

endreturn β

11

Page 12: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

2.3 Global Convergence and Zoutendjik

1. Globally Convergent. Zoutendjik Condition.

Definition 2.5. Suppose we have an algorithm, Ω which produces iterates(xk) and denote ∇f(xk) = ∇fk.

(a) Ω is globally convergent if ‖∇fk‖ → 0. (That is, xk converge to astationary point.)

(b) Suppose Ω produced search directions pk such that ‖pk‖ = 1. Andlet θk be the angle between ∇fk and pk. Ω satisfies the ZoutendjikCondition if

∞∑k=1

cos2(θk) ‖∇fk‖2 <∞

2. Zoutendjik Condition & Angle Bound implies global convergence

Lemma 2.4. Suppose Ω produces a sequence (xk, pk,∇fk, θk) such that∃δ > 0 and ∀k ≥ 1, cos(θk) ≥ δ. If Ω satisfies the Zoutendjik conditionthen Ω is globally convergent.

Proof. By the Zoutendjik condition:

δ2∞∑k=1

‖∇fk‖2 <∞

Therefore, ‖∇fk‖2 → 0.

3. Example of Angle Bound

Example 2.1. Suppose pk = −B−1k ∇fk where Bk are positive definite

and ‖Bk‖∥∥B−1

k

∥∥ < M <∞ for all k. Letting ‖·‖ = ‖·‖2, Bk = XΛXT bethe EVD of Bk, and XT∇fk = z:

| cos(θk)| =

∣∣∣∣∣ −∇fTk B−1k ∇fk

‖∇fk‖∥∥B−1

k ∇fk∥∥∣∣∣∣∣

=

∣∣∣∣ zTΛ−1z

‖z‖ ‖Λ−1z‖

∣∣∣∣=

∣∣∣∣∣∣∑ni=1

z2iλi

‖z‖√∑n

i=1z2iλ2i

∣∣∣∣∣∣≥ 1/λ1 ‖z‖2

1/λn ‖z‖2

≥ λnλ1

≥ 1

M

Hence, if Ω satisfies the Zoutendjik Condition, we see that limn ‖∇fn‖ →0.

12

Page 13: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

4. Wolfe Condition Line Search Satisfies Zoutendjik Condition.

Theorem 2.1. Suppose we have an objective f satisfying:

(a) f is bounded from below in Rn

(b) Given an initial x0, there is an open set N containing L = x :f(x) ≤ f(x0)

(c) f is continuously differentiable on N

(d) ∇f is Lipschitz continuous on N

Suppose we have an algorithm Ω producing (xk, pk,∇fk, θk, αk) such that:

(a) pk is a descent direction (with ‖pk‖ = 1)

(b) αk satisfies the Wolfe Conditions

Then Ω satisfies the Zoutendjik Condition.

Proof. The general strategy is to lower bound αk uniformly by C|∇fTk pk|,where C > 0. Then using the descent condition:

fk+1 ≤ fk − cαk|∇fTk pk|≤ fk − C|∇fTk pk|2

≤ fk − C cos2(θk) ‖∇fk‖22 ≤ f0 − Ck∑j=1

cos( θj) ‖∇fj‖22

And we can conclude that since f is bounded from below by l, thenf0 − fk ≤ f0 − l <∞. The result follows.

So now we set out to show that αk ≥ C|∇fTk pk|. We do this by leveragingthe Curvature Condition.

(a) From the curvature condition, we have that:

(∇fk+1 −∇fk)T pk ≥ (c2 − 1)∇fTk pk

(b) From Lipschitz Continuity, with constant L:

|(∇fk+1 −∇fk)T pk| ≤ ‖∇fk+1 −∇fk‖ ‖pk‖ ≤ Lαk ‖pk‖2 = Lαk

Together these imply that 1−c2L |∇f

Tk pk| ≤ αk.

5. Goldstein Condition Line Search satisfies Zoutendjik Condition

Theorem 2.2. Suppose f has the same properties as it does in Theorem2.1. And suppose Ω produces (xk, pk,∇fk, θk, αk) such that:

(a) pk is a descent direction with ‖pk‖ = 1

(b) αk satisfies the Goldstein conditions.

Then Ω satisfies the Zoutendjik Condition.

13

Page 14: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Proof. We use Taylor’s Theorem and then Lispchitz Continuity to findthe lower bound. By Taylor’s theorem, for tk ∈ (0, 1):

(1− c)αk∇fTk pk ≤ fk+1 − fk = ∇f(xk + tkαkpk)Tαkpk

Therefore,

cαk|∇fTk pk| ≤ |(∇f(xk + tkαkpk)T −∇fTk )pk|

≤ Ltkαk ‖pk‖2

≤ Lαk

The result follows.

6. Backtracking Line Search satisfies Zoutendjik Condition

Theorem 2.3. Suppose f has the same properties as it does in Theorem2.1. And suppose Ω produces (xk, pk,∇fk, θk, αk) such that:

(a) pk is a descent direction with ‖pk‖ = 1

(b) αk is selected by backtracking with α = 1

Then Ω satisfies the Zoutendjik condition.

Proof. Suppose Ω uses ρ ∈ (0, 1). There are two cases to consider. Whenαk = 1 and αk < 1. We denote the first subsequence by k(j) and thesecond by k(l). For αk(l) we know at least that αk(l)/ρ was rejected, thatis:

f(xk + αk(l)pk/ρ) > fk + cαk(l)∇fTk pk/ρ

Using the same strategy as in Goldstein, ∃tk such that:

∇f(xk + tkαk(l)pk/ρ)Tαkpk/ρ > cαk(l)∇fTk pk/ρ

Therefore, by Lipschitz continuity:

(1− c)|∇fTk pk| ≤ Ltkαk(l)/ρ ≤ Lαk(l)/ρ

We now consider αk(j) = 1 since the Armijo condition gives the sequence:

fk(j+1) ≤ fk(j)+1 ≤ fk(j) + c∇fTk pk = fk(j) − c| cos(θk(j)s)|∥∥fk(j)

∥∥Therefore:

c∑j

| cos(θk(j))|∥∥∇fk(j)

∥∥ <∞Then ∃J such that j ≥ J ,

cos2(θk(j))∥∥∇fk(j)

∥∥2 ≤ | cos(θk(j))|∥∥∇fk(j)

∥∥ < 1

Therefore, the Zoutendjik condition holds.

14

Page 15: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

3 Trust Region Methods

3.1 Trust Region Subproblem and Fidelity

1. The strategy of the trust region problem is as follows: at any point x wehave a model mx(p) of f(x + p) which we trust is a good approximationin a region ‖p‖ ≤ ∆. Minimizing this, we have a new iterate on which tocontinue applying the approach.

2. Trust Region Subproblem

Definition 3.1. Let x ∈ Rn and f : Rn → R be an objective functionwith model mx(p) in a region ‖p‖ ≤ ∆. The Trust Region Subproblem isto find

arg minmx(p) : ‖p‖ ≤ ∆

Note 3.1. Usually mx is taken to be

mx(p) = f(x) +∇fTx p+1

2pTBp

where B is the model Hessian and is not necessarily positive definite.

3. Fidelity

Definition 3.2. The Fidelity of mx in a region ‖p‖ ≤ ∆ at a point p+ xis

ρ(mx, f, p,∆) =f(x+ p)− f(x)

mx(p)−mx(0)

4. Fidelity and Trust Region. Let 0 < c1 < c2 < 1

(a) If ρ ≥ c2, especially if p is intended for descent, the model is a goodapproximation to f and we can expand the trust region ∆

(b) If ρ ≤ c1, especially if p is intended for descent, the model is a poorapproximation to f and we should shrink the trust region ∆

(c) Otherwise, the model is a sufficient approximation to f and the trustregion should not be adjusted.

5. Fidelity and Iterates. Suppose p causes a descent in mx. And let η ∈(0, 1/4]

(a) If ρ < η then we do not have a sufficient decrease in the function andthe process should be repeated with x

(b) If ρ ≥ η then we have a sufficient decrease in f , and we shouldcontinue the process with x+ p

6. Notation

(a) Let x0, x1, . . . be the iterates produces by subsequential solutions tothe trust region subproblem for some objective f

(b) mxk(p) =: mk(p)

(c) ∆ for a particular mk is denoted ∆k

(d) Accepted solutions to the subproblem at xk are pk so that xk+1 =xk + pk.

15

Page 16: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

3.2 Fidelity Algorithms

1. Fidelity and Trust Region

Algorithm 3: Trust Region Management Algorithminput : Thresholds 0 < c2 < c1 < 1, Trust Region ∆, ∆max, ρxif ρx > c1 then

∆← min(2∆,∆max)else if ρ < c2 then

∆← ∆/2else

Continueendreturn ∆

Note 3.2. Usually c1 = 1/4 and c2 = 3/4

2. Fidelity and Solution Acceptance

Algorithm 4: Solution Acceptance Algorithminput : Treshold η ∈ (0, 1/4], ρ,x,pif ρx(p) ≥ η then

x = x+ pelse

Continueendreturn x

3.3 Approximate Solutions to Subproblem

3.3.1 Cauchy Point

1. Cauchy Point

Definition 3.3. Let mx be a model of f within ‖p‖ ≤ ∆. Let pS be thedirection of steepest descent at mx(0) such that

∥∥pS∥∥ = 1, and let τ be:

arg minmx(τps) : τ ≤ ∆

Then, pC = τpS is the Cauchy Point of mx.

Note 3.3. The Cauchy point is the line search minimizer of mk along thedirection of steepest descent subject to ‖p‖ ≤ ∆.

2. Computing the Cauchy Point

Lemma 3.1. Let mx be a quadratic model of f within ‖p‖ ≤ ∆ such that

mx(p) = f(x) + gT p+1

2pTBp

Then, its Cauchy Point is:

pC =

−g‖g‖ min

(∆, ‖g‖

3

gTBg

)gTBg > 0

−g‖g‖∆ gTBg ≤ 0

16

Page 17: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Proof. First, we compute the direction of steepest descent:

pS =−g‖g‖

Then, we have that mx(tpS) = f(x) + tgT pS + t2

2 (pS)TBpS . If thequadratic term is negative, then t = ∆ is the minimizer. And so oneoption is:

pC =−∆g

‖g‖If the quadratic term is positive then:

d

dtmx(tpS) =

d

dt

= gT pS + tpSBpS

Setting this to 0 and substituting back in for pS :

t =‖g‖3

gTBg

This value could potentially be larger than ∆, so to safeguard against this,in this case:

τ = min

(∆,‖g‖3

gTBg

)

3. Using the Cauchy point does not produce the best rate of convergence

3.3.2 Dogleg Method

1. Requirement: B 0

2. Intuitively, the dogleg method moves towards the minimizer of mx whichis pmin = −B−1g until it reaches this point or hits the boundary of thetrust region. It does this by first moving to the Cauchy Point, and thenalong a linear path to pmin

3. Dogleg Path

Definition 3.4. Let mx be the quadratic model with B 0 of f withradius ∆. Let pC be the Cauchy Point and pmin be the unconstrainedminimizer of mx. The Dogleg Path is pDL(t) where:

pDL(t) =

pCt t ∈ [0, 1]

pC + (t− 1)(pmin − pC) t ∈ [1, 2]

4. Dogleg Method

Definition 3.5. The Dogleg Method chooses xk+1 = xk + pDL(τ) where∥∥pDL(τ)∥∥ = ∆.

17

Page 18: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

5. Properties of Dogleg Path

Lemma 3.2. Suppose pDL(t) is the Dogleg path of mx with ∆ = ∞ andB 0. Then:

(a) mx(pDL(t)) is decreasing

(b)∥∥pDL(t)

∥∥ is increasing

Proof. When 0 < t < 1, this case is uninteresting since for these values weknow mx(pDL(t)) is decreasing and

∥∥pDL(t)∥∥ is increasing. So we consider

1 < t < 2 and reparameterize using t− 1 = z. We have that:

d

dzmx(pDL(t)) = (pmin − pC)T (g −BpC) + z(pmin − pC)TB(pmin − pC)

= −(1− z)(pmin − pC)TB(pmin − pC)

Letting c = (pmin − pC)TB(pmin − pC) > 0 since B 0, we have thatthe derivative is strictly negative for z < 1. To show that the norm isincreasing, first we show that the angle between pC and pmin is between(−π/2, π/2), then we show its length is longer.

(a) For the angle, g 6= 0:

〈pmin, pC〉 = gTB−1g‖g‖2

gTBg> 0

(b) Using Cauchy-Schwartz, ‖g‖∥∥pmin

∥∥ ≥ ‖g‖∥∥pC∥∥ since:

∥∥B−1g∥∥ ‖g‖ ≥ gTB−1g =

gTB−1ggTBg

gTBg≥ ‖g‖

4

gTBg

3.3.3 Global Convergence of Cauchy Point Methods

1. Reduction obtained by Cauchy Point

Lemma 3.3. The Cauchy Point pck satisfies

mk(pCk )−mk(0) ≤ −1

2‖gk‖min

(∆k,

‖gk‖‖Bk‖

)

Proof. We have that

mk(pCk )−mk(0) = gT pCk +1

2(pCk )TBk(pCk )

18

Page 19: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

This brings in two cases when gTBg ≤ 0 then (dropping subscripts)

mk(pC)−mk(0) = −‖g‖∆ +1

2 ‖g‖2gTBg

≤ −‖g‖∆

≤ −1

2‖g‖min

(∆,‖g‖‖B‖

)When gTBg ≥ 0, and ∆(gTBg) ≤ ‖g‖3, we are again in the above case:

mk(pC)−mk(0) = −‖g‖∆ +1

2 ‖g‖2gTBg

≤ −‖g‖∆ +1

2‖g‖∆

≤ −1

2‖g‖min

(∆,‖g‖‖B‖

)When ∆gTBg > ‖g‖3:

mk(pC)−mk(0) = − ‖g‖4

gTBg+

‖g‖4

2(gTBg)2gTBg

= − ‖g‖4

2gTBg

≤ −1

2

‖g‖2

‖B‖

≤ −1

2‖g‖min

(∆,‖g‖‖B‖

)

2. Reduction achieved by Dogleg

Corollary 3.1. The dogleg step pDLk satisfies:

mk(pDLk )−mk(0) ≤ −1

2‖gk‖min

(∆k,

‖gk‖‖Bk‖

)

Proof. Since mk(pDLk ) ≤ mpCkk the result follows.

3. Reduction achieved by any method based on Cauchy Point

Corollary 3.2. Suppose a method uses step pk where mk(pk)−mk(0) ≤c(mk(pCk )−mk(0)) for c ∈ (0, 1). Then pk satisfies:

mk(pk)−mk(0) ≤ −1

2c ‖gk‖min

(∆k,

‖gk‖‖Bk‖

)

19

Page 20: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

4. Convergence when η = 0

Theorem 3.1. Let x0 ∈ Rn and f be an objective function. Let S = x :f(x) ≤ f(x0) and S(R0) = x : ‖x− y‖ < R0, y ∈ S be a neighborhoodof radius R0 about S. Suppose the following conditions:

(a) f is bounded from below on S

(b) f is Lipschitz continuously differentiable on S(R0) for some R0 > 0and some constant L.

(c) There is a c such that for all k:

mk(pk)−mk(0) ≤ −c ‖gk‖min

(∆k,

‖gk‖‖Bk‖

)(d) There is a γ ≥ 1 such that for all k:

‖pk‖ ≤ γ∆k

(e) η = 0 (fidelity threshold)

(f) ‖Bk‖ ≤ β for all k.

Then:lim inf ‖gk‖ = 0

5. Convergence when η ∈ (0, 1/4)

Theorem 3.2. If η ∈ (0, 1/4) with all other conditions from Theorem3.1 holding, then:

lim ‖gk‖ = 0

3.4 Iterative Solutions to Subproblem

3.4.1 Exact Solution to Subproblem

1. Conditions for Minimizing Quadratics

Lemma 3.4. Let m be the quadratic function

m(p) = gT p+1

2pTBp

where B is a symmetric matrix.

(a) m attains a minimum if and only if B 0 and g ∈ Im(B). If B 0then every p satisfying Bp = −g is a global minimizer of M .

(b) m has a unique minimizer if and only if B 0.

Proof. If p∗ is a minimizer of m we have from second order conditionsthat 0 = ∇m(p∗) = Bp∗ + g and B = ∇2m(p∗) 0. Now suppose thatg ∈ Im(B), then ∃p such that Bp = −g. Let w ∈ Rn then:

m(p+ w) = gT p+ gTw +1

2

(wTBw + 2pTBw + pTBp

)= m(p) + gTw − 1

22gTw +

1

2wTBw

≥ m(p)

20

Page 21: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

where the inequality follows since B 0. This proves the first part.

Now suppose m has a unique minimizer, p∗, and suppose ∃w such thatwTBw = 0. Since p∗ is a minimizer, we have that Bp∗ = −g and so fromthe computation above m(p+w) = m(p) and so p+w, p are minimizers tom which is a contradiction. For the other direction, since B 0 and let-ting p∗ = −B−1g, by Second Order Sufficient Conditions, m has a uniqueminimizer.

2. Characterization of Exact Solution

Theorem 3.3. The vector p∗ is a solution to the trust region problem:

minp∈Rn

f(x) + gT p+1

2pTBp : ‖p‖ ≤ ∆

if and only if p∗ is feasible and ∃λ > 0 such that:

(a) (B + λI)p∗ = g

(b) λ(‖p∗‖ −∆) = 0

(c) B + λI 0

Proof. First we prove (⇐) direction. So ∃λ ≥ 0 which satisfies the prop-erties. Consider the problem

m∗(p) = f(x) + gT p+1

2pT (B + λI)p = m(p) +

1

2λ ‖p‖2

Moreover, m∗(p) ≥ m∗(p∗) which implies that:

m(p) ≥ m(p∗) + λ(‖p∗‖2 − ‖p‖2)

So if λ = 0 then m(p) ≥ m(p∗) and p∗ is minimizer of m, or if λ > 0 then‖p∗‖ = ∆ and so for a feasible p, ‖p‖ ≤ ∆, which implies m(p) ≥ m(p∗).

Now suppose p∗ is the minimizer of m. If ‖p∗‖ < ∆ then p∗ is the un-constrained minimizer of m and so λ = 0, and so ∇m(p∗) = Bp∗ + g = 0and ∇2m(p∗) = B 0 by the previous lemma. Now suppose ‖p∗‖ = ∆.Then the second condition is automatically satisfied. We now use the La-grangian to determine the solution. L(λ, p) = gT p+ 1

2pTBp+ λ

2 (pT p−∆2)Differentiating with respect to p and setting this equal to 0, we see that

(B + λI)p∗ + g = 0

Now to show that (B + λI) is positive semi definite, note that for any p

such that ‖p‖2 = ∆ and since p∗ is the minimizer of m: m(p) + λ2 ‖p‖

2 ≥m(p∗) + λ

2 ‖p∗‖2 Rearranging, we have that

(p− p∗)T (B − λI)(p− p∗) ≥ 0

And since the set of all normalized ±(p− p∗) where ‖p‖ = ∆ is dense onthe unit sphere, B − λI is positive semi-definite. The last thing to showis that λ ≥ 0 which follows using the fact that if λ < 0 then p∗ must be aglobal minimizer of m. By the previous lemma, this is a contradiction.

21

Page 22: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

3.4.2 Newton’s Root Finding Iterative Solution

1. Theorem 3.3 guarantees that a solution exists and we need on find λwhich satisfies these conditions. Two cases exist from this theorem:

Case 1 λ = 0. If λ = 0 then B 0 and we can compute a solution p∗ usingQR decomposition.

Case 2 λ > 0. First, letting QΛQT , be the EVD of B, and defining:

p(λ) = −Q(Λ + λI)−1QT g

Hence:

‖p(λ)‖2 = gTQ(Λ + λI)−2QT g =

n∑j=1

(qTj g)2

(λj + λ)2

From Theorem 3.3, we want to find λ > 0 for which:

∆2 =

n∑j=1

(qTj g)2

(λj + λ)2

2. The second case has two sub-cases which must be considered. Let Q1 bethe columns of Q which correspond to the eigenspace of λ1.

Case 2a If QT1 g 6= 0, then we implement Newton’s Root finding algorithm on:

f(λ) =1

∆− 1

‖p(λ)‖

since as λ→ λ1, ‖p(λ)‖ → ∞ and so a solution exists.

Case 2b If QT1 g = 0, then applying Newton’s root finding algorithm naivelyis not useful since ‖p(λ)‖ 6→ ∞ as λ → λ1. We note that whenλ = −λ1, (B + λI) 0 and so we can find a τ such that:

∆2 =∑

j:λ1 6=λj

(qTj g)2

(λj − λ1)2+ τ2

Computation. ∃z 6= 0 such that (B−λ1I)z = 0 and ‖z‖ = 1. There-fore, qTj z = 0 for all j : λ1 6= λj . Setting

p(λ1, τ) =∑

j:λ1 6=λj

(qTj g)2

(λj − λ1)2+ τ2

we need only find τ .

3.5 Trust Region Subproblem Algorithms

22

Page 23: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

4 Conjugate Gradients

1. Conjugated Gradients Overview

Algorithm 5: Overview of Conjugate Gradient Algorithminput : x0 a starting point, ε tolerance, f objectivex← x0

Compute Gradient: r ← ∇f(x)Compute Conjugated Descent Direction: p← −rwhile ‖r‖ > ε do

Compute Optimal Step Length: αCompute Step: x← x+ αrCompute Gradient: r ← ∇f(x)Compute Conjugated Step Direction: p

endreturn Minimizer: x

2. Conjugate Gradients for minimizing convex (A 0), quadratic function

f(x) =1

2xTAx− bTx

Algorithm 6: Conjugate Gradients Algorithm for Convex Quadraticinput : x0 a starting point, ε = 0 tolerance, f objectivex← x0

Compute Gradient: r ← ∇f(x) = Ax− bCompute Conjugated Descent Direction: p← −rwhile ‖r‖ > ε do

Compute Optimal Step Length: α = rT rpTAp

Compute Step: x← x+ αrStore Previous: rold ← rCompute Gradient: r ← r + αAp

Compute Conjugated Step Direction: p← −r + ‖r‖2

‖rold‖2p

endreturn Minimizer: x

3. Fletcher-Reeves for Nonlinear Functions

Algorithm 7: Fletcher Reeves CG Algorithminput : x0 a starting point, ε tolerance, f objectivex← x0

Compute Gradient: r ← ∇f(x)Compute Conjugated Descent Direction: p← −rwhile ‖r‖ > ε do

Compute Optimal Step Length: α (Strong Wolfe Line Search)Compute Step: x← x+ αrStore Previous: rold ← rCompute Gradient: r ← ∇f(x)

Compute Conjugated Step Direction: p← −r + ‖r‖2

‖r2old‖p

endreturn Minimizer: x

23

Page 24: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Part II

Model Hessian Selection

5 Newton’s Method

1. Requires ∇2f 0 in some region for the method to work.

2. Line Search

(a) The search direction:

pNk = −∇2f−1k ∇fk

(b) Rate of Convergence

Theorem 5.1. Suppose f is an objective with minimum x∗ satisfyingthe Second Order Conditions. Moreover:

i. Suppose f is twice continuously differentiable in a neighborhoodof x∗

ii. Specifically, suppose ∇2f is Lipschitz continuous in a neighbor-hood of x∗

iii. Suppose pk = pNk and xk+1 = xk + pNkiv. Suppose xk → x∗

Then for x0 sufficiently close to x∗:

i. xk → x∗ quadratically

ii. ‖fk‖ → 0 quadratically.

Proof. Our goal is to get xk+1−x∗ in terms of ∇2f and use Lipschitzcontinuity to provide an upper bound.

i. We have that:

xk+1 − x∗

= xk + pk − x∗

= xk − x∗ −∇2f−1k ∇fk

= ∇2f−1k

[∇2fk(xk − x∗)− (∇fk −∇f∗)

]= ∇2f−1

k

[∇2fk(xk − x∗)−

∫ 1

0

∇2f(xk + t(x− x∗))(x− x∗)dt]

= ∇2f−1k

[∫ 1

0

(∇2fk −∇2f(xk + t(xk − x∗))

)(x− x∗)dt

]

24

Page 25: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

ii. Using Lipschitz continuity:

‖xk+1 − x∗‖

≤∥∥∇2f−1

k

∥∥ ∫ 1

0

∥∥∇2fk −∇2f(xk + t(xk − x∗))∥∥ ‖x− x∗‖ dt

≤∥∥∇2f−1

k

∥∥ ∫ 1

0

tL ‖xk − x∗‖2 dt

≤∥∥∇2f−1

k

∥∥ L2‖xk − x∗‖2

We now need to show that∥∥∇2f−1

k

∥∥ is bounded from above. Letη(2r) be a neighborhood of radius 2r about x∗ for which ∇2f 0. In this neighborhood, ∇2f−1 exists and g(x) =

∥∥∇2f(x)−1∥∥

is continuous. Then, on the compact closed ball η(r), g(x) hasa finite maximum. Therefore, we have quadratic convergence ofxk → x∗.

iii. We now consider the first derivative:

‖∇fk+1 −∇f∗‖ = ‖∇fk+1‖=∥∥∇fk+1 −∇fk −∇2fkpk

∥∥=

∥∥∥∥∫ 1

0

[∇2(xk + tpk)−∇2fk

]pkdt

∥∥∥∥≤∥∥∥∥∫ 1

0

Lt ‖pk‖2∥∥∥∥

≤ L

2g(xk)2 ‖∇fk‖2

So if x0 ∈ η(r), the result follows.

3. Trust Region

(a) Asymptotically Similar

Definition 5.1. A sequence pk is asymptotically similar to a se-quence qk if

‖pk − qk‖ = o(‖qk‖)

(b) The Subproblem

minp∈Rn

f(x) +∇fTx p+1

2pT∇2fxp : ‖p‖ ≤ ∆

(c) Rate of Convergence

Theorem 5.2. Suppose f is an objective with minimum x∗ satisfyingthe Second Order Conditions. Moreover:

i. Suppose f is twice continuously differentiable in a neighborhoodof x∗

25

Page 26: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

ii. Specifically, suppose ∇2f is Lipschitz continuous in a neighbor-hood of x∗

iii. Suppose for some c ∈ (0, 1), pk satisfy

mk(pk)−mk(0) ≤ −c ‖∇fk‖min

(∆k,

‖∇fk‖‖∇2fk‖

)and are asymptotically similar to pNk when

∥∥pNk ∥∥ ≤ 12∆k.

iv. Suppose xk+1 = xk + pk and xk → x∗

Then:

i. ∆k becomes inactive for sufficiently large k.

ii. xk → x∗ superlinearly.

Proof. We need to show that ∆k are bounded from below. The sec-ond part follows from this quickly since xk → x∗ implies

∥∥pNk ∥∥ → 0

and eventually∥∥pNk ∥∥ ≤ 1

2∆k. So applying the asymptotic similarity:

‖xk + pk − x∗‖ ≤∥∥xk + pNk − x∗

∥∥+∥∥pNk − pk∥∥

≤ o(‖xk − x∗‖2) + o(∥∥pNk ∥∥)

≤ o(‖xk − x∗‖2) + o(‖x− x∗‖)≤ o(‖x− x∗‖)

where the penultimate inequality follows from:

∥∥∇2f−1k

∥∥ ‖∇fk −∇f∗‖ ≤ ∥∥∇2f−1k

∥∥ ‖xk − x∗‖( supx∈η(r)

∥∥∇2f(x)∥∥)

To get that ∆k is bounded from below, we frame it in terms of ρk.The numerator of ρk − 1 satisfies:

|f(xk + pk)− f(xk)−mk(pk)−mk(0)|

≤∥∥∥∥1

2pTk

∫ 1

0

(∇2f(x+ tspk)−∇2fk

)pkdt

∥∥∥∥≤ L

4‖pk‖3

And the denominator of ρk − 1 satisfies:

|mk(pk)−mk(0)| ≥ c ‖∇fk‖min

(∆k,

‖∇fk‖‖∇2fk‖

)≥ c ‖∇fk‖min

(‖pk‖ ,

‖∇fk‖‖∇2fk‖

)So we need only lower bound ‖fk‖ by ‖pk‖ to translate this to a lowerbound for ∆k. If

∥∥pNk ∥∥ > 12∆k then (in some neighborhood of x∗)

‖pk‖ ≤ ∆k ≤ 2∥∥pNk ∥∥ ≤ 2

∥∥∇2f−1k

∥∥ ‖∇fk‖ ≤ C ‖∇fk‖26

Page 27: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Else if∥∥pNk ∥∥ ≤ 1

2∆k then

‖pk‖ ≤∥∥pNk ∥∥+ o(

∥∥pNk ∥∥) ≤ 2∥∥pNk ∥∥ ≤ C ‖∇fk‖

Therefore:

|mk(pk)−mk(0)| ≥ c ‖pk‖C

min

(‖pk‖ ,

‖pk‖‖∇2fk‖

∥∥∇2f−1k

∥∥)

Since the second term’s denominator is the condition number of thematrix (which must be greater than or equal to 1), we have that:

|ρk − 1| ≤ C ′ ‖pk‖ ≤ C ′∆k

Hence, if ∆k <3

4C′ , ρk > 3/4 and ∆k would double. So we know

that ∆k ≥ 12M ∆ for some fixed M such that 1

2M ∆ ≤ 34C′ . So we have

that for sufficiently large k, pk → 0 and so ρk → 1 and ∆k becomesinactive.

27

Page 28: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

6 Newton’s Method with Hessian Modification

1. If ∇2fk 6 0, we can add a matrix Ek so that ∇2fk + Ek =: Bk 0.As long as this model produces a descent direction, and ‖Bk‖

∥∥B−1k

∥∥ arebounded uniformly, Zoutendjik guarantees convergence.

2. Newton’s Method with Hessian Modification

Algorithm 8: Line Search with Modified Hessianinput : Starting point x0, Tolerance εk ← 0 while ‖pk‖ > ε do

Find Ek so that ∇2fk + Ek 0pk ← Solve Bkp = −gkCompute αk (Wolfe, Goldstein, Backtracking, Interpolation)xk+1 ← xk + αkpk k ← k + 1

end

3. Minimum Frobenius Norm

(a) Problem: find E which satisfies min ‖E‖2F : λmin(A+ E) ≥ δ.(b) Solution: If A = QΛQT is the EVD of A then E = Qdiag(τi)Q

T

where

τi =

0 λi ≥ δδ − λi λi < δ

4. Minimum 2-Norm

(a) Problem: find E which satisfies min ‖E‖22 : λmin(A+ E) ≥ δ(b) Solution: If λmin(A) = λn then E = τI where:

τ =

0 λn ≥ δδ − λn λn < δ

5. Modified Cholesky

(a) Compute LDL Factorization but add constants to ensure positivedefiniteness of A

(b) May require pivoting to ensure numerical stability

6. Modified Block Cholesky

(a) Compute LBL factorization where B is a block diagonal (1× 1 and2× where the number of blocks is the number of positive eigenvaluesand the number of 2×2 blocks is the number of negative eigenvalues)

(b) Compute the EVD of B and modify it to ensure appropriate eigen-values.

28

Page 29: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

7 Quasi-Newton’s Method

1. Fundamental Idea

Computation. Let f be our objective and suppose it is Lipschitz twicecontinuously differentiable. From Taylor’s Theorem:

∇f(x+ p) = ∇f(x) +

∫ 1

0

∇2f(x+ tp)pdt

= ∇f(x) +∇2f(x)p+

∫ 1

0

(∇2f(x+ tp)−∇2f(x)

)pdt

= ∇f(x) +∇2f(x)p+ o(‖p‖)

Therefore, letting x = xk and p = xk+1 − xk

∇2fk(xk+1 − xk) ≈ ∇fk+1 −∇fk

Quasi-Newton methods choose an approximation to ∇2fk, Bk which sat-isfies:

Bk+1(xk+1 − xk) = gk+1 − gk

2. Secant Equation

Definition 7.1. Let xi be a sequence of iterates when minimizing anobjective function f . Let sk = xk+1 − xk and yk = gk+1 − gk wheregi = ∇fi. Then the secant equation is

Bk+1sk = yk

where Bk+1 is some matrix.

3. Curvature Condition

Definition 7.2. The curvature condition is sTk yk > 0.

Note 7.1. If Bk satisfies the secant equation and sk, yk satisfy the cur-vature condition then sTkBk+1sk > 0. So the quadratic along the stepdirection is convex.

4. Line Search Rate of Convergence

Theorem 7.1. Suppose f is an objective with minimum x∗ satisfying theSecond Order Conditions. Moreover:

(a) Suppose f is twice continuously differentiable in a neighborhood of x∗

(b) Suppose Bk is computed using a quasi-Newton method and

Bkpk = −gk

(c) Suppose xk+1 = xk + pk

29

Page 30: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

(d) Suppose xk → x∗

When x0 is sufficiently close to x∗, xk → x∗ superlinearly if and only if:

limk→∞

∥∥(Bk −∇2f(x∗))pk∥∥

‖pk‖→ 0

Proof. We show that the condition is equivalent to: pk − pNk = o(‖pk‖).First, if the condition holds:

pk − pNk = ∇2f−1k

(∇2fkpk +∇fk

)= ∇2f−1

k

(∇2fk −Bk

)pk

= ∇2f−1k

[(∇2fk −∇2f(x∗)

)pk +

(∇2f(x∗)−Bk

)pk]

Taking norms and using that x0 is sufficiently close to x∗, and continuity:∥∥pk − pNk ∥∥ = o(‖pk‖)

For the other direction:

o(‖pk‖) =∥∥pk − pNk ∥∥ ≥ ∥∥(∇2f(x∗)−Bk

)pk∥∥

Now we just apply triangle inequality and properties of the Newton Step:

‖xk + pk − x∗‖ ≤∥∥xk + pNk − x∗

∥∥+∥∥pk − pNk ∥∥ = o(‖xk − x∗‖2)+o(‖pk‖)

Note that o(‖pk‖) = o(‖xk − x∗‖).

7.1 Rank-2 Update: DFP & BFGS

1. Davidson-Fletcher-Powell Update

(a) Let:

Gk =

∫ 1

0

∇2f(xk + tαpk)dt

So Taylor’s Theorem implies:

yk = Gksk

(b) Let:

‖A‖W =∥∥∥G−1/2

k AG−1/2k

∥∥∥F

(c) Problem. Given Bk symmetric positive definite, sk and yk, find

arg min ‖B −Bk‖W : B = BT , Bsk = yk

(d) Solution:

Bk+1 =

(I − yks

Tk

yTk sk

)Bk

(I − sky

Tk

yTk sk

)+yky

Tk

yTk sk

30

Page 31: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

(e) Inverse Hessian:

Hk+1 = Hk −Hkyky

TkHk

yTkHkyk+sks

Tk

yTk sk

2. Broyden, Fletcher, Goldfarb, Shanno Update

(a) Problem: Given Hk (inverse of Bk) symmetric positive definite, skand yk, find

arg min ‖H −Hk‖W : H = HT , Hyk = sk

(b) Solution:

Hk+1 =

(I − sky

Tk

yTk sk

)Hk

(I − yks

Tk

yTk sk

)+sks

Tk

yTk sk

(c) Hessian:

Bk+1 = Bk −Bksks

TkBk

sTkBksk+yky

Tk

yTk sk

3. Preservation of Positive Definiteness

Lemma 7.1. If Bk 0 and Bk+1 is updated from the BFGS method,then Bk+1 0.

Proof. We can equivalently show that Hk+1 0. Let z ∈ Rn \ 0. Then:

zTHk+1z = wTHkw + ρk(sTk z)2

where w = z − ρk(sTk z)yk. If sTk z = 0 then w = z and zTHkz > 0. IfsTk z 6= 0, then ρk(sTk z)

2 > 0 (regardless of if w = 0) and so Hk+1 0.

7.2 Rank-1 Update: SR1

1. Problem: find v such that:

Bk+1 = Bk + σvvT : Bk+1sk = yk

2. Solution:

Bk+1 = Bk +(yk −Bksk)(yk −Bksk)T

sTk (yk −Bksk)

Proof. Multiplying by sk:

yk −Bksk = σ(vT sk)v

Hence, v is a multiple of yk −Bksk.

31

Page 32: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Part III

Specialized Applications ofOptimization

8 Inexact Newton’s Method

1. Requirements for search direction:

(a) pk is inexpensive to compute

(b) pk approximates the newton step closely as measured by:

zk = ∇2fkpk −∇fk

(c) The error is controlled by the current gradient:

‖zk‖ ≤ ηk ‖∇fk‖

2. Typically, inexact methods keep Bk = ∇2fk and improve on computingbk

3. CG Based Algorithmic Overview

Algorithm 9: Overview of Inexact CG Algorithminput : x0 a starting point, f objective, restrictionsx← x0

Define Tolerance: ε Compute Gradient: r ← ∇f(x)Compute Conjugated Descent Direction: p← −rwhile ‖r‖ > ε do

Check Constrained Polynomialif pTBp ≤ 0 then

Compute α given restrictionsCompute Minimizer: x← x+ αpBreak Loop

elseCompute Optimal Step Length: αif Restrictions on x then

Compute τ given restrictionsCompute Minimizer: x← x+ ταpBreak Loop

elseCompute Step: x← x+ αr

endCompute Gradient: r ← ∇f(x)Compute Conjugated Step Direction: p

end

endx∗ ← x return Minimizer: x∗

4. As with the usual CG, we can implement methods for computing thegradient and conjugate step direction efficiently

32

Page 33: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

5. Line Search Strategy: the restriction of α is simply that it satisfies thestrong Wolfe Conditions

6. Trust-Region Strategy:

(a) The restriction on α is that ‖x+ αp‖ = ∆ because the restrictedpolynomial in the direction p is decreasing, so we should go all theway to the boundary

(b) The restriction on x is that ‖x+ αp‖ > ∆ because x+αp is no longerin the trust region so we need to find τ that ensures this happens.

7. An alternative is to use Lanczos’ Method, which generalizes CG Methods.

9 Limited Memory BFGS

1. In normal BFGS:

Hk+1 = (I − ρkskyTk )Hk(I − ρkyksTk ) + ρksksTk

2. Substituting in for Hk until H0 reveals that only sk, yk must be stored tocompute Hk+1

3. Limited BFGS capitalizes on this by storing a fixed number of sk, yk andre-updating Hk from a (typically diagonal) matrix Hk

0

Algorithm 10: L-BFGS Algorithminput : x0 a starting point, f objective, m storage size, ε tolerancex← x0

k ← 0while ‖∇f(x)‖ > ε do

Choose H0

Update H from (sk, yk), . . . , (sk−m, yk−m)Compute pu ← −H∇f(x)Implement Line Search or Trust Region (dogleg) to find px← x+ pk ← k + 1

endreturn Minimizer: x

33

Page 34: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

10 Least Squares Regression Methods

1. In least squares problems, the objective function has a specialized form,for example:

f(x) =

m∑i=1

rj(x)2

(a) x ∈ Rn and usually are the parameters of a particular model ofinterest

(b) rj are usually the residuals evaluated for the model φ with parameterx at point tj and output yj . That is:

rj(x) = φ(x, tj)− yj

2. Formulation

(a) Let r(x) =[r1(x) r2(x) · · · rm(x)

]. Then:

f(x) =1

2‖r(x)‖2

(b) The first derivative:

∇f(x) = ∇r(x)T r(x) =

∇r1(x)T

∇r2(x)T

...∇rm(x)T

r(x) =: J(x)T r(x)

(c) The Second Derivative:

∇2f(x) = ∇r(x)T∇r(x) +∇2r(x)r(x) = J(x)TJ(x) +∇2r(x)r(x)

3. The main conceit behind Least Squares methods is to approximate∇2f(x)by J(x)TJ(x), thus, only first derivative information is required.

10.1 Linear Least Squares Regression

1. Suppose φ(x, tj) = tTj x then:

r(x) = Tx− y

where each row of T is tTj

2. Objective Function:

f(x) =1

2‖Tx− y‖2

Note 10.1. The objective function can be solved directly using matrixfactorizations if m is not too large.

3. First Derivative:∇f(x) = TTTx− T y

Note 10.2. The normal equations may be better to solve if m is largebecause by multiplying by the transpose, we need only solve an n×n systemof equations.

34

Page 35: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

10.2 Line Search: Gauss-Newton

1. The Gauss-Newton Algorithm is a line search for non-linear least squaresregression with search direction:

JTk Jkpk = −JTk rk

2. Notice that these are the normal equations from:

1

2‖Jkpk + rk‖2

3. Using this search direction, we proceed with line search as per usual

10.3 Trust Region: Levenberg-Marquardt

1. The local model is:

m(p) = f(x) + (J(x)T r(x))T p+1

2pTJ(x)TJ(x)p ‖p‖ ≤ ∆

2. Since f(x) = ‖r(x)‖2 /2, the solution to minimizing m(p) is equivalent to:

min1

2‖J(x)p+ r‖2 ‖p‖ ≤ ∆

35

Page 36: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Part IV

Constrained Optimization

10.4 Theory of Constrained Optimization

1. In constrained optimization, the problem is:

minx ∈ arg min f(x) : ci(x) = 0 ∀i ∈ E , cj(x) ≥ 0 ∀j ∈ I

(a) f is the usual objective function

(b) E is the index for equality constraints

(c) I is the index for inequality constraints

2. Feasible Set. Active Set.

Definition 10.1. The feasible set, Ω, is the set of all points which satisfythe constraints:

Ω = x : ci(x) = 0 ∀i ∈ E , cj(x) ≥ 0 ∀j ∈ I

The active set at point x ∈ Ω, A(x), is defined by:

A(x) = E ∪ j ∈ I : cj(x) = 0

3. Feasible Sequence. Tangent. Tangent Cone.

Definition 10.2. Let x ∈ Ω. A sequence zk → x is a feasible sequence iffor sufficiently large K, if k ≥ K then zk ∈ Ω. A vector d is a tangent tox if there is a feasible sequence zk → x and a sequence tk → 0 such that:

limk→∞

zk − xtk

= d

The set of all tangents is the Tangent Cone at x, denoted TΩ(x).

Note 10.3. TΩ(x) contains all step directions which for a sufficientlysmall step size will remain in the feasible region Ω.

4. Linearized Feasible Directions.

Definition 10.3. Let x ∈ Ω and A(x) be its active set. The set of lin-earized feasible directions, F(x) is defined as:

F(x) = d : dT∇ci(x) = 0 ∀i ∈ E , dT∇ci(x) ≥ 0 ∀A(x) ∩ I

Note 10.4. Let x ∈ Ω and suppose we linearize the constraints nearx. The set F(x) contains all search directions for which the linearizedapproximations to ci(x+d) satisfy the constraints. For example If ci(x) =0 for i ∈ E, then 0ci(x+ d) ≈ ci(x) + dT∇ci(x) = dT∇ci(x) for d to be agood search direction.

5. LICQ Constrain Qualification

36

Page 37: Notes on Numerical Optimization - University of Wisconsin ...pages.stat.wisc.edu/~vrpatel6/resources/Notes/Notes...Fundamentals of Optimization 1 Overview of Numerical Optimization

Definition 10.4. Given a point x and active set A(x), the linear indepen-dence constrain qualification (LICQ) holds if the set of active constraintgradients, ∇ci(x) : i ∈ A(x), is linearly independent.

Note 10.5. Most algorithms use line search or trust region require thatF(x) ≈ TΩ(x), else there is not a good way to find another iterate.

6. First Order Necessary Conditions

Theorem 10.1. Let f, ci be continuously differentiable. Suppose x∗ is alocal solution to the optimization problem and at x∗, LICQ holds. Thenthere is a Lagrange multiplier vector λ, with components λi for i ∈ E ∪ I,such that the Karush-Kuhn-Tucker conditions are satisfied at (x∗, λ):

(a) ∇xL(x∗, λ) = 0

(b) For all i ∈ E, ci(x∗) = 0

(c) For all j ∈ I, cj(x∗) ≥ 0

(d) For all j ∈ I, λj ≥ 0

(e) For all j ∈ I, λjcj(x∗) = 0

37