numerical solution of nonlinear systems and …people.sc.fsu.edu/~jpeterson/nonlinear.pdf ·...

Part II

Numerical Solution of Nonlinear Systems

and Unconstrained Optimization

1

3

We now turn to the problem of finding the roots or zeros of a nonlinear functionf(x). In the simplest setting this may be the classical problem of finding the zerosof an n-th degree polynomial. In a more general setting f(x) can represent amapping from a subset D of Rn to Rm where x = (x1, x2, . . . , xn)T ∈ D and f(x) =(f1(x), f2(x), . . . , fm(x))T . Thus, in this case, the problem of solving f(x) = 0 isequivalent to solving a system of nonlinear equations

f1(x) = 0, f2(x) = 0, · · · , fm(x) = 0 .

Nonlinear systems of algebraic equations arise in many contexts; important amongthese are the discretization of differential and integral operator equations, nonlineareigenvalue problems, and nonlinear least squares.

The problem of finding the roots or zeros of f(x) is related to the problemof minimizing a functional g : Rn → R1 where g(x) = g(x1, x2, . . . , xn). If g isdifferentiable then the gradient of g is given by

g′(x) = (∂g/∂x1, ∂g/∂x2, . . . , ∂g/∂xn) .

From calculus, we know that if x minimizes g(x) over all x ∈ Rn then x is a root ofthe equation g′(x) = 0. Conversely, if x is a solution of f(x) = 0 then there is somefunctional h(x) such that x is a minimum point of h(x).

Minimization problems are usually divided into two types. The first, which wewill consider here, is unconstrained minimization and is simply a problem of theform

minimize g(x) over all x ∈ Rn.

This is the type of problem we described above. The second class of minimizationproblems is constrained minimization. The general form of this problem is to

minimize g(x) over all x ∈ Rn

subject to the constraintsbi(x) i = 1, . . . ,m .

These constraints may take many forms; they may be linear or nonlinear, equalityor inequality constraints, algebraic or derivative constraints, etc. Although thisclass of problems is very important, we will not consider it here.

Chapter 1

Numerical Solution of Systems of NonlinearEquations

Nonlinear algebraic systems arise in many contexts; important among these are inthe discretization of nonlinear differential or integral operators, nonlinear eigenvalueproblems and curve fitting with nonlinear parameters.

Throughout this chapter we will consider approximating the solution(s) of nnonlinear algebraic equations with n unknowns; i.e., F : D ⊂ Rn → Rn where

Fx = 0 .(1.1)

Recall that a system of linear equations has either a unique solution, no solution orinfinitely many solutions. However, the situation with nonlinear algebraic systemsis more complicated since the system may have arbitarily many solutions. It is forthis reason that one must discuss some basis existence and uniqueness results for(1.1).

Once we ascertain that a solution of (1.1) exists, we turn to approximatingits solution. In contrast to linear algebraic systems, nonlinear systems can rarelybe solved by direct methods so that, in general, the approximating schemes willbe iterative methods, i.e., rules for generating a sequence of vectors {x(k)}, k =0, 1, 2, . . . , such that x(k) → x∗ as k →∞ where x∗ is some solution of (1.1). Thereare three major considerations associated with such iterative methods.

First, the iterates {x(k)}, k ≥ 0, should be well defined. This includes thequestion of the selection of a starting vector x(0), or in general of a finite set ofstarting vectors, and the question of guaranteeing that the iterates {x(k)}, k =0, 1, 2, . . . ,, remain in the domain of definition of the relevant mappings which haveto be evaluated at x(k).

Second, we will be concerned with the convergence of the sequences generatedby the iterative methods and the question of whether or not their limits are indeedsolutions of (1.1). We wish to distinguish several types of convergence results. Thefirst one, called a local convergence theorem, assumes that a solution x∗ exists andthen shows that there is a neighborhood of x∗ such that if initial values belong tothis neighborhood, then the rest of the iterates are well defined and converge to x∗.The second type is called a semi-local convergence theorem and does not assume the

5

6 1. Numerical Solution of Systems of Nonlinear Equations

existence of the solution but shows that for a particular choice (usually restricted)of initial values, the convergence of {x(k)} is guaranteed with a limit x∗ which is asolution of (1.1) . Results of this type usually furnish error estimates, i.e., boundson ‖x(k) − x∗‖ in terms of k, the initial values, etc. Finally, a third type of theoremis a global convergence result which asserts that any choice of initial values in Rn

(or at least some “large” subset of Rn) leads to convergence to a solution x∗. Recallthat the convergence results we studied for iterative methods for linear problemswere global results; i.e., the convergence of the method was independent of theinitial guess. This type of result will be obtained for only very special classes ofnonlinear problems.

The third consideration is the question of how fast the sequence {x(k)} will con-verge (if it converges) to x∗. An error estimate, e.g., from a semilocal theorem, canfurnish such information. However, such bounds tend to be very coarse and pes-simistic. A more accurate picture is provided by the asymptotic rate of convergence;i.e., by the behavior of the sequence {x(k) − x∗} for large k.

Practical problems associated with iterative methods include efficient searchesfor suitable initial values, the efficient implementation of the algorithm which pro-duces the x(k)’s and a successful detection of proximity to the limit including theutilization of various stopping criteria for the iterative process.

We begin this chapter by briefly reviewing some theoretical results from calculusin n−dimensions which are necessary for the analysis of nonlinear systems. We nextturn to analyzing a general iteration scheme for the solution of (1.1). Due to theimportance of Newton’s method, we specialize the results for the general iterationscheme to Newton’s method and also state convergence results. Next, we investigatethe nonlinear SOR method which combines Newton’s method and SOR and reducesto the standard SOR for linear problems. Although Newton’s method has manyadvantages and is widely used, it also has some shortcomings. In the next fewsections we look at secant methods and quasi-Newton methods which address someof the disadvantages of Newton’s method. In the last section we briefly introducecontinuation methods.

1.1 Derivatives of mappings and mean-value theorems

In this section we gather some definitions and results from calculus of n−dimensionswhich are central to the study of nonlinear systems. We first consider derivativesof mappings.

In R1 a real-valued function f of a real variable x is differentiable at x if thereexists a number a = f ′(x) ∈ R1 such that

limt→0

1t

[f(x + t)− f(x)] = a

or equivalently

limt→0

1t

[f(x + t)− f(x)− at] = 0 .

1.1. Derivatives of mappings and mean-value theorems 7

In more than one dimension, we shall be concerned with a mapping F , defined ona subset D of Rn which takes on values in Rm. We shall write F : D ⊂ Rn → Rm.For x = (x1, x2, . . . , xn)T ∈ D we shall denote by F (x) or Fx its image in Rm

given by Fx = (f1(x), . . . , fm(x))T where fi : D ⊂ Rn → R1. Recall from calculusthat the mere existence of all partial derivatives of F was not enough to guaranteecontinuity or a derivative in directions other than the coordinate directions. Wethen defined the more general directional derivative which was defined similarly toa partial derivative except the direction could be any vector in Rn, not just thecoordinate directions. However, this is still an unsatisfactory generalization sincelinearity of the derivative is not necessarily guaranteed and, moreover, the existenceof the directional derivative in all directions does not imply continuity.

Let L(Rn, Rm) denote the linear (vector) space of linear operators from Rn toRm. One way to generalize the derivative in one dimension is given in the followingdefinition.Definition 1.1 A mapping F : D ⊂ Rn → Rm is Gateaux differentiable (orsimply G−differentiable ) at an interior point x of D if there exists a linear operatorA ∈ L(Rn, Rm) such that for any h ∈ Rn

limt→0

1t‖F (x + th)− F (x)− tAh‖ = 0 .(1.2)

We note that by the norm equivalence theorem the limit in (1.2) is independentof the choice of norm. Also if F is G−differentiable at x then the map A is unique.To see this suppose that there are two such maps, A1 and A2. Then for any h ∈ Rn

we have that

‖(A1 −A2)h‖ =1t‖F (x + th)− F (x)− tA2h− F (x + th) + F (x) + tA1h‖

≤ 1t‖F (x + th)− F (x)− tA1h‖+

1t‖F (x + th)− F (x)− tA2h‖ .

The right-hand side tends to zero as t → 0. Hence ‖(A1 −A2)h‖ = 0 for any h ∈ Rn

or A1 = A2. The fact that A is unique motivates the following definition.Definition 1.2 Let F : D ⊂ Rn → Rm be G−differentiable at x ∈ int (D).Then the unique linear mapping A ∈ L(Rn, Rm) for which (1.2) holds is called theGateaux- (or G-) derivative of A at x and is denoted by A = F ′(x).

Hence F ′(x) is a linear mapping in L(Rn, Rm), i.e., F ′ : D ⊂ Rn → L(Rn, Rm)and thus can be represented as an m × n matrix. Let Ai,j = (F ′(x))i,j be the(i, j) element of A ∈ L(Rn, Rm) and let e(j) denote the standard unit vectors inRn and Rm. Then since convergence in norm implies coordinatewise convergence,(1.1) with h = e(j) ∈ Rn gives

limt→0

1t

∣∣∣fi(x + te(j))− fi(x)− tAi,j

∣∣∣ = 0 i = 1, . . . ,m ,(1.3)

which, if it exists, is the partial derivative ∂fi/∂xj evaluated at x. Hence (1.3)implies that if the G−derivative F ′(x) = A of F : Rn → Rm exists at x ∈ int (D),


then necessarily

(F ′(x))i,j = Ai,j =∂fi

∂xj(x) , i = 1, . . . ,m j = 1, . . . , n ,

i.e., that F ′(x) is represented by the Jacobian matrix of the mapping F evaluatedat x which is defined by

F ′(x) =

∂f1∂x1

· · · ∂f1∂xn

......

∂fm

∂x1· · · ∂fm

∂xn

.(1.4)

We shall distinguish the special case m = 1 and call mappings g : D ⊂ Rn → R1

functionals on Rn. The G−derivative of g at x ∈ int (D), if it exists, is a linearmap g′(x) : Rn → L(Rn, R1) and consequently is represented as a 1× n row vectorg′(x) = (∂g/∂x1, · · · , ∂g/∂xn), called the gradient of g at x.

As we saw above, the existence of the G−derivative F ′(x) implies the existenceof the directional derivatives of F and thus the partial derivatives at x. The con-verse is not necessarily true. (See exercise xx.) In fact, the existence of the Gateauxderivative is quite a weak statement; it does not even imply that F is continuousat x. (See exercise xx.) However, it is true that the existence of F ′(x), i.e., theG−derivative, implies that F is continuous along rays issuing from x. Other prop-erties, such as linearity, do carry over from one dimension. See exercise xx.

Due to the relative weakness of the property of G−differentiability, we make thefollowing stronger definition.

Definition 1.3 A mapping F : D ⊂ Rn → Rm is Frechet- (or F-) differentiableat an interior point x of D if there exists a linear mapping A ∈ L(Rn, Rm) suchthat

limh→0

1‖h‖

‖F (x + h)− Fx−Ah‖ = 0 .(1.5)

Clearly, (1.5) reduces to the usual definition of a derivative in one-dimension andagain when the F−derivative exists, the mapping A is unique.

Definition 1.4 Let F : D ⊂ Rn → Rm be F−differentiable at x ∈ int (D).Then the unique linear map A for which (1.5) holds is called the Frechet- (or F-)derivative of F at x and is again denoted by F ′(x).

The definition of the Frechet derivative includes as a special case the defini-tion of the Gateaux derivative. This can be seen by letting h = th for some fixedh ∈ Rn where t ∈ R1. Then we have that ‖h‖ → 0 implies t → 0. There-fore we can state that if F : D ⊂ Rn → Rm is F−differentiable at x ∈ int (D)then it is G−differentiable at x. Moreover, in this case the F−derivative and theG−derivative coincide, are both denoted by F ′(x), and are both given by the Ja-cobian matrix (1.4). However, the existence of the G−derivative does not implyF−differentiability.

1.1. Derivatives of mappings and mean-value theorems 9

Unlike G−differentiability , F−differentiability does imply continuity, as thefollowing proposition demonstrates.

Proposition 1.1 Let F : D ⊂ Rn → Rm be F−differentiable at x. Then F iscontinuous at x.

Proof. Since the existence of the F−derivative F ′(x) at x implies that x ∈ int (D),there is a δ1 > 0 such that x + h ∈ D whenever ‖h‖ < δ1. Now (1.5) impliesthat given ε > 0 there exists a δ > 0, which we can choose such that δ < δ1,such that for ‖h‖ < δ we have that ‖F (x + h)− F (x)− F ′(x)h‖ ≤ ε‖h‖. Then‖F (x + h)− F (x)‖ ≤ (‖F ′(x)‖ + ε)‖h‖. Now we fix ε > 0 and set ‖F ′‖ + ε = c, aconstant independent of h. We then have that F is continuous at x. 2

If F : R2 → R1 then (1.5) implies that the surface z = F (x1, x2) has a tangentplane, in the usual geometric sense, at the point x; analogous results hold in higherdimensions. Hence F−differentiability generalizes most of the desirable propertiesof the one-dimensional derivative.

In obtaining convergence results for a single nonlinear equation, the mean-valuetheorem plays an important role. In one dimension the mean-value theorem statesthat if f : [a, b] ⊂ R1 → R1 is continuous on [a, b] and differentiable on (a, b), thenthere exists a t ∈ (a, b) such that f(b) − f(a) = f ′(t)(b − a). In more than onedimension this result can be generalized in several ways. First recall the followingdefinition of a set being convex.Definition 1.5 A set D0 ⊂ Rn is called convex if given any x, y ∈ D0, thenx + t(y − x) ∈ D0 for all 0 ≤ t ≤ 1.

This definition just states that if x, y are in a convex set then the line segmentjoining x and y is also in the set: for example, in R2 a circle is convex whereasan L−shaped region is not. It is easy to generalize the mean-value theorem forfunctionals, i.e., functions f : D ⊂ Rn → R1. To this end, we have that if f : D ⊂Rn → R1 is differentiable (G or F ) at each point of a convex set D0 ⊂ D then,given x, y ∈ D0 there exists a t ∈ (0, 1) such that

f(y)− f(x) = f ′(x + t(y − x))(y − x) .(1.6)

Equation (1.6) does not hold when f is a vector-valued mapping, say when f : D ⊂Rn → Rm, m > 1. For this reason other alternatives to the mean-value theoremmust be found.

It is easy to apply the mean value theorem to each of the components fi, 1 ≤i ≤ m of F : D ⊂ Rn → Rm. Specifically, let F be G−differentiable on an openconvex set D0 ⊂ D. By the comparability of vector norms in Rm we conclude thatthe components fi : D ⊂ Rn → R1, 1 ≤ i ≤ m, of F are G−differentiable on D0.Therefore for each i, 1 ≤ i ≤ m, (1.6) applies, i.e., there exists ti ∈ (0, 1) such that

fi(y)− fi(x) = f ′i(x + ti(y − x))(y − x), 1 ≤ i ≤ m .

Two other alternatives for the generalization of the mean-value theorem ton−dimensions are found in the following two propositions. The first is obtained


by proving a bound on Fy − Fx in terms of F ′ and the second involves the use ofthe fundamental theorem of calculus and leads to an integral form of the mean-valuetheorem. The proofs of these can be found in Ortega and Rheinboldt.

Proposition 1.2 Assume that F : D ⊂ Rn → Rm is G−differentiable on a convexset D0 ⊂ D. Then for every (x, y) ∈ D0

‖Fx− Fy‖ ≤ sup0≤t≤1

‖F ′(x + t(y − x))‖ ‖x− y‖(1.7)

Proposition 1.3 Let F : D ⊂ Rn → Rm have a continuous G−derivative at eachpoint of a convex set D0 ⊂ D. Then if x,y ∈ D0

Fy − Fx =∫ 1

0

F ′(x + t(y − x))(y − x) dt .(1.8)

Finally, we mention that as an application of the mean value theorems one canshow that if F : D ⊂ Rn → Rm has a G−derivative at each point of an openneighborhood of x ∈ D and F ′(x) is continuous at x, then F is F−differentiable atx and the Gateaux and Frechet derivatives of F , of course, coincide.

1.2 General convergence results and the contraction map-ping theorem

We begin this section by analyzing a general iteration scheme for the solution of(1.1). We rewrite (1.1) in the form

x = G(x) ,

where G : D ⊂ Rn → Rn is a mapping chosen so that the solutions of x = G(x), i.e.,the fixed points of G, coincide with the zeros of F (x). Then the following iterationsuggests itself

x(k+1) = Gx(k), k = 0, 1, 2, . . . , ; x(0)given in Rn .(1.9)

Intuitively, if {x(k)} has a limit we expect the limit to be a fixed point of G andhence a solution of F (x) = 0. Clearly the function G(x) must be chosen so that theiterates are well defined, easy to calculate, and converge as rapidly as possible. Wefirst give a definition.Definition 1.6 Let G : D ⊂ Rn → Rn. Then x∗ is a point of attraction of theiteration (1.9) if there is an open neighborhood S of x∗ such that S ⊂ D and, forany x(0) ∈ S, the iterates {x(k)} defined by (1.9) all lie in D and converge to x∗.

Before presenting a local convergence result, we give some notation. Let S(x, δ)denote the open ball (in some norm) of center x ∈ Rn and radius δ > 0; i.e., the setS(x, δ) = {y ∈ Rn | ‖x− y‖ < δ}. S(x, δ) will denote the corresponding closed ball,i.e., the set {y ∈ Rn | ‖x− y‖ ≤ δ}. We then have the following local convergenceresult.

1.2. General convergence results and the contraction mapping theorem 11

Theorem 1.4 Let G : D ⊂ Rn → Rn and suppose that there is a ball S =S(x∗, δ) ⊂ D and a constant α < 1 such that

‖Gx− x∗‖ ≤ α‖x− x∗‖ ∀x ∈ S .(1.10)

Then, for any x(0) ∈ S the iterates defined by (1.9) lie in S and converge to x∗.Hence x∗ is a point of attraction of the iteration (1.9).

Proof. Let x(0) ∈ S. Then using (1.9) and (1.10), ‖x(1) − x∗‖ = ‖Gx(0) − x∗‖ ≤α‖x(0) − x∗‖. Since α < 1 and x(0) ∈ S this shows that x(1) ∈ S and that‖x(1) − x∗‖ ≤ α‖x(0) − x∗‖. Assume now as the induction hypothesis that x(k) ∈ Sand that ‖x(k) − x∗‖ ≤ αk‖x(0) − x∗‖. Then

‖x(k+1) − x∗‖ ≤ ‖Gx(k) − x∗‖ ≤ α‖x(k) − x∗‖ ≤ αk+1‖x(0) − x∗‖ .

Hence x(k+1) ∈ S and ‖x(k+1) − x∗‖ ≤ αk+1‖x(0) − x∗‖. We conclude that for anyk ≥ 0, x(k) ∈ S and ‖x(k) − x∗‖ ≤ αk‖x(0) − x∗‖. This shows that limk→∞ x(k) =x∗ and therefore that x∗ is a point of attraction of (1.9). 2

The following theorem, due to Ostrowski, gives sufficient conditions on G sothat (1.10) holds, and essentially says that if the eigenvalues of G′ are sufficientlysmall, then the iteration will converge to the given fixed point for a range of initialvalues.

Theorem 1.5 Assume that G : D ⊂ Rn → Rn has a fixed point x∗ ∈ Int(D) andthat G has a Frechet derivative G′ at x∗. Then if the spectral radius of G′(x∗)satisfies ρ(G′(x∗)) = σ < 1 then x∗ is a point of attraction of the iteration (1.9).

Proof. By a result from linear algebra we can conclude that given any ε > 0 thereexists a norm ‖·‖ on Rn such that

‖G′(x∗)‖ ≤ σ + ε .(1.11)

Now since G is F−differentiable at x∗ we have that

limh→0

1‖h‖

‖G(x∗ + h)−Gx∗ −G′(x∗)h‖ = 0 ,

i.e., that there exists a δ = δ(x∗, ε) such that S = S(x∗, δ) ⊂ D and

‖Gx−Gx∗ −G′(x∗)(x− x∗)‖ ≤ ε‖x− x∗‖ ∀x ∈ S .(1.12)

Since x∗ is a fixed point of G, we have that for x ∈ S,

‖Gx− x∗‖ ≤ ‖Gx−Gx∗ −G′(x∗)(x− x∗)‖+ ‖G′(x∗)‖‖x− x∗‖≤ (σ + 2ε)‖x− x∗‖

, where we have used (1.11) and (1.12). We now choose ε > 0 so that σ + 2ε < 1so that we then have ‖Gx− x∗‖ ≤ α‖x− x∗‖ for all x ∈ S with α < 1, i.e., the


hypotheses of Theorem 1.4 are satisfied. Therefore, x∗ is a point of attraction ofthe iteration (1.9). 2

The condition ρ(G′(x∗)) < 1 is not necessary for convergence. See exercise xx.We now consider a semi-local result for the iteration (1.9) which is essentially

the contraction mapping theorem. We first give a definition.Definition 1.7 A mapping G : D ⊂ Rn → Rn is a contraction on a set D0 ⊂ Dif there is an α < 1 such that

‖Gx−Gy‖ ≤ α‖x− y‖ ∀x, y ∈ D0 .(1.13)

Theorem 1.6 Suppose G : D ⊂ Rn → Rn is a contraction on a closed set D0 ⊂ Dand that GD0 ⊂ D0, i.e., if x ∈ D0 then Gx ∈ D0 also. Then G has a uniquefixed point x∗ ∈ D0. Moreover, for any x0 ∈ D0 the iterates {xk} defined by (1.9)converge to x∗. We also have the following error estimates with α defined by (1.13)

‖x(k) − x∗‖ ≤ α

1− α‖x(k) − x(k−1)‖ , k = 1, 2, 3, . . .(1.14)

‖x(k) − x∗‖ ≤ αk

1− α‖Gx0 − x0‖ , k = 1, 2, 3, . . .(1.15)

Proof. Let x(0) be an arbitrary point in D0. Since GD0 ⊂ D0 the sequence {x(k)}defined by (1.9) is well defined and lies in D0. Furthermore, using (1.13),

‖x(k+1) − x(k)‖ = ‖Gx(k) −Gx(k−1)‖ ≤ α‖x(k) − x(k−1)‖(1.16)

holds for k ≥ 1. Therefore, for p ≥ 1 an integer, repeated application of (1.13)yields

‖x(k+p) − x(k)‖ ≤p∑

i=1

‖x(k+i) − x(k+i−1)‖

≤ (αp−1 + · · ·+ α + 1)‖x(k+1) − x(k)‖

≤ 11− α

‖x(k+1) − x(k)‖ ≤ αk

1− α‖x(1) − x(0)‖ .

We conclude that the sequence {x(k)} is a Cauchy sequence with respect to thenorm ‖·‖ in the closed set D0. Therefore it has a limit which we denote by x∗, i.e.,we have that there exists an x∗ ∈ D0 such that

limk→∞

x(k) → x∗ .(1.17)

Now this limit is a fixed point of G in D0 since

‖x∗ − x(k+1)‖+ ‖Gx(k) −Gx∗‖ ≤ ‖x∗ − x(k+1)‖+ α‖x(k) − x∗‖ → 0

as k →∞. Moreover, this limit is unique. (See exercise xx.) Finally, the estimates(1.14), (1.15) follow from (1.17) by letting p →∞ and observing that x(1) = Gx(0).2

1.3. Newton’s method: attraction theorem and rate of convergence 13

We note that the contraction property (1.13) depends on the particular choice ofthe underlying norm ‖·‖ on Rn. In particular, G may be a contraction for one normbut not for another. We stress that the contraction mapping theorem requires thatGD0 ⊂ D0 for the closed set D0 on which G is a contraction. The necessity of thisrequriement is easily verified. (See exercise xx.) An example of a global convergencetheorem is given by Theorem 1.6 when D0 = Rn.

The error estimates (1.14) and (1.15) are useful whenever α can be evaluated orsharply estimated. Here is a result towards this direction whose proof is left to theexercises.

Proposition 1.7 Assume that F : D ⊂ Rn → Rn has a G−derivative whichsatisfies ‖F ′(x)‖ ≤ α < 1 for all x in a convex set D0 ⊂ D. Then F is a contractionon D0.

Finally, the connection between the spectral radius of F ′(x) and the contractionproperty can be seen in the following result, whose proof is also left to the exercises.

Proposition 1.8 Assume that F : D ⊂ Rn → Rn has a continuous Frechet deriva-tive in an open neighborhood S of x and that ρ(F ′(x)) < 1. Then there is anotherneighborhood S1 of x and a norm on Rn such that F is a contraction on S1.

1.3 Newton’s method: attraction theorem and rate of con-vergence

In this section we will specialize the results of Section 1.2 to iterations of the form

x(k+1) = x(k) − (A(x(k))−1

F (x(k)) , k = 0, 1, 2, . . . , x0 given ;(1.18)

i.e., the case where Gx = x − (A(x)−1Fx for the solution of the nonlinear system

Fx = 0. The special case (1.18) for which

A(x) = F ′(x)(1.19)

where F ′(x) is the Frechet derivative of F at x is, of course, Newton’s method. Afterstating a preliminary lemma, we prove a local convergence result for iterations ofthe form (1.18). As a special case of these results we state the Newton attractiontheorem and then prove a result concerning the rate of convergence of the Newtoniteration. We close this section by mentioning some modifications of Newton’smethod.

We begin with a perturbation lemma for linear operators.

Lemma 1.9 Let A ∈ L(Rn, Rn) be a nonsingular matrix. Suppose that B ∈L(Rn, Rn) is such that ‖A−1‖ ‖B‖ < 1. Then (A + B) is nonsingular and

‖(A + B)−1‖ ≤ ‖A−1‖1− ‖A−1‖ ‖B‖

.(1.20)


Proof. See exercise xx.

Theorem 1.10 Suppose that F : D ⊂ Rn → Rn is F−differentiable at a pointx∗ ∈ int(d) for which Fx∗ = 0. Let A : S0 → L(Rn, Rn) be defined on an openneighborhood S0 ⊂ D of x∗ and let A be continuous at x∗ and A(x∗) be nonsingu-lar. Then there exists a closed ball S(x∗, δ) ⊂ S0, δ > 0, on which the mappingG : S → Rn, Gx = x − (A(x))−1

Fx, is well defined for x ∈ S. Moreover G isF−differentiable at x∗ and

G′(x∗) = I − (A(x∗))−1F ′(x∗) .(1.21)

Proof. Clearly G will be well-defined on S if we show that A is invertible there. Setβ = ‖(A(x∗))−1‖ and let ε > 0 be such that 0 < ε < 1/(2β). Now, by hypothesis,A(x) is continuous at x∗. Therefore, there exists a δ = δ(x∗, ε) > 0 such thatS(x∗, δ) ⊂ S0 and

‖A(x)−A(x∗)‖ ≤ ε ∀x ∈ S .(1.22)

We now use Lemma 1.9 with A = A(x∗) and B = −A(x∗)+A(x). Since (A(x∗))−1

exists and‖(A(x∗))−1‖ ‖Ax−Ax∗‖ ≤ εβ <

12

for all x ∈ S, we have that A(x) is invertible for all x ∈ S and moreover by 1.9

‖(A(x))−1‖ ≤ ‖(A(x∗))−1‖1− ‖A(x∗)‖−1‖A(x)−A(x∗)‖

≤ β

1− 1/2= 2β ∀x ∈ S .(1.23)

So indeed, the mapping G is well defined for all x ∈ S. We now proceed to showthat G is F−differentiable at x∗. First observe that since x∗ is a solution of Fx = 0,x∗ is a fixed point of G, i.e., Gx∗ = x∗. Also, since F is F−differentiable at x∗ andby choosing δ, the radius of S(x∗, δ), small enough, we conclude that

‖Fx− Fx∗ − F ′(x∗)(x− x∗)‖ ≤ ε‖x− x∗‖ ∀x ∈ S .(1.24)

Now, for x ∈ S,

‖Gx−Gx∗ − (I − (A(x∗))−1F ′(x∗))(x− x∗)‖

= ‖A(x∗)−1F ′(x∗)(x− x∗)−A(x)−1

F (x)‖≤ ‖−(A(x))−1[(Fx− Fx∗ − F ′(x∗)(x− x∗)]‖

+ ‖−(A(x))−1[A(x∗)−A(x)](A(x∗))−1F ′(x∗)(x− x∗)‖

≤ 2βε‖x− x∗‖+ 2β2ε‖F ′(x∗)‖ ‖x− x∗‖≤ ε(2β + 2β2‖F ′(x∗)‖)‖x− x∗‖= εC‖x− x∗‖

1.3. Newton’s method: attraction theorem and rate of convergence 15

where C is a constant and where we have used (1.23) and the fact that Fx∗ = 0.Therefore G is F−differentiable at x∗; moreover, its F−derivative is given by (1.21).2

We immediately obtain an attraction theorem which is a local convergence resultfor the iteration (1.18).

Corollary 1.11 Assume that the hypotheses of Theorem 1.10 hold. In additionsuppose that

ρ(G′(x∗)) = ρ(I − (A(x∗))−1

F ′(x∗))

= σ < 1 ,(1.25)

where ρ(·) denotes the spectral radius. Then x∗ is a point of attraction of theiteration (1.18).

Proof. By (1.25), Theorem 1.10, and Ostrowski’s Theorem, we easily reach theconclusion of the corollary.

In the special case A(x) = F ′(x) we obtain the following local convergence resultfor Newton’s method.

Theorem 1.12 (Newton Attraction Theorem) Assume that F : D ⊂ Rn → Rn isF−differentiable on an open neighborhood S0 ⊂ D of a point x∗ ⊂ D for whichFx∗ = 0. Also assume that F ′ is continuous at x∗ and that F ′(x∗) is nonsingular.Then x∗ is a point of attraction of the Newton iteration

x(k+1) = x(k) − (F ′(x(k)))−1

Fx(k) , k = 0, 1, 2, . . . , .(1.26)

Proof. Using Theorem 1.10 with A(x) = F ′(x) for x ∈ S0 we conclude that theNewton iteration function Gx = x − (F ′(x))−1

F (x) is well defined on some ballS(x∗, δ) ⊂ S0, δ > 0. Moreover, in thise case, (1.21) gives that G′(x∗) = 0 so thatρ(G′(x∗)) = σ = 0 and therefore Corollary 1.11 applies. 2

We now examine the rate of convergence of the Newton iterates {x(k)} to theattraction point x∗.

Proposition 1.13 Assume that the hypotheses of Theorem 1.12 hold. Then, forthe point of attraction of the Newton iteration, whose existence is guaranteed byTheorem (1.21), we have that

limk→∞

‖x(k+1) − x∗‖‖x(k) − x∗‖

= 0 .(1.27)

Moreover, if for some constant C

‖F ′(x)− F ′(x∗)‖ ≤ C‖x− x∗‖(1.28)

for all x in some open neighborhood of x∗, then there exists a positive constant Csuch that

‖x(k+1) − x∗‖ ≤ C‖x(k) − x∗‖2 .(1.29)


Proof. Recall that Gx = x−(F ′(x))−1F (x) for Newton’s method. In Theorem 1.12

it was shown that G is well defined in some ball about x∗ and that the F−derivativeG′ exists at x∗. Then, for an x in the ball of attraction we have, since x(k+1) = Gx(k),

limk→∞

‖x(k+1) − x∗‖‖x(k) − x∗‖

= limk→∞

‖Gx(k) −Gx∗ −G′(x∗)(x(k) − x∗)‖‖x(k) − x∗‖

= 0 ,

where we used Gx∗ = x∗ and G′(x∗) = 0 to obtain the first equality and where thesecond inequality follows from the F−differentiability of G at x∗. Hence (1.27) isvalid. Now let {x(k)} for k ≥ k0, k0 sufficiently large, belong to the neighborhoodof x∗ in which (1.28) holds. For such k consider the convex set consisting of thesegment between the points x(k) and x∗, we have, using (1.28) and Proposition ??

‖Fx(k) − Fx∗ − F ′(x∗)(x(k) − x∗)‖ ≤ C

2‖x(k) − x∗‖2 .(1.30)

Now

‖x(k+1) − x∗‖ = ‖Gx(k) − x∗‖

= ‖x(k) − (F ′(x(k)))−1

Fx(k) − x∗‖≤ ‖(F ′(x(k)))

−1(Fx(k) − Fx∗ − F ′(x∗)(x(k) − x∗)‖

+‖(F ′(x(k)))−1

(F ′(x(k))− F ′x∗)(x(k) − x∗)‖

≤

(C

2+ C

)‖(F ′(x(k)))

−1‖ ‖x(k) − x∗‖2 .

Now, with A = F ′ the hypotheses of Theorem 1.10 are satisfied, so that

‖F ′(x(k))−1‖ ≤ 2‖F ′(x∗)−1‖

which is a constant for k sufficiently large. Therefore, we have (1.29). 2

Equation(1.27) for Newton’s method should be contrasted with the case of ageneral iteration x(k+1) = Gx(k), which by (1.10) gives that

‖x(k+1) − x∗‖ = ‖Gx(k) − x∗‖ ≤ α‖x(k) − x∗‖

for k sufficiently large, i.e.,

‖x(k+1) − x∗‖‖x(k) − x∗‖

≤ α .(1.31)

The fact that the limit in (1.27) is zero implies that Newton’s method is superlinearlyconvergent. The general iteration, when it converges, is linearly convergent. Infact, if (1.28) holds, which, for instance, will be guaranteed if the second partialderivatives of F are continuous in a neighborhood of x∗, then we were able to prove(1.29); i.e., that Newton’s method is quadratically convergent.

1.4. Newton’s method: semi-local and global convergence results 17

Estimates such as (1.29) are not of practical use since x∗ is a priori unknown.However, it gives an idea of how fast the iteration will converge. The above discus-sion does not adequately describe the problem of the determinaiton of the asymp-totic rate of convergence of an iterative method. For example, there may be specialcases for which the rate of convergence is higher than two.

The application of Newton’s method in a region where F ′(x(k)) is nonsingularis usually carried out as follows. The vector F (x(k)) and the matrix F ′(x(k)) arecomputed and then the linear system F ′(x(k))δx(k) = F (x(k)) is solved for theunknown vector δx(k) = x(k+1) − x(k). Finally, x(k+1) is formed by x(k+1) = x(k) −δxk. Therefore, the method requires the formation of the Jacobian matrix thatrepresents F ′(x(k)) at x(k) and the solution of a linear system with that matrix forevery step k. For large nonlinear systems this may be very expensive.

Several simplications of Newton’s method have been devised which reduce thecost in either or both the Jacobian formation or the linear system solution. Unfor-tunately, often these simplifications result in the loss of the quadratic convergenceof Newton’s method. In later sections we will consider in some detail a few of thesemethods. Here we only mention a few simple ones.

The first that we mention is the “simplified Newton iteration”

x(k+1) = x(k) − (F ′(x(0)))−1

Fx(k) , k = 0, 1, 2, . . . , , x0 given .(1.32)

This iteration requires the solution of a linear system for each k but with a fixedmatrix F ′(x(0)) which can be formed and factored outside of the iteration loop andhence requires only backsolves for each iteration.

A second method is found by decomposing the matrix F ′(x) into its diagonaland strictly lower and upper triangular parts; i.e., in the spirit of iterative methodsfor linear systems we can write

F ′(x) = D(x) + L(x) + U(x) .(1.33)

Then, instead of choosing A(x) = F ′(x) in the iteration (1.18) one can considersetting

A(x) = D(x) + L(x)(1.34)

obtaining the iterative method

x(k+1) = x(k) − (D(x(k)) + L(x(k)))−1

Fx(k) , k = 0, 1, 2, . . . ,(1.35)

where x(0) is given. This requires the solution of a triangular system of equationsat every step. This method is known as the Newton-Gauss-Seidel method. (Seeexercise xx).

1.4 Newton’s method: semi-local and global convergence re-sults

In the last section we saw that when Newton’s method is applied to a problemwhich has a nonsingular Jacobian in a neighborhood of the desired root, in addition


to being continuous there, one can show convergence to the root for an initialguess sufficiently close. However, determining the convergence for Newton’s methodfrom an arbitary starting point is a much more difficult problem. The results ofthe last section were local convergence results; here we begin by stating a semi-local covergence theorem for Newton’s method due to Kantorovich. The reader isreferred to Ortega, Numerical Analysis, A Second Course, Ortega and Rheinboldt,or Kantorovich, Functional Analysis in Normed Spaces for the proof.

Theorem 1.14 Let F : D ⊂ Rn → Rn be F−differentiable on a convex set D0 ⊂ Dand assume that for some constant γ > 0

‖F ′(x)− F ′(y)‖ ≤ γ‖x− y‖ ∀x, y ∈ D0 .(1.36)

Suppose that there exists a point x(0) ∈ D0 such that F ′(x(0)) is invertible. More-over, suppose for some constants β, η

‖F ′(x(0))−1‖ ≤ β

‖(F ′(x(0)))−1

Fx(0)‖ ≤ η ,

α = βγη ≤ 12

.

Set

t∗ =1− (1− 2α)1/2

βγ, t =

1 + (1− 2α)1/2

βγ,(1.37)

and assume that S(x(0), t∗) ⊂ D0. Then the Newton iterates

x(k+1) = x(k) − (F ′(x(k)))−1

Fx(k) , k = 0, 1, 2, . . . ,(1.38)

are well defined, remain in S(x(0), t∗), and converge to a solution x∗ of Fx = 0which is unique in S(x(0), t) ∩D0. Moreover, we have the error estimates

‖x∗ − x(1)‖ ≤ 2βγ‖x(1) − x(0)‖2(1.39)

and

‖x(k) − x∗‖ ≤ (2α)2k

βγ2k, k = 0, 1, 2, . . . , .(1.40)

We note again that this is a semi-local convergence result; i.e., it states that not onlywill the Newton iterates converge, but that a solution also exists. The hypothesesare difficult in practice to verify. Note that since

‖x(1) − x(0)‖ = ‖(F ′(x(0)))−1

Fx(0)‖ ≤ η

the condition α ≤ 1/2 is essentially placing a condition on how close x(0) is to thesolution.

1.4. Newton’s method: semi-local and global convergence results 19

We conclude the analysis of Newton’s method with a global theorem which appliesto convex mappings. Recall that in one dimension if F : R1 → R1 is convex andmonotone increasing then for any x(0) to the right of the root x∗ the Newton iterateswill converge monotonically down to x∗, while if x(0) is to the left of x∗, the firstNewton iterate will be to the right of x∗ so that the succeeding iterates will convergemonotonically downward. We want to extend these results to n-dimensions.Definition 1.8 The mapping F : D ⊂ Rn → Rn is called convex on the convexset D0 ⊂ D if

F (λx + (1− λ)y) ≤ λFx + (1− λ)Fy(1.41)

for all x, y ∈ D0 and λ ∈ [0, 1].We recall that for x, y ∈ Rn, x ≤ y is equivalent to xi ≤ yi, i = 1, . . . , n and

that for A, B ∈ L(Rn, Rn), A ≤ B is equivalent to Ai,j ≤ Bi,j , i, j = 1, . . . , n. Wefirst prove the following lemma which extends to n−dimensions the familiar fact inR1 that a differentiable convex function lies above its tangent.

Lemma 1.15 Let F : D ⊂ Rn → Rn be F−differentiable on the convex set D0 ⊂D. Then F is convex on D0 if and only if

Fy − Fx ≥ F ′(x)(y − x) ∀x, y ∈ D0 .(1.42)

Proof. Suppose first that (1.42) holds. Fix x, y ∈ D0 and λ ∈ [0, 1] and set z =λx + (1 − λ)y. Since D0 is convex, z ∈ D0 and by (1.42) Fx − Fz ≥ F ′(z)(x − z)and Fy − Fz ≥ F ′(z)(y − z) . Then

λFx + (1− λ)Fy − Fz ≥ λF ′(z)(x− z) + (1− λ)F ′(z)(y − z)= F ′(z)[λx + (1− λ)y − z] = 0 .

HenceλFx + (1− λ)Fy ≥ Fz = F (λx + (1− λ)y) ,

which shows that F is convex. Conversely, if F is convex on D0, let 0 < λ ≤ 1.Then we can write F (λy+(1−λ)x) ≤ λFy+(1−λ)Fx as [F (x+λ(y−x))−Fx]/λ ≤Fy − Fx. By the F−differentiability of F it follows that as λ → 0, the left-handside tends to F ′(x)(y − x). Since for ak, b ∈ Rn, ak ≤ b and lim ak = a as k → ∞imply a ≤ b (componentwise convergence), and (1.42) follows.

In the previous results for Newton’s method in this section and the last we ob-tained convergence results which depended upon the initial guess being sufficientlyclose to the desired root. In many applications these results are adequate since agood initial guess is often known or perhaps a method which has a larger radius ofconvergence is first used to obtain a better initial guess. However, it is advantageousto know for what class of problems Newton’s method is globally convergent. Wenow prove a global convergence result for Newton’s method due to Balnev.

Theorem 1.16 Let F : Rn → Rn be continuously F−differentiable and convexon all of Rn. In addition, suppose that (F ′(x))−1 exists for all x ∈ Rn and that


(F ′(x))−1 ≥ 0 for all x ∈ Rn. Let Fx = 0 have a solution x∗. Then x∗ is uniqueand the Newton iterates x(k+1) = x(k) − (F ′(x(k)))

−1converge to x∗ for any initial

choice x(0) ∈ Rn. Moreover, for all k > 0

x∗ ≤ x(k+1) ≤ x(k) .(1.43)

Proof. We first prove, by induction, that (1.43) holds. Let x(0) ∈ Rn be arbitrary.Then by the hypothesis all the Newton iterates are well defined. Now x(1) =x(0) − (F ′(x(0)))

−1Fx(0). Then by (1.42)

Fx(1) − Fx(0) ≥ F ′(x(0))(x(1) − x(0)) = −Fx(0)

so that Fx(1) ≥ 0. But also, again by (1.42), 0 = Fx∗ ≥ Fx(1) +F ′(x(1))(x∗−x(1)).Then, since (F ′(x(1)))

−1 ≥ 0 we have that (F ′(x(1)))−1

Fx(1) + (x∗ − x(1)) ≤ 0.Then, since Fx(1) ≥ 0 we conclude that x∗ ≤ x(1). For general k it follows exactlyas above

x∗ ≤ x(k) and Fx(k) ≥ 0 , k = 0, 1, 2, . . . , .(1.44)

But x(k+1) = x(k) − (F ′(x(k)))−1

Fx(k) ≤ x(k), using (1.44). Therefore (1.43) holds.Now to conclude that x(k) → x∗ as k → ∞, first note that for each i, i = 1, . . . , n,(1.43) implies that x

(k)i is a monotonically decreasing (actually nonincreasing) se-

quence of k, bounded below by x∗i . Hence for each i, i = 1, . . . , n, x(k)i → yi as

k →∞, i.e., x(k) → y = [y1, . . . , yn]T . Since F is F−differentiable , it is continuous.Also F ′(x) is a continuous function of x. We conclude that

Fy = limk→∞

Fx(k) = limk→∞

F ′(x(k))(x(k+1) − x(k)) = F ′(y) · 0 = 0 .

Hence the sequence x(k) converges to a solution of Fx = 0. But Fx = 0 has onlyone solution, namely x∗. To see this, let x∗, y? be two solutions. Then by (1.42)0 = Fy∗ − Fx∗ ≥ F ′(x∗)(y∗ − x∗) and multiplying through by (F ′(x∗))−1 ≥ 0 weget y∗ ≤ x∗. Reversing the roles of x∗ and y∗ yields x∗ ≤ y∗. Hence x∗ = y∗. 2

In the case when F satisfies the conditions of Theorem 1.16, the iterates x(k) ofNewton’s method approach the root x∗ from above, and thus form upper bounds forthe root. Lower bounds for the root x∗ can also be generated. Note that regardlessof whether x(0) < x∗ or x(0) > x∗ that x(1) ≥ x∗ and all subsequent iterates satisfy(1.43).

1.5 Nonlinear Gauss-Seidel and SOR

In this section we consider in more detail some simplifications of Newton’s methodwhich in the case of Fx = Ax − b, i.e., Fx = 0 is a linear system, reduce towell-known iterative methods such as Jacobi, Gauss-Seidel, SOR for the solution oflinear systems. We briefly considered one such method in Section 1.3; here we willmostly concentrate on SOR.

1.5. Nonlinear Gauss-Seidel and SOR 21

We recall that SOR applied to the linear system Ax = b, A ∈ L(Rn, Rn), x,b ∈ Rn starts with the decomposition of the matrix A into

A = D − L− U ,(1.45)

where D =diag(a1,1 · · · , an,n), L is strictly lower triangular and U is strictly uppertriangular. Assuming ai,i 6= 0 the componentwise form of the SOR method for thesolution of Ax = b is, for some constant ω, 0 < ω < 2,

x(0)i given, i = 1, . . . , n

x(k+1)i = (1− ω)x(k)

i +ω

ai,i

bi −i−1∑j=1

ai,jx(k+1)j −

n∑j=i+1

ai,jx(k)j

,(1.46)

for k = 0, 1, 2, . . . , and i = 1, . . . , n. The matrix form of (1.46) is{x(0) givenx(k+1) = (D − ωL)−1[(1− ω)D + ωU ]x(k) + ω(D − ωL)−1

b ,(1.47)

for k = 0, 1, 2, . . . ,, which can also be written as{x(0) givenx(k+1) = x(k) − ω(D − ωL)−1(Ax(k) − b) , k = 0, 1, 2, . . . , .

(1.48)

We also note that the iterate x(k+1) can be expressed in terms of x(0) as follows.Let B, C, and H be defined by

B = ω−1(D − ωL) C = ω−1[(1− ω)D + ωU ], and H = B−1C .(1.49)

Note that A = B − C. Then (1.48) gives that

x(k+1) = Hx(k) + B−1b

= H(Hx(k−1) + B−1b) + B−1b = H2x(k−1) + (H + I)B−1b

= · · ·= Hk+1x(0) + (Hk + Hk−1 + · · ·+ I)B−1b

= x(0) + (Hk+1 − I)x(0) + (Hk + Hk−1 + · · ·+ I)B−1(Ax(0) −Ax(0) + b) .

Now, note that (I +H + · · ·+Hk)(I −H) = I −Hk+1 and B−1A = B−1(B−C) =I −H. Then we have that

x(k+1) = x(0) − ω(Hk + Hk−1 + · · ·+ I)(D − ωL)−1(Ax(0) − b) .(1.50)

One way to use SOR for nonlinear systems is as a means of approximatelyfinding the solution of the linear systems that must be solved at each iterative stepof Newton’s method. In this case the primary iteration is Newton’s method and the


secondary iteration is SOR. We call this method the Newton-SOR method, whichwe now describe. Let F : D ⊂ Rn → Rn be a given nonlinear mapping whose rootx∗ (Fx∗ = 0) is sought by Newton’s method

x(k+1) = x(k) − (F ′(x(k)))−1

Fx(k), k = 0, 1, 2, . . . , .

Suppose x(k) has been computed; then the next iterate x(k+1) is computed as thesolution x of the linear system

F ′(x(k))x = F ′(x(k))x(k) − Fx(k) .(1.51)

We now solve (1.51) approximately by SOR. To this end we decompose the Jacobianmatrix F ′(x(k)) into

F ′(x(k)) = Dk − Lk − Uk .(1.52)

We assume that the diagonal elements of Dk are nonzero, and in analogy with(1.49), for some constants ωk, 0 < ωk < 2, we define

Hk = (Dk − ωkLk)−1[(1− ωk)Dk + ωkUk] .(1.53)

To apply SOR to (1.51) we denote the m-th SOR iterate by x(k,m), m = 0, 1, . . .and apply (1.48) in the case where A = F ′(x(k)), b = F ′(x(k))x(k) − Fx(k), thusobtaining

x(k,m) = (Dk − ωkLk)−1[(1− ωk)Dk + ωkU ]x(k,m−1)

+ωk(Dk − ωkLk)−1(F ′(x(k))x(k) − Fx(k)),(1.54)

with x(k,0) given. Using (1.54) and (1.50), we have that for m = 1, 2, . . .

x(k,m) = x(k,0)

−ωk(Hm−1k + · · ·+ I)(Dk − ωkLk)−1(F ′(x(k))(x(k,0) − x(k)) + Fx(k)) .

(1.55)Of course the most natural way of picking x(k,0) is just to set it equal to x(k), theprevious Newton iterate. We assume that the inner SOR iteration is terminatedafter mk iterations, and we set x(k+1) = x(k,mk). Then given x(k,0) = x(k), the nextNewton iterate x(k+1) is given by

x(k+1) = x(k) − ωk(Hmk−1k + · · ·+ I)(Dk − ωkLk)−1

Fx(k) .(1.56)

How is mk chosen? We may choose to terminate the SOR iteration by some con-vergence criterion, e.g., ‖x(k,m) − x(k,m−1)‖ < ε for some preassigned ε > 0, thenmk varies with k and the variation is not known. Alternately, we can specify mk inadvance for k = 0, 1, 2, . . . ,. The simplest choice is mk = 1 for k = 0, 1, 2, . . . , i.e.,perform just one SOR step as a secondary iteration, which leads to the one-stepNewton-SOR iteration, which by (1.56) is a scheme of the form

x(k+1) = x(k) − ωk(Dk − ωkLk)−1Fx(k), k = 0, 1, 2, . . . , .(1.57)


Note that (1.57) is precisely the linear SOR method (1.48) when Fx = Ax − b.Furthermore, for ωk = 1 the method is exactly the Newton-Gauss Seidel method ofSection 1.3, which, in view of our present discussion is more accurately called theone-step Newton-Gauss-Seidel method. More generally, we may set mk = m ≥ 1for all k to obtain, for example, the m-step Newton-SOR method given by (1.56)with mk = m. We note that the description of the m-step refers to the secondaryiteration which in this case is the SOR iteration.

The one-step Newton-SOR method (1.57) clearly requires at each step the eval-uation of Dk − ωkLk i.e., the evaluation of n(n + 1)/2 partial derivatives and thesolution of one lower triangular linear system. The m-step Newton-SOR methodrequires the evaluation at x(k) of all n2 partial derivatives and the solution of mlower triangular systems with the same matrix as can be seen from (1.54).

There is a second way of extending an iterative method for linear systems intoone for nonlinear equations. Let us first examine the simplest iterative method forlinear sytems, the Jacobi method. Recall that for Ax = b, the componentwise formof this method is

x(k+1)i =

1ai,i

− n∑i=1i6=j

ai,jx(k)j + bi

, i = 1, . . . , n; k = 0, 1, 2, . . . ,(1.58)

with x(0)i given. We can interpret (1.58) as follows. We solve the i-th equation of

the system Ax = b for the unknown xi with all other unknowns xj , j 6= i, held attheir values at the k-th level, i.e., x

(k)j . Consider now the map F : D ⊂ Rn → Rn

whose root x∗ we are seeking. Let F = [f1, . . . , fn]T where fi : D ⊂ Rn → R1.The analogue of the linear Jacobi method, i.e., a nonlinear Jacobi method, wouldbe to keep the unknowns xj , j 6= i, at the level x

(k)j and solve the i-th equation of

Fx = 0, i.e., fi(x1, . . . , xn) = 0, for the unknown xi. In other words we solve thenonlinear equation

fi(x(k)1 , x

(k)2 , . . . , x

(k)i−1, xi, x

(k)i+1, . . . , x

(k)n ) = 0(1.59)

for xi for i = 1, . . . , n and call the resulting vector x = (x1, . . . , xn)T the next iteratex(k+1) = (x(k+1)

1 , . . . , x(k+1)n )T . For each i, i = 1, . . . , n, (1.59) represents a single

(scalar) nonlinear equation with one unknown xi. We can solve this equation by, forexample, applying mk steps of the one-dimensional analogue of Newton’s method,i.e., the j-th Newton step looks like

x(j)i = x

(j−1)i −

fi(x(k)1 , . . . , x

(k)i−1, x

(j−1)i , x

(k)i+1, . . . , x

(k)n )

∂fi

∂xi(x(k)

1 , . . . , x(k)i−1, x

(j−1)i , x

(k)i+1, . . . , x

(k)n )

(1.60)

where j = 1, . . . ,mk and where x(0)i can be taken as x

(k)i . We are thus led to the

Jacobi-Newton methods. For example, a one-step Jacobi-Newton method involves


applying, for each i, i = 1, . . . , n one step of Newton’s method for approximatelysolving the single nonlinear equation (1.59) for xi. Thus the Jacobi iteration nowis the primary (or outer) iteration while the Newton iteration is a secondary (orinner) one.

In an entirely analogous manner we may derive Gauss-Seidel-Newton methodsand SOR-Newton methods. For example, in the Gauss-Seidel Newton method inthe i-th equation, i.e., fi = 0 of Fx = 0, we keep the unknowns xi+1, · · · , xn at thelevel k and compute xk+1

i using the updated information at the level (k +1) for thealready computed x1, · · · , xi−1. Therefore for each i, i = 1, . . . , n, we solve

fi(x(k+1)1 , x

(k+1)2 , . . . , x

(k+1)i−1 , xi, x

(k)i+1, . . . , x

(k)n = 0(1.61)

for xi by Newton’s method and call the result xi = x(k+1)i . So to obtain x(k+1)

from x(k) we solve the successively updated n equations (1.61). More generally, ifwe set, after finding xi from (1.61),

x(k+1)i = x

(k)i + ωk(xi − x

(k)i )(1.62)

for some parameter ωk we obtain the SOR-Newton method. Limiting ourselves tothe one-step SOR-Newton method we see from (1.61) that

xi = x(k)i −

fi(x(k+1)1 , . . . , x

(k+1)i−1 , x

(k)i , x

(k)i+1, . . . , x

(k)n )

∂fi

∂xi(x(k+1)

1 , . . . , x(k+1)i−1 , x

(k)i , x

(k)i+1, . . . , x

(k)n )

(1.63)

and then, with ωk = ω for all k, by (1.62)

x(k+1)i = x

(k)i + ω(xi − x

(k)i )(1.64)

with xi given by (1.63). Now (1.63), (1.64) may be combined into the single formula

x(k+1)i = x

(k)i − ω

fi(x(k,i))∂fi

∂xi(x(k,i))

, i = 1, . . . , n; k = 0, 1, 2, . . . ,(1.65)

where we have set x(k,i) = (x(k+1)1 , · · · , x(k+1)

i−1 , x(k)i , x

(k)i+1, · · · , x

(k)n )T . Hence the one-

step SOR-Newton method requires at every step k the evaluation of the n functionsfi(x(k,i)) as well as the n derivatives ∂fi/∂xi(x(k,i)). (This should be contrastedwith the computations needed for the Newton-SOR method.)

It can be shown that the composite methods, e.g., SOR-Newton or Newton-SOR,do not possess the superlinear convergence rate of Newton’s method. In fact, they allconverge linearly. The asymptotic rate of convergence of the one-step Newton-SORand the one-step SOR-Newton methods are exactly the same and the asymptoticrate of the m-step Newton-SOR method is m times larger than the asymptotic rateof the one-step Newton-SOR method. Hence, the speed of convergence as well asthe number of operations per step must be taken into account when one chooses amethod for a given problem.


We now consider convergence proofs. We already have the machinery for thelocal convergence of the Newton-SOR processes. Consider a general process of theform

x(k+1) = x(k) − (I + · · ·+ H(x(k))m−1)B(x(k))−1

F (x(k)) ; k = 0, 1, 2, . . . ,(1.66)

where B and H are defined by

F ′(x) = B(x)− C(x), H(x) = B(x)−1C(x)(1.67)

Note that the m-step Newton-SOR process (1.56) with mk = m is precisely of thisform. We then have the following result.

Theorem 1.17 Let F : D ⊂ Rn → Rn be F−differentiable in an open neighborhoodS0 ⊂ D of a point x∗ ∈ D where F ′ is continuous and Fx∗ = 0. Suppose thatB : S0 → L(Rn, Rn) is continuous at x∗, that B(x∗)−1 exists and that ρ(H(x∗)) < 1.Then, for any m ≥ 1, x∗ is a point of attraction of the iteration defined by (1.66).

Proof. Since B(x∗)−1 exists and B(x) is continuous at x = x∗, it follows from theperturbation Lemma 1.9 that B(x)−1 exists in some ball S = S(x∗, δ) ⊂ S0 and iscontinuous at x∗. Hence by (1.66) it also exists in S and is continuous at x∗. Now,since

I −H(x)m = (I −H(x∗))(I + ·+ H(x∗)m−1)(1.68)

and since ρ(H(x∗)) < 1, we conclude that the matrix I + · · · + H(x∗)m−1 is non-singular. To see this, note that ρ(H(x∗)) < 1 implies I − H(x∗) is nonsingular.Also ρ(H(x∗)) < 1 implies ρm(H(x∗)) = ρ(Hm(x∗)) < 1 so that I − H(x∗)m isnonsingular for m = 1, . . . , · · ·. We then use (1.68). Now, since H is continuous atx∗

A(x) = B(x)[I + H(x) + · · ·+ H(x)m−1]−1

is continuous at x∗ and well defined in some ball S1 = S1(x∗, δ) ⊂ S. Clearly,A(x)−1 exists. Using Theorem 1.10 we then conclude that the mapping Gx =x−A(x)−1

Fx is well defined in a neighborhood of x∗ and is F−differentiable at x∗

with its F− derivative given by G′(x∗) = I −A(x∗)−1F ′(x∗). But in our case

G′(x∗) = I−(I + H(x∗) + · · ·+ H(x∗)m−1

)B(x∗)−1

B(x∗)(I−H(x∗)) = H(x∗)m .

Hence ρ(G′(x∗)) = ρ(H(x∗)m) = (ρ(H(x∗))m < 1. By Ostrowski’s theorem sincex∗ is a fixed point of G, it follows that x∗ is an attraction point of the iterationdefined by (1.66). 2

An immediate consequence is a local convergence result for the m-step Newton-SOR method.

Corollary 1.18 Let F : D ⊂ Rn → Rn be F−differentiable in an open neighbor-hood S0 ⊂ D of some point x∗ ∈ D where F ′ is continuous and where Fx∗ = 0. Let


F ′(x) = D(x) − L(x) − U(x) as usual and assume that D(x∗)−1 exists. Considerthe m-step Newton-SOR process (ωk = ω)

x(k+1) = x(k) − ω(I + · · ·+ Hω(x(k))m−1)(D(x(k))− ωL(x(k)))−1

Fx(k)(1.69)

for k = 0, 1, 2, . . . ,, where ω > 0, m ≥ 1 and

Hω(x) = (D(x)− ωL(x))−1((1− ω)D(x) + ωU(x)) .

Then, if ρ(Hω(x)) < 1, x∗ is a point of attraction of (1.69).

Proof. We need only verify the hypotheses of Theorem 1.17, which we leave to theexercises.

The situation is more difficult for the SOR-Newton method. We content our-selves with stating without proof the following result. Recall first that an M -matrixis an n× n matrix A which has the properties that A−1 ≥ 0 and ai,j ≤ 0 for i 6= j.

Proposition 1.19 Let F : D ⊂ Rn → Rn be continuously F−differentiable onan open neighborhood S0 ⊂ D of a point x∗ of D for which Fx∗ = 0. Assume,further that F ′(x∗) is an M -matrix. Then x∗ is a point of attraction of the m-step Newton-SOR method and also of the one-step SOR-Newton method, for anyω ∈ [0, 1].

1.6 Secant Methods

Newton’s method for nonlinear equations has some advantages over some of theother iterative methods we will discuss. The first is the existence of a domain ofattraction for Newton’s method (see Section 1.3). Thus if the Newton iteratesever land in the attraction ball, they will always remain in that ball and eventuallyconverge to x∗. The second advantage is the superlinear convergence of Newton’smethod, and if F satisfies the Lipschitz conditions (1.28) at x∗, the quadratic conver-gence of Newton’s method. Roughly, the latter implies that each iteration doublesthe number of significant digits in x(k) as an approximation to x∗. Finally, we notethat Newton’s method is “self-corrective”, i.e., x(k+1) depends only on F and x(k),so that bad effects from previous iterations are not carried along.

A well known disadvantage of Newton’s method is that for a particular problem,the attraction ball for the method may be very small. Therefore, if the iteration isto converge one may need a very good initial approximaiton to x∗.

A second disadvantage of Newton’s method is the requirement that we solve alinear system of size n at every iteration which, in general, requires O(n3) work. Inthe next section we study a method which reduces this work.

The third disadvantage of Newton’s method is the requirement of evaluatingthe Jacobian at each step. The evaluation of F ′(x(k)) involves the evaluation ofn2 scalar functions ∂fi/∂xj . In addition we must evaluate the n scalar functionsFx(k). If the Jacobian is relatively easy to obtain, then Newton’s method is very

1.6. Secant Methods 27

attractive. If it is relatively expensive, then we may want to circumvent this stepas much as possible.

In this section we consider methods for which the derivatives making up theJacobian are not evaluated. An obvious alternative is to replace the partial deriva-tives ∂fi/∂xj by difference quotients. For example, one frequently used differenceapproximation is

∂fi

∂xj(x) ≈ 1

h ij

[fi(x + hije(j))− fi(x)

](1.70)

where hij are given discretization parameters and e(j) is the j-th coordinate vector.More generally, let h ∈ Rn be a parameter vector and let ∆ij(x, h) denote a

difference approximation to ∂fi/∂xj(x). By this we mean a quantity well definedat x for sufficiently small ‖h‖, and such that if ∂fi/∂xj(x) exists, then

limh→0

∆ij(x, h) =∂fi

∂xj.(1.71)

Then, if we define the difference matrix J(x, h) by

J(x, h)i,j = ∆ij(x, h)(1.72)

the method

x(k+1) = x(k) − J(x(k), hk)−1

F (x(k)), k = 0, 1, 2, . . . ,(1.73)

is called a discretized Newton method. The parameter hk ∈ Rn is allowed to varywith k.

The simplest choice for hk is hk = h, independent of k. In this case it maybeproved that the iteration (1.73), under certain hypotheses on F , is only linearlyconvergent. The method may become more accurate if limk→∞ hk → 0. This canbe achieved, e.g., by letting hk = γkh for fixed h ∈ Rn, and where γk ∈ R1 is anarithmetic sequence tending to zero. Of great interest are those methods for whichthe discretization parameters hk are chosen to be functions of the iterates x(k). Weillustrate in one dimension. A usual discretized method in R1 is

x(k+1) = x(k) −(

f(x(k) + hk)− f(x(k))hk

)−1

f(x(k)) k = 0, 1, 2, . . . ,(1.74)

where hk remains to be chosen. The choice hk = x−x(k) where x is a fixed numberleads to the regular false (false position) method

x(k+1) = x(k) −(

f(x)− f(x(k))x− x(k)

)−1

f(x(k)) k = 0, 1, 2, . . . , .(1.75)

The choice hk = x(k−1) − x(k) gives the secant method

x(k+1) = x(k) −(

f(x(k−1))− f(x(k))x(k−1) − x(k)

)−1

f(x(k)) k = 1, 2, 3, . . .(1.76)


which requires two starting values x(0) and x(1).In one dimension, the iterate x(k+1) defined by (1.74) is a solution to the lin-

earized equation

`(x) =(

f(x(k) + hk)− f(x(k))hk

)(x− x(k)) + f(x(k)) = 0 .

One usually views `(x) as an approximation to the tangent line

`t(x) = F ′(x(k))(x− x(k)) + f(x(k)) = 0 .

But `(x) can also be viewed as the linear interpolant of f between the points(x(k), f(x(k))) and (x(k) + hk, f(x(k) + hk)). Indeed, `(x(k)) = f(x(k)) and `(x(k) +hk) = f(x(k) + hk). Then x(k+1) is the point where `(x) intersects the x− axis.In n-dimensions, the discretized Newton’s method followed the first viewpoint; wereplaced the Jacobian F ′(x) by an approximating matrix J(x(k), hk). In order tofollow the second viewpoint in n-dimensions, we must construct a hyperplane thatinterpolates fi at n + 1 given points x(k,j), j = 0, . . . , 1 in a neighborhood of x(k)

This means we must find a vector ai and a number αi such that the mapping

Lix = αi + xT ai = αi + x1ai1 + x2a

i2 + · · ·+ xnai

n

satisfiesLix

k,j = fi(x(k,j)) j = 0, . . . , n .

The next iterate x(k+1) is then found as the intersection of these hyperplanes inRn+1 with the hyperplane x = 0, i.e., x(k+1) is the solution of the linear system

Lix = 0, i = 1, . . . , n .

This describes the general secant method. We still must indicate how to pick theinterpolation points x(k,j), 0 ≤ j ≤ n. Before doing this, we need to develop someterminology and some results on n−dimensional interpolation.Definition 1.9 Any n + 1 points x0, x1, . . . , xn in Rn are said to be in generalposition if the vectors x0 − xj , j = 1, . . . , n, are linearly independent.

Proposition 1.20 Let x0, . . . , xn ∈ Rn. Then the following are equivalent.

(i) x0, . . . , xn are in general position.

(ii) For any j, 0 ≤ j ≤ n, the vectors xj − xi, 0 ≤ i ≤ n, i 6= j, are linearlyindependent.

(iii) The (n + 1) × (n + 1) matrix (e,XT ) where eT = (1, . . . , 1) and X =(x0, x1, . . . , xn) is nonsingular.

(iv) For any y ∈ Rn there exists scalars α0, . . . , αn such that∑n

i=0 αi = 1 andy =

∑ni=0 αix

i.


Proof. For some j, 0 ≤ j ≤ n, consider the matrix identity(1 0 · · · 0 0 · · · 0

x(k)j x(0) − x(j) · · · x(j−1) − x(j) x(j+1) − x(j) · · · x(n) − x(j)

)

=(

1 1 · · · 1 1 · · · 1x(k)j x(0) · · · x(j−1) x(j+1) · · · x(n)

)

1 −1 −1 · · · −11

1. . .

1

.

Taking determinants,

det(

x(0) − x(j) · · · x(j−1) − x(j) x(j+1) − x(j) · · · x(n) − x(j))

= det(

1 1 · · · 1 1 · · · 1x(j) x(0) · · · x(j−1) x(j+1) · · · x(n)

)= (−1)j det

(1 1 · · · 1

x(0) x(1) · · · x(n)

)= det

(eT

X

)(−1)j = (−1)j det

(e XT

).

Hence, statements (i) , (ii) and (iii) are equivalent. Now (iv) can be restated asfollows. Given any y ∈ Rn, the linear system

(eT

X

) α0

...αn

=(

1y

)

has a solution. Clearly then (iii) implies (iv). Now setting y successively equal to0, e(1), . . . , e(n) it is easy to see that (eT , XT ) is invertible and then (iv) implies(iii). 2

Proposition 1.21 Let x0, . . . , xn, y0, . . . , yn be given vectors in Rn. Then thereexists a unique affine function Lx = a + Ax, with a ∈ Rn and A ∈ L(Rn, Rn)such that Lxj = yj, 0 ≤ j ≤ n, if and only if x0, . . . , xn are in general position.Moreover, A is nonsingular if and only if y0, . . . , yn are in general position.

Proof. Since Lxj = yj , 0 ≤ j ≤ n, we have that a + Axj = yj , 0 ≤ j ≤ n, or

(e X

)( aT

AT

)=(

y0 · · · yn)T

where eT = (1, . . . , 1, )T and X = (x0, . . . , xn). The first part of the proposition isthen a direct consequence of Proposition 1.17. For the second part, observe that


Lxj = yj implies that A(xj − x0) = yj − y0 for 1 ≤ j ≤ n. Since by definition ofgeneral position (xj − x0), 1 ≤ j ≤ n, are linearly independent it follows that A isnonsingular if and only if yj − y0 are linearly independent, and hence if and only ify0, . . . , yn are in general position. 2

In order to apply these results to the general secant method, we require at the k-th step of the iteration, the selection of (n+1) interpolation points x(k,0), . . . , x(k,m)

in the domain of definition of F . We usually take xk,0 = xk. We assume thatx(k,0), . . . , x(k,m) and Fx(k,0), . . . , Fx(k,m) are in general position. Then we definethe next iterate by

x(k+1) = −Ak−1ak(1.77)

where Ak and ak satisfy

ak + Akx(k,j) = Fx(k,j) 0 ≤ j ≤ n .(1.78)

By Proposition 1.21, x(k+1) is well defined. The computation of x(k+1) by (1.77)requires finding ak and Ak satisfying (1.78) and then solving the linear systemAkx(k+1) = −ak. Finding ak, Ak requires the solution of a linear system of theform (

e XTk

)( ak

Ak

)=(

Fx(k,0) · · · Fx(k,n))T

.

However, there actually is no need to compute Ak and ak explicitly, and we reallyonly need to solve one linear system every iteration. We consider two alternateformulations which show that x(k+1) can be obtained by solving only one linearsystem

Let x(k,0), . . . , x(k,n) and Fx(k,0), . . . , Fx(k,n) both be in general position. Thenby Proposition ??, the coefficient matrix of the (n + 1)× (n + 1) system

(1 1 · · · 1

Fx(k,0) Fx(k,1) · · · Fx(k,n)

) z0

...zn

=

10...0

(1.79)

is nonsingular so that z = (z0, . . . , zn)T is well defined and unique. But (1.79)implies that

n∑i=0

zi = 1 and 0 =n∑

i=0

ziFx(k,i) .

Then, by (1.78),

0 =n∑

j=0

zjFx(k,j) =n∑

j=0

zj(ak + Akx(k,j)) = ak + Ak

n∑j=0

zjx(k,j) .


But, by Proposition 1.21, ak and Ak are uniquely defined and Ak is nonsingular.Then, comparing with (1.77) we have that

x(k+1) =n∑

j=0

zjx(k,j) .(1.80)

Note that (1.79) and (1.80) completely determine x(k+1) and that we need solve onlythe linear system (1.79). This formulation is known as the Wolfe Secand formula-tion. Note that it really only requires that the Fx(k,j)’s be in general position. TheWolfe formulation can be carried out even if the x(k,j) are not in general position.The general secant method can then be described by the following algorithm

(i) Given x(k), construction x(k,0), x(k,1), . . . , x(k,n)

(ii) Evaluate Fx(k,0), . . . , Fx(k,n). Under suitable conditions on F , x(k), etc. thesewill be in general position.

(iii) Solve (1.79) for z0, . . . , zn.

(iv) Define x(k+1) by the linear combination (1.80).

In this method, the evaluation of the n2 derivatives ∂fi/∂xj at x(k) has been re-placed by picking the points x(k,j) and the evaluation of the n2 quantities fi(x(k,j)),1 ≤ i, j ≤ n. Note that we have said nothing about how we pick the points x(k,j).We discuss this below.

We now consider a second formulation of the secant method which is known asthe Newton formulation. We set x(k,0) = x(k) and define

Hk =(x(k,1) − x(k), · · · , x(k,n) − x(k)

).(1.81)

We then may write that

x(k,j) = x(k) + Hke(j), j = 1, . . . , n .(1.82)

Conversely, if H ∈ L(Rn, Rn) is any nonsingular matrix, then (1.82) defines nvectors x(k,j) ∈ Rn, which together with x(k) are in general position. Hence tospecify a general secant method one need only specify Hk. Now for any nonsingularmatrix H, set

J(x,H) = ΓH−1(1.83)

where Γ is the matrix defined by

Γ(x, H) =(F (x + He(1))− Fx, · · · , F (x + He(n))− Fx

).(1.84)

But since Ak(x(k,j) − x(k)) = Fx(k,j) − Fx(k) (from the proof of Proposition 1.21)we see that (1.81), (1.82), and (1.85) imply that AkHk = Γk = Γ(x(k),Hk). Then


Ak = ΓkHk−1 since H is nonsingular nad thus Ak = J(x(k),Hk) by (1.83). But

x(k+1) = −Ak−1ak and ak = Fx(k,0) −Akx(k,0) (see (1.78)) we have that

x(k+1) = −Ak−1(−Akx(k,0) + Fx(k,0)

)= x(k,0) − J(x(k,0),H−1

k Fx(k,0) .(1.85)

(We note that form (1.83) we may write (1.85) as x(k+1) = x(k) − HkΓk−1Fx(k)

so that as in the Wolfe formulation, the Newton formulation maybe carried outprovided only that Fx(k,0), . . . , Fx(k,n) are in general position.) The Newton for-mulation of the secant method can therefore be expressed by the following compactalgorithm.

(i) Given x(k), construct x(k,0) = x(k), x(k,1), . . . , x(k,n)

(ii) Evaluate Hk = (x(k,1) − x(k), . . . , x(k,n) − x(k)).

(iii) Evaluate Γk = (F (x(k) + Hke(1)), . . . , F (x(k) + Hke(n)))

(iv) Solve the linear system Γkz = Fx(k)

(v) Evaluate x(k+1) = x(k) −Hkz.

Note that once again in going from x(k) to x(k+1) we need only solve one linearsystem.

We now consider some choices for the auxiliary points x(k,1), . . . , x(k,n). We firstconsider

x(k,j) = x(k) + (x(k−1)j − xk

j )e(j), j = 1, . . . , n(1.86)

for which Hk is the diagonal matrix

Hk = diag (x(k−1)1 − x

(k)1 , . . . , x(k−1)

n − x(k)n ) .(1.87)

Then it is easily seen that (1.85) reduces to a discretized Newton’s method ofthe form (1.73) where J(x(k),Hk) is the discrete approximation to the JacobianJij = ∂fi/∂xj(x(k)) obtained by replacing ∂fi/∂xj by the difference quotient (1.70)where hij = hk

j = x(k−1)j − xk

j . If, instead of (1.86) we choose

x(k,j) = x(k) +j∑

i=1

(x(k−1)i − x

(k)i )e(i) j = 1, . . . , n(1.88)

we are led to a discretized Newton’s method with hkij = x

(k−1)j − xk

j and with thepartial derivatives approximated by

∂fi

∂xj(x) ≈ 1

hij

[fi(x +

j∑l=1

hile(l))− fi(x +j−1∑l=1

hile(l))

]


(1.86) and (1.88) generate sequential two-point secant methods and are special casesof the choice

x(k,j) = x(k) + Pj,k(x(k−1) − x(k)) j = 1, . . . , n(1.89)

where Pj,k ∈ L(Rn, Rn) are given linear operators. In (1.89) the auxiliary pointsdepend only on x(k) and x(k−1). More generally, the auxiliary points x(k,j) maydepend on p of the previous iterates x(k)0, . . . , x(k−p+1), we have a sequential p-point secant method. The latter have, e.g.,

x(k,j) = x(k) +p−1∑i=1

Pi,j,k(x(k−i) − x(k)), j = 1, . . . , n

where again the Pi,j,k are linear operators. A special case is the choice of p = n + 1and of x(k,j) such that

Hk = (x(k−1,−)x(k), . . . , x(k−n,−)x(k))(1.90)

As an example of a nonsequential method consider choosing x(k,1), . . . , x(k,p) to bethe p vectors among x(0), . . . , x(k−1) such that ‖Fx(k)j‖ is smallest.

In general the secant method requires n(n+1) function evaluations at each step,namely fi(x(k,j)) for i = 1, . . . , n and j = 0, . . . , n. In particular this is true for thechoice (1.86). This work is comparable to that for Newton’s method if fi(x) and∂fi/∂xj are comparably costly to evaluate. The choice (1.88) effects some savingssince x(k,n) = x(k−1) and Fx(k−1) is available from the previous step. Thus onlyn2 functions need be evaluated, i.e., fi(x(k,j)) for i = 1, . . . , n and j = 0, . . . , n− 1.Other secand methods can provide additional savings. The msot spectacular savingsis provided by the choice leading to (1.90). Here x(k,j) = x(k)k − j j = 0, . . . , nand since Fx(k−1), . . . , Fx(k−n) are available from previous iterations, we need onlyevaluate Fx(k) at each step, i.e., the n functions fi(x(k)), i = 1, . . . , n. Furthermoreit can be shown that this method, i.e., (1.90) leads to a substantial savings inthe solution of the linear system required by the Newton formulation of the secantmethod. Thus the (n + 1)-sequential secant method requires the least amount ofcomputation per step. Unfortunately, it is prone to unstable behavior, and nosatisfactory convergence results can be given. (Also notoe the need to generateby, e.g., a two-point method, starting values x(0), . . . , x(n).) On the other hand,satisfactory convergence results for the two point methods (1.86) and (1.88) havebeen derived.

We close this section by briefly considering the closely related class of iterativemethods known as Steffensen methods. In one dimension these are defined by lettinghk = f(x(k)) in (1.74), i.e.,

x(k+1) = x(k) − f(x(k))f(x(k) + f(x(k))

)− f(x(k))

f(x(k)) k = 0, 1, 2, . . . ,(1.91)


Under suitable conditions, this iteration exhibits quadratic convergence, as doesNewton’s method, but without requiring any derivatives of f . Paralleling our de-velopment of the secant method, we can generalize to n-dimensions by letting

x(k+1) = x(k) + J−1Fx(k)

where the auxiliary points are chosen by formulas such as

x(k,j) = x(k) + Pj,kFx(k) .

Again, the choices Pj,k = (0, . . . , 0, e(j), 0, . . . , 0) and Pj,k = (e(1), . . . , e(j), 0, . . . , 0),analogous to (1.86) and (1.88) are of interest as is the generalization of the (n + 1)-point sequential method (1.90).

1.7 Broyden’s method and other update methods

At the beginning of the previous section we discussed various advantages and dis-advantages of Newton’s method for the solution of nonlinear equations. In practice,a Newton step proceeds as follows.

(i) Compute F (x(k)) and if x(k) is acceptable, stop;otherwise compute F ′(x(k))

(ii) Solve the linear system F ′(x(k))z = −F ((x(k)) for z andset x(k+1) = x(k) + z.

(1.92)

We have also studied various attempts at reducing the cost per step of the itera-tion. For example, if one wishes to avoid computing the n2 scalar functions whichmake up the Jacobian matrix F ′(x(k)), one could replace F ′(x(k)) in (1.92) by thecomputation of the matrix A(x(k), hk) ∈ L(Rn, Rn) where

[A(x, h)]i,j =

[fi(x + ηje(j))− fi(x)

]ηj

(1.93)

and h = (η1, . . . , ηn) is some suitably chosen vector. Of course, we now solve thesystem

A(x(k), hk)z = −Fx(k)(1.94)

for z. This finite difference Newton’s method is of substantial interest. For exam-ple, if F satisifes certain hypotheses including (1.28), and if at each step ‖hk‖ ≤γ‖F (x(k))‖ for some constant γ, then we can confirm convergence results for theabove finite difference Newton’s method. Of course, we still have the expense ofevaluating n2 scalar functions to determine the matrix A. A popular way to reducethis cost for either Newton’s method or the finite difference Newton’s method isto hold the Jacobian or the approximate Jacobian A fixed for a given number ofiterations. This is particularly useful when the Jacobian does not change rapidly

1.7. Broyden’s method and other update methods 35

from iteration to iteration. It is usually difficult to decide just how long to hold theJacobian fixed. It can be shown that this technique decreases the rate of conver-gence but can, in a certain sense, increase the efficiency when compared to Newton’smethod. We also note that all the variants of Newton’s method discussed in thelast section and here still require the solution of a linear system of equations at eachstep, i.e., O(n3) arithmetic operations. In some cases, this is the most expensivepart of the iteration. Once again one is attracted by those techniques where theJacobian is held fixed for a certain number of iterations since in each such iterationthe expense of solving the linear system can be reduced to O(n2), e.g., the forwardand backsolve.

We now turn to what are known as update or quasi-Newton methods, payingparticular attention to Broyden’s method. We shall see how, for example, Broy-den’s method reduces by one order of magnitude both the n(n+1) scalar functionsevaluations and the O(n3) operations involved in solving a linear system at eachiteration.

Let us now derive Broyden’s method. We begin by assuming that F : Rn → Rn

is continuously differentiable in an open set D and that for given x ∈ D and x 6= 0,the vector x = x + s belongs to D. Here we will associate x with x(k) and x withx(k+1) and we will seek a good approximation to F ′(x). Since F ′ is continuousat x, given ε > 0 there exists a δ > 0 such that ‖Fx− Fx− F ′(x)(x− x)‖ ≤ε‖x− x‖ provided ‖x− x‖ < δ. It follows that F (x) ≈ F (x) + F ′(x)(x − x) withthe approximation improving as ‖x− x‖ decreases. Hence, if B is to denote ourapproximation to F ′(x), it is natural to require that B satisfy Fx = Fx+ B(x− x),or

Bs = y = Fx− Fx(1.95)

where s = x− x. In the case of n = 1, (1.95) completely determines B and we areled to the secant method. For n > 1, we argue that the only new information aboutF has been gained in the direction determined by s. Now suppose we have in handan approximation B to F ′(x). Broyden then reasoned that there is no justificationfor having B differ from B on the orthogonal complement of s, i.e., we require that

Bz = Bz if zT s = 0.(1.96)

Clearly (1.95) and (1.96) uniquely determine B from B and, in fact,

B = B +(y −Bs)sT

sT s(1.97)

Now (1.95) is central to the development of quasi-Newton methods and is referredto as the quasi-Newton equation. Equation (1.97) provides what is known as theBroyden update to the approximate Jacobian.

The quasi-Newton equation (1.95) also plays a role in a second derivative of theBroyden update (1.97). We argue that although any matrix which satisfies (1.95)is a good candidate for B, we choose B to be that matrix satisfying (1.95) which is


“closest” to B. If we measure “closest” in the Euclidean or Frobenius matrix norm‖·‖F we are led to (1.97) again. This fact is proved by the following proposition.Here we use the notation L(Rn) to denote L(Rn, Rn).

Proposition 1.22 Given B ∈ L(Rn), y ∈ Rn and some nonzero s ∈ Rn, define Bby (1.97) . Then B is the unique solution to the problem

minB∈L(Rn)

{‖B −B‖F : Bs = y}

Proof. To show that B is a solution note that if y = Bs then

‖B −B‖F = ‖(B −B)ssT

sT s‖F ≤ ‖B −B‖F .

That B is the unique solution follows from the fact that the mapping f : L(Rn) →R1 defined by f(A) = ‖B −A‖F is strictly convex in L(Rn) and that the set ofB ∈ L(Rn) such that Bs = y is convex. 2

It is clear how (1.97) can be used in an iterative method. For example, the mostbasic form of Broyden’s method is defined by

x(k+1) = x(k) −Bk−1Fx(k) k = 0, 1, 2, . . . ,(1.98)

where

Bk+1 = Bk +(yk −Bksk)skT

skTsk

k = 0, 1, 2, . . . ,(1.99)

withyk = Fx(k+1) − Fx(k) and sk = x(k+1) − x(k) .(1.100)

It is clear that, given x(0) and B0, Broyden’s method can be carried out with nscalar function evaluations per iteration, i.e., Fx(k+1). However, it seems that westill need to solve the linear system Bksk = −Fx(k). One way to overcome thisdifficulty is to use the following result on rank one updates of a matrix which is dueto Sherman and Morrison. Recall that a matrix of the form uvT where u, v ∈ Rn isalways of rank one. We first prove a preliminary lemma concerning the determinantof a matrix which can be written as the identity plus a rank one matrix.

Lemma 1.23 Let v, w ∈ Rn be given. Then

det(I + vwT ) = 1 + vT w .(1.101)

Proof. Let P = I + vwT . If v is the zero vector, then the result is trivial, so weassume v 6= 0. Let z be an eigenvector of P , i.e., (I + vwT )z = λz for some λ.Then (1− λ)z = −(wT z)v, i.e., z is either orthogonal to w or is a multiple of v. IfwT z = 0, then λ = 1 while if z is parallel to v, λ = 1 + vT w. Thus the eigenvaluesof P are all one except for a single eigenvalue equal to (1 + vT w). Thus the result(1.101) follows. 2


Lemma 1.24 Let u, v ∈ Rn and assume that A ∈ L(Rn) is nonsingular. ThenA + uvT is nonsingular if and only if σ = 1 + vT A−1u 6= 0. Moreover, if σ 6= 0,then

(A + uvT )−1

= A−1 − 1σ

A−1uvT A−1 .(1.102)

Proof. Since det(A + uvT ) = det A det(I + A−1uvT ) and A is nonsingular, A + uvT

is nonsingular if and only if det(I + A−1uvT ) 6= 0. But by Lemma 1.25 det(I +A−1uvT ) = 1 + vT A−1u = σ. To verify (1.102) we need only show that if wemultiply the right-hand side of (1.102) by A+uvT we get the identity matrix. Thisis readily verified. 2

From Lemma 1.24 it follows that if Hk = (Bk)−1 and Hk+1 = (Bk+1)−1 then

Hk+1 = Hk +(sk −Hkyk)skT

Hk

skTHkyk

(1.103)

provided that skTHkyk 6= 0. Therefore Broyden’s method can also be implemented

asx(k+1) = x(k) −HkFx(k)

where the sequence {Hk} is generated by (1.103). In this form, Broyden’s methodrequires only n scalar function evaluations, i.e., Fx(k) and O(n2) arithmetic oper-ations, i.e., the matrix-vector multiplies involving Hk per iteration.

It is also possible to implement (1.99) and use only O(n2) arithmetic operationsper iteration. For example, if Bk = QkRk where Qk is orthogonal and Rk isupper triangular, then the corresponding factorization of Bk+1 can be obtained inO(n2) operations. Of course, if the factorizaiton Bk = QkRk is given, then thelinear system BkSk = −Fx(k) can be solved in O(n2) operations also. One reasonthis approach may be preferrable over (1.103) is that there are no matrix-vectormultiplications (the term Bksk is just −Fx(k)). Another reason is that (1.99) ismore stable than (1.103).

If F is an affine function, then Broyden’s method is norm reducing with respectto the induced L2-norm on matrices. Furthermore, we can prove the followingstronger result for the Frobenius norm.

Proposition 1.25 Let A ∈ L(Rn) satisfy y = As for some nonzero s ∈ Rn andy ∈ Rn. Moreover, given B ∈ L(Rn) defined B by (1.97). Then

‖B −A‖F ≤ ‖B −A‖F

with equality if and only if B = B.

Proof. We have that

‖B −A‖2F = ‖(B − B) + (B −A)‖2F = ‖B −B‖2F + ‖B −A‖2F


since A lies in the affine subspace {B : y = Bs} and since by Proposition 1.22,the matrix B is the orthogonal projection of B onto this subspace, i.e., B − B isorthogonal to B −A. The result then follows immediately. 2

If {x(k)} is any sequence and sk, yk are defined by (1.100), then yk = Ask for

A =∫ 1

0

F ′(x(k) + θsk) dθ .

Then Proposition 1.25 guarantees that in the Frobenius norm, Bk+1 is a betterapproximation than Bk to the average of F ′ on the line segment from x(k) to x(k+1).Of course, if F : Rn → Rn is affine, then A is the coefficient matrix and thereforefor linear systems Broyden’s method is norm-reducing in the Frobenius norm.

Broyden’s method is sometimes implemented in the form

B = B + θ(y −Bs)sT

sT s(1.104)

where θ is chosen so as to avoid singularity in B. The following result can be usedto decide how to choose θ.

To avoid B being singular, note that (1.104) and Lemma 1.23 yield

det B = detB

[(1− θ) + θ

yT B−1s

sT s

].(1.105)

We now choose θ as a number closest to unity such that |det B| ≥ σ|detB| forsome σ ∈ (0, 1). The usual choice is σ = 0.1.

The following local convergence theorem, which we present without proof, holdsfor Broyden’s method.

Theorem 1.26 Let F : Rn → Rn be continuously differentiable in an open convexset D. Let there be an x∗ ∈ D such that F (x∗) = 0 and F ′(x∗) is nonsingular.Further suppose there exists a constant C such that ‖F ′(x)− F ′(x∗)‖ ≤ C‖x− x∗‖for x ∈ D. Then Broyden’s method, defined by (1.98), (1.99), and (1.100), is locallyand superlinearly convergent at x∗.

Theorem 1.26 implies that under the stated hypotheses, Broyden’s methodproduces iterates which satisfy ‖x(k+1) − x∗‖ ≤ αk‖x(k) − x∗‖ for some sequence ofscalars αk tending to zero as k → ∞. Note, that under the same hypotheses, weknow that Newton’s method is locally quadratically convergent at x∗. Note thatalthough x(k) → x∗ superlinearly, it is not necessarily true that Bk → F ′(x∗). Thisimplies that Broyden’s method is not self-correcting, i.e., Bk may retain irrelevantor harmful information contained in Bj , j < k.

If we replace (1.97) by (1.104), i.e., if we replace (1.99) in Broyden’s method by Bk+1 = Bk + θk(yk −Bksk)skT

skTsk

Bk+1 nonsingular |θk − 1| ≤ θ and θ ∈ (0, 1)(1.106)


then Theorem 1.26 still holds. Furthermore, if F is affine, i.e., we are trying tosolve linear systems, then the modified Broyden’s method (1.98), (1.106), (1.100) isglobally and superlinearly convergent to the solution of the linear system.

A variation of Broyden’s method which is of interest in the case of F ′(x) beingsparse is given as follows. We use (1.98), (1.99) and (1.100) to define Bk+1 fromBk but before it is used, Bk+1 is forced to have the same sparcity pattern as F ′(x).Clearly forcing Bk to have the same sparcity pattern reduces ‖Bk − F ′(x∗)‖. Thismethod is also locally and superlinearly convergent.

We need not choose in (1.100) sk = x(k+1) − x(k). It is reasonable to choose sk

to be any vector such that F is defined at x(k) + sk and then set yk = F (x(k) +sk) − F (x(k)). For example, if we set sk = ηe(j) for some scalar η, then (1.99)shows that Bk+1 differs from Bk only in the j-th column and this column isnow given by [F (x(k) + ηe(j) − F (x(k))]/η. Of course, if sk 6= x(k+1) − x(k), theneach step requires 2n scalar function evaluations, i.e., F (x(k) + sk) and Fx(k),at each iteration. Thus if sk 6= x(k+1) − x(k), we still use (1.98), (1.99) but wereplace (1.100) by yk = F (x(k) + sk) − F (x(k)), where sk is a nonzero vector suchthat ‖sk‖ ≤ η max{‖x(k+1) − x∗‖, ‖x(k) − x∗‖} for some constant η. For example,the choice sk = ‖F (x(k+1))‖e(j) is suitable for each j. With sk satisfying thisrelation, the method will converge but only linearly as well as requiring 2n functionevaluations per step. Superlinear convergence can be recovered if, for example, the{sk} are uniformly linearly independent.

Broyden’s method and the variants discussed in this section are special casesof the general class of methods which are low rank updates of Bk to define Bk+1.Broyden’s method uses the rank one update given by (1.97). Other rank one andrank two update methods have been devised, but Broyden’s method has provento be, by far, the most useful update method for the approximate solution of nnonlinear equations in n unknowns. For unconstrained minimization problems,other update methods have proved useful, and we will discuss update methods inthat context in the next chapter.

Chapter 2

Unconstrained Minimization and Nonlinear LeastSquares

Let f : Rn → R1 be a functional defined on an open set D and consider theproblem of finding a z ∈ D such that f(z) ≤ f(x) for each x ∈ D. In this case z isa global minimizer of f in D and even if it is known to exist, finding it is usually anintractable task. Generally we are content with seeking local minimizers of f ; thatis, find x∗ ∈ D such that for some δ > 0

f(x∗) ≤ f(x), ‖x− x∗‖ ≤ δ, x ∈ D .(2.1)

We will consider here the solution of (2.1) only in the case of f being differentiable.In this case (2.1) is usually solved by trying to find a zero of ∇f , the gradient off . This approach is, of course, based on the fact that if x∗ is a local minimizerof f on an open set D and f is differentiable at x∗, then necessarily ∇f(x∗) = 0.Since ∇f(x∗) = 0 consists of n equations in the n unknown component of x we seethat the minimization problem, when f is differentiable, leads to finding the rootsof n nonlinear equations. Of course a reverse correspondence between systems ofequations and functional minimization can also be made. We shall exploit this inSection 3. We may therefore attempt to apply any of the methods of Chapter 1to finding the solution of ∇f = 0. For instance, since F = ∇f is a mapping fromRn to Rn, we may use, when f is twice differentiable, Newton’s method which heretakes the form

x(k+1) = x(k) −[∇2f(x(k))

](2.2)

where ∇2f(x) is the Hessian matrix of f at x, i.e., ∇2f(x) is just the Jacobianmatrix of ∇f . Under appropriate conditions we can obtain local and quadraticconvergence of (2.2) to a zero of ∇f .

2.1 Descent methods for unconstrained minimization

In this section we will consider descent methods for solving (2.1). A descent methodgenerates for each iterate x(k) a direction pk of local descent in the sense that thereis a λ∗ such that f(x(k) + λpk) < f(x(k)) for λ ∈ (0, λ∗]. The next iterate is of

41

42 2. Unconstrained Minimization and Nonlinear Least Squares

the form x(k+1) = x(k) + λkpk where λk is chosen so that f(x(k+1)) < f(x(k)).The directions pk and the parameters λk should be chosen in such a way that thesequence {∇f(x(k))} converges to zero. If ‖∇f(x(k))‖ is small then usually x(k) isnear a zero of∇f while the fact that the sequence {∇f(x(k))} is decreasing indicatesthat this zero of ∇f is problably a local minimizer of f .

The simplest example is the method of steepest descent for which we ask for thevector p of unit L2-norm such that for some λ > 0

f(x + λp) < f(x + λp), λ ∈ (0, λ), p 6= p

for all ‖p‖ = 1. It is easy to show that if ∇f(x) 6= 0 then p = −∇f(x)/‖∇f(x)‖, sothat the method of steepest descent is given by

x(k+1) = x(k) − λk∇f(x(k)), k = 0, 1, 2, . . . ,(2.3)

where λk is chosen to guarantee that f(x(k+1)) < f(x(k)). The following resultguarantees the existence of such a parameter.

Lemma 2.1 Let f : Rn → Rn be defined on an open set D and differentiable atx ∈ D. If [∇f(x)]T p < 0 for some p in Rn then there is a λ∗ = λ∗(x, p) such thatλ∗ > 0 and f(x + λp) < f(x) for λ ∈ (0, λ∗).

Proof. The proof follows easily from the fact that

limλ→0+

(f(x + λp)− f(x)

λ

)= [∇f(x)]T p .(2.4)

2

This lemma guarantees, in particular, that the parameter λk in the steepestdescent method can be chosen so that f(x(k+1)) < f(x(k)). This is not sufficient toshow that {x(k)} approaches a zero of ∇f since λk may be arbitrarily small. In fact,λk can be chosen so that ‖x(k+1) − x(k)‖ ≤ ε/2k and therefore {x(k)} converges to apoint x∗ such that ‖x(0) − x∗‖ ≤ 2ε. Clearly, if ∇f(x(0)) 6= 0 and ∇f is continuousat x(0), then ε can be chosen so that ∇f(x∗) 6= 0. Below we will consider a methodof choosing λk which avoids this problem. Then, when λk is chosen properly, thefollowing result, which we do not prove, holds.

Proposition 2.2 Let f : Rn → R1 be continuously differentiable and bounded belowon Rn, and assume that x(0) is such that ∇f is uniformly continuous on the levelset L(x(0)) = {x ∈ Rn : f(x) ≤ f(x(0))}. Then there is a sequence {λk} suchthat the steepest descent sequence (2.3) is well defined, {f(x(k))} is decreasing, and{∇f(x(k))} converges to zero.

If f is continuously differentiable on Rn and L(x(0)) is compact, then the rest ofthe assumptions of Proposition 2.2 are automatically satisfied and in addition, fhas a global minimizer and {∇f(x(k))} converges to zero. However, even in this case

2.1. Descent methods for unconstrained minimization 43

the steepest descent sequence does not necessarily converge to a local minimizer off ; it may converge to a saddle point of f . Nevertheless, Proposition 2.2 is a ratherstrong convergence result. The fact that {∇f(x(k))} converges to zero implies thatany limit point of {x(k)} is a zero of ∇f and that for any ε > 0 the stopping criterion‖∇f(x(k))‖ < ε will be satisfied in a finite number of steps. Unfortunately, steepestdescent usually converges linearly. This slow rate of convergence can be improvedby switching to a faster method, e.g., Newton’s method, in a neighborhood of azero of ∇f .

In view of the global convergence of steepest descent and the fast local con-vergence of Newton’s method, it seems desirable to have a method that behaveslike Newton’s method near a local minimizer but like steepest descent far from theminimizer. Most descent methods of this type are of the form

x(k+1) = x(k) − λkBk−1∇f(x(k)) , k = 0, 1, 2, . . . ,(2.5)

where Bk is a symmetric, positive definite matrix which resembles ∇2f(x(k)), atleast in a neighborhood of a local minimizer. One example is the method of Gold-feldt, Quandt, and Trotter

x(k+1) = x(k) − λk

(∇2f(x(k)) + µkI

)−1

∇f(x(k)) , k = 0, 1, 2, . . . ,(2.6)

where the scalar µk ≥ 0 is chosen so that (∇2f(x(k)) + µkI) is positive definite.To justify the claim that (2.6) behaves like Newton’s method in a neighborhoodof a local minimizer we merely note that if f is differentiable in an open set Dand twice differentiable at a local minimizer x∗ of f in D, then ∇2f(x∗) is positivesemidefinite. Therefore, if x(k) is in a neighborhood of a local minimizer, then verysmall values of µk will suffice to make (∇2f(x(k))+µkI) positive definite. Also notethat if s(µ) = −(∇2f(x) + µI)−1∇f(x), then s(0) is the Newton direction whileas µ → ∞ the angle between s(µ) and −∇f(x) decreases monotonically to zero.Thus for large µ, (2.6) behaves like steepest descent. In (2.6), in order to preservethe good local properties of Newton’s method, one has to chose µk and λk withsome care. As long as {µk} and {λk} converge to zero and one, respectively, theiteration (2.6) is superlinearly convergent. If µk ≤ η‖∇f(x(k))‖ for some constant ηand λ = 1 for sufficiently large k, then (2.6) converges quadratically. Unfortunately,these results do not indicate how to choose {µk} globally; in fact, this issue has notbeen fully resolved.

A method of the form (2.5) which avoids the problem of choosing µk in (2.6)and yet resembles (2.6) is the following. We try to obtain a Cholesky decompo-sition of ∇2f(x(k)), i.e., we try to find a nonsingular lower triangular matrix Lk

such that ∇2f(x(k)) = LkLTk . Of course, if ∇2f(x(k)) is not positive definite this

decomposition does not even exist, but the idea is that as the decompositon pro-ceeds it is possible to add to the diagonal of ∇2f(x(k)) and ensure that we obtainthe Cholesky decomposition of a well-conditioned, positive definite matrix whichdiffers from ∇2f(x(k)) in some minimal way. In particular, if ∇2f(x(k)) is a well-conditioned, positive definite matrix then ∇2f(x(k)) = LkLT

k . A more sophisticated


version of the algorithm is to use factorizations of the form LkDkLTk where Lk is

a unit lower triangular matrix and Dk is a positive definite diagonal matrix. Ofcourse, Lk = LkD

1/2k .

We now turn to the question of selection rules for the steplength λk used indescent methods of the form (2.5) and more generally in any descent method of theform

x(k+1) = x(k) + λkpk, k = 0, 1, 2, . . . ,(2.7)

where [∇f(x(k))]T pk < 0. In a descent method λk should be such that f(x(k+1)) <f(x(k)) but we have already noted that this requirement can be satisfied by arbi-trarily small λk and then {x(k)} may converge to a point at which ∇f is not zero.A more reasonable requirement is that

f(x(k) + λkpk) ≤ f(x(k)) + αλk[∇f(x(k))]T pk, α ∈ (0,12) .(2.8)

The reason for choosing α < 1/2 is that with this choice, then Theorem 2.3 belowshows that if {x(k)} converges to a local minimizer of f at which ∇2f(x∗) is positivedefinite, and {pk} converges to the Newton step [−∇2f(x(k))]−1∇f(x(k)) in bothlength and direction, then λk = 1 will satisfy (2.8) for all sufficiently large k. If αis close to zero then (2.8) is not a very stringent requirement; indeed α is usuallychosen in the range [10−4, 10−1]. However, it is not a good idea to fix λk by justrequiring that (2.8) be satisfied since, for instance, λk = 0 is then admissable. Ingeneral, unreasonably small λk are ruled out by a numerical search procedure, buttheoretically we need to impose another requirement. One such requirement is that[

∇f(x(k) + λkpk)]T

pk ≥ β[∇f(x(k))

]Tpk , β ∈ (α, 1) .(2.9)

The λk which satisfy (2.8) and (2.9) in the above figure lie in the interval J1 andJ2. At the left endpoints of each interval equality holds in (2.9) while at the rightendpoint equality holds in (2.8). To show that there are λk which satisfy (2.8) and(2.9), assume that f is defined on Rn and that f(x(k) + λpk) is bounded below forλ ≥ 0. It is then geometrically obvious that there are λk > 0 for which equality

2.1. Descent methods for unconstrained minimization 45

holds in (2.8) since [∇f(x(k))]T pk < 0 . If λk is the first such λk , then the meanvalue theorem implies that

λk

[∇f(x(k) + θkλkpk

]Tpk = f(x(k) + λkpk)− f(x(k)) = αλk

[∇f(x(k))

]Tpk

for some θk ∈ (0, 1); further since α < β,[∇f(x(k) + θkλkpk)

[Tpk ≥ β

[∇f(x(k))

]Tpk .

Thus λk = θkλk satisfies (2.8) and (2.9). However, we emphasize that a searchroutine for λ should not necessarily try to satisfy (2.8) and (2.9). In fact the intervalswhich satisfy these two conditions can be quite small and therefore diffficult to find.Moreover, to test whether or not (2.9) is satisfied requires the evaluation of ∇f .Instead, the search routine should produce a λk which satisfies (2.8) and is not toosmall; after all, (2.9) just guarantees that λk is not too small.

The following results guarantees the existence of a sequence which satisfies (2.8)and (2.9). We do not prove the result.

Theorem 2.3 Let F : Rn → R1 satisfy the assumption of Proposition 2.2and con-sider an iteration of the form (2.7) where the search directions pk satisfy [∇f(x(k))]T pk <0. Then there is a sequence {λk} which satisfies (2.8) and (2.9) and

limk→∞

([∇f(x(k))]T

pk

‖pk‖

)= 0 .(2.10)

For many iterations (2.10) implies that {‖∇f(x(k))‖} converges to zero; it is onlynecessary to show that the angle between pk and ∇f(x(k)) stays bounded away from90◦. For example, if pk = −∇f(x(k)), or more generally, if pk = −Bk

−1∇f(x(k))where {Bk} is a sequence of symmetric, positive matrice with uniformly boundedcondition numbers, then

−[∇f(x(k))]Tpk

‖pk‖≥ µ‖∇f(x(k))‖(2.11)

where 1/µ is an upper bound on the condition number of Bk. Hence (2.10) ensuresthat {‖∇f(x(k))‖} converges to zero.

For our final result we assume that the vectors pk converge in direction andlength to the Newton step and show that λk = 1 will eventually satisfy (2.8) and(2.9).

Theorem 2.4 Let f ∈ Rn → R1 be twice continuously differentiable in an open setD and consider the iteration (2.7), where [∇f(x(k))]T pk < 0 and λk is chosen tosatisfy (2.8) and (2.9). If {x(k)} converges to a point x∗ in D at which ∇2f(x∗) ispositive definite and

limk→∞

‖∇f(x(k)) +∇2f(x(k))pk‖‖pk‖

= 0 ,(2.12)


then there is an index ki ≥ 0 such that λk = 1 is admissible for k ≥ k0. Moreover,∇f(x∗) = 0 and {x(k)} converges superlinearly to x∗. 2

2.2 Quasi-Newton methods for unconstrained minimization

For unconstrained minimization problems we want the quasi-Newton step given by−Bk

−1∇f(x(k)) to define a descent direction. In fact, the most widespread use ofupdate methods is in conjunction with iterations of the form (2.5). In this con-text the update formula should generate a sequence of symmetric positive definitematrices {Bk} such that Bk resembles ∇2f(x(k)) at least when x(k) is near a localminimizer of f . We will consider such updates below. First we consider the rolesymmetry plays in the quasi-Newton equation. Throughout this section we assumef : Rn → R1 to be twice differentiable in the open convex set D, and that we havean approximation B to ∇2f(x) for some x in D, and a direction s such that x + sbelongs to D. What we seek is a good approximation B to ∇2f(x) where x = x+s.

Since the Hessian matrix is symmetric, we want the update formula to have theproperty of inheriting symmetry, i.e., B symmetric implies B symmetric. Further-more, because we are trying to approximate the Hessian, arguments similar to thosein Section 1.7, lead us to require that B satisfy

Bs = y = ∇f(x)−∇f(x) .(2.13)

Note that (2.13) is just the quasi-Newton equation (1.95) for F = ∇f . The firstquestion to ask is whether it is possible to satisfy the symmetry requirement and(2.13) with a rank one update formula. To see whether this is possible, we first notethat the general single rank update that satisfies the quasi-Newton equation (2.13)is given by

B = B +(y −Bs)cT

cT s(2.14)

for c ∈ Rn with cT s 6= 0. If B is to satisfy the symmetry requirement, then it iseasy to show that

B = B +(y −Bs)(y −Bs)T

(y −Bs)T s(2.15)

is the only solution provided that (y − Bs)T s 6= 0. If y = Bs, then B = B isthe solution while if y 6= Bs but (y − Bs)T s = 0, then there is no solution. Thesymmetric single rank formula (2.15) was first due to Davidon. If H = B−1 andH = B−1, both exist and B is symmetric then the inverse relation

H = H +(s−Hy)(s−Hy)T

(s−Hy)T y(2.16)

holds. We then have the following remarkable result when (2.16) is applied to aquadratic functional.

2.2. Quasi-Newton methods for unconstrained minimization 47

Proposition 2.5 Let A ∈ L(Rn) be a nonsingular symmetric matrix and set yk =Ask for 0 ≤ k ≤ m where {s0, . . . , sm} spans Rn. Let H0 be symmetric and fork = 0, . . . ,m generate the matrices

Hk+1 = Hk +(sk −Hkyk)(sk −Hkyk)T

(sk −Hkyk)T yk(2.17)

where it is assume that(sk −Hkyk)T yk 6= 0(2.18)

Then Hm+1 = A−1.

Proof. (Sketch) One verifies by induction that Hkyj = sj , 0 ≤ j < k, for k =1, . . . ,m + 1. Once this is done, Hm+1y

j = Hm+1Asj = sj , 0 ≤ j ≤ m, and theresult follows from the assumption that {s0, . . . , sm} span Rn. 2

The gist of Proposition 2.5 lies in the fact that if we have an iteration of the formx(k+1) = x(k) + sk and (2.18) holds, then the use of (2.17) allows one to minimize aquadratic funcitonal in a finite number of steps Unfortunately, there is no guaranteethat (2.18) will hold. However, it can be shown that if A−1−H0 is semidefinite andif {Hk} is generated by (2.17) when (2.18) holds, and Hk+1 = Hk otherwise, thenHm+1 = A−1.

The fact that the vectors (s − Hy) and y can be orthogonal forces a certainamount of numerical instability on the symmetric single-rank method. This has ledto several improvements in the basic algorithm. In some of its modified form, thisalgorithm has been quite successful. On the other hand, the numerical difficultiesof the symmetric single-rank formula have led to a wide-spread study of a wholeclass of updates which satisfy symmetry and (2.13). In particular, we now considera version of Broyden’s method in which we update by two rank one matrices; forthis reason it is sometimes called a rank two update method.

We begin with a symmetric B ∈ L(Rn) and consider

C1 = B +(y −Bs)cT

cT s

as a possible candidate for B. In general, C1 is not symmetric, so we considerC2 = (C1 + CT

1 )/2. However, since C2 does not satisfy the quasi-Newton method,we repeat the process. In this way a sequence {Ck} is generated by

C2k+1 = C2k +(y − C2ks)cT

cT sk = 0, 1, 2, . . . ,

C2k+2 = (C2k+1 + CT2k+1)/2

(2.19)

where C0 = B. It turns out that {Ck} has a limit B given by

B = B +(y −Bs)cT + c(y −Bs)T

cT s− (y −Bs)T s

(cT s)2ccT(2.20)


and it is clear that this update satisfies the symmetry condition and (2.13). Wesummarize in the following result.

Lemma 2.6 Let B ∈ L(Rn) be symmetric and let c, s and y be in Rn with cT s 6= 0.If the sequence {Ck} is defined by (2.19) with C0 = B, then {Ck} converges to Bas defined by (2.20).

Proof. (Sketch) If Gk = C2k, then (2.19) implies that

Gk+1 = Gk +12

wkcT + cwkT

cT s(2.21)

where wk = y −Gks. In particular wk+1 = Pwk where P = 12 [I − (csT )/(cT s)]. P

has one zero eigenvalue and all the other eigenvalues equation to 1/2. It can thenbe shown that

∞∑k=0

wk =∞∑

k=0

P k(y −Bs) = (I − P )−1(y −Bs) .(2.22)

Since limk→∞ Gk = B +∑∞

k=0 Gk+1 − Gk it follows from (2.21) and (2.22) that{Gk} converges. Thus since Lemma xx of Chapter 1 shows that (I − P )−1 =2[I − 1

2 (csT )/(cT s)], (2.21) and (2.22) also imply that the limit of {Gk} is B asdefined by (2.20). 2

Once c is chosen, (2.20) is a rank two update which satisfies the symmetrycondition and (2.13). Before looking at special cases of (2.20), we show that thisupdate solves a problem similar to the one specified in Proposition 1.22 in Chapter1.

Proposition 2.7 Let B ∈ L(Rn) be symmetric, and let c, s and y be in Rn withcT s > 0. Assume that M ∈ L(Rn) is any nonsingular, symmetric matrix such thatMc = M−1s. Then B as defined by (2.20) is the unique solution to the problem

min{‖B −B‖M,F : B symmetric, Bs = y}(2.23)

where ‖A‖M,F = ‖MAM‖F .

Proof. Let B be any symmetric matrix such that y = Bs and pre- and post- multiply(2.20) by M . If Mc = M−1s = z it follows that

E =EzzT + zzT E

zT z− (Ez)T z

(zT z)2zzT

where E = M(B−B)M and E = M(B−B)M . Now it is clear that ‖Ez‖ = ‖Ez‖and that if B is orthogonal to z then ‖Ev‖ ≤ ‖Ev‖. Thus ‖E‖F ≤ ‖E‖F asdesired. To show uniqueness just note that the mapping f : L(Rn) → Rn defined


by f(A) = ‖B −A‖M,F is strictly convex on the convex set of symmetric B suchthat Bs = y. 2

A minor modification of the above proof shows that the solution to the problemmin{‖B −B‖M,F : Bs = y} is given by (2.14).

If in (2.20) we choose c = s the underlying single-rank method is Broyden’s andthe resulting double-rank formula is often called the Powell symmetric Broydenupdate, or the PSB update

BPSB = B +(y −Bs)sT + s(y −Bs)T

sT s− (y −Bs)T s(ssT )

(sts)2.(2.24)

Proposition 2.7 implies that BPSB is the unique solution to the problem min{‖B −B‖F :B symmetric, Bs = y} which is reminiscent of Proposition xx in Chapter 1. Be-cause of this property it follows that if A is any symmetric matrix with y = As,then ‖B −A‖2F = ‖BPSB −B‖2 + ‖BPSB −A‖2F . These considerations lead usto believe that BPSB is a good approximation to the Hessian. To further justifythis claim note that (2.24) implies that for any symmetric A and B in L(Rn),BPSB − A = PT (B − A)P + [(y − As)sT + s(y − As)T P ]/(sT s) where P =I − ssT /sT s. Then since ‖AB‖F ≤ min{‖A‖2‖B‖F , ‖A‖F ‖B‖2}, we have that‖BPSB −A‖F ≤ ‖B −A‖F + 2‖y −As‖/‖s‖. If A = ∇2f(x) and ∇2f is Lips-chitz continuous with constant K in the open convex set D, it can be shown that‖BPSB −∇2f(x)‖F ≤ ‖B −∇2f(x)‖F + 3K‖s‖ whenever x and x lie in D. Thisrelationship shows that the absolute error of Bk as an approximation to ∇2f(x(k))grows linearly with ‖sk‖, and this holds independently of the position of x in D.

We now turn to updates which in addition to satisfying the symmetry conditionand (2.13) generate positive definite matrices. We want the property of hereditarypositive definiteness, i.e., B positive definite implies B positive definite. Note thatif an update satisfies (2.13) and this condition, then y = Bs and therefore yT s > 0whenever B is positive definite. This imposes a restriction on the angle between yand s, which although not severe, must be kept in mind. In fact, if [∇f(x)]T s < 0then yT s > 0 is equivalent to the existence of a β ∈ (0, 1) such that [∇f(x)]T s ≥β[∇f(x)]T s. For this reason the requirement (2.9) is a very natural one for quasi-Newton methods.

To investigate the property of hereditary positive definiteness, we will use thefollowing result from the perturbation theory of symmetric matrices.

Lemma 2.8 Let A ∈ L(Rn) be symmetric with eigenvalues λ1 ≤ λ2 ≤ · · · ≤ λn,and let A = A + σuuT for some u ∈ Rn. If σ > 0 then A has eigenvalues λi suchtht λ1 ≤ λ1 ≤ λ2 ≤ · · · ≤ λn ≤ λn, while if σ ≤ 0 then the eigenvalues of A can bearranged so that λ1 ≤ λ1 ≤ λ2 ≤ · · · ≤ λn ≤ λn. 2

Lemma 2.7 and the next two results will lead us to a choice of c in (2.20) whichnaturally satisfies the positive definite requirement.


Proposition 2.9 Let B ∈ L(Rn) be symmetric and positive definite, and let c, sand y be in Rn with cT s 6= 0. Then B as defined by (2.20) is positive definite if andonly if det B > 0.

Proof. If B is positive definite, then clearly det B > 0. For the converse note thatwe can write B = B + vwT + wvT where w = c and

v =y −Bs

cT s− 1

2(y −Bs)T s

(cT s)2c .

Therefore, B = B+ 12 [(v+w)(v+w)T−(v−w)(v−w)T ], and thus we have written B

as B plus the sum of two symmetric rank one matrices. If B is positive definite thenLemma 2.8 implies that B can have at most one non-positive eigenvalue. Therefore,if det B > 0, then all the eigenvalues must be positive and thus B is positive definite.2

In view of the above proposition, the symmetry and positive definite conditionsfor the updates defined by (2.20) require that if B is symmetric and positive definitethen det B > 0. To see what choices of c satisfy this requirement we need anexpression for det B, which we state without proof.

Lemma 2.10 Let ui ∈ Rn for i = 1, 2, 3, 4. Then

det(I + u1u2T+ u3u4T

) = (1 + u1Tu2)(1 + u3T

u4)− (u1Tu4)(u2T

u3) .

We now apply Lemma 2.10 to (2.20). After some algebra it follows that

det B = detB

[(cT Hy)2 − (cT Hc)(yT Hy) + (cT Hc)(yT s)

(cT s)2

](2.25)

where H = B−1. If we assume that B is positive definite and let v = H1/2y andw = H1/2c then

det B = detB

[(vT w)2 − ‖v‖2‖w‖2 + ‖w‖2(yT s)

(cT s)2

],(2.26)

and Proposition 2.9 implies that B is positive definite if and only if

‖w‖2yT s > ‖v‖2‖w‖2 − (vT w)2 .(2.27)

The most natural way of satisfying (2.27) is to choose w to be a multiple of v so that(2.27) only requires that yT s be positive. In this case c is a multiple of y and then(2.20) reduces to an update introduced by Davidon and later clarified and improvedby Fletcher and Powell. The DFP update is then given by

BDFP = B +(y −Bs)yT + y(y −Bs)T

yT s− (y −Bs)T s

(yT s)2yyT

= (I − ysT

yT s)B(I − syT

yT s) +

yyT

yT s.(2.28)


Some of its properties are given in the following result, but first note that theunderlying single-rank formula is given by (2.14) with c a multiple of y.

Proposition 2.11 Let B ∈ L(Rn) be a nonsingular, symmetric matrix and defineBDFP ∈ L(Rn) by (2.28) for any vectors y and s in Rn with yT s 6= 0. Then BDFP

is nonsingular if and only if yT Hy 6= 0 where H = B−1. If BDFP is nonsingular,then HDFP = B−1

DFP can be expressed as

HDFP = H +ssT

sT y− HyyT H

yT Hy.(2.29)

Furthermore, if B is positive definite, then BDFP is positive definite if and only ifyT s > 0.

Proof. Recall that for the DFP update c is a multiple of y so that (2.26) reduces to

det BDFP = det B

(yT Hy

yT s

).(2.30)

Thus BDFP is nonsingular if and only if yT Hy 6= 0. To verify that HDFP isgiven by (2.29) one need only show that HDFP BDFP = I. Finally, assume thatB is positive definite. If yT s is positive, then (2.30) implies that BDFP is positivedefinite. Conversely, if BDFP is positive definite then yT s = (BDFP s)T s > 0, whichis the desired result. 2

One way to use the DFP update to generate a quasi-Newton direction and onlyuse O(n2) arithmetic operations per iteration would be to generate Bk

−1 = Hk

via equation (2.29). Another approach is based on the fact that if A is positivedefinite and A = LLT for some lower triangular matrix L, then the correspondingdecompositon of A = A + αzzT can be obtained in O(n2) operations provided Ais positive definite. That this applies to (2.28) follows from the proof of Proposi-tion 2.9 which shows that (2.28) can be written as BDFP = B + 1

2z1zT1 − 1

2z2zT2

where z1 and z2 are linear combinations of Bs and y. If the DFP update is usedin a method of the form (2.5) then advantage of the latter approach is that (2.28)requires no matrix-vector products.

We note that the matrices generated by the DFP formula are good approxi-mations to the Hessian. In fact, it can be shown that if ‖s‖ is small then therelative error ‖BDFP −∇2f(x)‖/‖∇2f(x)‖ of BDFP as an approximation to a pos-itive defintie ∇2f(x) cannot be much larger than the corresponding error in B.Moreover the possible increase in this error is governed by a relative measure ofhow much f differs from a quadratic on D.

So far we have been thinking in terms of obtaining an approximation to the Hes-sian; we now consider trying to approximate the inverse of the Hessian, i.e., devel-oping inverse updates. We assume that we have an approximation H to [∇2f(x)]−1

and we try to obtain a good approximation H to [∇2f(x)]−1 where x = x + s. Forinverse updates, the analogue of the quasi-Newton equation is

Hy = s(2.31)


and therefore, the general single-rank formula which satisfies (2.31) is

H = H +(s−Hy)dT

dT y(2.32)

for any d ∈ Rn with dT y 6= 0. Let us examine the relation between (2.14) and(2.32). If Lemma xx of Chapter 1 is applied to (2.14) we obtain

B−1 = B−1 +(s−B−1y)cT B−1

cT B−1y.(2.33)

Therefore, (2.14) and (2.32) represent the same update if c = BT d.We now study the property of hereditary symmetry of H. It can be easily

verified that the only single rank formula which satisfies (2.31) and this symmetrycondition is given by the symmetric single rank formula (2.16). To obtain otherinverse updates which satisfy (2.31) and the symmetry condition we follow thesymmetrization argument of Lemma 2.6 to obtain

H = H +(s−Hy)dT + d(s−Hy)T

dT y− (s−Hy)T y

(dT y)2ddT .(2.34)

Note that if B and H are defined by (2.20) and (2.34), respectively, then in generalBH 6= I even if B is symmetric, BH = I and c = Bd. The reason for this isthat symmetrization and inversion operations do not commute. It is also possibleto prove an analogue of Proposition 2.7 for updates (2.34).

The most important property of (2.34) is the one which is obtained by askingfor the update to have the property of hereditary positive definiteness; this canbe accomplished by choosing d = s in (2.34). The resulting update is the BFGS(Broyden-Fletcher-Goldfarg-Shanno) udpate

HBFGS =(

I − syT

yT s

)H

(I − ysT

yT s

)=

ssT

yT s.(2.35)

The BFGS update is sometimes called the complementary DFP update; the under-lying single rank method is (2.32) with d = s. In many circles, the BFGS update isconsidered the best available update formula for use in unconstrained minimization.The following analogue of Proposition 2.9 holds.

Proposition 2.12 Let H ∈ L(Rn) be a nonsingular symmetric matrix and defineHBFGS ∈ L(Rn) by (2.35) for any vectors y and s in Rn with yT s 6= 0. ThenHBFGS is nonsingular if and only if sT Bs 6= 0, where B = H−1. If HBFGS isnonsingular, then BBFGS = (HBFGS)−1 can be expressed as

BBFGS = B +yyT

yT s− BssT B

sT Bs.

Furthermore, if H is positive definite, then HBFGS is positive definite if and onlyif yT s > 0. 2

2.3. Nonlinear least squares 53

Our remark above about the behavior of BDFP as a relative approximation tothe Hessian holds for HBFGS as a relative approximation to the inverse Hessian.We also note that if H is positive definite, then

HBFGS = HDFP + vvT(2.36)

where v is the vector

v =√

yT Hy

(s

sT y− Hy

yT Hy

)(2.37)

while if B is positive definite, then

BDFP = BBFGS + wwT(2.38)

where w is the vector

w =√

sT Bs

(y

sT y− Bs

sT Bs

).(2.39)

By virtue of Lemma (2.15), (2.36) and(2.38) imply that the eigenvalues of HBFGS

(BBFGS) are larger (smaller) than the eigenvalues of HDFP (BDFP ). However,there is no known relationship between the condition numbers of HBFGS and HDFP .It is also useful to note that the DFP and BFGS updates are related by the trans-formations

s ↔ y B ↔ H B ↔ H .(2.40)

We close by noting that the DFP and BFGS and PSB udpates, used in conjunc-tion with iterations of the form

x(k+1) = x(k) − λkHk∇f(x(k)) , k = 0, 1, 2, . . . ,

can be shown to be locally and superlinearly convergent under appropriate con-ditions on f . For the DFP and BFGS udpates, these requirements include that∇2f(x∗) be positive definite and ‖∇2f(x)−∇2f(x∗)‖ ≤ K‖x− x∗‖ for x in a con-vex set D. Here x∗ is a point in D such that ∇f(x∗) = 0. For the PSB update,∇2f(x∗) need not be positive definite.

2.3 Nonlinear least squares

We now turn to the nonlinear least squares problem

minφ(x) where φ(x) =12

m∑i=1

[fi(x)]2 ,(2.41)

where fi : Rn → R1. When fi = bi −∑m

j=1 ai,jxj we have a linear least squaresproblem while if m = n we have an unconstrained minimization problem of the type


of Sections 2.1 and 2.2. When φ(x) is sufficiently differentiable we can produce aTaylor expansion of φ about x. This is given by

φ(x + y) = φ(x) + [∇φ(x)]T y +12yT [∇2φ(x)]y + O(‖y‖3)(2.42)

where

∇φ(x) = [J(x)]T F (x), ∇2φ(x) = [J(x)]T J(x) +m∑

i=1

fi(x)∇2fi(x)(2.43)

where [J(x)]ij = ∂fi/∂xj , 1 ≤ i ≤ m, 1 ≤ j ≤ n and F (x) = [fi, . . . , fm]T . Here Fis an element of Rm and J is an m× n matrix. We may rewrite (2.42) as

φ(x + y) =12FT F + FT Jy +

12yT JT Jy +

12

m∑i=1

fiyT∇2fiy + O(‖y‖3)(2.44)

where F = Fx and J = J(x).We can deduce two basic strategies for minimizing φ(x) from the expansion

(2.44). Looking at φ(x + y) ≈ φ(x) + [∇φ(x)]T y + 12yT∇2φ(x)y, we ask what y

makes the right-hand side as small as possible. If we minimize with respect to y,we find that the minimizing y satisfies

[∇2φ(x)]y = −∇φ(x) ,(2.45)

a symmetric system of linear equations. Hopefully, the new solution x + y rendersa smaller sum of squares than x. The process can obviously be repeated and theresult is the Newton iteration

x(k+1) = x(k) + yk where yk solves [∇2φ(x(k))]y = −∇(φk) .(2.46)

The gradient and Hessian are given by (2.43).The expansion (2.44) can be written as

φ(x + y) ≈ 12‖Jy + F‖2 +

12yT

[m∑

i=1

fi(x)∇2fi(x)

]y .(2.47)

For the Gauss-Newton method it is argued that a good correction y is obtainedby minimizing ‖Jy + F‖, a linear least squares problem. This idea leads to theiteration

x(k+1) = x(k) + yk where yk minimizes ‖J(x(k))y + F (x(k))‖ .(2.48)

The Gauss-Newton step yk = −J(x(k))†F (x(k)) is a descent direction as longas neither F (x(k)) nor ∇φ(x(k)) are zero. (Here A† denotes the pseudo-inverse ofA.) The Newton step is not necessarily a descent direction unless the Hessian is


positive definite. For both the Newton and Gauss-Newton iterations, the zeros of∇φ(x) are fixed points. These zeros are stationary values of φ; a local minimumis found if ∇φ(x∗) = 0 and ∇2φ(x∗) = H is positive semi-definite. This does notimply that a global minimum occurs at x∗. If we suppose that the Newton andGauss-Newton iterations converge to x∗ and that the Lipschitz conditions hold forall x, y in a neighborhood x∗

‖∇2fi(x)−∇2fi(y)‖ ≤ κ‖x− y‖ ‖J(x)− J(y)‖ ≤ κ‖x− y‖ ,

it can be shown that the Newton iteration converges quadratically (provided∇2φ(x∗)is nonsingular) while the Gauss-Newton iteration is quadratically convergent ifφ(x∗) = 0, but only linear if φ(x∗) 6= 0. The reason for the later comes fromneglecting the term K(x) =

∑mi=1 fi(x)∇2fi(x) of the Hessian. If φ(x∗) = 0 then

K(x∗) = 0 and the Newton and Gauss-Newton iterations will be indistinguishablein the vicinity of x∗. However, if φ(x∗) 6= 0 we can hardly expect quadratic conver-gence when we neglect second order terms such as K(x). The distinction betweenthe zero (φ(x∗) = 0) and nonzero (φ(x∗) 6= 0) residual problems figures in the localconvergence behavior of the Gauss-Newton iterations, i.e., if x(k) is sufficiently closeto x∗, will Gauss-Newton converge? This cannot be guaranteed if φ(x∗) 6= 0. Ofcourse, the Newton iteration is locally convergent independent of φ(x∗). However,on the practical side, the Newton method involves many more partial derivativesevaluations than Gauss-Newton in any given step. Furthermore, one has to worryabout singular Hessians whereas the least squares problem in the Gauss-Newtonmethod is always well-defined.

Just because the Gauss-Newton step yk is a descent direction doesn’t mean thatφ(x(k) + yk) < φ(x(k)); we must limit how far we travel along the direction yk. Oneway of doing this is to introduce scalars αk and then iterate according to

x(k+1) = x(k) + αkyk(2.49)

where yk minimizes ‖J(x(k))y + F (x(k))‖ and αk > 0 chosen so that φ(x(k) +αkyk) < φ(x(k)). We call (2.49) the modified Gauss-Newton method (MGN). Weknow that the scalar αk exists because yk is a descent direction; there are many waysto find an α such that φ(x(k) + αyk) < φ(x(k)); note that this is a one-dimensinalproblem.

We know that the GN step yk = −J(x(k))†F (x(k)) is a descent direction. How-ever, when J is poorly conditioned it is easily possible that the GN direction isvery nearly perpendicular to the negative gradient (the path of steepest descent).Of course, it cannot be exactly perpendicular since it is locally a descent direc-tion, but the near perpendicularlity leads to very mild descent and therefore slowconvergence. The Levenberg-Marquardt algorithm(LM) attempts to overcome thelimitations of the GN step; the LM iteration is defined by

x(k+1) = x(k) + yk where

yk minimizes ‖J(x(k))y + F (x(k))‖2+λ2

k‖y‖2λk > 0

(2.50)


Let us study the effect of the λ’s. Denote the solution of

min ‖Jy + F‖2 + λ2‖y‖2 = min∥∥∥∥( J

λI

)y +

(F0

)∥∥∥∥2

by y(λ). Of course y(0) is the GN correction. If y > 0, then it is not hard to see

that(

JλI

)has full rank. In fact, if µ1 ≥ · · · ≥ µn are the singular values of

J , then the condition number of(

JλI

)=√

µ21 + λ2/µ2

n + λ2. It is clear that

the conditon of the augmented matrix(

JλI

)decreases with increasing λ. This

tempts one to determine y(λ) via the normal equations

(JT J + λ2I)y(λ) = JT F or y(λ) = [JT J + λ2I]−1JT F .

From this formula for y(λ) one can deduce that(i) y(λ) is a descent direction

(ii) limλ→∞ ‖y(λ)‖ = 0

(iii) limλ→∞y(λ)‖y(λ)‖ = − JT

‖JT F‖ , the negative gradient direction.

(2.51)

A Lagrange multiplier argument also shows that if ‖y(λ)‖ = β, then y(λ) solvesmin‖y‖=β ‖Jy + F‖, i.e., y(λ) is the best y of length β.

If the swift minimization of φ is our goal, we must be sensible in the selection ofthe sequence {λk}. One choice is λk = νpλk−1 where ν is a fixed constant greaterthan one and p is chosen to be the smallest integer greater than or equal to −1 forwhich φ[x(k)+y(νpλk−1] < φ(x(k)). Equation (2.51) guarantees that such a p exists.This process is expensive because each y(λ) costs about mn2 operations; howeverit has been argued that this choice of λk allows for nearly optimal interpolationbetween steepest descent (λ = ∞) and GN (λ = 0).

A second way of choosing λk in the LM method is to use the current residual‖F (x(k))‖, e.g., λ2

k = c‖F (x(k))‖ where c = 10 if 10 ≤ ‖F (x(k))‖, c = 1 if 1 ≤‖F (x(k))‖ < 10, and c = .01 if ‖F (x(k))‖ < 1. Behind this criterion is the resultthat the LM iteration has the same local convergence properties as GN providedλ2

k = O(‖F (x(k))‖). A third method of choosing λk is to use the ratio r = [φ(x(k))−φ(x(k)+yk)]/[φ(x(k))− 1

2‖J(x(k))yk + F (x(k))‖] = [actual reduction in φ]/[predictedreduction in φ]. When r is greater than say .75, the linear model is good andreduction in λk is called for; when r is less than .25, the linear prediction is poorand λk is increased. Otherwise, λk is judged all right. When we say that the linearmode is good we mean that the higher order terms in the Taylor expansion whichGN neglects are inconsequential. A fourth way of choosing the LM step is to askyk to solve min‖y‖≤αk

‖J(x(k))y + Fx(k)‖. We won’t detail how the αk are selected;


we simply note that the connection with the LM algorithm is found in the remarkwhich immediately follows (2.51).

The LM method requires the setting up and solution of a linear least squaresproblem. The setting up requires the evaluation of the mn partial derivatives whichmake up J . As usual, we can circumvent the need for taking derivatives by usingdifference quotients. For example, the derivative free Levenberg-Marquardt (DFLM)method given by

x(k+1) = x(k) + yk where yk minimizes ‖Gy + F‖+ λ2k‖y‖2, λk > 0

gij = fi(x(k) + hje(j)]− fi(x(k))]/hj , hj = min{‖Fx(k)‖, δj}

δia constant based upon |xkj | and the machine precision

(2.52)

is found to succeed or fail together with the LM method. After J is evaluated orapproximated, we still must solve the linear least squares problem involving the

matrix(

JλI

). Because of the special structure of this matrix this step requires

little more work than just the QR factorization of J . However, for either LMor DFLM, the volume of computation per step is cubic. Furthermore, the LMiteration often has trouble on large residual problems. We therefore once againturn to quasi-Newton or update methods. Some of our considerations will repeatprevious discussions, we present them here again for the sake of completeness, andalso to include mappings from Rn to Rm.

We begin with the following results, which we present without proof.

Proposition 2.13 If B ∈ L(Rm, Rn), y ∈ Rm and s ∈ Rn, s 6= 0, then

E =(y −Bs)sT

sT ssolves min

(B+E)s=y‖E‖F .

Proposition 2.14 If If B ∈ L(Rn), BT = B, y ∈ Rn and s ∈ Rn, s 6= 0, then

E =(y −Bs)sT + s(y −Bs)T

sT s− (y −Bs)T s

(sT s)2ssT solves min ‖E‖F

over all E such that ET = E and (B + E)s = y.

The last result indicates that B = B + E is the closest symmetric matrix to Bwhich sends s to y. Again notice that Bz = Bz if zT s = 0.

Now let g : Rn → Rm be twice differentiable so it has the following Taylorexpansion about x

g(x) = g(x) + T (x)(x− x) + O(‖x− x‖2)


where T (x) ∈ L(Rm, Rn) is the Jacobian of g at x, i.e., T (x)ij = ∂gi(x)/∂xj .Suppose we have an approximation B to T (x) and seek an approximation B toT (x) subject to the constraints

g(x) = g(x) + B(x− x) ‖B −B‖F = min .(2.53)

An application of Proposition 2.13 yields that

B = B +(y −Bs)sT

sT swhere s = x− x, y = g(x)− g(x),(2.54)

i.e., Broyden’s rank one update.Now suppose m = n and that the Jacobian T is always symmetric. Assume we

have a symmetric approximation B to T (x) and seek an approximation B to T (x)subject to the constraints

g(x) = g(x) + B(x− x), ‖B −B‖F = min, BT = B .(2.55)

In this case Proposition 2.14 yields the Powell symmetric Broyden (PSB) update

B = B +(y −Bs)sT + s(y −Bs)T

sT s− (y −Bs)T s

(sT s)2ssT(2.56)

where s = x − x and y = g(x) − g(x). This update is important because wheng = ∇φ, T (x) = ∇2g(x) is symmetric.

We now consider how to update the least squares Jacobian. Suppose B is anapproximation to J(x) and that we want an approximation B to J(x). By applying(2.53) with g(x) = Fx we find

B = B +(y −Bs)sT

sT s

where s = x− x and y = Fx− Fx. This leads to the quasi-Gauss-Newton (QGN)iteration{

x(k+1) = x(k) + yk where yk minimizes ‖Bky + Fx(k)‖ andBk = Bk−1 + (y−Bs)sT

sT s, s = x(k) − x(k−1) y = Fx(k) − Fx(k−1)

(2.57)

If one has a factorization QBk−1Z = (R 0)T , Q,Z orthogonal, R upper triangular,this factorizaton can be cheaply updated, and the linear least squares in the QGNmethod can be solved in O(mn) operations. A better single rank update methodfor the Jacobian J is to let yk be a judicious linear combination of −BkFx(k),an approximate steepest descent vector, and −[BT

k Bk + λ2kI]−1BT

k Fx(k), a LMcorrection.

The classical Newton method for the nonlinear least squares problem is given by(2.46) which using (2.43) becomes x = x + s where s solves

[JT (x)J(x) + K(x)]y = −J(x)T Fx


where K(x) =∑m

i=1 fi(x)∇2fi(x). We now consider updating this Newton itera-tion.

We now consider determining y from [J(x)T J(x) + B]y = −J(x)T Fx whereB is an approximation to K(x). We let B =

∑mi=1 fi(x)Bi where each Bi is an

approximation to ∇2fi(x) and is updated by setting B = Bi, s = x − x andy = ∇fi(x)−∇fi(x) in (2.56). The usual assumptions result in local and superlinearconvergence. The required storage is O(mn2) can be prohibitive. However, in someleast squares problems such as curve fitting, the component Hessians are often sparsein which case the above storage figure is not applicable.

The approximation [J(x)T J(x)+K(x)](x−x) ≈ J(x)T F (x−J(x)T Fx leads usto ask that the update B satisfy [J(x)T J(x) + B](x− x) = J(x)T F (x)− J(x)T Fx.This suggest defining B by (2.56) with s = x − x and y = J(x)T Fx − J(x)T Fx −J(x)T J(x)s. The success of this method has not been documented.

numerical solution of nonlinear systems and …people.sc.fsu.edu/~jpeterson/nonlinear.pdf ·...

Documents