unconstrained and constrained optimization algorithms by soman k.p

Upload: prasanthrajs

Post on 14-Oct-2015

105 views

Category:

Documents


1 download

DESCRIPTION

Optimization is a means of finding the most efficient way of solving complex mathematical problems, and is a vital part of most branches of computational science and engineering.

TRANSCRIPT

  • 1

    Unconstrained and Constrained Optimization Algorithms Soman K.P

    1 Introduction

    Optimization is a means of finding the most efficient way of solving complex

    mathematical problems, and is a vital part of most branches of computational science and

    engineering.

    We encounter optimization in our day to day life. Without thinking about it, most people

    are constantly trying to do things in an optimal way. It can be anything from looking for

    discounts to minimize the cost at the weekly shopping tour, to finding the shortest path between

    two cities. When to choose between a long queue and a short queue, most people choose the

    short one in order to minimize the time spent in the queue. Most of these everyday problems are

    solved by intuition and it is often not crucial to find the absolutely best solution. These are all

    examples of simple optimization problems. Unfortunately, there are many important

    optimization problems not that easy to solve. Optimization is used in many areas and is in many

    cases a very powerful tool. Common and more advanced, examples are to minimize the weight

    of a construction while maintaining the desired strength or to find the optimal route for an

    airplane to minimize the fuel consumption. In these cases, it can be impossible to solve the

    problems by intuition. Instead, a mathematical algorithm executed in a computer, an

    optimization routine, is often applied to the problem [1]. We are interested in applying

    optimization algorithms for signal and image processing applications. Recent development in

    compressed sensing spurred lots of interest in optimization theory, especially the theory of L1-

    norm optimization. Utilization of sparsity in signal representation and sparsity in gradient of the

    images requires strong footing in optimization theory. This chapter is a first level introduction to

    unconstrained optimization theory.

    2 Unconstrained Optimization

    In this section, at first numerical schemes to solve unconstrained optimization will be introduced.

    Solving unconstrained optimization problem is closely related to root finding process. Thus, it is

    worth considering root finding algorithm first.

  • 2

    2.1 Root Finding Algorithm

    In a root finding problem, we find *x that satisfies *( ) 0f x =

    where :f is a smooth function in one variable. The Newton's method is the most

    representative approach for this type of problem. Newton's method takes an iterative procedure,

    which successively generates a sequence kx , which approaches a root *x as k increases. In figure

    1, Newton's method is illustrated by a graph of an arbitrary function f . At iteration k, Newton's

    method draws a tangential line y ax b= + at the current point ( ), ( )k kx f x . See the straight line

    drawn tangential to the curve at ( ), ( )k kx f x in figure 1.1. We need to determine slope a and y-

    intercept b. The slope is simply the gradient ', ( )kf x at kx .

    ' ( )ka f x=

    Note that the tangential line should pass ( ), ( )k kx f x , which can be found by evaluating f at kx .

    Plugging this into y ax b= + gives '( ) ( )k k kf x f x x b= + which in turn gives'( ) ( )k k kb f x f x x=

    . Thus we have obtained the tangential line ' ' '( ) ( ) ( ) ( ) ( )( )k k k k k k ky f x x f x f x x f x f x x x= + = +

    The Newton's method updates kx so that 1kx + is the root of the tangential line. Thus, we get the

    following Newton's update formula 1 '( )( )

    kk k

    k

    f xx xf x+

    =

  • 3

    Figure1.1: Illustration of Newton's Method of root finding

    The Newton's method repeats this procedure until it converges to the root. The main concept of

    Newton's method is to linearize f (i.e. finding tangential line at the current point). Then it finds

    `the' root for the linearized function. The root is used as a next point. The procedure is repeated

    until it converges to the root of f . This idea of approximating original function locally and

    'solving' the approximated function instead of original function prevails in numerical

    optimization algorithm. We will definitely see this paradigm again. Letting 1k k kp x x+= , we

    have '( ) / ( )k k kp f x f x= (1.1)

    We can consider kp as a step to the next point 1kx + . Remember the formula given in (1.1). We

    will see the very similar expression in numerical optimization, too.

  • 4

    Exercise 1: Find square root of 3 using Newton method

    Solution: We take 2( ) 3f x x= . It is an equation of a parabola which cut the x axis at 3x =

    Or in other words solution of ( ) 0f x = is 3x =

    We take 0 1x = and proceed

    ' ( ) 2f x x= .

    ( )01 0 '0

    2( ) 1 2( ) 2

    f xx xf x

    = = =

    ( )12 1 '

    1

    1( ) 2 7 / 4( ) 4

    f xx xf x

    = = = and so on

    Depending on the starting point it will converge on one of the roots.

    The Newton's method can fail to find a root. Consider 2( ) 1f x x= and 0 0x = . Then we have

    0( ) 1f x = and'

    0( ) 0f x = , so 1x is undefined. This example illustrates the importance of

    starting point. A bad starting point can cause the algorithm to fail. You may think that the

    Newton's method is very robust if a good starting point is chosen. Unfortunately, it is not the

    case. In some cases, the Newton's method is very hard to converge or even fail to converge with

    any ordinary starting point. Usually, the Newton's method shows a great performance when a

    starting point is close to a root or a function f is very nice (i.e. f is convex). However, it is not

    guaranteed that the Newton's method will converge for a general function f .

  • 5

    Figure 1. 2: Illustration of global and local minimizer. The circular point: global minimizer, the

    rectangular points: local minimizers.

    2.2 Local Minimizer

    In unconstrained optimization problem, we minimize an objective function that depends on real

    variables, with no restrictions at all on the values of these variables. The mathematical

    formulation is

    min ( )ximize f x

    where nx is a real vector with 1n and : nf : is a smooth function. Usually we

    lack a global perspective on the function f . All we know are how to evaluate f at a specific

    points and maybe gradient1 2

    , , . .T

    n

    f f ffx x x

    =

    .

    We need to come up with an algorithm, which can find some" minimum of f , only with given

    minimal information. Good news is that there are several such algorithms!!. Bad news, however,

    is that most algorithms find local minimizer. This, in turn, tells us that global minimization is

    very difficult task. Then what is different between global and local minimizer? Let's examine the

    definitions of each minimizer.

  • 6

    Definition 1. *x is called global minimizer of f if *( ) ( )f x f x for all nx . On the other

    hands, *x is called local minimizer of f if *( ) ( )f x f x for all *( )ex N x , where ( )N x

    denotes -neighborhood of x.

    Note from the definition that the global minimizer is a local minimizer, but a local minimizer

    may not be a global minimizer. The difference between global and local minimizer is best

    depicted by the figure-1.2. Both square and circular points are local minimizers. However,

    there is only one global minimizer, which is the circular point. Even from the figure-2, you can

    imagine why it is hard to find a global minimizer under the condition that only some function

    evaluations and gradient information are given. Since we aim for the local minimizer, we need to

    know more about local minimizer. What are the characteristics of local minimizers? Listing a

    few:

    Tangential slope is zero at local minimizer. In other words, *( )f x = 0

    *( ) ( )f x f x for *( )x N x

    ( )f x is small if *( )x N x

    2 *( )f x (Hessian matrix) is positive semi definite

    In designing gradient based algorithm, the Taylor's theorem plays a crucial rule. Let's take a look

    at the Taylor's theorem.

    At first we look one variable case:

    Any given f(x), can be expressed as a power series with respect to a chosen point xo, as follows:

    2 30 1 0 2 0 3 0( ) ( ) ( ) ( ) ...f x a a x x a x x a x x= + + + + (1.2)

    Now how do we find the values of 0 1 2, , ,...a a a of this infinite series so that the equation holds.

  • 7

    2.3 Method:

    The general idea will be to process both sides of this equation and choose values of x so that

    only one unknown appears each time.

    To obtain 0a : Choose 0x x= in (1.2). This result in

    0 0( )a f x=

    To obtain 1a : First take the derivative of (1.2)

    2 31 2 0 3 0 4 0( ) 2 ( ) 3 ( ) 4 ( ) ....d f x a a x x a x x a x xdx

    = + + + + (1.3)

    Now choose 0x x= . Then, 0

    1x x

    dfadx =

    =

    To obtain 2a : First take the derivative of (1.3)

    ( ) ( ) ( )2

    2 32 3 0 4 0 5 02 ( ) 2 3 (2) ( ) 4 (3) ( ) 5 (4) ( ) ....

    d f x a a x x a x x a x xdx

    = + + + + (1.4)

    Now choose 0x x= .

    0

    2

    2 2

    12 x x

    d fadx

    =

    =

    To obtain 3a : First take the derivative of (1.4)

    ( ) ( ) ( )3

    23 4 0 5 03 ( ) 3 (2) 4 (3)(2) ( ) 5 (4)(3) ( ) ....

    d f x a a x x a x xdx

    = + + + (1.5)

    Now choose 0x x= .

    ( )0

    3

    3 3

    13 (2) x x

    d fadx

    =

    =

  • 8

    To obtain ak: First take the kth derivative of equation (1.2) and then choose 0x x= .

    0

    1!

    k

    k kx x

    d fak dx

    =

    =

    Summarizing, The Taylor series expansion of f(x) with respect to xo is given by:

    0 0 0

    22

    0 0 0 02

    1 1( ) ( ) ( ) ( ) ... ( ) ...2! !

    kk

    kx x x x x x

    df d f d ff x f x x x x x x xdx dx k dx= = =

    = + + + + +

    Generalization to multivariable function:

    Let 1 2 3, ,x x x be the three independent variables,

    1 2 3 0 1 1 1 2 2 2 3 3 32 2 2

    11 1 1 22 2 2 33 3 3

    12 1 1 2 2 13 1 1 3 3 23 2 2

    ( , , ) ( ) ( ) ( ) ( ) ( ) ( ) ( )( ) ( )( ) ( )(

    f x x x x x x x x xx x x x x xx x x x x x x x x x x

    = + + + +

    + + +

    + + 3 3 ) ...x +

    (1.6)

    Using similar method as described above, using partial derivatives this time,

    0 1 2 3 ( , , ),f x x x = 1 2 3

    11 , ,x x x

    fx

    =

    , 1 2 3

    22 , ,x x x

    fx

    =

    , 1 2 3

    33 , ,x x x

    fx

    =

    1 2 3

    2

    11 21 , ,

    12! x x x

    fx

    =

    , 1 2 3

    2

    22 22 , ,

    12! x x x

    fx

    =

    , 1 2 3

    2

    33 23 , ,

    12!

    x x x

    fx

    =

  • 9

    1 2 3

    2

    121 2 , ,x x x

    fx x

    =

    ; 1 2 3

    2

    131 3 , ,x x x

    fx x

    =

    ; 1 2 3

    2

    232 3 , ,x x x

    fx x

    =

    To get a simple concise expression, we assume the variables are 1 2, ,..., nx x x so that we denote

    the generic point in the domain of the function as ( )1 2, ,...,T

    nx x x x= . Then

    21( ) ( ) ( ) ( ) ( ) ( )( ) ....2

    T Tf f f f + + +x x x x x x x x x x (1.7)

    Where

    1/( )

    / n

    f xf

    f x

    =

    x

    And

    2 2 2 21 1 2 1

    2 2 2 22 2 1 2 2

    2 2 2 21 2

    / / // / /

    ( )

    / / /

    n

    n

    n n n

    f x f x x f x xf x x f x f x x

    f

    f x x f x x f x

    =

    x

    3 Line Search Methods

    As in Newton method for root -finding algorithm, in unconstrained optimization, we are thinking

    of designing iterative algorithm. Starting from an initial guess, we search for a direction p to

    take, and then decide how far we will go in direction p . This methodology is called line search

    algorithm. Thus we update current point kx ,

    1k kx x p+ = +

    where is called step length and p search direction. In root-finding algorithm, p is

    determined by equation (1.1) and = 1. Then what would be a good search direction for

    unconstrained optimization?

  • 10

    3.1 The Steepest Descent Method

    From calculus, we have learned that ( )f x gives the steepest ascent direction at x . Thus,

    ( )f x is the steepest descent direction. Since the only thing we need is a direction to take, we

    normalize the steepest descent direction to get p ,

    ( )( )f xp

    f x

    =

    (1.8)

    The line search method that uses this direction is called the steepest descent method.

    3.2 The Newton Method

    In contrast to the steepest descent method, the Newton's method takes a bit more sophisticated

    search direction p . As in the Newton's method for root-finding algorithm, we approximate the

    original objective function f locally, and then take a minimizer of approximated function f as

    a search direction. Linearizing f , however, is not an option because linearized f does not have

    a minimum. Next option would be to move one step further: taking quadratic approximation of

    f . Let's say we are at the current point kx . According to Taylor's theorem we have

    21( ) ( ) ( ) ( ) ...2

    T Tk k k kf x p f x p f x p f x p+ = + + +

    The higher order terms in the Taylor's expansion is very small if p is small. Thus if we define

    ( )f x to be:

    21 ( ) ( ) ( ) ( ) ( ) ( )( )2

    T Tk k k k k kf x f x x x f x x x f x x x= + + , (1.9)

    then ( ) ( )f x f x for x close to kx .

    Thus, f is a very good approximation to f around kx . In order to find out search direction p ,

    we need to find a minimizer x of f . We know for x to be a local minima, ( )f x = 0. Thus,

    we must have 2 ( ) ( )( ) 0k k kf x f x x x + =

    This is obtained by taking derivative on both side of (1.9) and equating to a zero vector . This is

    proved as follows:

  • 11

    21 ( ) ( ) ( ) ( ) ( ) ( )( )2

    T Tk k k k k kf x f x x x f x x x f x x x= + +

    2 2 21 1 ( ) ( ) ( ) ( ) ( ) ( ) ( )2 2

    T T T T Tk k k k k k k k k kf x f x x f x x f x x f x x x f x x x f x x= + + +

    2 21 ( ) ( ) ( ) ( )2

    T T Tk k k kf x x f x x f x x x f x x c= + +

    2 2 ( ) ( ) ( ) ( )k k k kf x f x f x x f x x = +

    2 ( ) ( ) ( )( )k k kf x f x f x x x = +

    Letting ( )f x = 0 we obtain

    2 ( ) ( )( ) 0k k kf x f x x x + =

    On solving for search direction

    kp x x= we get ( )12 ( ) ( )k k kx x f x f x

    =

    Or ( ) 121 ( ) ( )k k k kx x x f x f x

    += =

    The Newton search direction is

    ( ) 12 ( ) ( )k k kx x f x f x

    =

    4 Constrained Optimization

    The other major type of optimization is constrained optimization. In constrained optimization,

    the search space is dictated by constraints. The constraints can be either equality or inequality

    constraints.

  • 12

    For example, let us consider a case with one equality constraint. in ( ) . . ( ) 0x

    M f x s t g x =

    implies that on the set of points x that satisfies ( ) 0g x = , we search for *x at which ( )f x is

    minimum. Therefore, the condition to be satisfied at the optimal point *x is not *( )f x = 0 .

    *( )f x = 0 is the condition for unconstrained optimization problem.

    To find the condition for optimality in the case of constrained optimization, we make use of

    levelsets of the objective function. We know levelset is a set of points on which function value is

    a given constant. Therefore, we draw several level set in sequence starting from minimum

    possible level set. We then gradually increase the levelset value and draw corresponding

    levelsets. At a particular value of levelset, the corresponding levelset curve of objective function

    ( )f x just touches the zero levelset curve of ( )g x (or in other words , touch ( ) 0g x = curve). At

    the point of contact *x , we can draw a common tangent to the two levelsets . This tangent is

    orthogonal to both gradients *( )f x and *( )g x drawn at *x . This means gradients *( )f x

    and *( )g x are collinear. There are two possibilities- the gradients may point in the same

    direction or opposite direction, These are shown in Fig.1 and Fig.2. All depends on which

    direction ( )f x and ( )g x is increasing at the point *x .

  • 13

    Figure 1.3. Optimality conditions using level set curves, case 1

    Figure 1.4. Optimality conditions using level set curves, case 2

  • 14

    Well, we obtained conditions for optimality for one equality constraint. What about if there are

    several equality constraints. Note that each equality constraints put severe restriction on the

    search space. If there are more than one equality constraints, the search space is limited to set of

    common points that satisfy all the equality constraints. Now at the optimal point what are the

    conditions to be satisfied?. For the purpose of illustration, let us consider the case of two equality

    constraints 1 2( ) 0, ( ) 0g x g x= = as in figure 3. The objective function (the function for which

    location of optima is to be found out) is ( )f x . We assume as earlier , all functions are defined

    over 2 . At the point of intersection *x of 1 2( ) 0, ( ) 0g x g x= = , the level set of objective

    function ( )f x may not touch tangentially to any of the curve. However, *( )f x has to be in the

    plane spanned by * *1 2( ) and ( )g x g x . In other words * * *

    1 1 2 2( ) ( ) + ( )f x g x g x = . That

    is *( )f x is a linear combination of *1( )g x and *

    2 ( )g x . This is illustrated in Figure 3. The

    sign of 1 and 2 depends on which direction 1 2( ) and ( )g x g x are increasing at *x .

  • 15

    Figure 1.5. Optimality condition for problem with two equality constraints.

    5 Lagrangian Function

    Lagrange [3] introduced a new function called Lagrangian function for which when we apply

    first order optimality condition for unconstrained problem, we obtain the required condition for

    constrained problem. For example, let us consider

    1 2

    min ( )

    . . ( ) 0; ( ) 0;x

    f x

    s t g x g x= = (1.10)

    The Lagrangian function is given by

    ( )1 2 1 1 2 2, , ( ) ( ) ( )L x f x g x g x = (1.11)

    On differentiating with respect to 1 2, ,x we obtain respectively

  • 16

    1 1 2 2( ) ( ) ( )f x g x g x = 0 (1.12)

    1( ) 0g x =

    2 ( ) 0g x =

    The point *x satisfying above three conditions are given by

    * * *1 1 2 2( ) ( ) + ( )f x g x g x = ;

    *1( ) =0g x ;

    *2 ( ) 0g x = (1.13)

    These conditions are the same condition we obtained using geometrical arguments.

    The importance of Lagrangian function is that it converts a constrained optimization problem

    into unconstrained optimization problem. Lagrangian function is obtained by adding constraints

    multiplied by constants called Lagrangian multipliers to the objective function. Note that sign of

    multiplier depends on which direction the constraint functions are increasing at *x .

    1.5.1 Optimization problem with inequality constraints

    Next our consideration is deriving optimality condition when constraints are of inequality type.

    Inequality constraints are less stringent compared to equality constraints because we have more

    search space. That is, set of points in ( ) 0g x is much more than in ( ) 0g x = . Let us consider

    problem of type

    min ( )

    . . ( ) 0x

    f x

    s t g x (1.14)

  • 17

    Figure 1.6. Optimality conditions with inequality constraints

    Note that ( ) 0g x = divides the search domain into two regions. One region is ( ) 0g x and the

    other is ( ) 0g x < See figure 4. If the global minimum of function ( )f x is in ( ) 0g x , then the

    constraint ( ) 0g x is no more a constraint at all since now ( )f x can assume the global

    minimum value at *x x= .

    So, for ( ) 0g x to be a real constraint, the global minimum of function ( )f x must be in

    ( ) 0g x < . Figure 4 shows such a situation. We draw level sets for ( )f x until a levelset just

    grazes ( ) 0g x . Let the point where it just touches ( ) 0g x be *x . At *x , we note that the two

    gradients are collinear and are in the same direction. Therefore, the required first order

    optimality condition is

  • 18

    * *( ) ( ), 0f x g x = > (1.15)

    *( ) 0g x =

    There is one problem with above two conditions. It does not take care of the situation where

    ( ) 0g x is not constraining the objective function. It happens when global minimum of ( )f x

    is in ( ) 0g x . At that global minimum point *x , *( )f x = 0 but *( )g x and *( )g x is not

    necessarily zero. This will violate our optimality conditions given above. To take care of that, we

    write a new condition *( ) 0g x = which says that if the constraint is not active( *( ) 0g x ) at

    the optimal point, then 0 = .

    The condition *( ) 0g x = is called complementarity condition.

    Putting all together , we write the conditions as at the optimal point *x ,

    * *( ) ( ), 0f x g x = (1.16)

    *( ) 0g x

    *( ) 0g x =

    These three conditions take care of both situations. Let us see both situations separately.

    Case 1: ( ) 0g x is a constraint. (global optimum point is in ( ) 0g x < )

    * *( ) ( ), 0f x g x = > , *( )f x 0 (1.17)

    *( ) 0g x =

    *( ) 0g x = , 0 >

  • 19

    Case 2 ( ) 0g x is not a constraint . (global optimum point is in ( ) 0g x )

    * *( ) ( ), 0f x g x = = , *( )f x = 0 (1.18)

    *( ) 0g x

    *( ) 0g x = since 0 =

    Very important point to be noted here is that Lagrangian multiplier is not unrestricted in sign.

    If ( ) 0g x = is an active constraint, then, at the optimal point both ( )f x and ( )g x must be in

    the same direction.

    5.2 Lagrangian Formulation for inequality constraint

    Let us now try to construct Lagrangian function that can produce same optimality conditions.

    The problem min ( ) . . ( ) 0x

    f x s t g x , we rewrite as

    2

    max ( )

    . . ( ) 0x

    f x

    s t g x s = (1.19)

    Where, s is any real number so that 2s is positive for any s .

    We write the Lagrangian as

    ( ) ( )2, , ( ) ( ) , 0L x s f x g x s = (1.20)

    So that

    * *( ) ( ) ( ) ( )L f x g x f x g xx

    = = =

    0 (1.21)

    ( )2 *( ) 0 ( ) 0L g x s g x

    = =

    ; because 2 0s (1.22)

    *2 0 2 ( ) 0L s g xs

    = = =

    ; because *( ) 0g x = when 0s = (1.23)

  • 20

    While writing Lagrangian function we should make sure that 0Lx

    =

    produce * *( ) ( )f x g x =

    with 0 .

    Let us now try to see the directions of ( )f x and ( )g x at the optimal point for various

    combinations of type of optimization (max or min) and inequality constraints ( less than type

    or greater than type )

    a. Combination type 1 :

    max ( )

    . . ( ) 0x

    f x

    s t g x (1.24)

    Here, for the constraint to be active, Global maximum must be outside ( ) 0g x region.

    Under that situation at the optimal point *x , * *( ) ( ), 0f x g x = . See Figure 5

    Figure 1.7. Optimality conditions combination type 1

  • 21

    The lagrangian function for the problem is: ( ) ( )2, , ( ) + ( ) , 0L x s f x g x s =

    b. Combination type 2 :

    max ( )

    . . ( ) 0x

    f x

    s t g x (1.25)

    The Lagrangian for the problem is ( ) ( )2, , ( ) - ( ) , 0L x s f x g x s = . See figure 6

    Figure 1.8. Optimality conditions combination type 2

    c. Combination type 3 :

    min ( )

    . . ( ) 0x

    f x

    s t g x (1.26)

  • 22

    The Lagrangian for the problem is ( ) ( )2, , ( ) ( ) , 0L x s f x g x s = . See figure 7

    Remembering Lagrangian form for all the case is difficult. To make it easy to remember, we may

    convert the formulation into a standard form and remember only the lagrangian for that standard

    form.

    A maximization problem is converted into minimization problem by changing the objective

    function to ( ) f x . That is ( )max ( ) = min ( )xx

    f x f x

    Similarly 0 inequality constraint is converted into 0 constraint by multiplying with (-1).

    That is ( ) 0g x is changed into ( ) 0g x .

    Figure 1.9. Optimality conditions combination type 3

  • 23

    6 Formulation with several equality and inequality constraints

    Consider a general case where we have m equality constraints and n inequality constraints.

    Given:

    min ( )

    . . ( ) 0; 1, 2,..., g ( ) 0; 1, 2,...,

    x

    i

    j

    f x

    s t h x i mx j n= =

    = (1.27)

    Lagrangian for the above problem is

    ( )2 j1 1

    ( , , , , ) ( ) ( ) ( ) ; 0 j m n

    i i j j ji j

    L x f x h x g x s = =

    = s (1.28)

    First order Optimality condition is obtained as

    * * *

    1 1

    L ( ) ( ) ( ); 0 x

    m n

    i i j j ji j

    f x h x g x j = =

    = = +

    0 (1.29)

    *L 0 ( ) 0 ii

    h x i

    = = (1.30)

    * 2 *L 0 ( ) 0 ( ) 0 j j jj

    g x s g x j

    = = (1.31)

    *L 0 0 ( ) 0 j j j jj

    s g x js

    = = = (1.32)

    Putting all together , the optimal point *x has to satisfy the following

    * * *

    1 1( ) ( ) ( ); 0

    m n

    i i j j ji j

    f x h x g x j = =

    = + (1.33)

    (first order gradient condition)

    *( ) 0 ih x i= (feasibility condition)

  • 24

    *( ) 0 jg x j (feasibility condition)

    *( ) 0 j jg x j = (Complementarity condition)

    These conditions are called KKT Conditions or Karush-Kuhn-Tucker conditions.

    7 Convex Optimization problems

    A vast number of problems in engineering including signal and image processing problems can

    be posed as constrained optimization problems, of the type

    0min ( ) ( ) 0, 1, 2,..

    ( ) 0 1, 2,.. i

    i

    imize f xsubject to g x i m

    h x i n =

    = = (1.34)

    However, such problems can be very hard to solve in general, especially when the number of

    decision variables in x is large. There are several reasons for this difficulty:

    1) The problem terrain may be riddled with local optima.

    2) It might be very hard to find a feasible point (i.e., an x which satisfies all the equalities

    and inequalities), in fact the feasible set, which need not even be fully connected, could be

    empty.

    3) Stopping criteria used in general optimization algorithms are often arbitrary.

    4) Optimization algorithms might have very poor convergence rates.

    5) Numerical problems could cause the minimization algorithm to stop all together or

    wander

    It has been known for a long time that if f is convex function (which we will define soon), and

    all the constraints together define a convex space, then the first three problems disappear: any

    local optimum is, in fact, a global optimum; feasibility of convex optimization problems can be

    determined unambiguously, at least in principle; and very precise stopping criteria are available

    using duality(which will be defined soon). However, convergence rate and numerical sensitivity

    issues still remained a potential problem.

    It was not until the late 80s and 90s that researchers in the former Soviet Union and United

    States discovered that if, in addition to convexity, the objective function f satisfied a property

  • 25

    known as self-concordance (discussed later), then issues of convergence and numerical

    sensitivity could be avoided using interior point methods [2-6]. The self-concordance property is

    satisfied by a very large set of important functions used in engineering. Hence, it is now possible

    to solve a large class of convex optimization problems in engineering with great efficiency.

    7.1 Convex Sets

    In this section we list some important convex sets and operations.

    We will be concerned only with optimization problems whose decision variables are vectors in n

    or matrices in m n

    A function : n mf is affine if it has the form ( )f x Ax b= + . Affine functions are

    sometimes loosely referred to as linear. nS is a subspace if it contains the plane through any two of its points and the origin, i.e.,

    , , ,x y S x y S + .

    Two common representations of a subspace are as the range of a matrix

    { }( ) nrange A Aw w= 1 1 .... n n iw a w a w= + + where [ ]1 ... nA a a= ;

    alternatively, as the null space of a matrix. { }( )nullspace B x Bx= = 0

    A set nS is affine if it contains line through any two points in it, i.e.,

    , , , , 1x y S x y S + = +

    Figure 1.10. Example of Affine Set

  • 26

    Geometrically, an affine set is simply a set that is parallel to a subspace, which is centered at the

    origin (For a set of points to be a subspace, the null vector-origin- must be a member of the set).

    Two common representations for an affine set are the range of affine function

    { }nS Az b z= +

    alternatively, as the solution of a set of linear equalities: { }S x Bx d= =

    A set nS is a convex set if it contains the line segment joining any two of its points, i.e.,

    , , , 0, 1x y S x y S + = +

    Figure 1.11. Convex Sets

    Geometrically, we can think of convex sets as always bulging outward, with no dents or kinks in

    them. Clearly subspaces and affine sets are convex, since their definitions subsume convexity.

    A set nS is a convex cone if it contains all rays passing through its points which emanate

    from the origin, as well as all line segments joining any points on those rays, i.e.,

    , , , 0x y S x y S +

    Geometrically , , x y S means that S contains the entire pie slice between x, and y.

  • 27

    Figure 1.12. Convex Cone

    The nonnegative orthant, n+ is a convex cone. The set { }0n nS X S X+ = = of symmetric positive semi definite (PSD) matrices is also a convex cone, since any positive combination of

    semi definite matrices is semi definite. Hence we call nS+ the positive semi definite cone.

    A convex cone nK is said to be proper if it is closed, has nonempty interior, and is pointed,

    i.e., there is no line in K. A proper cone K defines a generalized inequality K=n

    :

    Kx y y x K=

    (strict version intKx y y x eriorK=

    Figure 1.13. Example of Convex Cone

    This formalizes our use of the = symbol:

    :n KK x y+= = means i ix y (components wise inequality)

  • 28

    :n KK S X Y+= = means X-Y is PSD

    Given points nix and i , then 1 1 2 2 ... k ky x x x = + + + is said to be a

    1. linear combination for any real i

    2. affine combination if 1ii =

    3. convex combination if 1, 0i ii =

    4. conic combination if 0i

    The linear (resp. affine, convex, conic) hull of a set S is the set of all linear (resp. affine, convex,

    conic) Combinations of points from S, and is denoted by span(S) (resp. Aff(S), Co(S), Cone(S)).

    It can be shown that this is the smallest such set containing S.

    As an example, consider the set S = f(1; 0; 0); (0; 1; 0); (0; 0; 1)g. Then span(S) is 3 ; Aff(S) is

    the hyper plane passing through the three points; Co(S) is the unit simplex which is the triangle

    joining the vectors along with all the points inside it; Cone(S) is the nonnegative orthant 3+

    Recall that a hyperactive plane, represented as { }( )0Tx a x b a= , is in general an affine set, and is a subspace if b = 0. Another useful representation of a hyperactive plane is given by

    { }0( ) 0Tx a x x = , where a is normal vector; 0x lies on hyperactive plane. Hyperactive planes are convex, since they contain all lines (and hence segments) joining any of their points.

    A halfspace, described as { }( )0Tx a x b a is generally convex and is a convex cone if 0b = .

    Another useful representation is { }0( ) 0Tx a x x , where a is (outward) normal vector and 0x lies on boundary.

  • 29

    Figure 1.14. Example of Half Space

    We now come to a very important fact about properties, which are preserved under intersection:

    Let A be an arbitrary index set (possibly unaccountably infinite) and { }S A collection of sets, then we have the following:

    subspace subspaceaffine affine

    is is convex convex

    convex cone convex coneA

    S S

    In fact, every closed convex set S is the (usually infinite)intersection of halfspaces which contain

    it, i.e., { } halfspace, S H H S H= . For example, another way to see that nS+ is a convex cone is to recall

    that a matrix nX S is positive semi definite if 0, T nz Xz z . Thus we can write

    10

    n

    nn n T

    i j ijz i

    S X S z Xz z z X+ =

    = =

    (1.35)

    Now observe that the summation above is actually linear in the components of X, so nS+ is the

    infinite intersection of halfspaces containing the origin (which are convex cones) in nS .

    We continue with our listing of useful convex sets.

    A polyhedron is intersection of a finite number of halfspaces

  • 30

    { } { }, 1, 2,..,Ti ix a x b i k x Ax b = = = where = above means component wise inequality.

    Figure 1.15. Example of Polyhedron

    A bounded polyhedron is called a polytope, which also has the alternative representation

    { }1 2, ,.., NP Co v v v= where { }1 2, ,.., Nv v v are its vertices. For example the nonnegative orthant

    { }0n nx x+ = = is a polyhedron , while the probability simplex 0, 1n ii

    x x x = =

    is a polytope

    If f is a norm and the norm Ball B={ }( ) 1cx f x x is convex and the norm Cone

    ( ){ }, ( )C x t f x t= is a convex cone. Perhaps the most familiar norms are pl norms on n : 1/

    ; 1

    max ;

    pp

    iip

    i i

    x px

    x p

    =

    =

    (1.36)

    The corresponding norm balls (in 2 ) look like this:

  • 31

    Figure 1.16. Norm Balls

    Two further properties are helpful in visualizing the geometry of convex sets. The first is the

    separating hyperplane theorem, which states that if , nS T are convex and disjoint

    S T = , then there exists a hyperplane { }0Tx a x b = which separates them.

    Figure 1.17. Separating Hyperplane theorem

  • 32

    The second property is the supporting hyperplane theorem which states that there exists a

    supporting hyperplane at every point on the boundary of a convex set, where a supporting

    hyperplane { }0T Tx a x a x= supports S at 0x S if 0T Tx S a x a x

    Figure 1.18. Supporting Hyperplane Theorem

    7.2 Convex Functions

    In this section, we introduce the reader to some important convex functions and techniques for

    verifying convexity. The objective is to sharpen the readers ability to recognize convexity.

    A. Convex functions

    A function : nf is convex if its domain dom (f )is convex and for all

    [ ], , 0,1x y dom f

    ( )( ) ( )1 ( ) 1 ( );f x y f x f y + + f is concave if -f is convex.

  • 33

    Figure 1.19. Types of Functions

    The convexity of a differentiable function : nf can also be characterized by conditions on

    its gradient f and Hessian 2 f . Recall that, in general, the gradient yields a first order Taylor

    approximation at 0x

    0 0 0( ) ( ) ( ) ( )Tf x f x f x x x + (1.37)

    We have the following first-order condition: f is convex if and only if for all 0, ,x x dom f ,

    0 0 0( ) ( ) ( ) ( )Tf x f x f x x x + . (1.38)

    i.e., the first order approximation of f is a global underestimator.

    Figure 1.20.Illustration of Taylor Series

    Recall that the Hessian of f , 2 f , yields a second order Taylor series expansion around 0x :

  • 34

    20 0 0 0 0 0

    1( ) ( ) ( ) ( ) ( ) ( )( )2

    T Tf x f x f x x x x x f x x x+ + (1.39)

    We have the following necessary and sufficient second order condition: a twice differentiable

    function f is convex if and only if for all 2 , ( ) 0x dom f f x = , i.e., its Hessian is positive

    semidefinite on its domain.

    7.3 Concept of Self Concordance :

    Nesterov and Nemirovski [5]introduced a notion of self-concordance and a class of self-

    concordant functions. This provides a new tool for analyzing Newtons method that exploits the

    affine invariance of the method.

    7.3.1 Definition (for one variable):

    A function :f is self concordant when f is convex and ( )3/ 2' ' ' ' '( ) 2f x f x for all

    x dom f

    Significance: If Newtons method is applied to a quadratic function (whose Hessian is constant

    matrix), then it converges in one iteration. By extension , if the Hessian matrix does not change

    rapidly , the Newton method ought to converge rapidly. Changes in the second derivative can be

    measured using the third derivative. Intuitively, the third derivative should be small relative to

    the second derivative. The self concordance property reflects this requirement.

    7.4 Concept of duality.

    Earlier we have seen the conditions to be satisfied at the optimal point for general constrained

    optimization problem. These conditions does not help us to find the solution except for very

    simple cases. It help us to check whether optimal point is reached or not. We need some iterative

    procedure to solve it. To this end , corresponding to the problem

    min ( )

    . . g(x) 0x

    f x

    s t (1.40)

    we show that , it is equivalent to solving minimization of lagrangian function without slack

    variables in two steps. That is

  • 35

    min ( )

    . . g(x) 0x

    f x

    s t (1.41)

    = x 0

    min max ( , ) ( ) g(x)L x f x

    >

    = (1.42)

    = x 0

    min max ( ) g(x)f x

    >

    (1.43)

    To understand logic behind this , let us consider a case where the domain is 2 .

    The inner minimization max ( ) g(x)f x

    can be visualized as follows.

    Every point x in the domain, we evaluate ( ) g(x)f x for different values of 0 . In the

    region where g(x)>0 , the maximum value possible at any x is ( )f x itself. This is achieved for

    =0. In the region where g(x) 0< , the maximum value possible is infinity and it is obtained for

    = .

    Because in this region at any point x, g(x) is negative and therefore g(x) becomes infinity.

    Note that is not allowed to take negative values. In addition, on g(x)=0 , the highest possible

    value for lagrangian ant any x is ( )f x .

    Through this inner computation cum optimization process, we associate every point in the

    domain a value. The values at infeasible regions is infinity and value at any feasible point x is

    ( )f x itself. Now we apply outer optimization (basically a search) . This search will find the

    location *x where the Lagrangian is assigned minimum value in the previous inner loop

    computations. Though this is not the actual computational procedure we finally follow for

    finding the optimal point, readers can see that the logic of the formulation is right and if one does

    that way, it will definitely end up in finding the location of *x . Now we are ready for the one of

    the most important concept in convex optimization theory - the lagrangian duality.

    7.5 Lagrangian Duality

    According to Lagrangian duality concept, for a wide class of functions,

    x xmin max ( , ) = max min ( , ) L x L x

    (1.44)

    That is , the order of maximization and minimization can be swapped.

  • 36

    The original problem x

    min max ( , )L x

    is called primal and the swapped version

    xmax min ( , ) L x

    is called dual of the primal.

    Why does swapping make sense?

    Consider a plot of the image of the domain of x under ( ( ), ))g fx x x( . The optimal primal

    solution lies on the ordinate, on the lower boundary of the image of the mapping.

    Figure 1.21. Example domain of x under ( ( ), ))g fx x x( .

    In the dual problem, the Lagrangian T( ) - ( )f gx x is being minimized. On the graph this is the

    y intercept of the line with the slope passing through the point ( ( ), ))g fx x( . The minimization

    finds the smallest such intercept, ranging over all x. This corresponds to the dual function. The

    subsequent maximization of the dual function takes the maximum of such y-intercepts. This

    yields the same point as the primal solution.

  • 37

    Appendix-1

    Understanding Lagrangian Duality

    Understanding the concept of duality is very important in the theory of support vector machines

    because one rarely solve the optimization arising in SVM in the primal because of the

    computational complexity involved. The main stumbling block that a new comer in this field

    faces is the concept of Lagrangian multiplier and the Lagrangian duality. To facilitate an easy

    entry, we consider duality concept in Linear Programming which many are familiar with.

    Duality in Linear Programming

    Linear programming was developed as a discipline in the 1940's, motivated initially by the need

    to solve complex planning problems in wartime operations. Its development accelerated rapidly

    in the postwar period as many industries found valuable uses for linear programming. The

    founders of the subject are generally regarded as George B. Dantzig, who devised the simplex

    method in 1947, and John von Neumann, who established the theory of duality that same year.

    The Nobel prize in economics was awarded in 1975 to the mathematician Leonid Kantorovich

    (USSR) and the economist Tjalling Koopmans (USA) for their contributions to the theory of

    optimal allocation of resources, in which linear programming played a key role. Many industries

    use linear programming as a standard tool, e.g. to allocate a finite set of resources in an optimal

    way. Examples of important application areas include airline crew scheduling, shipping or

    telecommunication networks, oil refining and blending, and stock and bond portfolio selection.

    The most remarkable mathematical property of linear programs is the theory of duality. Duality

    in linear programming is essentially a unifying theory that develops the relationships between a

    given linear program and another related linear program stated in terms of dual variables. The

    intriguing feature of this is that both primal and dual has the same optimal value for their

    objective function.

    To understand the logic behind duality, let us consider two examples.

  • 38

    Example 1. Given the linear program

    2

    1 2

    1 2

    min . 8

    3 2 6

    xs t x x

    x x

    +

    +

    how do we lower bound the value of the optimum solution?. That is, instead of solving the

    problem, using a linear combination of constraints, can we tell something about upper/lower

    bound of objective function.

    Multiplying the first constraint by 3 and adding to the second gives 25 30x which implies

    2 6x or 2 6x

    For any feasible solution , 2x is at least 6 .

    Example 2. Given the linear program

    1 2 3 4

    1 2 3 4

    1 2 3 4

    max 5 6 9 8. 2 3 5

    2 3 3 0, 1,2,3,4i

    x x x xs t x x x x

    x x x xx i

    + + +

    + + +

    + + +

    =

    How do we upper bound the value of the optimum solution?

    We choose 1 2 , 0y y . We multiply first constraint by 1y and the second by 2y and add. The

    choice of 1y and 2y should be such that , in the resulting sum we have coefficient of 1 x greater

    than 5, the coefficient of 2x greater than 6 , the coefficient of 3x greater than 9 and the

    coefficient of 4x greater than 8. Since 1 25 +3 y y is greater than this sum, the upper bound will

    become 1 25 +3 y y . That is

    ( ) ( )1 1 2 3 4 2 1 2 3 4 1 2 2 3 2 3 5 3 y x x x x y x x x x y y+ + + + + + + +

  • 39

    So if we choose y1 and y2 such that following conditions are met

    1 2

    1 2

    1 2

    1 2

    1 2

    + 52 + 63 + 2 9

    + 3 8, 0

    y yy yy y

    y yy y

    then solution will give an upper bound (by substituting y1 and y2 in 1 25 3 y y+ ) to our original

    problem.

    Then, to get tight upper bound we should choose 1 2 and y y such that it minimizes 1 25 3 y y+

    at same time satisfy the above constraints. Thus we obtain another optimization problem as

    follows:

    1 2

    1 2

    1 2

    1 2

    1 2

    1 2

    min 5 3 .+ 5

    2 + 63 +2 9

    + 3 8 and 0

    y ys ty y

    y yy y

    y yy y

    +

    We call above optimization problem as the dual of the original primal problem.

    Using Excel we get following result:

    * * * *1 2 3 41, 2, 0, 0x x x x= = = = for the primal and

    * *1 21, 4y y= = for the dual.

    So that * * * *1 2 3 45 6 9 8x x x x+ + + = 17 = * *1 25 3 y y+

    In essence, we get another linear programming with same optimal objective function value.

    Another incidental advantage in the above dual problem is that we have now only two variables

    though we have more constraints.

  • 40

    Getting the dual in Lagrangian way

    Lagrange (centuries before practical solution of LP using Simplex and interior point methods

    were invented) has put the above procedure in the framework of calculus.

    The method goes as follows.

    Suppose our objective is of type maximization. Lagrange says us to form a new objective

    function by adding the actual objective function with constraints multiplied by a positive

    quantity (assuming that our constraints are inequalities) such that the optimization of new

    objective function gives an upper bound for the original optimization.

    For example, for the following LP

    max T

    s.t.

    c xAx b

    x 0

    the Lagrangian function is

    ( ) +T T TL = c x + y b Ax x , y 0, 0

    We can easily prove that maximization of L for a given y 0, 0 give an optimum value that

    is higher than the original problems optimum value.

    We now take the derivative of the new objective function ( )L ,x y with respect to the primal

    variable x and equate to zero. Then we substitute back the resulting expression into the

    Lagrangian so that the new objective function become devoid of primal variable x .

    ( ) + 0TL , = =x y, c A y

    x -T =c A y

    Now ( )( ) ( ) +TT T TDL = y, A y x + y b Ax x = Ty b = Tb y

    This function we minimize with respect to y and such that -T=c A y . That is

  • 41

    ( ) TDmin L =y, b y

    Such that -T=c A y , y 0, 0

    This may also be written as (since 0 )

    ( ) TDmin L =y b y

    T A y c , y 0

    This is the dual of the primal.

    To make the concept more transparent, let us apply the method to an LP with out packing our

    variables into vectors and coefficients into matrix.

    Consider again the LP

    1 2 3 4

    1 2 3 4

    1 2 3 4

    max 5 6 9 8. 2 3 5

    2 3 3 0, 1,2,3,4i

    x x x xs t x x x x

    x x x xx i

    + + +

    + + +

    + + +

    =

    Whose solution is given by * * * *1 2 3 41, 2, 0, 0x x x x= = = = and * * * *1 2 3 45 6 9 8x x x x+ + + =17

    Let us take Lagrangian

    ( )L , x y, = 1 2 3 45 6 9 8x x x x+ + + + ( )( )1 1 2 3 45 2 3y x x x x + + + ( )( )2 1 2 3 43 2 3y x x x x+ + + +

    + 1 1 2 2 3 3 4 4x x x x + + + , with the condition 1 2 1 2 3 4, , , , , 0y y

    Taking derivative with respect to primal variables

  • 42

    1 2 11

    1 2 22

    1 2 33

    1 2 44

    ( ) 5 ( ) 0

    ( ) 6 (2 ) 0

    ( ) 9 (3 2 ) 0

    ( ) 8 ( 3 ) 0

    L , y yx

    L , y yx

    L , y yx

    L , y yx

    = + + =

    = + + =

    = + + =

    = + + =

    x y,

    x y,

    x y,

    x y,

    In matrix form

    1

    21

    32

    4

    5 1 16 2 1( )9 3 28 1 3

    yL ,y

    = +

    x y, 0x

    = + 0T =c A y ]

    Substituting this back into Lagrangian we obtain 1 2( ) 5 3DL y y= +y . We minimize this subject to

    the constraint (omitting 1 2, 0 )

    1 2

    1 2

    1 2

    1 2

    1 2

    + 52 + 63 + 2 9

    + 3 8 and 0

    y yy yy y

    y yy y

    So the dual problem is

    1 2

    1 2

    1 2

    1 2

    1 2

    1 2

    min 5 3 .+ 5

    2 + 63 + 2 9

    + 3 8 , 0

    y ys ty y

    y yy y

    y yy y

    +

  • 43

    In matrix form, the dual problem is

    ( ) TDmin L =y b y

    T A y c , y 0

    Complementary Conditions at the Optimal Point ( ), ,* *x y

    For primal and dual, the optimal values of variables are

    * * * *1 2 3 4* *1 2* * * *1 2 3 4

    1, 2, 0, 0

    1, 2

    0, 0, 2, 5

    x x x xy y

    = = = =

    = =

    = = = =

    [Note: Values of s are obtained by substituting optimal values of ys in the equations

    1 2 11

    1 2 22

    1 2 33

    1 2 44

    ( ) 5 ( ) 0

    ( ) 6 (2 ) 0

    ( ) 9 (3 2 ) 0

    ( ) 8 ( 3 ) 0

    L , , y yx

    L , , y yx

    L , , y yx

    L , , y yx

    = + + =

    = + + =

    = + + =

    = + + =

    x y

    x y

    x y

    x y

    ]

    Let us substitute these optimal values in the Lagrangian

    ( )L ,x y, = 1 2 3 417

    5 6 9 8x x x x=

    + + +

    +

    ( )( )1 1 2 3 41 0

    5 2 3y x x x x= =

    + + +

    ( )( )2 1 2 3 44 0

    3 2 3y x x x x= =

    + + + +

    +

    1 1 2 2 3 3 4 40 1 0 2 5 02 0

    x x x x = = = = = == =

    + + +

    Notice that if the constraint expression is nonzero, Lagrangian multiplier is 0 and vice versa.

    In matrix notation ( ) = 0T T =* * * *y b Ax 0, x

  • 44

    These conditions are called complementarity conditions

    Finally we capture the entire Lagrangian duality theorem in the following one mathematical

    statement.

    ( ) +T T Tmin max,

    y>0 x 0

    c x + y b Ax x

    More on Lagrangian duality

    For equality constraints we have

    Case (1)

    0 (

    s.t

    Max f )

    g( )

    =x

    x

    x b

    Lagrangian dual problem is constructed as

    ( )0 0

    ( Min Max f ) g( ) b

    + x

    x x , here is unrestricted in sign

    Case (2)

    0 (

    s.t

    Min f )

    g( ) b

    =x

    x

    x

    Lagrangian dual problem is constructed as

    ( )0

    ( Max Min f ) g( ) b

    + x

    x x , here is unrestricted in sign

    Case (3)

    0 (

    s.t

    Max f )

    g( ) b

    x

    x

    x

    The Lagrangian function is

    ( )( ) ( L f ) g( ) b = + x, x x . 0

  • 45

    We want g( )x b to be greater than or equal to zero. Multiplying this with a positive quantity

    and adding to the objective function give us a new objective function whose maximization

    leads to an upper bound of the original problem. The important point to note here is the sign of

    . Only a positive multiplier leads us to getting an upper bound. The solution obtained will

    be a function of . We then minimize this function w.r.t to get a tighter bound. The

    difference is called duality gap. If the problem is a convex optimization problem, the duality gap

    will be zero.

    So the Lagrangian dual is

    ( )0 0

    ( Min Max f ) g( ) b

    + x

    x x

    Case (4)

    0 (

    s.t

    Max f )

    g( ) b

    x

    x

    x

    At first, we convert the inequality into by rewriting the g( ) bx as 0b g( ) x Hence

    we proceed as in the previous case. Therefore, the Lagrangian function and dual are given by

    ( )( ) (L f ) b g( ) = + x, x x . 0

    ( )0 0

    (Min Max f ) b g( )

    + x

    x x

    Case (5)

    0 (

    s.t

    Min f )

    g( ) b

    x

    x

    x

    The Lagrangian function is

    ( )( ) ( L f ) g( ) b = + x, x x . 0

  • 46

    We want g( )x b to be less than or equal to zero. Multiplying this with a positive quantity

    and adding to the objective function give us a new objective function whose minimization leads

    to lower bound of the original problem. The important point to note here is the sign of . Only a

    positive multiplier leads us to getting a lower bound, if we multiply it with g( ) bx . The

    solution obtained will be a function of . We then maximize this function w.r.t to get a

    tighter bound. The difference is called duality gap. If the problem is a convex optimization

    problem, the duality gap will be zero.

    So the Lagrangian dual is

    ( )0 0

    ( Max Min f ) g( ) b

    + x

    x x

    Case (6)

    0 (

    s.t

    Min f )

    g( ) b

    x

    x

    x

    At first, we convert the inequality into by rewriting the g( ) bx as 0b - g( ) x .

    Then we proceed as in the previous case.

    The Lagrangian function and dual are given by

    ( )( ) ( L f ) b g( ) = + x, x x . 0

    ( )0 0

    (Max Min f ) b g( )

    + x

    x x

    General Case

    0 (

    s.t , i= 1,2,..,m , = 1,2,..,n

    i i

    j j

    Min f )

    g ( ) bh ( ) a j

    =

    xx

    x x

    ( ) ( )1 1

    ( , ) ( m n

    i i i j j ji j

    L f ) b g ( ) a h ( ) = =

    = + + x, x x x , 0

  • 47

    Lagrangian dual is

    ( ) ( )1 1

    max min ( m n

    i i i j j ji j

    f ) b g ( ) a h ( )

    = =

    + +

    x 00, x x x

    Note that is unrestricted in sign and 0

    General Case with unconstrained inner optimization

    (

    s.t , i= 1,2,..,m , = 1,2,..,n

    i i

    j j

    Min f )

    g ( ) bh ( ) a j

    =

    x

    x x

    x 0

    Here we form Lagrangian with multipliers given to x 0 also.

    ( ) ( )1 1

    ( , ) ( m n

    Ti i i j j j

    i jL f ) b g ( ) a h ( )

    = =

    = + + x, x x x x , , 0 0

    Lagrangian dual is

    ( ) ( )1 1

    max min ( m n

    Ti i i j j j

    i jf ) b g ( ) a h ( )

    = =

    + +

    x0, 0, x x x x

    When do Lagrangian multiplier takes zero (or non-zero value) in the case of inequality

    constraints?

    To answer this question, let us consider the problem

    0 (

    s.t , i=1,2,..,mi i

    Min f )

    g ( ) b

    x

    x

    x

    We add a positive quantity (a variable quantity) 2is to each constraint and make each constraint

    an equality constraint. Thus the optimization problem becomes

  • 48

    0 (

    s.t , i=1,2,..,m2i i i

    Min f )

    g ( ) s b

    =x

    x

    x +

    The Lagrangian is given by

    ( )m

    ( ) ( 2i i i i=1

    L f ) g ( )+ s b= + x,s, x x

    On differentiating with respect to primal variables, and equating to zero, we obtain the first order

    necessary conditions for optimality as

    ( ( (1)

    2 0 (2)i ii

    L f )+ )= 0

    L ss

    =

    = =

    x g xx

    The second relation implies that when slack variable is non zero, Lagrangian multiplier must

    necessarily be zero. Note that slack variable become zero when the constraint become active,

    that is, when the constraint become an equality and is on the verge of violation. So at the

    optimal point, if any constraint is active, then corresponding Lagrangian multiplier is nonzero.

    This also follows from the fact that, Lagrangian multiplier is rate of change of objective function

    when an active constraint is further relaxed for accommodating more search space.

    This condition also implies that , at the optimal point *x

    ( ) 0, 1,2,..*i i ig ( ) b i m = =x

    This condition is called complementarity condition.

    What are KKT conditions for the following Optimization Problem

    0 (

    s.t , i=1,2,..,m2i i i

    Min f )

    g ( ) s b

    =x

    x

    x +

    The Lagrangian is given by

  • 49

    ( )m

    ( ) ( 2i i i i=1

    L f ) g ( )+ s b= + x,s, x x

    ( (

    2 0

    0

    i ii

    2i i i

    i

    L f )+ )= 0

    L ssL g ( )+ s b

    =

    = =

    = =

    x g xx

    x

    We can rewrite these conditions without using the slack variables as

    ( )( (

    0, i=1,2,..,m, i=1,2,..,m

    i i i

    i i

    f )= - )g ( ) b

    g ( ) b

    =

    x g xx

    x

    These conditions are called KKT conditions at the optimal point.

    KKT Conditions for a General Case

    0 (

    s.t , i= 1,2,..,m , = 1,2,..,n

    i i

    j j

    Min f )

    g ( ) bh ( ) a j

    =

    xx

    x x

    Subtracting a positive slack variable from each of the inequality constraints we obtain

    0 (

    s.t , i= 1,2,..,m , = 1,2,..,n

    2i i i

    j j

    Min f )

    g ( ) s bh ( ) a j

    =

    =

    xx

    x x

    On taking Lagrangian, we obtain

    ( ) ( )1 1

    ( , ) ( m n

    2i i i i j j j

    i jL f ) b s g ( ) a h ( )

    = =

    = + + + x, x x x

    The first order necessary conditions are

  • 50

    ( ( (

    2 0

    0

    0

    i ii

    2i i i

    i

    j ji

    L f ) ) - )= 0

    L ssL g ( ) s b

    L a h ( )

    =

    = =

    = =

    = =

    x g x h xx

    x

    x

    Without using slack variables these conditions reduces to

    ( ( (f ) )+ ) = x g x h x

    ( ) 0, 1, 2,..,i i ig ( ) b i m = =x

    1, 2,..,i ig ( ) b i m =x

    , = 1,2,..,n j jh ( ) a j=x

    Very Important Note: Usually in the Lagrangian formulation, we do not add slack variables. So

    care must be taken in writing the KKT conditions. Simply differentiating Lagrangian with

    respect to primal and dual variables and putting equal to zero leads to wrong KKT conditions.

    Example 3: Find dual of the following LP using Lagrangian dual

    Tmax c. A

    0

    xs t x bx

    =

    T( , ) c ( A )T TL b= +x y x + y x s x , s 0

    ( )T TA 0 AL = + = =

    c y s y s cx

    Substituting TA =y s c in Lagrangian, we obtain

  • 51

    ( )T T( A ) A ( A )TT T T T T T + +c x + y b x s x = y s x + y b x s x = y b = b y

    Therefore the Lagrangian dual is

    TMiny

    b y

    s.to TA =y s c , s 0

    or

    TMiny

    b y

    TA y c

    Example 2: Find dual of the following LP using Lagrangian dual

    Tmax c. A

    0

    xs t x bx

    Lagrangian is

    T( , ) c ( A )T TL = +x, y s x + y b x s x , y 0,s 0

    Tc AL = +

    y s = 0x

    ( )Tc A = y s

    On substituting ( )Tc A = y s in the lagrangian we obtain

    ( )T Tc ( A ) A ( A )TT T T T T T + +x + y b x s x = y s x + y b x s x = y b = b y

    Therefore the lagrangian dual is

    0

    TMiny

    b y s.t TA y s = c , s 0 Or 0TMin

    yb y s.t TA y c

  • 52

    Newtons Method

    The Newtons method or the Newton-Raphson method is known to perform better than the algorithms discussed previously for quadratic functions. The previous methods utilize only the first derivatives for selecting a search direction. If higher derivatives are used, the algorithm would be more effective and that is what the Newton method does by involving the second derivative in finding the search direction. Newtons method however is locally convergent and hence the initial point has to be somewhat close to the minimizer for better convergence. If the objective function is quadratic the algorithm will converge in one step to the true minimizer, but if the function is non-quadratic then it will provide only an estimate of the position of the exact minimizer.

    We can obtain a quadratic approximation to the given twice continuously differentiable objective function f using the Taylor series expansion of f about a point xi neglecting the terms of order three and above as

    ( ) ( ) ( ) ( ) ( ) ( )( )T T' ''i i i i i if x f x x x f x x x f x x x= + + 12

    Applying the first order necessary condition for optimality, we get

    ( ) ( ) ( )( )' ' ''i i if x f x f x x x= + =0 0

    Let gi = f (xi) and hi = f(xi), then if hi > 0, then a minimum value for xi+1 can be obtained as

    i i i i ix x h g

    + = 1

    1

    is the step length. The convergence criteria can be zero value for the gradient or any other that we have discussed previously.

    Let us see how this algorithm works through an example

    Consider the problem of minimizing Powells function

    f (x1, x2, x3, x4) = (x1 + 10x2)2+5(x3 - x4)2 + (x2 - 2x3)4 + 10(x1 - x4)4

    The gradient and the Hessian matrix of the function are calculated as

    ( )

    ( - )

    ( - ), , ,

    - - ( - )

    - - ( - )

    x x x x

    x x x xf x x x x

    x x x x

    x x x x

    + +

    + + =

    +

    31 2 1 4

    31 2 2 3

    1 2 3 4 33 4 2 3

    33 4 1 4

    2 20 40

    20 200 4 2

    10 10 8 2

    10 10 40

  • 53

    ( )

    ( - ) - ( - )

    ( - ) - ( - ), , ,

    - ( - ) ( - ) -

    - ( - ) - ( - )

    x x x x

    x x x xf x x x x

    x x x x

    x x x x

    + + =

    +

    +

    21 4

    22 3

    22 3

    21 4

    22 120 20 0 1201 4220 200 12 2 24 2 02 32

    1 2 3 4 20 24 2 10 48 2 102 32120 0 10 10 1201 4

    Taking x0 = [1, 1, 0, -1]T as our starting point,

    f(x0) = 287

    We get [ ]33022243420 =g

    =

    49010-0480-10-5824-0024-21220480-020482

    0h

    =

    0.11070.01550.0087-0.11060.01550.02030.00080.01540.0087-0.00080.00570.0089-0.11060.01540.0089-0.1126

    10h

    [ ]T010 6190.03810.00952.10476.0 = gh

    x1 = x0 - h0-1g0 = [0.9524, -0.0952, -0.3810, -0.3810]T

    Proceeding in this manner we obtain the following results until the algorithm converges

    Iteration x values Function Value

    1 [0.9524, -0.0952, -0.3810, -0.3810]T 31.8089

    2 [0.6349, -0.0635, -0.2540, -0.2540]T 6.2823

    3 [0.4233, -0.0423, -0.1693, -0.1693]T 1.2409

    4 [0.2822, -0.0282, -0.1129, -0.1129]T 0.2452

  • 54

    5 [0.1881, -0.0188, -0.0753, -0.0753]T 0.0484

    6 [0.1254, -0.0125, -0.0502, -0.0502]T 0.0096

    The iterations continue until the minimum is reached at the point [0, 0, 0, 0]T

    Analysis of Newtons Method

    There is no guarantee that the Newtons algorithm will point in the direction of decreasing values of the objective function if hi is not positive definite. Moreover even if hi > 0, still the direction may not be the descent direction. Some remedial measures have been discussed afterwards. Despite this drawback the Newtons method has superior convergence when the starting point is near to the minimizer.

    The convergence analysis of the Newtons method when the objective function is quadratic is straightforward.

    Let 1( )2

    T Tf x x Ax b x c= +

    ( )g x Ax b=

    ( )h x A=

    Given any initial point 0x

    ( )

    *

    x x h g

    x A Ax b

    A bx

    =

    =

    ==

    11 0 0 0

    10 0

    1

    x* being the true minimizer. Thus the algorithm will converge in a single step for a quadratic function irrespective of the starting point. Thus the order of convergence for this case is infinity. For general cases the order of convergence is at least 2.

    The Newtons method has superior convergence properties if the starting point is near the solution and is not guaranteed to converge to the solution if we start far away and may not be even well defined because of the singularity of the Hessian matrix. The other drawbacks of the method are that evaluation of the Hessian matrix for large dimensions can be computationally expensive. Further more a set of n linear equations has to be solved to get the search direction in

  • 55

    each iteration. The main problem arises from the Hessian matrix not being positive definite. We will see a method to overcome this difficulty in the next section.

    Levenberg-Marquardt Modification

    If the Hessian matrix hi is not positive definite, then the search direction di = hi-1gi may not point in the descent direction. A simple technique to overcome this is to add the Levenberg-Marquardt modification to Newtons algorithm.

    ( )i i i i ix x h I g

    + = +1

    1

    where i 0.

    Underlying idea of this modification is as follows. Consider a symmetric matrix h, which may not be positive, definite. Let 1, 2, ., n be the Eigen values of h with the respective Eigen vectors v1,v2, ,vn. The Eigen values are real and all may not be positive. Now consider the matrix G = h + I with 0. The Eigen values of G are 1 + , 2 + , ., n + . In fact,

    ( )

    ( )

    i iGv h I vhvi Ivi

    ivi vii vi

    = +

    = +

    = +

    = +

    This means that vi is also the Eigen vectors of G with the Eigen values i + . Therefore if is sufficiently large, then the Eigen values of the matrix G will b positive thus making it positive definite. The search direction will then always be pointing in the direction of descent.

    ( )i i i i i ix x h I g

    + = +1

    1

    The Levenberg-Marquardt modification of the Newtons algorithm can be made to approach the behaviour of the pure Newtons method by letting tend to zero and if tends to infinity then it will approach a pure gradient method with small step size. In practice, we may start with a small value of and then slowly increase it until we the iteration is in the descent.

    Newtons Method for Nonlinear Least Squares

    Suppose we are given m measurements of a process at m points in time. Let t1, t2, .,tm be the measurement times and y1,y2,ym the measurement values as shown in the figure.

  • 56

    We want to fit a sinusoidal curve to the measurement so as to predict the process for a particular time. The equation of the sinusoid is

    ( )siny A t = +

    we have to find the values of the parameters A, , and that will minimize the error between the actual and the predicted values of the measurements. We can construct the objective function as

    ( )( )minimize sinm

    i ii

    y A t =

    + 21

    this type of problems are known as non-linear least squares problem

    In general such problems are defined as

    ( )( )minimizem

    ii

    f x= 2

    1

    where fi(x) are the given functions.

    We will see how we can apply the Newtons method to the example problem. Let [ ]T, ,x A =be the vector of decision variables and the function written as

  • 57

    ( ) ( )sini i ir x y A t = +

    defining T[ , ,..., ]mr r r r= 1 2 , the objective function can be expressed as

    f(x) = r(x)Tr(x)

    To apply the Newtons method we need to compute the gradient and the Hessian of f.

    The jth component of the gradient is

    ( )( ) ( )

    ( ) ( )

    jj

    mi

    ii j

    ff x xx

    rr x x

    x=

    =

    =

    12

    Denoting the Jacobian matrix of r as

    ( )

    ( ) ( )

    ( ) ( )

    m

    m m

    m

    r rx xx x

    J xr rx xx x

    =

    1 1

    1

    1

    Then, the gradient can be represented as

    ( ) ( ) ( )Tf x J x r x = 2

    To compute the Hessian matrix

    ( ) ( )

    ( ) ( )

    ( ) ( ) ( ) ( )

    k j k j

    mi

    iik j

    mi i i

    ii k j k j

    f fx xx x x x

    rr x xx x

    r r rx x r x xx x x x

    =

    =

    =

    =

    = +

    2

    1

    2

    1

    2

    2

    Let S(x) be the matrix whose (k, j)th component is

    ( ) ( )iik j

    rr x xx x

    2

  • 58

    The Hessian can now be written as

    ( ) ( ) ( ) ( )( )TH x J x J x S x= +2 Therefore, Newtons method applied to the nonlinear least squares problem is given by

    ( ) ( ) ( )( ) ( ) ( )T Ti ix x J x J x S x J x r x+ = + 11 If the second derivatives are considerably small, the matrix S(x) can be neglected, in which case the Newtons method reduces to Gauss-Newton method.

    ( ) ( )( ) ( ) ( )T Ti ix x J x J x J x r x+ = 11

    In the Gauss-Newton method also, sometimes the ( ) ( )( )TJ x J x matrix may not be positive definite and as before the LevenbergMarquardt modification can be implemented by adding a positive I value to it to overcome the problem. An alternative interpretation of the Levenberg-Marquardt algorithm is to view the term I as an approximation to S(x) in Newtons method.

    Quasi-Newton Methods

    In Newtons method, we have to calculate the Hessian of the objective function at each iteration. When the dimension of the problem is high and the Hessian of the function is difficult to calculate, the calculation of the inverse Hessian may take considerable computational time. In the Quasi-Newtons method, the computation of the Hessian is avoided while retaining the fast local convergence of Newtons method. The method aims at finding an approximation to the inverse Hessian which is computational economic. This approximate will then be updated at each iteration so that it retains some of the properties of the original inverse of the Hessian the prominent one being its positive definiteness for providing a descent direction.

    Quasi-Newton Condition

    Suppose that we have already calculated ( ) ( ) 12 and , + iii xxfxf , then from Taylor series expansion we can write,

    ( ) ( ) ( )iiiii xxhxfxf = +++ 111 + ii xxo +1

    Neglecting the higher order terms and denoting the inverse Hessian by H, we can rewrite the previous equation as

  • 59

    ( ) ( )( ) iiiii xxxfxfH = +++ !11

    This is called the Quasi-Newton condition.

    In the general form we write the Quasi Newton condition as

    iiiH =+1

    ( ) ( )( ) iiiii xxxfxf == ++ !i1 and where

    Specifically, Quasi-Newton methods have the form,

    iii gHd =

    ( )iii dxf

    +=

    minarg0

    iiii dxx +=+1

    The Quasi-Newton methods are in a sense Conjugate directions method, since the search directions generated would be A conjugate.

    There are some specific updating formulae for the inverse Hessian

    Broydens method DFP method BFGS method

    Broydens method

    Broydens method is the rank one correction formula for the inverse Hessian Hi+1 from Hi. We write it as

    T1 iiiii zzHH +=+

    This is called the rank one correction since,

    ( ) [ ] 1rank rank 11

    T =

    = nii

    ni

    i

    ii zzz

    zzz

  • 60

    This can be verified by substituting any value for zi. This is sometimes called single-rank symmetric (S R S) algorithm. It can also be observed that if Hi is symmetric then Hi+1 will also be symmetric. Now the task is to find i, zi such that the Quasi Newton condition is satisfied.

    ( ) iiiiiiii zzHH =+=+ T1 Since ziTi is a scalar, zi can be expressed as

    ( )iiiiii

    i zHz

    T

    =

    Hence,

    ( )( )( )2T

    T

    1

    iii

    iiiiiiii

    z

    HHHH

    +=+

    Now to find zi we premultiply ( ) iiiiiii zzH T= by iT to obtain

    ( ) iiiiTiiiTiiTi zzH T=

    ( )2T iiiiiTiiTi zH = Substituting the above relation gives

    ( )( )( )iiiTi

    iiiiiiii H

    HHHH

    +=+

    T

    1

    The Broydens algorithm

    Step 1: Set i = 0 ans select x0 and a real symmetric positive definite H0.

    Step 2: If gi = 0, stop; else iii gHd =

    Step 3: Compute ( )iii dxf

    +=

    minarg0

    and iiii dxx +=+1

    Step 4: Compute ( ) ( )( ) iiiii xxxfxf == ++ 1i1 and

    ( )( )( )iiiTi

    iiiiiiii H

    HHHH

    +=+

    T

    1

  • 61

    Step 5: Set i = i + 1 and go to step 2

    This algorithm also has he disadvantage that the Hessian might not be always positive definite. So we proceed to rank two correction formula namely, the DFP algorithm

    Example

    Find the minimizer of the function f(x1, x2) = 1.5x2 + y2 + 5

    The function can be written as f(x, y) = 0.5*XTAX + 5

    With

    =

    =

    yx

    XandA2003

    Take X0 = [1 , 2]T

    Then g0 = [3x, 2y]T ; g0 = [3 4]T

    Let H0 be an identity matrix of order 2.

    The search direction d 0 = -H0 g0 = [-3, -4]T

    The step length 4237.000

    000 == Add

    dgT

    T

    X1= X0 + 0d0 = [-0.2711, 0.3052]T

    0 = X1 X 0 = 0*d0 = [-1.2712, -1.6949]T

    g1 = [-0.8133, 0.6104]T

    0 = g1 - g0 = [-3.8133, -3.3896]T

    ( )( )( )

    =

    +=8140.02791.02791.05814.0

    0000

    T000000

    01

    HHHHH T

    d1 = [0.6432, -0.7238]T

    1= 0.4216

    X2 = X1 + 1d1 = [0, 0]T

    Since this is a quadratic problem with two variables, the solution is arrived at the second iteration itself. The solution can be verified from the contour plot of the function.

  • 62

    Plot of the function f(x, y)

    Contour plot of the function

    DFP Algorithm

    As mentioned previously, the DFP algorithm uses the rank two update via

    TT1 iiiiiii ppzzHH ++=+

    Now this must satisfy the condition,

    ( ) iiiiiiii ppzzH =++ TT

    After computations as done previously, we obtain

    iTi

    ii

    Ti

    iiiiii pzHpz

    1 and1,, ====

    On substitution,

    ( )( )iii

    iiii

    iTi

    iiii H

    HHHH

    T

    TT

    1 +=+

    The DFP algorithm is the same as the Broydens algorithm except for the rank two update for the inverse Hessian.

    This formula was considered by Davidon and then modified by Fletcher and Powell and hence the name DFP. This method is also called variable metric algorithm.

  • 63

    Example

    Find the minimizer of the function f(x, y) = x2 + x*y + y2 + 3x - 2y

    The function can be rewritten as f(x, y) = 0.5XTAX - bTX

    with

    =

    =

    =

    yx

    XandbA23

    1112

    Take X 0 = [1, 0]T

    g0 = [2x + y +3, x + 2y - 2] = [5;-1]

    Again let H0 be the identity matrix of second order

    The search direction d0 = -H0 g0 = [-5, 1]T

    The step length 6190.000

    000 == Add

    dgT

    T

    X1= X0 + 0d0 = [-2.0952, 0.6190]T

    0 = X1 X 0 = 0*d0 = [-3.0952, 0.6190]T

    g1 = [-0.5714, -2.8571]T

    0 = g1 - g0 = [-5.5714, -1.8571]T

    ( )( )

    =+=

    9238.04190.04190.06952.0

    00T0

    T0000

    00

    T00

    01

    HHHHH T

    d1 = [-0.8, 2.4]T

    1= 0.7143

    X2 = X1 + 1d1 = [-2.6667, 2.3333]T

    The result can be verified against the plot given. The quadratic function of two variables, as expected converges in the second iteration to the true solution.

  • 64

    Plot of the function

    Contour plot of the function

    The major advantage of this method is that the positive definiteness and the symmetry of the H matrix are preserved. When used for minimizing quadratic functions, the search directions generated are all A- conjugate and at the nth iteration, Hn becomes the true inverse of the Hessian h. When used along with exact line search method for minimizing quadratic functions the method will converge within n steps. If the function is strictly convex, then along with exact line search techniques the method shows global convergence. However in the case of large non-quadratic problems, the algorithm has a tendency to get stuck owing to the singularity of H matrix. To overcome this, the BFGS algorithm was formulated.

    BFGS Algorithm

    The BFGS algorithm was suggested independently by Broyden, Fletcher, Goldfarb and Shanno. In all the previous algorithms some update formulas were derived for approximating the inverse of the Hessian matrix. An alternative to this is to approximate the Hessian matrix itself. To do this let us assume Bi be the approximate of the Hessian at the ith step. Bi+1 must satisfy the following relation.

    ijB jjj += 01

    We can observe that this condition is same as the previous set of Quasi-Newton condition for Hi+1 except that the terms and are interchanged. Thus given any update formula for H matrix, the corresponding formula for B can be obtained by interchanging H and B as well as and . Specifically the DFP update for H corresponds to the BFGS update for B and the formulas related in this fashion are called dual or complementary.

    The DFP algorithm for H is given by

  • 65

    ( )( )iii

    iiii

    iTi

    iiii H

    HHHH

    T

    TT

    1 +=+

    And by making use of the duality concept, the update formula for B can be obtained as

    ( )( )iii

    iiii

    iTi

    iiii B

    BBBB

    T

    TT

    1 +=+

    The above equation gives the BFGS update for the approximate Hessian and now to find the inverse of the approximate Hessian take

    ( )( )1

    T

    TT1

    11

    ++

    +==

    iii

    iiii

    iTi

    iiiii B

    BBBBH

    To obtain the inverse of the B matrix, we shall make use of the Sherman - Morrison formula for matrix inverse stated as follows.

    If M is a nonsingular matrix and u and v be the column vectors such that 1 + vTM-1u 0,

    then M + uvT is nonsingular and

    ( ) ( )( )uMv

    MvuMMuvM 1T1T1

    11T

    1

    +=+

    Applying this relation twice to Bi+1 yields

    ( ) ( )ii

    iiiiii

    ii

    ii

    iTi

    iiiii

    HHHHH

    T

    TTT

    T

    TT

    1 1+

    ++=+

    This is the BFGS formula for updating Hi.

    Example

    Find the minimizer of the function f(x, y) = 1.5x2 - 2xy + y2 - 2x y + 5

    The function can be rewritten as f(x, y) = 0.5XTAX - bTX + 5

    with

    =

    =

    =

    yx

    XandbA12

    2223

    Take X 0 = [2, 3]T

    g0 = [ 3x - 2y - 2, -2x + 2y - 1] = [-2, 1]T

  • 66

    Again let H0 be the identity matrix of second order

    The search direction d0 = -H0 g0 = [2, -1]T

    The step length 0.227300

    000 == Add

    dgT

    T

    X1= X0 + 0d0 = [2.4545, 2.7727]T

    0 = X1 X 0 = 0*d0 = [0.4545, -0.2273]T

    g1 = [-0.1818, -0.3636]T

    0 = g1 - g0 = [1.8182, -1.3636]T

    ( ) ( )

    =

    +

    ++=

    7066.04050.04050.05537.0

    10

    T0

    TT000

    T000

    0T0

    T00

    00

    T000

    01

    HHHHH T

    d1 = [0.2479, 0.3306]T

    1= 2.2

    X2 = X1 + 1d1 = [3.0000, 3.5000T

    Plot of the function

    Contour plot of the function

    Like DFP method, this method also ensures A conjugacy of the search directions and the positive definiteness of the Hessian matrix. The BFGS update is reasonably robust when the line search techniques are not exact and is far more efficient than the DFP algorithm.

  • 67

    Convex optimization:

    1. Consider the unconstrained problem min ( )x

    f x ,where : n nf R R is smooth.

    a. One form of the Barzilali-Borwein method takes steps of the form 1 ( )k k k kx x f x+ =

    where 1 1: , : , : ( ) ( )Tk k

    k k k k k k kTk k

    s s s x x y f x f xs y

    = = =

    Write down an explicit formula for k in terms of ks and A ,for the special case in which

    f is strictly convex quadratic, that is , 1( )2

    Tf x x Ax= ,where A is symmetric positive

    definite. b. Considering the steepest descent method 1 ( )k k k kx x f x+ = ,applied to the strictly

    convex quadratic ,write down an explicit formula for the exact minimizing k

    c. Show that the step lengths obtained in parts (a)and(b)are related as follows: 1k k + =

    Solution:

    a. Consider 1( )2

    Tf x x Ax=

    Therefore, ( )f x Ax = Since,

    ( )

    1

    1

    1

    1

    : ( ) ( )

    ( )

    k k k

    k k

    k k

    k k k k

    y f x f xAx AxA x xAs s x x

    =

    =

    =

    = =

    Therefore,

    :Tk k

    k Tk k

    Tk k

    Tk k

    s ss ys s

    s As

    =

    =

    b. Consider the steepest descent method 1 ( )k k k kx x f x+ = ,

    Since 1( )2

    Tf x x Ax= ,

    ( )1( ( )) ( ( )) ( ( ))2T

    k k k k k k k k kf x f x x f x A x f x =

    Inorder to find the expression corresponds to differentiate the above equation with respect

    to and equate to zero.

  • 68

    ( ) ( )( )( )

    ( )2

    1 ( ) ( ) 021 ( ) ( ) ( ) ( ) 021 2 ( ) ( ) ( ) 02

    Tk k k k k k

    T T T T Tk k k k k k k k k k k k

    T T Tk k k k k k k k

    x f x A x f x

    x Ax x A f x f x Ax A f x f x

    x Ax x A f x A f x f x

    = + = + =

    On differentiating,

    ( )1 2 ( ) 2 ( ) ( ) 02( ) ( ) ( ) 0

    ( ) ( ) ( )

    T Tk k k k k

    T Tk k k k k

    T Tk k k k k

    x A f x f x A f x

    x A f x f x A f xf x A f x x A f x

    + =

    + =

    =

    Therefore,

    ( )( ) ( )

    ( ) ( )

    Tk k

    k Tk k

    Tk k

    k Tk k

    x A f xf x A f xx A Ax

    Ax A Ax

    =

    =

    2

    3

    Tk k

    k Tk k

    x A xx A x

    =

    c. We have,

    1:k k ks x x = Ie, 1 1:k k k k ks x x Ax+ += = Therefore,

    11

    k kk

    Ax s +

    =

    Since 2

    3

    Tk k

    k Tk k

    x A xx A x

    =

  • 69

    1 1

    1 1

    1 12

    1 12

    1 11

    1 1

    1 1

    1 1

    1

    1

    T

    k kk k

    k T

    k kk k

    Tk k

    k

    Tk k

    kTk k

    kTk k

    s s

    s A s

    s s

    s As

    s ss As

    + +

    + +

    + +

    + +

    + ++

    + +

    =

    =

    = =

    2. Suppose that : n nf R R be a twice continuously differentiable function and suppose

    that{ }kx is a sequence of iterates in nR

    a. Suppose that liminf ( ) 0kf x = . Is it true that all accumulation points of { }kx are stationary (that is satisfying first order necessary conditions)?

    b. Suppose that lim ( ) 0kf x = . Is it true that all accumulation points of { }kx are stationary? c. Suppose that the sequence { }kx converges to a point *x ,that the gradients ( )kf x converge to

    zero, and that the hessians 2 ( )kf x at all these points are positive definite. Show that the second

    order necessary conditions are satisfied at the limit *x d. For the situation described in part(c), can we say that second order sufficient conditions will be

    satisfied at *x ? Explain

    Solution:

    a. No, liminf ( ) 0kf x = guarantees only that there is a subsequence k such that ,

    lim ( ) 0kk K f x =

    The accumulation point may be the limit of another subsequence K

    for which ( ) | 0k k Kf x

    b. Yes, since lim ( ) 0kf x = ,hence lim ( ) 0kk K f x = for all subsequence { }1,2...K =

    If X

    is any accumulation point, there is subsequence K

    such that lim kk K

    X X

    =

    We have lim ( ) 0 ( )kk K

    f X f X

    = = , so X

    is stationary

    c. We have ( ) lim ( ) 0kkf X f x

    = =

  • 70

    Since all 2 ( )kf X are positive definite, the limit is at least positive semi definite.(The minimum

    eigen values of ( )kf X is positive for all k , it may approaches to zero as h but cannot become negative)

    d. No, we have ( )2min ( ) 0kf X , so 2 *( )f X may be only positive semi