1 computacion inteligente derivative-based optimization

1

Computacion Inteligente

Derivative-Based Optimization

2

Contents

• Optimization problems

• Mathematical background

• Descent Methods

• The Method of Steepest Descent

• Conjugate Gradient

3

OPTIMIZATION PROBLEMS

4Terms in Mathematical Optimization

1. Objective function – mathematical function which is optimized by changing the

values of the design variables.

2. Design Variables – Those variables which we, as designers, can change.

3. Constraints – Functions of the design variables which establish limits in individual

variables or combinations of design variables.

5Problem Formulation

3 basic ingredients…– an objective function,– a set of decision variables,– a set of equality/inequality constraints.

The problem is

to search for the values of the decision variables that minimize the objective function while satisfying the constraints…

6Mathematical Definition

– Design Variables: decision and objective vector

– Constraints: equality and inequality

– Bounds: feasible ranges for variables

– Objective Function: maximization can be converted to minimization due to the duality principle

max minf x f x

min : , 0, 0L U

xy f x x x x h x g x

Obective Decision vector

Bounds constrains

7Steps in the Optimization Process

1. Identify the quantity or function, f, to be optimized.

2. Identify the design variables: x1, x2, x3, …,xn.

3. Identify the constraints if any exist

a. Equalities

b. Inequalities

4. Adjust the design variables (x’s) until f is optimized and all of the constraints are satisfied.

8Local and Global Optimum Designs

1. Objective functions may be unimodal or multimodal.

a. Unimodal – only one optimumb. Multimodal – more than one optimum

2. Most search schemes are based on the assumption of a unimodal surface. The optimum determined in such cases is called a local optimum design.

3. The global optimum is the best of all local optimum designs.

9Weierstrass Theorem

• Existence of global minimum

• If f(x) is continuous on the feasible set S which is closed and bounded, then f(x) has a global minimum in S

– A set S is closed if it contains all its boundary pts.

– A set S is bounded if it is contained in the interior of some circle

compact = closed and bounded

)numberfinite:,( ccxxT

10Example of an Objective Function

-1 -0.5 0 0.5 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

x1

x2

11Multimodal Objective Function

0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

local maxsaddle point

12Optimization Approaches

• Derivative-based optimization (gradient based)

– Capable of determining “search directions” according to an objective function’s derivative information

• steepest descent method;

• Newton’s method; Newton-Raphson method;

• Conjugate gradient, etc.

• Derivative-free optimization

• random search method;

• genetic algorithm;

• simulated annealing; etc.

13

MATHEMATICAL BACKGROUND

14Positive Definite Matrices

• A square matrix M is positive definite if

• It is positive semidefinite if

0Tx Mx for all x ≠ 0

0Tx Mx for all x

The scalar xTMx = is called a quadratic form.,x Mx


• A symmetric matrix M = MT is positive definite if and only if its eigenvalues λi > 0. (semidefinite ↔ λi ≥ 0)

– Proof (→): Let vi the eigenvector for the i-th eigenvalue λi

– Then,

– which implies λi > 0,

i i iMv v

20 T T

i i i i i i iv Mv v v v

prove that positive eigenvalues imply positive definiteness.


• Theorem: If a matrix M = UTU then it is positive definite

• Proof. Let’s f be defined as

• If we can show that f is always positive then M must be positive definite. We can write this as

• Provided that Ux always gives a non zero vector for all values of x except when x = 0 we can write b = U x, i.e.

• so f must always be positive

T T Tf x Mx x U Ux

2Ti

i

f b b b

Tf Ux Ux

17Quadratic Functions

• f: Rn → R is a quadratic function if

– where Q is symmetric.

1

2T Tf x x Qx b x c


• It is no necessary for Q be symmetric.– Suposse matrix P non-symmetric

Q is symmetric

1 1

11 12 1 1

21 21 2

1

1 1

2 2

1: :2

:

n nT

ij i ji j

n

n nn

f x p x x x Px

p p p x

p xx x

p p

1 1( )

2 2T

ij ij jix Q x where q p p


– Suposse matrix P non-symmetric. Example

Q is symmetric

2 2 21 1 2 1 3 2 2 3 3

1( ) 2 2 4 6 4 5

2f x x x x x x x x x x

2 2 41

( ) , 0 6 42

0 0 5

Tf x x Px P

2 1 21

, 1 6 22

2 2 5

Tx Qx Q

20Quadratic functions

• Given the quadratic function

1

2T Tf x x Qx b x c

If Q is positive definite, then f is a parabolic “bowl.”

21Quadratic functions

• Two other shapes can result from the quadratic form.

– If Q is negative definite, then f is a parabolic “bowl” up side down.

– If Q is indefinite then f describes a saddle.


• Quadratics are useful in the study of optimization.

– Often, objective functions are “close to” quadratic near the solution.

– It is easier to analyze the behavior of algorithms when applied to quadratics.

– Analysis of algorithms for quadratics gives insight into their behavior in general.

23One Dimension Derivative

• The derivative of f: R → R is a function f ′: R → R given by

• if the limit exists.

0

' limh

df x f x h f xf x

dx h

24Directional Derivatives

• Along the Axes…

x

yxf

),(

y

yxf

),(


• In general direction…

v

yxf

),(

2Rv

1v


x

yxf

),(

y

yxf

),(


• Definition: A real-valued function f: Rn → R is said to be continuously differentiable if the partial derivatives

• exist for each x in Rn and are continuous functions of x.

• In this case, we say f C1 (a smooth function C1)

1

,...,n

f f

x x

28The Gradient vector

• Definition: The gradient of f: in R2 → R:

It is a function ∇f: R2 → R2 given by

( , ) :T

f ff x y

x y

),( yxfIn the plane

29The Gradient vector

• Definition: The gradient of f: Rn → R is a function ∇f: Rn → Rn given by

11

( ,..., ) : ,...,

T

nn

f ff x x

x x

30The Gradient Properties

• The gradient defines (hyper) plane approximating the function infinitesimally

yy

fx

x

fz


• By the chain rule

vfpv

fp ,)(

1v

pf

v


• Proposition 1:

is maximal choosing

p

p

ff

v

1

intuitive: the gradient points at the greatest change direction

v

f

Prove it!

33The Gradient properties

• Proof:

– Assign:

– by chain rule:

1p

p

v ff

2

( , ) 1( ) ( ) , ( )

( )

1,

p p

p

p

p p p

p p

f x yp f f

v f

ff f f

f f

34The Gradient properties

• Proof:

– On the other hand for general v:

( , )( ) ,

,

p p

p

p

f x yp f v f v

v

f

f x yf p

v


• Proposition 2: let f: Rn → R be a smooth function C1 around p,

• if f has local minimum (maximum) at p then,

0 pf

Intuitive: necessary for local min(max)


• Proof: intuitive


• We found the best INFINITESIMAL DIRECTION at each point,

• Looking for minimum: “blind man” procedure

• How can we derive the way to the minimum using this knowledge?

38Jacobian

• The gradient of f: Rn → Rm is a function Df: Rn → Rm×n given by

called Jacobian

Note that for f: Rn → R , we have ∇f(x) = Df(x)T.

39Derivatives

• If the derivative of ∇f exists, we say that f is twice differentiable.

– Write the second derivative as D2f (or F), and call it the Hessian of f.

40Level Sets and Gradients

• The level set of a function f: Rn → R at level c is the set of points S = {x: f(x) = c}.


• Fact: ∇f(x0) is orthogonal to the level set at x0


• Proof of fact:

– Imagine a particle traveling along the level set.

– Let g(t) be the position of the particle at time t, with g(0) = x0.

– Note that f(g(t)) = constant for all t.

– Velocity vector g′(t) is tangent to the level set.

– Consider F(t) = f(g(t)). We have F′(0) = 0. By the chain rule,

– Hence, ∇f(x0) and g′(0) are orthogonal.

' 0 ' 0 0 0T

F g f g

43Taylor's Formula

• Suppose f: R → R is in C1. Then,

– o(h) is a term such that o(h) = h → 0 as h → 0.

– At x0, f can be approximated by a linear function, and the approximation gets better the closer we are to x0.

0 0 00'f x f x f x x ox x x

44Taylor's Formula

• Suppose f: R → R is in C2. Then,

2

0

2

0 0 0 0 0

1' ''

2

f x f x f x x x f x x

o x x

x

– At x0, f can be approximated by a quadratic function.

45Taylor's Formula

• Suppose f: Rn → R.

– If f in C1, then

– If f in C2, then

0 00 0

Tf x f x f x x x xx o

0 0

2

0 0 0

0

0

1

2

T Tf x f x f x x x x x F x x

x

x

o x

46In What Direction does a Gradient Point?

• We already know that ∇f(x0) is orthogonal to the level set at x0.

– Suppose ∇f(x0) ≠ 0.

• Fact: ∇f points in the direction of increasing f.

47Proof of Fact

• Consider xα = x0 + α∇f(x0), α > 0.

– By Taylor's formula,

• Therefore, for sufficiently small ,

f(xα) > f(x0)

00 0 0

2

0 0

Tf x f x x x f x

f x f x

o x x

o

48

DESCENT METHODS

49The Wolfe Theorem

• This theorem is the link from the previous gradient properties to the constructive algorithm.

• The problem:

)(min xfx

50The Wolfe Theorem

• We introduce a model for algorithm:nRx 0

0)( ixfn

i Rh

)(minarg0

iii hxf

iiii hxx 1

Data

Step 0: set i = 0

Step 1: if stop,

Step 2: compute the step-size

Step 3: set go to step 1

else, compute search direction

51The Wolfe Theorem

• The Theorem:

– Suppose f: Rn → R C1 smooth, and exist continuous function: k: Rn → [0,1], and,

– And, the search vectors constructed by the model algorithm satisfy:

0)(0)(: xkxfx

iiiii hxfxkhxf )()(),(

52The Wolfe Theorem

– And

• Then

– if is the sequence constructed by the algorithm model,

– then any accumulation point y of this sequence satisfy:

00)( ihyf

0)( yf

0}{ iix

53The Wolfe Theorem

• The theorem has very intuitive interpretation:

• Always go in descent direction.

)( ixf

ih

The principal differences between various descent algorithms lie in the first procedure for determining successive directions

54

STEEPEST DESCENT

55The Method of Steepest Descent

• We now use what we have learned to implement the most basic minimization technique.

• First we introduce the algorithm, which is a version of the model algorithm.

• The problem: )(min xfx


• Steepest descent algorithm:nRx 0

0)( ixf

)(minarg0

iii hxf

iiii hxx 1

Data

Step 0: set i = 0

Step 1: if stop,

Step 2: compute the step-size

Step 3: set go to step 1

else, compute search direction )( ii xfh


• Theorem:

– If is a sequence constructed by the SD algorithm, then every accumulation point y of the sequence satisfy:

– Proof: from Wolfe theorem

0)( yf

0}{ iix

Remark: Wolfe theorem gives us numerical stability if the derivatives aren’t given (are calculated numerically).


• How long a step to take?

1i i ix x h

Note search direction is if x

– We are limited to a line search

• Choose λ to minimize f .

. . . directional derivative is equal to zero.


• How long a step to take?

– From the chain rule:

• Therefore the method of steepest descent looks like this:

0),()( iiiii hhxfhxfd

d

1( )i if x h They are orthogonal !

61Gradient Descent Example

Given:

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

1 2 1 2 1 2, 2sin 1.47 sin 0.34 sin sin 1.9f x x x x x x

λ arbitrary

62Optimum Steepest Descent Example

Given:

Find the minimum when x1 is allowed to vary from 0.5 to 1.5 and x2 is allowed to vary from 0 to 2.

1 2 1 2 1 2, 2sin 1.47 sin 0.34 sin sin 1.9f x x x x x x

63

CONJUGATE GRADIENT

64Conjugate Gradient

• We from now on assume we want to minimize the quadratic function:

• This is equivalent to solve linear problem:

cxbAxxxf TT 2

1)(

( ) 1 10

2 2Tf x

f x A x Ax bx

Ax bIf A symmetric

65Sample: 2D lineal system

• La solucion es la interseccion de las lineas

3 2

2 6A

2

8b

0c

66Sample: 2D lineal system

– Cada elipsoide tiene f(x) constante

In general, the solution x lies at the intersection pointof n hyperplanes, each having dimension n – 1.


• What is the problem with steepest descent?

– We can repeat the same directions over and over…

• Wouldn’t it be better if, every time we took a step, we got it right the first time?


• What is the problem with steepest descent?

– We can repeat the same directions over and over…

• Conjugate gradient requires n gradient evaluations and n line searches.


• First, let’s define de error as

bxA ~

xxe ii~

• ei is a vector that indicates how far we are from the solution.

solution

Start point


• Let’s pick a set of orthogonal search directions

0 1 1, ,..., ,...,j nd d d d

iiii dxx 1

(should span Rn)

– In each search direction, we’ll take exactly one step,

that step will be just the right length to line up evenly with x


– Unfortunately, this method only works if you already know the answer.

• Using the coordinate axes as search directions…


• We have

bxA ~

iiii dxx 1

( )f x Ax b Ax Ax

xxe ii~

( ) ( )i i if x A x x Ae


• Given , how do we calculate ?

iiii dxx 1

jd j

• ei+1 should be orthogonal to di

0 1d e


• Given , how do we calculate ?

– That is

jd j

1( ) 0Ti id f x

( )T Ti i i i

i T Ti i i i

d Ae d f x

d Ad d Ad

1 0Ti id Ae

( ) 0Ti i i id A e d


• How do we find ?

– Since search vectors form a basis

jd

1

00

n

iiide

1

0110020010 ...

j

iiij deddedee

On the other hand

1

0

1

0

j

iii

n

iiij dde


• We want that after n step the error will be 0:

– Here an idea: if then:jj

11

0

1

0

1

0

1

0

n

jiii

j

iii

n

iii

j

iii

n

iiij ddddde

nj 0ne

So if:


• So we look for such that

– Simple calculation shows that if we take

jj jd

0iTj Add

The correct choice is

ji

( )i id f x

78Conjugate gradient

• Conjugate gradient algorithm for minimizing f:

Step 4: and repeat n times

Step 1:

nRx 0Data

Step 0:

Step 3:

)(: 000 xfrd

iTi

iT

ii Add

rr

iiii dxx 1

iT

i

iT

ii rr

rr 111

iiii drd 111

)(: ii xfr

Step 2:

79

Sources

• J-Shing Roger Jang, Chuen-Tsai Sun and Eiji Mizutani, Slides for Ch. 5 of “Neuro-Fuzzy and Soft Computing: A Computational Approach to Learning and Machine Intelligence”, First Edition, Prentice Hall, 1997.

• Djamel Bouchaffra. Soft Computing. Course materials. Oakland University. Fall 2005

• Lucidi delle lezioni, Soft Computing. Materiale Didattico. Dipartimento di Elettronica e Informazione. Politecnico di Milano. 2004

• Jeen-Shing Wang, Course: Introduction to Neural Networks. Lecture notes. Department of Electrical Engineering. National Cheng Kung University. Fall, 2005

80

Sources

• Carlo Tomasi, Mathematical Methods for Robotics and Vision. Stanford University. Fall 2000

• Petros Ioannou, Jing Sun, Robust Adaptive Control. Prentice-Hall, Inc, Upper Saddle River: NJ, 1996

• Jonathan Richard Shewchuk, An Introduction to the Conjugate Gradient Method Without the Agonizing Pain. Edition 11/4. School of Computer Science. Carnegie Mellon University. Pittsburgh. August 4, 1994

• Gordon C. Everstine, Selected Topics in Linear Algebra. The GeorgeWashington University. 8 June 2004

1 computacion inteligente derivative-based optimization

Documents