chapter 1 s tochastic s earch and o ptimization: m otivation and s upporting r esults

29
CHAPTER 1 CHAPTER 1 S S TOCHASTIC TOCHASTIC S S EARCH AND EARCH AND O O PTIMIZATION: PTIMIZATION: M M OTIVATION AND OTIVATION AND S S UPPORTING UPPORTING R R ESULTS ESULTS •Organization of chapter in ISSO –Introduction –Some principles of stochastic search and optimization •Key points in implementation and analysis •No free lunch theorems –Gradients, Hessians, etc. –Steepest descent and Newton-Raphson search Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

Upload: serge

Post on 08-Jan-2016

30 views

Category:

Documents


2 download

DESCRIPTION

Slides for Introduction to Stochastic Search and Optimization ( ISSO ) by J. C. Spall. CHAPTER 1 S TOCHASTIC S EARCH AND O PTIMIZATION: M OTIVATION AND S UPPORTING R ESULTS. Organization of chapter in ISSO Introduction Some principles of stochastic search and optimization - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

CHAPTER 1CHAPTER 1

SSTOCHASTIC TOCHASTIC SSEARCH AND EARCH AND

OOPTIMIZATION: PTIMIZATION: MMOTIVATION AND OTIVATION AND

SSUPPORTING UPPORTING RRESULTSESULTS

•Organization of chapter in ISSO

–Introduction

–Some principles of stochastic search and optimization

•Key points in implementation and analysis

•No free lunch theorems

–Gradients, Hessians, etc.

–Steepest descent and Newton-Raphson search

Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall

Page 2: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-2

Potpourri of Problems Using Stochastic Potpourri of Problems Using Stochastic Search and Optimization Search and Optimization

• Minimize the costs of shipping from production facilities to warehouses

• Maximize the probability of detecting an incoming warhead (vs. decoy) in a missile defense system

• Place sensors in manner to maximize useful information

• Determine the times to administer a sequence of drugs for maximum therapeutic effect

• Find the best red-yellow-green signal timings in an urban traffic network

• Determine the best schedule for use of laboratory facilities to serve an organization’s overall interests

Page 3: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-3

Search and Optimization Algorithms as Search and Optimization Algorithms as Part of Problem SolvingPart of Problem Solving

• There exist many deterministic and stochastic algorithms• Algorithms are partpart of the broader solution• Need clear understanding of problem structure, constraints,

data characteristics, political and social context, limits of algorithms, etc.

• “Imagine how much money could be saved if truly appropriate techniques were applied that go beyond simple linear programming.” (Michalewicz and Fogel, 2000) – Deeper understanding required to provide truly appropriate

solutions; “COTS” software usually not enough!

• Many (most?) real-world implementations involve stochastic effects

Page 4: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

4

Two Fundamental Problems of Interest Two Fundamental Problems of Interest

• Let be the domain of allowable values for a vector represents a vector of “adjustables”

may be continuous or discrete (or both)

• Two fundamental problems of interest:

Problem 1.Problem 1. Find the value(s) of a vector that minimize a scalar-valued loss function L()

— or —

Problem 2.Problem 2. Find the value(s) of that solve the equation g() = 0 for some vector-valued function g()

• Frequently (but not necessarily) g() =

• Let denote solution to Problem 1 or 2 (may or may not be unique)

( )L

Page 5: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-5

Continuous Discrete/

Continuous

Discrete

Three Common Types of Loss FunctionsThree Common Types of Loss Functions

Page 6: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-6

Classical Calculus-Based Optimization

• Classical optimization setting of interest

Find that minimizes the differentiable loss L()subject to satisfying relevant constraints ( )

Standard nonlinear unconstrained optimization setting: Find such that = 0–Lagrange multipliers useful for some types of constraints

L0

L( )

*

L

Page 7: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-7

Global vs. Local SolutionsGlobal vs. Local Solutions• Typically the solution to = 0 may only be a local

(vs. global) optimal

• General global optimization problem is very difficultvery difficult

• Sometimes local optimization is “good enough” given limited resources available

• Global methods include: genetic algorithms, evolutionary strategies, simulated annealing, etc.

L

Page 8: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-8

Global vs. Local Solutions (cont’d)Global vs. Local Solutions (cont’d)

• Global methods tend to have following characteristics:

– Inefficient, especially for high-dimensional– Relatively difficult to use (e.g., require very careful selection of

algorithm coefficients)

– Sometimes questionable theoretical foundation for global convergence

– Multiple runs usually required to have confidence in reaching global optimum

• Some “hype” with many methods; e.g., genetic algorithm (GA) software advertisement:

– “…“…uses GAs to solve uses GAs to solve anyany optimization problem!” optimization problem!”

• But there are somesome sound methods

– E.g., restricted settingsrestricted settings for GAs, simulated annealing, and stochastic approximation

Page 9: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-9

Examples of Easy and Hard Problems for Examples of Easy and Hard Problems for Global OptimizationGlobal Optimization

Note: “Hard problem” below (plot (b)) is indicative of effective volume of region around in practical high-dimensional problems

Page 10: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-10

Stochastic Search and OptimizationStochastic Search and Optimization• Focus here is stochastic search and optimization:

A. Random noise in input information (e.g., noisy A. Random noise in input information (e.g., noisy measurements of measurements of LL(()):: yy(() ) LL(() + noise) + noise))

— and/or —

B. Injected randomness (Monte Carlo) in choice of B. Injected randomness (Monte Carlo) in choice of algorithm iteration magnitude/directionalgorithm iteration magnitude/direction

• Contrasts with deterministic methods– E.g., steepest descent, Newton-Raphson, etc.– Assume perfect information about L() (and its gradients)– Search magnitude/direction deterministic at each iteration

• Injected randomness (B) in search magnitude/direction can offer benefits in efficiency and robustness– E.g., Capabilities for global (vs. local) optimization

Page 11: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-11

Some Popular Stochastic Search and Some Popular Stochastic Search and Optimization TechniquesOptimization Techniques

• Random search• Stochastic approximation

– Robbins-Monro and Kiefer-Wolfowitz– SPSA– NN backpropagation– Infinitesimal perturbation analysis– Recursive least squares– Many others

• Simulated annealing• Genetic algorithms• Evolutionary programs and strategies• Reinforcement learning• Markov chain Monte Carlo (MCMC)• Etc.

Page 12: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-12

Effects of Noise on Simple Optimization ProblemEffects of Noise on Simple Optimization Problem(Example 1.4 in (Example 1.4 in ISSOISSO) )

Page 13: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-13

• Consider tracking problem where controller and/or system depend on design parameters – E.g.: Missile guidance, robot arm manipulation, attaining

macroeconomic target values, etc.

• Aim is to pick to minimize mean-squared error (MSE):

• In general nonlinear and/or non-Gaussian systems, not possible to compute L()

• Get observedobserved squared error by running system

• Note that– Values of y(), not L(), used in optimization of

Example of Noisy Loss Measurements: Example of Noisy Loss Measurements: Tracking Problem (Example 1.5 in Tracking Problem (Example 1.5 in ISSOISSO) )

2( ) actual output desired outputL E

2( )y

2( ) ( ) noisey L

Page 14: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-14

• Have credible Monte Carlo simulation of real system

• Parameters in simulation have physical meaning in system– E.g.: is machine locations in plant layout, timing settings in

traffic control, resource allocation in military operations, etc.

• Run simulation to determine best for use in real system

• Want to minimize average measure of performance L()– Let y() represent one simulation output (y() = L() + noise)

Example of Noisy Loss Measurements: Example of Noisy Loss Measurements: Simulation-Based Optimization Simulation-Based Optimization

(Example 1.6 in (Example 1.6 in ISSOISSO))

StochasticStochasticoptimizeroptimizer

y()Monte Carlo Monte Carlo SimulationSimulation

inputs

Page 15: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-15

• Algorithm comparisons via number of measurements of L() or g() (not iterations)– Function measurements typically represent major cost

• Curse of dimensionality– E.g.: If dim() = 10, each element of can take on 10 values.

Taking 10,000 random samples: Prob(finding at least one of 500 best ) =

0.0005– Above example would be even much harder with only noisy

function measurements

• Constraints

• Limits of numerical comparisons– Avoid broad claims based on numerical studies– Best to combine theory andand numerical analysis

Some Key Properties in Implementation and Some Key Properties in Implementation and Evaluation of Stochastic AlgorithmsEvaluation of Stochastic Algorithms

Page 16: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-16

• = 0 setting usually associated with unconstrained optimization

• Most real problems include constraints

• Many constrained problems can also be converted to = 0

– Penalty functions, projection methods, ad hoc methods and common sense, etc

• Considerations for constraints:

– Hard vs. soft

– Explicit vs. implicit

Constrained vs. UnconstrainedConstrained vs. Unconstrained

L

L

Page 17: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-17

No Free Lunch TheoremsNo Free Lunch Theorems

• Wolpert and Macready (1997) establish several “No Free Lunch” (NFL) Theorems for optimization

• NFL Theorems apply to settings where parameter set and set of loss function values are finite, discrete sets

– Relevant for continuous problem when considering digital computer implementation

– Results are valid for deterministic and stochastic settings

• Number of optimization problems—mappings from to set of loss values—is finite

• NFL Theorems state, in essence, that no one search algorithm is “best” for all problems

Page 18: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

18

No Free Lunch Theorems—Basic FormulationNo Free Lunch Theorems—Basic Formulation

• Suppose that

N number of possible values of

NL number of possible values of loss function

• Then = number of loss functions

N

LN θ

• There is a finite (but possibly huge) number of loss functions

• Basic form of NFL considers average performance over all loss functions

Page 19: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-19

Illustration of No Free Lunch TheoremsIllustration of No Free Lunch Theorems(Example 1.7 in (Example 1.7 in ISSOISSO))

• Three values of , two outcomes for noise free loss L

– Eight possible mappings, hence eight optimization problems

• Mean loss across all problems is same regardless of ; entries 1 or 2 in table below represent two possible L outcomes

1 2 3 4 5 6 7 8

1 1 1 1 2 2 2 1 2

2 1 1 2 1 1 2 2 2

3 1 2 2 1 2 1 1 2

Map

Page 20: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-20

No Free Lunch Theorems—Basic FormulationNo Free Lunch Theorems—Basic Formulation

• Assume algorithm A is applied to L()

• Let represent value of a loss function after n unique function evaluations

• Consider probability that given algorithm A and loss function L

• Proof of NFL Theorems consider this probability for choices of algorithms A and loss functions L

• Conclusion of NFL based on averaging above probability w.r.t. either L or A

• Most important special case of is = L()

ˆ nL

ˆ nL

ˆ ,nP L L A

Page 21: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

21

No Free Lunch Theorems (cont’d)No Free Lunch Theorems (cont’d)• NFL Theorems state, in essence:

• In particular, if algorithm 1 performs better than algorithm 2 over some set of problems, then algorithm 2 performs better than algorithm 1 on another set of problems

• NFL theorems say nothing about specific algorithms on specific problems

– NFL provides no direct guidance for implementation, but useful as conceptual device to avoid exaggerated claims of power of specific algorithms

Averaging (uniformly) over all possible problems (loss functions L), all algorithms perform equally well

Overall relative efficiency of two algorithms cannot be inferred from a few sample problems

Page 22: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-22

• Often used directly in deterministic methods like steepest descent and Newton-Raphson; indirectly in stochastic methods– Exact gradients and Hessians generally notnot available in

stochastic optimization

• Gradient g() of L() is the vector of 1st partial derivatives

• Hessian of L() is the matrix H() consisting the 2nd partial derivatives

• Hessian useful in characterizing shape of L and in providing search direction for Newton-Raphson algorithm

Gradients and HessiansGradients and Hessians

2

( )T

L

H

( )

L

g

Page 23: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-23

Rationale Behind Steepest Descent Update Rationale Behind Steepest Descent Update Direction for Direction for ii th Element of th Element of

Page 24: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-24

Steepest Descent with Noisy and Noise-Free Steepest Descent with Noisy and Noise-Free Gradient Input (Example 1.8 in Gradient Input (Example 1.8 in ISSOISSO) )

Page 25: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-25

Noisy and Noise-Free Gradient Input (cont’d): Noisy and Noise-Free Gradient Input (cont’d): Relative Loss Values (Example 1.8 in Relative Loss Values (Example 1.8 in ISSOISSO) )

Page 26: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-26

Deterministic Optimization: 1Deterministic Optimization: 1st st -order (Steepest -order (Steepest Descent) and 2Descent) and 2ndnd-order (Newton-Raphson) -order (Newton-Raphson)

Directions with Directions with pp = 2 = 2

Page 27: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

27

Final Remarks on Contrast of Deterministic Final Remarks on Contrast of Deterministic and Stochastic Optimization: and Stochastic Optimization:

Relative Theoretical Rates of ConvergenceRelative Theoretical Rates of Convergence

• Theoretical analysis based on convergence rates of iterates where k is iteration counter

• Recall is optimal value of

• For deterministicdeterministic optimization, a standard rate result is:

• Corresponding rate with noisy measurementsnoisy measurements

• Stochastic rate inherently slower in theory and practice

ˆ ,k

O ˆ ( ), 0 1kk c c

O

12

1ˆ , 0kk

Page 28: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

1-28

Concluding RemarksConcluding Remarks• Stochastic search and optimization very widely used

– Handles noise in the function evaluations– Generally better for global optimization– Broader applicability to “non-nice” problems (robustness)

• Some challenges in practical problems– Noise dramatically affects convergence– Distinguishing global from local minima not generally easy– Curse of dimensionality– Choosing algorithm “tuning coefficients”

• Rarely sufficient to use theory for standard deterministic methods to characterize stochastic methods

• “No free lunch” theorems are barrier to exaggerated claims of power and efficiency of any specific algorithm

• Algorithms should be implemented in context: “Better a rough answer to the right question than an exact answer to the wrong one” (Lord Kelvin)

Page 29: CHAPTER 1 S TOCHASTIC  S EARCH AND  O PTIMIZATION:  M OTIVATION AND  S UPPORTING  R ESULTS

29

Selected References on Stochastic Optimization Selected References on Stochastic Optimization • Fogel, D. B. (2000), Evolutionary Computation: Toward a New Philosophy of

Machine Intelligence (2nd ed.), IEEE Press, Piscataway, NJ.

• Fu, M. C. (2002), “Optimization for Simulation: Theory vs. Practice” (with discussion by S. Andradóttir, P. Glynn, and J. P. Kelly), INFORMS Journal on Computing, vol. 14, pp. 192227.

• Goldberg, D. E. (1989), Genetic Algorithms in Search, Optimization, and Machine Learning, Addison-Wesley, Reading, MA.

• Gosavi, A. (2003), Simulation-Based Optimization: Parametric Optimization Techniques and Reinforcement Learning, Kluwer, Boston.

• Holland, J. H. (1975), Adaptation in Natural and Artificial Systems, University of Michigan Press, Ann Arbor.

• Kleijnen, J. P. C. (2008), Design and Analysis of Simulation Experiments, Springer.

• Kushner, H. J. and Yin, G. G. (2003), Stochastic Approximation and Recursive Algorithms and Applications (2nd ed.), Springer-Verlag, New York.

• Michalewicz, Z. and Fogel, D. B. (2000), How to Solve It: Modern Heuristics, Springer-Verlag, New York.

• Spall, J. C. (2003), Introduction to Stochastic Search and Optimization: Estimation, Simulation, and Control, Wiley, Hoboken, NJ.

• Zhigljavsky, A. A. (1991), Theory of Global Random Search, Kluwer, Boston.