optimization in r: algorithms, sequencing, and automatic differentiation

Optimization in R: algorithms, sequencing, and automatic differentiation

James ThorsonAug. 26, 2011

2

ThemesBasic:

Algorithms

Settings

Starting location

Intermediate

Sequenced optimization

Phasing

Parameterization

Standard errors

Advanced

Derivatives

3

Outline1. One-dimensional

2. Two-dimensional

3. Using derivatives

4

ONE-DIMENSIONAL

5

Basic: Algorithm

• Characterists– Very fast

– Somewhat unstable

• Process– Starts with 2 points

– Moves in direction of higher point

– Then goes between two highest points

optimize(fn =, interval =, ...)

6

Basic: Algorithm

7

Basic: Algorithm

8

Intermediate: Sequenced Sequencing:

1. Using a stable but slow method

2. Then using a fast method for fine-tuning

One-dimensional sequencing

3. Grid-search

4. Then use optimize()

9

Intermediate: Sequenced

10

Basic: AlgorithmsOther one-dimensional functions

• uniroot – Finds where f( ) = 0∙• polyroot – Finds all solutions to f( ) = 0∙

11

TWO-DIMENSIONAL

12

Basic: Settings

• trace = 1– Means different things for different optimization

routines

– In general, gives output during optimization

– Useful for diagnostics

optimx(par = , fn = , lower = , upper = , control=list(trace=1, follow.on=TRUE) , method = c(“nlminb”,”L-BFGS-U”))

13

Basic: Settings


14

Basic: Settings

• follow.on = TRUE– Starts subsequent methods at last stopping point

• method = c(“nlminb”,”L-BFGS-U”)– Lists the set and order of methods to use


calcMin() in “PBSmodelling” package

15

Basic: SettingsContraints

• Unbounded

• Bounded– I recommend using bounds

– Box-constraints are common

• Non-box constraints– Usually implemented in the objective function

16

Basic: AlgorithmsDifferences among algorithms:

• Speed vs. accuracy

• Unbounded vs. bounded

• Can use derivatives

17

Basic: AlgorithmsNelder-Mead (a.k.a. “Simplex”)

• Characteristics– Bounded (nlminb)

– Unbounded (optimx)

– Cannot use derivatives

– Slow and but good at following valleys

– Easily stuck at local minima

18

Basic: AlgorithmsNelder-Mead (a.k.a. “Simplex”)

• Process– Uses a polygon with n+1 vertices

– Take worst point and rotate across center

– If worse: shrink

– If better: Accept and expand along axis

Basic: Algorithms

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3

-10

12

3

XX

YY

-1 0 1 2 3-1

01

23

20

Basic: AlgorithmsRosenbrock “Banana” Function

21

Basic: AlgorithmsQuasi-Newton (“BFGS”)

• Characteristics– Bounded (optim, method=“BFGS”)

– Unbounded (optim, method=“L-BFGS-U”)

– Can use derivatives

– Fast and less accurate

22

Basic: AlgorithmsQuasi-Newton (“BFGS”)

• Process– Approximates gradient and Hessian

– Uses Newton’s method to update location

– Uses various other methods to update gradient and Hessian

23

Basic: Algorithms

24

Basic: AlgorithmsQuasi-Newton (“ucminf”)

• Different variation on quasi-Newton

25

Basic: Algorithms

26

Basic: AlgorithmsConjugate gradient

• Characteristics:– Bounded (optim)

– Very fast for near-quadratic problems

– Low memory

– Highly unstable generally

– I don’t recommend it for general usage

27

Basic: AlgorithmsConjugate gradient

• Process– Numerical calculation of derivatives

– Subsequent derivatives are “conjugate” (i.e. form an optimal linear basis for a quadratic problem)

28

Basic: Algorithms

29

Basic: AlgorithmsMany others!

As one example….

Spectral project gradient

• Characterististics– ???

• Process– ???

30

Basic: Algorithms

31

Basic: AlgorithmsAccuracy trials

Npar bobyqa

newuoa

Rvmmin

nlminb

Rcgmin

ucminf L-BFGS-B

nlm spg Nelder-Mead

BFGS CG

1 50 0 0 1 0 1 0 1 1 1 0 1 1

2 50 0 0 0 1 1 0 1 1 0 0 0 1

3 50 0 0 0 1 1 0 1 1 0 0 0 1

4 2 0 0 0 1 1 1 1 0 0 1 0 0

5 3 0 NA 1 1 0 NA 1 NA 1 NA NA NA

6 50 0 0 1 0 1 0 1 1 1 0 1 1

7 50 0 0 1 0 1 0 1 1 1 0 1 1

8 50 0 0 0 1 1 1 1 1 1 0 1 1

9 303 0 0 1 1 1 0 1 1 1 0 1 1

10 5 0 NA 1 1 1 NA 1 NA 1 NA NA NA

32

Basic: starting locationIt’s important to provide a good starting

location!– Some methods (like nlminb) find the nearest local

minimum

– Speeds convergence

33

Intermediate: ParameterizationSuggestions:

1. All parameters on a similar scale– Derivatives are approximately equal

– One method: use exp() and plogit() for inputs

2. Minimize covariance

3. Minimize changes in scale or covariance

34

Intermediate: PhasingPhasing

1. Estimate some parameters (with others fixed) in a first phase

2. Estimate more parameters in each phase

3. Eventually estimate all parameters

Uses

4. Multi-species models

• Estimate with linkages in later phases

5. Statistical catch-at-age

• Estimate scale early

35

Intermediate: Standard errorsMaximum likelihood allows asymptotic

estimates of standard errors

1. Calculate Hessian matrix at maximum likelihood estimate– Second derivatives of Log-Likelihood function

2. Invert the Hessian

3. Diagonal entries are variances

4. Square root is standard error

36

Intermediate: Standard errorsCalculation of Hessian depends on parameter

transformations

• When using exp() or logit() transformations, use the delta-method to transform back to normal space

37

Intermediate: Standard errors

38

Intermediate: Standard errorsGill and King (2004) “What to do when your

Hessian is not invertible”

gchol() – Generalized Cholesky (“kinship”)

ginv() – Moore-Penrose Inverse (“MASS”)

39

Intermediate: Standard errors[

Switch over to R-screen to show mle() and solve(hess())

]

40

Advanced: Differentiation

Gradient:

• Quasi-newton

• Conjugate gradient

Hessian:

• Quasi-newton

optimx(par = , fn = , gr=, hess=, lower = , upper = , control=list(trace=1, follow.on=TRUE) , method = c(“nlminb”,”L-BFGS-U”))

41

Advanced: DifferentiationAutomatic differentiation

• AD Model Builder

• “radx” package (still in development)

Semi-Automatic differentiation

• “Rsympy” package

Symbolic differentiation

• “deriv”

BUT:

None of these handle loops or “sum/prod”

so they’re not really helpful for statistics yet

42

Advanced: DifferentiationMixture distribution model (~ 15 params)

• 10 seconds in R

• 2 seconds in ADMB

Multispecies catchability model (~ 150 params)

• 4 hours in R (using trapezoid method)

• 5 minutes in ADMB (using MCMC)

Surplus production meta-analysis (~ 750 coefs)

• 7 days in R (using trapezoid method)

• 2 hours in ADMB (using trapezoid method)

optimization in r: algorithms, sequencing, and automatic differentiation

Documents