Optimization in R: algorithms, sequencing, and automatic differentiation
James ThorsonAug. 26, 2011
2
ThemesBasic:
Algorithms
Settings
Starting location
Intermediate
Sequenced optimization
Phasing
Parameterization
Standard errors
Advanced
Derivatives
3
Outline1. One-dimensional
2. Two-dimensional
3. Using derivatives
4
ONE-DIMENSIONAL
5
Basic: Algorithm
• Characterists– Very fast
– Somewhat unstable
• Process– Starts with 2 points
– Moves in direction of higher point
– Then goes between two highest points
optimize(fn =, interval =, ...)
6
Basic: Algorithm
7
Basic: Algorithm
8
Intermediate: Sequenced Sequencing:
1. Using a stable but slow method
2. Then using a fast method for fine-tuning
One-dimensional sequencing
3. Grid-search
4. Then use optimize()
9
Intermediate: Sequenced
10
Basic: AlgorithmsOther one-dimensional functions
• uniroot – Finds where f( ) = 0∙• polyroot – Finds all solutions to f( ) = 0∙
11
TWO-DIMENSIONAL
12
Basic: Settings
• trace = 1– Means different things for different optimization
routines
– In general, gives output during optimization
– Useful for diagnostics
optimx(par = , fn = , lower = , upper = , control=list(trace=1, follow.on=TRUE) , method = c(“nlminb”,”L-BFGS-U”))
13
Basic: Settings
optimx(par = , fn = , lower = , upper = , control=list(trace=1, follow.on=TRUE) , method = c(“nlminb”,”L-BFGS-U”))
14
Basic: Settings
• follow.on = TRUE– Starts subsequent methods at last stopping point
• method = c(“nlminb”,”L-BFGS-U”)– Lists the set and order of methods to use
optimx(par = , fn = , lower = , upper = , control=list(trace=1, follow.on=TRUE) , method = c(“nlminb”,”L-BFGS-U”))
calcMin() in “PBSmodelling” package
15
Basic: SettingsContraints
• Unbounded
• Bounded– I recommend using bounds
– Box-constraints are common
• Non-box constraints– Usually implemented in the objective function
16
Basic: AlgorithmsDifferences among algorithms:
• Speed vs. accuracy
• Unbounded vs. bounded
• Can use derivatives
17
Basic: AlgorithmsNelder-Mead (a.k.a. “Simplex”)
• Characteristics– Bounded (nlminb)
– Unbounded (optimx)
– Cannot use derivatives
– Slow and but good at following valleys
– Easily stuck at local minima
18
Basic: AlgorithmsNelder-Mead (a.k.a. “Simplex”)
• Process– Uses a polygon with n+1 vertices
– Take worst point and rotate across center
– If worse: shrink
– If better: Accept and expand along axis
Basic: Algorithms
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3
-10
12
3
XX
YY
-1 0 1 2 3-1
01
23
20
Basic: AlgorithmsRosenbrock “Banana” Function
21
Basic: AlgorithmsQuasi-Newton (“BFGS”)
• Characteristics– Bounded (optim, method=“BFGS”)
– Unbounded (optim, method=“L-BFGS-U”)
– Can use derivatives
– Fast and less accurate
22
Basic: AlgorithmsQuasi-Newton (“BFGS”)
• Process– Approximates gradient and Hessian
– Uses Newton’s method to update location
– Uses various other methods to update gradient and Hessian
23
Basic: Algorithms
24
Basic: AlgorithmsQuasi-Newton (“ucminf”)
• Different variation on quasi-Newton
25
Basic: Algorithms
26
Basic: AlgorithmsConjugate gradient
• Characteristics:– Bounded (optim)
– Very fast for near-quadratic problems
– Low memory
– Highly unstable generally
– I don’t recommend it for general usage
27
Basic: AlgorithmsConjugate gradient
• Process– Numerical calculation of derivatives
– Subsequent derivatives are “conjugate” (i.e. form an optimal linear basis for a quadratic problem)
28
Basic: Algorithms
29
Basic: AlgorithmsMany others!
As one example….
Spectral project gradient
• Characterististics– ???
• Process– ???
30
Basic: Algorithms
31
Basic: AlgorithmsAccuracy trials
Npar bobyqa
newuoa
Rvmmin
nlminb
Rcgmin
ucminf L-BFGS-B
nlm spg Nelder-Mead
BFGS CG
1 50 0 0 1 0 1 0 1 1 1 0 1 1
2 50 0 0 0 1 1 0 1 1 0 0 0 1
3 50 0 0 0 1 1 0 1 1 0 0 0 1
4 2 0 0 0 1 1 1 1 0 0 1 0 0
5 3 0 NA 1 1 0 NA 1 NA 1 NA NA NA
6 50 0 0 1 0 1 0 1 1 1 0 1 1
7 50 0 0 1 0 1 0 1 1 1 0 1 1
8 50 0 0 0 1 1 1 1 1 1 0 1 1
9 303 0 0 1 1 1 0 1 1 1 0 1 1
10 5 0 NA 1 1 1 NA 1 NA 1 NA NA NA
32
Basic: starting locationIt’s important to provide a good starting
location!– Some methods (like nlminb) find the nearest local
minimum
– Speeds convergence
33
Intermediate: ParameterizationSuggestions:
1. All parameters on a similar scale– Derivatives are approximately equal
– One method: use exp() and plogit() for inputs
2. Minimize covariance
3. Minimize changes in scale or covariance
34
Intermediate: PhasingPhasing
1. Estimate some parameters (with others fixed) in a first phase
2. Estimate more parameters in each phase
3. Eventually estimate all parameters
Uses
4. Multi-species models
• Estimate with linkages in later phases
5. Statistical catch-at-age
• Estimate scale early
35
Intermediate: Standard errorsMaximum likelihood allows asymptotic
estimates of standard errors
1. Calculate Hessian matrix at maximum likelihood estimate– Second derivatives of Log-Likelihood function
2. Invert the Hessian
3. Diagonal entries are variances
4. Square root is standard error
36
Intermediate: Standard errorsCalculation of Hessian depends on parameter
transformations
• When using exp() or logit() transformations, use the delta-method to transform back to normal space
37
Intermediate: Standard errors
38
Intermediate: Standard errorsGill and King (2004) “What to do when your
Hessian is not invertible”
gchol() – Generalized Cholesky (“kinship”)
ginv() – Moore-Penrose Inverse (“MASS”)
39
Intermediate: Standard errors[
Switch over to R-screen to show mle() and solve(hess())
]
40
Advanced: Differentiation
Gradient:
• Quasi-newton
• Conjugate gradient
Hessian:
• Quasi-newton
optimx(par = , fn = , gr=, hess=, lower = , upper = , control=list(trace=1, follow.on=TRUE) , method = c(“nlminb”,”L-BFGS-U”))
41
Advanced: DifferentiationAutomatic differentiation
• AD Model Builder
• “radx” package (still in development)
Semi-Automatic differentiation
• “Rsympy” package
Symbolic differentiation
• “deriv”
BUT:
None of these handle loops or “sum/prod”
so they’re not really helpful for statistics yet
42
Advanced: DifferentiationMixture distribution model (~ 15 params)
• 10 seconds in R
• 2 seconds in ADMB
Multispecies catchability model (~ 150 params)
• 4 hours in R (using trapezoid method)
• 5 minutes in ADMB (using MCMC)
Surplus production meta-analysis (~ 750 coefs)
• 7 days in R (using trapezoid method)
• 2 hours in ADMB (using trapezoid method)
Top Related