Nonlinear Optimizationweb.lums.edu.pk › ~akarim › pub › optimization1.pdf · 2015-09-15 ·...

Nonlinear Optimization Theory and Practice

by

Asim Karim Computer Science Dept.

Lahore University of Management Sciences

2nd International Bhurban Conference on Applied Sciences and Technology

Control and Simulation (June 19 – 21, 2003)

Optimization

What is optimization? Finding solution(s) from a set of admissible or feasible solutions that minimizes (or maximizes) a performance measure or objective

Examples Engineering design: find the cross-sectional dimensions of a beam that results in the least weight structure

Resource management: find the optimal distribution of resources to accomplish a task in least time

Machine control: find the policy for injecting fuel that leads to lest fuel consumption

Traveling salesperson problem: find the path through a given set of locations that has the shortest distance

Optimization is a very powerful concept. Many problems in different fields can be posed as optimization problems.

Types of Optimizations

Two basic classes of optimization problems Static: decision variables do not vary over time Dynamic: decision variables vary over time, and optimal solutions are time-paths or trajectories rather than single values

Other classifications Linear and nonlinear: if any non-linearity exists in the problem, then it is a nonlinear optimization problem; otherwise, it is a linear optimization problem

Unconstrained and constrained: if the variables are unrestricted, then it is an unconstrained optimization problem; otherwise, it is a constrained optimization problem

Nonlinear Optimization

Nonlinear optimization theory includes as a special case the linear optimization problem. The basic concepts of static optimization are similar to those of dynamic optimization.

Solution methods Mathematical: These methods are based on calculus and geometry. Collectively, these methods are known as nonlinear programming techniques.

Heuristic: These methods are based on search heuristics. Examples include genetic algorithms and simulated annealing.

We will be focusing on nonlinear programming – that is, mathematical methods for solving static nonlinear optimization problems

Optimization Theory

Two questions Existence: do local/global minima exist? Optimality conditions: what are the properties or characteristics of local/global minima?

Does f(x) = x has a local minimum?. What about f(x) = exp(x)? We won’t be focusing on existence of optimal solutions. Optimality conditions are used in algorithms for solving optimization problems.

Criteria for Characterizing Solution Methods

Rate of convergence Stability of convergence Search for minima (local or global?) Computational efficiency and scalability Memory usage and scalability Other requirements (continuous differentiability, twice continuous differentiability, etc)

Unconstrained Optimization

Definition Minimize: )(xf subject to nR∈x Rn is the n-dimensional space of real numbers (Euclidean space).

Local and Global Minimum Vector x* is a local minimum of f if there exists 0>ε such that ε<−∀≤ * with );(*)( xxxxx ff Vector x* is a global minimum if nR );(*)( ∈∀≤ xxx ff

Local and Global Minima

Optimality Conditions

Assuming f is continuously differentiable:

Necessary conditions 0*)( =∇ xf (first order necessary condition 0*)(2 ≥∇ xf (i.e. positive semi-definite; second order necessary

conditions

Sufficient condition 0*)(2 >∇ xf (i.e. positive definite)

Special cases Convex or quadratic function: first order necessary condition is also sufficient. Moreover, the stationary point is a global minimum.

Optimality Conditions - Example

Gradient Methods (1)

Gradient methods Method of steepest descent (and its variations) Newton’s method (and its variations) Quasi-Newton’s method

Basic strategy and equations These methods involve iterative descent such that

,...2,1 )()( 1 =<+ kff kk xx Update rule: kkkk dxx α+=+1 such that 0)( <∇ kTkf dx

Gradient Methods (2)

The different methods vary in their choice of dk. The stepsize αk is determined by a line search technique.

Method of Steepest Descent (1)

Direction vector )( kk f xd −∇=

This method is often slow to converge.

Method of Steepest Descent (2)

Scaled steepest descent )( kkk f xDd ∇−=

where Dk is a diagonal matrix used to scale the gradient vector. Usually, the diagonal element i in Dk is computed as the inverse of the second order partial derivative of f with xi (an approximation to the Newton’s method). This method converges faster than the method of steepest descent.

Newton’s Method (1)

Direction vector [ ] )()(

12 kkk ff xxd ∇∇−= −

assuming the Hessian is positive definite.

When αk = 1, then it is known as pure Newton’s method. However, the pure method has some major drawbacks (can you identify some?)

Faster convergence (see figure on next slide), but computationally expensive (Hessian computation)

A variation to reduce computational complexity To reduce computation expense of the Hessian, the modified Newton’s method computes the Hessian every p > 1 iterations, instead of every iteration.

Newton’s Method (2)

To ensure global convergence (a drawback of pure method) Use the steepest descent direction vector whenever the Hessian is negative or undefined.

Quasi-Newton Methods

Direction vector )( kkk f xDd ∇−=

where Dk is a positive definite matrix selected such that it approximates the Newton direction. A popular way to compute Dk is

TkkkkkkTk

kTkkk

kTk

Tkkkk τ ))((

)()(

))((

)()(

))((1 υυqDq

DqqDqp

ppDD ξ+−+=+

where

k

kk

kTk

kk

τqD

qppυ −=)(

; )()( kkTkk qDq=τ ; 10 ≤≤ kξ ; kkk xxp −= +1 ;

)()( 1 kkk ff xxq ∇−∇= + When ξ k = 0 (for all k), then it is known as DFP method When ξ k = 0 (for all k), then it is known as BFGS method (popular)

Conjugate Gradient Method

Iterative improvement kkkk dxx α+=+1

where the direction vectors dk (k = 0, 1, …) are Q-conjugate.(Q is a positive definite matrix)

The directions dk are computed by the Gram-Shmidt method.

)( 00 xd f−∇=

111 )()(

)()()( −

−− ∇∇∇∇+−∇= k

kTk

kTkkk

xff

fff d

xxx

xd

Conjugate gradient method and its variations are popular approaches for unconstrained optimization and solving linear system of equations.

Stepsize Selection Methods

Importance In practice, the choice of the stepsize αk significantly affects the rate of convergence, stability and computational efficiency of iterative direction methods.

If αk is too small, convergence may be very slow If αk is too large, convergence may not be smooth (divergence) Exact computation can be expensive

Common methods Constant stepsize Line minimization Armijo rule Goldstein rule

Line Minimization

The stepsize αk is such that it minimizes f(x) along dk, that is

)(min)( kkkkk ff dxdx αα +=+ Usually ],0[ s∈α where s > 0 to reduce computation (method known as limited line minimization) The bisection or Newton-Raphson methods are used for this minimization (these are line or 1-D optimization algorithms)

Disadvantage It is computationally expensive, requiring the solution of a sub-optimization problem in each iteration.

Armijo Rule

The stepsize skmk βα = is determined by a successive reduction process where mk is the first non-negative integer for which

kTkmkmkk fssff dxdxx )()()( ∇−≥+− σββ

Procedure Select a value for s, )1,0(),1,0( ∈∈ σβ Set m0 = 0 Evaluate the inequality. If it is satisfied, skmk βα = ; otherwise, increment m and repeat the evaluation

Usually σ is chosen close to zero and β is between ½ and 1/10. If dk is scaled then s = 1 is an appropriate choice.

Comparison of Methods

Steepest descent Newton’s Quasi-Newton CG Slow convergence

Fastest Fast Fast

Computationally less expensive

HIgh High/Moderate Moderate

Needs once differentiability

Twice differentiability

Once differentiability

Once differentiability

Suitable for less complex problems

Well-defined problems

Complex problems

Complex problems

Suitable for small scale problems

Small to medium

Medium Medium and large

Hard to parallelize

Hardest Easier Easier

Constrained Optimization

Definition Minimize: )(xf subject to nRC ⊂∈x C is the constraint set, an n-dimensional subspace of real numbers. C is defined by Iihi ,...,1 0)( ==x (equality constraints) Jjg j ,...,1 0)( =≤x (inequality constraints)

Local and Global Minimum Vector C∈*x is a local minimum of f over C if there exists 0>ε such that ε<−∈∀≤ * with );(*)( xxxxx Cff Vector C∈*x is a global minimum if C );(*)( ∈∀≤ xxx ff

Optimality Conditions (1)

Assuming f is continuously differentiable:

Necessary conditions Cf T ∈∀≥−∇ xxxx 0*)(*)(

If f is convex over C, then the above is also sufficient for optimality.

If f and constraint set C are convex then local minimum x* is also a global minimum.

The solution methods based on these optimality conditions are similar to those for unconstrained problems (feasible directions methods).

Geometric Interpretation

Optimality Conditions (2)

Karush-Kuhn-Tucker Necessary Condition If C∈*x is a local minimum of f over C, then there exist Lagrange multiplier vectors ),...,(* 11 λλ=λ and ),...,(* 1 Jµµ=µ , such that 0*)*,*,( =∇ µλxLx where Jjj ,...,1 0* =≥µ

0* =jµ for j when constraint j is inactive at x* Lagrangian function

∑∑==

++=J

jjj

I

iii ghfL

11

)()()(),,( xxxµλx µλ

Example

Minimize f(x) = x1 + x2 Subject to: x1

2 + x22 = 2

Lagrangian

)()(),( xxx hfλL λ+= At the local optimum x*, the KKT condition must be satisfied

0*)(*)( =∇+∇ xx hf λ What is the value of λ?

Barrier and Interior Point Methods

Constrained problem is converted to a sequence of unconstrained problems which involve an added high cost for approaching the boundary of the feasible region.

Barrier and interior point methods are used for inequality constrained problems.

Minimize )()()( xxx BfFB += Barrier function

∑=

−−=J

jjgB

1

)}(ln{)( xx (logarithmic)

∑=

−=J

j jgB

1 )(

1)(

xx (inverse)

Barrier Method – Geometrical Interpretation

Penalty Method

Constrained problem is converted to a sequence of unconstrained problems which involve an added high cost for infeasibility.

A penalty parameter or function is used to penalize violation of constraints.

[ ] [ ]

++= ∑∑

=

+

=

J

jj

I

ii

nP gh

rfF

1

2

1

2 )()(2

)()( xxxx

where [ ])(,0max)( xx jj gg =+ and rn is a penalty parameter

These approaches are also known as SUMT, sequential unconstrained minimization technique (any unconstrained algorithm may be used)

Often the penalized function is an augmented Lagrangian function to improve convergence.

Optimal Control (1)

Optimal control problems are dynamic optimization problems

Definition (discrete-time optimal control)

Minimize ∑−

=+=

1

0

),()(N

iiiiNN ggJ uxx

Subject to 1,...,0 );,(1 −==+ Nif iiii uxx (system equation)

NiRX nii ,...,1 ; =⊂∈x (state vector forming a trajectory)

1...,0 ; −=⊂∈ NiRU miiu (control vector forming a trajectory)

x0: given The system equation specifies uniquely the state trajectory which corresponds to a given control trajectory.

Optimal Control (2)

Given a control trajectory u = (u0, u1,…,uN-1), the state trajectory is uniquely determined by the system equation fi (i = 1, N). Equivalently, we can write

Nix ii ,...,2,1 );( == uφ where iφ is determined from fi.

Simplified definition

Minimize ∑−

=+=

1

0

)),(())(()(N

iiiiNN ggJ uuuu φφ

Subject to NiRX n

ii ,...,1 ; =⊂∈x 1...,0 ; −=⊂∈ NiRU m

iiu x0: given The optimal solution u* = (u0*, u1*…,uN-1*) can be found by any of the nonlinear optimization methods.

Nonlinear Optimization Algorithm (1)

1. Select x0 (i.e. set decision/control variables); set k = 0 2. For constrained problems, set the initial penalty r0 to a suitable

small value. Choose a penalty update rule (e.g. rk = rk-1 * 1.75) 3. For constrained problems, formulate the equivalent

unconstrained objective function (using penalty method) 4. Find the new vector xk+1

a. Compute the direction vector dk b. Compute the stepsize αk c. Update kkkk dxx α+=+1 d. For constrained problem, repeat steps a to c until convergence

is achieved within reasonable (large) tolerance 5. If stopping criteria is satisfied, stop. xk is the optimum solution

and f(xk) the optimum objective value 6. if stopping criteria is not satisfied, update k = k + 1, update rk+1

and go to step 3.

Nonlinear Optimization Algorithm (2)

Stopping criteria

ε<− −

)(

)()( 1

k

kk

f

ff

xxx

ε<− −

k

kk

xxx 1

where ε is a small positive number.

Calculating gradients The gradients are typically computed by the finite difference method in practice. This procedure is general and does not require explicit expressions for the gradient functions (which are often not available in practice – implicit equations)

Practical Guidelines (1)

General Nonlinear programming is computationally challenging. Theoretical results often do not translate into practical behavior. This is because of the discrete nature of digital computations.

Each problem should be considered from modeling to implementation independently from others.

Experimentation can yield insights that can be used to tune the methods for improved efficiency and performance.

Large scale problems require additional care in design and implementation.

Real world problems often do not possess many properties assumed during theoretical analyses (e.g. continuously differentiable functions, etc)

Practical Guidelines (2)

Categories Mathematical modeling/problem formulation Scaling Validation Method selection Large scale problems High performance and parallel implementation

Mathematical Modeling/Problem Formulation (1)

Key questions What are the objectives of the optimization? Is an accurate mathematical model of the problem available? Is reliable data available?

Goal: the simplest mathematical model consistent with the objectives and accuracy of available data and models

Specific decisions Objective function? Number and type of variables? Number and type of constraints?

Mathematical Modeling/Problem Formulation (2)

Some guidelines If there are more than one objective functions, embody all but one into the constraints set. (otherwise, a multi-objective optimization technique has to be used)

If unsure of design and implementation decisions, start with a simple formulation and study the results of the optimization before modifying it

Two rules of thumb: (1) convex objective and constraints sets are preferable to non-convex ones; (2) Linear and simple nonlinear functions are preferable

Converting nonlinear functions to piecewise linear ones is generally NOT preferable, as this increases the number of variables and distorts the physical understanding of the problem

Converting integer variables to continuous ones may be done Making objective and constraint functions differential may be done

Scaling

Variables should be scaled so that their values are neither too large nor too small relative to one another.

Benefits Controls round-off errors Improves the conditioning of the problem Often improves convergence

Example: suppose x ranges between a and b. It is scaled to [-1, 1] by

[ ]

[ ]2/)(

2/)(

ab

bax

−+−

Method Selection

A comparison of several methods was presented on a previous slide

Some further considerations Is a one-time solution sought or the problem has to be solved many times (for varying parameters)? In the latter case, efficiency and accuracy is important.

For complex nonlinear problems with non-convex sets, simpler methods like the method of steepest descent are preferable.

If parallel implementation is desired then the conjugate-gradient method is preferable.

The choice of the stepsize rule has a significant impact on efficiency and accuracy.

Validation

It is essential that the solution obtained is validated as correct. There are no fixed procedures for this; each problem has to be considered separately.

Some useful strategies If an optimal solution is known, then a close solution from a method indicates correctness

Run the algorithm with several different starting values to see if it converges to the same optimum solution

Vary parameters of the problem and correlate the behavior of the solution to the physical understanding of the problem.

Solve the problem using different methods

Example – Min. Weight Design of Cold-Formed Steel Beam

Objective of the optimization problem To develop parameterized minimum weight design curves for cold-formed steel hat-shape beams

Example (2)

Problem definition Min. [ ]dbLtf 22 += µ Subject to: the constraints of the building code (AISI) Variables of the problem are t, b, and d only; others are parameters. The code specified constraints are complex, nonlinear, and implicit. Some equations are not continuously differentiable.

Scaling No scaling is needed since the code equations are based on the ratios b/t and d/t

Method Selection The method of steepest descent (with scaling) is most appropriate because it can be followed and understood, and hence, tuned to give good solutions.

Example (3)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5

Span length (m)

Thi

ckne

ss, t

(m

m)

b

d t

b/2 b/2

q (KN/m)20

15

10

7.55

2.5

Fy = 345 N/mm^2Unbraced

Validation The solution is validated by comparing with optimal solutions found by other algorithms and by parametric behavior of the solution.

Large Scale Problems (1)

What is large-scale? There are no hard and fast rules. One criteria is

Hundreds or thousands of variables Run-time in the tens of minutes

Consideration in design and implementation Scalability of method (both computational efficiency and memory usage)

Memory requirement Run-time Utilizing structure in the problem to enhance performance (e.g. large-scale problems are often sparse)

Numerical conditioning Robustness Parallel implement ability

Large Scale Problems (2)

Recommendations Non-Newton-like methods such as method of steepest descent and conjugate gradient method

Conjugate gradient method, especially when implemented in parallel, is is usually the best

Scaling of variables is essential Scaling of gradient direction (when using steepest descent method)

Parallel Implementation

Motivation Significant speedups can be achieved by implementing the method on a high-performance parallel computer

Parallel computing is affordable with cluster of computers running Linux and freely available parallel libraries.

Recommendation The conjugate gradient method is readily parallelizable on both distributed-memory and shared-memory architectures. Signficant speedups and efficiencies can are obtained in practice.

Other advantages of parallel implementation Improved search for a global minimum. This is a consequence of the non-deterministic execution order of parallel programs

Faster and stable convergence. This has been observed in practice for the solution of complex and large problems

MATLAB Optimization Toolbox

The MATLAB toolbox implements several methods. This makes experimentation straightforward and the selection of the best method easier. However, MATLAB code is not as efficient as compiled C or Fortran code. Hence, it is appropriate for small to medium scale problems only.

Two key functions fminunc - Multidimensional unconstrained nonlinear minimization. fmincon - Multidimensional constrained nonlinear minimization.

These m-files are the primary interface for constrained and unconstrained optimization in MATLAB. Type help optim to list all toolbox functions

Unconstrained Optimization

Syntax X=FMINUNC(FUN,X0,OPTIONS)

where FUN = objective function to be minimized; X0 = starting vector; OPTIONS = structure specifying optimization options

Example FUN can be specified using @:

X = fminunc(@myfun,2) where MYFUN is a MATLAB function such as: function F = myfun(x)

F = sin(x) + 3;

Constrained Optimization

Syntax X=FMINCON(FUN,X0,A,B,Aeq,Beq,LB,UB,NONLCON,OPTIONS) This function minimizes the optimization problem: Minimize F(X) subject to: A*X <= B; Aeq*X = Beq (linear constraints)

C(X) <= 0; Ceq(X) = 0 (nonlinear constraints) LB <= X <= UB

The function NONLCON accepts X and returns the vectors C and Ceq, representing the nonlinear inequalities and equalities, respectively. Like FUN, NONLCON can be specified with a function or with INLINE.

Setting Optimization Parameters

The OPTIMSET function is used to modify the OPTIONS structure that specifies optimization parameters such as optimization method and line search method.

Syntax OPTIONS = OPTIMSET('PARAM1',VALUE1,'PARAM2',VALUE2,...)

For medium scale problems, MATLAB provides steepest descent, Newton and Quasi-Newton (BFGS and DFP) methods

For large scale problems, MATLAB provides CG and sequential quadratic programming methods

HessUpdate - [ {bfgs} | dfp | steepdesc ] Use help command for more details.

Example (1)

Min. f(x)= 100*(x(2)-x(1)^2)^2+(1-x(1))^2 (banana function)

05

1015

2025

3035

0

10

20

30

400

500

1000

1500

2000

2500

3000

Example (2)

Optimal solution is x* = [1, 1] and f(x*) = 0

BFGS Quasi-Newton method Value of the function at the solution: 8.98565e-009 Number of function evaluations: 105

DFP Quasi-Newton method Value of the function at the solution: 2.26078e-008 Number of function evaluations: 109

Steepest descent method Value of the function at the solution: 4.84404 Number of function evaluations: 302 Steepest descent did not converge in 302 iterations.

References

Dimitri P. Bertsekas, Nonlinear Programming, Athena Scientific, MA, 1995.

Edward K. Chong et al. An Introduction to Optimization, Wiley, 2001.

Hojjat Adeli and Asim Karim, Construction Scheduling, Cost Optimization and Management, Spon Press, 2001.

Ananth Grama et al., An Introduction to Parallel Computing: Design and Analysis of Algorithms, Addison-Wesley, 2003.

MATLAB Optimization Toolbox, http://www.mathworks.com/access/helpdesk/help/toolbox/optim/optim.shtml

Nonlinear Optimizationweb.lums.edu.pk › ~akarim › pub › optimization1.pdf · 2015-09-15 ·...

Documents

Transcript of Nonlinear Optimizationweb.lums.edu.pk › ~akarim › pub › optimization1.pdf · 2015-09-15 ·...