Short course: Optimality Conditions and Algorithms in ...ghaeser/santiago1.pdf · Algorithms in...

50
Short course: Optimality Conditions and Algorithms in Nonlinear Optimization Part I - Introduction to nonlinear optimization Gabriel Haeser Department of Applied Mathematics Institute of Mathematics and Statistics University of Sªo Paulo Sªo Paulo, SP, Brazil Santiago de Compostela, Spain, October 28-31, 2014 www.ime.usp.br/ ghaeser Gabriel Haeser

Transcript of Short course: Optimality Conditions and Algorithms in ...ghaeser/santiago1.pdf · Algorithms in...

Short course: Optimality Conditions andAlgorithms in Nonlinear Optimization

Part I - Introduction to nonlinear optimization

Gabriel Haeser

Department of Applied MathematicsInstitute of Mathematics and Statistics

University of São PauloSão Paulo, SP, Brazil

Santiago de Compostela, Spain, October 28-31, 2014

www.ime.usp.br/∼ghaeser Gabriel Haeser

Outline

Part I - Introduction to nonlinear optimizationExamples and historical notesFirst and Second order optimality conditionsPenalty methodsInterior point methods

Part II - Optimality ConditionsAlgorithmic proof of Karush-Kuhn-Tucker conditionsSequential Optimality ConditionsAlgorithmic discussion

Part III - Constraint QualificationsGeometric InterpretationFirst and Second order constraint qualifications

Part IV - AlgorithmsAugmented Lagrangian methodsInexact Restoration algorithmsDual methods

www.ime.usp.br/∼ghaeser Gabriel Haeser

www.ime.usp.br/∼ghaeser Gabriel Haeser

Optimization

Optimization is a mathematical problem with many “real world”applications. The goal is to find minimizers or maximizers of amultivariable real function, under a restricted domain.

to draw a map of Americawith areas proportional to the

real areas

hard-spheres problem: toplace m points on a

n-dimensional sphere in sucha way that the smallest

distance between two pointsis maximized.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Problem America

To draw a map of America, similar to the usual map, with areasproportional to real areas.

Minimize 12∑m

i=1 ‖pi − pi‖2,Subject to 1

2∑nj

i=1(pxi py

i+1 − pxi+1py

i ) = βj, j = 1, . . . , c

c = 17 countriesβj is the real area of country j

m = 132 given points pi on thefrontiers of the usual mapGreen-Gauss formula to computeareas

www.ime.usp.br/∼ghaeser Gabriel Haeser

Problem America

United States (without Alaska and Hawaii) = 8.080.464 km2

Brazil = 8.514.876 km2

Usual map ratio ≈ 1.32

Real ratio ≈ 0.95

Usual map Areas proportional to real areaswww.ime.usp.br/∼ghaeser Gabriel Haeser

Problem America

Areas proportional to GDP Areas proportional to population

www.ime.usp.br/∼ghaeser Gabriel Haeser

Kissing and hard-spheres problems

The kissing number of dimension n is the largest number of unitspheres that may be put touching a n-dimensional unit spherewithout overlapping.

The hard-spheres problem consists of maximizing the smallestdistance d between m points on the n-dimensional sphere of ra-dius 2.

n Kissing number2 63 124 245 40–446 72–787 126–1348 2409 306–36410 500–554

d∗ ≥ 2⇒ kissing number ≥ m

n = 2, n = 3,m = 6, d∗ = 2 m = 12, d∗ ≈ 2.194

www.ime.usp.br/∼ghaeser Gabriel Haeser

Applications: Packing

www.ime.usp.br/∼ghaeser Gabriel Haeser

Applications: PackingInitial configuration for molecular dynamics

www.ime.usp.br/∼ghaeser Gabriel Haeser

Large scale problems: Finance

Jacek Gondzio and Andreas Grothey (May 2005):quadratic convex program with 353 million constraints and 1010million variables.

Tool: Interior Point Method

www.ime.usp.br/∼ghaeser Gabriel Haeser

Large scale problems: Localization

Find a point in the rectangle but not in the ellipsis such that thesum of the distances to the polygons is minimized.

1.567.804 polygons.3.135.608 variables.1.567.804 upper levelconstraints.12.833.106 lower levelconstraints.convergence in10 outer iterations,56 inner iterations,133 funct. evaluations,185 seconds.

Tool: Augmented Lagrangian method

www.ime.usp.br/∼ghaeser Gabriel Haeser

TANGO Project - www.ime.usp.br/∼egbirgin/tango

Trustable Algorithms for Nonlinear General Optimization

www.ime.usp.br/∼ghaeser Gabriel Haeser

TANGO Project - www.ime.usp.br/∼egbirgin/tango

40.370 visits registered by Google Analytics - Since 2007(More than 3.000 downloads)

USA: 7.969, Brazil: 7.230, Germany: 2.974

www.ime.usp.br/∼ghaeser Gabriel Haeser

TANGO Project - www.ime.usp.br/∼egbirgin/tango

Spain: 733

www.ime.usp.br/∼ghaeser Gabriel Haeser

Historical Notes

Military Programs formulated as a system of linearinequalities gave rise to the term Programming in a linearstructure (title of the first paper by G. Dantzig, 1948).Koopmans shortened the term to Linear Programming.Dorfman (in 1949) thought that Linear Programming wastoo restrictive and suggest the more general termMathematical Programming, now called MathematicalOptimization.Nonlinear Programming is the title of the 1951 paper byKuhn and Tucker that deals with Optimality Conditions.These results are the extension of the Lagrange rule ofmultipliers (1813) to the case of equality and inequalityconstraints. These were previously considered on the 1939unpublished master’s thesis of Karush (KKT conditions).These works are particularly important because theysuggest the development of algorithms to deal withpractical problems.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Historical Notes

Linear Programming is part of a revolutionary developmentthat gave humanity the capability to formulate an objectiveand determine a way of detailed decisions to reach thisgoal in the best way possible.Tools: Models, algorithms, computers and softwares.The impossibility to perform large computations is the mainreason, according to Dantzig, to the lack of interest inoptimization before 1947.

Important topics in computing: (a) Dealing with sparsity allowsfor solving larger problems; (b) Global optimization; (c)Automatic differentiation of a function represented in aprogramming language.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Automatic Differentiation

f (x1, x2) = sin(x1) + x1x2

www.ime.usp.br/∼ghaeser Gabriel Haeser

Duality

Game theory and linear programming:1948 - G. Dantzig visited John von Neumann in Princeton.

J. von Neumann, 1963. Discussion of a maximum problem.D. Gale, H. W. Kuhn, A. W. Tucker, 1951. Linear programmingand the theory of games.

Elements of duality:

a pair of optimization problems, one a maximum problemwith objective function f and the other a minimum problemwith objective function h, based on the same datafor feasible solutions to the pair of problems, always h ≥ f

necessary and sufficient conditions for optimality are h = f

www.ime.usp.br/∼ghaeser Gabriel Haeser

Duality

(Fermat XVII century): Given 3 points p1, p2 and p3 on the plane,find the point x that minimizes the sum of the distances from x top1, p2 and p3.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Duality

(Thomas Moss, The Ladies Diary, 1755): “In the three sides ofan equiangular field stand three trees, at the distances of 10, 12and 16 chains from one another: to find the content of the field,it being the greatest the data will admit.”

www.ime.usp.br/∼ghaeser Gabriel Haeser

Duality

(J.D. Gergonne (ed), Annales de Mathématiques Pures et Ap-pliquées, 1810-1811): Given any triangle, circumscribe thelargest possible equilateral triangle about it.

Solution given in the 1811-1812 edition by Rochat, Vecten,Fauguier and Pilatte where duality was acknowledged.

www.ime.usp.br/∼ghaeser Gabriel Haeser

The problem (NLP)

Minimize f (x),Subject to hi(x) = 0, i = 1, . . . ,m.

gj(x) ≤ 0, j = 1, . . . , p.

f , hi, gj : Rn → R are (twice) continuously differentiablefunctions.

Ω = x ∈ Rn | h(x) = 0, g(x) ≤ 0 (feasible set)

www.ime.usp.br/∼ghaeser Gabriel Haeser

Solution

Global Solution: A feasible point x∗ ∈ Ω is a global minimizer ofNLP when

f (x∗) ≤ f (x),∀x ∈ Ω

Local Solution: A feasible point x∗ ∈ Ω is a local minimizer ofNLP when there exists a neighbourhood B(x∗, ε) of x∗ such that

f (x∗) ≤ f (x), ∀x ∈ Ω ∩ B(x∗, ε)

A(x) = j ∈ 1, . . . , p | gj(x) = 0 (set of active inequalities atx ∈ Ω)

www.ime.usp.br/∼ghaeser Gabriel Haeser

Example

Minimize x2 + y2,Subject to x + y− 1 = 0.

www.ime.usp.br/∼ghaeser Gabriel Haeser

First order optimality condition - Lagrange multipliers

Minimize x2 + y2,Subject to x + y− 1 = 0.

x = 12 , y = 1

2 ,

(11

)+ (−1)

(11

)= 0

www.ime.usp.br/∼ghaeser Gabriel Haeser

Example

Maximize x2 + y2,Subject to x + 2y− 2 ≤ 0,

x ≥ 0,y ≥ 0.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Minimize −x2 − y2,Subject to x + 2y− 2 ≤ 0,

−x ≤ 0,−y ≤ 0.

x = 2, y = 0,(−40

)+ 4

(12

)+ 8

(0−1

)= 0

x = 0, y = 1,(

0−2

)+ 1

(12

)+ 1

(−10

)= 0

x = 0.4, y = 0.8,(−0.8−1.6

)+ 0.8

(12

)= 0

www.ime.usp.br/∼ghaeser Gabriel Haeser

First order optimality condition - KKT condition

(Karush-Kuhn-Tucker) Under some condition (constraintqualification), if x∗ is a local solution, there exist Lagrangemultipliers λ ∈ Rm and µ ∈ Rp such that:

∇f (x) +

m∑i=1

λi∇hi(x∗) +

p∑j=1

µj∇gj(x∗) = 0, (Lagrange condition)

µjgj(x∗) = 0, j = 1, . . . , p, (complementarity)

h(x∗) = 0, g(x∗) ≤ 0, (feasibility)

µ ≥ 0. (dual feasibility)

Interpretation: up to first order, a feasible direction cannot be adescent direction.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Second order optimality condition

x∗ =

(0.40.8

),∇g1(x∗) =

(12

), ∇2f (x∗) =

(−2 00 −2

).

There exists some d ∈ Rn,∇g1(x∗)Td ≤ 0, dT∇2f (x∗)d < 0.

Theorem: Under some conditions, if x∗ is a local minimizer

dT

∇2f (x) +

m∑i=1

λi∇2hi(x∗) +

p∑j=1

µj∇2gj(x∗)

d ≥ 0,

for every d ∈ Rn such that

∇f (x∗)Td ≤ 0,

∇hi(x∗)Td = 0, i = 1 . . . ,m

∇gj(x∗)Td ≤ 0, j ∈ A(x∗).

Interpretation: All critical directions must be of ascent nature.

www.ime.usp.br/∼ghaeser Gabriel Haeser

History of nonlinear programming

Kuhn, Tucker, 1951.Nonlinear programming.

Albert William Tucker(1905 - 1995)Princeton UniversityTopology

Harold William Kuhn(1925 - 2014)Princeton UniversityPhD 1950, AlgebraGame Theory, Optimization

Saddle point problem

φ(x∗, u) ≤ φ(x∗, u∗) ≤ φ(x, u∗), ∀x, u

www.ime.usp.br/∼ghaeser Gabriel Haeser

History of nonlinear programming

William Karush (1917-1997)

1939. Minima of Functions of Several Variableswith Inequalities as Side Conditions.M.Sc. thesis, Department of Mathematics,University of Chicago

Calculus of Variations and Optimization

University of Chicago and California StateUniversity (also Manhattan Project)

I concluded that you two had exploited and de-veloped the subject so much further than I, thatthere was no justification for my announcing tothe world, “Look what I did, first.”, 1975.

www.ime.usp.br/∼ghaeser Gabriel Haeser

History of nonlinear programming

Fritz John (1910 - 1994)

1948. Extremum problems with inequalities assubsidiary conditions.

PhD 1933 in Göttingen under CourantNew York University

Partial differential equations, convex geometry,nonlinear elasticity

www.ime.usp.br/∼ghaeser Gabriel Haeser

History of nonlinear programming

Fritz John (1910 - 1994)

Let S be a bounded set in Rm. Find the sphereof least positive radius enclosing S.

Minimize F(x) := xm+1,Subject toG(x, y) := xm+1−

∑mi=1(xi− yi)

2 ≥ 0 for all y ∈ S.

the boundary of a compact convex set S in Rn

lies between two homothetic ellipsoids of ratio≤ n, and the outter ellipsoid can be taken to bethe ellipsoid of least volume containing S.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Snell’s law of diffractionsin θy

vy= sin θz

vz

www.ime.usp.br/∼ghaeser Gabriel Haeser

Snell’s law of diffractionsin θy

vy= sin θz

vz

Minimize T(x) :=‖x− y‖

vy+‖x− z‖

vzSubject to h(x) = 0

At the solution x∗, ∇T(x∗) = x∗−yvy‖y−x∗‖ + x∗−z

vz‖z−x∗‖ is parallel to∇h(x∗), the normal vector to the surface.

Define y = x∗ + y−x∗vy‖y−x∗‖ and z = x∗ + z−x∗

vz‖z−x∗‖ .Hence −∇T(x∗) = (y− x∗) + (z− x∗) is the diagonal of thefollowing parallelogram:

www.ime.usp.br/∼ghaeser Gabriel Haeser

Snell’s law of diffractionsin θy

vy= sin θz

vz

By triangular sim-ilarity, y and z areequally away fromthe normal line.Hence‖y− x∗‖ sin θy =‖z− x∗‖ sin θz.The calculation‖y − x∗‖ = 1

vyand

‖z− x∗‖ = 1vz

yieldsSnell’s law.

www.ime.usp.br/∼ghaeser Gabriel Haeser

External Penalty Method

Choose a sequence ρk with ρk → +∞ and for each k solvethe problem

Minimize f (x) + ρkP(x),

obtaining the (global) solution xk, if it exists.P is a smooth functionP(x) ≥ 0

P(x) = 0⇔ h(x) = 0, g(x) ≤ 0

For example: P(x) = ‖h(x)‖22 + ‖max0, g(x)‖2

2

www.ime.usp.br/∼ghaeser Gabriel Haeser

External Penalty Method

Theorem: If xk is well defined then every limit point of xk isa global solution to Minimize P(x)

Theorem: If xk is well defined and there exists a point wherethe function P vanishes (feasible region is not empty), thenevery limit point of xk is a global solution ofMinimize f (x), Subject to h(x) = 0, g(x) ≤ 0.

The External Penalty Method can be used as a theoretical toolto prove KKT conditions, but also, it can be adjusted to be anefficient algorithm (augmented lagrangian method).

www.ime.usp.br/∼ghaeser Gabriel Haeser

External Penalty Method

Minimize x21 + x2

2,Subject to x1 − 1 = 0

x2 − 1 ≤ 0.

Minimize x21 + x2

2 + ρk((x1 − 1)2 + max0, x2 − 12)(= Φk(x)).

Solving ∇Φk(x) = 0 we get xk = ( ρk1+ρk

, 0)→ (1, 0).

Show simulation

www.ime.usp.br/∼ghaeser Gabriel Haeser

Internal Penalty Method

Choose a sequence µk with µk → 0+ and for each k solve theproblem

Minimize f (x) + µkB(x),Subject to h(x) = 0

g(x) < 0.

B is smoothB(x) ≥ 0

B(x)→ +∞ if some gi(x)→ 0 with g(x) < 0.For example: B(x) = −

∑mi=1 log(−gi(x))

www.ime.usp.br/∼ghaeser Gabriel Haeser

Interior Point Method

Consider the convex quadratic problem

Minimize cTx + 12 xTQx,

Subject to Ax = bx ≥ 0.

and the barrier subproblem

Minimize cTx + 12 xTQx− µ

∑nj=1 log xj,

Subject to Ax = bx > 0.

KKT condition

c− ATλ+ Qx− µX−1e = 0,Ax = b,

where X−1 = diagx−11 , . . . , x−1

n and e = (1, . . . , 1)T. Denotings = µX−1e we get

ATλ+ s− Qx = c,Ax = b,

XSe = µe, (x, s) > 0.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Interior Point Method

Active-set methods

ATλ+ s− Qx = c,

Ax = b,

XSe = 0,

(x, s) ≥ 0.

Interior point methods

ATλ+ s− Qx = c,

Ax = b,

XSe = µe,

(x, s) > 0.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Interior Point Method

Complementarity: xisi = 0,∀i = 1, . . . , n.

Active-set methods try to guess the optimal active subsetA ⊆ 1, . . . , n and set xi = 0 for i ∈ A (active constraints), si = 0for i 6∈ A (inactive constraints).

Interior point methods use ε-mathematics:Replace xisi = 0,∀i = 1, . . . , nby xisi = µ, ∀i = 1, . . . , n.

Force convergence by letting µ→ 0+.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Interior Point Method

Solve the nonlinear system of equations

f (x, λ, s) = 0,

where f : R2n+m → R2n+m is the mapping:

f (x, λ, s) =

ATλ+ s− Qx− cAx− bXSe− µe

.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Interior Point Method

Newton direction:

−Q AT IA 0 0S 0 X

.

∆x∆λ∆s

=

c− ATλ− s + Qxb− Axµe− XSe

.

Reduce µ at each Newton iteration.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Interior Point Method

Algorithm: Step 0: Choose (x0, λ0, s0), (x0, s0) > 0, µ0 > 0 andparameters 0 < γ < 1 and ε > 0. Set k = 0.

Step 1: Compute the Newton direction (∆x,∆λ,∆s) at(x, λ, s) := (xk, λk, sk).

Step 2: Choose a stepsize α such that (xk +α∆x, sk +α∆s) > 0.

Step 3: Update µk+1 = γµk.

Step 4: If xksk ≤ εx0s0, stop. Else set k := k + 1 and go to Step 1.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Interior Point Method

Consider the merit function

ψ(x, s) = (n +√

n) log(xTs)−n∑

i=1

(xisi),

(Note that ψ(x, s)→ −∞⇒ xTs→ 0.)

Choosing the stepsize α that minimizes ψ(xk + α∆x, sk + α∆s)(exact line search) we get:

Theorem: If γ = nn+√

n , we have xTk sk ≤ εxT0 s0 in O(√

n log(nε

))iterations.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Algorithms

There are no “direct method” to solve NLP.NLP is solved using iterative methods.An iterative method generates a sequence of pointsxk ∈ Rn that converges (or not) to a solution of the problem.Iterative methods are programmed and implemented oncomputers, where real mathematical operations arereplaced by floating point operations.

www.ime.usp.br/∼ghaeser Gabriel Haeser

Algorithms

Theory is necessary to avoid performing an infinite numberof experiments.Useful theory should be able to predict the behavior ofmany experiments.Usually, the theory does not refer to the real sequencesgenerated by the computer, but theoretical sequencesdefined by the algorithms.The analogy between real sequences and theoreticalsequences is not perfect.There are practical phenomena that the theory is not ableto predict, but relevant theory is the one that contributs inexplaining practical phenomena.

www.ime.usp.br/∼ghaeser Gabriel Haeser