econweb.ucsd.edujsobel/205f10/alexisnotes.pdfContents I Basics 5 1 Linear algebra 6 1.1 Linearity ....

Math Camp Notes

Alexis Akira Toda1

1Department of Economics, University of California San Diego. Email:[email protected]

mailto:[email protected]

Contents

I Basics 5

1 Linear algebra 61.1 Linearity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.2 Inner product and norm . . . . . . . . . . . . . . . . . . . . . . . 71.3 Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81.4 Identity matrix, inverse, determinant . . . . . . . . . . . . . . . . 91.5 Transpose, symmetric matrices . . . . . . . . . . . . . . . . . . . 101.6 Eigenvector, diagonalization . . . . . . . . . . . . . . . . . . . . . 11

2 Topology of RN 152.1 Convergence of sequences . . . . . . . . . . . . . . . . . . . . . . 152.2 Topological properties . . . . . . . . . . . . . . . . . . . . . . . . 162.3 Continuous functions . . . . . . . . . . . . . . . . . . . . . . . . . 17

3 Multi-Variable Calculus 193.1 A motivating example . . . . . . . . . . . . . . . . . . . . . . . . 193.2 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3 Vector notation and gradient . . . . . . . . . . . . . . . . . . . . 213.4 Mean-value theorem and Taylor’s theorem . . . . . . . . . . . . . 223.A Chain rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4 Multi-variable unconstrained optimization 264.1 First and second-order conditions . . . . . . . . . . . . . . . . . . 264.2 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.2.1 General case . . . . . . . . . . . . . . . . . . . . . . . . . 284.2.2 Quadratic case . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294.3.1 Gradient method . . . . . . . . . . . . . . . . . . . . . . . 304.3.2 Newton method . . . . . . . . . . . . . . . . . . . . . . . 31

4.A Characterization of convex functions . . . . . . . . . . . . . . . . 324.B Newton method for solving nonlinear equations . . . . . . . . . . 34

5 Multi-Variable Constrained Optimization 385.1 A motivating example . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . 385.1.2 A solution . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.1.3 Why study the general theory? . . . . . . . . . . . . . . . 39

5.2 Optimization with linear constraints . . . . . . . . . . . . . . . . 40

1

5.2.1 One linear constraint . . . . . . . . . . . . . . . . . . . . . 405.2.2 Multiple linear constraints . . . . . . . . . . . . . . . . . . 425.2.3 Linear inequality and equality constraints . . . . . . . . . 44

5.3 Optimization with nonlinear constraints . . . . . . . . . . . . . . 455.3.1 Karush-Kuhn-Tucker theorem . . . . . . . . . . . . . . . . 455.3.2 Convex optimization . . . . . . . . . . . . . . . . . . . . . 465.3.3 Constrained maximization . . . . . . . . . . . . . . . . . . 48

6 Contraction Mapping Theorem and Applications 536.1 Contraction Mapping Theorem . . . . . . . . . . . . . . . . . . . 536.2 Markov chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 546.3 Implicit function theorem . . . . . . . . . . . . . . . . . . . . . . 56

II Advanced topics 60

7 Convex Sets 617.1 Convex sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 617.2 Hyperplanes and half spaces . . . . . . . . . . . . . . . . . . . . . 627.3 Separation of convex sets . . . . . . . . . . . . . . . . . . . . . . 637.4 Application: asset pricing . . . . . . . . . . . . . . . . . . . . . . 66

8 Convex Functions 698.1 Convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 698.2 Characterization of convex functions . . . . . . . . . . . . . . . . 708.3 Subgradient of convex functions . . . . . . . . . . . . . . . . . . . 728.4 Application: Pareto efficient allocations . . . . . . . . . . . . . . 73

9 Convex Programming 779.1 Convex programming . . . . . . . . . . . . . . . . . . . . . . . . . 779.2 Constrained maximization . . . . . . . . . . . . . . . . . . . . . . 839.3 Portfolio selection . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

9.3.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . 849.3.2 Mathematical formulation . . . . . . . . . . . . . . . . . . 859.3.3 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

9.4 Capital asset pricing model (CAPM) . . . . . . . . . . . . . . . . 869.4.1 The model . . . . . . . . . . . . . . . . . . . . . . . . . . 869.4.2 Equilibrium . . . . . . . . . . . . . . . . . . . . . . . . . . 879.4.3 Asset pricing . . . . . . . . . . . . . . . . . . . . . . . . . 88

10 Nonlinear Programming 9210.1 The problem and the solution concept . . . . . . . . . . . . . . . 9210.2 Tangent cone and normal cone . . . . . . . . . . . . . . . . . . . 92

10.2.1 Cone and its dual . . . . . . . . . . . . . . . . . . . . . . 9210.2.2 Tangent cone and normal cone . . . . . . . . . . . . . . . 95

10.3 Karush-Kuhn-Tucker theorem . . . . . . . . . . . . . . . . . . . . 9610.4 Constraint qualifications . . . . . . . . . . . . . . . . . . . . . . . 9810.5 Sufficient condition . . . . . . . . . . . . . . . . . . . . . . . . . . 99

2

11 Sensitivity Analysis 10311.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

11.1.1 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10311.1.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

11.2 Sensitivity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 10411.2.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . 10411.2.2 Implicit function theorem . . . . . . . . . . . . . . . . . . 10511.2.3 Main result . . . . . . . . . . . . . . . . . . . . . . . . . . 106

12 Duality Theory 11012.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11012.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

12.2.1 Linear programming . . . . . . . . . . . . . . . . . . . . . 11112.2.2 Entropy maximization . . . . . . . . . . . . . . . . . . . . 112

12.3 Convex conjugate function . . . . . . . . . . . . . . . . . . . . . . 11312.4 Duality theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

3

Notations

Symbol Meaning∀x . . . for all x . . .∃x . . . there exists x such that . . .∅ empty setx ∈ A or A 3 x x is a member of the set AA ⊂ B or B ⊃ A A is a subset of B; B contains AA ∩B intersection of sets A and BA ∪B union of sets A and BA\B elements of A but not in BRN set of vectors x = (x1, . . . , xN ) with xn ∈ RRN+ set of x = (x1, . . . , xN ) with xn ≥ 0 for all nRN++ set of x = (x1, . . . , xN ) with xn > 0 for all nx ≥ y or y ≤ x xn ≥ yn for all n; same as x− y ∈ RN+x > y or y < x xn ≥ yn for all n and xn > yn for some nx y or y x xn > yn for all n; same as x− y ∈ RN++

〈x, y〉 inner product of x and y, 〈x, y〉 = x1y1 + · · ·+ xNyN‖x‖ euclidean norm of x, ‖x‖ =

√x2

1 + · · ·+ x2N

clA closure of AintA interior of AcoA convex hull of A[a, b] closed interval x | a ≤ x ≤ b(a, b) open interval x | a < x < b(a, b]; [a, b) half open intervalsf : A→ B f is a function defined on A taking values in Bdom f effective domain of f ,

x ∈ RN

∣∣ f(x) <∞

epi f epigraph of f ,

(x, y) ∈ RN × R∣∣ y ≥ f(x)

f ∈ Cr(Ω) function f is r times continuously differentiable on Ωf ≤ g or g ≥ f f(x) ≤ g(x) for all x∇f(x) gradient (vector of partial derivatives) of f at x∇2f(x) Hessian (matrix of second derivatives) of f at xDf(x) Jacobian (matrix of partial derivatives) of f at x

4

Part I

Basics

5

Chapter 1

Linear algebra

1.1 Linearity

In mathematics, “linear” means that a property is preserved by addition andmultiplication by a constant. A linear space (more commonly vector space) is aset X for which x+y (addition) and αx (multiplication by α) are defined, wherex, y ∈ X and α ∈ R. In this course we only encounter the Euclidean space RN ,which consists of N -tuples of real numbers (called N -vector)

x = (x1, . . . , xN ).

Here addition and multiplication by a constant are defined component-wise:

(x1, . . . , xN ) + (y1, . . . , yN ) := (x1 + y1, . . . , xN + yN )

α(x1, . . . , xN ) := (αx1, . . . , αxN ).

(The symbol “:=” means that we define the left-hand side by the right-handside.)

A linear function is a function that preserves linearity. Thus f : RN → R islinear if

f(x+ y) = f(x) + f(y),

f(αx) = αf(x).

An obvious example of a linear function f : RN → R is

f(x) = a1x1 + · · ·+ aNxN =

N∑n=1

anxn,

where a1, . . . , aN are numbers. In fact we can show that all linear functions areof this form.

Proposition 1.1. If f : RN → R is linear, then f(x) = a1x1 + · · ·+ aNxN forsome a1, . . . , aN .

Proof. Let e1 = (1, 0, . . . , 0) be the vector whose 1st element is 1 and 0 otherwise.Similarly, let en = (0, . . . , 0, 1, 0, . . . , 0) be the vector whose n-th element is 1

6

and 0 otherwise. (These vectors are called unit vectors.) By the definition ofRN , we have

x = (x1, . . . , xN ) = x1e1 + · · ·+ xNeN .

Hence by the linearity of f , we get

f(x) = x1f(e1) + · · ·+ xNf(eN ),

so f(x) has the desired form by setting an = f(en).

1.2 Inner product and norm

An expression of the form a1x1 + · · ·+ aNxN appears so often that it deservesa special name and notation. Let x = (x1, . . . , xN ) and y = (y1, . . . , yN ) be twovectors. Then

〈x, y〉 := x1y1 + · · ·+ xNyN =

N∑n=1

xnyn

is called the inner product (also vector product) of x and y.1 The (Euclidean)norm of x is defined by

‖x‖ :=√〈x, x〉 =

√x2

1 + · · ·+ x2N .

The Euclidean norm is also called the L2 norm for a reason that will be clearlater.

Fixing x, the inner product 〈x, y〉 is linear in y, so we have

〈x, y1 + y2〉 = 〈x, y1〉+ 〈x, y2〉 ,〈x, αy〉 = α 〈x, y〉 .

The same holds for x as well, fixing y. So the inner product is a bilinear functionof x and y.

You might remember from high school algebra/geometry that the inner prod-uct in a two dimensional space satisfies

〈x, y〉 = x1y1 + x2y2 = ‖x‖ ‖y‖ cos θ,

where θ is the angle between the vector x = (x1, x2) and y = (y1, y2). Since

cos θ

> 0, (θ is an acute angle)

= 0, (θ is a right angle)

< 0, (θ is an obtuse angle)

the vectors x, y are orthogonal if 〈x, y〉 = 0 and form an acute (obtuse) angleif 〈x, y〉 > 0 (< 0). Most of us cannot “see” higher dimensional spaces, butgeometric intuition is very useful. For any x, y ∈ RN , we say that x, y areorthogonal if 〈x, y〉 = 0.

The inner product and norms of vectors x, y satisfy the following Cauchy-Schwarz inequality: |〈x, y〉| ≤ ‖x‖ ‖y‖. The proof is an exercise. The norm ‖·‖satisfies

1The term “inner” is weird but this is because there is a notion of “outer product”. Theinner product of x, y, is sometimes denoted by (x, y), x · y, and 〈x|y〉, etc.

7

1. ‖x‖ ≥ 0, with equality if and only if x = 0,

2. ‖αx‖ = |α| ‖x‖,

3. ‖x+ y‖ ≤ ‖x‖+ ‖y‖.

The last inequality is called the triangle inequality because it says that thelength of any edge of any triangle is less than or equal to the sum of the lengthof the remaining two edges. (Draw a picture of a triangle with vertices at points0, x, and x+ y.) Proving the triangle inequality is an exercise.

1.3 Matrix

Instead of a linear function f : RN → R, consider a linear map f : RN → RM .This means that for each x ∈ RN , f associates a vector f(x) ∈ RM , and fis linear (preserves addition and multiplication by a constant): f(x + y) =f(x) + f(y) and f(αx) = αf(x). Let fm(x) be the m-th element of f , sof(x) = (f1(x), . . . , fM (x)). It’s easy to see that each fm(x) is a linear functionof x. Hence by Proposition 1.1, we have

fm(x) = am1x1 + · · ·+ amNxN

for some numbers am1, . . . , amN . Since this is true for any m, a linear mapcorresponds to numbers amn, where 1 ≤ m ≤M and 1 ≤ n ≤ N . Conversely,any such array of numbers corresponds to a linear map. We write

A = (amn) =

a11 · · · a1n · · · a1N

.... . .

.... . .

...am1 · · · amn · · · amN

.... . .

.... . .

...aM1 · · · aMn · · · aMN

and call it a matrix. For an M ×N matrix A and an N -vector x, we define theM -vector Ax by the vector whose m-th element is

am1x1 + · · ·+ amNxN .

So f(x) = Ax is a linear map from RN to RM . By defining addition andmultiplication by a constant component-wise, the set of all M ×N matrices canbe identified as RMN , the MN -dimensional Euclidean space.

Now consider the linear maps f : RN → RM and g : RM → RL. Since f, g arelinear, we can find an M×N matrix A = (amn) and an L×M matrix B = (blm)such that f(x) = Ax and g(y) = By. We can also consider the composition ofthese two maps, h = g f , where h(x) := g(f(x)). It is easy to see that h is alinear map from RN to RL, and therefore it can be written as h(x) = Cx withan L×N matrix C = (cln). Using the definition h(x) = g(f(x)) = B(Ax), it isnot hard to see (exercise) that

cln =

M∑m=1

blmamn.

8

So it makes sense to define the multiplication of matrix C = BA by this rule.You can use all standard rules of algebra such as B(A1 + A2) = BA1 + BA2,A(BC) = (AB)C, etc. The proof is immediate by carrying out the algebra orthinking about linear maps. In Matlab, A+B and A*B return the sum and theproduct of matrices A,B (if they are well-defined). If A,B have the same size,then A.*B returns the component-wise product (Hadamard product).

1.4 Identity matrix, inverse, determinant

An M × N matrix is square when M = N . The identity map id : RN →RN defined by id(x) = x is clearly linear and has a corresponding matrix I.By simple calculation I is square, all of its diagonal elements are 1, and 0otherwise. Clearly AI = IA = A when A is a square matrix (think of maps, oralternatively, do the component-wise calculation). In Matlab, eye(N) returnsthe N -dimensional identity matrix.

A map f : RN → RN is said to be one-to-one (or injective) if f(x) 6= f(y)whenever x 6= y. f is onto (or surjective) if for each y ∈ RN , there exists x ∈ RNsuch that y = f(x). f is bijective if f is both injective and surjective. If f isbijective, for each y ∈ RN there exists a unique x ∈ RN such that y = f(x).Since this x depends only on y, we write x = f−1(y) and say that f−1 is theinverse of f . Now if f : RN → RN is a bijective linear map with a correspondingsquare matrix A (so f(x) = Ax), its inverse f−1 is also linear and hence hasa matrix representation. We write this matrix A−1 and call it the inverse ofA. Clearly AA−1 = A−1A = I. The inverse of A, if it exists, is unique. Tosee this, suppose that B,C are both inverse of A. Then AB = BA = I andAC = CA = I, so

B = BI = B(AC) = (BA)C = IC = C.

A matrix that has an inverse is called regular, nonsingular, invertible, etc. InMatlab, inv(A) returns the inverse of A.

If A =

[a bc d

], then the determinant of A is detA = ad− bc. In general, we

can define the determinant of a square matrix inductively as follows. For 1× 1matrix A = (a), we have detA = a. Suppose that the determinant has beendefined up to (N − 1)× (N − 1) matrices. If A = (amn) is N ×N , then

detA =

N∑n=1

(−1)m+namnMmn =

N∑m=1

(−1)m+namnMmn

where Mmn is the determinant of the matrix obtained by removing row m andcolumn n of A. It is well-known that this definition is consistent (i.e., doesnot depend on m,n). Following are useful properties of the determinant (seetextbooks for proofs).

1. A is regular if and only if detA 6= 0. In that case, we have A−1 = 1detA A,

where A = (amn) satisfies amn = (−1)m+nMnm. For example, if A =[a bc d

], then A−1 = 1

ad−bc

[d −b−c a

].

2. IfA,B are square matrices of the same order, then det(AB) = (detA)(detB).

9

3. If A is partitioned as A =

[A11 A12

O A22

], where A11 and A22 are square

matrices, then detA = (detA11)(detA22).

In Matlab, det(A) returns the determinant of A.

1.5 Transpose, symmetric matrices

When numbers are stacked horizontally like x = (x1, . . . , xN ), it is called a rowvector. When stacked vertically likex1

...xN

,it is a column vector. An N -column vector is the same as an N × 1 matrix.An N -row vector is the same as an 1 × N matrix. The notation f(x) = Ax iscompatible with the definition of the product of an M × N matrix A and anN × 1 matrix x. To see this, writing down the elements, we get

Ax =

a11x1 + · · ·+ a1NxN...

am1x1 + · · ·+ amNxN...

aM1x1 + · · ·+ aMNxN

=

a11 · · · a1n · · · a1N

.... . .

.... . .

...am1 · · · amn · · · amN

.... . .

.... . .

...aM1 · · · a1n · · · a1N

x1

...xn...xN

= Ax.

(The left-most Ax is the linear map; the right-most Ax is the multiplication ofthe M ×N matrix A and the N × 1 matrix x.)

In this course vectors are always column vectors, unless otherwise specified.However, it is awkward to write down column vectors every time because theytake up a lot of space, so we use the notation (x1, . . . , xN )′ (with a prime) todenote a column vector. (x1, . . . , xN )′ is called the transpose of the row vectorx = (x1, . . . , xN ). Oftentimes we are even more sloppy and don’t distinguishbetween a row and column vector. (After all, what does it mean mathematicallyto stack numbers horizontally or vertically?)

Transpose can be defined for matrices, too. For an M×N matrix A = (amn),we define its transpose by the N×M matrix B = (bnm), where bnm = amn, andwe write B = A′. Thus A′ is the matrix obtained by flipping A “diagonally”.In Matlab, A’ returns the transpose of A.

A square matrix P such that P ′P = PP ′ = I is called orthogonal, becauseby definition the column vectors of P are orthogonal and have norm 1 (justwrite down the elements of P ′P ). If P is orthogonal, then clearly P−1 = P ′.

A square matrix A such that A = A′ is called symmetric, because its elementsare symmetric about the diagonal. A is positive semidefinite if 〈x,Ax〉 ≥ 0 forall x, and positive definite if in addition 〈x,Ax〉 = 0 only if x = 0. Symmetricmatrices have a natural (partial) order (exercise): we write A ≥ B if and onlyif A−B is positive semidefinite.

There is a simple test for positive definiteness. Let A be square. The deter-minant of the matrix obtained by keeping the first k-th rows and columns of A is

10

called the k-th principal minor of A. For example, if A = (amn) is N ×N , thenthe first principal minor is a11, the second principal minor is a11a22 − a12a21,and the N -th principal minor is detA, etc.

Proposition 1.2. Let A be real symmetric. Then A is positive definite if andonly if its principal minors are all positive.

Proof. We prove by mathematical induction on the dimension N of the matrixA. If N = 1, the claim is trivial.

Suppose the claim is true up to dimension N−1, and let A be N -dimensional.

Partition A as A =

[A1 bb′ c

], where A1 is an (N − 1)-dimensional symmetric

matrix, b is an (N − 1)-dimensional vector, and c is a scalar. Let

P =

[I −A−1

1 b0 1

].

Then by simple algebra we get

P ′AP =

[A1 00 c− b′A−1

1 b

].

Clearly detP = 1, so P is regular. Since 〈x,Ax〉 = x′Ax = (P−1x)′(P ′AP )(P−1x),A is positive definite if and only if P ′AP is. But since P ′AP is block diagonal,P ′AP is positive definite if and only if A1 is positive definite and c−b′A−1b > 0.By assumption, A1 is positive definite if and only if its principal minors are allpositive. Furthermore, since detP = 1, we get

detA = det(P ′AP ) = (detA1)(c− b′A−11 b).

Therefore

A > O ⇐⇒ all principal minors of A1 are positive and a− b′A−11 b > 0

⇐⇒ all principal minors of A1 are positive and detA > 0

⇐⇒ all principal minors of A are positive,

so the claim is true for N as well.

1.6 Eigenvector, diagonalization

If A is a square matrix and there exist a number α and a nonzero vector vsuch that Av = αv, then we say that v is an eigenvector of A associated witheigenvalue α. Since

Av = αv ⇐⇒ (αI −A)v = 0,

α is an eigenvector of A if and only if det(αI − A) = 0 (for otherwise αI − Ais invertible, which would imply v = 0, a contradiction). ΦA(x) := det(xI −A)is called the characteristic polynomial of A. In Matlab, eig(A) returns theeigenvalues of A.

Even if A is a real matrix, eigenvalues and eigenvectors need not be real.For complex vectors x, y, the inner product is defined by

〈x, y〉 = x∗y = x′y =

N∑n=1

xnyn,

11

where x is the complex conjugate of x and x∗ = x′ is the transpose of the complexconjugate of x (called adjoint). By definition, 〈x, y〉 = 〈y, x〉. Similarly, for acomplex matrix A, its adjoint A∗ is defined by the complex conjugate of thetranspose. It is easy to see that 〈x,Ay〉 = 〈A∗x, y〉, because

〈A∗x, y〉 = (A∗x)∗y = x∗(A∗)∗y = x∗Ay = 〈x,Ay〉 .

Matrices satisfying A∗ = A are called Hermitian. If A is real, then an Hermitematrix is the same as a symmetric matrix. For an Hermite matrix A, thequadratic form 〈x,Ax〉 is real, for

〈x,Ax〉 = 〈Ax, x〉 = 〈A∗x, x〉 = 〈x,Ax〉 .

Proposition 1.3. The eigenvalues of an Hermite matrix are real.

Proof. Suppose that Av = αv with v 6= 0. Then

〈v,Av〉 = 〈v, αv〉 = α 〈v, v〉 = α ‖v‖2

is real, so α = 〈v,Av〉 / ‖v‖2 is also real.

Since real symmetric matrices are Hermite, the eigenvalues of real symmetricmatrices are all real (and so are eigenvectors).

If U is a square matrix such that U∗U = UU∗ = I, then U is called unitary.Real unitary matrices are orthogonal, by definition.

We usually take the standard basis e1, . . . , eN in RN , but that is not neces-sary. Suppose we take vectors p1, . . . , pN, where the matrix P = [p1, . . . , pN ]is regular. Let x be any vector and y = P−1x. Then

x = PP−1x = Py = y1p1 + · · ·+ yNpN ,

so the elements of y can be interpreted as the coordinates of x when we use thebasis P . What does a matrix A look like when we use the basis P? Consider thelinear map x 7→ Ax. Using the basis P , this map looks like P−1x 7→ P−1Ax =(P−1AP )(P−1x), so using the basis P , the linear map x 7→ Ax has the matrixrepresentation B = P−1AP . Oftentimes, it is useful to find a matrix P suchthat P−1AP is a simple matrix. The simplest matrix of all are diagonal ones. Ifwe can find P such that P−1AP is diagonal, we say that A is diagonalizable. Aremarkable property of real symmetric matrices is that they are diagonalizablewith some orthogonal matrix.

Theorem 1.4. Let A be real symmetric. Then there exists a real orthogonalmatrix P such that P−1AP = P ′AP = diag[α1, . . . , αN ], where α1, . . . , αN areeigenvalues of A. (diag is the symbol for diagonal matrix with specified diagonalelements.)

Proof. We prove by mathematical induction on the dimension of A. If A isone-dimensional (scalar), then the claim is trivial. (Just take P = 1.)

Suppose that the claim is true up to dimensionN−1. Let α1 be an eigenvalueof A and p1 be an associated eigenvector, so Ap1 = α1p1. Let W1 be the set ofvectors orthogonal to p1, so W1 = x | 〈p1, x〉 = 0. If x ∈W1, then

〈p1, Ax〉 = 〈A′p1, x〉 = 〈Ap1, x〉 = 〈αp1, x〉 = α 〈p1, x〉 = 0,

12

so Ax ∈ W1. Pick an orthonormal basis q2, . . . , qN in W1. Letting P1 =[p1, q2, . . . , qN ], then we have

AP1 = P1

[α1 00 A1

]⇐⇒ P ′1AP1 =

[α1 00 A1

],

where A1 is some N − 1-dimensional matrix. To see this, note that

AP1 = [Ap1, Aq2, . . . , AqN ] = [αp1, Aq2, . . . , AqN ].

Since each qn belongs to the space W1, Aqn is a linear combination of q2, . . . , qN ,so there exists a matrixA1 above. SinceA is symmetric, (P ′1AP1)′ = P ′1A(P1)′′ =P ′1AP1, so P ′1AP1 is also symmetric. Therefore A1 is symmetric. Since A1

is (N − 1)-dimensional, by assumption we can take an orthogonal matrix P2

such that P ′2A1P2 = D1, where D1 = diag[α2, . . . , αN ] is diagonal. ThenA1 = P2D1P2, so

P ′1AP1 =

[α1 00 P2D1P

′2

]=

[1 00 P2

] [α1 00 D1

] [1 00 P2

]′.

Letting P = P1

[1 00 P2

], which is orthogonal, then P ′AP =

[α1 00 D1

]=

diag[α1, . . . , αN ], so P ′AP is diagonal.

Similarly, Hermite matrices can be diagonalized by unitary matrices. Diag-onalization is often useful for proving theorems (see exercises).

This note is too short to cover all the details of linear algebra. A goodreference is Horn and Johnson (1985).

Exercises

1.1. Let x = (x1, . . . , xN ) and y = (y1, . . . , yN ) be vectors in RN . Define

f(t) = ‖tx− y‖2, where t ∈ R.

1. Expand f and express it as a quadratic function of t.

2. Prove the Cauchy-Schwarz inequality |〈x, y〉| ≤ ‖x‖ ‖y‖. (Hint: how manysolutions does the quadratic equation f(t) = 0 have?)

Remark. Make sure to treat the cases x = 0 and x 6= 0 separately.

1.2. Prove the triangle inequality ‖x+ y‖ ≤ ‖x‖+‖y‖. (Hint: Cauchy-Schwarzinequality.)

1.3. Let A,B,C be matrices with appropriate dimensions so that the followingexpressions are well defined. Prove that A(B + C) = AB + AC, A(BC) =(AB)C, (AB)−1 = B−1A−1, and (AB)′ = B′A′.

1.4. Let A,B,C be symmetric matrices of the same size.

1. Prove that A ≥ A (reflexivity).

2. Prove that A ≥ B and B ≥ A imply A = B (antisymmetry).

13

3. Prove that A ≥ B and B ≥ C imply A ≥ C (transitivity).

Hence ≥ is a partial order for symmetric matrices.

1.5. Let P be a matrix such that P 2 = P . Show that the eigenvalues of P areeither 0 or 1.

1.6. Let A be symmetric. Show that A is positive definite if and only if alleigenvalues of A are positive.

1.7. Let A be symmetric and positive semidefinite. Show that there exists asymmetric and positive semidefinite matrix B such that A = B2.

1.8. Let A be symmetric and positive semidefinite. Let α1 ≤ · · · ≤ αN bethe eigenvalues of A. Show that for any nonzero vector x, we have α1 ≤‖Ax‖ / ‖x‖ ≤ αN .

1.9. 1. Let B = A+XCY , where A is N ×N , X is N ×M , C is M ×M ,and Y is M ×N . Show that

B−1 = A−1 −A−1X(C−1 + Y A−1X)−1Y A−1,

assuming that all inverses exist.

2. Let A,B be symmetric and positive definite. Show that A ≥ B if and onlyif B−1 ≥ A−1. Another proof can be found in Toda (2011). This result isuseful for establishing the asymptotic efficiency of the GMM estimator.

14

Chapter 2

Topology of RN

2.1 Convergence of sequences

By the triangle inequality, the norm ‖·‖ on RN can be used to define a distance.For x, y ∈ RN , we define the distance between these two points by

dist(x, y) = ‖x− y‖ .

Let xk∞k=1 be a sequence in RN . (Here each xk = (x1k, . . . , xNk) is a vectorin RN .) We say that xk∞k=1 converges to x ∈ RN if

(∀ε > 0)(∃K > 0)k > K =⇒ ‖xk − x‖ < ε,

that is, for any small error tolerance ε > 0, we can find a large enough numberK such that the distance between xk and x can be made smaller than the errortolerance ε provided that the index satisfies k > K. When xk∞k=1 converges tox, we write limk→∞ xk = x or xk → x (k →∞). Sometimes we are sloppy andwrite limxk = x or xk → x. A sequence xk∞k=1 is convergent if it convergesto some point.

A sequence xk∞k=1 is bounded if there exists b > 0 such that ‖xk‖ ≤ b forall k.

Proposition 2.1. A convergent sequence is bounded.

Proof. Suppose that xk → x. Setting ε = 1 in the definition of convergence,we can take K > 0 such that ‖xk − x‖ < 1 for all k > K. By the triangleinequality, we have ‖xk‖ ≤ ‖x‖+ 1 for k > K. Therefore

‖xk‖ ≤ b := max ‖x1‖ , . . . , ‖xK‖ , ‖x‖+ 1 .

The sequence xkl∞l=1 is called a subsequence of xk∞k=1 if k1 < k2 < · · · <

kl < · · · . The following proposition shows that the subsequence of a convergentsequence converges to the same limit.

Proposition 2.2. If xk → x and xkl∞l=1 is a subsequence of xk∞k=1, then

xkl → x.

Proof. By the definition of convergence, for any ε > 0 we can take K > 0 suchthat ‖xk − x‖ < ε whenever k > K. Since kl ≥ l, it follows that ‖xkl − x‖ < εwhenever l > K, so xkl → x.

15

Let xk ⊂ R be a real sequence. Define αl = supk≥l xk and βl = infk≥l xk,possibly ±∞. Clearly αl is decreasing and βl is increasing, so they havelimits α, β in [−∞,∞]. We write

lim supk→∞

xk := α = liml→∞

supk≥l

xk,

lim infk→∞

xk := β = liml→∞

infk≥l

xk,

and call them the limit superior and limit inferior of xk, respectively.

2.2 Topological properties

Let F ⊂ RN be a set. F is closed if for any convergent sequence xk∞k=1 inF (meaning that xk ∈ F for all k and xk → x for some x ∈ RN ), the limitpoint belongs to F (meaning that x ∈ F ).1 Intuitively, a closed set is one thatincludes its own boundary. Thus the set [0, 1] = x | 0 ≤ x ≤ 1 is closed but(0, 1) = x | 0 < x < 1 is not.

A set A ⊂ RN is bounded if there exists b > 0 such that ‖x‖ ≤ b for allx ∈ A.

Let U ⊂ RN be a set. The complement of U , denoted by U c, is definedby U c =

x ∈ RN

∣∣x /∈ U. That is, the complement of a set consists of thosepoints that do not belong to the original set. U is said to be open if U c isclosed.2 Thus (0, 1) = x | 0 < x < 1 is open. Intuitively, an open set is onethat does not include its own boundary.

A set K ⊂ RN is said to be compact3 if any sequence in K has a convergentsubsequence with a limit in K.4 That is, K is compact if for any sequencexk∞k=1 ⊂ K, we can find a subsequence xkl

∞l=1 and a point x ∈ K such that

xkl → x as l→∞.

Theorem 2.3 (Heine-Borel). K is compact if and only if it is closed andbounded.

Proof. Suppose that K is compact. Take any convergent sequence xk ⊂ Kwith limxk = x. Since K is compact, we can take a subsequence xkl suchthat xkl → y for some y ∈ K. But by Proposition 2.2 we get x = y ∈ K, soK is closed. To prove that K is bounded, suppose that it is not. Then for anyk we can find xk ∈ K such that ‖xk‖ > k. For any subsequence xkl, since‖xkl‖ > kl → ∞ as l → ∞, xkl is not bounded. Hence by Proposition 2.1xkl is not convergent. Since xk has no convergent subsequence, K is notcompact, which is a contradiction. Hence K is bounded.

Now suppose that K is closed and bounded. Let us show by induction onthe dimension N that a bounded sequence has a convergent subsequence. ForN = 1, let xk∞k=1 ∈ [−a, a] be a bounded sequence, where a > 0. Defineαl = supk≥l xk. Since xk ∈ [−a, a], it follows that αl ∈ [−a, a]. Clearly αl is

1The letter F is often used for a closed set since the French word for “closed” is ferme.2The letters U, V are often used for an open set since the French word for “open” is ouvert

but the letter O is confusing due to the resemblance to 0.3Strictly speaking, this is the definition of a sequentially compact set, but in RN the two

concepts are identical.4The letter K is often used fo a compact set since the German word for “compact” is

kompakt.

16

a decreasing sequence, so it has a limit α ∈ [−a, a]. For each l, choose kl ≥ lsuch that |xkl − αl| < 1/l, which is possible by the definition of αl. Then

|xkl − α| ≤ |xkl − αl|+ |αl − α| <1

l+ |αl − α| → 0

as l→∞, so xkl → α. Therefore xk has a convergent subsequence xkl.Suppose that the claim is true up to N − 1. Let xk∞k=1 ∈ [−a, a]N be a

bounded sequence, where xk = (x1k, . . . , xNk). Since x1k∞k=1 ⊂ [−a, a], it hasa convergent subsequence x1k′. By the induction hypothesis, the sequenceof N − 1-vectors (x2k′ , . . . , xNk′) ⊂ [−a, a]N−1 has a convergent subsequence(x2k′′ , . . . , xNk′′). Since x1k′′ is a subsequence of x1k′, by Proposition 2.2it is also convergent. Therefore xk′′ = (x1k′ , . . . , xNk′) ⊂ [−a, a]N−1 alsoconverges, so xk has a convergent subsequence.

We have shown that any xk ⊂ K has a convergent subsequence xkl.Since K is closed, the limit belongs to K. Therefore K is compact.

2.3 Continuous functions

A function f : RN → R is continuous at x if f(xk)→ f(x) for any sequence suchthat xk → x. f is continuous if it is continuous at every point. Intuitively, f iscontinuous if its graph has no gaps. The following theorem is important becauseit gives a sufficient condition for the existence of a solution to an optimizationproblem.

Theorem 2.4 (Bolzano-Weierstrass). If K ⊂ RN is nonempty and compactand f : K → R is continuous, then f(K) is compact. In particular, f attainsits maximum and minimum over K.

Proof. Let yk ⊂ f(K). Then we can take xk ⊂ K such that yk = f(xk) forall k. Since K is compact, we can take a subsequence xkl

∞l=1 such that xkl →

x ∈ K. Then by the continuity of f , we have ykl = f(xkl) → f(x) ∈ f(K), sof(K) is compact.

Since f(K) is compact, it is bounded. Hence M := sup f(K) < ∞. Re-peating the above argument with yk such that yk → M , it follows thatM = lim ykl = lim f(xkl) = f(x), so f attains its maximum. The case forthe minimum is similar.

With applications in mind, it is useful to allow some discontinuous functionsand functions that take values ±∞. We say that f : RN → [−∞,∞] is lowersemi-continuous at x if for any xk → x we have f(x) ≤ lim infk→∞ f(xk). fis upper semi-continuous if f(x) ≥ lim supk→∞ f(xk). Clearly f is upper semi-continuous if −f is lower semi-continuous.

Theorem 2.5. Let K be compact and f : K → [−∞,∞] be lower (upper)semi-continuous. Then f attains a minimum (maximum) over K.

Proof. We show only for the case f is lower semi-continuous. If f(x) = −∞for some x ∈ K or f(x) = ∞ for all x ∈ K, there is nothing to prove. Henceassume that f(x) > −∞ for all x ∈ K and f(x) < ∞ for some x ∈ K. Letm = infx∈K f(x). Take a sequence xk ⊂ K such that f(xk)→ m. Since K is

17

compact, there is a subsequence such that xkl → x for some x ∈ K. Since f islower semi-continuous, we get

m ≤ f(x) ≤ lim infl→∞

f(xkl) = m,

so −∞ < f(x) = m.

Exercises

2.1. 1. Let Fii∈I ⊂ RN be a collection of closed sets. Prove that⋂i∈I Fi

is closed.

2. Let A ⊂ RN be any set. Prove that there exists a smallest closed set thatincludes A. (We denote this set by clA and call it the closure of A.)

3. Prove that there exists a largest open set that is included in A. (We denotethis set by intA and call it the interior of A.)

2.2. Let B(x, ε) =y ∈ RN

∣∣ ‖y − x‖ < ε

be the open ball with center x andradius ε.

1. Prove that U is open if and only if for any x ∈ U , there exists ε > 0 suchthat B(x, ε) ⊂ U .

2. Prove that if U1, U2 are open, so is U1∩U2. Prove that if F1, F2 are closed,so is F1 ∪ F2.

3. Let A,B be any set. Prove that int(A∩B) = intA∩ intB and cl(A∪B) =clA ∪ clB.

2.3. In the proof of the Bolzano-Weierstrass theorem, I used the fact thatany bounded set A ⊂ R has a supremum α = supA, known as the axiom ofcontinuity. Using this axiom, prove that any bounded monotone sequence isconvergent (which I also used in the proof).

2.4. A sequence xk∞k=1 ⊂ RN is said to be Cauchy if

∀ε > 0,∃K > 0, k, l > K =⇒ ‖xk − xl‖ < ε,

that is, the terms with sufficiently large indices are arbitrarily close to eachother.

1. Prove that a Cauchy sequence is bounded.

2. Prove that a Cauchy sequence converges. (Hint: Heine-Borel theorem.This property is called the completeness of RN .)

2.5. Let A,B be symmetric positive definite matrices and f(x) = 〈x, y〉 −12 〈x,Ax〉, where y is a (fixed) vector.

1. Compute ∇f(x).

2. Compute the maximum of f(x) over x ∈ RN .

3. Prove that A ≥ B if and only if B−1 ≥ A−1.

18

Chapter 3

Multi-Variable Calculus

3.1 A motivating example

Suppose you are the owner of a firm that produces two goods. The unit priceof good 1 and 2 are p1 and p2, respectively. To produce x1 units of good 1 andx2 units of good 2, it costs

c(x1, x2) =1

2(x2

1 + x22).

What is the optimal production plan?This problem can be solved using only high school algebra. If you produce

(x1, x2), the profit is

f(x1, x2) = p1x1 + p2x2 − c(x1, x2) = p1x1 + p2x2 −1

2(x2

1 + x22).

Since f is a quadratic function, you can complete the squares:

f(x1, x2) = −1

2(x1 − p1)2 − 1

2(x2 − p2)2 +

1

2(p2

1 + p22),

so the optimal plan is (x1, x2) = (p1, p2), with maximum profit 12 (p2

1 + p22).

Many practical problems are optimization problems that involve two or morevariables, as in this example. In the previous chapter, we saw that calculus is apowerful tool for solving one-variable optimization problems. The same is truefor the multi-variable case. This chapter introduces the basics of multi-variablecalculus.

3.2 Differentiation

Consider a function of two variables, f(x1, x2). In the previous chapter, we mo-tivated differentiation by a linear approximation. The same is true for functionsof two or more variables. Suppose we want to approximate f(x1, x2) by a linearfunction around the point (x1, x2) = (a1, a2), so

f(x1, x2) ≈ p1(x1 − a1) + p2(x2 − a2) + q

19

for some numbers p1, p2, q. The approximation should be exact at (x1, x2) =(a1, a2), so substituting (x1, x2) = (a1, a2) we must have q = f(a1, a2). Thevalues of p1, p2 should be such that as (x1, x2) approaches (a1, a2), the approxi-mation should get better and better. Therefore subtracting f(a1, a2) and lettingx2 = a2 and x1 → a1, it must be the case that

p1 =∂f

∂x1(a1, a2) := lim

x1→a1

f(x1, a2)− f(a1, a2)

x1 − a1.

This quantity is called the partial derivative of f with respect to x1 (evaluatedat (a1, a2)). A partial derivative, as the name suggests, is just a derivative of afunction with respect to one variable, fixing all other variables. Intuitively, thepartial derivative is the rate of change (slope) of the function in the direction ofone particular coordinate. By a similar argument, we obtain

p2 =∂f

∂x2(a1, a2) := lim

x2→a2

f(a1, x2)− f(a1, a2)

x2 − a2,

the partial derivative of f with respect to x2.If you know how to take the derivative of a one-variable function, comput-

ing partial derivatives of a multi-variable function is straightforward: you justpretend that all variables except one are constants.

Example 1. Let f(x1, x2) = x1 + 2x2 + 3x21 + 4x1x2 + 5x2

2. Then

∂f

∂x1(x1, x2) = 1 + 6x1 + 4x2,

∂f

∂x2(x1, x2) = 2 + 4x1 + 10x2.

A function is said to be partially differentiable if the partial derivatives exist.If a function is partially differentiable and the partial derivatives are continu-ous, we call it a C1 function. In general, a Cn function means that you canpartially differentiate n times (with an arbitrary choice of variables) and theresulting function is continuous. A function is said to be differentiable if thelinear approximation becomes exact as the point (x1, x2) gets closer to (a1, a2),so formally

f(a1 + h1, a2 + h2)− f(a1, a2) = p1h1 + p2h2 + ε(h1, h2) (3.1)

with ε(h1, h2)/√h2

1 + h22 → 0 as (h1, h2) → (0, 0), where p1, p2 are partial

derivatives. It is known that a C1 function is differentiable (exercise).In the one-variable case, the derivative of a function at a minimum or a

maximum is zero. The same is true for partial derivatives of a multi-variablefunction. We omit the proof because it is essentially the same as the one-variablecase.

Proposition 3.1. Consider the optimization problem

maximize f(x1, x2),

where f is partially differentiable. If (x∗1, x∗2) is a solution, then ∂f

∂x1(x∗1, x

∗2) =

∂f∂x2

(x∗1, x∗2) = 0.

20

Example 2. Consider the motivating example. Then

∂f

∂x1= p1 − x1,

∂f

∂x2= p2 − x2,

so ∂f/∂x1 = ∂f/∂x2 = 0 implies (x1, x2) = (p1, p2), which maximizes the profit.

3.3 Vector notation and gradient

Equation (3.1) shows that the difference in f is approximately a linear functionof the differences in the coordinates, p1h1 + p2h2. Define the vectors a, p, h bya =

[a1a2

], p =

[p1p2

], and h =

[h1

h2

]. If you remember the definition of the inner

product,1 (3.1) can be compactly written as

f(a+ h)− f(a) = p · h+ ε(h)

with ε(h)/ ‖h‖ → 0 as h → 0, where ‖h‖ =√h · h =

√h2

1 + h22 is the norm

(length) of the vector h. The vector of partial derivatives,

∇f(a) :=

[p1

p2

]=

[∂f∂x1

(a1, a2)∂f∂x2

(a1, a2)

],

is called the gradient of f at (a1, a2). (You read the symbol ∇ “nabla”.) Theabove equation then becomes

f(a+ h)− f(a) = ∇f(a) · h+ ε(h).

Example 3. Let f(x1, x2) = x21x

32. Then

∇f(x1, x2) =

[∂f∂x1∂f∂x2

]=

[2x1x

32

3x21x

22

].

Using the gradient, Proposition 3.1 simplifies as follows: if x∗ is a solutionof the optimization problem

maximize f(x),

where f is partially differentiable, then ∇f(x∗) = 0.2 The same is true forminimization.

My experience tells that when I introduce the vector notation, many stu-dents are overwhelmed by the “abstract” nature. It is true that imagining avector requires more abstract thinking than imagining a real number. However,the vector notation has two important advantages over the component-wise no-tation. First, since you don’t need to write down all the components, it savesspace and you can focus on the substantial content. Second, since the vectornotation applies to any dimension (1, 2, . . . ), you can develop a single theory

1The inner product is also called the dot product, although inner product is more common.2We use the letter 0 to denote the zero vector

[00

].

21

that applies to all cases. Therefore, you should get used to the vector nota-tion. Whenever you think it is too abstract, consider the two-dimensional casefor concreteness. The chapter appendix proves the chain rule using the vectornotation.

Intuitively, the gradient ∇f(a) is the direction at which the function in-creases fastest at the point a. To see this, take any vector d and evaluate thevalue of f along the straight line x = a+ td that passes through the point a andpoints to the direction d, where t is a parameter. The value is then f(a + td).The slope of f along this line is

limt→0

f(a+ td)− f(a)

t= ∇f(a) · d,

which can be shown using the chain rule. This quantity is known as the direc-tional derivative of f (with direction d). In particular, if d is a unit vector (sayd =

[10

]or[01

]in the two-dimensional case), then the directional derivative is

a partial derivative. Assuming d has length 1 (so ‖d‖ = 1) and applying theCauchy-Schwarz inequality x · y ≤ ‖x‖ ‖y‖ to x = ∇f(a) and y = d, it followsthat

∇f(a) · d ≤ ‖∇f(a)‖ ‖d‖ = ‖∇f(a)‖ ,

with equality if d is parallel to ∇f(a), so d = ∇f(a)/ ‖∇f(a)‖. This inequalityshows that the directional derivative (the rate of change of the function) ismaximum when the direction is that of the gradient.

Example 4. Let f(x1, x2) =√x2

1 + x22. Letting x1 = r cos θ and x2 = r sin θ,

we have f(x1, x2) = r, so f is constant along a circle and increases away fromthe circle. Therefore at any point x = (x1, x2), f increases fastest along theradius joining the origin and x. In fact, the gradient is

∇f(x1, x2) =

x1√x21+x2

2x2√x21+x2

2

=

[cos θsin θ

],

which points to the direction of the radius.

3.4 Mean-value theorem and Taylor’s theorem

In one-variable optimization problems, the mean-value theorem and Taylor’stheorem are useful to characterize the solution. The same is true with multiplevariables.

Proposition 3.2 (Mean-value theorem). Let f be differentiable. For any vec-tors a, b, there exists a number 0 < θ < 1 such that

f(b)− f(a) = 〈∇f((1− θ)a+ θb), b− a〉 .

Here 〈x, y〉 = x · y = x1y1 + · · · + xNyN is another notation for the innerproduct. The proof is an exercise. The mean-value theorem for the one-variablecase says that there exists a number c between a and b such that

f(b)− f(a)

b− a= f ′(c).

22

Multiplying both sides by b−a and choosing 0 < θ < 1 such that c = (1−θ)a+θb(which is possible because c is between a and b), we get

f(b)− f(a) = f ′((1− θ)a+ θb)(b− a).

Therefore the multi-variable version of the mean-value theorem is a generaliza-tion of the one-variable case.

Taylor’s theorem also generalizes to the multi-variable case. Suppose youwant to approximate f(x) around x = a. Let h = x − a, and consider theone-variable function g(t) = f(a+ th). Then g(0) = f(a) and g(1) = f(x). Nowapply Taylor’s theorem to the one-variable function g(t) and set t = 1. The re-sult is Taylor’s theorem for the multi-variable function f(x). The multi-variableversion of Taylor’s theorem is most useful in the second-order approximation.The result is

f(x) = f(a) + 〈∇f(a), x− a〉+1

2

⟨x− a,∇2f(ξ)(x− a)

⟩,

where ξ = (1− θ)a+ θx for some 0 < θ < 1. ∇2f is the matrix of second partialderivatives of f , which is known as the Hessian:

∇2f =

[∂2f∂x2

1

∂2f∂x1∂x2

∂2f∂x2∂x1

∂2f∂x2

2

].

In general, the (m,n) element of the Hessian ∇2f is ∂2f∂xm∂xn

. Although we

do not prove it, for C2 functions, the order of the partial derivatives can be

exchanged: ∂2f∂x2∂x1

= ∂2f∂x1∂x2

.

Example 5. Consider the motivating example. Then

∂2f

∂x21

= −1,∂2f

∂x1∂x2= 0,

∂2f

∂x22

= −1,

so the Hessian is

∇2f(x) =

[−1 00 −1

].

3.A Chain rule

Let me convince you that the vector and matrix notation is quite useful byproving the chain rule. Instead of a real-valued function of several variables,consider a vector-valued function, for example

f(x) =

[f1(x1, x2)f2(x1, x2)

].

Here the variables are x1, x2. f is a two-dimensional vector, and the first compo-nent is f1(x1, x2) and the second is f2(x1, x2). More generally, you can considerf : RN → RM , where N is the dimension of the domain (variables) and M isthe dimension of the range (value). In the example above, we have M = N = 2.

23

Such a function f is differentiable at the point a = (a1, . . . , aN ) if there existsan M ×N matrix A and a function ε(h) such that

f(a+ h)− f(a) = Ah+ ε(h)

with ε(h)/ ‖h‖ → 0 as h → 0, where h = (h1, . . . , hN ). Setting hn = 0 for allbut one n, dividing by hn 6= 0, taking the limit as hn → 0, and comparing them-th component of both sides, you can show that the (m,n) component of thematrix A is the partial derivative ∂fm

∂xn(a). The matrix A is called the Jacobian

of f at a, and is often denoted by Df(a). In particular, if the dimension of therange is M = 1 (so f is a real-valued function), then

Df(a) =[∂f∂x1

(a) · · · ∂f∂xN

(a)],

the 1×N matrix obtained by transposing the gradient ∇f(a).With these notations, we can prove the chain rule. Let g : RM → RL be

differentiable at b = f(a). By definition, there exists an L×M matrix B and afunction δ(k) such that

g(b+ k)− g(b) = Bk + δ(k)

with δ(k)/ ‖k‖ → 0 as k → 0, where k = (k1, . . . , kM ). Consider the functionobtained by composing f and g. Letting k = f(a+ h)− f(a), we obtain

g(f(a+ h))− g(f(a)) = g(b+ k)− g(b) = Bk + δ(k)

= B(f(a+ h)− f(a)) + δ(f(a+ h)− f(a))

= B(Ah+ ε(h)) + δ(Ah+ ε(h))

= BAh+Bε(h) + δ(Ah+ ε(h)).

Since ε and δ are negligible compared with their arguments, it follows thatg(f(x)) is differentiable at x = a and

D(g f)(a) = BA = Dg(b)Df(a).

This is the chain rule, which generalizes (g(f(x)))′ = g′(f(x))f ′(x) to the multi-dimensional case.

What does this equation mean? For example, let g be a real-valued functionof two variables, say g(x1, x2). Let f be a vector-valued function of one variable,

say f(t) =[f1(t)f2(t)

]. Since

Dg =[∂g∂x1

∂g∂x2

]and Df =

[f ′1(t)f ′2(t)

],

it follows that

d

dtg(f1(t), f2(t)) = D(g f) = DgDf

=[∂g∂x1

∂g∂x2

] [f ′1(t)f ′2(t)

]=

∂g

∂x1f ′1(t) +

∂g

∂x2f ′2(t).

In general, if f is an m-dimensional function of x1, . . . , xN and g is a real-valuedfunction of y1, . . . , yM , we have

∂(g f)

∂xn=

M∑m=1

∂g

∂ym

∂fm∂xn

.

24

Exercises

Problems marked with (C) are challenging.

3.1. Compute the partial derivatives, the gradient, and the Hessian of the fol-lowing functions.

1. f(x1, x2) = a1x1 + a2x2, where a1, a2 are constants.

2. f(x1, x2) = ax21 + 2bx1x2 + cx2

2, where a, b, c are constants.

3. f(x1, x2) = x1x2.

4. f(x1, x2) = x1 log x2, where x2 > 0.

3.2 (C). Compute the gradient and the Hessian of the following functions.

1. f(x) = 〈a, x〉, where a, x are vectors of the same dimensions and 〈a, x〉 =a · x is the inner product of a and x.

2. f(x) = 〈x,Ax〉, where A is a square matrix of the same dimension as thevector x.

3.3 (C). This problem asks you to prove the Cauchy-Schwarz inequality. Letx = (x1, . . . , xN ) and y = (y1, . . . , yN ) be two vectors.

1. Define f(t) =∑Nn=1(txn − yn)2. Show that f is a nonnegative quadratic

function of t if x 6= 0.

2. Let f(t) = at2 + bt+ c. Express a, b, c using x, y.

3. Using the discriminant, prove the Cauchy-Schwarz inequality.

3.4 (C). This problem asks you to prove the multi-variable mean-value theorem.Let f be differentiable.

1. Let g(t) = f(a+ t(b− a)). Using the chain rule, compute g′(t).

2. Using the one-variable mean-value theorem, prove the multi-variable mean-value theorem.

3.5 (C). This problem asks you to prove that a C1 function is differentiable.Let f(x1, x2) be a C1 function (i.e., partially differentiable and the partialderivatives are continuous). Fix (a1, a2).

1. Using the one-variable mean-value theorem, show that there exist numbers0 < θ1, θ2 < 1 such that

f(a1+h1, a2+h2)−f(a1, a2) =∂f

∂x1(a1+θ1h1, a2+h2)h1+

∂f

∂x2(a1, a2+θ2h2)h2.

(Hint: subtract and add f(a1, a2 + h2) to the left-hand side.)

2. Letε(h) = f(a+ h)− f(a)−∇f(a) · h,

where a = (a1, a2) and h = (h1, h2). Prove that limh→0 ε(h)/ ‖h‖ = 0.

25

Chapter 4

Multi-variableunconstrained optimization

4.1 First and second-order conditions

Consider the unconstrained optimization problem

minimize f(x), (4.1)

where f is a (one- or multi-variable) differentiable function. Recall that if x∗ isa solution, then ∇f(x∗) = 0, where

∇f(x) =

[∂f∂x1

(x)∂f∂x2

(x)

]

is the gradient (the vector of partial derivatives) in the two-dimensional case butthe general case is similar. The condition ∇f(x∗) = 0 is called the first-ordercondition for optimality. It is necessary, but not sufficient.

Can we derive a sufficient condition for optimality? The answer is yes. Tothis end we need to introduce a few notations. x∗ is called a (global) solutionto the unconstrained minimization problem (4.1) if f(x) ≥ f(x∗) for all x. x∗

is called a local solution if f(x) ≥ f(x∗) for all x close enough to x∗. Finally,x∗ is called a stationary point if ∇f(x∗) = 0.

We know that if x∗ is a solution to an optimization problem (with a differ-entiable objective function), x∗ is a stationary point. The following exampleshows that the converse is not true in general.

Example 6. Let f(x) = x3 − 3x. Since

f ′(x) = 3x2 − 3 = 3(x− 1)(x+ 1)

> 0, (x < −1, x > 1)

= 0, (x = ±1)

< 0, (−1 < x < 1)

x = ±1 are stationary points. x = 1 is a local minimum and x = −1 is a localmaximum. However, since f(x) → ±∞ as x → ±∞, they are neither globalminimum nor global maximum.

26

The above example shows that in general, we can only expect that a sta-tionary point is a local optimum, not a global optimum. We can use Taylor’stheorem to derive conditions under which this is indeed true.

Suppose that f is a C2 (twice continuously differentiable) function, and x∗

is a stationary point. Take any x = x∗ + h. By Taylor’s theorem, there exists0 < α < 1 such that

f(x∗ + h) = f(x∗) + 〈∇f(x∗), h〉+1

2

⟨h,∇2f(x∗ + αh)h

⟩, (4.2)

where 〈a, b〉 is the inner product between vectors a, b and ∇2f is the Hessian(matrix of second derivatives) of f :

∇2f =

[∂2f∂x2

1

∂2f∂x1∂x2

∂2f∂x2∂x1

∂2f∂x2

2

]in the two-variable case. Since x∗ is a stationary point, we have ∇f(x∗) = 0, so(4.2) implies

f(x∗ + h) = f(x∗) +1

2

⟨h,∇2f(x∗ + αh)h

⟩.

Therefore whether f(x) = f(x∗ + h) is greater than or less than f(x∗) dependson whether the quantity

⟨h,∇2f(x∗ + αh)h

⟩is positive or negative. Letting

Q = ∇2f(x∗) be the Hessian of f evaluated at the stationary point x∗, ifh = x− x∗ is small, then ∇2f(x∗ + αh) is close to Q. Therefore

f(x) ≷ f(x∗) ⇐⇒ 〈h,Qh〉 ≷ 0. (4.3)

In general, a symmetric matrix A is called positive (negative) definite if〈h,Ah〉 > 0 (< 0) for all vector h 6= 0. A is called positive (negative) semidefiniteif 〈h,Ah〉 ≥ 0 (≤ 0) for all h. If A is positive (semi)definite, we write A > (≥)0,and similarly for the negative case. (4.3) says that x∗ is a local minimum(maximum) if Q = ∇2f(x∗) is positive (negative) definite. Thus we obtainthe following sufficient condition for local optimality, called the second-ordercondition.

Proposition 4.1. Let f be a twice continuously differentiable function, and∇f(x∗) = 0. If ∇2f(x∗) is positive (negative) definite, then x∗ is a local mini-mum (maximum).

Example 7. Let f(x) = x3 − 3x. Since f ′(x) = 3x2 − 3 and f ′′(x) = 6x, wehave f ′(±1) = 0 and f ′′(±1) = ±6. Therefore x = 1 is a local minimum andx = −1 is a local maximum.

Example 8. Let f(x1, x2) = x21 + x1x2 + x2

2. Then the gradient is ∇f(x) =[2x1+x2

x1+2x2

]and the Hessian is

∇2f(x) =

[2 11 2

].

Since ∇f(0) = 0, (x1, x2) = (0, 0) is a stationary point. Now⟨h,∇2f(0)h

⟩=[h1 h2

] [2 11 2

] [h1

h2

]= 2h2

1 + 2h1h2 + 2h22 = 2

(h1 +

1

2h2

)2

+3

2h2

2 ≥ 0,

27

with strict inequality if (h1, h2) 6= (0, 0), so∇2f(0) is positive definite. Therefore(x1, x2) = (0, 0) is a local minimum (indeed, a global minimum).

Example 9. Let f(x1, x2) = x21 − x2

2. Since ∇f(x) =[

2x1

−2x2

], (x1, x2) = (0, 0)

is a stationary point. However, since f(x1, 0) = x21 attains the minimum at

x1 = 0 and f(0, x2) = −x22 attains the maximum at x2 = 0, (x1, x2) = (0, 0) is

neither a local minimum nor a local maximum (it is a saddle point). Indeed,the Hessian

∇2f(x) =

[2 00 −2

]is neither positive nor negative definite.

In order to determine whether a stationary point is a local minimum, max-imum, or a saddle point, we need to determine whether the Hessian is positivedefinite, negative definite, or neither. Although there are a few ways to do so,usually the easiest way is to complete the squares. If h = (h1, h2) and Q is asymmetric matrix, 〈h,Qh〉 is a quadratic function of h1, h2, so you can completethe squares as in the example above. If the result is the sum of two positive terms(N positive terms if there are N variables), then Q is positive (semi)definite.If the result is the sum of negative terms, the Q is negative (semi) definite. Ifthe result is the sum of positive and negative terms, then Q is neither positivenor negative definite. The order you complete the squares doesn’t matter—aproperty known as Sylvester’s law of inertia.1

4.2 Convex optimization

Recall that in the one-variable case, a twice differentiable function f is convex iff ′′(x) ≥ 0 for all x, and if f is convex, f(x∗) = 0 implies that x∗ is the (global)minimum. That is, the first order condition is necessary and sufficient. Thesame holds for the multi-variable case.

4.2.1 General case

As in the one-variable case, a function f is said to be convex if for any x1, x2

and 0 ≤ α ≤ 1 we have

f((1− α)x1 + αx2) ≤ (1− α)f(x1) + αf(x2).

A function is called concave if the reverse inequality holds (so −f is convex).The chapter appendix shows that a twice continuously differentiable functionis convex (concave) if and only if the Hessian (matrix of second derivatives) ispositive (negative) semidefinite.

Let f be a twice continuously differentiable convex function, and x∗ be astationary point (so ∇f(x∗) = 0). By (4.2) and using the definition of thepositive definiteness, we obtain

f(x∗ + h) = f(x∗) + 〈∇f(x∗), h〉+1

2

⟨h,∇2f(x∗ + αh)h

⟩= f(x∗) +

1

2

⟨h,∇2f(x∗ + αh)h

⟩≥ f(x∗)

1http://en.wikipedia.org/wiki/Sylvester’s_law_of_inertia

28

http://en.wikipedia.org/wiki/Sylvester's_law_of_inertia

for all h, so x∗ is a global minimum. This important result is summarized inthe following theorem.

Theorem 4.2. Let f be a differentiable convex (concave) function. Then x∗ isa minimum (maximum) of f if and only if ∇f(x∗) = 0.

If you read the chapter appendix, you will notice that the twice differentia-bility of f is not necessary.

4.2.2 Quadratic case

A special but important class of convex and concave functions are quadraticfunctions, because we can solve for the optimum in closed-form. A generalquadratic function with two variables has the following form:

f(x1, x2) = a+ b1x1 + b2x2 + c1x21 + c2x1x2 + c3x

22,

where a, b, c’s are constants. It turns out that it is useful to change the notationsuch that c1 = 1

2q11, c2 = q12, and c3 = 12q22. Then

f(x1, x2) = a+ b1x1 + b2x2 +1

2q11x

21 + q12x1x2 +

1

2q22x

22

= a+ 〈b, x〉+1

2〈x,Qx〉 ,

where b =[b1b2

]and Q =

[q11 q12

q12 q22

]. The gradient is

∇f(x) =

[b1 + q11x1 + q12x2

b2 + q12x1 + q22x2

]=

[b1b2

]+

[q11 q12

q12 q22

] [x1

x2

]= b+Qx,

and the Hessian is

∇2f(x) =

[q11 q12

q12 q22

]= Q.

The vector and matrix notation is valid with an arbitrary number of variables.Since the Hessian of a quadratic function is constant, f is convex (concave)

if Q is positive (negative) semidefinite. Since

0 = ∇f(x) = b+Qx ⇐⇒ x = −Q−1b,

(Q−1 is the inverse matrix of Q) if Q is positive (negative) definite, then x∗ =−Q−1b is the minimum (maximum) of f(x) = a+ 〈b, x〉+ 1

2 〈x,Qx〉.

4.3 Algorithms

In the one-variable case, we learned several algorithms for computing the solu-tion of an optimization problem. Similar algorithms exist for the multi-variablecase.

29

4.3.1 Gradient method

Recall that the gradient of a function has the geometric interpretation that it isthe direction at which the function increases fastest. Therefore when searchingfor the minimum or maximum of a function, it makes sense to search along thisstraight line. This idea is called a line search.

More precisely, consider the minimization problem

minimize f(x),

where f is differentiable. Let a be a current approximate solution. Since thedirection at which f increases fastest at a is g = ∇f(a), search for a newapproximate solution along the straight line x = a− gt, where t > 0. (There isa minus sign because g = ∇f(a) is the direction the function increases, but wewant to minimize the function.) Therefore define the one-variable function

φ(t) = f(a− tg)

and minimize over t > 0. This procedure will give a new approximate solution,and the iteration is known as the gradient method (also known as the steepestdescent method).

The following is the gradient algorithm for minimization.

1. Pick an initial value x0 and error tolerance ε > 0.

2. For n = 0, 1, 2, . . . , compute gn = ∇f(xn). Define the one-variable func-tion φ(t) = f(xn − tgn) and minimize over t > 0 using any one-variableoptimization algorithms (e.g., false position, quadratic fit, etc.). Let tn bethe solution and define xn+1 = xn − tngn.

3. Stop if |xn+1 − xn| < ε. The approximate solution is xn+1.

The gradient method works for maximization too, but −g should be replacedby g.

Example 10. Let f(x1, x2) = x21 + 2x2

2 − 2x1x2 − 2x2 and consider the opti-mization problem

minimize f(x1, x2).

The gradient of f is ∇f(x) =[

2x1−2x2

−2x1+4x2−2

]and the Hessian is

∇2f(x) =

[2 −2−2 4

].

It is not hard to show that ∇2f is positive definite, so f is convex. Setting∇f(x) = 0, the solution is (x1, x2) = (1, 1).

Suppose that we start the gradient algorithm from x0 = (0, 0). At this point,the gradient is g0 = ∇f(x0) =

[0−2

]. Therefore the objective function along the

steepest line isφ(t) = f(x0 − tg0) = f(0, 2t) = 8t2 − 4t.

Since φ is a quadratic function, it can be minimized by setting the derivative tozero. The solution is

0 = φ′(t) = 16t− 4 ⇐⇒ t =1

4.

30

Therefore the new approximate solution is

x1 = x0 − tg0 =

[00

]− 1

4

[0−2

]=

[012

].

This completes the first iteration of the gradient method.The second iteration is similar. At x1 = (0, 1

2 ), the gradient is g1 =

∇f(x1) =[−1

0

]. Therefore the objective function along the steepest line is

φ(t) = f(x1 − tg1) = f

(t,

1

2

)= t2 − t− 1

2.

Since φ is a quadratic function, it can be minimized by setting the derivative tozero. The solution is

0 = φ′(t) = 2t− 1 ⇐⇒ t =1

2.

Therefore the new approximate solution is

x2 = x1 − tg1 =

[012

]− 1

2

[−10

]=

[1212

].

The gradient method has order of convergence 1.

4.3.2 Newton method

The one-variable Newton method directly generalizes to the multi-variable case.Suppose that you want to minimize or maximize a twice continuously differen-tiable function f(x). Let x = a be the current approximate solution. By Taylor’stheorem, we have

f(x) ≈ f(a) + 〈∇f(a), x− a〉+1

2

⟨x− a,∇2f(a)(x− a)

⟩.

Since the right-hand side is a quadratic function of x, it can be minimized ormaximized by setting the gradient to zero. Therefore

∇f(a) +∇2f(a)(x− a) = 0 ⇐⇒ x = a− [∇2f(a)]−1∇f(a),

assuming that the Hessian ∇2f(a) is nonsingular.The formal algorithm of the Newton method is as follows.

1. Pick an initial value x0 and error tolerance ε > 0.

2. For 0 = 1, 2, . . . , compute

xn+1 = xn − [∇2f(xn)]−1∇f(xn).

3. Stop if |xn+1 − xn| < ε. The approximate solution is xn+1.

As in the one-variable case, the Newton method is a local algorithm. If x∗

is a local optimum and the Hessian ∇2f(x∗) is positive or negative definite,then the Newton algorithm converges double exponentially if the initial valuex0 is sufficiently close to x∗. However, it may not converge at all if you start

31

from a far away point. In practice, you will use a gradient method to obtain acoarse approximate solution and switch to the Newton method to increase theaccuracy.

Another issue with the Newton method is that you need to compute the in-verse of the Hessian [∇2f(xn)]−1 at each iteration, which can be costly. Thereare algorithms that avoid computing the Hessian (known as quasi-Newton algo-rithms), but I will omit the detail.

4.A Characterization of convex functions

When f is differentiable, there is a simple way to establish convexity.

Proposition 4.3. Let f be differentiable. Then f is (strictly) convex if andonly if

f(y)− f(x) ≥ (>) 〈∇f(x), y − x〉

for all x 6= y

Proof. Suppose that f is (strictly) convex. Let x 6= y and define g : (0, 1] → Rby

g(t) =f((1− t)x+ ty)− f(x)

t.

Then g is (strictly) increasing, for if 0 < s < t ≤ 1 we have

g(s) ≤ (<)g(t)

⇐⇒ f((1− s)x+ sy)− f(x)

t≤ (<)

f((1− t)x+ ty)− f(x)

t

⇐⇒ f((1− s)x+ sy) ≤ (<)(

1− s

t

)f(x) +

s

tf((1− t)x+ ty),

but the last inequality holds by letting α = s/t, x1 = x, x2 = (1− t)x+ ty, andusing the definition of convexity. Therefore

f(y)− f(x) = g(1) ≥ (>) limt→0

g(t) = 〈∇f(x), y − x〉 .

Conversely, suppose that

f(y)− f(x) ≥ (>) 〈∇f(x), y − x〉

for all x 6= y. Take any x1 6= x2 and α ∈ (0, 1). Setting y = x1, x2 andx = (1− α)x1 + αx2, we get

f(x1)− f((1− α)x1 + αx2) ≥ (>) 〈∇f(x), x1 − x〉f(x2)− f((1− α)x1 + αx2) ≥ (>) 〈∇f(x), x2 − x〉 .

Multiplying each by 1 − α and α respectively and adding the two inequalities,we get

(1− α)f(x1) + αf(x2)− f((1− α)x1 + αx2) ≥ (>)0,

so f is (strictly) convex.

32

x y

P

Q

R

S

y − x

〈∇f(x), y − x〉

Slope = ∇f(x)

Figure 4.1. Characterization of a convex function.

Figure 8.2 shows the geometric intuition of Proposition 8.1. Since QR =f(y)− f(x) and SR = 〈∇f(x), y − x〉, we have f(y)− f(x) ≥ 〈∇f(x), y − x〉.

A twice differentiable function f : R→ R is convex if and only if f ′′(x) ≥ 0for all x. The following proposition is the generalization for RN .

Proposition 4.4. Let f be a twice continuously differentiable function. Thenf is convex if and only if the Hessian (matrix of second derivatives)

∇2f(x) =

[∂2f(x)

∂xm∂xn

]is positive semidefinite.

Proof. Suppose that f is a C2 function. Take any x 6= y. Applying Taylor’stheorem to g(t) = f((1− t)x+ ty) for t ∈ [0, 1], there exists α ∈ (0, 1) such that

f(y)− f(x) = g(1)− g(0) = g′(0) +1

2g′′(α)

= 〈∇f(x), y − x〉+1

2

⟨y − x,∇2f(x+ α(y − x))(y − x)

⟩.

If f is convex, by proposition 8.1

1

2

⟨y − x,∇2f(x+ α(y − x))(y − x)

⟩= f(y)− f(x)− 〈∇f(x), y − x〉 ≥ 0.

Since y is arbitrary, let y = x+ εd with ε > 0. Dividing the above inequality by12ε

2 > 0 and letting ε→ 0, we get

0 ≤⟨d,∇2f(x+ εαd)d

⟩→⟨d,∇2f(x)d

⟩,

so∇2f(x) is positive semidefinite. Conversely, if∇2f(x) is positive semidefinite,then

f(y)− f(x)− 〈∇f(x), y − x〉 =1

2

⟨y − x,∇2f(x+ α(y − x))(y − x)

⟩≥ 0,

so by Proposition 8.1 f is convex.

33

4.B Newton method for solving nonlinear equa-tions

The Newton method can also be used to solve nonlinear equations.2 For exam-ple, let F : RN → RN be an RN -valued function in N variables. Suppose youwant to solve the equation F (x) = 0. Let x∗ be a solution, and a be a currentapproximate solution. By Taylor’s theorem, we have

F (x) ≈ F (a) +DF (a)(x− a),

where DF (a) is the N ×N Jacobian matrix of F . Since the right-hand side isa linear function, we can solve for the new approximate solution analytically:

F (a) +DF (a)(x− a) = 0 ⇐⇒ x = a− [DF (a)]−1F (a).

The Newton method for optimization is a special case of the Newton methodfor solving nonlinear equations by setting F (x) = ∇f(x).

Exercises

Problems marked with (C) are challenging.

4.1. Let f(x) = 10x3 − 15x2 − 60x. Find the local maxima and minima of f .Does f have global maximum and minimum?

4.2. Let f(x) = 180x− 15x2 − 10x3. Solve

maximize f(x) subject to x ≥ 0.

4.3. Let f(x1, x2) = x21 − x1x2 + 2x2

2 − x1 − 3x2.

1. Compute the gradient and the Hessian of f .

2. Determine whether f is convex, concave, or neither.

3. Find the stationary point(s) of f .

4. Determine whether each stationary point is a maximum, minimum, orneither.

4.4. Let f(x1, x2) = x21 − x1x2 − 6x1 + x3

2 − 3x2.

1. Find the stationary point(s) of f .

2. Determine whether each stationary point is a local maximum, local mini-mum, or a saddle point.

4.5. Let A =

[a bb c

]be a 2 × 2 symmetric matrix. Show that A is positive

definite if and only if a > 0 and ac− b2 > 0.

2I have a particular attachment to this method since I discovered it on my own when I wasat college.

34

4.6. For each of the following symmetric matrices, show whether it is positive(semi)definite, negative (semi)definite, or neither.

1. A =

[1 00 1

],

2. A =

[0 11 0

],

3. A =

[2 11 2

],

4. A =

[2 11 −2

],

5. A =

[1 22 1

],

6. A =

[1 11 4

],

7. A =

[−3 11 −4

],

4.7. For each of the following functions, show whether it is convex, concave, orneither.

1. f(x1, x2) = x1x2 − x21 − x2

2,

2. f(x1, x2) = 3x1 + 2x21 + 4x2 + x2

2 − 2x1x2,

3. f(x1, x2) = x21 + 3x1x2 + 2x2

2,

4. f(x1, x2) = 20x1 + 10x2,

5. f(x1, x2) = x1x2,

6. f(x1, x2) = ex1 + ex2 ,

7. f(x1, x2) = log x1 + log x2, where x1, x2 > 0,

8. f(x1, x2) = log(ex1 + ex2),

9. f(x1, x2) = (xp1 + xp2)1p , where x1, x2 > 0 and p 6= 0 is a constant.

4.8 (C). Let K ≥ 2. Prove that f is convex if and only if

f

(K∑k=1

αkxk

)≤

K∑k=1

αkf(xk)

for all xkKk=1 ⊂ RN and αk ≥ 0 such that∑Kk=1 αk = 1.

4.9 (C). 1. Prove that f(x) = log x (x > 0) is strictly concave.

35

2. Prove the following inequality of arithmetic and geometric means: for anyx1, . . . , xK > 0 and α1, . . . , αK ≥ 0 such that

∑αk = 1, we have

K∑k=1

αkxk ≥K∏k=1

xαkk ,

with equality if and only if x1 = · · · = xK .

4.10 (C). 1. Show that if fi(x)Ii=1 are convex, so is f(x) =∑Ii=1 αifi(x)

for any α1, . . . , αI ≥ 0.

2. Show that if fi(x)Ii=1 are convex, so is f(x) = max1≤i≤I fi(x).

3. Suppose that h : RM → R is increasing (meaning that h is increasing ineach coordinate x1, . . . , xM ) and convex and gm : RN → R is convex form = 1, . . . ,M . Prove that f(x) = h(g1(x), . . . , gM (x)) is convex.

4.11 (C). Let f : RN → (−∞,∞] be convex.

1. Show that the set of solutions to minx∈RN f(x) is a convex set.

2. If f is strictly convex, show that the solution (if it exists) is unique.

4.12. Suppose you collect some two-dimensional data (xn, yn)Nn=1, where Nis the sample size. (A concrete situation might be n is the index of a student, xnis the score of Econ 172A, and yn is the score of Econ 172B.) You wish to fit astraight line y = a+ bx to the data. Suppose you do so by making the observedvalue yn as close as possible to the theoretical value a+ bxn by minimizing thesum of squares

f(a, b) =

N∑n=1

(yn − a− bxn)2.

1. Is f convex, concave, or neither?

2. Compute the gradient of f .

3. Express a, b that minimize f using the following quantities:

E[X] =1

N

N∑n=1

xn, Var[X] =1

N

N∑n=1

(xn − E[X])2,

E[Y ] =1

N

N∑n=1

yn, Cov[X,Y ] =1

N

N∑n=1

(xn − E[X])(yn − E[Y ]).

4.13 (C). In the previous problem, the variable y is explained by two vari-ables, 1, x. Generalize the problem when y is explained by K variables, x =(x1, . . . , xK). It will be useful to define the N ×K matrix X = (xnk) and N -vector y = (y1, . . . , yN ), where xnk is the n-th observation of the k-th variable.The equation you want to fit is

yn = β1xn1 + · · ·+ βKxnK + error term,

and β = (β1, . . . , βK) is the vector of coefficients.

36

4.14 (C). Let A be a symmetric positive definite matrix.

1. Let f(x) = 〈y, x〉 − 12 〈x,Ax〉, where y is some vector. Compute the

gradient and Hessian of f and show whether f is convex, concave, orneither.

2. Find the minimum of f and its value.

3. Let A,B be symmetric positive definite matrices. We write A ≥ B if thematrix C = A− B is positive semidefinite. Show that A ≥ B if and onlyif B−1 ≥ A−1.

4.15 (C). Let F : RN → RN be twice continuously differentiable. Assume thatF (x∗) = 0 and the Jacobian DF (x∗) is nonsingular. Show that the Newtonalgorithm converges to x∗ double exponentially fast if the initial value is closeenough to x∗.

37

Chapter 5

Multi-Variable ConstrainedOptimization

In the real world, optimization problems come with constraints. Most of us havea budget and cannot spend more money than we have, so we have to choosewhat to buy or not. Our stomach has a finite capacity and we cannot eat morethan a certain amount, so we must choose what to eat or not.

5.1 A motivating example

5.1.1 The problem

Suppose there are two goods (say apples and bananas), and your satisfaction isrepresented by the function (called utility function)

u(c1, c2) = log c1 + log c2,

where c1, c2 are the amounts of good 1 and 2 that you consume. Suppose thatthe price of goods are p1 and p2 per unit, and your budget is w. If you buy c1and c2 units of good each, your expenditure is

p1c1 + p2c2.

Since your budget is w, your budget constraint is

p1c1 + p2c2 ≤ w.

So the problem of attaining maximum satisfaction within your budget can bemathematically expressed as:

maximize log c1 + log c2

subject to p1c1 + p2c2 ≤ w.

Here u(c1, c2) = log c1+log c2 is called the objective function, and p1c1+p2c2 ≤ wis the constraint.1

1Strictly speaking, there are other constraints c1 ≥ 0 and c2 ≥ 0, since you cannot consumea negative amount.

38

5.1.2 A solution

How can we solve this problem? Some of you might find a trick that turns thisconstrained optimization problem into an unconstrained one, as follows. First,since the objective function log c1 + log c2 is increasing in both c1 and c2, youwill always exhaust your budget. That is, you will always want to consume ina way such that the budget constraint holds with equality, i.e.,

p1c1 + p2c2 = w.

Solve this for c2, we get

c2 =w − p1c1

p2.

Substituting this into the objective function, the problem is equivalent to findingthe maximum of

f(c1) = log c1 + logw − p1c1

p2= log c1 + log(w − p1c1)− log p2.

Setting the derivative equal to zero, we get

f ′(c1) =1

c1− p1

w − p1c1= 0 ⇐⇒ c1 =

w

2p1.

In this case, f tends to −∞ when c1 approaches the boundary c1 = 0 andc1 = w/p1, so we need not worry about the boundary. The corresponding valueof c2 is

c2 =w − p1

w2p1

p2=

w

2p2.

Therefore the solution is

(c1, c2) =

(w

2p1,w

2p2

).

5.1.3 Why study the general theory?

The above solution is mathematically correct, but too special to be useful. Thereason it worked is because

1. the inequality constraint p1c1 +p2c2 ≤ w could be turned into an equalityconstraint p1c1 + p2c2 = w,

2. the equality constraint p1c1 + p2c2 = w could be solved for one variablec2, and

3. after substitution the optimization problem became unconstrained, whichis why we could apply calculus.

But in general we cannot hope that any of these steps work. As an exercise, tryto solve the following problem:

maximize x1 + 2x2 + 3x3

subject to x21 + x2

2 + x23 ≤ 1.

Some of you might be able to solve this problem using an ingenious trick, butsuch tricks are generally inapplicable. That is why we need a general theory forsolving constrained optimization problems.

39

5.2 Optimization with linear constraints

5.2.1 One linear constraint

Consider the two-variable optimization problem with a linear constraint,

minimize f(x1, x2)

subject to a1x1 + a2x2 ≤ c,

where f is differentiable and a1, a2, c are constants. Let x∗ = (x∗1, x∗2) be a

solution. The goal is to derive a necessary condition for x∗.If a1x

∗1 +a2x

∗2 < c, so the constraint does not bind or is inactive, since x can

move freely around x∗, the point x∗ must be a local minimum of f . Therefore∇f(x∗) = 0. If a1x

∗1 + a2x

∗2 = c, so the constraint binds or is active, then the

situation is more complicated. Let

Ω = x = (x1, x2) | a1x1 + a2x2 ≤ c

be the constraint set. The boundary of Ω is the straight line a1x1 + a2x2 = c,which has the normal vector a =

[a1a2

]. Figure 5.1 shows the constraint set Ω

(with the boundary and the normal vector), the solution x∗, and the negative ofthe gradient −∇f(x∗). (Here we draw the negative of the gradient because thatis the direction at which the function f decreases fastest, and we are solving aminimization problem.)

a1x1 + a2x2 = c

a =[a1a2

]

d =[d1d2

]x∗ −∇f(x∗)

Ω

Figure 5.1. Gradient and feasible direction.

Consider moving towards the direction d from the solution x∗. Since x∗ ison the boundary, we have 〈a, x∗〉 = c. The point x = x∗ + td (where t > 0 issmall) is feasible if and only if

〈a, x∗ + td〉 ≤ c = 〈a, x∗〉 ⇐⇒ 〈a, d〉 ≤ 0,

which is when the vectors a, d make an obtuse angle as in the picture. Since x∗

is a solution, we have f(x∗ + td) ≥ f(x∗) for small enough t > 0. Therefore

0 ≤ limt↓0

f(x∗ + td)− f(x∗)

t= 〈∇f(x∗), d〉 ⇐⇒ 〈−∇f(x∗), d〉 ≤ 0,

so −∇f(x∗) and d make an obtuse angle. Therefore we obtain the followingnecessary condition for optimality:

If a and d make an obtuse angle, then so do −∇f(x∗) and d.

40

In Figure 5.1, the angle between −∇f(x∗) and d is acute, so f decreasestowards the direction d. But this is a contradiction because x∗ is by assumption asolution to the constrained minimization problem, so f cannot decrease towardsany feasible direction. Thus Figure 5.1 is false.

The only case that −∇f(x∗) and d make an obtuse angle whenever a andd do so is when −∇f(x∗) and a point to the same direction, as in Figure 5.2.Therefore if x∗ is a solution, there must be a number λ ≥ 0 such that

∇f(x∗) = −λa ⇐⇒ ∇f(x∗) + λa = 0.

a1x1 + a2x2 = c

a =[a1a2

]x∗

−∇f(x∗)

d =[d1d2

]Ω

Figure 5.2. Necessary condition for optimality.

In the discussion above, we considered two cases depending on whether theconstraint is binding (active) or not binding (inactive), but the inactive case(∇f(x∗) = 0) is a special case of the active case (∇f(x∗) + λa = 0) by settingλ = 0. Furthermore, although I explained the intuition in two-dimension, theresult clearly holds in arbitrary dimensions. Therefore we can summarize thenecessary condition for optimality as in the following proposition.

Proposition 5.1. Consider the optimization problem

minimize f(x)

subject to 〈a, x〉 ≤ c,

where f is differentiable, a is a nonzero vector, and c is a constant. If x∗ is asolution, then there exists a number λ ≥ 0 such that

∇f(x∗) + λa = 0.

Example 11. Consider the motivating example

maximize log x1 + log x2

subject to p1x1 + p2x2 ≤ w.

Maximizing log x1 + log x2 is the same as minimizing f(x1, x2) = − log x1 −log x2. The gradient is ∇f(x) = −

[1/x1

1/x2

]. Let p =

[p1p2

]. By the proposition,

there is a number λ ≥ 0 such that

∇f(x∗) + λp = 0 ⇐⇒ −[1/x1

1/x2

]+ λ

[p1

p2

]=

[00

].

41

Therefore it must be x1 = 1λp1

and x2 = 1λp2

.Since the objective function is increasing, at the solution the constraint

p1x1 + p2x2 ≤ w must bind. Therefore

p1x1 + p2x2 = w ⇐⇒ p11

λp1+ p2

1

λp2= w ⇐⇒ λ =

2

w.

Substituting again, we get the solution (x1, x2) = ( w2p1

, w2p2

).

5.2.2 Multiple linear constraints

Now consider the optimization problem

minimize f(x)

subject to 〈a1, x〉 ≤ c1,〈a2, x〉 ≤ c2,

where f is differentiable, a1, a2 are nonzero vectors, and c1, c2 are constants.Let x∗ be a solution and Ω be the constraint set, i.e., the set

Ω = x | g1(x) ≤ 0, g2(x) ≤ 0 ,

where gi(x) = 〈ai, x〉 − ci for i = 1, 2. Assume that both constraints are activeat the solution. The situation is depicted in Figure 5.3.

a1 = ∇g1

a2 = ∇g2

d

x∗

−∇f(x∗)

Ω

Figure 5.3. Gradient and feasible direction.

In general, the vector d is called a feasible direction if you can move a littlebit towards the direction d from the point x∗, so x∗ + td ∈ Ω for small enought > 0. If d is a feasible direction and x∗ is the minimum of f , then f cannotdecrease towards the direction d, so we must have

0 ≤ limt→0

f(x∗ + td)− f(x∗)

t= 〈∇f(x∗), d〉 ⇐⇒ 〈−∇f(x∗), d〉 ≤ 0.

Therefore the negative of the gradient −∇f(x∗) and any feasible direction dmust make an obtuse angle. Recall that d is a feasible direction if d and ai = ∇gimake an obtuse angle for i = 1, 2. By looking at Figure 5.3, in order for x∗ tobe the minimum, it is necessary that −∇f(x∗) lies between a1 and a2. This istrue if and only if there are numbers λ1, λ2 ≥ 0 such that

−∇f(x∗) = λa1 + λa2 ⇐⇒ ∇f(x∗) + λ1∇g1(x∗) + λ2∇g2(x∗) = 0.

42

Although we have assumed that both constraints bind, this equation is true evenif one (or both) of them does not bind by setting λ1 = 0 and/or λ2 = 0. Also,it is clear that this argument holds for arbitrary number of linear constraints.Therefore we obtain the following general theorem.

Theorem 5.2 (Karush-Kuhn-Tucker theorem with linear constraints). Con-sider the optimization problem

minimize f(x)

subject to gi(x) ≤ 0, (i = 1, 2, . . . , I)

where f is differentiable and gi(x) = 〈ai, x〉 − ci is linear with ai 6= 0. If x∗ is asolution, then there exist numbers (called Lagrange multipliers) λ1, . . . , λI suchthat

∇f(x∗) +

I∑i=1

λi∇gi(x∗) = 0, (5.1a)

(∀i) λi ≥ 0, gi(x∗) ≤ 0, λigi(x

∗) = 0. (5.1b)

Condition (5.1a) is called the first-order condition. Its interpretation is thatat the minimum x∗, the negative of the gradient −∇f(x∗) must lie between allthe ai = ∇gi(x∗) corresponding to the active constraints. Condition (5.1b) iscalled the complementary slackness condition. The first-order condition and thecomplementary slackness condition are jointly called the Karush-Kuhn-Tuckerconditions.2 λi ≥ 0 says that the Lagrange multiplier is nonnegative, andgi(x

∗) ≤ 0 says that the constraint is satisfied, which are not new. The conditionλigi(x

∗) = 0 takes care of both the active and inactive case. If the constraint i isactive, then gi(x

∗) = 0, so we have λigi(x∗) = 0 automatically. If the constraint

i is inactive, we have λi = 0, so again λigi(x∗) = 0 automatically.

An easy way to remember (5.1a) is as follows. Given the objective functionf(x) and the constraints gi(x) ≤ 0, define the Lagrangian

L(x, λ) = f(x) +

I∑i=1

λigi(x),

where λ = (λ1, . . . , λI). The Lagrangian is the sum of the objective functionf(x) and the weighted sum of the constraint function gi(x) weighted by theLagrange multiplier λi. Pretend that λ is constant and you want to minimizeL(x, λ) with respect to x. Then the first-order condition is

0 = ∇L(x, λ) = ∇f(x) +

I∑i=1

λi∇gi(x),

which is the same as (5.1a).

2A version of this theorem is due to the 1939 Master’s thesis of William Karush (1917-1997)but did not get much attention at a time applied mathematics was looked down. The theorembecame widely known after the rediscovery by Harold Kuhn and Albert Tucker (1905-1995)in a conference paper in 1951. However, the paper (http://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200500249) does not cite Karush.

43

http://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200500249


5.2.3 Linear inequality and equality constraints

So far we considered the case when all constraints are inequalities, but what ifthere are also equalities? For example, consider the optimization problem

minimize f(x)

subject to 〈a, x〉 ≤ c,〈b, x〉 = d,

where f is differentiable, a, b are nonzero vectors, and c, d are constants. Wederive a necessary condition for optimality by turning this problem into onewith inequality constraints alone. Note that 〈b, x〉 = d is the same as 〈b, x〉 ≤ dand 〈b, x〉 ≥ d. Furthermore, 〈b, x〉 ≥ d is the same as 〈−b, x〉 ≤ −d. Thereforethe problem is equivalent to:

minimize f(x)

subject to 〈a, x〉 − c ≤ 0,

〈b, x〉 − d ≤ 0,

〈−b, x〉+ d ≤ 0.

Setting g1(x) = 〈a, x〉 − c, g2(x) = 〈b, x〉 − d, and g3(x) = 〈−b, x〉 + d, thisproblem is exactly the form in Theorem 5.2. Therefore there exist Lagrangemultipliers λ1, λ2, λ3 ≥ 0 such that

∇f(x∗) + λ1∇g1(x∗) + λ2∇g2(x∗) + λ3∇g3(x∗) = 0.

Substituting ∇gi(x∗)’s, we get

∇f(x∗) + λ1a+ λ2b+ λ3(−b) = 0.

Letting λ = λ1 and µ = λ2 − λ3, we get

∇f(x∗) + λa+ µb = 0.

This equation is similar to the KKT condition, except that µ can be positive ornegative. In general, we obtain the following theorem.

Theorem 5.3 (Karush-Kuhn-Tucker theorem with linear constraints). Con-sider the optimization problem

minimize f(x)

subject to gi(x) ≤ 0, (i = 1, 2, . . . , I)

hj(x) = 0, (j = 1, 2, . . . , J)

where f is differentiable and gi(x) = 〈ai, x〉 − ci and hj(x) = 〈bj , x〉 − dj arelinear with ai, bj 6= 0. If x∗ is a solution, then there exist Lagrange multipliersλ1, . . . , λI and µ1, . . . , µJ such that

∇f(x∗) +

I∑i=1

λi∇gi(x∗) +

J∑j=1

µj∇hj(x∗) = 0, (5.2a)

(∀i) λi ≥ 0, gi(x∗) ≤ 0, λigi(x

∗) = 0, (5.2b)

(∀j) hj(x∗) = 0. (5.2c)

44

An easy way to remember the KKT conditions (5.2) is as follows. As in thecase with only inequality constraints, define the Lagrangian

L(x, λ, µ) = f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x).

Pretend that λ, µ are constants and you want to minimize L with respect tox. The first-order condition is ∇xL(x, λ, µ) = 0, which is exactly (5.2a). Thecomplementary slackness condition (5.2b) is the same as in the case with onlyinequality constraints. The new condition (5.2c) merely says that the solutionx∗ must satisfy the equality constraint hj(x) = 0.

5.3 Optimization with nonlinear constraints

5.3.1 Karush-Kuhn-Tucker theorem

Now we consider the general case, the optimization of a nonlinear multi-variablefunction with nonlinear constraints. Consider the optimization problem

minimize f(x)

subject to gi(x) ≤ 0, (i = 1, 2, . . . , I).

Let x∗ be a solution. By Taylor’s theorem, we have

gi(x) ≈ gi(x∗) + 〈∇gi(x∗), x− x∗〉 .

The gradient at x = x∗ is ∇gi(x∗) for both sides. A natural idea is to approxi-mate the nonlinear constraints by linear ones around the solution this way, andderive a necessary condition corresponding to these linear constraints. This ideaindeed works, subject to some caveats.

Theorem 5.4 (Karush-Kuhn-Tucker theorem). Consider the above optimiza-tion problem, where f and gi’s are differentiable. If x∗ is a solution and aregularity condition (called constraint qualification, CQ) holds, then there existLagrange multipliers λ1, . . . , λI such that

∇f(x∗) +

I∑i=1

λi∇gi(x∗) = 0, (5.3a)

(∀i) λi ≥ 0, gi(x∗) ≤ 0, λigi(x

∗) = 0. (5.3b)

Proving the Karush-Kuhn-Tucker theorem in the full generality is beyondthe scope of this lecture note. The conclusion of Theorem 5.4 is the same asthat of Theorem 5.2. What is different is the assumption. While in the linearcase the conclusion holds without any qualification, in the nonlinear case weneed to verify certain “constraint qualifications”. The following example showsthe need for such conditions.

Example 12. Consider the minimization problem

minimize x

subject to − x3 ≤ 0.

45

Since −x3 ≤ 0 ⇐⇒ x ≥ 0, the solution is obviously x∗ = 0. Now let f(x) = xand g(x) = −x3. If the KKT theorem holds, there must be a Lagrange multiplierλ ≥ 0 such that

f ′(x∗) + λg′(x∗) = 0.

But since f ′(x) = 1 and g′(x) = −3x2, we have f ′(0) = 1 and g′(0) = 0, sothere is no number λ ≥ 0 such that f ′(x∗) + λg′(x∗) = 0.

Below are a few examples of the constraint qualifications. In general, letI(x∗) be the set of indices such that gi(x

∗) = 0—that is, i’s for which theconstraint gi(x) ≤ 0 binds at x = x∗.

Linear independence (LICQ) The gradient of the active constraints,

∇gi(x∗) | i ∈ I(x∗) ,

are linearly independent.

Slater (SCQ) gi’s are convex, and there exists a point x0 such that the con-straints are satisfied with strict inequalities: gi(x0) < 0 for all i.

There are other weaker constraint qualifications, but I omit them.In practice, many optimization problems have only linear constraints, in

which case you don’t need to check any constraint qualifications. If there areequality constraints as well as inequality constraints, then the conclusion ofTheorem 5.3 holds under certain constraint qualifications.

5.3.2 Convex optimization

We previously learned that for unconstrained optimization problems, the first-order condition is not just necessary but also sufficient for optimality whenthe objective function is convex or concave. The same holds for constrainedoptimization problems. Indeed, we can prove the following theorem.

Theorem 5.5. Consider the optimization problem

minimize f(x)

subject to gi(x) ≤ 0, (i = 1, 2, . . . , I)

where f and gi’s are differentiable and convex.

1. If x∗ is a solution and there exists a point x0 such that gi(x0) < 0 for alli, then there exist Lagrange multipliers λ1, . . . , λI such that (5.3) holds.

2. If there exist Lagrange multipliers λ1, . . . , λI such that (5.3) holds, thenx∗ is a solution.

The first part of Theorem 5.5 is just the KKT theorem with the Slaterconstraint qualification. The second part says that the first-order conditionsare sufficient for optimality.

46

Proof. We only prove the second part. Suppose that there exist Lagrange mul-tipliers λ1, . . . , λI such that (5.3) holds. Let

L(x, λ) = f(x) +

I∑i=1

λigi(x)

be the Lagrangian. Since λi ≥ 0 and f, gi’s are convex, L is convex as a functionof x. By the first-order condition (5.3a), we have

0 = ∇f(x∗) +

I∑i=1

λi∇gi(x∗) = ∇xL(x∗, λ),

so L attains the minimum at x∗. Therefore

f(x∗) = f(x∗) +

I∑i=1

λigi(x∗) (∵ λigi(x

∗) = 0 for all i by (5.3b))

= L(x∗, λ) ≤ L(x, λ) (∵ ∇xL(x∗, λ) = 0)

= f(x) +

I∑i=1

λigi(x) (∵ Definition of L)

≤ f(x), (∵ λi ≥ 0 and gi(x) ≤ 0 for all i)

so x∗ is a solution.

Theorem 5.5 makes our life easy for solving nonlinear constrained optimiza-tion problems. A general approach is as follows.

Step 1. Verify that f and gi’s are differentiable and convex, and that the Slaterconstraint qualification holds.

Step 2. Set the Lagrangian L(x, λ) = f(x) +∑Ii=1 λigi(x) and derive the KKT

conditions (5.3).

Step 3. Solve for x and λ. The solution of the original problem is x.

Example 13. Consider the problem

minimize1

x1+

1

x2

subject to x1 + x2 ≤ 2,

where x1, x2 > 0. Let us solve this problem step by step.

Step 1. Let f(x1, x2) = 1x1

+ 1x2

be the objective function. Since

(1/x)′′ = (−x−2)′ = 2x−3 > 0,

the Hessian of f ,

∇2f(x1, x2) =

[2x−3

1 00 2x−3

2

],

is positive definite. Therefore f is convex. Let g(x1, x2) = x1 + x2 − 2.Since g is linear, it is convex. For (x1, x2) = (1

2 ,12 ) we have g(x1, x2) =

−1 < 0, so the Slater condition holds.

47

Step 2. Let

L(x1, x2, λ) =1

x1+

1

x2+ λ(x1 + x2 − 2)

be the Lagrangian. The first-order condition is

0 =∂L

∂x1= − 1

x21

+ λ ⇐⇒ x1 =1√λ,

0 =∂L

∂x2= − 1

x22

+ λ ⇐⇒ x2 =1√λ.

The complementary slackness condition is

λ(x1 + x2 − 2) = 0.

Step 3. From these equations it must be λ > 0 and

x1 + x2 − 2 = 0 ⇐⇒ 2√λ− 2 = 0 ⇐⇒ λ = 1

and x1 = x2 = 1/√λ = 1. Therefore (x1, x2) = (1, 1) is the (only)

solution.

In practice, in convex optimization problems (meaning that both the objec-tive function and the constraints are convex) we often skip verifying the Slaterconstraint qualification. The reason is that if you set up the Lagrangian anda point satisfies the first-order condition, then it is automatically a solution bythe second part of Theorem 5.5. If you cannot find any point satisfying thefirst-order condition, then either there is no solution or the Slater constraintqualification is violated.

If the problem is not a convex optimization problem, then the procedure isslightly more complicated.

Step 1. Show that a solution exists (e.g., by showing that the objective functionis continuous and the constraint set is compact).

Step 2. Verify that f and gi’s are differentiable, and some constraint qualifica-tion holds.

Step 3. Set the Lagrangian L(x, λ) = f(x) +∑Ii=1 λigi(x) and derive the KKT

conditions (5.3).

Step 4. Solve for x and λ. If you get a unique x, it is the solution. If you getmultiple x’s, compute f(x) for each and pick the minimum.

5.3.3 Constrained maximization

Next, we briefly discuss maximization. Although maximization is equivalent tominimization by flipping the sign of the objective function, doing so every timeis awkward. So consider the maximization problem

maximize f(x)

subject to gi(x) ≥ 0 (i = 1, . . . , I), (5.4)

48

where f, gi’s are differentiable. (9.6) is equivalent to the minimization problem

minimize − f(x)

subject to − gi(x) ≤ 0. (i = 1, . . . , I)

Assuming that Theorem 5.4 applies, the necessary condition for optimality is

−∇f(x∗)−I∑i=1

λi∇gi(x∗) = 0, (5.5a)

(∀i) λi(−gi(x∗)) = 0. (5.5b)

But (9.7) is equivalent to (5.3) by multiplying everything by (−1). For thisreason, it is customary to formulate a maximization problem as in (9.6) so thatthe inequality constraints are always “greater than or equal to zero”.

Example 14. Consider a consumer with utility function

u(x) = α log x1 + (1− α) log x2,

where 0 < α < 1 and x1, x2 ≥ 0 are consumption of good 1 and 2 (such afunction is called Cobb-Douglas). Let p1, p2 > 0 the price of each good andw > 0 be the wealth.

Then the consumer’s utility maximization problem (UMP) is

maximize α log x1 + (1− α) log x2

subject to x1 ≥ 0, x2 ≥ 0, p1x1 + p2x2 ≤ w.

Step 1. Clearly the objective function is concave and the constraints are linear(hence concave). The point (x1, x2) = (ε, ε) strictly satisfies the inequal-ities for small enough ε > 0, so the Slater condition holds. Thereforewe can apply the saddle point theorem.

Step 2. Let

L(x, λ, µ) = α log x1 +(1−α) log x2 +λ(w−p1x1−p2x2)+µ1x1 +µ2x2,

where λ ≥ 0 is the Lagrange multiplier corresponding to the budgetconstraint

p1x1 + p2x2 ≤ w ⇐⇒ w − p1x1 − p2x2 ≥ 0

and µn is the Lagrange multiplier corresponding to xn ≥ 0 for n = 1, 2.By the first-order condition, we get

0 =∂L

∂x1=

α

x1− λp1 + µ1,

0 =∂L

∂x2=

1− αx2

− λp2 + µ2.

By the complementary slackness condition, we have λ(w−p1x1−p2x2) =0, µ1x1 = 0, µ2x2 = 0.

49

Step 3. Since log 0 = −∞, x1 = 0 or x2 = 0 cannot be an optimal solution.Hence x1, x2 > 0, so by complementary slackness we get µ1 = µ2 = 0.Then by the first order condition we get x1 = α

λp1, x2 = 1−α

λp2, so λ > 0.

Substituting these into the budget constraint p1x1 + p2x2 = w, we get

α

λ+

1− αλ

= w ⇐⇒ λ =1

w,

so the solution is

(x1, x2) =

(αw

p1,

(1− α)w

p2

).

Remark. In the above example, you notice that the constraints x1 ≥ 0 andx2 ≥ 0 never bind because the objective function is increasing in both argu-ments. Therefore a quicker way to solve the problem is to ignore these con-straints and set the Lagrangian

L(x1, x2, λ) = α log x1 + (1− α) log x2 + λ(w − p1x1 − p2x2).

Exercises

5.1. This problem asks you to show that in general you cannot omit the con-straint qualification. Consider the optimization problem

minimize x

subject to x2 ≤ 0.

1. Find the solution.

2. Show that both the objective function and the constraint function areconvex, but the Slater constraint qualification does not hold.

3. Show that the Karush-Kuhn-Tucker condition does not hold.

5.2. Consider the problem

minimize1

x1+

4

x2


where x1, x2 > 0.

1. Prove that the objective function is convex.

2. Write down the Lagrangian.

3. Explain why the Karush-Kuhn-Tucker conditions are both necessary andsufficient for a solution.

4. Compute the solution.

50


maximize4

3x3

1 +1

3x3

2


x1 ≥ 0, x2 ≥ 0.

1. Are the Karush-Kuhn-Tucker conditions necessary for a solution? Answeryes or no, then explain why.

2. Are the Karush-Kuhn-Tucker conditions sufficient for a solution? Answeryes or no, then explain why.



5.4. Let f(x1, x2) = x21 + x1x2 + x2

2.

1. Compute the gradient and the Hessian of f .

2. Show that f is convex.

3. Solve

minimize x21 + x1x2 + x2

2

subject to x1 + x2 ≥ 2.


maximize x1 + log x2 −1

2x23

subject to x1 + p2x2 + p3x3 ≤ w,

where x2, x3 > 0 but x1 is unconstrained and p2, p3, w > 0 are constants.

1. Show that the objective function is concave.


3. Find the solution.

5.6 (C). Let A be an N ×N symmetric positive definite matrix. Let B be anM × N matrix with M ≤ N such that the M row vectors of B are linearlyindependent. Let c ∈ RM be a vector. Solve

minimize1

2〈x,Ax〉

subject to Bx = c.

5.7. Solve

maximize x1 + x2

subject to x21 + x1x2 + x2

2 ≤ 1.

51

5.8 (C). Solve

maximize 〈b, x〉subject to 〈x,Ax〉 ≤ r2,

where 0 6= b ∈ RN , A is an N × N symmetric positive definite matrix, andr > 0.

5.9 (C). Consider a consumer with utility function

u(x) =

N∑n=1

αn log xn,

where αn > 0,∑Nn=1 αn = 1, and xn ≥ 0 is the consumption of good n. Let

p = (p1, . . . , pN ) be a price vector with pn > 0 and w > 0 be the wealth.

1. Formulate the consumer’s utility maximization problem.


5.10 (C). Solve the same problem as above for the case

u(x) =

N∑n=1

αnx1−σn

1− σ,

where σ > 0.

52

Chapter 6

Contraction MappingTheorem and Applications

6.1 Contraction Mapping Theorem

In economics, we often apply fixed point theorems. Let X be a set, and T :X → X a mapping that maps X to itself. Then x ∈ X is called a fixed point ifT (x) = x, i.e., the point x stays fixed by applying the mapping T .

One of the most useful (and easiest to prove) fixed point theorems is theContraction Mapping Theorem, also known as the Banach Fixed Point Theorem(named after the Polish mathematician who proved it first).

Before stating the theorem, we need to introduce some definition. Let X bea set. Then the function d : X ×X → R is called a metric (or distance) if

1. (positivity) d(x, y) ≥ 0 for all x, y ∈ X, and d(x, y) = 0 if and only ifx = y,

2. (symmetry) d(x, y) = d(y, x) for all x, y ∈ X, and

3. (triangle inequality) d(x, z) ≤ d(x, y) + d(y, z) for all x, y, z ∈ X.

The pair (X, d) is called a metric space if d is a metric on X. When the metricd is clear from the context, we often call X a metric space. A typical exampleis X = RN and

d(x, y) =

√√√√ N∑n=1

(xn − yn)2

(the Euclidean distance), or more generally

d(x, y) =

(N∑n=1

|xn − yn|p) 1p

for p ≥ 1 (Lp distance). When p = ∞, it becomes d(x, y) = maxn |xn − yn|(sup norm).

A metric space (X, d) is called complete if any Cauchy sequence is convergent,that is, x = limn→∞ xn exists whenever

(∀ε > 0)(∃N > 0)(∀m,n ≥ N) d(xm, xn) < ε.

53

Theorem 6.1 (Contraction Mapping Theorem). Let (X, d) be a complete met-ric space and T : X → X be a contraction, so there exists 0 < β < 1 suchthat

d(T (x), T (y)) ≤ βd(x, y)

for all x, y ∈ X. Then

1. T has a unique fixed point x ∈ X,

2. for any x0 ∈ X, we have x = limn→∞ Tn(x0), and

3. the approximation error d(Tn(x0), x) has order of magnitude βn.

Proof. First note that a contraction is continuous (indeed, uniformly continu-ous) because for any ε > 0, if we take δ = ε/β, then d(x, y) < δ implies thatd(T (x), T (y)) ≤ βd(x, y) < βδ = ε.

Take any x0 ∈ X and define xn = T (xn−1) for n ≥ 1. Then xn = Tn(x0).Since T is a contraction, we have

d(xn, xn−1) = d(T (xn−1), T (xn−2)) ≤ βd(xn−1, xn−2) ≤ · · · ≤ βn−1d(x1, x0).

If m > n ≥ N , then by the triangle inequality we have

d(xm, xn) ≤ d(xm, xm−1) + · · ·+ d(xn+1, xn)

≤ (βm−1 + · · ·+ βn)d(x1, x0)

=βn − βm

1− βd(x1, x0) ≤ βn

1− βd(x1, x0) ≤ βN

1− βd(x1, x0).

Since 0 < β < 1, βN gets smaller as N gets large, so xn is a Cauchy sequence.Since X is complete, x = limn→∞ xn exists. Since

d(T (xn−1), xn−1) = d(xn, xn−1) ≤ βnd(x1, x0),

letting n→∞ and using the continuity of T , we get d(T (x), x) = 0. Since d isa metric, we have T (x) = x, so x is a fixed point of T .

To show uniqueness, suppose that x, y are fixed points of T , so T (x) = xand T (y) = y. Since T is a contraction, we have

0 ≤ d(x, y) = d(T (x), T (y)) ≤ βd(x, y),

so it must be d(x, y) = 0 and hence x = y. Therefore the fixed point is unique.Finally, let x be the fixed point of T , x0 be any point, and xn = Tn(x0).

Then

d(xn, x) = d(T (xn−1), T (x)) ≤ βd(xn−1, x) ≤ · · · ≤ βnd(x0, x),

so letting n→∞ we have xn → x, and the error has order of magnitude βn.

6.2 Markov chain

When a random variable is indexed by time, it is called a stochastic process.Let Xt be a stochastic process, where t = 0, 1, 2 . . . . When the distributionof Xt conditional on the past information Xt−1, Xt−2, . . . depends only on the

54

most recent past (i.e., Xt−1), Xt is called a Markov process. For example, anAR(1) process

Xt = aXt−1 + εt

(where a is a number and εt is independent and identically distributed overtime) is a Markov process. When the Markov process Xt takes on finitelymany values, it is called a finite-state Markov chain. Let Xt be a (finite-state) Markov chain and s = 1, 2, . . . , S be an index of the values the processcan take. (We say that Xt = xs when the state at t is s.) Since there are finitelymany states, the distribution of Xt conditional on Xt−1 is just a multinomialdistribution. Therefore the Markov chain is completely characterized by thetransition probability matrix P = (pss′), where pss′ is the probability of moving

from state s to s′. Clearly, we have pss′ ≥ 0 and∑Ss′=1 pss′ = 1.

Suppose that X0 is distributed according to the distribution µ = [µ1, . . . , µS ](µs is the probability of being in state s). Then what is the distribution of X1?Using transition probabilities, the probability of being in state s′ at t = 1 is∑Ss=1 µspss′ , because the process must be in some state (say s) at t = 0 (which

happens with probability µs) and conditional on being in state s at t = 0, theprobability of moving to state s′ at t = 1 is pss′ . But note that

S∑s=1

µspss′ = (µP )s′ ,

the s′-th element of the row vector µP by the definition of matrix multiplication.Similarly, the distribution of X2 is (µP )P = µP 2, and in general the distributionof Xt is µP t.

As we let the system run for a long time, does the distribution settles down tosome fixed distribution? That is, does limt→∞ µP t exists, and if so, is it unique?We can answer this question by using the contraction mapping theorem.

Theorem 6.2. Let P = (pss′) be a transition probability matrix such that pss′ >0 for all s, s′. Then there exists a unique invariant distribution π such thatπ = πP , and limt→∞ µP t = π for all distribution µ.

Proof. Let ∆ =x ∈ RS+

∣∣∣∑Ss=1 xs = 1

be the set of all multinomial distri-

butions. Since ∆ ⊂ RS is closed and RS is a complete metric space with theL1 norm (that is, d(x, y) = ‖x− y‖ for ‖x‖ =

∑Ss=1 |xs|), ∆ is also a complete

metric space.Define T : ∆ → ∆ by T (x) = xP . To show that T (x) ∈ ∆, note that if

x ∈ ∆, since pss′ ≥ 0 for all s, s′, we have xP ≥ 0, and since∑Ss′=1 pss′ = 1, we

haveS∑

s′=1

(xP )s′ =

S∑s′=1

S∑s=1

xspss′ =

S∑s=1

xs

S∑s′=1

pss′ =

S∑s=1

xs = 1.

Therefore T (x) = xP ∈ ∆.Next, let us show that T is a contraction mapping. Since pss′ > 0 and the

number of states is finite, there exists ε > 0 such that pss′ > ε for all s, s′.Without loss of generality, we may assume Sε < 1. Let qss′ = pss′−ε

1−Sε > 0 andQ = (qss′). Since

∑s′ pss′ = 1, we obtain

∑s′ qss′ = 1, so Q is also a transition

probability matrix. Letting J be the matrix with all elements equal to 1, wehave P = (1− Sε)Q+ εJ .

55

Now let µ, ν ∈ ∆. Then

µP − νP = (1− Sε)(µQ− νQ) + ε(µJ − νJ).

Since all elements of J are 1 and the vectors µ, ν sum up to 1, µJ, νJ are bothvectors of ones. Therefore µJ = νJ , so letting 0 < β = 1− Sε < 1, we get

‖T (µ)− T (ν)‖ = ‖µP − νP‖ = β ‖µQ− νQ‖

= β

S∑s′=1

|(µQ)s′ − (νQ)s′ | = β

S∑s′=1

∣∣∣∣∣S∑s=1

(µs − νs)qss′∣∣∣∣∣

≤ βS∑

s′=1

S∑s=1

|µs − νs| qss′ = β

S∑s=1

|µs − νs|S∑

s′=1

qss′

= β

S∑s=1

|µs − νs| = β ‖µ− ν‖ .

Therefore T is a contraction. By the contraction mapping theorem, there existsa unique π ∈ ∆ such that πP = π, and limt→∞ µP t = π for all µ ∈ ∆.

Remark. The same conclusion holds if there exists a number k such that P k isa positive matrix. (The above theorem is a special case by setting k = 1.) Theidea of the proof is as follows.

1. Take any x0 ∈ ∆ and let xn = x0Pn.

2. For each n = 1, 2, . . . , let n = kqn + rn, where 0 ≤ rn < k.

3. Using the fact that P k is a positive matrix and imitating the proof ofthe contraction mapping theorem, show that xn is Cauchy, and hencex = limxn exists, which satisfies x = xP .

4. Show that x = xP has a unique solution using the fact that P k is positive(and hence defines a contraction).

6.3 Implicit function theorem

When solving economic models, we often encounter equations like f(x, y) = 0,where y is an endogenous variable and x is an exogenous variable. Oftentimesy does not have an explicit expression, but nevertheless we might be interestedin dy/dx. The implicit function theorem lets you compute this derivative asfollows. Let y = g(x). Then f(x, g(x)) = 0. Differentiating both sides withrespect to x and using the chain rule, we get fx+fyg

′(x) = 0, so g′(x) = −fx/fy.(fx is the shorthand for ∂f/∂x.)

The same argument holds for multi dimensions. If x is m-dimensional, yis n-dimensional, and f : Rm × Rn → Rn, then differentiating both sides off(x, g(x)) = 0 we get Dxf +DyfDxg = 0, so Dxg = −Dyf

−1Dxf . (Dxf is theJacobi matrix (matrix of partial derivatives) of f with respect to x: the (i, j)element of Dxf is ∂fi/∂xj .) The goal of this section is to prove the followingimplicit function theorem.

56

Theorem 6.3 (Implicit Function Theorem). Let f : Rm × Rn → Rn be acontinuously differentiable function. If f(x0, y0) = 0 and Dyf(x0, y0) is regular,there exist neighborhoods U of x0 and V of y0 and a function g : U → V suchthat

1. for all x ∈ U , f(x, y) = 0 ⇐⇒ y = g(x),

2. g is continuously differentiable, and

3. Dxg(x) = −Dyf(x, y)−1Dxf(x, y), where y = g(x).

The last claim of the implicit function theorem follows from the first two andthe chain rule. The first two claims follows from the inverse function theorem:

Theorem 6.4 (Inverse Function Theorem). Let f : Rn → Rn be a continuouslydifferentiable function. If Df(x0) is regular, then there exists a neighborhood Vof y0 = f(x0) such that

1. f : U = f−1(V )→ V is bijective (one-to-one and onto),

2. g = f−1 is continuously differentiable, and

3. Dg(y) = Df(g(y))−1 on V .

Proof of Implicit Function Theorem. Define F : Rm+n → Rm+n by

F (x, y) =

[x

f(x, y)

].

Then F is continuously differentiable. Furthermore, since

DF (x, y) =

[Im Om,n

Dxf(x, y) Dyf(x, y)

],

we have detDF (x0, y0) = detDyf(x0, y0) 6= 0, so DF (x0, y0) is regular. SinceF (x0, y0) = (x0, 0), by the inverse function theorem there exists a neighborhoodV of (x0, 0) such that F : F−1(V ) → V is bijective. Let G be the inversefunction of F . Then for any (z, w) ∈ V , we have F (x, y) = (z, w) ⇐⇒ (x, y) =G(z, w) = (G1(z, w), G2(z, w)). Since F (x, y) = (x, f(x, y)) by definition, wehave x = z, so f(x, y) = w ⇐⇒ y = G2(x,w). Letting w = 0, we havef(x, y) = 0 ⇐⇒ y = g(x) := G2(x, 0). Since G is continuously differentiable,so is g. Therefore the implicit function theorem is proved.

Next, we prove the inverse function theorem. If f is linear, so f(x) =y0 + A(x − x0) for some matrix A, then we can find the inverse function bysolving y = y0 + A(x − x0) ⇐⇒ x = x0 + A−1(y − y0) (provided that A isregular). The idea to prove the nonlinear case is to linearize f around x0.

Since f is differentiable, we have y = f(x) ≈ f(x0)+Df(x0)(x−x0). Solvingthis equation for x, we obtain

x ≈ x0 +Df(x0)−1(y − f(x0)).

This equations shows that given an approximate solution x0 of f(x) = y, thesolution can be approximated further by x0 + Df(x0)−1(y − f(x0)) (which isessentially the Newton-Raphson algorithm for solving the nonlinear equationf(x) = y). This intuition is helpful for understanding the proof.

To prove the inverse function theorem, we need the following lemma.

57

Lemma 6.5 (Mean value inequality). Let f : Rn → Rm be differentiable. Then

‖f(x2)− f(x1)‖ ≤ supt∈[0,1]

‖Df(x1 + t(x2 − x1))‖ ‖x2 − x1‖ .

Proof. Take any d ∈ Rm with ‖d‖ = 1. Define φ : [0, 1]→ R by

φ(t) = d′(f(x1 + t(x2 − x1))− f(x1)).

Then φ(0) = 0 and φ(1) = d′(f(x2)−f(x1)). By the mean value theorem, thereexists t ∈ [0, 1] such that

d′(f(x2)− f(x1)) = φ(1)− φ(0) = φ′(t) = d′Df(x1 + t(x2 − x1))(x2 − x1),

where the last equality follows from the chain rule. Taking the norm of bothsides, taking the supremum over t ∈ [0, 1], and the supremum over d, we obtainthe desired inequality.

Proof of Inverse Function Theorem. Fix y ∈ Rn and define T : Rn → Rnby

T (x) = x+Df(x0)−1(y − f(x)),

which is well-defined because Df(x0) is regular. For notational simplicity, letA = Df(x0). Applying the mean value inequality to T (x), for any x1, x2 wehave

‖T (x2)− T (x1)‖ ≤ supt∈[0,1]

∥∥I −A−1Df(x(t))∥∥ ‖x2 − x1‖ ,

where x(t) = x1 + t(x2 − x1). Since f is continuously differentiable, andA = Df(x0), we can take ε > 0 such that

∥∥I −A−1Df(x)∥∥ ≤ 1/2 whenever

‖x− x0‖ ≤ ε. Note that since y cancel out in T (x2)−T (x1), ε does not dependon y. Let B = x | ‖x− x0‖ ≤ ε be the closed ball with center x0 and radiusε. If x1, x2 ∈ B, then

‖x(t)− x0‖ ≤ (1− t) ‖x1 − x0‖+ t ‖x2 − x0‖ ≤ ε,

so x(t) ∈ B. Then by assumption ‖T (x2)− T (x1)‖ ≤ 12 ‖x2 − x1‖ on B. Let us

show that T (B) ⊂ B if y is sufficiently close to y0 = f(x0). To see this, notethat

T (x)− x0 = x− x0 +A−1(f(x0)− f(x)) +A−1(y − y0).

Using the mean value inequality to h(x) = x−A−1f(x), we obtain

‖T (x)− x0‖ ≤ supt∈[0,1]

‖I −Df(x0 + t(x− x0))‖ ‖x− x0‖+∥∥A−1(y − y0)

∥∥ .Take a neighborhood V of y0 such that

∥∥A−1(y − y0)∥∥ ≤ 1

2ε for all y ∈ V . Ify ∈ V and x ∈ B, then we have

‖T (x)− x0‖ ≤1

2‖x− x0‖+

1

2ε ≤ ε,

so T (x) ∈ B. Since T : B → B, B is closed, and T is a contraction on B, thereexists a unique fixed point x of T . Since

x = T (x) = x+A−1(y − f(x)),

58

we have y = f(x).Let x = g(y). Then f(x) = y, so f(g(y)) = y. First let us show that g is

continuous on V . Suppose yn → y and xn = g(yn) but xn 6→ x. Since B iscompact, by taking a subsequence if necessary, we may assume that xn → x′,where x 6= x′ ∈ B. Since f(xn) = f(g(yn)) = yn, letting n → ∞, since f iscontinuous we get f(x′) = y, which is a contradiction because f(x) = y has aunique solution on B. Therefore g is continuous.

To show the differentiability of g, for small enough h ∈ Rn, let k(h) := g(y+h)−g(y). Since g is continuous, so is k. Noting that g(y+h) = g(y)+k(h) = x+kand f is differentiable, we obtain

y + h = f(g(y + h)) = f(x+ k) = f(x) +Dfk + o(k).

Since y = f(x), we get h = Dfk + o(k), so h and k have the same order ofmagnitude. Finally,

g(y + h) = g(y) + k = g(y) + [Df ]−1(h− o(k)) = g(y) + [Df ]−1h+ o(h),

so g is differentiable and Dg(y) = Df(g(y))−1.

59

Part II

Advanced topics

60

Chapter 7

Convex Sets

7.1 Convex sets

A set C ⊂ RN is said to be convex if the line segment generated by any twopoints in C is entirely contained in C. Formally, C is convex if x, y ∈ C implies(1−α)x+αy ∈ C for all α ∈ [0, 1] (Figure 7.1). So a circle, triangle, and squareare convex but a star-shape is not (Figure 7.2). One of my favorite mathematicaljokes is that the Chinese character for “convex” is not convex (Figure 7.3).

x

y

(1− α)x+ αy

Figure 7.1. Definition of a convex set.

Rectangle Circle Ellipse Convex

Convex

Non-convex

Figure 7.2. Examples of convex and non-convex sets.

61

Figure 7.3. Chinese character for “convex” is not convex.

Let A be any set. The smallest convex set that includes A is called theconvex hull of A and is denoted by coA. (Its existence is left as an exercise.)For example, in Figure 7.4, the convex hull of the set A consisting of two circlesis the entire region in between.

AA coA

Figure 7.4. Convex hull.

Let xkKk=1 ⊂ RN be any points. A point of the form

x =

K∑k=1

αkxk,

where αk ≥ 0 and∑Kk=1 αk = 1, is called a convex combination of the points

xkKk=1. The following lemma provides a constructive way to obtain the convexhull of a set.

Lemma 7.1. Let A ⊂ RN be any set. Then coA consists of all convex combi-nations of points of A.

Proof. Exercise.

7.2 Hyperplanes and half spaces

You should know from high school that the equation of a line in R2 is

a1x1 + a2x2 = c

for some real numbers a1, a2, c, and that the equation of a plane in R3 is

a1x1 + a2x2 + a3x3 = c.

Letting a = (a1, . . . , aN ) and x = (x1, . . . , xN ) be vectors in RN , the equation〈a, x〉 = c is a line if N = 2 and a plane if N = 3, where

〈a, x〉 = a1x1 + · · ·+ aNxN

62

is the inner product of the vectors a and x.1 In general, we say that the setx ∈ RN

∣∣ 〈a, x〉 = c

is a hyperplane if a 6= 0. The vector a is orthogonal to this hyperplane (isa normal vector). To see this, let x0 be a point in the hyperplane. Since〈a, x0〉 = c, by subtraction and linearity of inner product we get 〈a, x− x0〉 = 0.This means that the vector a is orthogonal to the vector x−x0, which can pointto any direction in the plane by moving x. So it makes sense to say that a isorthogonal to the hyperplane 〈a, x〉 = c. The sets

H+ =x ∈ RN

∣∣ 〈a, x〉 ≥ c ,H− =

x ∈ RN

∣∣ 〈a, x〉 ≤ care called half spaces, since H+ (H−) is the portion of RN separated by thehyperplane 〈a, x〉 = c towards the direction of a (−a). Hyperplanes and halfspaces are convex sets (exercise).

7.3 Separation of convex sets

Let A,B be two sets. We say that the hyperplane 〈a, x〉 = c separates A,B ifA ⊂ H− and B ⊂ H+ (Figure 7.5), that is,

x ∈ A =⇒ 〈a, x〉 ≤ c,x ∈ B =⇒ 〈a, x〉 ≥ c.

(The inequalities may be reversed.)

A

B

H−

H+

〈a, x〉 = c

a

Figure 7.5. Separation of convex sets.

Clearly A,B can be separated if and only if

supx∈A〈a, x〉 ≤ inf

x∈B〈a, x〉 ,

since we can take c between these two numbers. We say that A,B can be strictlyseparated if the inequality is strict, so

supx∈A〈a, x〉 < inf

x∈B〈a, x〉 .

1The inner product is sometimes called the vector product or the dot product. Commonnotations for the inner product are 〈a, x〉, (a, x), a · x, etc.

63

The remarkable property of convex sets is the following separation property.

Theorem 7.2 (Separating Hyperplane Theorem). Let C,D ⊂ RN be nonemptyand convex. If C ∩D = ∅, then there exists a hyperplane that separates C,D. IfC,D are closed and one of them is compact, then they can be strictly separated.

We need the following lemma to prove Theorem 7.2.

Lemma 7.3. Let C be nonempty and convex. Then any x ∈ RN has a uniqueclosest point PC(x) ∈ clC, called the projection of x on clC. Furthermore, forany z ∈ C we have

〈x− PC(x), z − PC(x)〉 ≤ 0.

Proof. Let δ = inf ‖x− y‖ | y ∈ C ≥ 0 be the distance from x to C (Figure7.6).

δ

x

y = PC(x)

z

C

Figure 7.6. Projection on a convex set.

Take a sequence yk ⊂ C such that ‖x− yk‖ → δ. Then by simple algebrawe get

‖yk − yl‖2 = 2 ‖x− yk‖2 + 2 ‖x− yl‖2 − 4

∥∥∥∥x− 1

2(yk + yl)

∥∥∥∥2

. (7.1)

Since C is convex, we have 12 (yk + yl) ∈ C, so by the definition of δ we get

‖yk − yl‖2 ≤ 2 ‖x− yk‖2 + 2 ‖x− yl‖2 − 4δ2 → 2δ2 + 2δ2 − 4δ2 = 0

as k, l → ∞. Since yk ⊂ C is Cauchy, it converges to some point y ∈ clC.Then

‖x− y‖ ≤ ‖x− yk‖+ ‖yk − y‖ → δ + 0 = δ,

so y is the closest point to x in clC. If y1, y2 are two closest points, then by thesame argument we get

‖y1 − y2‖2 ≤ 2 ‖x− y1‖2 + 2 ‖x− y2‖2 − 4δ2 ≤ 0,

so y1 = y2. Thus y = PC(x) is unique.

64

Finally, let z ∈ C be any point. Take yk ⊂ C such that yk → y = PC(x).Since C is convex, for any 0 < α ≤ 1 we have (1− α)yk + αz ∈ C. Therefore

δ2 = ‖x− y‖2 ≤ ‖x− (1− α)yk − αz‖2 .

Letting k →∞ we get ‖x− y‖2 ≤ ‖x− y − α(z − y)‖2. Expanding both sides,dividing by α > 0, and letting α → 0, we get 〈x− y, z − y〉 ≤ 0, which is thedesired inequality.

The following proposition shows that a point that is not an interior point ofa convex C can be separated from C.

Proposition 7.4. Let C ⊂ RN be nonempty and convex and x /∈ intC. Thenthere exists a hyperplane 〈a, x〉 = c that separates x and C, i.e.,

〈a, x〉 ≥ c ≥ 〈a, z〉

for any z ∈ C. If x /∈ clC, then the above inequalities can be made strict.

Proof. Suppose that x /∈ clC. Let y = PC(x) be the projection of x on clC.

Then x 6= y. Let a = x− y 6= 0 and c = 〈a, y〉+ 12 ‖a‖

2. Then for any z ∈ C we

have

〈x− y, z − y〉 ≤ 0 =⇒ 〈a, z〉 ≤ 〈a, y〉 < 〈a, y〉+1

2‖a‖2 = c,

〈a, x〉 − c = 〈x− y, x− y〉 − 1

2‖a‖2 =

1

2‖a‖2 > 0 ⇐⇒ 〈a, x〉 > c.

Therefore the hyperplane 〈a, x〉 = c strictly separates x and C.If x ∈ clC, since x /∈ intC we can take a sequence xk such that xk /∈ clC

and xk → x. Then we can find a vector ak 6= 0 and a number ck ∈ R such that

〈ak, xk〉 ≥ ck ≥ 〈ak, z〉

for all z ∈ C. By dividing both sides by ‖ak‖ 6= 0, without loss of generality wemay assume ‖ak‖ = 1. Since xk → x, the sequence ck is bounded. Thereforewe can find a convergent subsequence (akl , ckl)→ (a, c). Letting l→∞, we get

〈a, x〉 ≥ c ≥ 〈a, z〉

for any z ∈ C.

Proof of Theorem 7.2. Let E = C −D = x− y |x ∈ C, y ∈ D. Since C,Dare nonempty and convex, so is E. Since C ∩ D = ∅, we have 0 /∈ E. Inparticular, 0 /∈ intE. By Proposition 7.4, there exists a 6= 0 such that 〈a, 0〉 =0 ≥ 〈a, z〉 for all z ∈ E. By the definition of E, we have

〈a, x− y〉 ≤ 0 ⇐⇒ 〈a, x〉 ≤ 〈a, y〉

for any x ∈ C and y ∈ D. Letting supx∈C 〈a, x〉 ≤ c ≤ infy∈D 〈a, y〉, it followsthat the hyperplane 〈a, x〉 = c separates C and D.

Suppose that C is closed and D is compact. Let us show that E = C −Dis closed. For this purpose, suppose that zk ⊂ E and zk → z. Then we cantake xk ⊂ C, yk ⊂ D such that zk = xk − yk. Since D is compact, there isa subsequence such that ykl → y ∈ D. Then xkl = ykl + zkl → y + z, but sinceC is closed, x = y + z ∈ C. Therefore z = x− y ∈ E, so E is closed.

Since E = C −D is closed and 0 /∈ E, by Proposition 7.4 there exists a 6= 0such that 〈a, 0〉 = 0 > 〈a, z〉 for all z ∈ E. The rest of the proof is similar.

65

7.4 Application: asset pricing

Consider an economy with two periods, denoted by t = 0, 1. Suppose that att = 1 the state of the economy can be one of s = 1, . . . , S. There are J assets inthe economy, indexed by j = 1, . . . , J . One share of asset j trades for price qjat time 0 and pays Vsj in state s. (It can be Vsj < 0, in which case the holderof one share of asset j must deliver −Vsj > 0 in state s.) Let q = (q1, . . . , qJ)the vector of asset prices and V = (Vsj) be the matrix of asset payoffs. Define

W = W (q, V ) =

[−q′V

]be the (1 + S) × J matrix of net payments of one share of each asset in eachstate. Here, state 0 is defined by time 0 and the presence of −q = (−q1, . . . ,−qJ)means that in order to receive Vsj in state s one must purchase one share ofasset j at time 0, thus paying qj (receiving −qj).

Let θ ∈ RJ be a portfolio. (θj is the number of shares of asset j an investorbuys. θj < 0 corresponds to shortselling.) The net payments of the portfolio θis the vector

Wθ =

[−q′θV θ

]∈ R1+S .

Here the investor pays q′θ at t = 0 for buying the portfolio θ, and receives (V θ)sin state s at t = 1.

Let 〈W 〉 =Wθ

∣∣ θ ∈ RJ

be the set of payoffs generated by all portfolios,called the asset span. We say that the asset span 〈W 〉 exhibits no-arbitrage if

〈W 〉 ∩ R1+S+ = 0 .

That is, it is impossible to find a portfolio that pays a non-negative amount inevery state and a positive amount in at least one state.

Theorem 7.5 (Fundamental Theorem of Asset Pricing). The asset span 〈W 〉exhibits no-arbitrage if and only if there exists π ∈ RS++ such that [1, π′]W = 0.

In this case, the asset prices are given by

qj =

S∑s=1

πsVsj .

πs > 0 is called the state price in state s.

Proof. Suppose that such a π exists. If 0 6= w = (w0, . . . , wS) ∈ R1+S+ , then

[1, π′]w = w0 +

S∑s=1

πsws > 0,

so w /∈ 〈W 〉. This shows 〈W 〉 ∩ R1+S+ = 0.

Conversely, suppose that there is no arbitrage. Then 〈W 〉 ∩ ∆ = ∅, where

∆ =w ∈ R1+S

+

∣∣∣∑Ss=0 ws = 1

is the unit simplex. Clearly 〈W 〉 ,∆ are convex

and nonempty, and ∆ is compact. By the separating hyperplane theorem, wecan find 0 6= λ ∈ R1+S such that 〈λ,w〉 < 〈λ, d〉 for any w ∈ 〈W 〉 and d ∈ ∆.

66

If there is θ such that 〈λ,Wθ〉 6= 0, letting w = Wαθ and α → ±∞, wewill violate the inequality 〈λ,w〉 < 〈λ, d〉. Hence λ′Wθ = 〈λ,Wθ〉 = 0 forany θ, so λ′W = 0. Letting d = es (unit vector) for s = 0, 1, . . . , S, we getλs > 0. Letting πs = λs/λ0 for s = 1, . . . , S, the vector π = (π1, . . . , πS)satisfies π 0 and [1, π′]W = 0. Writing down this equation component-wise,

we get qj =∑Ss=1 πsVsj .

Notes

Rockafellar (1970) is a classic reference for convex analysis. Much of the theoryof separation of convex sets can be generalized to infinite-dimensional spaces.The proof in this chapter using the projection generalizes to Hilbert spaces, butfor more general spaces (topological vector spaces) we need the Hahn-Banachtheorem. See Berge (1959) and Luenberger (1969).

Exercises

7.1. 1. Let Cii∈I ⊂ RN be a collection of convex sets. Prove that⋂i∈I Ci

is convex.

2. Let A be any set. Prove that there exists a smallest convex set thatincludes A (convex hull of A).

7.2. Prove Lemma 7.1.

7.3. 1. Let 0 6= a ∈ RN and c ∈ R. Show that the hyperplane H =x ∈ RN

∣∣ 〈a, x〉 = c

and the half space H+ =x ∈ RN

∣∣ 〈a, x〉 ≥ c areconvex sets.

2. Let A be an M ×N matrix and b ∈ RM . The set of the form

P =x ∈ RN

∣∣Ax ≤ bis called a polytope. Show that a polytope is convex.

7.4. 1. Let a, b ∈ RN . Prove the following parallelogram law :

‖a+ b‖2 + ‖a− b‖2 = 2 ‖a‖2 + 2 ‖b‖2 .

2. Using the parallelogram law, prove (7.1).

7.5. Let A =

(x, y) ∈ R2∣∣ y > x3

and B =

(x, y) ∈ R2

∣∣x ≥ 1, y ≤ 1

.

1. Draw a picture of the sets A,B on the xy plane.

2. Can A,B be separated? If so, provide an equation of a straight line thatseparates them. If not, explain why.

7.6. Let C =

(x, y) ∈ R2∣∣ y > ex

and D =

(x, y) ∈ R2

∣∣ y ≤ 0

.

1. Draw a picture of the sets C,D on the xy plane.

2. Provide an equation of a straight line that separates C,D.

67

3. Can C,D be strictly separated? Answer yes or no, then explain why.

7.7. Prove the following Stiemke’s Lemma. Let A be an M ×N matrix. Thenone and only one of the following statements is true:

1. There exists x ∈ RN++ such that Ax = 0.

2. There exists y ∈ RM such that A′y > 0.

7.8. A typical linear programming problem is

minimize 〈c, x〉subject to Ax ≥ b,

where x ∈ RN , 0 6= c ∈ RN , b ∈ RM , and A is an M ×N matrix with M ≥ N .A standard algorithm for solving a linear programming problem is the simplexmethod, which you should have already learned. The idea is that you keepmoving from one vertex of the polytope

P =x ∈ RN

∣∣Ax ≥ bto a neighboring vertex as long as the function value decreases, and if there areno neighboring vertex with smaller function value, you stop.

1. Prove that the simplex method terminates in finite steps.

2. Prove that when the algorithm stops, you are at a solution of the originalproblem.

68

Chapter 8

Convex Functions

8.1 Convex functions

Let f : RN → [−∞,∞] be a function. The set

epi f :=

(x, y) ∈ RN × R∣∣ f(x) ≤ y

is called the epigraph of f , for the obvious reason that epi f is the set of pointsthat lie on or above the graph of f . f is said to be convex if epi f is convex. fis convex if and only if for any x1, x2 ∈ RN and α ∈ [0, 1] we have (exercise)

f((1− α)x1 + αx2) ≤ (1− α)f(x1) + αf(x2).

This inequality is often used as the definition of a convex function.

x

y = f(x)epi f

x1 x2

f(x1)

f(x2)

Figure 8.1. Convex function and its epigraph.

When the inequality is strict whenever α ∈ (0, 1), we say that f is strictlyconvex. f is concave if −f is convex. A convex function is proper if f(x) > −∞for all x and f(x) <∞ for some x. If f is a convex function, then the set

dom f :=x ∈ RN

∣∣ f(x) <∞

is a convex set, since x1, x2 ∈ dom f implies

f((1− α)x1 + αx2) ≤ (1− α)f(x1) + αf(x2) <∞.

69

dom f is called the effective domain of f .Another useful but weaker concept is quasi-convexity. The set

Lf (y) =x ∈ RN

∣∣ f(x) ≤ y

is called the lower contour set of f at level y. f is said to be quasi-convexif all lower contour sets are convex. f is quasi-convex if and only if for anyx1, x2 ∈ RN and α ∈ [0, 1] we have (exercise)

f((1− α)x1 + αx2) ≤ max f(x1), f(x2) .

Again if the inequality is strict whenever x1 6= x2 and 0 < α < 1, then f is saidto be strictly quasi-convex.

f is said to be concave if −f is convex, that is, f is a convex function flippedup side down. The definition for strict concavity or quasi-concavity is similar.

8.2 Characterization of convex functions

When f is differentiable, there is a simple way to establish convexity.

Proposition 8.1. Let U be an open convex set and f : U → R be differentiable.Then f is (strictly) convex if and only if

f(y)− f(x) ≥ (>) 〈∇f(x), y − x〉

for all x 6= y.

Proof. Suppose that f is (strictly) convex. Let x 6= y ∈ U and define g : (0, 1]→R by

g(t) =f((1− t)x+ ty)− f(x)

t.

Then g is (strictly) increasing, for if 0 < s < t ≤ 1 we have

g(s) ≤ (<)g(t)

⇐⇒ f((1− s)x+ sy)− f(x)

t≤ (<)

f((1− t)x+ ty)− f(x)

t

⇐⇒ f((1− s)x+ sy) ≤ (<)(

1− s

t

)f(x) +

s

tf((1− t)x+ ty),

but the last inequality holds by letting α = s/t, x1 = x, x2 = (1− t)x+ ty, andusing the definition of convexity. Therefore

f(y)− f(x) = g(1) ≥ (>) limt→0

g(t) = 〈∇f(x), y − x〉 .

Conversely, suppose that

f(y)− f(x) ≥ (>) 〈∇f(x), y − x〉

for all x 6= y. Take any x1 6= x2 and α ∈ (0, 1). Setting y = x1, x2 andx = (1− α)x1 + αx2, we get

f(x1)− f((1− α)x1 + αx2) ≥ (>) 〈∇f(x), x1 − x〉f(x2)− f((1− α)x1 + αx2) ≥ (>) 〈∇f(x), x2 − x〉 .

70

Multiplying each by 1 − α and α respectively and adding the two inequalities,we get

(1− α)f(x1) + αf(x2)− f((1− α)x1 + αx2) ≥ (>)0,

so f is (strictly) convex.

Figure 8.2 shows the geometric intuition of Proposition 8.1. Since QR =f(y)− f(x) and SR = 〈∇f(x), y − x〉, we have f(y)− f(x) ≥ 〈∇f(x), y − x〉.

x y

P

Q

R

S

y − x

〈∇f(x), y − x〉

Slope = ∇f(x)

Figure 8.2. Characterization of a convex function.

As in the case with convex functions, there is a simple way to establishquasi-convexity if f is differentiable.

Proposition 8.2. Let f be differentiable. Then f is quasi-convex if and onlyif for all x, y we have

f(y) ≤ f(x) =⇒ 〈∇f(x), y − x〉 ≤ 0.

Proof. Suppose that f is quasi-convex and f(y) ≤ f(x). Then for any 0 < t ≤ 1we have

f((1− t)x+ ty) ≤ f(x) ⇐⇒ 1

t(f(x+ t(y − x))− f(x)) ≤ 0.

Letting t→ 0, we obtain 〈∇f(x), y − x〉 ≤ 0.Conversely, suppose that 〈∇f(x), y − x〉 ≤ 0 whenever f(y) ≤ f(x). If f is

not quasi-convex, there exist x, y, and 0 ≤ t ≤ 1 such that f(x) ≥ f(y) and

f((1− t)x+ ty) > f(x).

Let g(t) = f((1 − t)x + ty) and T = t ∈ [0, 1] | g(t) > g(0). Then the aboveinequality shows that T 6= ∅. Take any t ∈ T . Clearly t 6= 0, so t > 0. Alsosince f(y) ≤ f(x), we have g(1) ≤ g(0), so 1 /∈ T . Using the assumption forx = (1− t)x+ ty and y = x, it follows that

〈∇f((1− t)x+ ty), x− ((1− t)x+ ty)〉 ≤ 0

⇐⇒ 〈∇f((1− t)x+ ty), y − x〉 ≥ 0.

71

Then the above argument shows that g(t) > g(0) implies g′(t) ≥ 0.Since g is continuous, T is open. Take any t1 ∈ T and let T1 be the connected

component of T that includes t1. Let t = supT1. Since T1 is connected, forany t1 ≤ t < t, we have t ∈ T1 ⊂ T , so g(t) > g(0) and g′(t) ≥ 0. Since g iscontinuous and monotonic on T1, we have g(t) ≥ g(t1) > g(0), so t ∈ T . Bycontinuity of g, for small enough ε > 0 we have g(t) > g(0) if t ∈ [t, t + ε], sot = supT1 ≥ t+ ε, which is a contradiction.

A twice differentiable function f : R→ R is convex if and only if f ′′(x) ≥ 0for all x. The following proposition is the generalization for RN .

Proposition 8.3. Let U be an open convex set and f : U → R be C2. Then fis convex if and only if the Hessian (matrix of second derivatives)

∇2f(x) =

[∂2f(x)

∂xm∂xn

]is positive semidefinite.

Proof. Suppose that f is a C2 function on U . Take any x 6= y ∈ U . ApplyingTaylor’s theorem to g(t) = f((1− t)x+ ty) for t ∈ [0, 1], there exists α ∈ (0, 1)such that

f(y)− f(x) = g(1)− g(0) = g′(0) +1

2g′′(α)

= 〈∇f(x), y − x〉+1

2

⟨y − x,∇2f(x+ α(y − x))(y − x)

⟩.

If f is convex, by proposition 8.1

1

2

⟨y − x,∇2f(x+ α(y − x))(y − x)

⟩= f(y)− f(x)− 〈∇f(x), y − x〉 ≥ 0.

Since y is arbitrary, let y = x+ εd with ε > 0. Dividing the above inequality by12ε

2 > 0 and letting ε→ 0, we get

0 ≤⟨d,∇2f(x+ εαd)d

⟩→⟨d,∇2f(x)d

⟩,

so∇2f(x) is positive semidefinite. Conversely, if∇2f(x) is positive semidefinite,then

f(y)− f(x)− 〈∇f(x), y − x〉 =1

2

⟨y − x,∇2f(x+ α(y − x))(y − x)

⟩≥ 0,

so by Proposition 8.1 f is convex.

8.3 Subgradient of convex functions

Suppose that f is a proper convex function and x ∈ int dom f . Since (x, f(x)) ∈epi f but (x, f(x)− ε) /∈ epi f for all ε > 0, we have (x, f(x)) /∈ int epi f . Henceby the separating hyperplane theorem, there exists a vector 0 6= (η, β) ∈ RN×Rsuch that

〈η, x〉+ βf(x) ≤ 〈η, y〉+ βz

72

for any (y, z) ∈ epi f . Letting z →∞, we get β ≥ 0. Letting z = f(y) we get

β(f(y)− f(x)) ≥ 〈−η, y − x〉

for all y. If β = 0, since x ∈ int dom f , the vector y − x can point to anydirection. Then η = 0, which contradicts (η, β) 6= 0. Therefore β > 0. Lettingξ = −η/β, we get

f(y)− f(x) ≥ 〈ξ, y − x〉 (8.1)

for any y.A vector ξ that satisfies (12.3) is called a subgradient (subdifferential) of f

at x ∈ int dom f . The set of all subgradients at x is denoted by

∂f(x) =ξ ∈ RN

∣∣ (∀y)f(y)− f(x) ≥ 〈ξ, y − x〉.

∂f(x) is a closed convex set (exercise).If f is partially differentiable, letting y = x+ td and letting t→ 0, we get

〈ξ, d〉 ≤ f(x+ td)− f(x)

t→ f ′(x; d) = 〈∇f(x), d〉

for all d, so ξ = ∇f(x) and ∂f(x) = ∇f(x).

8.4 Application: Pareto efficient allocations

Suppose that there are I consumers indexed by i = 1, . . . , I with continuousutility functions ui : RN+ → R. Let xi ∈ RN+ be the consumption vector (avector that specifies how much of each good an agent consumes) of agent i.Let e ∈ RN+ be the total amount of good in the economy (called the aggregate

endowment). An allocation xiIi=1 ⊂ RN+ is feasible if

I∑i=1

xi ≤ e.

(For N -vectors x = (x1, . . . , xN ) and y = (y1, . . . , yN ), we write x ≤ y if xn ≤ ynfor all n; x < y if x ≤ y and x 6= y; and x y if xn < yn for all n.)

An allocation yi is said to Pareto dominate xi if u(yi) ≥ u(xi) for all iand u(yi) > u(xi) for some i. A feasible allocation is said to be Pareto efficient,or just efficient, if it is not Pareto dominated by any other feasible allocation.Intuitively, an allocation is efficient when there is no other feasible allocationthat makes every agent at least as well off and one agent strictly better off.

The following proposition shows that an efficient allocation exists.

Proposition 8.4. Let λ ∈ RI++ and define W : RNI+ → R by

W (x1, . . . , xI) =

I∑i=1

λiui(xi).

Then (i) W attains the maximum over all feasible allocations. (ii) An allocationthat achieves the maximum of W is Pareto efficient.

73

Proof. Let F =

(x1, . . . , xI) ∈ RNI+

∣∣∣∑Ii=1 xi ≤ e

be the set of feasible allo-

cations. Clearly F is nonempty and compact. Since ui is continuous, so is W .Therefore by the Bolzano-Weierstrass theorem W has a maximum.

Suppose that xi attains the maximum of W . If xi is inefficient, thereexists a feasible allocation yi that Pareto dominates xi. Then u(yi) ≥ u(xi)for all i and u(yi) > u(xi) for some i. Since λi > 0 for all i, we get

W (x1, . . . , xI) =

I∑i=1

λiui(xi) <

I∑i=1

λiui(yi) = W (y1, . . . , yI),

which contradicts the maximality of xi. Hence xi is efficient.

Clearly Proposition 8.4 holds with the weaker assumption that each ui isupper semi-continuous.

When the utility function ui is concave, the following partial converse holds(continuity is not necessary).

Proposition 8.5. Suppose that each ui is concave and the feasible allocationxi is efficient. Then there exists λ ∈ RI+ such that xi is the maximizer of

W (x1, . . . , xI) =

I∑i=1

λiui(xi).

Proof. Define

V =v = (v1, . . . , vI) ∈ RI

∣∣ (∃ xi ∈ F )(∀i)vi ≤ ui(xi),

the set of utilities that are (weakly) dominated by some feasible allocation. Sincexi ∈ V , V is nonempty. Since each ui is concave, V is convex (exercise). Letu = (u(x1), . . . , uI(xI) and D = u + RI++, the set of utilities that (strictly)dominate xi. Then D is nonempty and convex. Since xi is efficient, wehave V ∩D = ∅. Therefore by the separating hyperplane theorem, there exists0 6= λ ∈ RI such that 〈λ, v〉 ≤ 〈λ, u+ d〉 for any v ∈ V and d ∈ RI++. Lettingdi → ∞, it must be λi ≥ 0, so λ ∈ RI+. Letting d → 0, we get 〈λ, v〉 ≤ 〈λ, u〉,which implies

I∑i=1

λiu(xi) ≤I∑i=1

λiu(xi)

for any feasible xi. Hence xi maximizes W over all feasible allocations.

Exercises

8.1. 1. Prove that f is convex if and only if

f((1− α)x1 + αx2) ≤ (1− α)f(x1) + αf(x2)

for all x1, x2 ∈ RN and α ∈ [0, 1].

74

u

V

D

λ

Figure 8.3. Construction of Pareto weights.

2. Prove that f is convex if and only if

f

(K∑k=1

αkxk

)≤

K∑k=1

αkf(xk)

for all xkKk=1 ⊂ RN and αk ≥ 0 such that∑Kk=1 αk = 1.

8.2. 1. Prove that f : [0,∞)→ (−∞,∞] defined by f(x) = − log x is strictlyconvex.

2. Prove the following inequality of arithmetic and geometric means: for anyx1, . . . , xK ≥ 0 and α1, . . . , αK ≥ 0 such that

∑αk = 1, we have

K∑k=1

αkxk ≥K∏k=1

xαkk ,

with equality if and only if x1 = · · · = xK .

8.3. 1. Show that if fi(x)Ii=1 are convex, so is f(x) =∑Ii=1 αifi(x) for

any α1, . . . , αI ≥ 0.

2. Show that if fi(x)i∈I are convex, so is f(x) = supi∈I fi(x).

3. Suppose that h : RM → R is increasing (meaning x ≤ y implies h(x) ≤h(y)) and convex and gm : RN → R is convex for m = 1, . . . ,M . Provethat f(x) = h(g1(x), . . . , gM (x)) is convex.

8.4. Let f : R → R be defined by f(x) = |x|. Compute the subdifferential∂f(x).

8.5. Let f be a proper convex function and x ∈ int dom f . Prove that ∂f(x) isa closed convex set.

8.6. Define f : R→ (−∞,∞] by

f(x) =

∞, (x < 0)

−√x. (x ≥ 0)

1. Show that f is a proper convex function.

2. Compute the subdifferential ∂f(x) for x > 0. Does ∂f(0) exist?

75

8.7. u : RN → R is said to be strongly increasing if x < y (meaning x ≤ y andx 6= y) implies u(x) < u(y). Show that conclusion of Proposition 8.4 holds withthe weaker condition 0 6= λ ∈ RI+ if each ui is strongly increasing.

8.8. Prove that the set V in the proof of Proposition 8.5 is convex.

76

Chapter 9

Convex Programming

9.1 Convex programming

Consider the minimization problem

minimize f(x) subject to x ∈ C. (9.1)

The minimization problem (9.1) is called a convex programming problem if f(x)is a convex function and C is a convex set. By redefining f such that f(x) =∞for x /∈ C, the constrained optimization problem (9.1) becomes the uncon-strained optimization problem

minimize f(x).

You should know from calculus that if f is differentiable and x minimizes f ,then ∇f(x) = 0, which is called the first-order necessary condition. In general,the first-order condition is not sufficient. For instance, for f(x) = x3 we havef ′(0) = 0, but x = 0 does not minimize f since f(−1) = −1 < 0 = f(0).

For a convex programming problem, however, the first-order necessary con-dition is also sufficient, as the following proposition shows.

Proposition 9.1. Let f : RN → (−∞,∞] be a proper convex function. Thenx ∈ int dom f is a solution to (9.1) if and only if 0 ∈ ∂f(x). In particular, if fis differentiable, f(x) = min f(x) if and only if ∇f(x) = 0.

Proof. If f(x) = min f(x), then for any x

f(x)− f(x) ≥ 0 = 〈0, x− x〉 = 0,

so 0 ∈ ∂f(x). If 0 ∈ ∂f(x), then

f(x)− f(x) ≥ 〈0, x− x〉 = 0,

so f(x) = min f(x).

In applications, the constraint set is often given by inequalities and equa-

77

tions. Consider the constrained minimization problem

minimize f(x)

subject to gi(x) ≤ 0 (i = 1, . . . , I),

hj(x) = 0 (j = 1, . . . , J),

x ∈ Ω, (9.2)

where f , gi’s, and hj ’s are functions and Ω ⊂ RN is a set. gi(x) ≤ 0 is called aninequality constraint. hj(x) = 0 is called an equality constraint. The conditionx ∈ Ω is called a side constraint.

When f, gi’s are convex, hj ’s are affine (so hj(x) = 〈aj , x〉 − bj for someaj ∈ RN , bj ∈ R for all j), and Ω is convex, then (9.2) is a convex programmingproblem. Without loss of generality, we may assume that aj is linearly inde-pendent, for otherwise either the constraint set is empty or some constraints areredundant. Letting A = (a1, . . . , aJ)′ be an J ×N matrix and b = (b1, . . . , bJ)′,the equality constraints can be compactly written as Ax− b = 0.

To characterize the solution of (9.2), define the Lagrangian by

L(x, λ, µ) =

f(x) +

∑Ii=1 λigi(x) +

∑Jj=1 µjhj(x), (λ ∈ RI+)

−∞, (λ /∈ RI+)

where λ = (λ1, . . . , λI) ∈ RI and µ = (µ1, . . . , µJ) ∈ RJ are called Lagrangemultipliers. (Defining L to be −∞ when λ /∈ RI+ is useful later when explainingduality.) A point (x, λ, µ) ∈ Ω × RI × RJ is called a saddle point if it achievesthe minimum with respect to x and maximum with respect to (λ, µ). Formally,(x, λ, µ) is a saddle point if

L(x, λ, µ) ≤ L(x, λ, µ) ≤ L(x, λ, µ) (9.3)

for all (x, λ, µ) ∈ Ω× RI+J .The following theorem gives necessary and sufficient conditions for optimal-

ity.

Theorem 9.2 (Saddle Point Theorem). Let Ω ⊂ RN be a convex set. Supposethat f, gi : Ω → (−∞,∞]’s are convex and hj’s are affine in the minimizationproblem (9.2). Let

(h1(x), . . . , hJ(x))′ = Ax− b,

where A is an J ×N matrix and b ∈ RJ .

1. If (i) x is a solution to the minimization problem (9.2), (ii) there existsx0 ∈ RN such that gi(x0) < 0 for all i and Ax0 − b = 0, and (iii) 0 ∈int(AΩ − b), then there exist Lagrange multipliers λ ∈ RI+ and µ ∈ RJsuch that (x, λ, µ) is a saddle point of L.

2. If there exist Lagrange multipliers λ ∈ RI+ and µ ∈ RJ such that (x, λ, µ)is a saddle point of L, then x is a solution to the minimization problem(9.2).

Remark. Condition (1ii) is called the Slater constraint qualification and willbe discussed in more detail in the next chapter. In application, we often haveΩ = RN . Then condition (1iii) holds automatically since ARN − b = RJ when

78

the row vectors of A are linearly independent, which we may assume without lossof generality. Condition (1iii) also holds when there are no equality constraints(J = 0).

Proof of Theorem 9.2.

Necessity (Claim 1). Assume that x ∈ Ω is a solution to (9.2). Define thesets C,D ⊂ R1+I+J by

C =

(u, v, w) ∈ R1+I+J∣∣ (∃x ∈ Ω)u ≥ f(x), (∀i)vi ≥ gi(x), w = Ax− b

,

D =

(u, v, w) ∈ R1+I+J∣∣u < f(x), (∀i)vi < 0, (∀j)wj = 0

.

(See Figure 9.1.)

v

u

f(x)

C

D

O

Figure 9.1. Saddle point theorem.

Clearly C,D are convex since f, gi’s are convex and Ω is convex. Since

(f(x), v, Ax− b) ∈ C

for vi = gi(x), C is nonempty. Letting v0i = gi(x0) < 0, v0 = (v01, . . . , v0I), andu0 < f(x), we have (u0, v0, 0) ∈ D, so D is nonempty. If (u, v, w) ∈ C∩D, since(u, v, w) ∈ D we have u < f(x), v 0, and w = 0. Then since (u, v, 0) ∈ C thereexists x ∈ Ω such that f(x) ≤ u < f(x), gi(x) < 0 for all i, and Ax − b = 0,contradicting the optimality of x. Therefore C ∩ D = ∅. By the separatinghyperplane theorem, there exists 0 6= (α, β, γ) ∈ R× RI × RJ such that

sup(u,v,w)∈D

αu+ 〈β, v〉+ 〈γ,w〉 ≤ inf(u,v,w)∈C

αu+ 〈β, v〉+ 〈γ,w〉 .

Taking (u, v, w) ∈ D and letting u → −∞, it must be α ≥ 0. Similarly, lettingvi → −∞, it must be βi ≥ 0 for all i.

Let us show that α > 0. For any ε > 0, we have (f(x)− ε,−ε1, 0) ∈ D, so

α(f(x)− ε)− ε 〈β,1〉 ≤ αf(x) +

I∑i=1

βigi(x) +

J∑j=1

γjhj(x) (9.4)

for any x. Letting x = x0 and ε→ 0, we get

αf(x) ≤ αf(x0) +

I∑i=1

βigi(x0).

79

If α = 0, by since by assumption gi(x0) < 0, it must be βi = 0 for all i. IfJ = 0 (no equality constraints), then (α, β) = 0, a contradiction. Hence α > 0.If J > 0, then

sup(u,v,w)∈D

〈γ,w〉 ≤ inf(u,v,w)∈C

〈γ,w〉 ⇐⇒ (∀x ∈ Ω)γ′(Ax− b) ≥ 0.

Since by assumption 0 is an interior point of AΩ − b, for small enough δ > 0there exists x ∈ Ω with −δγ = Ax − b. Therefore −δ ‖γ‖2 ≥ 0 implies γ = 0.Then (α, β, γ) = 0, a contradiction. Hence α > 0.

Since α > 0, define λ = β/α and µ = γ/α. Then letting ε → 0 in (9.4), weget

f(x) ≤ f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x) =: L(x, λ, µ)

for all x. Since λi ≥ 0 and gi(x) ≤ 0 for all i and hj(x) = 0 for all j, it followsthat

L(x, λ, µ) = f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x)

≤ f(x) ≤ f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x) = L(x, λ, µ),

which is the right inequality of (9.3) and it must be λigi(x) = 0 for all i. Itremains to show the left inequality of (9.3). If λ /∈ RI+, by the definition of theLagrangian we have L(x, λ, µ) = −∞, so it is trivial. If λ ∈ RI+, then sincegi(x) ≤ 0 and hj(x) = 0, we obtain

L(x, λ, µ) = f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x)

≤ f(x) = L(x, λ, µ),

which is the left inequality of (9.3).

Sufficiency (Claim 2). Assume that (x, λ, µ) ∈ Ω×RI+×RJ is a saddle pointof L. By the left inequality of (9.3), for any λ ∈ RI+ and µ ∈ RJ we obtain

f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x) ≤ f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x)

=⇒I∑i=1

λigi(x) +

J∑j=1

µjhj(x) ≤I∑i=1

λigi(x) +

J∑j=1

µjhj(x).

Letting µj → ±∞, it must be hj(x) = 0 for all j. Letting λi → ∞, we get

gi(x) ≤ 0 for all i. Letting λ = 0, we get 0 ≤∑Ii=1 λigi(x), so it must be

λigi(x) = 0 for all i. Then by the right inequality of (9.3), for any x ∈ Ω we

80

obtain

f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x) ≤ f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x)

=⇒ f(x) ≤ f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x).

Since λi ≥ 0, if gi(x) ≤ 0 and hj(x) = 0 it follows that f(x) ≤ f(x), so x is asolution to the constrained minimization problem (9.2).

The following corollary is useful for computing the solution of a convexprogramming problem.

Corollary 9.3. Let Ω ⊂ RN be a convex set, f, gi : Ω → (−∞,∞] be convexand differentiable on Ω, and hj be affine.

1. If (i) x is a solution to the minimization problem (9.2), (ii) there existsx0 ∈ RN such that gi(x0) < 0 for all i and Ax0 − b = 0, and (iii) 0 ∈int(AΩ − b), then there exist Lagrange multipliers λ ∈ RI+ and µ ∈ RJsuch that

∇f(x) +

I∑i=1

λi∇gi(x) +

J∑j=1

µj∇hj(x) = 0, (9.5a)

(∀i)λi ≥ 0, gi(x) ≤ 0, λigi(x) = 0, (9.5b)

(∀j)hj(x) = 0. (9.5c)

2. If there exist Lagrange multipliers λ ∈ RI+ and µ ∈ RJ such that (9.5)holds, then x is a solution to the minimization problem (9.2).

Proof. If x is a solution, by the saddle point theorem and its proof, there existsLagrange multipliers such that (9.5b) and (9.5c) hold and x minimizes L(x, λ, µ).Then by Proposition 9.1, (9.5a) holds.

If (9.5) holds, then by condition (9.5a) and Proposition 9.1, x minimizesL(x, λ, µ). By conditions (9.5b) and (9.5c), we can show that x solves (9.4) byimitating the sufficiency proof of Theorem 9.2.

Condition (9.5a) is called the first-order condition. Condition (9.5b) is calledthe complementary slackness condition.

By Corollary 9.3, we can solve a constrained minimization problem (9.2) asfollows.

Step 1. Verify that the functions f, gi’s are convex, hj ’s are affine, and theconstraint set Ω is convex.

Step 2. Verify the Slater condition (there exists x0 ∈ Ω such that gi(x0) < 0 forall i and hj(x0) = 0 for all j) and 0 ∈ intAΩ− b.

Step 3. Form the Lagrangian

L(x, λ, µ) = f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x).

81

Derive the first-order condition and complementary slackness condition(9.5).

Step 4. Solve (9.5). If there is a solution x, it is a solution to the minimizationproblem 9.2. Otherwise, there are no solutions.

As an example, consider the problem

minimize1

x1+

1

x2


x1, x2 > 0.

Let us solve this problem step by step.

Step 1. Let f(x1, x2) = 1x1

+ 1x2

be the objective function. Since

(1/x)′′ = (−x−2)′ = 2x−3 > 0,

the Hessian of f ,

∇2f(x1, x2) =

[2x−3

1 00 2x−3

2

],

is positive definite. Therefore f is convex. Let g(x1, x2) = x1 + x2 − 2.Since g is affine, it is convex. Clearly the set

Ω = R2++ = (x1, x2) |x1, x2 > 0

is convex.

Step 2. For (x1, x2) = (12 ,

12 ) we have g(x1, x2) = −1 < 0, so the Slater condition

holds.

Step 3. Let

L(x1, x2, λ) =1

x1+

1

x2+ λ(x1 + x2 − 2)

be the Lagrangian. The first-order condition is

0 =∂L

∂x1= − 1

x21

+ λ ⇐⇒ x1 =1√λ,

0 =∂L

∂x2= − 1

x22

+ λ ⇐⇒ x2 =1√λ.

The complementary slackness condition is

λ(x1 + x2 − 2) = 0.

Step 4. From these equations it must be λ > 0 and

x1 + x2 − 2 = 0 ⇐⇒ 2√λ− 2 = 0 ⇐⇒ λ = 1

and x1 = x2 = 1/√λ = 1. Therefore (x1, x2) = (1, 1) is the (only)

solution.

82

9.2 Constrained maximization

Next, we briefly discuss maximization. Although maximization is equivalent tominimization by flipping the sign of the objective function, doing so every timeis awkward. So consider the maximization problem

maximize f(x)

subject to gi(x) ≥ 0 (i = 1, . . . , I),

hj(x) = 0 (j = 1, . . . , J),

x ∈ Ω, (9.6)

where f, gi’s are concave and hj ’s are affine. (9.6) is equivalent to the minimiza-tion problem

minimize − f(x)

subject to − gi(x) ≤ 0 (i = 1, . . . , I),

− hj(x) = 0 (j = 1, . . . , J),

x ∈ Ω,

where −f,−gi’s are convex and −hj ’s are affine. Assuming that the saddle pointtheorem applies, the necessary and sufficient conditions for optimality are

−∇f(x)−I∑i=1

λi∇gi(x)−J∑j=1

µj∇hj(x) = 0, (9.7a)

(∀i) λi(−gi(x)) = 0, (9.7b)

(∀j) − hj(x) = 0. (9.7c)

But (9.7) is equivalent to (9.5). For this reason, it is customary to formulate amaximization problem as in (9.6) so that the inequality constraints are always“greater than or equal to zero”.

As an example, consider a consumer with utility function

u(x) = α log x1 + (1− α) log x2,

where 0 < α < 1 and x1, x2 ≥ 0 are consumption of good 1 and 2 (such afunction is called Cobb-Douglas). Let p1, p2 > 0 the price of each good andw > 0 be the wealth.

Then the consumer’s utility maximization problem (UMP) is


subject to x1 ≥ 0, x2 ≥ 0, p1x1 + p2x2 ≤ w.

Step 1. Clearly the objective function is concave and the constraints are linear(hence concave). The point (x1, x2) = (ε, ε) strictly satisfies the inequal-ities for small enough ε > 0, so the Slater condition holds. Thereforewe can apply the saddle point theorem.

83

Step 2. Let

L(x, λ, µ) = α log x1 +(1−α) log x2 +λ(w−p1x1−p2x2)+µ1x1 +µ2x2,

where λ ≥ 0 is the Lagrange multiplier corresponding to the budgetconstraint

p1x1 + p2x2 ≤ w ⇐⇒ w − p1x1 − p2x2 ≥ 0

and µn is the Lagrange multiplier corresponding to xn ≥ 0 for n = 1, 2.By the first-order condition, we get

0 =∂L

∂x1=

α

x1− λp1 + µ1,

0 =∂L

∂x2=

1− αx2

− λp2 + µ2.

By the complementary slackness condition, we have λ(w−p1x1−p2x2) =0, µ1x1 = 0, µ2x2 = 0.

Step 3. Since log 0 = −∞, x1 = 0 or x2 = 0 cannot be an optimal solution.Hence x1, x2 > 0, so by complementary slackness we get µ1 = µ2 = 0.Then by the first order condition we get x1 = α

λp1, x2 = 1−α

λp2, so λ > 0.

Substituting these into the budget constraint p1x1 + p2x2 = w, we get

α

λ+

1− αλ

= w ⇐⇒ λ =1

w,

so the solution is

(x1, x2) =

(αw

p1,

(1− α)w

p2

).

In the next two sections, we apply constrained minimization and maximiza-tion to problems in finance—portfolio selection and asset pricing.

9.3 Portfolio selection

9.3.1 The problem

Suppose you live in a world with two periods, t = 0, 1. There are J assetsindexed by j = 1, . . . , J . Asset j trades at price Pj per share and it pays off Xj

per share at t = 1, where Xj is a random variable. You have some wealth w0

to invest at t = 0 and let w1 be your wealth at t = 1. Assume that (as most ofyou do) you like money but dislike risk. So suppose you want to minimize thevariance Var[w1] while keeping the expected wealth E[w1] at some value w.

This is the classic portfolio selection problem studied by Harry Markowitz1

in his dissertation at University of Chicago, which was eventually published inJournal of Finance Markowitz (1952) and made him win the Nobel Prize in1990.

1Harry Markowitz is currently Professor of Finance at Rady School of Management, UCSD.

84

9.3.2 Mathematical formulation

To mathematically formulate the problem, let nj be the number of shares youbuy. (If nj > 0, you buy asset j; if nj < 0, you shortsell asset j. I assume thatshortselling is allowed.) Then the problem is

minimize Var

J∑j=1

Xjnj

subject to E

J∑j=1

Xjnj

= w,

J∑j=1

Pjnj = w0.

Rj = Xj/Pj be the gross return of asset j and let θj = Pjnj/w0 be the

fraction of wealth invested in asset j. Since by definition∑Jj=1 θj = 1 and

Xjnj =Xj

Pj

Pjnjw0

w0 = w0Rjθj ,

the problem is equivalent to

minimize Var[R(θ)]

subject to E[R(θ)] = µ,∑j

θj = 1,

where θ = (θ1, . . . , θJ), R(θ) =∑j Rjθj , and µ = w/w0.

9.3.3 Solution

Let µj = E[Rj ] be the expected return of asset j and Σ = (σij) be the variance-covariance matrix of (R1, . . . , RJ), where

σij = Cov[Ri, Rj ] = E[(Ri − µi)(Rj − µj)].

Then the problem can be stated as

minimize1

2〈θ,Σθ〉

subject to 〈µ, θ〉 = µ, 〈1, θ〉 = 1,

where µ = (µ1, . . . , µJ) is the vector of expected returns and 1 = (1, . . . , 1) isthe vector of ones. (The 1

2 is there for making the first derivative nice.) Assumethat Σ is positive definite. Then the objective function is strictly convex. TheLagrangian is

L(θ, λ1, λ2) =1

2〈θ,Σθ〉+ λ1(µ− 〈µ, θ〉) + λ2(1− 〈1, θ〉).

The first-order condition is

Σθ − λ1µ− λ21 = 0 ⇐⇒ θ = λ1Σ−1µ+ λ2Σ−11.

85

The first interesting observation is that the optimal portfolio θ is spanned bytwo vectors, Σ−1µ and Σ−11 (mutual fund theorem). Substituting θ into theconstraints, we obtain

λ1

⟨µ,Σ−1µ

⟩+ λ2

⟨µ,Σ−11

⟩= µ,

λ1

⟨1,Σ−1µ

⟩+ λ2

⟨1,Σ−11

⟩= 1.

Noting that⟨µ,Σ−11

⟩=⟨1,Σ−1µ

⟩because Σ is symmetric and letting[

a bb c

]=

[⟨µ,Σ−1µ

⟩ ⟨µ,Σ−11

⟩⟨1,Σ−1µ

⟩ ⟨1,Σ−11

⟩]−1

,

we get [λ1

λ2

]=

[a bb c

] [µ

1

].

Using the first-order condition and the constraints, the minimum variance is

σ2 = 〈θ,Σθ〉 = 〈θ, λ1µ+ λ21〉 = λ1µ+ λ21

= [µ, 1]

[a bb c

] [µ

1

]= aµ2 + 2bµ+ c

= a

(µ+

b

a

)2

+ c− b2

a= a(µ−m)2 + s2.

Using1

ac− b2

[c −b−b a

]=

[⟨µ,Σ−1µ

⟩ ⟨µ,Σ−11

⟩⟨1,Σ−1µ

⟩ ⟨1,Σ−11

⟩] ,we can compute

m = − ba

=

⟨1,Σ−1µ

⟩〈1,Σ−11〉

,

s2 =ac− b2

a=

1

〈1,Σ−11〉.

The relationship between the target return µ and standard deviation σ,

σ2 = a(µ−m)2 + s2 ⇐⇒ µ = m± 1√a

√σ2 − s2,

is a hyperbola (Figure 9.2). A portfolio satisfying this relation is called anmean-variance efficient portfolio.

9.4 Capital asset pricing model (CAPM)

9.4.1 The model

Now consider an economy consisting of I agents indexed by i = 1, . . . , I. Letwi > 0 be the initial wealth of agent i. Instead of minimizing the variance ofportfolio returns subject to a target expected portfolio returns, suppose thatagent i wants to maximize

vi(θ) = E[R(θ)]− 1

2τiVar[R(θ)],

86

σ

µ

m

s

Slope = 1√a

O

Figure 9.2. Mean-variance efficient frontier.

where R(θ) is the portfolio return and τi > 0 is the “risk tolerance”. Assumethat in addition to the risky J assets, agents can trade a risk-free asset in zeronet supply, where the risk-free rate Rf is determined in equilibrium. Letting θjthe fraction of wealth invested in asset j, the fraction of wealth invested in therisk-free asset is 1−

∑j θj . Therefore the portfolio return is

R(θ) =∑j

Rjθj +Rf

1−∑j

θj

= Rf +∑j

(Rj −Rf )θj .

The expected return and variance of the portfolio are

E[R(θ)] = Rf + 〈µ−Rf1, θ〉 , Var[R(θ)] = 〈θ,Σθ〉 ,

respectively. Thus the optimal portfolio problem of agent i reduces to

maximize Rf + 〈µ−Rf1, θ〉 − 1

2τi〈θ,Σθ〉 ,

where θ ∈ RJ is unconstrained. The first-order condition is

µ−Rf1− 1

τiΣθ = 0 ⇐⇒ θi = τiΣ

−1(µ−Rf1), (9.8)

where the subscript i indicates that it refers to agent i.

9.4.2 Equilibrium

Mathematics ends and economics starts here. Since every asset must be heldby someone and the risk-free asset is in zero net supply by definition, theaverage portfolio weighted by individual wealth,

∑wiθi/

∑i wi, must be the

market portfolio (value-weighted average portfolio), denoted by θm. Letting

τ =∑Ii=1 wiτi/

∑Ii=1 wi be the “average risk tolerance”, it follows from the

first-order condition (9.8) that

θm = τΣ−1(µ−Rf1). (9.9)

87

Comparing (9.8) and (9.9), we obtain

θi =τiτθm. (9.10)

This means that you should hold risky assets in the same proportion as in themarket portfolio, where the fraction of the market portfolio is τi/τ and thefraction of the risk-free asset is 1− τi/τ . Thus the mutual fund theorem holdswith the market portfolio and the risk-free asset. This strong implication hadan enormous impact on investment practice, and led to the creation of the firstindex fund in 1975 by Vanguard.2

9.4.3 Asset pricing

Since by definition the market portfolio invests only in risky assets, we have

1 = 〈1, θm〉 = τ⟨1,Σ−1(µ−Rf1)

⟩⇐⇒ Rf =

⟨1,Σ−1µ

⟩− 1

τ

〈1,Σ−11〉,

which determines the risk-free rate. Since the market portfolio does not investin the risk-free asset, it must lie on the hyperbola corresponding to no risk-freeasset (Figure 9.4).

Letting Rm =∑j Rjθm,j be the market return, we have

Cov[Rm, Ri] = Cov

Ri,∑j

Rjθm,j

= (Σθm)i.

Hence multiplying both sides of (9.9) by Σ and taking the i-th element, weobtain

Cov[Rm, Ri] = τ(E[Ri]−Rf ). (9.11)

Multiplying both sides of (9.11) by θm,i and summing of i, we obtain

Var[Rm] = Cov[Rm, Rm] = τ(E[Rm]−Rf ). (9.12)

Dividing (9.11) by (9.12), we obtain the covariance pricing formula

E[Ri]−Rf =Cov[Rm, Ri]

Var[Rm](E[Rm]−Rf ). (9.13)

Thus, the excess return of an asset E[Ri]−Rf is proportional to the covarianceof the asset with the market, Cov[Rm, Ri]. The quantity

βi =Cov[Rm, Ri]

Var[Rm]

is called the beta of the asset. By definition, the market beta is βm = 1. Betameasures the market risk of an asset.

Rewriting (9.13), we obtain

E[Ri] = Rf + βi(E[Rm]−Rf ).

2See http://en.wikipedia.org/wiki/Index_fund.

88

http://en.wikipedia.org/wiki/Index_fund

The theoretical linear relationship between βi and E[Ri] is called the securitymarket line (SML) (Figure 9.3). An asset above (below) the security marketline, that is,

E[Ri] > (<)Rf + βi(E[Rm]−Rf )

is undervalued (overvalued) because the expected return is higher (lower) thanpredicted.

β

E[R]

Rf

Market portfolio

SML

βi

E[Ri]

1

E[Rm]

O

Undervalued

Overvalued

Figure 9.3. Security market line.

Since both sides of (9.13) are linear in Ri, (9.13) also holds for any linearcombination (hence portfolio) of assets. In particular, letting Ri be the optimalportfolio return of agent i, R(θi), it follows from (9.10) that

σθi =√

Var[R(θi)] =τiτ

√Var[Rm],

Cov[Rm, R(θi)] =τiτ

Var[Rm].

Substituting these equations into (9.13), eliminating τi/τ , and letting σm =√Var[Rm], we get

E[R(θi)]−Rf =E[Rm]−Rf

σmσθi .

This linear relationship between the standard deviation of the optimal portfolioσθi and the excess return E[R(θi)]−Rf is called the capital market line, denotedby CML in Figure 9.4. Agents that are relatively risk tolerant (τi > τ) borrowand choose a portfolio on the capital market line to the right of the marketportfolio. Agents that are relatively risk averse (τi < τ) lend and choose aportfolio on the capital market line to the left of the market portfolio. Theslope,

E[Rm]−Rfσm

,

is called the Sharpe ratio, named after William Sharpe who invented it (Sharpe,1964). The capital market line is tangent to the efficient frontier with no risk-

89

free asset at the market portfolio. Clearly the market portfolio (and any optimalportfolio) attains the maximum Sharpe ratio.

σ

µ

Market Portfolio

CML

σm

µm

Rf

O

Figure 9.4. Capital market line.

Exercises

9.1. Let f : RN → (−∞,∞] be convex.

1. Show that the set of solutions to minx∈RN f(x) is a convex set.

2. If f is strictly convex, show that the solution (if it exists) is unique.

9.2. Let f(x, y) = x2 + xy + y2.

1. Show that f is strictly convex.

2. Solve

minimize x2 + xy + y2

subject to x+ y ≥ 2.

9.3. Let A be an N × N symmetric positive definite matrix. Let B be anN ×M matrix with M ≤ N such that the M column vectors of B are linearlyindependent. Let c ∈ RM be a vector. Solve

minimize1

2〈x,Ax〉

subject to B′x = c.

9.4. Solve

minimize x+ y

subject to x2 + xy + y2 ≤ 1.

90

9.5. Solve

minimize 〈b, x〉subject to 〈x,Ax〉 ≤ r2,

where 0 6= b ∈ RN , A is an N × N symmetric positive definite matrix, andr > 0.

9.6. Consider a consumer with utility function

u(x) =

N∑n=1

αn log xn,

where αn > 0,∑Nn=1 αn = 1, and xn ≥ 0 is the consumption of good n. Let

p = (p1, . . . , pN ) be a price vector with pn > 0 and w > 0 be the wealth.

1. Formulate the consumer’s utility maximization problem.

2. Prove that a solution exists. (Hint: u is upper semi-continuous.)


9.7. Solve the same problem as above for the case

u(x) =

N∑n=1

αnx1−σn

1− σ,

where σ > 0.

91

Chapter 10

Nonlinear Programming

10.1 The problem and the solution concept

We are interested in solving a general constrained optimization problem

minimize f(x) subject to x ∈ C, (10.1)

where f is the objective function and C ⊂ RN is the constraint set. Such anoptimization problem is called a linear programming problem when both the ob-jective function and the constraint are linear. Otherwise, it is called a nonlinearprogramming problem. If both the objective function and the constraint happento be convex, it is called a convex programming problem. Thus

Linear Programming ⊂ Convex Programming ⊂ Nonlinear Programming.

We focus on minimization because maximizing f is the same as minimizing−f . If f is continuous and C is compact, by the Bolzano-Weierstrass theorem weknow there is a solution, but the theorem does not tell you how to compute it. Inthis section you will learn how to derive necessary conditions for optimality forgeneral nonlinear programming problems. Oftentimes, the necessary conditionsalone will pin down the solution to a few candidates, so you only need to comparethese candidates.

x ∈ C is said to be a global solution if f(x) ≥ f(x) for all x ∈ C. x is alocal solution if there exists ε > 0 such that f(x) ≥ f(x) whenever x ∈ C and‖x− x‖ < ε. If x is a global solution, clearly it is also a local solution.

10.2 Tangent cone and normal cone

10.2.1 Cone and its dual

A set C ⊂ RN is said to be a cone if it contains a ray originating from 0 andpassing through any point of C. Formerly, C is a cone if x ∈ C and α ≥ 0implies αx ∈ C. An example of a cone is the positive orthant

RN+ =x = (x1, . . . , xN ) ∈ RN

∣∣ (∀n)xn ≥ 0.

92

Another example is the setx =

K∑k=1

αkak

∣∣∣∣∣ (∀k)αk ≥ 0

,

where a1, . . . , aK are vectors. The latter set is called the cone generated byvectors a1, . . . , aK , and is denoted by cone[a1, . . . , aK ] (Figure 10.1).

a1

a2

O

cone[a1, a2]

Figure 10.1. Cone generated by vectors.

Clearly RN+ = cone[e1, . . . , eN ], where e1, . . . , eN are unit vectors of RN .A cone generated by vectors are also called polyhedral cone because it is apolyhedral. A polyhedral cone is a closed convex cone (easy).

Let C ⊂ RN be any nonempty set. The set

C∗ =y ∈ RN

∣∣ (∀x ∈ C) 〈x, y〉 ≤ 0

is called the dual cone of C. Thus the dual cone C∗ consists of all vectors thatmake an obtuse angle with any vector in C (Figure 10.2).

O

C

C∗

Figure 10.2. Cone and its dual.

Proposition 10.1. Let ∅ 6= C ⊂ D. Then (i) the dual cone C∗ is a nonempty,closed, convex cone, (ii) C∗ = (coC)∗, and (iii) C∗ ⊃ D∗.

93

Proof. C∗ is nonempty since 0 ∈ C∗. Since the half space

H−x :=y ∈ RN

∣∣ 〈x, y〉 ≤ 0

is a closed convex cone (easy), so is C∗ =⋂x∈C H

−x . If y ∈ D∗, then 〈x, y〉 ≤ 0

for all x ∈ D. Since C ⊂ D, we have 〈x, y〉 ≤ 0 for all x ∈ C. Therefore y ∈ C∗,which proves D∗ ⊂ C∗. Finally, since C ⊂ coC, we have C∗ ⊃ (coC)∗. Toprove the reverse inclusion, take any x ∈ coC. By Lemma 7.1, there exists aconvex combination x =

∑Kk=1 αkxk such that xk ∈ C for all k. If y ∈ C∗, it

follows that〈x, y〉 =

⟨∑αkxk, y

⟩=∑

αk 〈xk, y〉 ≤ 0,

so y ∈ (coC)∗. Therefore C∗ ⊂ (coC)∗.

Proposition 10.2. Let C ⊂ RN be a nonempty cone. Then C∗∗ = cl coC.(C∗∗ is the dual cone of C∗, so it is the dual cone of the dual cone of C.)

Proof. Let x ∈ coC. For any y ∈ C∗ = (coC)∗ we have 〈x, y〉 ≤ 0. Thisimplies x ∈ C∗∗. Hence coC ⊂ C∗∗. Since the dual cone is closed, we havecl coC ⊂ clC∗∗ = C∗∗.

To show the reverse inclusion, suppose that x /∈ cl coC. Then by Proposition7.4 there exists a separating hyperplane such that

〈a, x〉 > c > 〈a, z〉

for any z ∈ coC. In particular, c > 〈a, z〉 for all z ∈ C. Since C is a cone, it mustbe 〈a, z〉 ≤ 0 for all z, for otherwise if there is z0 ∈ C with 〈a, z0〉 > 0, for anyβ > 0 we have βz0 ∈ C, but for large enough β we have 〈a, βz0〉 = β 〈a, z0〉 > c,a contradiction. This shows that a ∈ C∗. Again since C is a cone, we have0 ∈ C, so c > 〈a, 0〉 = 0. Since 〈a, x〉 > c >= and a ∈ C∗, it follows thatx /∈ C∗∗. Therefore C∗∗ ⊂ cl coC.

The following corollary will play an important role in optimization theory.

Corollary 10.3 (Farkas). Let C = cone[a1, . . . , aK ] be a polyhedral cone gen-erated by a1, . . . , aK . Let D =

y ∈ RN

∣∣ (∀k) 〈ak, y〉 ≤ 0

. Then D = C∗ andC = D∗ (Figure 10.3).

Proof. Let y ∈ D. For any x ∈ C, we can take αk ≥ 0 such that x =∑k αkak.

Then〈x, y〉 =

∑k

αk 〈ak, y〉 ≤ 0,

so y ∈ C∗. Conversely, let y ∈ C∗. Since ak ∈ C, we get 〈ak, y〉 ≤ 0 for all k, soy ∈ D. Therefore D = C∗.

Since C is a closed convex cone, by Propositions 10.2 and 10.1 (iii) we get

C = cl coC = C∗∗ = (C∗)∗ = D∗.

94

a1

a2

O

C = D∗

D = C∗

Figure 10.3. Farkas’ lemma.

10.2.2 Tangent cone and normal cone

Let x ∈ C be any point. The tangent cone of C at x is defined by

TC(x) =

y ∈ RN

∣∣∣∣ (∃) αk ≥ 0, xk ⊂ C, y = limk→∞

αk(xk − x)

.

That is, y ∈ TC(x) if y points to the same direction as the the limiting directionof xk − x. Intuitively, the tangent cone of C at x consists of all directionsthat can be approximated by that from x to another point in C.

Lemma 10.4. TC(x) is a nonempty closed cone.

Proof. Setting αk = 0 for all k we get 0 ∈ TC(x). If y ∈ TC(x), then y =limαk(xk − x) for some αk ≥ 0 and xk ⊂ C. Then for β ≥ 0 we haveβy = limβαk(xk − x) ∈ TC(x), so TC(x) is a cone. To show that TC(x) isclosed, let yl ⊂ TC(x) and yl → y. For each l we can take a sequence suchthat αk,l ≥ 0, limk→∞ xk,l = x, and yl = limk→∞ αk,l(xk,l − x). Hence wecan take kl such that ‖xkl,l − x‖ < 1/l and ‖yl − αkl,l(xkl,l − x)‖ < 1/l. Thenxkl,l → x and

‖y − αkl,l(xkl,l − x)‖ ≤ ‖y − yl‖+ ‖yl − αkl,l(xkl,l − x)‖ → 0,

so y ∈ TC(x).

The dual cone of TC(x) is called the normal cone at x and is denoted byNC(x). By the definition of the dual cone, we have

NC(x) = (TC(x))∗ =z ∈ RN

∣∣ (∀y ∈ TC(x)) 〈y, z〉 ≤ 0.

The following theorem is fundamental for constrained optimization.

Theorem 10.5. If f is differentiable and x is a local solution of the problem

minimize f(x) subject to x ∈ C,

then −∇f(x) ∈ NC(x).

95

Proof. By the definition of the normal cone, it suffices to show that

〈−∇f(x), y〉 ≤ 0 ⇐⇒ 〈∇f(x), y〉 ≥ 0

for all y ∈ TC(x). Let y ∈ TC(x) and take a sequence such that αk ≥ 0, xk → x,and αk(xk − x)→ y. Since x is a local solution, for sufficiently large k we havef(xk) ≥ f(x). Since f is differentiable, we have

0 ≤ f(xk)− f(x) = 〈∇f(x), xk − x〉+ o(‖xk − x‖).1

Multiplying both sides by αk ≥ 0 and letting k →∞, we get

0 ≤ 〈∇f(x), αk(xk − x)〉+ ‖αk(xk − x)‖ · o(‖xk − x‖)‖xk − x‖

→ 〈∇f(x), y〉+ ‖y‖ · 0 = 〈∇f(x), y〉 .

The geometrical interpretation of Theorem 10.5 is the following. By theprevious lecture, −∇f(x) is the direction towards which f decreases fastestaround the point x. The tangent cone TC(x) consists of directions towardswhich x can move around x without violating the constraint x ∈ C. Hence inorder for x to be a local minimum, −∇f(x) must make an obtuse angle withany vector in the tangent cone, for otherwise f can be decreased further. Thisis the same as −∇f(x) belonging to the normal cone.

10.3 Karush-Kuhn-Tucker theorem

Theorem 10.5 is very general. Usually, we are interested in the cases where theconstraint set C is given parametrically. Consider the minimization problem

minimize f(x)

subject to gi(x) ≤ 0 (i = 1, . . . , I)

hj(x) = 0 (j = 1, . . . , J). (10.2)

This problem is a special case of problem (10.1) by setting

C =x ∈ RN

∣∣ (∀i)gi(x) ≤ 0, (∀j)hj(x) = 0.

gi(x) ≤ 0 is called an inequality constraint. hj(x) = 0 is an equality constraint.Let x ∈ C be a local solution. To study the shape of C around x, we define asfollows. The set of indices for which the equality constraints are binding,

I(x) = i | gi(x) = 0 ,

is called the active set. Assume that gi’s and hj ’s are differentiable. The set

LC(x) =y ∈ RN

∣∣ (∀i ∈ I(x)) 〈∇gi(x), y〉 ≤ 0, (∀j) 〈∇hj(x), y〉 = 0

is called the linearizing cone of the constraints gi’s and hj ’s. The reason whyLC(x) is called the linearizing cone is the following. Since

gi(x+ ty)− gi(x) = t 〈∇gi(x), y〉+ o(t),

1o(h) represents any quantity q(h) such that q(h)/h→ 0 as h→ 0.

96

the point x = x + ty almost satisfies the constraint gi(x) ≤ 0 if gi(x) = 0 (iis an active constraint) and 〈∇gi(x), y〉 ≤ 0. The same holds for hj ’s. Thusy ∈ LC(x) implies that from x we can move slightly towards the direction ofy and still (approximately) satisfy the constraints. Thus we can expect thatthe linearizing cone is approximately equal to the tangent cone. The followingproposition make this statement precise.

Proposition 10.6. Suppose that x ∈ C. Then coTC(x) ⊂ LC(x).

Proof. Since LC(x) is a closed convex cone, it suffices to prove TC(x) ⊂ LC(x).Let y ∈ TC(x). Take a sequence αk ≥ 0 and C 3 xk → x such that αk(xk −x)→ y. Since gi(x) = 0 for i ∈ I(x) and gi is differentiable, we get

0 ≥ gi(xk) = gi(xk)− gi(x) = 〈∇gi(x), xk − x〉+ o(‖xk − x‖).

Multiplying both sides by αk ≥ 0 and letting k →∞, we get

0 ≥ 〈∇gi(x), αk(xk − x)〉+ ‖αk(xk − x)‖ · o(‖xk − x‖)‖xk − x‖

→ 〈∇gi(x), y〉+ ‖y‖ · 0 = 〈∇gi(x), y〉 .

Similar argument applies to hj . Hence y ∈ LC(x).

Note that the tangent cone is directly defined by the constraint set C,whereas the linearizing cone is defined through the functions that define theset C. Therefore different parametrizations may lead to different linearizingcones (exercise).

The main result in static optimization is the following.

Theorem 10.7 (Karush-Kuhn-Tucker2). Suppose that f, gi, hj are differen-tiable and x is a local solution to the minimization problem (10.2). If LC(x) ⊂coTC(x), then there exist vectors (called Lagrange multipliers) λ ∈ RI+ andµ ∈ RJ such that

∇f(x) +

I∑i=1

λi∇gi(x) +

J∑j=1

µj∇hj(x) = 0, (10.3a)

(∀i) λigi(x) = 0. (10.3b)

Proof. By Theorem 10.5, −∇f(x) ∈ NC(x) = (TC(x))∗. By Proposition 10.6and the assumption LC(x) ⊂ coTC(x), we get LC(x) = coTC(x). Hence by theproperty of dual cones, we get (TC(x))∗ = (coTC(x))∗ = (LC(x))∗. Now let

K be the polyhedral cone generated by ∇gi(x)i∈I(x) and ±∇hj(x)Jj=1. ByFarkas’s lemma, K∗ is equal to

y ∈ RN∣∣ (∀i ∈ I(x)) 〈∇gi(x), y〉 ≤ 0, (∀j) 〈±∇hj(x), y〉 ≤ 0

,

2A version of this theorem is due to the 1939 Master’s thesis of William Karush (1917-1997)but did not get much attention at a time applied mathematics was looked down. The theorembecame widely known after the rediscovery by Harold Kuhn and Albert Tucker (1905-1995)in a conference paper in 1951. However, the paper (http://projecteuclid.org/download/pdf_1/euclid.bsmsp/1200500249) does not cite Karush.

97



which is precisely the linearizing cone LC(x). Again by Farkas’s lemma, wehave (LC(x))∗ = K. Therefore −∇f(x) ∈ K, so there exists numbers λi ≥ 0(i ∈ I(x)) and αj , βj ≥ 0 such that

−∇f(x) =∑i∈I(x)

λi∇gi(x) +

J∑j=1

(αj − βj)∇hj(x).

Letting λi = 0 for i /∈ I(x) and µj = αj − βj , we get (10.3a). Finally, (10.3b)holds for i ∈ I(x) since gi(x) = 0. It also holds for i ∈ I(x) since we definedλi = 0 for such i.

Define the Lagrangian of the minimization problem 10.2 by

L(x, λ, µ) = f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x).

Then (10.3a) implies that the derivative of L(·, λ, µ) at x is zero. (10.3a) iscalled the first-order condition. (10.3b) is called the complementary slacknesscondition. (10.3a) and (10.3b) are jointly called Karush-Kuhn-Tucker (KKT)conditions.

10.4 Constraint qualifications

Conditions of the form LC(x) ⊂ coTC(x) in Theorem 10.7 are called constraintqualifications (CQ). These are necessary conditions in order for the KKT con-ditions to hold. There are many constraint qualifications in the literature:

Guignard (GCQ) LC(x) ⊂ coTC(x).

Abadie (ACQ) LC(x) ⊂ TC(x).

Mangasarian-Fromovitz (MFCQ) ∇hj(x)Jj=1 are linearly independent,

and there exists y ∈ RN such that 〈∇gi(x), y〉 < 0 for all i ∈ I(x) and〈∇hj(x), y〉 = 0 for all j.

Slater (SCQ) gi’s are convex, hj ’s are affine, and there exists x0 ∈ RN suchthat gi(x0) < 0 for all i and hj(x0) = 0 for all j.

Linear independence (LICQ) ∇gi(x)i∈I(x) and ∇hj(x)Jj=1 are linearlyindependent.

Theorem 10.8. The following is true for constraint qualifications.

LICQ or SCQ =⇒ MFCQ =⇒ ACQ =⇒ GCQ.

Proof. ACQ =⇒ GCQ is trivial. The rest is not so simple, so I omit.

In applications, oftentimes constraints are linear. In that case GCQ is au-tomatically satisfied, so you don’t need to check it (exercise). It is known thatthe GCQ is the weakest possible condition Gould and Tolle (1971).

98

10.5 Sufficient condition

The Karush-Kuhn-Tucker theorem provides necessary conditions for optimal-ity: if the constraint qualification holds, then a local solution must satisfythe Karush-Kuhn-Tucker conditions (first-order conditions and complementaryslackness conditions). Note that the KKT conditions are equivalent to

∇xL(x, λ, µ) = 0, (10.4)

where L(x, λ, µ) is the Lagrangian. (10.4) is the first-order necessary conditionof the unconstrained minimization problem

minx∈RN

L(x, λ, µ). (10.5)

Below I give a sufficient condition for optimality.

Proposition 10.9. Suppose that x is a solution to the unconstrained minimiza-tion problem (10.5) for some λ ∈ RI+ and µ ∈ RJ . If gi(x) ≤ 0 and λigi(x) = 0for all i and hj(x) = 0 for all j, then x is a solution to the constrained mini-mization problem (10.1).

Proof. Take any x such that gi(x) ≤ 0 for all i and hj(x) = 0 for all j. Then

f(x) = f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x)

= L(x, λ, µ) ≤ L(x, λ, µ)

= f(x) +

I∑i=1

λigi(x) +

J∑j=1

µjhj(x) ≤ f(x).

The first line is due to λigi(x) = 0 for all i and hj(x) = 0 for all j. The secondline is the assumption that x minimizes L(·, λ, µ). The third line is due to λi ≥ 0and gi(x) ≤ 0 for all i and hj(x) = 0 for all j.

Corollary 10.10. If f, gi are all convex, then the KKT conditions are sufficientfor optimality.

When f, gi are not convex, but only quasi-convex, we can still give a sufficientcondition for optimality.

Theorem 10.11. Consider the constrained optimization problem

minimize f(x)

subject to gi(x) ≤ 0 (i = 1, . . . , I),

where f, gi’s are differentiable and quasi-convex. Suppose that the Slater con-straint qualification holds and that x and λ satisfy the KKT conditions. If∇f(x) 6= 0 then x is a solution.

Proof. First, let us show that 〈∇f(x), x− x〉 ≥ 0 for all feasible x. Multiplyingx− x as an inner product to the first-order condition

∇f(x) +

I∑i=1

λi∇gi(x) = 0,

99

we obtain

〈∇f(x), x− x〉 = −I∑i=1

λi 〈∇gi(x), x− x〉 .

Therefore it suffices to show λi 〈∇gi(x), x− x〉 ≤ 0 for all i. If gi(x) < 0,by complementary slackness we have λi = 0, so there is nothing to prove. Ifgi(x) = 0, since x is feasible, we have gi(x) ≤ 0 = gi(x). Hence by Proposition8.2, we have 〈∇gi(x), x− x〉 ≤ 0. Since λi ≥ 0, we have λi 〈∇gi(x), x− x〉 ≤ 0.

Next, let us show that there exists a feasible point x1 such that

〈∇f(x), x1 − x〉 > 0.

Consider the point x0 in the Slater condition. By the previous result we have〈∇f(x), x0 − x〉 ≥ 0. Since ∇f(x) 6= 0, letting x1 = x0 + ε∇f(x) with ε > 0sufficiently small, then by the Slater condition x1 is feasible and

〈∇f(x), x1 − x〉 ≥ ε ‖∇f(x)‖2 > 0.

Finally, take any feasible x and x1 as above. Since 〈∇f(x), x− x〉 ≥ 0, forany 0 < t < 1 we have

〈∇f(x), (1− t)x+ tx1 − x〉 = 〈∇f(x), (1− t)(x− x) + t(x1 − x)〉= (1− t) 〈∇f(x), x− x〉+ t 〈∇f(x), x1 − x〉 > 0.

Since f is quasi-convex, by Proposition 8.2 this inequality implies that

f((1− t)x+ tx1) > f(x).

Letting t→ 0, we get f(x) ≥ f(x), so x is a solution.

Next I give a second order sufficient condition that will be useful later.Consider the minimization problem (10.2). Assume that the KKT conditions(10.3) hold at x with corresponding Lagrange multipliers λ ∈ RI+ and µ ∈ RJ .Remember that the active set of the inequality constraints is

I(x) = i | gi(x) = 0 .

LetI(x) = i |λi > 0

be the set of constraints such that the Lagrange multiplier is positive. Sinceλigi(x) = 0 by complementary slackness, λi > 0 implies gi(x) = 0, so necessarilyI(x) ⊂ I(x). Define the cone

LC(x) =y ∈ RN

∣∣∣(∀i ∈ I(x)\I(x)) 〈∇gi(x), y〉 ≤ 0,

(∀i ∈ I(x)) 〈∇gi(x), y〉 = 0, (∀j) 〈∇hj(x), y〉 = 0.

Clearly LC(x) ⊂ LC(x). The following theorem gives a second order sufficientcondition for local optimality.

100

Theorem 10.12. Suppose that f , gi’s, and hj’s are twice differentiable, theKKT conditions (10.3) hold at x = x, and⟨

y,∇2xL(x, λ, µ)y

⟩> 0

for all 0 6= y ∈ LC(x). Then x is a strict local solution of the minimizationproblem (10.2), i.e., there exists a neighborhood Ω of x such that f(x) < f(x)whenever x ∈ Ω satisfies the constraints in (10.2).

Proof. Suppose that x is not a strict local solution. Then we can take a sequenceC 3 xk → x such that f(xk) ≤ f(x). Let αk = 1/

∥∥xk − x∥∥ > 0. Then∥∥αk(xk − x)∥∥ = 1, so by taking a subsequence if necessary we may assume

αk(xk − x)→ y with ‖y‖ = 1. Let us show that y ∈ LC(x).Multiplying both sides of

f(xk)− f(x) ≤ 0, gi(xk)− gi(x) ≤ 0 (i ∈ I(x)), hj(x

k)− hj(x) = 0

by αk and letting k →∞, we get

〈∇f(x), y〉 ≤ 0, 〈∇gi(x), y〉 ≤ 0 (i ∈ I(x)), 〈∇hj(x), y〉 = 0. (10.6)

Multiplying both sides of the first-order condition (10.3a) by y as an innerproduct, noting that λi = 0 if i /∈ I(x) by complementary slackness, and using(10.6), we get

〈∇f(x), y〉+∑i∈I(x)

λi 〈∇gi(x), y〉 = 0.

Again by (10.6) it must be 〈∇f(x), y〉 = 0 and λi 〈∇gi(x), y〉 = 0 for all i ∈ I(x).Therefore if i ∈ I(x), so λi > 0, it must be 〈∇gi(x), y〉 = 0. Hence by definitionwe have y ∈ LC(x).

Since f(xk) ≤ f(x), λi ≥ 0, gi(xk) ≤ 0, and λigi(x) = 0, it follows that

L(xk, λ, µ) = f(xk) +∑i∈I(x)

λigi(xk) ≤ f(x) = L(x, λ, µ).

By Taylor’s theorem

0 ≥ L(xk, λ, µ)− L(x, λ, µ)

=⟨∇xL(x, λ, µ), xk − x

⟩+

1

2

⟨xk − x,∇2

xL(x, λ, µ)(xk − x)⟩

+ o(∥∥xk − x∥∥2

).

By the KKT conditions, the first term in the right-hand side is zero. Multiplyingboth sides by α2

k and letting k →∞, we get

0 ≥ 1

2

⟨y,∇2

xL(x, λ, µ)y⟩,

which is a contradiction. Therefore x is a strict local solution.

Exercises

10.1. Suppose that x ∈ intC (interior point of C).

101

1. Compute the tangent cone TC(x) and the normal cone NC(x).

2. Interpret Theorem 10.5.

10.2. Let x ∈ R and consider the constraints (i) x ≤ 0 and (ii) x3 ≤ 0.

1. Show that the constraints (i) and (ii) are equivalent, and compute thetangent cone at x = 0.

2. Compute the linearizing cones corresponding to constraints (i) and (ii) atx = 0, respectively. Are they the same?

10.3. Suppose that gi(x) = 〈ai, x〉 − ci and hj(x) = 〈bj , x〉 − dj in the min-imization problem (10.2). Show that the Guignard constraint qualification issatisfied.

10.4. Let a, b, c > 0 be numbers such that a+ b+ c = 1. Prove that

a2

b+ c+

b2

c+ a+

c2

a+ b≥ 1

2.

102

Chapter 11

Sensitivity Analysis

11.1 Motivation

11.1.1 Example

Suppose you are managing a firm that produces a final product by using laborand raw materials. The production function is

y = Alαx1−α,

where y is the quantity of output, A > 0 is a productivity parameter, l > 0 islabor input, x > 0 is the input of raw materials, and 0 < α < 1 is a parameter.

Assume that you cannot hire or fire workers in the short run and thereforeyou see labor input l as constant, but can choose the input of raw materials xfreely. The wage rate is w > 0 and the unit price of the raw material is p > 0.The unit price of the final product is normalized to 1. You are interested in twoquestions:

1. What is the optimal amount of input of raw materials?

2. What would happen to the firm’s profit if parameter values change?

We can answer the first question by solving the optimization problem. We canalso answer the second question once we have solved the optimization problem,but the topic of this lecture (sensitivity analysis) is how to (partly) answer thesecond question without solving the optimization problem.

11.1.2 Solution

Mathematically, the problem is

maximize Alαx1−α − wl − pxsubject to l fixed, x ≥ 0.

It is not hard to see that the objective function is concave in x. Clearly theconstraint is linear. Therefore the KKT conditions are necessary and sufficientfor optimality. The Lagrangian is

L(x, λ) = Alαx1−α − wL− px+ λx.

103

The KKT conditions are

Alα(1− α)x−α − p+ λ = 0, (11.1a)

λ ≥ 0, x ≥ 0, λx = 0. (11.1b)

(11.1a) is the first-order condition. (11.1b) is the complementary slackness con-dition.

If x = 0, then the first-order condition (11.1a) will be ∞ − p + λ = 0, acontradiction. Therefore it must be x > 0. By the complementary slacknesscondition (11.1b), we get λ = 0. Substituting this into (11.1a) and solving forx, we get

Alα(1− α)x−α − p = 0 ⇐⇒ x =

(A(1− α)

p

) 1α

l. (11.2)

Substituting this into the objective function, after some algebra the maximizedprofit is

π(α,A, p, w, l) := α(1− α)1α−1A

1α p1− 1

α l − wl.

Regarding the second question, suppose that we are interested in how themaximized profit π change when the price of the raw materials p changes. Thenwe compute

∂π

∂p= α

(1− 1

α

)(1− α)

1α−1A

1α p−

1α l = −

(A(1− α)

p

) 1α

l.

This is simply the negative of the optimal input of raw materials computed in(11.2). If we had partially differentiated the profit function

Alαx1−α − wl − px

with respect to p, we get the same answer −x (evaluated at the optimal solution,though)! Is this a coincidence? The answer is no. In the next section I willexplain how to carry out sensitivity analysis.

11.2 Sensitivity analysis

11.2.1 The problem

Consider the parametric minimization problem

minimize f(x, u)

subject to gi(x, u) ≤ 0 (i = 1, . . . , I). (11.3)

Here x ∈ RN is the control variable and u ∈ Rp is a parameter. Let

φ(u) = infxf(x, u) | (∀i)gi(x, u) ≤ 0

be the minimum value function. We are interested in how φ(u) changes when udoes.

104

11.2.2 Implicit function theorem

Before stating the main result I need the following theorem.

Theorem 11.1 (Implicit Function Theorem). Let U ⊂ Rm, V ⊂ Rn be opensets and f : U × V → Rn be a C1 ( i.e., continuously differentiable) function.Let (x, y) ∈ U × V be such that

1. f(x, y) = 0,

2. Dyf(x, y), the n× n matrix with (i, j) element ∂∂yj

fi(x, y), is regular.

Then there exists an open set U ⊂ U and a C1 function g : U → V such thatf(x, g(x)) = 0 for all x ∈ U , and

Dxg(x) = − [Dyf(x, y)]−1Dxf(x, y). (11.4)

The proof can be found in most textbooks of calculus so I omit.1 It ismore important to understand the meaning of the theorem and to be able toreproduce (11.4) by yourself (I myself can never remember (11.4) so I reproduceit whenever I need to use the theorem). The theorem roughly says that iff(x, y) = 0 and the Jacobian matrix of f with respect to y is regular (“the slopeis nonzero”), then the equation in y

f(x, y) = 0

can be solved locally as y = g(x), and we can compute the partial derivativesof g using those of f . Once we know that there is a C1 function g such thatf(x, y) = 0 with y = g(x), it is not hard to derive (11.4). Start from

f(x, g(x)) = 0.

Partially differentiating both sides with respect to x and using the chain rule,we get

Dxf(x, g(x)) +Dyf(x, g(x))Dxg(x) = 0.

Substituting x = x, using y = g(x), and multiplying both sides by [Dyf(x, y)]−1

,we get (11.4).

For example, suppose that

f(x, y) = Ax+By,

where A is an n×m matrix and B is an n× n regular matrix. Clearly

Ax+By = 0 ⇐⇒ y = −B−1Ax,

so g(x) = −B−1Ax and Dxg(x) = −B−1A. Note that

Dxf(x, y) = A, Dyf(x, y) = B,

so in fact (11.4) holds.

1See, for example, http://arxiv.org/pdf/1212.2066.pdf.

105

http://arxiv.org/pdf/1212.2066.pdf

11.2.3 Main result

Remember the second-order sufficient condition for optimality from LectureNote 04. To recapitulate, the active set of the inequality constraints is

I(x) = i | gi(x) = 0 .

LetI(x) = i |λi > 0

be the set of constraints such that the Lagrange multiplier is positive. Sinceλigi(x) = 0 by complementary slackness, λi > 0 implies gi(x) = 0, so necessarilyI(x) ⊂ I(x). Define the cone

LC(x)

=y ∈ RN

∣∣∣ (∀i ∈ I(x)\I(x)) 〈∇gi(x), y〉 ≤ 0, (∀i ∈ I(x)) 〈∇gi(x), y〉 = 0.

Then the second-order sufficient condition for local optimality is⟨y,∇2

xL(x, λ, µ)y⟩> 0 (∀0 6= y ∈ LC(x)). (11.5)

Theorem 11.2 (Sensitivity Analysis). Suppose that x ∈ RN is a local solutionof the parametric optimization problem (11.3) corresponding to u ∈ Rp. Assumethat

1. The vectors ∇xgi(x, u)i∈I(x) are linearly independent, so the Karush-

Kuhn-Tucker theorem holds with Lagrange multiplier λ ∈ RI+,

2. Strict complementary slackness condition holds, so gi(x, u) = 0 impliesλi > 0 and therefore I(x) = I(x),

3. The second order condition (11.5) holds.

Then there exists a neighborhood U of u and C1 functions x(u), λ(u) such thatfor any u ∈ U , x(u) is the local solution of the parametric optimization problem(11.3) and λ(u) is the corresponding Lagrange multiplier.

In our case, since strict complementary slackness holds and there are noequality constraints (hj ’s), condition (11.5) reduces to

y 6= 0, (∀i ∈ I(x)) 〈∇xgi(x, u), y〉 = 0 =⇒⟨y,∇2

xL(x, λ, u)y⟩> 0. (11.6)

We need the following lemma in order to prove the theorem.

Lemma 11.3. Let everything be as in Theorem 11.2. Define the (N+I)×(N+I)matrix A by

A =

∇2xL ∇xg1 · · · ∇xgI

λ1∇xg′1 g1 0...

. . .

λI∇xg′I 0 gI

,where all functions are evaluated at (x, λ, u). Then A is regular.

106

Proof. Suppose that A[vw

]= 0, where v ∈ RN and w ∈ RI . It suffices to show

that v = 0 and w = 0. By the definition of A, we get

∇2xLv +

I∑i=1

wi∇xgi = 0, (11.7a)

(∀i) λi 〈∇xgi, v〉+ wigi = 0. (11.7b)

For i ∈ I(x) (hence gi(x, u) = 0), by (11.7b) and strict complementary slacknesswe have λi > 0 and therefore 〈∇xgi, v〉 = 0. For i /∈ I(x) (hence gi(x, u) <0), again by (11.7b) and strict complementary slackness we have λi = 0 andtherefore wi = 0. Therefore (11.7a) becomes

∇2xLv +

∑i∈I(x)

wi∇xgi = 0. (11.8)

Multiplying (11.8) by v as an inner product and using 〈∇xgi, v〉 = 0 for i ∈ I(x),we obtain ⟨

v,∇2xLv

⟩= 0.

By condition (11.6), it must be v = 0. Then by (11.8) we obtain∑i∈I(x)

wi∇xgi = 0,

but since ∇xgii∈I(x) are linearly independent, it must be wi = 0 for all i.Therefore v = 0 and w = 0.

Proof of Theorem 11.2. Define f : RN × RI × Rp → RN × RI by

f(x, λ, u) =

∇xL(x, λ, u)λ1gi(x, u)

...λIgI(x, u)

.Then the Jacobian matrix of f with respect to (x, λ) evaluated at (x, λ, u) is A,which is regular. Furthermore, since x is a local solution corresponding to u = u,by the KKT Theorem we have f(x, λ, u) = 0. Therefore by the Implicit FunctionTheorem, there exists a neighborhood U of u and C1 functions x(u), λ(u) suchthat

f(x(u), λ(u), u) = 0

for u ∈ U . By the second-order sufficient condition (see Lecture Note 04), x(u)is the local solution of the parametric optimization problem (11.3) and λ(u) isthe corresponding Lagrange multiplier.

The following theorem is extremely important.

Theorem 11.4 (Envelope Theorem). Let everything be as in Theorem 11.2 and

L(x, λ, u) = f(x, u) +

I∑i=1

λigi(x, u)

107

be the Lagrangian. Assume that the parametric optimization problem (11.3) hasa solution x(u) with Lagrange multiplier λ(u). Let

φ(u) = minxf(x, u) | (∀i)gi(x, u) ≤ 0

be the minimum value function. Then φ is differentiable and

∇φ(u) = ∇uL(x(u), λ(u), u),

i.e., the derivative of φ can be computed by differentiating the Lagrangian withrespect to the parameter u alone, treating x and λ as constants.

Proof. By the definition of φ and complementary slackness, we obtain

φ(u) = f(x(u), u) = L(x(u), λ(u), u).

Differentiating both sides with respect to u, we get

Duφ(u)︸︷︷︸1×p

= DxL︸︷︷︸1×N

Dux(u)︸︷︷︸N×p

+DλL︸︷︷︸1×I

Duλ(u)︸︷︷︸I×p

+DuL︸︷︷︸1×p

.

By the KKT theorem, we have DxL = 0. By the strict complementary slackness,we have λi(u) = 0 for i /∈ I(x) and

DλiL(x(u), λ(u), u) = gi(x(u), u) = 0

for i ∈ I(x), so DλLDuλ(u) = 0. Therefore ∇φ(u) = ∇uL(x(u), λ(u), u).

Corollary 11.5. Consider the special case

minimize f(x)

subject to gi(x) ≤ ui (i = 1, . . . , I).

Then ∇φ(u) = −λ(u).

Proof. The Lagrangian is

L(x, λ, u) = f(x) +

I∑i=1

λi[gi(x)− ui].

By the envelope theorem, ∇φ(u) = ∇uL(x(u), λ(u), u) = −λ(u).

Exercises

11.1. Consider the utility maximization problem (UMP)


subject to p1x1 + p2x2 ≤ w,

where 0 < α < 1 is a parameter, p1, p2 > 0 are prices, and w > 0 is wealth.

1. Solve UMP.

108

2. Let v(p1, p2, w) be the value function. Compute the partial derivatives ofv with respect to each of p1, p2, and w.

3. Verify Roy’s identity xn = − ∂v∂pn

/ ∂v∂w for n = 1, 2, where xn is the optimaldemand of good n.

11.2. Consider the UMP

maximize u(x)

subject to x ∈ RL+, 〈p, x〉 ≤ w,

where w > 0 is the wealth of the consumer, p ∈ RL++ is the price vector, x ∈ RL+is the demand, and u : RL+ → R is differentiable and strictly quasi-concave. Letx(p, w) be the solution of UMP (called Marshallian demand) and v(p, w) thevalue function. If x(p, w) 0, prove Roy’s identity

x(p, w) = −∇pv(p, w)

∇wv(p, w).

(Hint: envelope theorem.)

11.3. Consider the expenditure minimization problem (EMP)

minimize p1x1 + p2x2

subject to α log x1 + (1− α) log x2 ≥ u

where 0 < α < 1 is a parameter, p1, p2 > 0 are prices, and u ∈ R is the desiredutility level.

1. Solve EMP.

2. Let e(p1, p2, u) be the value function. Compute the partial derivatives ofe with respect to p1 and p2.

3. Verify Shephard’s lemma xn = ∂e∂pn

for n = 1, 2, where xn is the optimaldemand of good n.

11.4. Consider the EMP

minimize 〈p, x〉subject to u(x) ≥ u,

where p ∈ RL++ is the price vector, x ∈ RL is the demand, u(x) is a strictlyquasi-concave differentiable utility function, and u ∈ R is the desired utilitylevel. Let h(p, u) be the solution of EMP (called Hicksian demand) and e(p, u)be the minimum expenditure (called expenditure function). Prove Shephard’slemma

h(p, u) = ∇pe(p, u).

11.5. Prove the following Slutsky equation:

Dpx(p, w)︸︷︷︸L×L

= D2pe(p, u)︸︷︷︸L×L

− [Dwx(p, w)]︸︷︷︸L×1

[x(p, w)]′︸︷︷︸1×L

,

where x(p, w) is the Marshallian demand, u = u(x(p, w)) is the utility levelevaluated at the demand, and e(p, u) is the expenditure function.

109

Chapter 12

Duality Theory

12.1 Motivation

Consider the following constrained minimization problem

minimize f(x)

subject to gi(x) ≤ 0 (i = 1, . . . , I) (12.1)

with Lagrangian

L(x, λ) =

f(x) +

∑Ii=1 λigi(x), (λ ∈ RI+)

−∞. (λ /∈ RI+)

(The following discussion can easily accommodate equality constraints as well.)In an earlier lecture, I proved the saddle point theorem: if f, gi’s are all convexand there is a point x0 ∈ RN such that gi(x0) < 0 for all i (Slater constraintqualification), then x is a solution to (12.1) if and only if there is λ ∈ RI+ suchthat (x, λ) is a saddle point of L, that is,

L(x, λ) ≤ L(x, λ) ≤ L(x, λ) (12.2)

for all x ∈ RN and λ ∈ RI . By (12.2), we get

supλL(x, λ) ≤ L(x, λ) ≤ inf

xL(x, λ).

Thereforeinfx

supλL(x, λ) ≤ L(x, λ) ≤ sup

λinfxL(x, λ). (12.3)

On the other hand, L(x, λ) ≤ supλ L(x, λ) always, so taking the infimum withrespect to x, we get

infxL(x, λ) ≤ inf

xsupλL(x, λ).

Noting that the right-hand side is just a constant, taking the supremum of theleft-hand side with respect to λ, we get

supλ

infxL(x, λ) ≤ inf

xsupλL(x, λ). (12.4)

110

Combining (12.3) and (12.4), it follows that

L(x, λ) = infx

supλL(x, λ) = sup

λinfxL(x, λ). (12.5)

Define

θ(x) = supλL(x, λ),

ω(λ) = infxL(x, λ).

Then (12.5) is equivalent to

L(x, λ) = infxθ(x) = sup

λω(λ). (12.6)

Note that by the definition of the Lagrangian,

θ(x) = supλL(x, λ) =

f(x), (∀i, gi(x) ≤ 0)

∞. (∃i, gi(x) > 0)

Therefore the constrained minimization problem (12.1) is equivalent to the un-constrained minimization problem

minimize θ(x). (P)

For this reason the problem (P) is called the primal problem. In view of (12.6),define the dual problem by

maximize ω(λ). (D)

Then (12.6) implies that the primal and dual values coincide.The above discussion suggests that in order to solve the constrained mini-

mization problem (12.1) and hence the primal problem (P), it might be sufficientto solve the dual problem (D). Since L(x, λ) is linear in λ, ω(λ) = infx L(x, λ)is always a concave function of λ no matter what f or gi’s are. Therefore wecan expect that solving the dual problem is much easier than solving the primalproblem.

12.2 Example

12.2.1 Linear programming

A typical linear programming problem is

minimize 〈c, x〉subject to Ax ≥ b,

where x ∈ RN is the vector of decision variables, c ∈ RN is the vector of thecoefficients, A is an M ×N matrix, and b ∈ RM is a vector. The Lagrangian is

L(x, λ) = 〈c, x〉+ 〈λ, b−Ax〉 ,

111

where λ ∈ RM+ is the vector of Lagrange multipliers. Since

ω(λ) = infxL(x, λ) = inf

x[〈c−A′λ, x〉+ 〈b, λ〉]

=

〈b, λ〉 , (A′λ = c)

−∞. (A′λ 6= c)

The dual problem is

maximize 〈b, λ〉subject to A′λ = c, λ ≥ 0.

12.2.2 Entropy maximization

Let p = (p1, . . . , pN ) be a multinomial distribution, so pn ≥ 0 and∑Nn=1 pn = 1.

The quantity

H(p) = −N∑n=1

pn log pn

is called the entropy of p.In practice we often want to find the distribution p that has the maximum

entropy satisfying some moment constraints. Suppose the constraints are givenby

N∑n=1

ainpn = bi. (i = 1, . . . , I)

Since maximizing H is equivalent to minimizing −H, the problem is

minimize

N∑n=1

pn log pn

subject to

N∑n=1

ainpn = bi, (i = 0, . . . , I) (12.7)

where ain’s and bi’s are given and I define a0n = 1 and b0 = 1 to accommodatethe constraint

∑n pn = 1 (accounting of probability).

If the number of unknown variablesN is large (sayN ∼ 10000), then it wouldbe very hard to solve the problem even using a computer since the objectivefunction pn log pn is nonlinear. However, it turns out that the dual problem isvery simple.

Although p log p is not well-defined when p ≤ 0, define

p log p =

0, (p = 0)

∞. (p < 0)

Then the constraint pn ≥ 0 is built in the problem. The Lagrangian is

L(p, λ) =

N∑n=1

pn log pn +

I∑i=0

λi

(bi −

N∑n=1

ainpn

)

= 〈b, λ〉+

N∑n=1

(pn log pn − 〈an, λ〉 pn) ,

112

where b = (b0, . . . , bI) and an = (a0n, . . . , aIn). To derive the dual problem, weneed to compute infp L(p, λ), which reduces to computing

infp

[p log p− cp]

for c = 〈an, λ〉 above. But this problem is easy! Differentiating with respect top, the first-order condition is

log p+ 1− c = 0 ⇐⇒ p = ec−1,

with the minimum value

p log p− cp = ec−1(c− 1)− cec−1 = −ec−1.

Substituting pn = e〈an,λ〉−1 in the Lagrangian, after some algebra the objectivefunction of the dual problem becomes

ω(λ) = infpL(p, λ) = 〈b, λ〉 −

N∑n=1

e〈an,λ〉−1.

Hence the dual problem of (12.7) is

maximize 〈b, λ〉 −N∑n=1

e〈an,λ〉−1. (12.8)

Numerically solving (12.8) is much easier than (12.7) because the dual problem(12.8) is unconstrained and the number of unknown variables 1 + I is typicallymuch smaller than N .

12.3 Convex conjugate function

In the above example we needed to compute

infp

[p log p− cp] = − supp

[cp− p log p].

In general, for any function f : RN → [−∞,∞] we define

f∗(ξ) = supx∈RN

[〈ξ, x〉 − f(x)],

which is called the convex conjugate function of f . Fixing x, since 〈ξ, x〉−f(x) isan affine (hence closed and convex) function of ξ, f∗ is a closed convex function.(A function is closed if its epigraph is closed. Closedness of a function is thesame as lower semi-continuity.)

For any function f , the largest closed convex function that is less than orequal to f is called the closed convex hull of f , and is denoted by cl co f . For-mally, we define

cl co f(x) = sup g(x) | g is closed convex and f(x) ≥ g(x) .

113

(At least one such g exists—g(x) = −∞.) Clearly φ = cl co f satisfies epiφ =cl co epi f , so

epi cl co f = cl co epi f

is also the definition of cl co f .The convex conjugate function of the convex conjugate function of f(x) is

called the biconjugate function. Formally,

f∗∗(x) = supξ∈RN

[〈x, ξ〉 − f∗(ξ)].

In the case of cones, we had C∗∗ = cl coC for any cone C. The same holds forfunctions, under some conditions.

Theorem 12.1. Let f be a function. If cl co f is proper (so cl co f(x) > −∞for all x), then

f∗∗(x) = cl co f(x).

In particular, f∗∗(x) = f(x) if f is a proper closed convex function.

We need a lemma in order to prove Theorem 12.1.

Lemma 12.2. For any function f , let L(f) be the set of affine functions thatare less than or equal to f , so

L(f) = h |h is affine and f(x) ≥ h(x) for all x .

If f is a proper closed convex function, then L(f) 6= ∅.

Proof. Let f be a proper closed convex function. The claim is obvious if f =∞,so we may assume dom f 6= ∅. Take any vector x ∈ dom f and real number ysuch that y < f(x). Then (x, y) /∈ epi f . Since f is a closed convex function,epi f is a closed convex set. Therefore by the separating hyperplane theorem wecan take a vector (0, 0) 6= (η, β) ∈ RN × R and a constant γ ∈ R such that

〈η, x〉+ βy < γ < 〈η, x〉+ βy

for any (x, y) ∈ epi f . Letting x = x and y → ∞, it must be β > 0. Dividingboth sides by β > 0 and letting a = −η/β and c = γ/β, we get

−〈a, x〉+ y < c < −〈a, x〉+ y

for all (x, y) ∈ epi f . Let h(x) = 〈a, x〉 + c. If x /∈ dom f , then clearly ∞ =f(x) > h(x). If x ∈ dom f , letting y = f(x) in the right inequality we getf(x) > 〈a, x〉+ c = h(x). Therefore f(x) ≥ h(x) for all x, so h ∈ L(f) 6= ∅.

Lemma 12.3. Let f be a function. If cl co f is proper, then

cl co f(x) = sup h(x) |h ∈ L(f) .

Proof. Since φ(x) = cl co f is the largest closed convex function that is less thanequal to f , and any h ∈ L(f) is an affine (hence closed convex) function that isless than equal to f , clearly

φ(x) ≥ sup h(x) |h ∈ L(f) .

114

To prove that equality holds, suppose that

φ(x) > sup h(x) |h ∈ L(f)

for some x. Then we can take a real number y such that

φ(x) > y > sup h(x) |h ∈ L(f) . (12.9)

By the left inequality, (x, y) /∈ epiφ. Since epiφ = epi cl co f = cl co epi f is aclosed convex set, by the separating hyperplane theorem we can take a vector(0, 0) 6= (η, β) ∈ RN × R and a constant γ ∈ R such that

〈η, x〉+ βy < γ < 〈η, x〉+ βy (12.10)

for any (x, y) ∈ epiφ. Letting y → ∞, we get β ≥ 0. There are two cases toconsider.

Case 1: β > 0. If β > 0, as in the proof of Lemma 12.2 dividing both sides of(12.10) by β > 0 and letting a = −η/β and c = γ/β, we get

−〈a, x〉+ y < c < −〈a, x〉+ y

for all (x, y) ∈ epiφ. Since f(x) ≥ cl co f(x) = φ(x), letting y = f(x) weget f(x) > 〈a, x〉 + c and y < 〈a, x〉 + c. The first inequality implies thath1(x) := 〈a, x〉+ c satisfies h1 ∈ L(f). Hence by the second inequality we get

y < h1(x) ≤ sup h(x) |h ∈ L(f) ,

which contradicts (12.9)

Case 2: β = 0. If β = 0, let h1(x) = −〈η, x〉 + γ. Then by (12.10) we geth1(x) < 0 < h1(x) for any x ∈ domφ. Since by assumption φ is a proper closedconvex function, be Lemma 12.2 we can take an affine function h2(x) such thatφ(x) ≥ h2(x). Hence for any λ ≥ 0 we have

φ(x) > λh1(x) + h2(x)

for any x, so h(x) = λh1(x) + h2(x) satisfies h ∈ L(f). But since h1(x) > 0, forlarge enough λ we have h(x) = λh1(x)+h2(x) > y, which contradicts (12.9).

Proof of Theorem 12.1. For h(x) = 〈ξ, x〉 − β, by the definition of L(f) andf∗ we have

h ∈ L(f) ⇐⇒ (∀x) f(x) ≥ 〈ξ, x〉 − β⇐⇒ (∀x) β ≥ 〈ξ, x〉 − f(x)

⇐⇒ β ≥ supx

[〈ξ, x〉 − f(x)] = f∗(ξ)

⇐⇒ (ξ, β) ∈ epi f∗.

Hence by the definition of the biconjugate function and Lemma 12.3,

f∗∗(x) = sup〈x, ξ〉 − f∗(ξ)

∣∣ ξ ∈ RN

= sup 〈x, ξ〉 − β | (ξ, β) ∈ epi f∗= sup h(x) |h ∈ L(f) = cl co f(x).

115

12.4 Duality theory

Looking at the entropy maximization example, the key to simplifying the dualproblem is to reduce it to the calculation of the convex conjugate function.However, unless the functions gi in the constraints (12.1) are all affine, theLagrangian L(x, λ) will not contain an affine function and therefore we cannotreduce it to a convex conjugate function.

To circumvent this issue, instead of the original constrained minimizationproblem (12.1) consider the perturbed parametric minimization problem

minimize f(x)

subject to gi(x) ≤ ui, (i = 1, . . . , I) (12.11)

where u = (u1, . . . , uI) is a parameter. Using the Lagrangian of (12.1), theLagrangian of (12.11) is

L(x, λ)− 〈λ, u〉 .Let

F (x, u) =

f(x), (∀i, gi(x) ≤ ui)∞, (∃i, gi(x) > ui)

be the value of f restricted to the feasible set and

φ(u) = infxF (x, u)

be the value function of the minimization problem 12.11. By the definition ofθ(x), we obtain θ(x) = F (x, 0).

Lemma 12.4. Let everything be as above. Then

L(x, λ) = infu

[F (x, u) + 〈λ, u〉], (12.12a)

F (x, u) = supλ

[L(x, λ)− 〈λ, u〉]. (12.12b)

Proof. Since F (x, u) = ∞ if gi(x) > ui for some i, in taking the infimum of(12.12a) we may assume gi(x) ≤ ui for all i. If λi < 0 for some i, then lettingui →∞ we get

infu

[F (x, u) + 〈λ, u〉] = −∞ = L(x, λ).

If λi ≥ 0 for all i, since F (x, u) = f(x) the infimum is attained when ui = gi(x)for all i, so

infu

[F (x, u) + 〈λ, u〉] = f(x) +

I∑i=1

λigi(x) = L(x, λ).

Since L(x, λ) = −∞ if λi < 0 for some i, in taking the supremum of (12.12b)we may assume λi ≥ 0 for all i. Then

supλ

[L(x, λ)− 〈λ, u〉] = supλ≥0

[f(x) +

I∑i=1

λi(gi(x)− ui)

]

=

f(x), (∀i, gi(x) ≤ ui)∞, (∃i, gi(x) > ui)

= F (x, u).

116

We need one more lemma.

Lemma 12.5. Let everything be as above. Then

ω(λ) = −φ∗(−λ),

supλω(λ) = φ∗∗(0).

Proof. By the definition of ω, φ, and the convex conjugate function, we get

ω(λ) = infxL(x, λ)

= infx,u

[F (x, u) + 〈λ, u〉] (∵ (12.12a))

= infu

[φ(u) + 〈λ, u〉]

= − supu

[〈−λ, u〉 − φ(u)] = −φ∗(−λ).

Therefore

supλω(λ) = sup

λ[−φ∗(−λ)] = sup

λ[〈0,−λ〉 − φ∗(−λ)] = φ∗∗(0).

We immediately obtain the main result.

Theorem 12.6 (Duality theorem). The primal value infx θ(x) and the dualvalue supλ ω(λ) coincide if and only if φ(0) = φ∗∗(0). In particular, this is thecase if φ is a proper closed convex function.

Proof. Clearly infx θ(x) = infx F (x, 0) = φ(0). By the previous lemma we havesupλ ω(λ) = φ∗∗(0). If φ is a proper closed convex function, then φ(u) = φ∗∗(u)for all u, so the primal and dual values coincide.

Exercises

12.1. Compute the convex conjugate functions of the following functions.

1. f(x) = 1p |x|

p, where p > 1. (Express the solution using q > 1 such that

1/p+ 1/q = 1.)

2. f(x) =

∞, (x < 0)

0, (x = 0)

x log xa , (x > 0)

where a > 0.

3. f(x) =

∞, (x ≤ 0)

− log x. (x > 0)

4. f(x) = 〈a, x〉, where a ∈ RN .

5. f(x) = δa(x) :=

0, (x = a)

∞, (x 6= a)where a ∈ RN .

6. f(x) = 12 〈x,Ax〉, where A is an N×N symmetric positive definite matrix.

117

12.2. Let

f(x) =

∞, (x < 0)

−x2. (x ≥ 0)

1. Compute f∗(ξ), f∗∗(x), cl co f(x).

2. Does f∗∗(x) = cl co f(x) hold? If not, is it a contradiction?

12.3. Derive the dual problem of

minimize 〈c, x〉subject to Ax ≥ b, x ≥ 0.


minimize 〈c, x〉+1

2〈x,Qx〉

subject to Ax ≥ b,

where Q is a symmetric positive definite matrix.

12.5. Consider the entropy maximization problem (12.7). Noting that a0n = 1and b0 = 1, let an = (1, Tn) ∈ R1+I , b = (1, T ) ∈ R1+I , and λ = (λ0, λ1) ∈R×RI . Carry out the maximization in (12.8) with respect to λ0 alone and showthat (12.8) is equivalent to

maximize⟨T , λ1

⟩− log

(N∑n=1

e〈Tn,λ1〉

),

and also to

minimize log

(N∑n=1

e〈Tn−T ,λ1〉)⇐⇒ minimize

N∑n=1

e〈Tn−T ,λ1〉.

12.6. Let p = (p1, . . . , pN ) and q = (q1, . . . , qN ) be multinomial distributions.A concept closely related to entropy is the Kullback-Leibler information of pwith respect to q, defined by

H(p; q) =

N∑n=1

pn logpnqn.

Derive the dual problem of the minimum information problem

minimize

N∑n=1

pn logpnqn

subject to

N∑n=1

ainpn = bi, (i = 0, . . . , I)

where the minimization is over p given q.


minimize f(x)

subject to Ax = b,

where f : RN → (−∞,∞].(Hint: define the Lagrangian by L(x, λ) = f(x) + 〈λ, b−Ax〉.)

118

Bibliography

Claude Berge. Espaces Topologiques: Fonctions Multivoques. Dunod, Paris,1959. English translation: Translated by E. M. Patterson. Topological Spaces,New York: MacMillan, 1963. Reprinted: Mineola, NY: Dover, 1997.

F. J. Gould and Jon W. Tolle. A necessary and sufficient qualification forconstrained optimization. SIAM Journal of Applied Mathematics, 20(2):164–172, March 1971. doi:10.1137/0120021.

Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge UniversityPress, New York, 1985.

David G. Luenberger. Optimization by Vector Space Methods. John Wiley &Sons, New York, 1969.

Harry Markowitz. Portfolio selection. Journal of Finance, 7(1):77–91, March1952. doi:10.1111/j.1540-6261.1952.tb01525.x.

R. Tyrrell Rockafellar. Convex Analysis. Princeton University Press, Princeton,NJ, 1970.

William F. Sharpe. Capital asset prices: A theory of market equilibrium un-der conditions of risk. Journal of Finance, 19(3):425–442, September 1964.doi:10.1111/j.1540-6261.1964.tb02865.x.

Alexis Akira Toda. Operator reverse monotonicity of the inverse.American Mathematical Monthly, 118(1):82–83, January 2011.doi:10.4169/amer.math.monthly.118.01.082.

119

http://dx.doi.org/10.1137/0120021

http://dx.doi.org/10.1111/j.1540-6261.1952.tb01525.x

http://dx.doi.org/10.1111/j.1540-6261.1964.tb02865.x

http://dx.doi.org/10.4169/amer.math.monthly.118.01.082

econweb.ucsd.edujsobel/205f10/alexisnotes.pdfContents I Basics 5 1 Linear algebra 6 1.1 Linearity ....

Documents

Transcript of econweb.ucsd.edujsobel/205f10/alexisnotes.pdfContents I Basics 5 1 Linear algebra 6 1.1 Linearity ....