Lecture Notes IE 521 Convex Optimizationniaohe.ise.illinois.edu/IE521/IE 521-convex...

Spring 2017

Lecture Notes IE 521 Convex Optimization

Niao He UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

What’s Inside Module I: Classical Convex Optimization - Fundamentals

♣ Lecture 1: Convex Set and Topological Properties

♣ Lecture 2: Convex Set and Representation Theorems

♣ Lecture 3: Convex Set and Separation Theorems

♣ Lecture 4: Convex Function and Calculus of Convexity

♣ Lecture 4: Convex Function and Characterizations

♣ Lecture 6: Subgradient and Subdifferential

♣ Lecture 7: Closed Convex Function and Conjugate Theory

♣ Lecture 8: Convex Programming

♣ Lecture 9: Lagrangian Duality and Optimality Conditions

♣ Lecture 10: Minimax Problems

♣ Lecture 11: Polynomial Solvability of Convex Programs

♣ Lecture 12: Center-of-gravity Method and Ellipsoid Method

Module II: Modern Convex Optimization ♦ Lecture 13: Conic Programming

♦ Lecture 14: Conic Duality

♦ Lecture 15: Applications of Conic Duality

♦ Lecture 16: Applications in Robust Optimization

♦ Lecture 17: Interior Point Method and Path-Following Schemes

♦ Lecture 18: Newton Method for Unconstrained Minimization

♦ Lecture 19: Minimizing Self-concordant functions

♦ Lecture 20: Damped Newton for Self-concordant Minimization

♦ Lecture 21: Interior Point Method and Self-concordant Barriers

♦ Lecture 22: Interior Point Method for Conic Programming

♦ Lecture 23: CVX Tutorial and Examples

Module III: Algorithms for Constrained Convex Optimization ♥ Lecture 24: Subgradient Method

♥ Lecture 25: Bundle Method

♥ Lecture 26: Constrained Level Method

IE 521: Convex Optimization Spring 2017, UIUC

Lecture 1: Convex Sets – January 23

Instructor: Niao He Scribe: Niao He

Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.

In this lecture, we cover the following topics

• Topology review

• Convex sets (convex/conic/affine hulls)

• Examples of convex sets

• Calculus of convex sets

• Some nice topological properties of convex sets.

1.1 Topology Review

Let X be a nonempty set in Rn. A point x0 is called an interior point if X contains a small ballaround x0, i.e. ∃r > 0, such that B(x0, r) := x : ‖x − x0‖2 ≤ r ⊆ X. A point x0 is called alimit point if there exits a convergent sequence in X that converges to x0, i.e. ∃xn ⊆ X, suchthat xn → x0 as n→∞.

• The interior of X, denoted as int(X), is the set of all interior point of X.

• The closure of X, denoted as cl(X), is the set of all limit points of X.

• The boundary of X, denoted as ∂(X) = cl(X)/int(X), is the set of points that belongs tothe closure but not in the interior.

X is closed if cl(X) = X; X is open if int(X) = X. Here are some basic facts:

• int(X) ⊆ X ⊆ cl(X);

• A set X is closed if and only if its complement Xc = Rn/X is open;

• The intersection of arbitrary number of closed sets is closed, i.e., ∩α∈AXα is closed if Xα isclosed for all α ∈ A.

• The union of finite number of closed sets is closed, i.e., ∪ni=1Xi is closed if Xi is closed fori = 1, . . . , n.

1-1

Lecture 1: Convex Sets – January 23 1-2

1.2 Convex Sets

Definition 1.1 (Convex set) A set X ⊆ Rn is convex if ∀x, y ∈ X,λx + (1 − λ)y ∈ X for anyλ ∈ [0, 1].

In another word, the line segment that connects any two elements lies entirely in the set.

(a) convex set (b) non-convex set

Figure 1.1: Examples of convex and non-convex sets

Given any elements x1, . . . , xk, the combination λ1x1 + λ2x2 + . . .+ λkxk is called

• Covex: if λi ≥ 0, i = 1, . . . , k and λ1 + λ2 + . . .+ λk = 1;

• Conic: if λi ≥ 0, i = 1, . . . , k;

• Affine: if λ1 + λ2 + . . .+ λk = 1;

• Linear: if λi ∈ R, i = 1, . . . , k.

Consequently, we have

• A set is convex if all convex combinations of its elements are in the set;

• A set is a convex cone if all conic combinations of its elements are in the set;

• A set is an affine subspace if all affine combinations of its elements are in the set;

• A set is a linear subspace if all linear combinations of its elements are in the set.

Clearly, a linear subspace is always a convex cone; a convex cone is always a convex set.


Definition 1.2 (Convex hull) A convex hull of a set X ⊆ Rn is the set of all convex combinationof its elements, denoted as

Conv(X) =

∑k

i=1λixi : k ∈ N, λi ≥ 0,

∑k

i=1λi = 1, xi ∈ X,∀i = 1, . . . , k

.

Figure 1.2: Examples of convex hulls

Similarly, one can define the conic hull and affine hull of a set.

Cone(X) =

∑k

i=1λixi : k ∈ N, xi ∈ X,λi ≥ 0, ∀i = 1, . . . , k

.

Aff(X) =

∑k

i=1λixi : k ∈ N, xi ∈ X,

∑k

i=1λi = 1,∀i = 1, . . . , k

.

Proposition 1.3 We have the following

1. A convex hull is always convex.

2. If X is convex, then conv(X) = X.

3. For any set X, conv(X) is the smallest convex set that contains X.

Proof:

1. By definition, for any x, y ∈ Conv(X), we can write x =∑

i λixi and y =∑

i µixi whereλi, µi ≥ 0 and

∑i λi =

∑i µi = 1. Hence, for any α ∈ [0, 1], we have

αx+ (1− α)y = α∑i

λixi + (1− α)∑i

µixi =∑i

ξixi

where ξi = αλi + (1 − α)µi,∀i. Note that ξi ≥ 0 and∑

i ξi = α∑

i λi + (1 − α)∑

i µi = 1.Therefore, αx+ (1− α)y ∈ Conv(X). Hence, Conv(X) is convex.


2. First of all, based on definition of convex hull, it is straightforward to see that X ⊆ Conv(X).Next, we show that Conv(X) ⊆ X by induction on k. The baseline is when k = 1, which istrivial. Now assuming that any convex combination with k entries is in X, we want to showthat any convex combination of k + 1 entries is still in X. Consider the convex combinationbelow given by λ1, . . . , λk+1 with λi ≥ 0, i = 1, . . . , k + 1 and

∑k+1i=1 λi = 1.

λ1x1 + . . .+ λk+1xk+1 = (1− λk+1)

(λ1

1− λk+1x1 + . . .+

λk1− λk+1

xk

)︸︷︷︸

z

+λk+1xk+1

Based on induction, we can see that z ∈ X since z is a convex combination of k entries in X.By convexity of X, we further have λ1x1 + . . .+ λk+1xk+1 ∈ X.

3. Suppose Y is convex and Y ⊇ X, we want to show that Y ⊇ Conv(X). From previousargument, if Y contains X, then Y should contain all convex combinations of X, i.e. Y ⊇Conv(X).

Examples of Convex Sets

Example 1. Some simple convex sets:

• Hyperplane: x ∈ Rn : aTx = b

• Halfspace: x ∈ Rn : aTx ≤ b

• Affine space: x ∈ Rn : Ax = b

• Polyhedron: x ∈ Rn : Ax ≤ b

• Simplex: x ∈ Rn : x ≥ 0,∑n

i=1 xi = 1 = conv(e1, . . . , en).

Example 2. Euclidean balls:x ∈ Rn : ‖x‖2 ≤ r

where ‖ · ‖2 is the Euclidean norm defined on Rn.

Example 3. Ellipsoid:x ∈ Rn : (x− a)TQ(x− a) ≤ r2

where Q 0 and is symmetric.


1.3 Calculus of Convex Sets

The following operators preserve the convexity of sets, which can be easily verified based on thedefinition.

1. Intersection: If Xα, α ∈ A are convex sets, then

∩α∈AXα

is also a convex set.

2. Direct product: If Xi ⊆ Rni , i = 1, . . . , k are convex sets, then

X1 × · · · ×Xk := (x1, . . . , xk) : xi ∈ Xi, i = 1, . . . , k


3. Weighted summation: If Xi ⊆ Rn, i = 1, . . . , k are convex sets, then

α1X1 + · · ·+ αkXk := α1x1 + . . .+ αkxk : xi ∈ Xi, i = 1, . . . , k


4. Affine image: If X ⊆ Rn is a convex set and A(x) : x 7→ Ax+ b is an affine mapping fromRn to Rk, then

A(X) := Ax+ b : x ∈ X


5. Inverse affine image: If X ⊆ Rn is a convex set and A(y) : y 7→ Ay+b is an affine mappingfrom Rk to Rn, then

A−1(X) := y : Ay + b ∈ X


Proof:

1. Let x, y ∈ ∩α∈AXα, then x, y ∈ Xα, ∀α ∈ A. Since Xα is convex, for any λ ∈ [0, 1], λx+ (1−λ)y ∈ Xα, ∀α ∈ A. Hence, λx+ (1− λ)y ∈ ∩α∈AXα.

2. Let x = (x1, . . . , xk) ∈ X1 × · · · ×Xk, y = (y1, . . . , yk) ∈ X1 × · · · ×Xk,. Since Xi is convex,for λ ∈ [0, 1], λxi + (1− λ)yi ∈ Xi, for i = 1, . . . , k. Hence

λx+ (1− λ)y = (λx1 + (1− λ)y1, . . . , λxk + (1− λ)yk) ∈ X1 × · · · ×Xk.

3. Let x, y ∈ α1X1 + · · · + αkXk, by definition, there exists xi, yi ∈ Xi, i = 1, . . . , k, such thatx = α1x1 + . . .+ αkxk, y = α1y1 + . . .+ αkyk. Hence, for all λ ∈ [0, 1]

λx+ (1− λ)y = α1z1 + . . .+ αkzk ∈ α1X1 + · · ·+ αkXk

because zi = λxi + (1− λ)yi ∈ Xi, for all i = 1, . . . , k.


4. Let y1, y2 ∈ A(X), then there exits x1, x2 ∈ X such that y1 = Ax1 + b and y2 = Ax2 + b.Therefore, for any λ ∈ [0, 1], we have λy1+(1−λ)y2 = A(λx1+(1−λ)x2)+b ∈ A(X) becauseλx1 + (1− λ)x2 ∈ X.

5. Let y1, y2 ∈ A−1(X), then there exits x1, x2 ∈ X such that x1 = Ay1 + b and x2 = Ay2 + b.Therefore, for any λ ∈ [0, 1], we have A(λy1 + (1 − λ)y2) + b ∈ λx1 + (1 − λ)x2 ∈ X, thisimplies that λy1 + (1− λ)y2 ∈ A−1(X).

1.4 Nice Topological Properties of Convex Sets

Convex sets are special because of their nice geometric properties.

Proposition 1.4 If X be a convex set with nonempty interior, then int(X) is dense in cl(X).

Proof: Let x0 ∈ int(X) and x ∈ cl(X). We can construct a convergence sequence yn = 1nx0+(1− 1

n)xsuch that yn → x. We only need to show that yn ∈ int(X). Therefore, it suffices to prove thefollowing claim:

Claim 1.5 If x0 ∈ int(X) and x ∈ cl(X), then [x0, x) ∈ int(X), namely, for any α ∈ [0, 1), thepoint z := αx0 + (1− α)x ∈ int(X).

This can be proved as follows. Since x0 ∈ int(X), there exits r > 0 such that B(x0, r) ⊆ X.Since x ∈ cl(X), there exits a sequence xn ⊆ X such that xn → x. Let zn = αx0 + (1 − α)xn,then zn → z. When n is large enough, ‖zn − z‖2 ≤ αr

2 . Since B(x0, r) ⊆ X and xn ∈ X, thenB(zn, αr) = αB(x0, r) + (1 − α)xn ⊆ X. Hence, B(z, αr2 ) ⊆ B(zn, αr) ⊆ X. This is because forany z′ ∈ B(z, αr2 ), ‖z′ − z‖ ≤ αr

2 ,

‖z′ − zn‖2 ≤ ‖z′ − z‖2 + ‖zn − z‖2 ≤αr

2+αr

2= αr.

Remark. Note that in general, for any set X, int(X) ⊆ X ⊆ cl(X), but int(X) and cl(X)can differ dramatically. For instance, let X be the set of all irrational numbers in (0, 1), thenint(X) = ∅, cl(X) = [0, 1]. The proposition implies that a convex set is perfectly well characterizedby its closure or interior if nonempty.


Lecture 2: Geometry of Convex Sets – January 25

Instructor: Niao He Scribe: Shuanglong Wang



• Nice topological properties (cont’d)

• Representation Theorem (Caratheodory)

• Radon, Helley Theorems

2.1 Recall

• A set X is convex if ∀x, y ∈ X,λx+ (1− λ)y ∈ X for any λ ∈ [0, 1].

• Convex Hull:

Conv(X) =

∑k

i=1λixi : k ∈ N, λi ≥ 0,

∑k

i=1λi = 1, xi ∈ X,∀i = 1, . . . , k

.

Note that the convex hull of X is the smallest convex set that contains X.

• Affine Hull:

aff(X) =

∑k

i=1λixi : k ∈ N, xi ∈ X,

∑k

i=1λi = 1,∀i = 1, . . . , k

.

Notet that the affine hull of X is the smallest affine subspace that contains X. An affinespace M is a shifted linear space, i.e. M = a + L, e.g. x : Ax = b = x0 + x : Ax = 0where x0 is such that Ax0 = b. The dimension of X: dim(X) = dim(aff(X))

• If X ∈ Rn is convex with non-empty interior, then

∀x0 ∈ int(X), x ∈ cl(X)⇒ λx0 + (1− λ)x ∈ int(X).

Moreover, the interior of X if nonempty, is dense in cl(X), i.e., cl(intX) = cl(X)

Questions: What if int(X) = ∅? For example, X = Conv(e1, e2, e3) = (x1, x2, x3) : x1, x2, x3 ≥0, x1 + x2 + x3 = 1, int(X) = ∅,aff(X) = (x1, x2, x3) : x1 + x2 + x3 = 1, dim(X) = 2.

2-1

Lecture 2: Geometry of Convex Sets – January 25 2-2

Definition 2.1 (Relative Interior) rint(X) = x : ∃r > 0, s.t.B(x, r)⋂

Aff(X) ⊆ X

Fact: If X is convex and nonempty, then rint(X) is always non-empty.The proof is left as an exercise.

Proposition 2.2 Let X be a nonempty convex set. Then

a) int(X),cl(X), rint(X) are convex

b) If x0 ∈ rint(X), x ∈ cl(X), then λx0 + (1− λ)x ∈ rint(X),∀λ ∈ (0, 1]

c) cl(rint(X)) = cl(X)

d) rint(cl(X)) = rint(X)

Proof: a) is straightfoward, b), c) can be derived as an analogy to the case when int(X) is non-empty. d) is due to the following fact: ∀x0 ∈ rint(X), x ∈ rint(cl(X)), there exitts y ∈ cl(X), s.t.x ∈ (x0, y). From b), this implies x ∈ rint(X).

Remark A convex set is perfectly well approximated by its relative interior or closure.

2.2 Representation Theorem

Theorem 2.3 (Caratheodory) Let X ⊆ Rn be non empty and dim(X) = d ≤ n. Every pointx ∈ conv(X) is a convex combination of at most (d+1) points, i.e.

Conv(X) =

d+1∑i=1

λixi : xi ∈ X,λi ≥ 0,d+1∑i=1

= 1

Proof: Suppose the minimal representation of x ∈ Conv(X) has m ≥ d+ 1 terms

x =m∑i=1

αixi,where αi ≥ 0,m∑i=1

αi = 1

The system of linear equations ∑mi=1 δixi = 0∑mi=1 δi = 0

has non trivial solution.We can write x =

∑mi=1(αi − tδi)xi. Let λi(t) = (αi − tδi), i = 1, . . . ,m, we have

∑λi(t) = 1.

Let t∗ = minαiδi, δi > 0

:=

αj

δj, then λi(t∗) > 0,∀i 6= j and λj(t∗) = 0. This leads to a smaller

representation of x, contradiction!

Example Suppose there are 100 different kinds of herbal tea, everyone of them is a blend of 25herbs. Donald wants n particular mixture of all herbal teas with equal proportions. What’s theleast number of teas he should buy? 26


2.3 Radon and Helley Theorem

Theorem 2.4 (Radon) Let S be a collection of N points in Rn with N ≥ n + 2. Then we canwrite S = S1 ∪ S2 s.t. S1 ∩ S2 = ∅, and Conv(S1) ∩ Conv(S2) 6= ∅

Proof: Let S = x1, ..., xN Consider the linear system∑Ni=1 γixi = 0∑Ni=1 γi = 0

=⇒ (n+1)equations, but N ≥ (n+ 2) unknowns

There exists a non-zero solution γ1, ..., γN .Let I = i : γi ≥ 0, J = j : γj < 0 and a =

∑i∈I γi = −

∑j∈J γj , then∑

i∈Iγixi =

∑j∈J

(−γj)xj =⇒∑i∈I

γiaxi =

∑j∈J

−γjaxj

The partition S1 = xi, i ∈ I and S2 = xj : j ∈ J gives the desired result.

Theorem 2.5 (Helley) Let S1, ..., SN be a collection of convex sets in Rn. Assume every (n+1)sets of them have a point in common, then all the sets have a point in common.

Proof: Let’s prove this by induction on NBase case: N = n+ 1, obviously true.Induction step: Assume that the collection of N(≥ n+ 1) sets have common point if every (n+ 1)of them have common point. We want to show that this holds true for a collection of N + 1 sets.

From the assumption, there exits xi ∈ S1 ∩ ... ∩ Si−1 ∩ Si+1 ∩ ... ∩ SN+1 6= ∅. Hence, we obtain(N + 1) ≥ (n+ 2) points x1, ..., xN+1.

By Radom’s theorem, we can split x1, x2, ...xN+1 into disjoint sets, whose convex hulls havenonempty intersection. Without loss of generality, let’s assume the two disjoint sets are x1, . . . , xkand xk+1, . . . , xN, and

Conv(x1, ..., xk) ∩ Conv(xk+1, ..., xN+1) 6= ∅.

Let z ∈ Conv(x1, ..., xk) ∩ Conv(xk+1, ..., xN+1). Since x1, ..., xk ⊆ Sk+1 ∩ ... ∩ SN+1 =⇒z ∈ Conv(x1, ..., xk) ⊆ Sk+1 ∩ ... ∩ SN+1.Since xk+1, ..., xN+1 ⊆ S1 ∩ ... ∩ Sk =⇒ z ∈Conv(xk+1, ..., xN+1) ⊆ S1 ∩ ... ∩ Sk. Therefore, z ∈ S1 ∩ ... ∩ SN+1.

Remark

• The theorem is not true for infinite collection: e.g. Si = [i,∞),∩+∞i=1Si = ∅

• The theorem is not true if reduce to (n+ 1) sets to n sets.

Corollary 2.6 (Helley) Let F be any collection of compact convex sets in Rn. If every (n + 1)sets have common point, then all sets have n points in common


Remark Helley’s theorem have many applications, especially for uniform approximation.

Example Consider the optimization problem

p∗ = minx∈R10

g0(x), s.t. gi(x) ≤ 0, i = 1, ..., 521

Suppose ∀t ∈ R, X0 =x ∈ R10 : g0(x) ≤ t

is convex, Xi =

x ∈ R10 : gi(x) ≤ 0

is convex.

How many constraints can you drop without affecting the optimal value?

Answer: You can drop as many as 521 - 11 = 510 constraints. i.e. you just need to keep 11constraints.

Proof: Suppose every 11 constraint relaxation will change the optimal value. ∀ i1, i2, ...in ⊆1, ...521,

minx∈R10

go(x) : gi1(x) ≤ 0, ..., gin(x) ≤ 0 = p(i1, ..., iN ) < p∗

Since there are only finite combinations, let pmax be the largest among p(i1, ..., iN ), pmax < p∗.Consider the collection of sets, i = 1, ..., 521

Si =x ∈ R10 : g0(x) ≤ pmax, gi(x) ≤ 0

Hence, we have

(i) Si is nonempty and convex, and

(ii) every 11 sets of them have non empty intersection

By Helley’s theorem, S1∩...∩S521 6= ∅ i.e. ∃x ∈ R10 s.t. g0(x) < p∗ and gi(x) ≤ 0, ∀i Contradiction!


Lecture 3: Separation Theorems – January 30




• Separation Theorems

• The Farkas Lemma

• Duality of Linear Programs

Reference: Boyd & Vandenberghe, Chapter 2.5; Ben-Tal & Nemirovski, Chapter 1.2

3.1 Separation of Convex Sets

Definition 3.1 Let S and T be two nonempty convex sets in Rn, A hyperplane H =x ∈ Rn : aTx = b

with a 6= 0 is said to separate S and T if

a) S ⊂ H− =x ∈ Rn : aTx ≤ b

and T ⊂ H+ =

x ∈ Rn : aTx ≥ b

b) S ∪ T 6⊂ H

Note that a) implies thatsupx∈S

aTx ≤ infx∈T

aTx

and b) implies thatinfx∈S

aTx < supx∈T

aTx

The separation is strict if S ⊂x ∈ Rn : aTx ≤ b′

and T ⊂

x ∈ Rn : aTx ≥ b′′

, with b′ < b′′.

Note that strict separation is equivalent to

supx∈S

aTx < infx∈T

aTx

Question: When can S and T be separated? strictly separated? Necessary conditions?

Theorem 3.2 Let S and T be two nonempty convex sets. Then S and T can be separated if andonly if rint(S) ∩ rint(T ) = ∅

3-1

Lecture 3: Separation Theorems – January 30 3-2

Corollary 3.3 Let S be a nonempty convex set and x0 ∈ ∂S. Then there exists a supportinghyperplane H =

x : aTx = aTx0

such that S ⊂

x : aTx ≤ aTx0

and x0 ∈ H.

We will prove a special case of the theorem and corollary.

Theorem 3.4 Let S be closed and convex and x0 6∈ S, Then there exists a hyperplane that strictlyseparated x0 and S.

Proof: Define the projection of x0, denoted as proj(x0) to be the point in S that is closest to x0:

proj(x0) = arg minx∈S‖ x− x0 ‖22

Note that proj(x0) exists and is unique.

• Existence: due to closedness of S

• Uniqueness: If x1, x2 are both closest to x0 in S, then ‖ x1−x0 ‖2=‖ x2−x0 ‖2= d. Considerz = x1+x2

2 ∈ S, then ‖ z−x0 ‖≥ d. Since ‖ (x0−x1)+(x0−x2) ‖22 + ‖ (x0−x1)−(x0−x2) ‖22=2 ‖ x0 − x1 ‖22 +2 ‖ x0 − x2 ‖22. We have 4 ‖ x0 − z ‖22 + ‖ x1 − x2 ‖22= 4d2. Hence‖ x1 − x2 ‖22= 0, i.e. x1 = x2.

Next we show that strict separation is given by H :=x : aTx = b

with a = x0 − proj(x0).

b = aTx0 −‖a‖222 , i.e. aTx < b, ∀x ∈ S, aTx0 > b.

By definition of projection and convexity, ∀λ ∈ [0, 1], x ∈ S,

λx+ (1− λ)proj(x0) ∈ S

Let φ(λ) =‖ λx+ (1− λ)proj(x0)− x0 ‖22=‖ proj(x0)− x0 + λ(x− proj(x0)) ‖22Then

φ(λ) ≥ φ(0),∀λ ∈ [0, 1]

Hence, φ′(0) ≥ 0, i.e. −2aT (x− proj(x0)) ≥ 0. This implies

aTx ≤ aTproj(x0) = aT (x0 − a) = aTx0− ‖ a ‖2< aTx0 −‖ a ‖2

2= b

Corollary 3.5 Let S and T be two nonempty convex sets and S ∩T = ∅. Assume S−T is closed,then S and T can be strictly separated.

Proof: Let Y = S − T . Since Y is a weighted sum of two convex sets, Y is nonempty and convex.Since S ∩ T , 0 6∈ Y , from the precious theorem, ∃a, b such that aT y < b < 0. This implies that

aTx < b+ aT z, ∀x ∈ S, z ∈ T

Hence, supx∈S aTx < infz∈T a

T z, i.e. S and T can be strictly separated.


Remark

1. Even if both S and T are closed convex, S − T might not be closed, and they might not bestrictly separated.

2. When both S and T are closed convex, S ∩ T = ∅ and at least one of them is bounded, thenS − T is closed, and S and T can be strictly separated

3.2 Theorems of alternatives

Theorem 3.6 (Farkas’ Lemma) Exactly one of the following sets must be empty:

(i) x ∈ Rn : Ax = b, x ≥ 0

(ii)y ∈ Rm : AT y ≤ 0, bT y > 0

where A ∈ Rm×n, b ∈ Rm.

Remark

• System (i) and (ii) are often called strong alternative, i.e. exactly one of them must befeasible.

• Farka’s Lemma is particularly useful to prove infeasibility of an linear program

• Geometric interpretation: let A = [a1|a2|...|an],

Cone a1, ..., an =

n∑

i=1

xiai : xi ≥ 0, i = 1, ..., n

(ii)empty⇐⇒ b 6∈ Cone a1, ..., an =⇒ ∃y, yTai ≤ 0,∀i = 1, ..., n, yT b > 0

Farkas’ lemma can be regarded as a special case of the separation theorem.

Proof: First, we show that if system (ii) feasible, then system (i) infeasible. Otherwise, 0 < bT y =(Ax)T y = xT (AT y) ≤ 0, contradiction!Second, we show that if system (i) infeasible, then system (ii) feasible. Let C = Cone a1, ..., an,then C is convex and closed. Now that b /∈ C, by the separation theorem, b and C can be (strictly)separated, i.e.

∃y ∈ Rm, γ ∈ R, y 6= 0, such that , yT z ≤ γ,∀z ∈ C, yT b > γ

Since 0 ∈ C, we have γ ≥ 0. Suppose γ > 0, and ∃z0 ∈ C such that yT z0 > 0, then we haveyT (αz0) > γ when α is large enough. Hence, it suffices to set γ = 0. Since a1, .., an ∈ C, we haveyTai ≤ 0, ∀i = 1, ...,m, i.e. AT y ≤ 0.

Remark: The fact that Cone a1, ...am is closed is crucial. Note that in general, when S is nota finite set, Cone(S) is not always closed. e.g. the conic hull of a solid circle S = (x1, x2) :x21 + (x2 − 1)2 < 1 is the open halfspace (x1, x2) : x2 > 0.


Variant of Farkas’ Lemma Exactly one of the following two sets must be empty:

1. x ∈ Rn : Ax ≤ b

2.y ≥ 0 : AT y = 0, bT y < 0

Proof: Exercise in HW1.

3.3 LP strong duality

Consider the primal and dual pair of linear programs

min cTx

(P ) s.t. Ax = b

x ≥ 0

max bT y

(D) s.t. AT y ≤ c

Theorem 3.7 If (P ) has a finite optimal value, then so does (D) and the two values equal eachother.

Proof: Exercise in HW 1.

Remark The theorem of alternatives can be generalized to systems with convex constraints, andthe strong duality of linear program can be extended to general convex programs.


Lecture 4: Convex Functions, Part I – February 1




• Convex Functions

• Examples

• Convexity-preserving Operations

Reference: Boyd &Vandenberghe, Chapter 3.1-3.2

4.1 Convex Function

Let f be a function from Rn to R. The domain of f is defined as dom(f) = x ∈ Rn : |f(x)| <∞.For example,

• f(x) = 1x , dom(f) = R\ 0

• f(x) =∑n

i=1 xi ln(xi), dom(f) = Rn++ = x : xi > 0,∀i = 1, ..., n

Definition 4.1 (Convex function) A function f(x) : Rn → R is convex if

(i) dom(f) ⊆ Rn s a convex set;

(ii) ∀x, y ∈ dom(f) and λ ∈ [0, 1], f(λx+ (1− λ)y) ≤ λf(x) + (1− λ)f(y).

Geometrically, the line segment between (x, f(x)), (y, f(y)) sits above the graph of f .

Definitions

• A function is called strictly convex if (ii) holds with strict sign, i.e. f(λx + (1 − λ)y) <λf(x) + (1− λ)f(y).

• A function is called α-strongly convex if f(x)− α2 ‖x‖

22 is convex.

• A function is called concave if −f(x) is convex.

Note that strongly convex =⇒ strictly convex =⇒ convex

4-1

Lecture 4: Convex Functions, Part I – February 1 4-2

Figure 4.1: Convex function

4.2 Examples

1. Simple univariate functions:

• Even powers: xp, p is even

• Exponential: eax,∀a ∈ R

• Negative logarithmic: − log x

• Absolute value: |x|• Negative entropy: x log(x)

2. Affine functions: f(x) = aTx+ b

• both convex & concave, but not strctly convex/concave

3. Some quadratic functions: f(x) = 12x

TQx+ bTx+ c

• convex if and only if Q 0 is positive semi-definite

• strictly convex if and only if Q 0 is positive definite

• special case: f(x) =‖ Ax− b ‖22 is convex

4. Norms: A function π(·) is called a norm if

(a) π(x) ≥ 0, ∀x and π(x) = 0 iff x = 0

(b) π(αx) = |α| · π(x),∀α ∈ R

(c) π(x+ y) ≤ π(x) + π(y)

Note that norms are convex: ∀λ ∈ [0, 1], π(λx + (1− λ)y) ≤ π(λx) + π((1− λ)y) = λπ(x) +(1− λ)π(y) where the inequality comes from (c) and the equality comes from (b).Examples of norms include:

• lp-norm on Rn: ‖ x ‖p:= (∑n

i=1 |xi|p)1/p, where p ≥ 1

• Q-norm on Rn: ‖ x ‖Q=√xTQx, where Q 0 is positive definite


• Frobenius norm on Rm×n : ‖ A ‖F= (∑m

i=1

∑nj=1 |Ai,j |2)1/2

• spectral norm on Sn: ‖ A ‖= maxi=1,...,n |λi(A)|, where λi’s are the eigenvalues of A.

5. Indicator function Ic(x) =

0, x ∈ C∞, x 6∈ C

The indicator function IC(x) is convex if the set C is a convex set.

6. Supporting function: I∗C(x) = supy∈C xT y

The support function I∗C(x) is always convex for any set C.Proof: Note that supy∈C f(y) + g(y) ≤ supy∈C f(y) + supy∈C g(y)Then ∀x1, x2, λ ∈ [0, 1]

I∗C(λx1 + (1− λ)x2) = supy∈C

λxT1 y + (1− λ)xT2 y

≤ supy∈C

λxT1 y + supy∈C

(1− λ)xT2 y

= λI∗C(x1) + (1− λ)I∗C(x2)

7. More examples

• Piecewise linear functions: max(aT1 x+ b1, ..., aTk x+ bk)

• Log of exponential sums: log(∑k

i=1 eaTi x+bi)

• Negative log of determinant: − log(det(X))

How to show convexity of these functions?

4.3 Convexity-Preserving Operators

1. Taking conic combination: If fi(x), i ∈ I are convex functions and αi ≥ 0,∀i ∈ I, then

g(x) =∑i∈I

αifi(x)

is a convex function.

Proof: The domain of function g

dom(g) = ∩i:αi>0dom(fi)

is convex. For any x, y ∈ dom(g), λ ∈ [0, 1]

g(λx+ (1− λ)y) =∑

αifi(λx+ (1− λ)y)

≤∑

αi [λfi(x) + (1− λ)fi(y)]

= λ∑

αifi(x) + (1− λ)∑

αifi(y)

= λg(x) + (1− λ)g(y)


Remark The property extend to infinite sums and integrals. If f(x, ω) is convex in x for anyω ∈ Ω and α(ω) ≥ 0,∀ω ∈ Ω. then

g(x) =

∫Ωα(ω)f(x, ω)dω

is convex if well defined.

For example if η = η(ω) is a well-defined random variable on Ω, and f(x, η(ω)) is convex,∀ω ∈ Ω, then Eη [f(x, η)] is a convex function.

2. Taking affine composition If f(x) : Rn → R is convex and A(y) : y 7→ Ay + b is an affinemapping from Rm to Rn, then

g(y) := f(Ay + b)

is convex on Rm.

Proof: dom(g) = y : Ay + b ∈ dom(f) is convex.

∀y1, y2 ∈ dom(g) : g(λy1 + (1− λ)y2) = f(λ(Ay1 + b) + (1− λ)(Ay2 + b))

≤ λf(Ay1 + b) + (1− λ)f(Ay2 + b)

= λg(y1) + (1− λ)g(y2)

For example, ‖ Ax− b ‖22,∑

i eaTi x−bi and

∑ni=1 log(aTi x− bi) are convex.

3. Taking pointwise maximum and supremum If fi(x), i ∈ I are convex, then

g(x) := maxi∈Ifi(x)

is also convex.

Proof: First of all, dom(g) = ∩i∈Idom(fi) is convex, For any x, y ∈ dom(g), λ ∈ [0, 1]

g(λx+ (1− λ)y) = maxi∈I

fi(λx+ (1− λ)y)

≤ maxi∈Iλfi(x) + (1− λ)fi(y)

≤ maxi∈I

λfi(x) + maxi∈I

(1− λ)fi(y)

= λg(x) + (1− λ)g(y)

Remark The property extends to the pointwise supremum over a infinite set. If f(x, ω) isconvex in x, for ω ∈ Ω, then

g(x) := supω∈Ω

f(x, ω)

is convex.For example, the following functions are convex:


(a) piecewise linear functions: f(x) = max(aT1 x+ b1, ..., aTk x+ bk)

(b) support function: I∗C(x) = supy∈C xT y

(c) maximum distance to any set C: dmax(x,C) = maxy∈C ‖ y − x ‖2(d) maximum eigenvalue of a symmetric matrix: λmax(X) = max‖y‖2=1 y

TXy

Indeed, almost every convex function can be expressed as the pointwise supremum of a familyof affine functions!

4. Taking convex monotone composition:

• scalar case If f is a convex function on Rn and F (·) is a convex and non-decreasingfunction on R, then g(x) = F (f(x)) is convex.

• vector case If fi(x), i = 1, . . . ,m are convex on Rn and F (y1, . . . , ym) is convex andnon-decreasing (component-wise) in each argument, then

g(x) = F (f1(x), . . . , fm(x))

is convex.

Proof: By convexity of fi, we have

fi(λx+ (1− λ)y) ≤ λfi(x) + (1− λ)fi(y),∀i,∀λ ∈ [0, 1].

Hence, we have for any x, y ∈ dom(g), λ ∈ [0, 1],

g(λx+ (1− λ)y) = F (f1(λx+ (1− λ)y), . . . , fm(λx+ (1− λ)y))

≤ F (λf1(x) + (1− λ)f1(y), . . . , λfm(x) + (1− λ)fm(y)) ( by monotonicity of F )

≤ λF (f1(x), . . . , fm(x)) + (1− λ)F (f1(x), . . . , fm(x)) ( by convexity of F )

= λg(x) + (1− λ)g(y) ( by definition of g)

Remark Taking pointwise maximum is a special case of the above rule, by setting F (y1, ..., ym) =max(y1, ..., ym),

maxi=1,...,m

fi(x) = F (f1(x), ..., fm(x))

is convex.

For example:

(a) ef(x) is convex if f is convex

(b) − log f(x) is convex if f is concave

(c) log(∑k

i=1 efi) is convex if fi are convex.

5. Taking Partial minimization: If f(x, y) is convex in (x, y) ∈ Rn and Y is a convex set,then

g(x) = infy∈Y

f(x, y)


is convex.

Proof: dom(g) = x : (x, y) ∈ dom(f) and y ∈ C is a projection of dom(f), hence is convex.

Given any x1, x2, by definition, for any ε > 0, ∃y1 ∈ Y, y2 ∈ Y s.t.

f(x1, y1) ≤ g(x1) + ε/2

f(x2, y2) ≤ g(x2) + ε/2

For any λ ∈ [0, 1], adding the two equations, we have

λf(x1, y1) + (1− λ)f(x2, y2) ≤ λg(x1) + (1− λ)g(x2) + ε.

By convexity of f(x, y), this implies

f(λx1 + (1− λ)x2, λy1 + (1− λ)y2) ≤ λg(x1) + (1− λ)g(x2) + ε.

Hence for any ε > 0, g(λx1 + (1− λ)x2) ≤ λg(x1) + (1− λ)g(x2) + ε. Letting ε→ 0 leads tothe convexity of g.

Examples

(a) Minimum distance to a convex set: d(x,C) = miny∈C ‖ x− y ‖2 where C is convex;

(b) Defineg(x) = inf

yh(y)|Ay = x

is convex if h is convex. This is because g(x) = infy f(x, y), where

f(x, y) :=

h(x) Ay = x

∞ o.w.

is convex in (x, y).


Lecture 5: Convex Functions, Part II – February 6




• Characterization of Convex Function

– Epigraph

– Level set

– One dimensional property

– First order condition

– Second order condtion

• Continuity of Convex Functions

References: Boyd &Vandenberghe Chapter 2

5.1 Recall

A function f : Rn → R is convex if it satisfies

(i) convexity of domain: dom(f) = x : |f(x)| < +∞ is convex

(ii) basic convex inequality: if x, y ∈ dom(f), λ ∈ [0, 1] f(λx+ (1− λ)y) ≤ λf(x) + (1− λ)f(y)

Remark (Extended-value function) f can be extended to a function from Rn → R⋃+∞ by

setting f(x) = +∞, if x 6∈ dom(f). Now (ii) can rewritten as ∀x, y ∈ Rn, λ ∈ [0, 1], f(λx + (1 −λ)y) ≤ λf(x) + (1− λ)f(y)

Remark (General convex inequality) ∀λi ≥ 0,∑m

i=1 = 1, we have by induction that

f(m∑i=1

λixi) ≤m∑i=1

λif(xi)

This is also known as Jensen’s inequality. Let ψ be a finite discrete random variable. Then wealways have

f(E[ψ]) ≤ E[f(ψ)]

5-1

Lecture 5: Convex Functions, Part II – February 6 5-2

5.2 Characterization of Convex Functions

5.2.1 Epigraph

The epigraph of a function is defined as

epi(f) =

(x, t) ∈ Rn+1 : f(x) ≤ t

Proposition 5.1 f is convex on Rn if and only if its epigraph is a convex set in Rn+1.

Proof:

• (⇐ part) Firstly, dom(f) = x : ∃t, s.t. (x, t) ∈ epi(f) is convex. Let (x1, t1), (x2, t2) ∈epi(f), then (λx1 + (1− λ)x2, λt1 + (1− λ)t2) ∈ epi(f), ∀λ ∈ [0, 1]. By definition of epigraph,this means f(λx1 + (1− λ)x2) ≤ λt1 + (1− λ)t2. Particularly, setting t1 = f(x1), t2 = f(x2),we have the basic convex inequality.

• (⇒ part) On the other hand,

f convex⇒ f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2),∀λ ∈ [0, 1]

⇒ f(λx1 + (1− λ)x2) ≤ λt1 + (1− λ)t2,∀(x1, t1), (x2, t2) ∈ epi(f)

⇒ λ(x1, t1) + (1− λ)(x2, t2) ∈ epi(f)

⇒ epi(f)is a convex set

5.2.2 Level set

For any t ∈ R, the level set of f with level t is defined as

levt(f) = x ∈ dom(f) : f(x) ≤ t .

Proposition 5.2 If f is convex, then, every level set is convex.

Proof: For all x1, x2 ∈ levt(f) and λ ∈ [0, 1],

f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2) ≤ λt+ (1− λ)t = t

i.e. λx1 + (1− λ)x2 ∈ levt(f)

Remark: The reverse is not true. A function is called quasi-convex if its domain and all level setsare convex, this holds if and only if

f(λx+ (1− λ)y) ≤ max f(x), f(y) ,∀λ ∈ [0, 1]


5.2.3 One-dimensional Property

Recall that a set is convex if and only if its restriction on any line is convex. Similarly, we have

Proposition 5.3 f is convex if and only is its restriction on any line is convex. i.e. ∀x, h ∈Rn, φ(t) = f(x+ th) is convex on the axis.

Proof: (=⇒ ) Firstly, dom(φ) = t ∈ R : x+ th ∈ dom(f) is convex. Also, ∀t1, t2 ∈ R, ∀λ ∈ [0, 1]

φ(λt1 + (1− λ)t2) = f(x+ (λt1 + (1− λ)t2)h)

= f(λ(x+ t1h) + (1− λ)(x+ t2h))

≤ λf(x+ t1h) + (1− λ)f(x2 + t2h)

= λφ(t1) + (1− λ)φ(t2)

Hence, φ(t) is convex. Proof for the reverse direction is omitted.

Remark: Checking convexity in R boils down to check convexity of one-dimensional function onthe axis. (See example in HW1)Recall from basic calculus that for a univariate function f on (a, b),

1. if f is differentiable, f convex ⇐⇒ f ′ increasing

2. if f is twice-differentiable, f convex ⇐⇒ f” ≥ 0

Similar first and second order conditions hold for general convex functions.

5.2.4 First-order Condition

Proposition 5.4 Assume f is differentiable, then f is convex if and only if dom(f) is convex andf(y) ≥ f(x) +5f(x)T (y − x),∀x, y.

Proof: (⇐ part) Let z = λx+ (1− λ)y,∀λ ∈ [0, 1]

f(x) ≥ f(z) +∇f(z)T (x− z)f(y) ≥ f(z) +∇f(z)T (y − z)

Hence, we have

λf(x) + (1− λ)f(y) ≥ f(z) +∇f(z)T (λx+ (1− λ)y − z) = f(z) = f(λx+ (1− λ)y).

(⇒ part) By convexity, we have for all λ ∈ [0, 1],

f((1− λ)x+ λy) ≤ (1− λ)f(x) + λf(y)

⇒ f(x+ λ(y − x))

λ≤ f(x)

λ+ f(y)− f(x)

⇒ f(y) ≥ f(x) +f(x+ λ(y − x))− f(x)

λ, ∀λ ∈ [0, 1]

⇒ f(y) ≥ f(x) +∇f(x)T (y − x), letting λ→ 0


Remark: The above proposition implies that we have obtain global underestimate of the entirefunction based on local information (f(x),∇f(x)). This is an important feature of convex functions.Moreover, one can see that a convex function can be viewed as the supremum of affine function.

5.2.5 Second-order condition

Proposition 5.5 Assume f is twice-differentiable, then f is convex if and only if dom(f) is convexand ∇2f(x) 0,∀x ∈ dom(f)

Proof: (⇐ part) ∀x, h, φ(t) = f(x+ th) is convex on the axis

φ”(t) = hT∇2f(x+ th)h ≥ 0

particularly, φ”(0) = hT∇2f(x)h ≥ 0, ∀h Hence ∇2f(x) 0.

(⇒ part) Any one dimensional restriction

φ(t) = f(x+ th) is convex since φ”(t) ≥ 0

Hence f is convex.

Example: f(x) = 12x

TQx+ bTx+ c is convex if and only if Q 0.

5.3 Continuity of Convex Functions

Convex functions are almost everywhere continuous.

Theorem 5.6 Let f : Rn → R be convex, then f is continuous on rint(dom(f)).

Remark: Note that f needs not to be continuous on Rn or dom(f). e.g.

f(x) =

1, x = 0

0, x > 0

+∞, o.w.

Proof: Without loss of generality, let’s assume dim(dom(f)) = n, 0 ∈ int(dom(f)) and x :‖ x ‖2≤ 1 ⊆dom(f). Let us consider the continuity at point 0. Let xn → 0 with ‖ xn ‖2≤ 1.

(a) lim supn→∞ f(xn) ≤ f(0)This is because xk = (1− ‖ xk ‖2) · 0+ ‖ xk ‖2 ·yk, where yk = xk

‖xk‖2 ∈ dom(f). By convexityof f , we have

f(xk) ≤ (1− ‖ xk ‖2) · f(0)+ ‖ xk ‖2 ·f(yk)

Therefore, lim supk→∞ f(xk) ≤ f(0).


(b) lim infn→∞ f(xn) ≥ f(0)

This is because 0 = 1‖xk‖2+1xk + ‖xk‖2

‖xk‖2+1zk, where zk = − xk‖xk‖2 ∈ dom(f). By the convexity

of f , we have

f(0) ≤ 1

‖ xk ‖2 +1f(xk) +

‖ xk ‖2‖ xk ‖2 +1

f(zk)

Therefore, lim infk→∞ f(xk) ≥ f(0).


Lecture 6: Convex Functions, Part III – February 8




• subgradient and subdifferential

– definition and examples, existence, subdifferential properties, calculus

References: Bertsekas, Nedich & Ozdaglar, 2003, Chapter 4

6.1 Motivation

Convex functions are “essentially” convex sets:

• f(x) is convex ⇐⇒ epi(f) is convex

• supα∈A fα(x) is convex ⇐⇒ ∩α∈Aepi(fα) is convex

• ”tight affine minorant” of f ⇐⇒ supporting hyperplane of epi(f)

Question: Can you find any affine function that underestimates f(x) and is tight at x = 0?

• f(x) = 12x

2

• f(x) = |x|

• f(x) =

−√x, x ≥ 0

+∞, o.w.

• f(x) =

1, x = 0

0, x > 0

+∞, o.w.

6-1

Lecture 6: Convex Functions, Part III – February 8 6-2

6.2 Subgradient

Definition 6.1 Let f : Rn → R ∪ +∞ be a convex function. A vector g ∈ Rn is a subgradientof f at a point x ∈ dom(f) if

f(y) ≥ f(x) + gT (y − x), ∀y

The set of all subgradient at x is called the subdifferential of f at x denoted as ∂f .

Remark (Subgradient and epigraph) Note that for any fixed x ∈ dom(f)

g ∈ ∂f(x)⇔ f(y)− gT y ≥ f(x)− gTx,∀y⇔ t− gT y ≥ f(x)− gT y,∀(y, t) ∈ epi(f)

⇔[−g1

]T [yt

]≥[−g1

]T [y

f(x)

], ∀(y, t) ∈ epi(f)

⇔ The hyperplane H :=

(y, t) : (−g, 1)T (y, t) = (−g, 1)T (x, f(x))

is a supporint plane of epi(f)

Remark (Differentiable Case) If f is differentiable at x ∈ dom(f), then ∂f(x) = ∇f(x) is asingleton.

Proof: By definition, let y = x+ εd, g ∈ ∂f(x), f(x+ εd) ≥ f(x) + εgTd.

f(x+ εd)− f(x)

ε≥ gTd,∀d,∀ε ε→0

==⇒ ∇f(x)Td ≥ gTd,∀d

This only holds when g = ∇f(x).

Examples

1. f(x) = 12x

2, ∂f(x) = x

2. f(x) = |x|, ∂f(x) =

sgn(x), x 6= 0

[−1, 1], x = 0. Note at x = 0, ∀g ∈ [−1, 1], |y| ≥ 0 + g · y,∀y

3. f(x) =

−√x, x ≥ 0

+∞, o.w., ∂f(0) = ∅. Note at x = 0, @g ∈ R, s.t.

√y ≥ 0 + g · y,∀y ≥ 0

4. f(x) =

1, x = 0

0, x > 0

+∞, o.w., ∂f(0) = ∅. Note at x = 0, @g ∈ R, s.t. 0 ≥ 1 + g · y,∀y ≥ 0

Theorem 6.2 Let f be a convex function and x ∈ dom(f). Then

1. ∂f(x) is convex and closed


2. ∂f(x) is nonempty and bounded if x ∈ rint(dom(f))

Proof:

1. Convexity and closedness are evident due to

∂f(x) =g ∈ Rn : f(y) ≥ f(x) + gT (y − x),∀y

= ∩y

g ∈ Rnf(y) ≥ f(x) + gT (y − x)

is the solution to an infinite system of linear inequalities. The intersection of arbitrary numberof closed and convex sets is still closed and convex.

2. (Non-empty) W.l.o.g., let’s assume dom(f) is full-dimensional and x ∈ int(dom(f)). Sinceepi(f) is convex and (x, f(x)) belongs to its boundary, by separation theorem, ∃α = (s, β) 6= 0,s.t.

sT y + βt ≥ sTx+ βf(x),∀(y, t) ∈ epi(f)

Clearly, we must have β ≥ 0. Since x ∈ int(dom(f)), we cannot have β = 0, Otherwise, toensure sT y ≥ sTx,∀y ∈ B(x, y), s = 0, which is impossible.Hence, β > 0, setting g = −β−1s

f(y) ≥ f(x) + gT (y − x),∀y

(Bounded) Suppose ∂f(x) is unbounded, i.e. ∃gk ∈ ∂f(x), s.t. ‖ gk ‖2→ ∞, as k → ∞.Since x ∈ int(dom(f)), ∃δ > 0, s.t. B(x, δ) ⊆ dom(f). Hence, yk = x+ δ gk

‖gk‖2 ∈ dom(f). Byconvexity,

f(yk) ≥ f(x) + gTk (yk − x) = f(x) + δ ‖ gk ‖2→∞.However, this contradicts with the continuity of f over int(dom(f)).

Remark: On the contrary, if ∀x ∈ dom(f), ∂f(x) is non-empty, dom(f) is convex, then f is convex.This is because ∀x, y ∈ dom(f) and λ ∈ [0, 1]. z = λx + (1 − λ)y ∈ dom(f). So ∂f(z) 6= ∅. Letg ∈ ∂f(z). we have

f(x) ≥ f(z) + gT (x− z)f(y) ≥ f(z) + gT (y − z)

⇒ λf(x) + (1− λ)f(y) ≥ f(λx+ (1− λ)y)

Remark: (Continuity of Subdifferential) Let f : Rn → R be convex and continuous. Supposexk → x and gk ∈ ∂f(xk), gk → g, then g ∈ ∂f(x).

6.3 Subdifferential and Directional Derivative

Recall that the directional derivative of a function f at x along direction d is

f ′(x; d) = limδ→0+

f(x+ δd)− f(x)

δ

If f is differentiable, then f ′(x; d) = ∇f(x)Td


Lemma 6.3 When f is convex, the ratio φ(δ) := f(x+δd)−f(x)δ is non-decreasing of δ > 0.

The proof is left as a self exercise.

Theorem 6.4 Let f be convex and x ∈ int(dom(f)), then

f ′(x; d) = maxg∈∂f(x)

gTd

Proof: By definition, We have f(z + δd)− f(x) ≥ δgTd for all δ and g ∈ ∂f(x). Hence, f ′(x; d) ≥gTd,∀g ∈ ∂f(x). Moreover, this implies that

f ′(x; d) ≥ maxg∈∂f(x)

gTd.

It suffices to show that ∃g ∈ ∂f(x), s.t. f ′(x; d) ≤ gTd. Consider the two sets

C1 = (y, t) : f(y) < tC2 =

(y, t) : y = x+ αd, t = f(x) + αf ′(x; d), α ≥ 0

Claim: C1 ∩ C2 = ∅ and C1, C2 are closed and nonempty. This is because f(x + αd) ≥ f(x) +αf ′(x; d), ∀α ≥ 0. (Due to Lemma 6.3).By separation theorem, ∃(g0, β) 6= 0, s.t.

gT0 (x+ αd) + β(f(x) + αf ′(x; d)) ≤ gT0 y + βt,∀α ≥ 0,∀t > f(y)

One can easily show that β > 0. Let g = β−1g0,

gT (x+ αd) + f(x) + αf ′(x; d) ≤ gT y + f(y), ∀α ≥ 0

Let α = 1 and y = x, we have

g + f(x) ≤ gT y + f(y)⇔ −g ∈ ∂f(x)

Let α = 1 and y = x, we have

gTd+ f ′(x; d) ≤ 0⇔ f ′(x; d) ≤ −gTd

Therefore, we have shown that f ′(x; d) = maxg∈∂f(x) gTd

6.4 Calculus of Subgradient

1. Nonnegative summation: If h(x) = β1f1(x) + β2f2(x), with β1, β2 ≥ 0, then

∂h(x) = βαf1(x) + β2∂f2(x)

2. Affine transformation: If h(x) = f(Ax+ b), then

∂h(x) = AT∂h(Ax+ b)

3. Pointwise maximum: If h(x) = maxi∈I fi(x), I = 1, 2, ...,m

∂h(x) = Conv ∂fi(x)|i ∈ I(x) , where I(x) = i ∈ I : fi(x) = h(x)


Lecture 7: Convex Functions (part IV) – February 13




• Closed convex function

• Convex conjugate function

• Conjugate theory

• Introduction to convex programs

References: Bertsekas, Nedich & Ozdaglar, Chapter 7 and Boyd & Vandenbergh Chapter 3.3

7.1 Closed Convex Functions

Recall for a closed convex set X, ∀x 6∈ X, ∃ a hyperplane strictly separates x and X, i.e. ∃(ax, bx),s.t. X ⊂ Hx :=

y : aTx y ≤ bx

, and x 6∈ Hx. Hence

X = ∩x 6∈Xy : AT

x y ≤ bx

Any closed convex set is the intersection of closed half-spaces. Intuitively, when translated toconvex functions, any convex function with closed and nonempty epigraph is the upper bound ofits affine minorants.

Definition 7.1 A function is closed if its epigraph is a closed set.

Remark: Any continuous function is closed, but a closed function is not necessarily continuous.For example,

f(x) =

1, x > 0

0, x = 0

+∞, o.w.

Proposition 7.2 Let f : Rn → R ∪ +∞. Then the following are equivalent.

(i) f is closed;

7-1

Lecture 7: Convex Functions (part IV) – February 13 7-2

(ii) every level set is closed;

(iii) f is lower semi-continuous, i.e. ∀x ∈ Rn, ∀ xk → x, f(x) ≤ lim infk→∞ f(xk).

Proof: Self Exercise

Remark Note that a convex function needs not to be closed, e.g.

f(x) =

1, x = 0

0, x > 0

+∞, o.w

Suppose f is a closed convex function, for a given slope y ∈ Rn, when is an affine function yTx−βan affine minorant of f?

f(x) ≥ yTx− β ⇒ β ≥ yTx− f(x),∀x⇒ β ≥ sup

x∈Rn

yTx− f(x)

︸︷︷︸

f∗(y)

The function f∗(y), is known as the conjugate function of f .

7.2 Convex Conjugate Function

Definition 7.3 Let f : Rn → R ∪ +∞: The function

f∗(y) = supx∈Rn

yTx− f(x)

= sup

x∈dom(f)

yTx− f(x)

is called the conjugate of function f . Also know as Legendre-Fenchel transformation.

Remark:

1. (Fenchel’s inequality) f(x) + f∗(y) ≥ xT y,∀x, y

2. f∗ is always convex and closed

3. The conjugate of the conjugate function f∗(y), also called biconjugate

f∗∗(x) = supy∈Rn

xT y − f∗(y)

is also closed and convex.

Theorem 7.4 We have


(a) f(x) ≥ f∗∗(x)

(b) If f is closed and convex, then f∗∗ = f .

Proof:

(a) By definition of f∗, f∗(y) ≥ yTx− f(x),∀y. This implies that f(x) ≥ yTx− f∗(y), ∀y. Hence,f(x) ≥ supy

yTx− f∗(y)

= f∗∗(x).

(b) Suppose f∗∗ 6= f , then epi(f) 6⊆ epi(f∗∗)

∃(x0, f∗∗(x0)), s.t. (x0, t0) ∈ epi(f∗∗) and (x0, t0) 6∈ epi(f)

Since f is convex and closed, then epi(f) is a closed and convex set. By separation theorem,∃(y, β) 6= 0 s.t. H =

(x, t) : yTx+ βt = β0

separates epi(f) and (x0, f

∗∗(x0)), Note β 6= 0.w.l.o.g. let β = −1, and

yTx− t > yTx0 − f∗(x0),∀(x, t) ∈ epi(f)

This implies that yTx0 − f(x0) > yTx0 − f∗∗(x0). Hence, f∗∗(x0) > f(x0). Contradiction!

Examples:

1. f(x) = ax− b f∗(y) =

b, x = a

+∞, o.w.

2. f(x) = |x| f∗(y) =

0, |y| < 1

+∞, o.w.

3. f(x) = c2x

2(c > 0) f∗(y) = 12cy

2

7.3 Convex Optimization

The standard form of an optimization is

min f(x)

s.t. gi(x) ≤ 0, i = 1, ...,m (P )

hj(x) = 0, j = 1, ..., k

Definition 7.5 An optimization problem (P ) is convex if

1. the objective function f is convex.


2. the inequality constraint functions g1, ..., gm are convex.

3. there is either no equality constraint or only linear equality constraint.

Note the domain of he problem D = dom(f) ∩ dom(gi) is convex.

Definition 7.6 (Feasibility) The set C = x ∈ dom(f) : gi(x) ≤ 0, i = 1, ...,m, hj(x) = 0, j = 1, .., kis called the feasible set. Any point x ∈ C is called a feasible solution. We say (P ) is feasible ifC 6= φ.

Definition 7.7 (Optimality) The value p∗ = inf f(x) : gi(x) ≤ 0, i = 1, ..,m, hj(x) = 0, j = 1, ..., kis called the optimal value of (P ). If (P ) is infeasible, we set p∗ = +∞.

• We say (P ) is unbounded below is p∗ = −∞.

• We say (P ) is solvable if p∗ is finite and exists a feasible solution x∗, such that p∗ = f(x∗).We call such x∗, an optimal solution. The set of all optimal solution is called the optimal set.

Epigraph form of the problem The standard problem (P ) is equivalent as the problem

minx∈Rn,t∈R

t

s.t. f(x)− t ≤ 0 (P ′)

gi(x) ≤ 0, i = 1, ...,m

hj(x) = 0, j =, 1..., k

Note that

1. (P ′) is still convex if (P ) is convex

2. (x∗, t∗) is optimal to (P ′) if and only if x∗ is optimal to (P ) and t∗ = f(x∗)

Example

minx

maxj=1,...,m

(aT1 x+ b1, ..., aTmx+ bm)

s.t. Cx = d

is equivalent to

minx,t

t

s.t. aTi x+ bi − t ≤ 0, i = 1, ...,m

Cx = d


Lecture 8: Convex Programming – February 15




• Basics of Convex Programs

• Convex Theorem on Alternatives

• Lagrange Duality

References: Ben-Tal & Nemirovski, Chapter 3.1-3.3

8.1 Basics of Convex Programs

The standard form of an optimization problem

minx

f(x)

s.t. gi(x) ≤ 0, i = 1, ...,m (P )

hj(x) = 0, j = 1, ..., k

The optimal value of (P ) is

p∗ =

+∞, if no feasible solution

infx:gi(x)≤0,∀i,hj(x)=0 ∀j f(x), if exists feasible solution.

• (P ) is infeasible, if p∗ = +∞

• (P ) is unbounded below, if p∗ = −∞

• (P ) is solvable. if ∃ a feasible solution x∗, s.t. p∗ = f(x∗)

• (P ) is unattainable, if |p∗| <∞ but 6 ∃ feasible x∗, s.t. p∗ = f(x∗). For example, minx∈(0,1) ex

Given a solution x∗,

• x∗ is a global optimum for (P ) if x∗ is feasible and f(x∗) ≤ f(x), ∀x feasible

8-1

Lecture 8: Convex Programming – February 15 8-2

• x∗ is a local optimum for (P ) if x∗ is feasible and ∃r > 0, s.t. f(x∗) ≤ f(x), ∀ feasible x ∈B(x∗, r)

Proposition 8.1 For convex programs, a local optimum is a global optimum.

Proof: Let C denote the feasible set, and x∗ be local optimum. We want to show that ∀x ∈C, f(x∗) ≤ f(x). Let z = εx∗ + (1− ε)x. Then z ∈ C ∩B(x∗, r) when ε is small enough. Hence,

f(x∗) ≤ f(z) ≤ εf(x∗) + (1− ε)f(x)

i.e., (1− ε)f(x∗) ≤ (1− ε)f(x). Hence, f(x∗) ≤ f(x),∀x ∈ C.

Question: How to verify whether a solution x∗ is optimal?

In the linear program case, we have shown strong duality between (P ) & (D)

min cTx

(P ) s.t. Ax = b

x ≥ 0

max bT y

(D) s.t. AT y ≤ c

We know that

x∗ is optimal⇔Ax∗ = b, x ≥ 0 (primal feasibility )

∃y∗, AT y∗ ≤ c (dual feasibility)

cTx∗ = bT y∗ (zero-duality gap)

The strong duality is based on the Farkas’ Lemma (or separation theorem). As an analogy, we canderive duality and optimality condition for general convex programs.

8.2 Convex Theorems on Alternatives

Consider the general form of convex program

minx∈X

f(x)

s.t. gi(x) ≤ 0, i = 1, ...,m (P )

where X is a convex set, f, g1, . . . , gm are convex functions.

Theorem 8.2 Assume g1, ..., gm satisfy the Slater condition: ∃x ∈ X, s.t. gi(x) < 0. Then exactlyone of the following two systems must be empty.


(I) x ∈ X : f(x) < 0, gi(x) ≤ 0, i = 1, ...,m

(II) λ ∈ Rm : λ ≥ 0, infx∈X f(x) +∑m

i=1 λigi(x) ≥ 0

Proof:

1. We first show that system (II) feasible ⇒ system (I) infeasible.

Suppose (I) is also feasible, ∃x0, s.t. f(x0) < 0, gi(x0) ≤ 0. Then

∀λ ≥ 0, infx∈X

f(x) +

m∑i=1

λigi(x)

< 0

Contradiction!

2. We then show that system (I) infeasible ⇒ system (II) feasible.

Consider the following two sets:

S =u = (u0, u1, ..., um) ∈ Rm+1 : ∃x ∈ X, f(x) ≤ u0, g1(x) ≤ u1, ..., gm(x) ≤ um

T =

u = (u0, u1, ..., um) ∈ Rm+1 : u0 < 0, u1 ≤ 0, ..., um ≤ 0

Note that S is convex (why?) and nonempty, T is convex and nonempty and S ∩ T = ∅.By separation theorem, ∃a = (a0, a1, ..., am) ∈ Rm+1 and a 6= 0, s.t.

supu∈T

aTu ≤ infu∈S

aTu

i.e. supu0<0,ui≤0,∀i=1,..,m

m∑i=0

aiui ≤ infx,u0,...,um:u0≥f(x),ui≥gi(x),∀i=1,..,m

m∑i=0

aiui

Observe a ≥ 0, hence:

0 ≤ infxa0f(x) + a1g1(x) + ...+ amgm(x)

Note that a0 > 0, otherwise: we have (a1, ..., am) ≥ 0 and (a1, . . . , am) 6= 0,

infxa1g1(x) + ...+ amgm(x) ≥ 0

However, ∃x, s.t. gi(x) < 0,∀i = 1, ...,m. This implies

infx∈X

a1g1(x) + ...+ amgm(x) < 0

Contradiction!Hence, setting λi = ai

a0, i = 1, ...,m. we have

λ ≥ 0 and infx

f(x) +

∑λigi(x)

≥ 0

i.e. system (II) is feasible


Remark (relaxed Slater condition): The Slater condition can be relaxed to accommodate forlinear equalities: ∃x ∈ rint(X) s.t. gi(x) < 0 for all i = 1, ...,m such that gi(x) is not affine.

Note that in the general case,

x∗ is optimal to (P )⇔x ∈ X : f(x) ≤ f(x∗), gi(x) ≤ 0, i = 1, ...,m is feasiblex ∈ X : f(x) < f(x∗), gi(x) ≤ 0, i = 1, ...,m is infeasible

⇔x ∈ X : f(x) ≤ f(x∗), gi(x) ≤ 0, i = 1, ...,m is feasibleλ ∈ Rm : λ ≥ 0, infx∈X f(x) +

∑λigi(x) ≥ f(x∗)is feasible

Observe that the function infx∈X f(x) +∑λigi(x) ≤ f(x∗) for any λ ≥ 0. Therefore, the

optimality of x∗ implies that there must exsit λ∗ ≥ 0 such that infx∈X f(x) +∑λ∗i gi(x) = f(x∗).

8.3 Lagrange Duality

Definition 8.3 The function L(x, λ) = f(x) +∑m

i=1 λigi(x) is called the Lagrange function. Thisinduces two related functions:

L(λ) = infx∈X

L(x, λ)

L(x) = supλ≥0

L(x, λ)

The Lagrange dual of the problem

min f(x)

(P ) s.t. gi(x) ≤ 0, i = 1, ...,m

x ∈ X

is defined as

max L(λ)

(D) s.t. λ ≥ 0

Theorem 8.4 (Duality Theorem) Denote Opt(P ) and Opt(D) as the optimal values to (P ) and(D), we have

(a) (Weak Duality) ∀λ ≥ 0, L(λ) ≤ Opt(D). Moreover, Opt(D) ≤ Opt(P ).

(b) (Strong Duality) If (P ) is convex and below bounded, and satisfies Slater condition, then (D)is solvable, and

Opt(D) = Opt(P )

Proof:


(a) If (P ) is infeasible, Opt(P ) =∞, Opt(D) ≤ Opt(P ) always hold.If (P ) is feasible, let x0 be feasible solution, i.e. gi(x0) ≤ 0, x0 ∈ X

∀λ ≥ 0, L(λ) = infx∈X

f(x) +

∑λigi(x)

≤ f(x0) + λigi(x0) ≤ f(x0)

Hence,

∀λ ≥ 0, L(λ) ≤ inf f(x0) : x0 ∈ X, gi(x0) ≤ 0, i = 1, ...,m = Opt(P )

Furthermore, Opt(D) = supλ≥0 L(λ) ≤ Opt(P )

(b) By optimality, we know that x ∈ X : f(x) < Opt(P ), gi(x) ≤ 0, i = 1, ...,m has no solution.By convex theorem on alternative and Slater condition, the system λ ≥ 0 : L(λ) ≥ Opt(P )has a solution. This implies that Opt(D) = supλ≥0 L(λ) ≥ Opt(P ). Combining with part (a),we have Opt(P ) = Opt(D).

Example: The Slater condition does not hold in the following example.

minx∈X

e−x

s.t.x2

y≤ 0

where X = (x, y) : y > 0. Note that there exists no solution in x ∈ X such that x2/y < 0.


Lecture 9: Optimality Conditions – February 20




• Illustrations of Lagrange Duality

• Saddle Point Formulation

• Optimality Conditions (KKT conditions)

References: Bental & Nemirovski Chapter 3.2

9.1 Recall

• Convex program:

min f(x)

s.t. gi(x) ≤ 0, i = 1, ...,m (P )

x ∈ X

where f(x), g1, . . . , gm are convex and X is convex.

• Lagrange function:

L(x, λ) = f(x) +

m∑i=1

λigi(x)

where λ = (λ1, ..., λm) is called Lagrange multiplier.

• Lagrange dual function:

L(λ) := infx∈X

L(x, λ) = infx∈Xf(x) +

m∑i=1

λigi(x)

– Note that ∀λ ≥ 0, L(λ) ≤ Opt(P )

• Lagrange dual program:maxλ≥0

L(λ)

We show that Opt(D) = Opt(P ) when (relaxed) Slater condition holds

9-1

Lecture 9: Optimality Conditions – February 20 9-2

9.2 Illustrations of Lagrange Duality

• Linear Program Duality

min cTx

(P ) s.t. Ax = b

x ≥ 0

max bT y

(D) s.t. AT y ≤ c

First, rewrite the original problem as:

min cTx

s.t. Ax− b ≤ 0 (λ1)

b−Ax ≤ 0 (λ2)

x ≥ 0

Introducing the multipliers λ = (λ1, λ2) ≥ 0, the Lagrange function is

L(x, λ) = cTx+ λT1 (Ax− b) + λT2 (b−Ax) = (c+ATλ1 −ATλ2)Tx+ bT (λ2 − λ1)

The Lagrange dual function is

L(λ) = infx≥0

(c+ATλ1 −ATλ2)Tx+ bT (λ2 − λ1) =

bT (λ2 − λ1), c+ATλ1 −ATλ2 ≥ 0

−∞, o.w.

The Lagrange dual is maxλ≥0 L(λ), which is equivalent to

maxλ≥0

bT (λ2 − λ1)

c+AT (λ1 − λ2) ≥ 0

Substituteing y = λ2 − λ1, the formulation above is also equivalent to:

maxy

bT y

s.t. AT y ≤ c

• Quadratic Program Duality:

minx

1

2xTQx+ qTx

(P ) s.t. Ax ≤ bx ≥ 0

maxy,λ

−1

2yTQy + bTλ

(D) s.t. ATλ−Qy = q

λ ≥ 0

where Q 0.


The Lagrange function is L(x, λ) = 12x

TQx+ qTx+ λT (b−Ax).The Lagrange dual function is L(λ) = infx12x

TQx+ (q −ATx)x+ bTλ.The infimum is achieved when Qx = ATλ− q, x = Q−1(ATλ− q).

L(λ) = −1

2(ATλ− q)TQ−1(ATλ− q) + bTλ

The Lagrange dual is

maxλ≥0

L(λ)⇐⇒ maxλ≥0

1

2(ATλ− q)TQ−1(ATλ− q) + bTλ⇐⇒ max

λ≥0,y−1

2yTQy + bTλ

s.t. ATλ−Qy = q

λ ≥ 0

In both cases discussed above, strong duality holds true.

9.3 Saddle Point Formulation

Recall

(P ) : minx∈X

f(x) : gi(x) ≥ 0, i = 1, . . . ,m = minx∈X

L(x) = minx∈X

maxλ≥0

L(x, λ)

(D) : maxλ≥0

L(λ) = maxλ≥0

minx∈X

L(x, λ)

Definition 9.1 (Saddle point) We call (x∗, λ∗), where x∗ ∈ X,λ∗ ≥ 0, a saddle point of L(x, λ) ifL(x, λ∗) ≥ L(x∗, λ∗) ≥ L(x∗, λ), ∀x ∈ X,λ ≥ 0.

Theorem 9.2 (x∗, λ∗) is saddle point of L(x, λ) if and only if x∗ is an optimal solution (P ), λ∗ isan optimal solution to (D) and Opt(P ) = Opt(D).

Proof:

(⇒) Assume (x∗, λ∗) is a saddle point,

L(x, λ∗) ≥ L(x∗, λ∗) ≥ L(x∗, λ), ∀x ∈ X,λ ≥ 0

Opt(P ) = infx∈X

L(x) ≤ L(x∗) = supλ≥0

L(x∗, λ) = L(x∗, λ∗)

Opt(D) = supλ≥0

L(λ) ≥ L(λ∗) = infx∈X

L(x, λ∗) = L(x∗, λ∗)

Hence, Opt(P ) ≤ L(x∗, λ∗) ≤ Opt(D)Combined with weak duality, we have Opt(P ) = Opt(D). Hence, Opt(P ) = L(x∗) = L(x∗, λ∗) =L(λ∗) = Opt(D). Thus, x∗ solves (P ), λ∗ solves (D), and Opt(P ) = Opt(D)


(⇐) Assume (x∗, λ∗) are optimal solutions to (P ) and (D), and Opt(P ) = Opt(D). By optimality,

Opt(P ) = L(x∗) = supλ≥0

L(x∗, λ) ≥ L(x∗, λ∗)

Opt(D) = L(λ∗) = infx∈X

L(x, λ∗) ≤ L(x∗, λ∗)

Since Opt(D) = Opt(P )supλ≥0

L(x∗, λ) = L(x∗, λ∗) = infx∈X

L(x∗, λ)

i.e. (x∗, λ∗) is a saddle point of L(x, λ).

Remark:

• The above theorem holds true for any saddle function L(x, λ) and its induced primal anddual problems, not limited to the Lagrange function.

• Saddle point always exists for the Lagrange function of a solvable convex program satisfyingthe Slater condition. More generally, the existence of saddle points is guaranteed for convex-concave saddle functions over convex compact domains (Minimax Theorem).

9.4 Optimality Conditions

Theorem 9.3 : Let x∗ ∈ X

(i) (sufficient condition) If there exists λ∗ ≥ 0, such that (x∗, λ∗) is a saddle point of L(x, λ),then x∗ is an optimal solution to (P ).

(ii) (necessary condition) Assume (P ) is convex and satisfies the slater condition. If x∗ is anoptimal solution to (P ) then ∃λ∗ ≥ 0, s.t. (x∗, λ∗) is a saddle point of L(x, λ)

Proof:

(i) (sufficient part) Follows from previous theorem.

(ii) (necessary part) By strong duality theorem, ∃ optimal dual solution λ∗ ≥ 0, such thatOpt(P ) = Opt(D). Hence, following from the previous theorem, (x∗, λ∗) is a saddle point ofL(x, λ).

Remark Note that the sufficient condition holds for general constrained program, not necessarilyconvex ones. However, they are far from being necessary and hardly satisfied.

Definition 9.4 (Normal Cone) Let X ⊂ Rn and x ∈ X. The normal cone of X, denoted asNX(x), is the set

NX(x) =h ∈ Rn : hT (y − x) ≥ 0, ∀y ∈ X


Note that NX(x) is a closed convex cone.

Examples:

• x ∈ int(X), NX(x) = 0

• x ∈ rint(X), NX(x) = L⊥ where L = Linear subspace parallel to Aff(X)

• X =x : aTi x ≥ bi, i = 1, ...,m

, x 6∈ int(X).

NX(x) = Cone(ai|aTi x = bi

)

Theorem 9.5 Let (P ) be a convex program and let x∗ be a feasible solution. Assume f, g1, ..., gmare differentiable at x∗.

(a) (sufficient condition) If there exists λ∗ ≥ 0 satisfying

1) ∇f(x∗) +∑m

i=1 λ∗i∇gi(x∗) ∈ NX(x∗)

2) λ∗gi(x∗) = 0, ∀i = 1, ...,m

then x∗ is optimal solution to (P ) (Karush-Kuhn-Tucker, KKT condition (1951))

(b) (necessary condition)If (P ) also satisfies the slater condition. then the above condition is alsonecessary for x∗ to be optimal.

Proof:

(a) (sufficient part) Under the KKT condition

1) implies that L(x, λ∗) ≥ L(x∗, λ∗),∀x ∈ X because

L(x, λ∗) ≥ L(x∗, λ∗) +∇xL(x∗, λ∗)(x− x∗)

as the last part is non-negative

2) + feasibility of x∗ implies L(x∗, λ∗) ≥ L(x∗, λ), ∀λ ≥ 0 because

L(x∗, λ∗) = f(x∗) +

m∑i=1

λigi(x∗) = f(x∗) ≥ f(x∗) +

∑λigi(x

∗) = L(x∗, λ)

Hence (x∗, λ∗) is a saddle point of L(x, λ).

(b) (necessary part) From previous theorem, there exists λ ≥ 0 such that (x∗, λ∗) is a saddlepoint of L(x, λ). We have

L(x, λ∗) ≥ L(x∗, λ∗),∀x ∈ X ⇒ (y − x∗)T∇Lx(x∗, λ∗) = limε→0

L(x∗ + ε(y − x∗), λ∗)− L(x∗, λ∗)

ε≥ 0

⇒ ∇xL(x∗, λ∗) ∈ NX(x∗)


L(x∗, λ∗) ≥ L(x∗, λ), ∀λ ≥ 0⇒m∑i=1

λ∗i gi(x∗) ≥

m∑i=1

λigi(x∗),∀λ ≥ 0

⇒m∑i=1

λ∗i gi(x∗) ≥ 0 (note we also have λ∗i gi(x

∗) ≤ 0)

⇒ λ∗i gi(x∗) = 0, ∀i

This leads to the KKT condition.

Remark To summarize, (x∗, λ∗) is an optimal primal-dual pair if it satisfies:

• Primal feasibility: x∗ ∈ X, gi(x∗) ≤ 0

• Dual feasibility: λ∗ ≥ 0

• Lagrange optimality: ∇f(x∗) +∑m

i=1 λ∗i∇gi(x∗) ∈ NX(x∗)

• Complementary slackness: λ∗i gi(x∗) = 0,∀i = 1, ...,m

Example: Given ai > 0, i = 1, ..., n, solve the problem

minx

∑n

i=1

aixi

s.t. x > 0∑n

i=1xi ≤ 1

The Lagrange function L(x, λ) =∑n

i=1aixi

+ λ(∑n

i=1 xi − 1). The KKT optimality conditions yieldx∗i > 0,

∑ni=1 x

∗i ≤ 1

λ∗ ≥ 0

− ai(x∗i )

2 + λ∗ = 0

λ∗(∑n

i=1 x∗i − 1) = 0

⇒ xi =

√aiλ∗

and

m∑i=1

xi = 1⇒

λ∗ = (

∑ni=1

√ai)

2

x∗i =√ai∑n

i=1

√ai, i = 1, . . . ,m


Lecture 10: Minimax Problems – February 22




• Saddle Point and Minimax Problems

• Minimax Theorem (Sion-Kakutani Theorem)

References: Bental & Nemirovski Chapter 3.4

10.1 Saddle Ppint and Minimax Problem

Let X ⊆ Rn and Y ⊆ Rm be two non-empty sets. Let L(x, y) : X × Y → R be a real-valuedfunction. We say (x∗, y∗) is a saddle point of L(x, y) if

L(x, y∗) ≥ L(x∗, y∗) ≥ L(x∗, y), ∀x ∈ X, y ∈ Y

Game Theory Interpretation: This can be viewed as a two player zero-sum game.

• Player 1 chooses x ∈ X

• Player 2 chooses y ∈ Y

• After both players reveal their decisions, player 1 pays to player 2 the amount L(x, y).

• Player 1 is interested in minimizing his loss, while player 2 is interested in maximizing hisgain

• The strategy (x∗, y∗) is called a (Nash) equilibrium if no player has an interactive to changehis chosen strategy after considering the opponent’s action.

Question: What should players do to optimize their profit?

Considering the following two situations

• Situation I: Player 1 choose x first and player 2 choose y knowing the choice of player 2

minx∈X

L(x) := maxy∈Y

L(x, y)

10-1

Lecture 10: Minimax Problems – February 22 10-2

• Situation II: Player 2 choose y first and player 1 choose x knowing the choice of player 1

maxy∈Y

L(y) := minx∈X

L(x, y)

Two Induced Problems:

(P ) : minx∈X

maxy∈Y

L(x, y) = minx∈X

L(y)

(D) : maxy∈Y

minx∈X

L(x, y) = maxy∈Y

L(x)

Intuitively, it is better for players to play second. Indeed, there is an inequality:

Proposition 10.1supy∈Y

infx∈X

L(x, y) ≤ infx∈X

supy∈Y

L(x, y).

Proof: This is because

∀x ∈ X : infx∈X

L(x, y) ≤ L(x, y),∀y

⇒∀x ∈ X : supy∈Y

infx∈X

L(x, y) ≤ supy∈Y

L(x, y)

⇒ supy∈Y

infx∈X

L(x, y) ≤ infx∈X

supy∈Y

L(x, y)

On the other hand, if (x∗, y∗) is a saddle point.

L(x, y∗) ≥ L(x∗, y∗) ≥ L(x∗, y), ∀x ∈ X, y ∈ Y⇒ inf

x∈XL(x, y∗) ≥ L(x∗, y∗) ≥ sup

y∈YL(x∗, y)

⇒ supy∈Y

infx∈X

L(x, y) ≥ infx∈X

supy∈Y

L(x, y)

Hence, if (x∗, λ∗) is a saddle point, then

supy∈Y

infx∈X

L(x, y) = infx∈X

supy∈Y

L(x, y)

Remark:

• If there exists a saddle point, there is no advantage to the players of knowing the opponent’schoice. Moreover, the equilibrium, minimax and maximin all give the same solution.

• Saddle points do not always exist. For example. L(x, y) = (x− y)2, X = [0, 1], Y = [0, 1]

L(x) = supy∈[0,1]

L(x, y) = maxx2, (x− 1)2

, inf

x∈Xsupy∈Y

L(x, y) =1

4

L(y) = infx∈[0,1]

L(x, y) = 0, supy∈Y

infx∈X

L(x, y) = 0

• As already shown in last lecture: Saddle point exists if and only the induced problems (P )and (D) are both solvable and the optimal values equal to each other. Moreover, the saddlepoints are pairs (x∗, λ∗), where x∗ is optimal to (P ) and λ∗ is optimal to (D).


10.2 Existence of Saddle Points

Lemma 10.2 (Minimax Lemma) Let fi(x), i = 1, ...,m be convex and continuous on a convexcompact set X. Then

minx∈X

max1≤i≤m

fi(x) = minx∈X

m∑i=1

λ∗i fi(x)

for some λ∗ ∈ Rm such that, λ∗i ≥ 0, i = 1, ...,m and∑m

i=1 λ∗i = 1

Remark: Let L(x, λ) =∑m

i=1 λifi(x),4 = λ : λ ≥ 0,∑m

i=1 λi = 1 be a standard simplex. Theabove lemma implies: L(x, λ) has a saddle point on X ×4. Since

maxλ∈4

minx∈X

∑λifi(x) ≥ min

x∈X

∑λ∗i fi(x) = min

x∈Xmaxλ∈4

∑λifi(x)

Proof: Consider the epigraph form of the problem minx∈X max1≤i≤m fi(x):

minx,t

t

s.t. fi(x)− t ≤ 0, i = 1, ...,m

(x, t) ∈ Xt

where Xt = (x, t) : x ∈ X, t ∈ R. The optimal value t∗ = minx∈X max1≤i≤m fi(x).Note that the above problem satisfies the slater condition and is solvable (why ?)Hence, there exists (x∗, t∗) ∈ Xt and λ∗ ≥ 0, such that (x∗, t∗;λ∗) is a saddle point of the Lagrangefunction:

L(x, t;λ) = t+m∑i=1

λi(fi(x)− t) = (1−m∑i=1

λi)t+m∑i=1

λifi(x)

Therefore: ∂L∂t L(x∗, t∗;λ∗) = 1−

∑mi=1 λ

∗i = 0∑m

i=1 λ∗i (fi(x

∗)− t∗) = 0⇒

∑mi=1 λ

∗i = 1∑m

i=1 λ∗i fi(x

∗) = t∗

This implies that ∃λ∗i ≥ 0,∑m

i=1 λ∗i = 1, such that

minx∈X

max1≤i≤m

fi(x) = t∗ = min(x,t)∈Xt

L(x, t; , λ∗) = minx∈X

m∑i=1

λ∗i fi(x)

Theorem 10.3 (Minima Theorem, Sion-Kakutani )Let X and Y be two convex compact sets. Let L(x, y) : X × Y → R be a continuous function thatis convex in x ∈ X for every fixed y ∈ Y and concave in y ∈ Y for every fixed x ∈ X. Then L(x, y)has a saddle point on X × Y , and

minx∈X

maxy∈Y

L(x, y) = maxy∈Y

minx∈X

L(x, y)


Proof: We should prove that

(P ) : minx∈X

L(x) := maxy∈Y

L(x, y)

(D) : maxy∈Y

L(y) := minx∈X

L(x, y)

are both solvable with equal optimal values.

1. X and Y are compact, L(x, y) is continuous, both L(x) and L(y) are continuous and attaintheir optimum on compact set. It is sufficient to show Opt(D) ≥ Opt(P ),i.e.

maxy∈Y

minx∈X

L(x, y) ≥ minx∈X

maxy∈Y

L(x, y)

Consider the sets X(y) = x ∈ X : L(x, y) ≤ Opt(D)

2. Note that X(y) is nonempty, compact and convex for any y ∈ Y . We show that everycollection of these sets has a point in common.Suppose ∃y1, ..., ym. s.t. X(y1) ∩ ... ∩X(ym) = ∅. This implies:

minx∈X

maxi=1,...,m

L(x, yi) > Opt(D)

By Minimax Lemma, ∃λ∗i ≥ 0 and∑m

i=1 λ∗i = 1, such that

minx∈X

maxi=1,...,m

L(x, yi) = minx∈X

m∑i=1

λ∗iL(x, yi)

≤ minx∈X

L(x,

m∑i=1

λ∗i yi)

= L(y) ≤ Opt(D)

where y =∑m

i=1 λ∗i yi and the first inequality is by concavity of L(x, y). The result here leads

to a contradiction! Hence, every finite collection of X(y) has a common point.

3. By Helley’s theorem, all of these sets X(y) : y ∈ Y has a common point. Therefore, ∃x∗ ∈X : x∗ ∈ X(y), ∀y ∈ Y , which means ∃x∗ ∈ X,L(x∗, y) ≤ Opt(D), ∀y ∈ Y . Hence Opt(P ) ≤Opt(D)


Lecture 11: Convex Programming, Part III – February 27




• Some Remarks on Minimax Problem

• Some Remarks on Optimality Conditions

• Polynomial Solvability of Convex Programs

11.1 Minimax Problem

Recall in the last lecture, we have discussed

[Minimax Theorem] If X and Y are convex compact sets, L(x, y), X × Y → R is convex-concaveand continuous, then L(x, y) has a saddle point on X × Y , and

minx∈X

maxy∈Y

L(x, y) = maxy∈Y

minx∈X

L(x, y)

Remark 1: Based on the proof analysis, we can immediate see that some of the assumptions canbe relaxed. Here is a general result:

Theorem 11.1 Let X and Y be convex and one of them is compact. Let L(x, y) be lower-continuous (l.s.c.) and quasi-convex in x ∈ X. and upper semi-continuous(u.s.c) and quasi-concavein y ∈ Y . Then

minx∈X

maxy∈Y

L(x, y) = maxy∈Y

minx∈X

L(x, y)

Remark 2: The compactness is sufficient bu not necessary, see for examples:

1. minx maxy(x+ y) =∞ 6= −∞ = maxy minx(x+ y)

2. minx max0≤y≤1(x+ y) = −∞ = −∞ = max0≤y≤1 minx(x+ y)

3. minx maxy≤1(x+ y) = −∞ = −∞ = maxy≤1 minx(x+ y)

11-1

Lecture 11: Convex Programming, Part III – February 27 11-2

Remark 3: Minimax problem could also arise from Fenchel dual:

Let f(x) be closed and convex, then f = f∗∗. We can equivalently write f as

f(x) = maxy∈Rn

yTx− f∗(x)

Hence

minx∈X

f(x) = minx∈X

maxy∈Rn

yTx− f∗(x)

Remark 4: Alternative optimization does not necessarily converge to the saddle point.Consider the saddle point problem

min−1≤x≤1

max−1≤y≤1

xy

The problem has a unique saddle point: (0, 0). Start with any (x0, y0), where x0 > 0 and doalternative maximization over y and minimization over x:

(x0, y0)⇒ (x0, 1)⇒ (−1, 1)⇒ (−1,−1)⇒ (1,−1)⇒ (1,−1)⇒ (1, 1)⇒ ...

This will not converge to the saddle point.

11.2 Optimality Conditions for Convex Programs

• General Constrained Differentiable Case:

minx∈X

f(x)

gi(x) ≤ 0, i = 1, ...,m (P )

We have already shown that for a feasible solution x∗ to be optimal:

x∗ ∈ X is optimal for (P)slater condition−−−−−−−−−−−−−−−−−− ∃λ∗ ≥ 0 s.t.

(a) ∇f(x∗) +∑m

i=1 λ∗i∇gi(x∗) ∈ NX(x∗)

(b) λ∗i gi(x∗) = 0,∀i = 1, . . . ,m(11.1)

• Simple Constrained Differentiable Case:

minx∈X

f(x)

Proposition 11.2 Assume X is convex and f(x) is convex and differentiable at x∗. Then

x∗ ∈ X is optimal ⇔ ∇f(x∗) ∈ NX(x∗), i.e. ∇f(x∗)T (x− x∗) ≥ 0,∀x ∈ X

Proof:

(⇐) f(x) ≥ f(x∗) +∇f(x∗)T (x− x∗) ≥ f(x∗),∀x ∈ X

(⇒) ∇f(x∗)(x− x∗) = limε→0f(x∗+ε(x−x∗))−f(x∗)

ε ≥ 0


Remark

– If x∗ ∈∫

(X), then x∗ is optimal ⇔ ∇f(x∗) = 0

– (Unconstrained case): If X = Rn, then x∗ is optimal ⇔ ∇f(x∗) = 0

– For general (non-convex) optimization problems, ∇f(x∗) = 0 is only a necessary but notsufficient condition for x∗ to be optimal.

• Nondifferentiable Case:

Proposition 11.3 Assume f(x) is convex on X, and x∗ ∈ rint(dom(f)), then

x∗ ∈ X is optimal ⇔ ∃g ∈ ∂f(x∗), s.t. gT (x− x∗) ≥ 0, ∀x ∈ X

Proof:

(⇐) f(x) ≥ f(x∗) + gT (x− x∗) ≥ f(x∗), ∀x ∈ X

(⇒) maxg∈∂f(x∗) gT (x− x∗) = f ′(x∗;x− x∗) = limε→0

f(x∗+ε(x−x∗))−f(x∗)ε ≥ 0

Remark In the unconstrained case when X = Rn, x∗ is optimal ⇔ 0 ∈ ∂f(x∗)

11.3 Solve Convex Programs

min f(x)

s.t. gi(x) ≤ 0, i = 1, ...,m (P )

x ∈ X

In practice, to solve a problem means to find a ‘’approximate” solution to (P) with a small inaccu-racy ε > 0.

Measure of accuracy of an approximate solution x: Some function ε(x) such that ε(x) ≥ 0and ε(x)→ 0 as x→ x∗. For instance:

(i) ε(x) = infx∗∈Xopt ‖ x− x∗ ‖, where Xopt is the optimal set.

(ii) ε(x) = f(x)−Opt(P ), where Opt(P ) is the optimal value.

(iii) ε(x) = max(f(x)−Opt(P ),max1≤i≤m

[gi(x)

]+

), where u+ = max(u, 0)


Black-box oriented numerical method: We assume the objective and constraints can beaccessed through oracles.

• zero-order oracle: O =(f(x), g1(x), ..., gm(x)

)• first-order oracle: O =

(∂f(x), ∂g1(x), ..., ∂gm(x)

)• second-order oracle: O =

(∇2f(x),∇2g1(x), ...,∇2gm(x)

)• separation oracle: given x, either reports x ∈ X or returns a separator, i.e. a vector e 6= 0,

such that eTx ≥ supy∈X eT y. Note that when x 6∈ int(X), a separator does exist.

Complexity of a numerical method M : Given an input ε > 0, a problem instance P ,

• oracle complexity: number of oracles required to solve the problem (P ) up to accuracy ε > 0

• arithmetric complexity: number of arithmetric operation (bit-wise operation) requirement tosolve the problem (P ) up to accuracy ε > 0

A solution method M for a family P of problems is called polynomial if ∀p ∈ P , the arithmetriccomplexity

ComplM (ε, p) ≤ O(1)[dim(P )

]α︸︷︷︸polynomial of size

· ln(V (P )/ε)︸︷︷︸number of accuracy digits

where V (P ) is some data-dependent quantity.

We say the family P of problems polynomially solvable, i.e. computationally tractable, if it admitspolynomial solution methods.

11.4 Convex Problems are Polynomially Solvable

Illustration: One-dimensional Caseminx∈[a,b]

f(x)

• Zero-order line search: Assume there exists a unique minimizer

– Initialize a localizer G1 = [a, b] for x∗

– At each iteration, choose at, bt ∈ Gt, update the localizer

Gt+1 ←

[a, bt] ∩Gt, if f(at) ≤ f(bt)

[at, b] ∩Gt, if f(at) > f(bt)

If we choose at, bt that split [a, b] into equal length, |Gt+1| = 23 |Gt|. We get linear convergence.

• First-order line search (bisection)


– Initialize a localizer G1 = [−R,R] ⊃ [a, b]

– At each iteration, compute the midpoint ct of Gt = [at, bt]

if ct 6∈ [a, b],

Gt+1 =

[at, ct], if ct > b

[ct, bt], if ct < a

if ct ∈ [a, b] and f ′(ct) 6= 0

Gt+1 =

[at, ct], if f ′(ct) > 0

[ct, bt], if f ′(ct) < 0

otherwise, this implies ct is optimal

Note that x∗ ∈ Gt and |Gt+1| = 12 |Gt|, we get linear convergence.


Lecture 12: Polynomial Solvability of Convex Programs – March 01




• Center of Gravity Method

• Ellipsoid Method

References: Bental & Nemirovski, Chapter 7

12.1 Convex Program

We aim to solve the convex optimization problem

minx∈X

f(x)

where f is convex and X ⊆ Rn is closed, bounded convex with non-empty interior (also called aconvex body).

Let r,R be such that x :‖ x− c ‖2≤ r ⊆ X ⊆ x :‖ x ‖2≤ R for some c.

Setup: We assume that the objective and constraint set can be accessed through

• Separation Oracle for X: a routine that given an input x, either reports x ∈ X or returns a

vector ω 6= 0, s.t. ωTx ≥ supy∈X ωT y

• First Order Oracle for f : a routine that given an input x, returns a subgradient g ∈ ∂f(x),

i.e. f(y) ≥ f(x) + gT (y − x), ∀y.

• Zero Order Oracle for f : a routine that given an input x, returns the function value f(x).

Example:

minx

max1≤j≤J

fj(x)

s.t. gi(x) ≤ 0, i = 1, ...,m

where fj(x), 1 ≤ j ≤ J and gi(x), 1 ≤ i ≤ m are convex and differentiable. Here, we have

f(x) = max1≤j≤J

fj(x)

X = x ∈ Rn : gi(x) ≤ 0, i = 1, ...,m

12-1

Lecture 12: Polynomial Solvability of Convex Programs – March 01 12-2

1. We can build first order oracle for f if at a given x, we can compute f1(x), ..., fJ(x), and∇f1(x), ...,∇fJ(x). This is because:

∂f(x) ⊇ Conv ∂fj(x), j such that f(x) = fj(x)

2. We can build separation oracle for X if at given x, we can compute g1(x), ..., gm(x) and∇g1(x), ...,∇gm(x). This is because

x ∈ X ⇔ gi(x) ≤ 0,∀i = 1, ...,m

and

x 6∈ X ⇔ ∃i′, s.t. gi′(x) > 0

⇒ ∇gi′(x)T (y − x) ≤ gi′(y)− gi′(x) ≤ 0,∀y ∈ X⇒ ωTx ≥ sup

y∈XωT y, for ω = ∇gi′(x)

12.2 Center of Gravity Method (Levin, 1965; Newman, 1965)

We consider the simple cutting plane scheme

• Initialize G0 = X

• At iteration t = 1, 2, ..., T , do

– Compute the center of gravity: xt = 1Vol(Gt−1)

∫x∈Gt−1

xdx

– Call the first order oracle and obtain gt ∈ ∂f(xt)

– Set Gt = Gt−1 ∩y : gTt (y − xt) ≤ 0

• Output xT ∈ arg minx∈x1,...,xT f(x)

Lemma 12.1 Let C be a centered convex body in Rn with∫x xdx = 0. Then ∀ω 6= 0

Vol(C ∩x : ωTx ≤ 0

) ≤

(1− (

n

n+ 1)n)Vol(C) ≤ (1− 1

e)Vol(C)

As an immediate result of the center of gravity method, we have

• x∗ ∈ Gt,∀t ≥ 1 because Xopt ⊆ y : f(y) ≤ f(xt) ⊆y : gTt (y − xt) ≤ 0

• Vol(Gt) ≤ (1− 1

e )tVol(X)


Theorem 12.2 The approximate solution generated by the center of gravity methods satisfies:

f(xT )− f∗ ≤ (1− 1

e)Tn VarX(f)

where VarX(f) = maxx∈X f(x)−minx∈X f(x), and f∗ is the optimal value.

Proof: Let δ ∈ ((1− 1e )

Tn , 1), and consider the neighborhood of x∗

Xδ = x∗ + δ(x− x∗) : x ∈ X

Vol(Xδ) = δnVol(X) > (1− 1

e)TVol(X) ≥ Vol(GT )

Hence Xδ/GT 6= 0. Let y = x∗ + δ(z − x∗) ∈ Xδ/GT for some z ∈ X.Thus, for certain t∗ ≤ T , we have y ∈ Gt∗−1/Gt∗ . Since y 6∈ Gt∗ , we have gTt∗(y − xt∗) > 0, hencef(y) > f(xt∗). Since y = x∗ + δ(z − x∗), by convexity of f ,

f(y) = f(δz + (1− δ)x∗) ≤ δf(z) + (1− δ)f(x∗)

= f(x∗) + δ[f(z)− f(x∗)]

≤ f(x∗) + δVarX(f)

Hence f(xT ) ≤ f(xt∗) ≤ f∗ + δVarX(f). Let δ → (1− 1e )

Tn , we got the desired result.

Remarks:

1. To obtain a solution with small inaccuracy ε > 0, the number of oracles needed are polyno-mially dependent on the dimension.

• separation oracle: N(ε) = n log(VarX(f)ε )

• first order oracle: N(ε) = n log(VarX(f)ε )

• zero order oracle: N(ε) = n log(VarX(f)ε )

2. The rate of convergence of center of gravity methods is exponentially fast and this is usuallycalled a linear rate.

3. The center of gravity is not necessarily polynomial and cannot be used as a computationaltool because finding the center of gravity at each step can be extremely difficult, even forpolytopes.

4. We can use ellipsoid as the localizer, so that is is easy to compute the center.

12.3 Ellipsoid Method (Shor, Nemirovsky, Yudin, 1970s)

Ellipsoid. Let Q be a symmetric positive definite matrix, and c be the center, an ellipsoid isuniquely characterized by (c,Q):

E(c,Q) =x ∈ Rn : (x− c)TQ−1(x− c) ≤ 1

=x = c+Q

12u : uTu ≤ 1


The Volume of E(c,Q) is

Vol(E(c,Q)) = Det(Q12 )Vol(Bn)

where Bn is a n-dimensional Euclidean ball with Vol(Bn) = πn2

Γ(n2

+1) .

Proposition 12.3 Let E = E(c,Q) =x ∈ Rn : (x− c)TQ−1(x− c) ≤ 1

be an ellipsoid. Let

H+ =x : ωTx ≤ ωT c

be a half space with ω 6= 0 that pass through the center c. Then the half

ellipsoid E ∩H+ can be contained by the ellipsoid E+ = E(c+, Q+) with

c+ = c− 1

n+ 1q

Q+ =n2

n2 + 1(Q− 2

n+ 1qqT )

where q = Qω√ωTQω

is the step from c to the boundary of E. Moreover,

Vol(E+) ≤ exp

− 1

2n

Vol(E)

The Ellipsoid Method. The algorithm works as follows:

• Initialize E(c0, Q0) with c0 = 0, Q0 = R2I

• At iteration t = 1, 2, ..., T , do

– Call separation oracle with the input ct−1

– If ct−1 6∈ X, obtain a separator u 6= 0

– If ct−1 ∈ X, call first order oracle and obtain a subgradient ω ∈ ∂f(ct)

– Set the new ellipsoid E(ct, Qt) with

ct = ct−1 −1

n+ 1

Qt−1ω√ωTQt−1ω

Qt =n2

n2 − 1(Qt−1 −

2

n+ 1

Qt−1ωωTQt−1

ωTQt−1ω)

• OutputxT = arg min

c∈c1,...,cT ∩Xf(c)

As a immediate result,

Vol(Et) ≤ exp

− t

2n

Vol(E0) = exp

− t

2n

RnVol(Bn)


Theorem 12.4 The approximate solution generated by the Ellipsoid methods after T steps, forT > 2n2 log(Rr ) satisfies:

f(xT )− f∗ ≤ R

rVarX(f) exp

− T

2n2

Proof: Similar as the proof for the center of gravity method. Set δ ∈ (Rr exp− T

2n2

, 1). Then

Xδ = x∗ + δ(x− x∗) : x ∈ X

Vol(Xδ) = δnVol(X) ≥ δnγnVol(Bn) > Rn exp

− T

2n

Vol(Bn) ≥ Vol(Et)

Hence, Xδ/Et 6= ∅. The rest of the proof follows similarly as in Theorem 12.2.

Remark: The oracle complexity for Ellipsoid method is O(n2 log(1ε )), which is only slightly worse

than the center of gravity method.

Remark: The Ellipsoid method works for any general convex problems as long as they admitseparation and first order oracles. Moreover, the algorithm is polynomial if it takes only polynomialtime to call those oracles. For instance, for linear programs with n variables and m constraints, ittakes O(nm) computation cost for the separation oracle.

In summary, here are some advantages and disadvantages of Ellipsoid method:

+ : universal

+ : simple to implement and steady for small size problems

+ : low order dependence on the number of functional constraints

− : quadratic growth on the size of problem, inefficient for large-scale problems.


Lecture 13: Conic Programming – March 06




• Generalized Inequality Constraints and Cones

• Conic Programs: LP, SOCP, SDP

References: Ben-Tal & Nemirovski. Lectures on Modern Convex Optimization, Chapter 1.4

13.1 Generalized Inequality

In order to extend the linear constraints to nonlinear convex constraints, we may extend the stan-dard component-wise inequality to a generalized vector inequality.

Ax ≥ b⇒ Ax < b

where the partial ordering “<” also satisfies:

(i) reflexibility: a < a

(ii) anti-symmetry: a < b and b < a implies a = b

(iii) transitivity: a < b and b < c implies a < c

(iv) homogeneity: a < b and λ ∈ R+ implies λa < λb

(v) additivity: a < b and c < b implies a+ c < b+ d

Remark 1: The generalized inequality on Rm is completely identified by the setK = a ∈ Rm : a < 0via the rule

a < b⇔ a− b ∈ K

This is because:

• a < b and −b < −b (by (i)) implies a− b < 0

• a− b < 0 and b < b (by (i)) implies a < b

13-1

Lecture 13: Conic Programming – March 06 13-2

Remark 2 The set K satisfies

1. K is nonempty: 0 ∈ K

2. K is closed w.r.t addition: a, b ∈ K ⇒ a+ b ∈ K

3. K is closed w.r.t multiplication a ∈ K,λ ∈ R⇒ λa ∈ K

4. K is pointed: a,−a ∈ K ⇒ a = 0

Equivalently, K must be a nonempty pointed convex cone. Indeed, this condition is both necessaryand sufficient for the set K to define a partial ordering ”≥K” via the rule:

a ≥K b⇔ a− b ∈ K

Remark 3: The standard inequality a ≥ b⇔ a− b ∈ Rm+ . In addition, K = Rm

+ is also closed andhas non-empty interior.

13.2 Conic Programs

Definition 13.1 Let K be a regular cone (the cone is convex, closed, pointed and with a nonemptyinterior) and ”≥K” be the induced inequality. The optimization problem:

minx

cTx

s.t. Ax ≥K b (CP )

is called a conic program associated with the cone K.

Examples of Regular Cones

• Non-negative Orthant:Rm

+ = x ∈ Rm : xi ≥ 0, i = 1, ...,m

• Lorentz Cone (a.k.a second order/ ice-cream cone)

Lm =

x ∈ Rm : xm ≥

√√√√m−1∑i=1

x2i

• Semidefine Cone:

Sm+ = A ∈ Sm : A < 0

is the set of m×m symmetric positive semidefinite matrices.

Typical Conic Programs


• Linear Program: K = Rm+ = R× ...×R is a direct product of nonnegative orthant:

minx

cTx : aTi x− bi ≥ 0, i = 1, ...,m

(LP )

where the linear map Ax− bi := [aTi x− bi; ....; aTmx− bm]

• Conic Quadratic Program (a.k.a Second Order Conic Program): K = Lm1 × ...× Lmk

minx

cTx :‖ Dix− di ‖2≤ eTi x− fi, i = 1, ...,m

(SOCP )

where the linear map

Ax− b :=[D1x− d1; eT1 x− f1︸︷︷︸

Rm1

; ...;Dkx− dk; eTk x− fk︸︷︷︸Rmk

]

• Semidefinite Program:

minx

cTx :

n∑i=1

xiAi −B < 0

(SDP )

where the linear map Ax− b =∑m

i=1 xiAi −B : Rn → Sm, where A1, ..., An, B ∈ Sm

Remark: (LP ) ⊆ (SOCP )

This is because aTi x− bi ≥ 0⇔[

0aTi x− bi

]∈ L2.

Remark: (SOCP ) ⊆ (SDP )This is because: x1...

xm

∈ Lm ⇔

xm x1 x2 ... xm−1x1 xmx2 xm...

. . .

xm−1 xm

< 0

Lemma 13.2 (Schur Complement) Let S =

[P QT

Q R

]be symmetric with R 0. Then

S < 0 if and only if P −QTR−1Q < 0

Proof:

S < 0⇔ ∀u.v :[uT , vT

] [P QT

Q R

] [uv

]≥ 0

⇔ infu,v

uTPu+ 2vTQu+ vTRv ≥ 0

⇔ infuuT (P −QTR−1Q)u ≥ 0

⇔ P −QTR−1Q < 0


To show (SOCP ) ⊆ (SDP )

1. If x ∈ Lm and x 6= 0, then xm > 0 and

xm ≥x21 + ...+ x2m−1

xm

The above lemma implies Ax ≥ 0. If x ∈ Lm and x = 0, then Ax = 0 ∈ Sm+ .

2. If Ax ≥ 0 and Ax 6= 0, then xm > 0. Then by Schur complement Lemma, we have:

xm −x21 + ...+ x2m−1

xm≥ 0⇒ x ∈ Lm

If Ax = 0, then x = 0 ∈ Lm

13.3 Examples

1. L2-norm minimization

(a) minx∈Rn ‖ x ‖2

Note that

minx∈Rn

‖ x ‖2 ⇐⇒ minx,t

t ⇐⇒ minx,t

t

s.t. t ≥‖ x ‖2 s.t.

[xt

]≥Ln+1 0

(b) minx∈Rn xTx

Note that

minx∈Rn

xTx ⇐⇒ minx,t

t ⇐⇒ minx,t

t

s.t. t ≥ xTx s.t.

2xt− 1t+ 1

≥Ln+2 0

2. Quadratic problems: minx∈Rn xTQx+ qTx where Q = LLT < 0

Note that

minx∈Rn

xTQx+ qTx ⇐⇒ minx,t

t ⇐⇒ minx,t

t

s.t. t ≥ xTQx+ qTx s.t.

2LTxt− qTx− 1t− qTx+ 1

≥Ln+2 0


3. Spectral norm minimization:Spectral norm: ‖ A ‖2= λmax(A), where A is symmetricNote that

minx∈Rn

‖n∑

i=1

xiAi ‖2 ⇐⇒ minx∈Rn,t

t ⇐⇒ minx,t

t

s.t. t ≥ λmax(n∑

i=1

xiAi) s.t. tIm −n∑

i=1

xiAi < 0

where A1, ..., An ∈ Sm


Lecture 14: Conic Duality – March 08




• Dual Cone

• Conic Duality Theorem

References: Ben-Tal & Nemirovski. Lectures on Modern Convex Optimization, Chapter 1.4

14.1 Motivation

Recall the LP Duality:

min cTx

(P ) s.t. Ax = b

x ≥ 0

max bT y

(D) s.t. AT y ≤ c

min cTx

(LP ) s.t. Ax ≥ bmax bT y

(LD) s.t. AT y = c

y ≥ 0

Now consider the conic program

min cTx

(CP ) s.t. Ax ≥K b

|(CD) ?

14-1

Lecture 14: Conic Duality – March 08 14-2

Observe that

Ax ≥ b⇒ ∀y ≥ 0, yT (Ax) ≥ yT b⇒ ∀y ≥ 0 and AT y = c, cTx ≥ bT y⇒ cTx ≥ max

bT y : y ≥ 0, AT y = c

Now that Ax ≥K b⇒ yT (Ax) ≥ yT b for which y?

14.2 Dual Cone

Definition 14.1 Let K be a nonempty cone. The set

K∗ =y : yTx ≥ 0,∀x ∈ K

is called the dual cone to K. Note that K∗ is a closed cone.

Proposition 14.2 Let K be a closed cone and K∗ be its dual cone. Then:

1. (K∗)∗ = K

2. K is pointed iff K∗ has non-empty interior

3. K is a regular cone iff K∗ is a regular cone

Proof: Self-exercise.

Examples: We call these self-dual cones.

• (Rm+ )∗ = Rm

+

• (Lm)∗ = Lm

• (Sm+ )∗ = Sm+

14.3 Conic Duality

The dual of the conic program

(CP ) minx

cTx : Ax ≥K b

is given by

(CD) maxy

AT y = c, y ≥K∗ 0


Theorem 14.3 (Weak Conic Duality) The optimal value of (CD) is a lower bound of the optimalvalue of (CP ).

Theorem 14.4 (Strong Conic Duality) If the primal (CP ) is bounded below and strictly feasible,i, e,∃x0, s.t. Ax0 K b. Then the dual (CD) is solvable and the optimal values equal to each other.

Proof: Let p∗ be the optimal value of the primal (CP ). It’s sufficient to show that ∃y∗ feasible to(CD), such that bT y∗ ≥ p∗.

Suppose c = 0, p∗ = 0, ∃y∗ = 0, s.t. y∗ ≥K∗ , AT y∗ = c and bT y∗ = p∗.

Now consider c 6= 0. let the set M =Ax− b : cTx ≤ p∗

. Note that M is a nonempty set.

Claim 1 : M ∩ int(K) = ∅This is because: Suppose ∃x, s.t. Ax ∈ int(K) and cT x ≤ p∗. Then there exists a small enoughneighborhood of x that are feasible. Since c 6= 0, there exists a point x in this neighborhood suchthat cT x < cT x ≤ p∗. This contradicts with the fact that p∗ is the optimal value.

By Separation Theorem, ∃y 6= 0, s.t.

supz∈M

yT z ≤ infz∈int(K)

yT z

Note that infz∈int(K) yT z = 0, since int(K) contains the raytz : t ≥ 0 , ∀z ∈int(K).

Hence, we have y ∈ K∗ and

supx:cT x≤p∗

yT (Ax− b) ≤ 0 (14.1)

Therefore, it must hold that λc = AT y for some λ ≥ 0. otherwise, the supremum goes to infinity.

Claim 2 : ∃λ > 0, s.t. AT y = λc

This is because: Suppose λ = 0, AT y = 0, (14.1) implies −bT y ≤ 0. Since (CP ) is strictly feasible,∃x0, s.t. Ax0− b ∈int(K). We already show y ∈ K∗, then yT (Ax0− b) > 0, which implies yT b > 0.Contradiction!

Now let y∗ = yλ , we obtain y∗ ∈ K∗, A

T y∗ = c and cTx − (y∗)T b ≤ 0,∀x such that cTx ≤ p∗.Therefore, y∗ is dual feasible and bT y∗ ≥ p∗

Corollary 14.5 If the dual (CD) is bounded above and strictly feasible,

i.e. ∃y, s.t. y >K∗ 0 and AT y = c

then the primal (CP ) is solvable and the optimal values equal to each other.

Corollary 14.6 If both (CP ) and (CD) are strictly feasible, the both are solvable with equal optimalvalues.


Theorem 14.7 (Optimality Conditions) Suppose at least one of (CP ) and (CD) is bounded andstrictly feasible, then the feasible primal-dual pair (x∗, y∗) is a pair of optimal primal-dual solutionsiff

1. (Zero duality gap) if and only if cTx∗ − bT y∗ = 0

2. (Complementary Slackness) if and only if (Ax∗ − b)T y∗ = 0

Proof: Let p∗ and d∗ be the optimal values of (CP ) and (CD)

cTx∗ − bT y∗ = (cTx∗ − p∗) + (d∗ − bT y∗) + (p∗ − d∗)

All three terms on RHS are ≥ 0. And cTx∗ − bT y∗ = 0 iff cTx∗ = p∗, d∗ = bT y∗ and p∗ = d∗

Remark In the case of LP , the strictly feasibility is not required for strong duality and it is notrequired for a program to be solvable. However, in the genetic case of CP , this is not necessarytrue.

Example 1: A conic problem can be strictly feasible and bounded, but NOT solvable.

minx1,x2

x1

s.t.

x1 − x21x1 + x2

≥L3 0 ⇐⇒

minx1,x2

x1

s.t. 4x1x2 ≥ 1

x1 + x2 > 0

Example 2: A conic problem can be not strictly feasible yet solvable, and the dual is infeasible.

minx1,x2

x2

s.t.

x1x2x1

≥L3 0 ⇐⇒

minx1,x2

x2

s.t. x2 = 0 ⇐⇒x1 ≥ 0

maxλ

0

s.t.

[λ1 + λ3λ2

]=

[01

]λ ≥L3 0


Lecture 15: Applications of Conic Duality – March 13




• SOCP Duality

• SDP Duality

• SDP Relaxations for Non-Convex Quadratic Programs

References: Ben-Tal & Nemirovski. Lectures on Modern Convex Optimization, Chapter 2.1, 3.1

15.1 Recall: Conic Duality

Let K be a regular cone, and K∗ =y : yTx ≥ 0,∀x ∈ K

be its dual cone

p∗ = minx

cTx

s.t. Ax ≥K b (CP)

d∗ = maxy

bT y

s.t. AT y = c (CD)

y ≥K∗ 0

• Weak Duality: p∗ ≥ d∗

• Strong Duality: If one of (CP ), (CD) is bounded and strictly feasible, then p∗ = d∗.

• Optimality Condition: (x∗, y∗) is optimal iff

(i) primal feasibility: Ax∗ ≥K b

(ii) dual feasibility: AT y∗ = c, y∗ ≥K∗ 0

(iii) zero-duality gap: cTx∗ − bT y∗ = 0

15-1

Lecture 15: Applications of Conic Duality – March 13 15-2

15.2 SOCP Duality

Second-order Cone Program (SOCP): when K is the Cartesian product of Lorentz cones,namely, K = Ln1 × ...× Lnm .The general form of a second-order cone program is

minx

cTx

s.t. ‖ Aix− bi ‖2≤ dTi x− ei, i = 1, ...,m (SOCP–P)

where c ∈ Rn, Ai ∈ R(ni−1)×n,Ri ∈ R(ni−1)×1, ei ∈ R.

The conic form is

minx

cTx

s.t. Aix ≥Lni bi, i = 1, ...,m

where A1 =

[AidTi

], bi =

[biei

]

Proposition 15.1 Ln is self-dual, i.e. (Ln)∗ = Ln

Proof:

(i) Ln ⊆ (Ln)∗

Suppose y ∈ Ln, we show that ∀x ∈ Ln,

yTx = y1x1 + . . .+ ynxn ≥ −

√√√√n−1∑i=1

y2i

√√√√n−1∑i=1

x2i + ynxn ≥ 0

where the first inequality is due to Cauchy-Schwarz inequality and the second inequality is

because yn ≥√∑n−1

i=1 y2i and xn ≥

√∑n−1i=1 x

2i .

(ii) (Ln)∗ ⊆ LnSuppose y ∈ (Ln)∗, we have yTx ≥ 0,∀x ∈ LnIf (y1, ..., yn−1) = 0, setting x = [0, ..., 0, 1] ∈ Ln, we get yTx = yn ≥ 0, so y ∈ Ln

If (y1, ..., yn−1) 6= 0, setting x = [−y1, ...,−yn−1,√∑n−1

i=1 y2i ] ∈ Ln, we get

yTx =n−1∑i=1

y2i + yn

√√√√n−1∑i=1

y2i ≥ 0⇒ yn ≥

√√√√n−1∑i=1

y2i , i.e. y ∈ Ln


Hence, the dual of the SOCP is

maxλ∈Rm,ui∈Rni−1,i=1,...,m

m∑i=1

bTi ui + eTλ

s.t.m∑i=1

(ATi ui + diλi) = c (SOCP–D)

‖ ui ‖2≤ λi, i = 1, ...,m

Remark: The same dual can be derived from Lagrange duality. The Lagrange function is

L(x, λ) = cTx+

m∑i=1

λi

(‖ Aix− bi ‖2 −(dTi x− ei)

)The Lagrange dual is given by

maxλ≥0

minxL(x, λ)

= maxλ≥0

minx

max‖ui‖2≤λi,i=1,...,m

cTx−m∑i=1

uTi (Aixi − bi)−m∑i=1

λi(dTi x− ei)

= maxλ≥0,‖ui‖2≤λi

minx

(c−m∑i=1

(ATi ui + diλi))Tx+

m∑i=1

bTi ui + eTλ

= maxλ,u1,...,um

m∑i=1

bTi ui + eTλ

s.t.m∑i=1

(ATi ui + diλi) = c

‖ ui ‖2≤ λi, i = 1, ...,m

which is the same as dual derived from conic duality.

15.3 SDP Duality

Semidefinite Program (SDP): when K is the positive semidefinite cone.The general form of a semidefinite program is

min cTx

s.t. Ax−B =

n∑i=1

xiAi −B < 0 (SDP–P)

The constraint type x1A1 + ...+ xnAn −B < 0 is called Linear Matrix Inequality.

Proposition 15.2 Sn+ is self-dual, i.e. (Sn+)∗ = Sn+


Proof:

(i) Sn+ ⊆ (Sn+)∗

Suppose Y < 0. we have Y = UΛUT =∑n

i=1 λiuiuTi , where λ ≥ 0, i = 1, ...,m

∀X < 0, 〈X,Y 〉 = tr(XTY ) = tr(XY ) =tr(

n∑i=1

λiuiuTi Y ) =

n∑i=1

λitr(uTi Y ui) ≥ 0

Hence, Y ∈ (Sn+)∗.

(ii) (Sn+)∗ ⊆ Sn+Suppose Y ∈ (Sn+)∗, i.e. tr(XY ) ≥ 0,∀x ∈ Sn+. For any x ∈ Rn, let X = xxT , we havetr(XY ) = tr(xxTY ) = xTY x ≥ 0. Hence, Y ∈ Sn+.

The dual of SDP is:

maxY

tr(BY )

s.t. tr(AiY ) = ci i = 1, ..., n (SDP–D)

Y < 0

Remark: Based on conic duality theorem (x∗, y∗) is optimal primal-dual pair iff

1.∑n

i=1 xiAi < B (primal feasibility)

2. Y < 0, tr(AiY ) = ci, i = 1, ...,m (dual feasibility)

3. tr(Y (∑n

i=1 xiAi −B)) = 0, i.e. Y (∑n

i=1 xiAi −B) = 0

Example: Use SDP duality to show that for any B ∈ Sn+:

λmax(B) = maxx∈Rn

xTBx :‖ x ‖2= 1

Note that the right hand side can be rewritten as

maxx

tr(BxxT )

s.t. tr(xxT ) = 1

We show this is equivalent to solve the SDP relaxation.

maxX

tr(BX)

s.t. tr(X) = 1 (P)

X < 0


Let p = maxx

tr(BxxT ) : tr(xxT ) = 1

and p = maxX tr(BX) : tr(X) = 1, X < 0. Since thelatter is a relaxation, so p ≥ p. On the other hand, for any X < 0 with tr(X) = 1,

X =n∑i=1

λixixTi , with

n∑i=1

λi = 1, λi ≥ 0, i = 1, ..., n, and ‖ xi ‖2= 1, i.e. tr(xixTi ) = 1

tr(BX) = tr(n∑i=1

λixixTi B) =

n∑i=1

λitr(BxixTi ) ≤ max

i=1,..,ntr(Bxix

Ti ) ≤ p

Hence, p ≤ p. Therefore, p = p.

By the SDP duality, the dual of (P) is

minx

λ

s.t. λI −B < 0 (D)

Note that λI −B < 0 is equivalent to λ ≥ λmax(B).

Since (P) is strictly feasible, X = n−1I satisfies tr(X) = 1 and X 0, strong duality holds.Therefore,

maxx

xTBx :‖ x ‖2= 1

= Opt(P ) = Opt(D) = λmax(B)

15.4 SDP Relaxations of Non-convex Quadratic Programming

Consider the quadratic constrained quadratic programming:

Opt = min xTQ0x+ 2qT0 x+ c0

s.t. xTi Qixi + 2qTi x+ ci ≤ 0, 1 ≤ i ≤ m (QCQP)

Let X =

[xxT xxT 1

]=

[x1

] [x1

]Tand Ai =

[Qi qiqTi ci

], i = 0, 1, ...,m

We can write (QCQP) as

minx,X

tr(A0X)

s.t. tr(AiX) ≤ 0, i = 1, ...,m

X =

[xxT xxT 1

]


SDP Relaxation. The following SDP is a relaxation of (QCQP)

OptSDP = minX

tr(A0X)

s.t. tr(AiX) ≤ 0, i = 1, ...,m (SDP-relaxation)

X < 0

Xn+1,n+1 = 1

Therefore, OptSDP ≤ Opt.

Lagrange Relaxation. Another relaxation of QCQP is based on Lagrange duality. The Lagrangedual is given by

maxλ≥0

infx

L(x, λ)

= maxλ≥0

infx

xTQ0x+ 2qT0 x+ c0 + xT (∑m

i=1λiQi)x+ 2(

∑m

i=1λiqi)

Tx+∑m

i=1λici

= maxλ≥0t : xT (Q0 +

∑m

i=1λiQi)x+ 2(q0 +

∑m

i=1λiqi)

Tx+ (∑m

i=1λici + c0) ≥ t,∀x

Note that ∀x, xTAx ≥ 0 is equivalent to A < 0. Hence, the Lagrange relaxation is

OptLagrange = maxλ≥0

t

s.t.

[Q0 +

∑mi=1 λiQi q0 +

∑mi=1 λiqi

(q0 +∑m

i=1 λiqi)T∑m

i=1 λici + c0 − t

]< 0 (Lagrange relaxation)

By weak duality, we know that OptLagrange ≤ Opt.

Indeed, one can show that the Lagrange relaxation is the SDP dual to the SDP relaxation.

• If either of them is strictly feasible, then OptLagrange = OptSDP.

• Otherwise, we always have OptLagrange ≤ OptSDP ≤ Opt.

Therefore, SDP relaxation is always tighter than Lagrange relaxation.


Lecture 16: Applications in Robust Optimization – March 27




• Robust Linear Program

• Example: Robust Portfolio Selection

• Example: Robust Classification

16.1 Robust Linear Program

Consider the linear programmin

cTx : Ax ≤ b

(LP )

Assume that the data (c, A, b) of the program are not known exactly and vary in a given uncertaintyset U . The goal is to find a robust solution that is

• feasible for all instances, i.e. Ax ≤ b,∀(c, A, b) ∈ U

• optimal for the worst-case objective, i.e. supcTx : (c, A, b) ∈ U

We call the following program the robust counterpart of (LP):

minx

sup

(c,A,b)∈UcTx : Ax ≤ b,∀(c, A, b) ∈ U

(RC)

or equivalently,minx,tt : cTx ≤ t and Ax ≤ b ∀(c, A, b) ∈ U

Note that the robust counterpart is a semi-infinite convex optimization program with infinitely manylinear inequality constraints. The structure of (RC) depends on the geometry of the uncertaintyset U .

Without loss of generality, let’s consider the robust counterpart in the simple form:

min cTx

s.t. aTi x ≤ bi, ∀ai ∈ Ui, i = 1, ...,m (RC)

16-1

Lecture 16: Applications in Robust Optimization – March 27 16-2

where Ui ⊆ Rn is the uncertainty set of ai.

Polyhedral Uncertainty

Consider the situation where the uncertainty set is a polyhedron:

Ui = ai : Diai ≤ di i = 1, ...,m

The (RC) becomes

min cTx

s.t. maxai∈Ui

aTi x ≤ bi

By LP duality, we know that

maxai

aTi x ⇐⇒ minpi

dTi pi

s.t. Diai ≤ di s.t. DTi pi = x

pi ≥ 0

Hence, (RC) is equivalent to

minx,p1,...,pm

cTx

s.t. dTi pi ≤ bi, i = 1, ...,m

DTi pi = x, i = 1, ...,m

pi ≥ 0, i = 1, ...,m

which is a linear program.

Ellipsoidal Uncertainty

Consider the situation when the uncertainty set is an ellipsoid

Ui = ai + Piu : ‖u‖2 ≤ 1 , i = 1, ...,m

where ai ∈ Rn, Pi ∈ Rn×n.

The (RC) becomes

minx

cTx

s.t. maxai∈Ui

aTi x ≤ bi

Note that

maxai∈Ui

aTi x = max‖u‖2≤1

aTi x+ uTP Ti x = aTi x+ max‖u‖2≤1

uTP Ti x = aTi x+ ‖P Ti x‖2


Hence, (RC) is equivalent to

minx

cTx

s.t. ‖P Ti x‖2 ≤ bi − aTi x

which is an SOCP.

16.2 Examples

16.2.1 Example I: Robust Portfolio Selection

Consider a classical portfolio optimization problem with n assets. Let x = (x1, ..., xn) be theportfolio vector.

1. Assume that returns ri, i = 1, ..., n are exactly known. To maximize the return leads to theoptimization problem.

maxx

rTx ⇐⇒ maxx,γ

γ

s.t.n∑i=1

xi = 1 s.t. rTx ≥ γ

x ≥ 0n∑i=1

xi = 1, x ≥ 0

which leads to a highly non-robust solution.

2. Assume the returns are know within an ellipsoid:

U =r + ρΣ1/2u : ‖u‖2 ≤ 1

where r and Σ are the empirical mean and covariance matrix.

The robust portfolio problem is

maxx

minr∈U

γTx, s.t. x ≥ 0,n∑i=1

xi = 1

which is equivalent as an SOCP:

maxx

rTx− ρ‖Σ1/2x‖2, s.t. x ≥ 0,1Tx = 1

This can be interpreted as a return-risk tradeoff.


3. Assume that returns are random variables with mean r and covariance Σ. Consider the chanceconstrained program:

maxx,γ

γ

s.t. P (rTx ≥ γ) ≥ 1− εn∑i=1

xi = 1

x ≥ 0

Note that

P (rTx ≥ γ) ≥ 1− ε

⇔ P (Z ≥ γ − rTx√xT Σx

) ≥ 1− ε, where Z ∼ N(0, 1)

⇔ P (Z ≤ rTx− γ√xT Σx

) ≥ 1− ε

⇔ rTx− γ√xT Σx

≥ Φ−1(1− ε)

⇔ rTx− Φ−1(1− ε)‖Σ1/2x‖2 ≥ Γ

Hence, the chance constrained program is equivalent to

maxx

rTx− Φ−1(1− ε)‖Σ1/2x‖2, s.t. x ≥ 0,1Tx = 1

Remark Robust LP with ellipsoidal uncertainty set are closed related to chance constraints of astochastic model.

16.2.2 Example 2: Robust Classification

Consider the support vector machine model for binary classification:

minω,b,ε

m∑i=1

εi

s.t. yi(ωTxi + b) ≥ 1− εi (SVM)

ε ≥ 0

where (xi, yi), i = 1, ...,m are data points with xi ∈ Rn, yi ∈ ±1.

1. Assume that the feature vector xi are subject to spherical uncertainty

xi ∈ Xi = xi + ρu : ‖u‖2 ≤ 1


The robust counterpart of (SVM) simplifies to a SOCP

minω,b,ε

m∑i=1

εi

s.t. yi(ωT xi + b)− ρ‖ω‖2 ≥ 1− εi

εi ≥ 0

⇔ minω,b

m∑i=1

max(

1− yi(ωTxi + b) + ρ‖ω‖2, 0)

which is very similar to the classical SVM with `2-norm regularization:

minω,b

m∑i=1

max(

1− yi(ωTxi + b), 0)

+ λ‖ω‖22

2. Assume that the feature vector xi are subject to box uncertainty

xi ∈ Xi = xi + ρu : ‖u‖∞ ≤ 1

THe robust counterpart of (SVM) becomes

minω,b

m∑i=1

max(

1− yi(ωTxi + b) + ρ‖ω‖1, 0)

which is very similar to the classical SVM with `1-norm regularization.

Remark Robust optimization has a close connection to the regularization technique in machinelearning.


Lecture 17: Interior Point Method, Part I – March 29




• Path-following Scheme

• Self-concordant Functions

Reference:

Nemirovski, Interior Point Polynomial Time Methods in Convex Programing, 2004, Chapter 1Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 4.1

17.1 Historical Notes

• 1947: Dantzig, Simplex method for LP

• 1973: Klee and Minty proved that Simplex Method is not a polynomial-time algorithm

• mid-1970s: Shor, Nemirovski and Yudin, Ellipsoid method for linear and convex programs

• 1979: Khachiyan proved the polynomial-time solvability of LP

• 1984: Karmarkar, polynomial algorithm (potential reduction interior point method) for LP(1967): Dikin, affine scaling algorithm (simplification of Karmarkar’s algorithm) for LP

• late-1980s: Renegar, Gonzaga, path-following interior point method for LP

• 1988: Nesterov and Nemirovski, extend interior point method for convex programs

17.2 Path following Scheme

We intend to solve a general convex program

minx

f(x)

s.t. gi(x) ≤ 0, i = 1, ...,m (P)

where f , gi are twice continuously differentiable convex functions.

17-1

Lecture 17: Interior Point Method, Part I – March 29 17-2

Denote X = x : gi(x) ≤ 0, ∀i = 1, . . . ,m as the feasible domain. Assume Slater condition holdsand X is bounded, so X is a compact convex set with non-empty interior.

Barrier Method: solve a series of unconstrained problems

minx

tf(x) + F (x) (Pt)

where t > 0 is a penalty parameter and F (x) is a barrier function that satisfies:

• F : int(X)→ R and F (x)→ +∞ as x→ ∂(X)

• F is smooth (twice continuously differentiable) and convex

• F is non-degenerate, i.e. ∇2F (x) 0, ∀x ∈ int(X)

Note that for any t > 0, (Pt) has a unique solution in the interior of X.

Denotex∗(t) = arg min

xtf(x) + F (x)

The path x∗(t), t > 0 is called the central path. We have

x∗(t) −→ x∗, as t −→∞

To implement the above path-following scheme, need to specify:

1. the barrier function F (x):

– use self-concordant barriers, e.g. F (x) = −∑m

i=1 log(−gi(x))

2. the method to solve unconstrained minimization problems (Pt):

– use Newton method

3. the policy to update the penalty parameter t.

17.3 Self-concordant Function

Definition 17.1 A function f : R→ R is self-concordant if f is convex and

|f ′′′(x)| ≤ κf ′′(x)3/2,∀x ∈ dom(f)

for some constant κ ≥ 0.

When κ = 2. f is called a standard self-concordant function.

Examples:


• Logarithmic function: f(x) = − ln(x), x > 0 is standard self-concordant:

f ′(x) = −1

x, f ′′(x) =

1

x2, f ′′′(x) = − 2

x3,|f ′′′(x)|f ′′(x)3/2

= 2

• Linear function: f(x) = cx is self-concordant with constant κ = 0 :

f ′(x) = c, f ′′(x) = 0, f ′′′(x) = 0

• Convex quadratic function: f(x) = a2x

2 + bx+ c (a > 0) is self-concordant with κ = 0:

f ′(x) = ax+ b, f ′′(x) = a, f ′′′(x) = 0

• Exponential function: f(x) = ex is not self-concordant:

f ′(x) = f ′′(x) = f ′′′(x) = ex,|f ′′′(x)|f ′′(x)3/2

= ex/2 −→ +∞ as x→ −∞

• Power functions:

f(x) =1

xp(p > 0), (x > 0)

f(x) = |x|p(p > 2)

f(x) = x2p(p > 2)

are not self-concordant.

Remark: Self-concordant function is affine-invariant: If f(x) is self-concordant, f(y) = f(ay+b)is also self-concordant with the same constant.

Proof: It is easy to see that f is convex and

f ′′′(y)

f ′′(y)3/2=|a3f ′′′(ay + b)|[a2f ′′(ay + b)

]3/2 =f ′′′(ay + b)

f ′′(ay + b)3/2≤ κ

When extending to a function f defined on Rn and f ∈ C3(Rn), we say f is self-concordant if it isself-concordant along every line, namely, ∀x ∈ dom(f), h ∈ Rn, φ(t) = f(x+ th) is self-concordantwith some constant κ ≥ 0.

Denote

Dkf(x)[h1, ..., hk] =∂k

∂t1...∂tk|t1=...=tk=0f(x+ t1h1 + ...+ tkhk)

as the k-th differential of f taken at x along the directions h1, ..., hk e.g.

Df(x)[h] = φ′(0) = 〈∇f(x), h〉D2f(x)[h, h] = φ′′(0) = 〈∇2f(x)h, h〉


Definition 17.2 We say a function f : Rn → R is self-concordant if

D3f(x)[h, h, h] ≤ κ(D2f(x)[h, h])3/2,∀x ∈ dom(f), h ∈ Rn

for some constant κ ≥ 0

Operations Preserving Self-Concordance

Proposition 17.3

1. (Affine invariant) If f(y) is self-concordant with constant κ, then the function f(x) = f(Ax+b) is also self-concordant with constant κ.

2. (Summation) If f1(x) and f2(x) are self-concordant with constants κ1, κ2, then the functionf(x) = f1(x) + f2(x) is self-concordant with constant κ = max κ1, κ2

3. (Scaling) If f(x) is self-concordant with constant κ, and α > 0 then the function αf(x) isalso self-concordant with κ = κ√

α

Proof: Exercise!

Hence, it is straightforward to see that

Corollary 17.4 f(x) = −∑m

i=1 ln(bi − aTi x) is standard self-concordant on int(X), where X =x : aTi x ≤ bi, i = 1, ...,m

Remark: Indeed, under regular conditions, self-concordant functions are barrier functions:

• If dom(f) contains no straight line, then ∇2f(x) is non-degenerate (strictly convex)

• If f is closed convex, then f(xk)→ +∞ if xk ⊆ dom(f) and xk → x ∈ ∂(dom(f))


Lecture 18: Interior Point Method - Part II – April 3




• Classical Newton Method for Unconstrained Minimization

• (Damped) Newton Method for Self-concordant Functions

Reference:

Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 1.2.4

18.1 Classical Newton Method

Consider the unconstrained minimization

minx∈Rn

f(x)

where f(x) is twice continuously differentiable (not necessarily convex) on Rn.

Newton Method:xk+1 = xk − [∇2f(xk)]

−1∇f(xk), k = 0, 1, 2, ...

The direction dk = −∇2f(xk)−1∇f(xk) is called Newton’s direction.

Interpretation: The Newton method can be treated as

• Minimizing quadratic approximation of f : From Taylor’s expression, we have:

f(x+ h) = f(x) +∇f(x)Th+1

2hT∇2f(x)h+ o(‖ h ‖2)

The iteration can be viewed as minimizing the quadratic approximation of f :

xk+1 = minx

f(xk) +∇f(xk)

T (x− xk) +1

2(x− xk)T∇2f(xk)(x− xk)

18-1

Lecture 18: Interior Point Method - Part II – April 3 18-2

• Solving linearized optimality condition: From first-order optimality condition: ∇f(x) = 0.Note that

∇f(x+ h) ≈ ∇f(x) +∇2f(x)h

The iteration can also be viewed as solving linearized optimality condition

∇f(xk) +∇2f(xk)(x− xk) = 0

Remark (Newton method vs Gradient Descent)

xk+1 = xk −∇2f(xk)−1∇f(xk) (Newton)

xk+1 = xk − γk∇f(xk) (GD)

Firstly, unlike Gradient Descent, Newton’s direction is not necessarily a descent direction: for∇f(x) 6= 0,

∇f(x)Td = −∇f(x)T∇2f(x)∇f(x) 6< 0

Secondly, Newton method is a second-order method and requires relatively high computational costcomparing to GD.

Remark (Convergence)

(i) Newton method can break down if ∇2f(x) is degenerate.

(ii) When f is quadratic and non-degenerate, Newton method converges in one step.

(iii) The method may diverge even for a nice strongly convex function.

(iv) If started close enough to a strict local minimum, the method can converge very fast.

Example: consider the convex function f(x) =√

1 + x2

One can easily compute that x∗ = 0, and

f ′(x) =x√

1 + x2, f ′′(x) =

1

(1 + x2)3/2

The Newton method works as

xk+1 = xk − (1 + x2k)3/2 xk√

1 + x2k

= −x3k

Note that

• if |x0| < 1, the method converges extremely fast

• if |x0| = 1, the method oscillates between 1 and -1

• if |x0| > 1, the method diverges.


18.2 Classical Analysis

Theorem 18.1 (Local quadratic convergence of Newton method) Assume that

• f has a Lipschitz Hessian: ‖ ∇2f(x)−∇2f(y) ‖2≤M ‖ x− y ‖2 for M > 0.

• f has a strict local minimum x∗: ∇2f(x∗) < µI, with µ > 0.

• The initial point x0 is close enough to x∗: ‖ x0 − x∗ ‖2≤ µ2M

Then Newton method is well-defined and converges to x∗ at a quadratic rate

‖ xk+1 − x∗ ‖2≤M

µ‖ xk − x∗ ‖22

Note that for a symmetric matrix A, ‖ A ‖2:= supx6=b

‖Ax‖2‖x‖2

= maxk |λk(A)|. We first prove the

following simple useful lemma.

Lemma 18.2 Suppose f has Lipschitz Hession with constant M , then for any x, y,

‖∇f(y)−∇f(x)−∇2f(x)(y − x)‖2 ≤M

2‖y − x‖22.

Proof: This is because by basic calculus

∇f(y)−∇f(x) =

∫ 1

0∇2f(x+ t(y − x))(y − x)dt

‖∇f(y)−∇f(x)−∇2f(x)(y − x)‖2 =‖∫ 1

0∇2f(x+ t(y − x))(y − x)dt−∇2f(x)(y − x) ‖2

=‖∫ 1

0

[∇2f(x+ t(y − x))−∇2f(x)

](y − x)dt ‖2

≤∫ 1

0Mt ‖ y − x ‖22 dt

=M

2‖ y − x ‖22

Now we are ready to prove the main theorem.

Proof: We have

xk+1 − x∗ = xk − x∗ − [∇2f(xk)]−1∇f(xk)

= [∇2f(xk)]−1[∇2f(xk)(xk − x∗)−∇f(xk)]

= [∇2f(xk)]−1[∇f(x∗)−∇f(xk)−∇2f(xk)(x

∗ − xk)]


The last equality is due to the fact that ∇f(x∗) = 0 by optimality condition. Hence,

‖ xk+1 − x∗ ‖2 ≤‖ [∇2f(xk)]−1 ‖2‖ ∇f(x∗)−∇f(xk)−∇2f(xk)(x

∗ − xk) ‖2

≤‖ [∇2f(xk)]−1 ‖2 ·

M

2‖ xk − x∗ ‖22

The last inequality is due to the previous lemma.We show now by induction that ‖ xk − x∗ ‖2≤ µ

2M .

Assume ‖ xk − x∗ ‖2≤ µM , then

‖ ∇2f(xk)−∇f(x∗) ‖2≤M ‖ xk − x∗ ‖2≤µ

2

which implies that −µ2 I 4 ∇2f(xk)−∇2f(x∗) 4 µ

2 I. Hence,

∇2f(xk) < ∇2f(x∗)− µ

2<µ

2I

which implies that ‖ [∇2f(xk)]−1 ‖2≤ 2

µ . This leads to

‖ xk+1 − x∗ ‖2≤M

µ‖ xk − x∗ ‖22

and ‖ xk+1 − x∗ ‖2≤ µ2M , which concludes the proof.

Note that the local convergence holds for any unconstrained minimization regardless of convex ornot. When f is strongly convex, the analysis is even simpler.

Theorem 18.3 Assume that

• f has a Lipschitz Hessian ‖ ∇2f(x)−∇f(y) ‖2≤ L ‖ x− y ‖2

• f is µ-strongly convex, i,e, ∇2f(x) < µI,∀x

• The initial point x0 satisfies ‖ ∇f(x0) ‖2< 2µ2

M

Then the gradient converges to zero quadratically

‖ ∇f(xk+1) ‖2≤M

2µ2‖ ∇f(xk) ‖22

Proof: Setting y = xk+1, x = xk in the previous lemma, we have

‖ ∇f(xk+1) ‖2 ≤M

2‖[∇2f(xk)

]−1∇f(xk) ‖22

≤ M

2‖[∇2f(xk)

]−1‖22 · ‖ ∇f(xk) ‖22

≤ M

2µ2‖ ∇f(xk) ‖22


Remark (Complexity) The quadratic convergence implies that:

M

µ‖ xk − x∗ ‖2 ≤

[Mµ‖ xk−1 − x∗ ‖2

]2≤[Mµ‖ xk−2 − x∗ ‖2

]4≤ ...

≤[Mµ‖ x0 − x∗ ‖2

]2k≤ (

1

2)2

k

Hence, ‖ xk − x∗ ‖2< µM 2−2

k

To achieve an accuracy ε, i.e. ‖ xk − x∗ ‖2≤ ε, the number of iterations

k ≥ log2 log2(M

µε)

Remark (Affine Invariance) The Newton method is invariant w.r.t. affine transformation ofvariables.

Let A be a non-singular matrix consider the function

f(y) = f(Ay)

The Newton step for f and f are

xk+1 = xk −[∇2f(xk)

]−1∇f(xk)

yk+1 = yk −[∇2f(yk)

]−1∇f(yk)

Let y0 = A−1x0, then yk = A−1xk. This can be shown by induction

yk+1 = yk −[AT∇2f(Ayk)A

]−1[AT∇f(Ayk)

]= A−1xk −

[At∇2f(xk)A

]−1AT∇f(xk)

= A−1xk −A−1∇2f(xk)−1∇f(xk)

= A−1xk+1

In other words, Newton’s method follow the same trajectory in the ’x-space’ and ’y-space’. Hence,the region of quadratic convergence should not depend on the Euclidean metric.

However, in the classical analysis, the assumption and the measure of error, e.g. the Lipschitzcontinuity of Hessian, depend heavily on the Euclidean metric and is not affine invariant. Anatural remedy is to assume self-concordance. Self concordant function are especially well suitedfor Newton method.


Lecture 19: Interior Point Method - Part III – April 05




• Properties of Self-Concordant Functions

• Minimization of Self-Concordant Functions

• (Damped) Newton Method of Self-Concordant Functions

Reference:

Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 4.1.4 - 4.1.5

19.1 Properties of Self-Concordant Functions

Let f(x) ba a standard self-concordant function, i.e. ∀x ∈ dom(f), h ∈ Rn

∣∣∣D3f(x)[h, h, h]∣∣∣ ≤ 2

(D2f(x)[h, h]

)3/2i.e. ∣∣∣ d3

dt3|t=0f(x+ th)

∣∣∣ ≤ 2( d2dt2|t=0f(x+ th)

)3/2Definition 19.1 We define the local norm of h at x ∈ dom(f) as

‖ h ‖x=√hT∇2f(x)h

We state below a basic inequality without proof, for standard self-concordant function f , it holdsthat ∣∣∣D3f(x)[h1, h2, h3]

∣∣∣ < 2 ‖ h1 ‖x · ‖ h2 ‖x · ‖ h3 ‖x

Remark (“Lipschitz continuity”) at a high level,∣∣∣ ddt|t=0D

2f(x+ tδ)[h, h]∣∣∣ ≤ 2 ‖ δ ‖x D2f(x)[h, h]

The second derivative is relatively Lipschitz continuous w.r.t. the local norm defined by f .

19-1

Lecture 19: Interior Point Method - Part III – April 05 19-2

For instance, when f is self-concordant on R and strictly convex,

|f ′′′(x)||f ′′(x)|3/2

≤ 1 =⇒∣∣∣ ddx

[f ′′(x)−1/2]∣∣∣ ≤ 1

=⇒ −y ≤∫ y

0

d

dx[f ′′(x)]−1/2dx ≤ y

=⇒ −y ≤ 1√f ′′(y)

− 1√f ′′(0)

≤ y

Simplifying the above terms, we arrive at

f ′′(0)

(1 + y√f ′′(0))2

≤ f ′′(y) ≤ f ′′(0)

(1− y√f ′′(0))2

, ∀0 ≤ y <√f ′′(0) (?)

Note that the Hessian is nearly proportional to f ′′(0) around a neighborhood of x = 0.

Moreover, if we integrate (?) and integrate again on both sides (e.g. lower bound):

|f ′′′(x)||f ′′(x)|3/2

≤ 1 =⇒√f ′′(0)−

√f ′′(0)

1 + y√f ′′(0)

≤ f ′(y)− f ′(0)

=⇒ y√f ′′(0)− ln(1 + y

√f ′′(0)) ≤ f(y)− f(0)− f ′(0)y

Rewriting the above equations and considering both sides, we arrive at: ∀0 ≤ y <√f ′′(0)

(y√f ′′(0))2

1 + y√f ′′(0)

≤ y(f ′(y)− f ′(0)) ≤(y√f ′′(0))2

1− y√f ′′(0)

(??)

and

y√f ′′(0)− ln(1 + y

√f(0)) ≤ f(y)− f(0)− f ′(0)y ≤ −y

√f ′′(0)− ln(1− y

√f(0)) (? ? ?)

More generally, for self-concordant functions in Rn, similar results hold (see, Nesterov, 2004 The-orem 4.1.6-4.1.8) . We provide the results below without providing the proofs.

Definition 19.2 (Dikin Ellipsoid)

Wr(x) = y :‖ y − x ‖x≤ 1

W 0r (x) = y :‖ y − x ‖x< 1

Theorem 19.3 For x ∈ dom(f), we have W 0r (x) ⊆ dom(f)

The Dikin ellipsoid of any point in the domain is contained in the domain. More critically, theself-concordant function behaves nicely within this Dikin ellipsoid.


Theorem 19.4 (Hessian of self-concordant function) For x ∈ dom(f), we have

(1− ‖ y − x ‖x)2∇2f(x) 4 ∇2f(y) 4 (1− ‖ y − x ‖x)−2∇2f(x), ∀y ∈W 01 (x)

Theorem 19.5 (Gradient of self-concordant function) For x ∈ dom(f), we have

‖ y − x ‖2x1+ ‖ y − x ‖x

) ≤ 〈∇f(y)−∇f(x), y − x〉 ≤ ‖ y − x ‖2x1− ‖ y − x ‖x

, ∀y ∈W 01 (x)

Theorem 19.6 (Linear approximation of self-concordant function)

1. ∀x, y ∈ dom(f), we have

f(y) ≥ f(x) + 〈f ′(x), y − x〉+ ω(‖ y − x ‖x)

2. ∀x ∈ dom(f), y ∈W 0x (1), we have

f(y) ≤ f(x) + 〈f(x), y − x〉+ ω∗(‖ y − x ‖x)

where ω(t) = t− ln(1 + t) and ω∗(t) = −t− ln(1− t) is the conjugate

Proof: See Theorem 4.1.6, 4.1.7, 4.1.8 in (Nesterov, 2004).

19.2 Minimizing Self-Concordant Functions

Consider the unconstrained minimization

minx∈Rn

f(x)

where f is a standard self-concordant and non-degenerate (i.e. ∇2f(x) < 0) function. Note thatthe problem is not necessarily solvable, e.g. f(x) = − ln(x).

Definition 19.7 (Newton’s decrement) The quantity

λf (x) =√∇f(x)[∇2f(x)]−1∇f(x)

is called Newton decrement.

Remark: The Newton decrement can be interpreted as

1. The decrease of the second order Taylor expansion after a Newton step:

f(x)−minh

f(x) + hT∇f(x) +

1

2hT∇2f(x)h

=

1

2λ2f (x)


2. The conjugate local norm of ∇f(x):

‖ ∇f(x) ‖x,∗ = max∇f(x)T y, ‖ y ‖x≤ 1

=‖ [∇2f(x)]−1/2∇f(x) ‖2= λf (x)

3. The local norm of Newton’s direction d(x) = −∇2f(x)−1∇f(x):

‖ d(x) ‖x= λf (x)

Theorem 19.8 Assume λf (x0) < 1 for some x0 ∈ dom(f). Then there exists a unique minimizerof f .

Proof: It suffices to show that the level set y : f(y) ≤ f(x0) is bounded. Since

f(y) ≥ f(x) + 〈∇f(x), y − x〉+ ω(‖ y − x ‖x)

≥ f(x)− ‖ ∇f(x) ‖x,∗ · ‖ y − x ‖x +ω(‖ y − x ‖x)

= f(x)− λf (x)· ‖ y − x ‖x +ω(‖ y − x ‖x)

we have

f(y) ≤ f(x0) =⇒ ω(‖ y − x0 ‖x0)

‖ y − x0 ‖x0≤ λf (x0) < 1

Notice the function φ(t) = ω(t)t = 1− 1

t ln(1+t) is strictly increasing in t ≥ 0. Hence, ‖ y−x0 ‖x0≤ t∗for some t∗. This implies that the level set must be bounded.

Remark: Note that for self-concordant functions, local condition such as λf (x0) < 1 provides someglobal information on f .

Example : consider the self-concordant function f(x) = εx− ln(x) with dom(f) := x : x > 0.

λf (x) =

√(ε− 1

x)(

1

x2)−1(ε− 1

x) = |1− εx|

When ε ≤ 0, λf (x) ≥ 1, and the function is unbounded below and there does not exist a minimizer.

When ε > 0, λf (x) < 1, for x ∈ (0, 2ε ), there exists a unique minimizer x∗ = 1ε .


Lecture 20: Interior Point Method - Part IV – April 10




• Newton method for self-concordant functions

• Damped Newton method

Reference: Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 4.1.5

20.1 Recall

In the last lecture, we discussed the unconstrained minimization of self-concordant function

minxf(x)

where f(x) is standard self-concordant and non-degenerate.

Dikin ellipsoid: W 0r (x) = y :‖ y − x ‖x< r

For standard self-concordant function, W 01 (x) ⊆ dom(f),∀x ∈ dom(f). Moreover, f has some nice

behavior insider the Dikin ellipsoid. ∀y :‖ y − x ‖x= γ < 1, it holds

1. (1− r)2∇2f(x) 4 ∇2f(y) 4 1(1−r)2∇

2f(x)

2. γ2

1+γ ≤ 〈∇f(y)−∇f(x), y − x〉 ≤ γ2

1−γ

3. ω(γ) ≤ f(y)− f(x)− 〈∇f(x), y − x〉 ≤ ω∗(γ)

where ω(t) = t− ln(1 + t), ω∗(t) = −t− ln(1− t).

Newton’s decrement: λf (x) =√∇f(x)T [∇2f(x)]−1∇f(x)

• Note that λf (x) =‖ ∇f(x) ‖x,∗=‖ d(x) ‖x, where d(x) = [∇2f(x)]−1∇f(x)

• If λf (x) < 1, the point x+ = x− d(x) ∈ dom(f)

• If x∗ is a minimizer of f , then λf (x∗) = 0

• λf (x0) < 1 for some x0 ∈ dom(f), then f has a unique minimizer.

20-1

Lecture 20: Interior Point Method - Part IV – April 10 20-2

20.2 Newton Method for Self-concordant Function

Basic Newton method: initialize x0 ∈ dom(f) and update via

xk+1 = xk − [∇2f(xk)]−1∇f(xk), k = 0, 1, 2, ...

Unlike the classical analysis, we should adopt different measures of error that are independent ofEuclidean metric. For instance,

• Function gap: f(xk)− f(x∗)

• Newton’s decrement: λf (xk) =‖ ∇f(xk) ‖xk,∗

• Local distance to the minimizer: ‖ xk − x∗ ‖xk

• Distance to the minimizer under fixed metric ‖ xk − x∗ ‖x∗

Indeed, all of these measures are equivalent locally.

Proposition 20.1 When λf (x) < 1, we have

1. f(x)− f(x∗) ≤ ω∗(λf (x)) ≤ λf (x)2

2(1−λf (x))2

2. ‖ x− x∗ ‖x≤λf (x)

1−λf (x)

3. ‖ x− x∗ ‖x∗≤λf (x)

1−λf (x)

Proof: See Theorem 4.1.13 in (Nesterov, 2004).

For this reason, we will focus mainly on the convergence in terms of λf (x).For simplicity, in the following, let us denote λk := λf (xk), k = 0, 1, 2...

Theorem 20.2 (Local convergence) If xk ∈ dom(f) and λk < 1, then xk+1 ∈ dom(f) and

λk+1 ≤( λk

1− λk

)2Proof: Note ‖ xk+1 − xk ‖xk= λf (xk) = λk < 1, so xk+1 ∈ dom(f). Also, it holds that

∇2f(xk+1) < (1− λk)2∇2f(xk)

Hence,

λk+1 =

√∇f(xk+1)T

[∇2f(xk+1)

]−1∇f(xk+1) ≤

1

1− λk

√∇f(xk+1)T

[∇2f(xk)

]−1∇f(xk+1)


Note

∇f(xk+1) = ∇f(xk+1)−∇f(xk)− [∇2f(xk)](xk+1 − xk)

=[ ∫ 1

0∇2f(xk + t(xk+1 − xk))−∇2f(xk)dt

](xk+1 − xk)

: = G(xk+1 − xk)

Hence,

λk+1 ≤1

1− λk

√(xk+1 − xk)TGT [∇2f(xk)]−1G(xk+1 − xk)

≤ 1

1− λk‖xk+1 − xk‖xk · ‖ [∇2f(xk)]

−1/2G[∇2f(xk)]−1/2 ‖2

≤ λk1− λk

‖ [∇2f(xk)]−1/2G[∇2f(xk)]

−1/2︸︷︷︸H

‖2

We have

G < ∇2f(xk)

∫ 1

0

[(1− tλk)2 − 1

]dt =

(λ2k3

)∇2f(xk)

G 4 ∇2f(xk)

∫ 1

0

[ 1

(1− tλk)2− 1]dt =

λk1− λk

∇2f(xk)

Hence, ‖ H ‖2≤ maxλk −

λ2k3 ,

λk1−λk

= λk

1−λk

This leads to the conclusion λk+1 ≤ ( λk1−λk )2

Remark: Let λ∗ be such that λ∗

(1−λ∗)2 = 1. Then if λk < λ∗, λk+1 < λk. The region of quadratic

convergence is λf (x) ≤ λ∗ = 3−√5

2 ≈ 0.38.

Question: T he Newton method for self-concordant function still might diverge if not started witha point with λf (x) small enough. How to modify the Newton method to ensure global convergence?

The remedy is to use Newton method with line-search or damping factors

xk+1 = xk − γk[∇2f(xk)]−1∇f(xk)

where γk > 0 is some stepsize.

20.3 Damped Newton Method

Damped Newton method: initialize x0 ∈ dom(f) and update via

xk+1 = xk −1

1 + λf (xk)[∇2f(xk)]

−1∇f(xk)

Note this is essentially Newton method with particular stepsize γk = 11+λf (xk)

.


Remark.

1. Since ‖ xk+1 − xk ‖xk=λf (xk)

1+λf (xk)< 1, xk+1 ∈ W 0

1 (xk) ⊆ dom(f). The damped Newton

procedure is always well-defined.

2. The Newton direction dk = −[∇2f(xk)]−1∇f(xk) is a descent direction when f is non-

degenerate. If f is bounded below, then xt converges to the unique minimizer of f .

Theorem 20.3 (Global convergence of damped Newton) The damped Newton method sat-isfies that

1. (Descent phase) ∀k ≥ 0, f(xk+1) ≤ f(xk)− ω(λf (xk))

2. (Quadratic convergence phase) If λk(xk) <14 , then λf (xk+1) ≤ 2[λf (xk)]

2

Proof:

1. In view of previous theorem,

f(xk+1) ≤ f(xk) + 〈∇f(xk), xk+1 − xk〉+ ω∗(‖ xk+1 − xk ‖xk)

= f(xk)−λf (xk)

1 + λf (xk)+ ω∗

( λf (xk)

1 + λf (xk)

)where ω∗(t) = −t− ln(1− t).This can be further simplified as f(xk+1) ≤ f(xk)− ω(λf (xk))

2. Proof follows similarly as the earlier theorem for basic Newton method.

Remark: A general strategy for solving the self-concordant minimization:

• Damped Newton stage: when λf (xk) ≥ β, where β ∈ (0, 1/4)

f(xk+1) ≤ f(xk)− ω(β)

The number of steps of this stage is bounded by N1 ≤ f(x0)−f(x∗)ω(β) .

• (Damped/Basic) Newton stage: when λf (xk) < β

λf (xk+1) ≤ 2[λf (xk)

]2The number of steps to find a solution with λf (x) ≤ ε is bounded by N2 ≤ O(1) log2 log2(

1ε ).

The total complexity does not exceed

O(1)[f(x0)− f∗ + log log(1

ε)]


Lecture 21: Interior Point Method - Part V – April 12




• Revisit barrier method

• Self-concordant barrier

• Restate path-following scheme

Reference: Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 4.2

21.1 Revisit Barrier Method

Recall the barrier method solves the general convex problem

minx∈X

f(x), where X = x : gi(x) ≤ 0, i = 1, ...,m

by solving a sequence of unconstrained minimization: minx tf(x) + F (x), where F (x) is barrierfunction defined on int(X), and t > 0 is penalty parameter.

Without loss of generality, let us consider problem of the form,

minx∈X

cTx (P )

where X is a closed bounded convex set with non-empty interior. Let F be standard self-concordantwith cl(dom(F )) = X. We want to solve (P) by tracing the central path

x∗(t) = arg minx

tcTx+ F (x)︸︷︷︸Ft(x)

By optimality condition: ∀t > 0,tc+∇F (x∗(t)) = 0

Note that Ft(x) is standard self-concordant and (damped) Newton method achieves a local conver-gence when λFt(x) ≤ 1

4 .

21-1

Lecture 21: Interior Point Method - Part V – April 12 21-2

When increasing t→ t′, we would want to preserve λFt′ (x∗(t)) ≤ 1

4 and make t′ as large as possible.We have

λFt′ (x∗(t)) =‖ ∇Ft′(x∗(t)) ‖x∗(t),∗

=‖ (t′ − t)c+ tc+∇F (x∗(t)) ‖x∗(t),∗=‖ (t′ − t)c ‖x∗(t),∗

=‖ (t′

t− 1)∇F (x∗(t)) ‖x∗(t),∗

Hence,t′

t= 1 +

1

4λF (x∗(t))

To ensure t→ +∞, need λF (x) to be uniformly bounded from above, namely,

λ2F (x) = ∇F (x)T [∇2F (x)]−1∇F (x) ≤ ν

for some ν. This leads to the definition of self-concordant barriers.

21.2 Self-concordant Barriers

21.2.1 Definition and Examples

Definition 21.1 Let v ≥ 0. We call F a ν-self-concordant barrier (v-s.c.b.) for set X = cl(dom(F ))if F is standard self-concordant and satisfies

|DF (x)[h]| ≤ v1/2√D2F (x)[h, h] ∀x ∈ dom(F ), h ∈ Rn (?)

Remark

1. The inequality (?) implies that |∇F (x)Th| ≤ ν ‖ h ‖2x, i.e. F is Lipschitz continuous, w.r.t.the local norm defined by F .

2. When F is non-degenerate, (?) is equivalent to

λ2F (x) = ∇F (x)T [∇2F (x)]−1∇F (x) ≤ ν

3. The following are equivalent:

(?)⇐⇒ ∇2F (x) <1

ν∇F (x)[∇F (x)]T

Examples

• Constant function: f(x) = c is 0-s.c.b.


• Linear function: f(x) = aTx+ c(a 6= 0), is not s.c.b.

• Quadratic function: f(x) = 12x

TQx+ qTx+ c Q 0 is not s.c.b.

• Logarithmic function: f(x) = − lnx (x > 0) is 1-s.c.b.

(f ′(x))2

f ′′(x)= (−1

x)2x2 = 1

21.2.2 Self-concordance Preserving Operators

Proposition 21.2 The following are true:

1. If F (x) is a ν-self-concordant barrier, then F (y) = F (Ay + b) is a ν-self-concordant barrier.

2. If Fi(x) is νi-self-concordant barrier, i = 1, 2, then F1(x) + F2(x) is (ν1 + ν2)-self-concordantbarrier.

3. If F (x) is a ν-self-concordant barrier, then βF (x) with β ≥ 1 is a (βν)-self-concordant barrier.

Example: The function

F (x) = −m∑i=1

ln(bi − aTi x)

is m-self-concordant barrier for the set x : Ax ≤ b

Remark: Indeed, for any closed convex set X ⊆ Rn with non-empty interior, there exists a(βn)-self-concordant barrier for X.

21.2.3 Properties of Self-concordant Barriers

Lemma 21.3 (Boundedness) Let f be a ν-self-concordant barrier for X. Then for any x ∈int(X), y ∈ X, we have 〈∇F (x), y − x〉 ≤ ν

Proof is omitted and can be found in Theorem 4.2.4. in (Nesterov, 2004).

Theorem 21.4 For any t > 0, we have

cTx∗(t)−minx∈X

cTx ≤ ν

t

Proof:cTx∗(t)− cT y = −t−1∇F (x∗(t))T (x∗(t)− y) ≤ ν

t


21.3 Restate Path-following Scheme

We would like to address the convex problem

minx∈X

cTx

by tracing the central path

x∗(t) = arg minx

Ft(x) := tcTx+ F (x)

where F (x) is a ν-self-concordant barrier for X = cl(dom(F )).

Theorem 21.5 For any t > 0, we have

cTx∗(t)−minx∈X

cTx ≤ ν

t

Proof: By optimality condition: tc+∇F (x∗(t)) = 0, i.e. c = −t−1∇F (x∗(t)),

∀y ∈ X : cTx∗(t)− cT y = −t−1∇F (x∗(t))T (x∗(t)− y)

= t−1∇F (x∗(t))T (y − x∗(t))

≤ ν

t

Now consider an approximate solution x that is close to x∗(t):

λFt(x) ≤ β

where β is small enough.

Theorem 21.6 If λFt(x) ≤ β,

cTx−minx∈X

cTx ≤ 1

t(ν +

√νβ

1− β)

Proof: First of all,

cTx− cTx∗(t) ≤‖ c ‖x∗(t),∗ · ‖ x− x∗(t) ‖x∗(t)= t−1 ‖ ∇F (x∗(t)) ‖x∗(t),∗ · ‖ x− x∗(t) ‖x∗(t)

Recall from last lecture that for x∗ = arg minx f(x) with standard self concordant f :

‖ x− x∗ ‖x∗≤λf (x)

1− λf (x)


Hence,

‖ x− x∗(t) ‖x∗(t)≤λFt(x)

1− λFt(x)≤ β

1− β

Thus, cTx− cTx∗(t) ≤√νt

β1−β

cTx−minx∈X

cTx = cTx− cTx∗(t) + cTx∗(t)−minx∈X

cTx

≤√ν

t

β

1− β+ν

t

Now we can formally describe the path-following scheme.

Path-following Scheme

• Initialize (x0, t0) with t0 > 0 and λFt0(x0) ≤ β (β ∈ (0, 14))

• For k ≥ 0, do

tk+1 = tk(1 +γ√ν

)

xk+1 = xk − [∇2F (xk)]−1[tk+1c+∇F (xk)]

Theorem 21.7 (Rate of convergence) In the above scheme, one has

cTxk −minx∈X

cTx ≤ O(1)ν

t0exp

−O(1)

k√ν

where the constant factor O(1) depends solely on β and γ.

Remark [Complexity]

• The total complexity of Newton steps does not exceed

N(ε) ≤ O(√

ν logν

t0ε

)• The total arithmetic cost of finding an ε-solution by the above scheme does not exceed

O

(M√ν log

ν

t0ε

)where M is the arithmetic cost for computing ∇F (x), ∇2F (x) and solving a Newton system.


Remark [Initialization] In order to obtain a fair (x0, t0), s.t.

λFt0(x0) ≤ β

one can apply the following trick. Suppose x ∈ dom(F ) is given. Consider the auxiliary path

y∗(t) = arg minx

[−t∇F (x)Tx+ F (x)]

When t = 1, y∗(1) = x. When t→ 0, y∗(t) = xF := arg minx F (x).

We can trace y∗(t) as t decreases from 1 to 0, until we approach to a point (y0, t0) such thatλFt0

(x0) ≤ β.


Lecture 22: Interior Point Method for Conic Programs – April 17




• IPM for Conic Programs

• Primal Dual path following IPM

22.1 Summary of IPM

Let us first summarize the key concepts discussed in the last few lectures.

• Self-concordant barrier: A function F (x) is ν-s.c.b. for a set X = cl(dom(F )) if it satisfiesfor any x ∈ dom(F ), h ∈ Rn:

|D3F (x)[h, h, h]| ≤ 2(D2F (x)[h, h])3/2

|DF (x)[h]| ≤√ν(D2F (x)[h, h])1/2

• Path-following interior point method:

min cTx =⇒ minx

tcTx+ F (x)

s.t. x ∈ X

1. choose (x0, t0), where t0 > 0 and x0 is close to the analytical center xF := arg minx F (x)

2. do for k = 0, 1, . . .

tk+1 = tk(1 +γ√ν

)

xk+1 = xk − [∇2F (xk)]−1[tk+1c+∇F (xk)]

• Iteration Complexity: O(√ν log(νε )) iterations.

22-1

Lecture 22: Interior Point Method for Conic Programs – April 17 22-2

22.2 IPM for Conic Programs

Recall the standard form and dual of conic program:

minx

cTx maxy

bT y

s.t. Ax− b ∈ K s.t. AT y = c

y ∈ K∗

This becomes an LP, SOCP, SDP when K = Rn+, L

n, Sn+, respectively.

22.2.1 Self-concordant Barriers for LP, SOCP, SDP

We now introduce the canonical self-concordant barriers for these cones.

Example 1: F (x) = −∑n

i=1 ln(xi) is n-s.c.b for the nonnegative orthant Rn+.

Example 2: F (x) = − ln(x2n − x21 − ...− x2n−1) is 2-s.c.b. for the Lorentz cone Ln.

Example 3: F (X) = − ln(det(X)) = −∑n

i=1 ln(λi(X)) is n-s.c.b. for the positive semidefinitecone Sn+.

The first two examples can be easily verified based on the definition. Let’s prove the third case.Proof: It suffices to show that given X ∈ int(Sn+) and H ∈ Sn, the one-dimensional restriction

φ(t) = F (X + tH) = − ln(det(X + tH))

is a n-self-concordant barrier. We have

φ(t) = − ln(det(X1/2(I + tX−1/2HX1/2)X1/2))

= − ln(det(I + tX−1/2HX−1/2))− ln(Det(X))

= −n∑i=1

ln(1 + tλi(X−1/2HX−1/2)) + φ(0)

Denote λi = λi(X−1/2HX−1/2), i = 1, ..., n, then φ(t) = −

∑mi=1 ln(1 + tλi) + φ(0). Hence,

φ′(0) = −n∑i=1

λi

φ′′(0) =n∑i=1

λ2i

φ′′′(0) = −n∑i=1

λ3i


Note that

(n∑i=1

λi)2 ≤ n

n∑i=1

λ2i =⇒ |φ′(0)2| ≤ nφ′′(0)

|n∑i=1

λ3i | ≤ (n∑i=1

λ2i )3/2 =⇒ |φ′′′(0)| ≤ 2[φ′′(0)]3/2

Therefore, φ(t) is a n-self-concordant barrier for any X ∈ int(Sn+) and H ∈ Sn.

Remark: Indeed, any ν-self-concordant carrier for the cone Sn+ has ν ≥ n.

As a result, the function build on the above barriers,

F (x) := F (Ax− b)

is a self-concordant barrier for X := x : Ax− b ∈ K.

22.2.2 Linear Program: IPM vs Ellipsoid Method

For instance, consider the linear program (m > n)

min cTx

s.t. aTj x ≥ bj , j = 1, ...,m

The barrier function

F (x) = −m∑j=1

ln(aTj x− bj)

is m-self-concordant barrier for the constraint set.

We can compute the gradient and Hessian

∇F (x) = −m∑j=1

aj

aTj x− bj

∇2F (x) =m∑j=1

ajaTj

(aTj x− bj)2

The arithmetic costs of computing F (x), ∇F (x), ∇2F (x) are O(mn), O(mn), O(mn2), respectively.The arithmetic cost for a Newton step, i,e, solving [∇2F (x)]y = ∇F (x), is O(n3). Hence, the overallarithmetic cost of finding a ε-solution to the linear program is

O(mn2)O(√m log(

m

ε)) = O(m3/2n2 log(

m

ε))

In contrast, when applying Ellipsoid method, the overall arithmetic cost is

O(mn+ n2)O(n2 log(1

ε)) = O(mn3 log(

1

ε))

The interior-point method is more efficient when m is not too large, i.e. m ≤ O(n2)


22.3 Primal-Dual Path-Following IPM

When solving conic programs, ideally we would like to develop interior point methods that

• produce primal-dual pairs at each iteration

• handle equality constraints

• require no prior knowledge of a strictly feasible solution

• adjust penalty based on current solution

Key idea: To approximate the KKT conditions.

We focus on the SDP case. Consider the standard form of the primal problem

min Tr(CX)

s.t. Tr(AiX) = bi, i = 1, ...,m (P)

X < 0

and its dual problem

maxy,Z

bT y

s.t.m∑i=1

yiAi + Z = C (D)

Z < 0

Assume (P ) and (D) are strictly primal-dual feasible, so there is no duality gap.

KKT conditions for (P) and (D):

X∗ < 0, Z∗ < 0

∀i = 1, ...,m : Tr(AiX∗) = bi

m∑i=1

y∗iAi + Z∗ = C

(primal-dual feasibility)

X∗Z∗ = 0 (complementary slackness)


We now consider the barrier problem of the primal problem

min Tr(CX)− µ ln(det(X))

s.t. Tr(AiX) = bi, i = 1, ...,m (BP)

and that of the dual problem

maxy

bT y + µ log(det(Z))

s.t.

m∑i=1

yiAi + Z = C (BD)

In fact, these are indeed the Lagrange duals to each other, up to constant.

KKT conditions for (BP) and (BD):

∀i = 1, ...,m : Tr(AiX∗(µ)) = bi

m∑i=1

y∗i (µ)Ai + Z∗(µ) = C

(primal-dual feasibility)

X∗(µ)Z∗(µ) = µI (complementary slackness)

The duality gap at (X∗(µ), y∗(µ)) is

Tr(CX∗(µ))− bT y∗(µ) = Tr(Z∗(µ)X∗(µ)) = µn

As µ→ 0, the duality gap is zero, and (X∗(µ), y∗(µ), Z∗(µ))→ (X∗, y∗, Z∗).

The set (X(µ), y(µ), Z(µ)) : µ > 0 is called the primal-dual central path.

Newton step: Find direction (∆X,∆y,∆Z) by solving the equations:

Tr(Ai(X + ∆X)) = bi, i = 1, ...,m∑m

i=1(yi + ∆yi)Ai + (Z + ∆Z) = C

(X + ∆X)(Z + ∆Z) = µI

=⇒

Tr(Ai∆X) = 0, i = 1, ...,m∑m

i=1 ∆yiAi + ∆Z = 0 (?)

(X + ∆X)(Z + ∆Z) = µI


Basic primal-dual scheme

Initial (X, y, Z) = (X0, y0, Z0) with X0 > 0, Z0 > 0.

At each iteration:

• compute µ = Tr(XZ)n , µ← µ

2

• compute (∆X,∆y,∆Z) by solving the KKT equations (?)

• update (X, y, Z)← (X+α∆X, y+β∆y, Z+β∆Z) with proper α, β that preserves positivityof (X,Z).

Remark (Approximation of KKT equation). Note that only the last equation in the system ofKKT conditions is nonlinear. One cane apply first-order approximation:

µ = (X + ∆X)(Z + ∆Z) ≈ XZ + ∆XZ +X∆Z

and solve the linearized KKT equations.

Lecture 23: CVX Tutorial

(Slides Origin: Boyd & Vandenberghe)

IE 521: Convex Optimization Spring 2017

April 19, 2017

1

Cone program solvers

• LP solvers

– many, open source and commercial

• cone solvers

– each handles combinations of a subset of LP, SOCP, SDP, EXP cones– open source: SDPT3, SeDuMi, CVXOPT, CSDP, ECOS, SCS, . . .– commercial: Mosek, Gurobi, Cplex, . . .

2

Transforming problems to cone form

• lots of tricks for transforming a problem into an equivalent cone program

– introducing slack variables– introducing new variables that upper bound expressions

• these tricks greatly extend the applicability of cone solvers

• writing code to carry out this transformation is painful

• modeling systems automate this step

3

Modeling systems

a typical modeling system

• automates transformation to cone form; supports

– declaring optimization variables– describing the objective function– describing the constraints– choosing (and configuring) the solver

• when given a problem instance, calls the solver

• interprets and returns the solver’s status (optimal, infeasible, . . . )

• (when solved) transforms the solution back to original form

4

Some current modeling systems

• AMPL & GAMS (proprietary)

– developed in the 1980s, still widely used in traditional OR– no support for convex optimization

• YALMIP (‘Yet Another LMI Parser’, matlab)

– first object-oriented convex optimization modeling system

• CVX (matlab)

• CVXPY (python, GPL)

• Convex.jl (Julia, GPL, merging into JUMP)

• CVX, CVXPY, and Convex.jl collectively referred to as CVX*

5

Disciplined convex programming

• describe objective and constraints using expressions formed from

– a set of basic atoms (affine, convex, concave functions)– a restricted set of operations or rules (that preserve convexity)

• modeling system keeps track of affine, convex, concave expressions

• rules ensure that

– expressions recognized as convex are convex– but, some convex expressions are not recognized as convex

• problems described using DCP are convex by construction

• all convex optimization modeling systems use DCP

6

CVX

• uses DCP

• runs in Matlab, between the cvx_begin and cvx_end commands

• relies on SDPT3 or SeDuMi (LP/SOCP/SDP) solvers

• refer to user guide, online help for more info

• the CVX example library has more than a hundred examples

7

Example: Constrained norm minimization

A = randn(5, 3);

b = randn(5, 1);

cvx_begin

variable x(3);

minimize(norm(A*x - b, 1))

subject to

-0.5 <= x;

x <= 0.3;

cvx_end

• between cvx_begin and cvx_end, x is a CVX variable

• statement subject to does nothing, but can be added for readability

• inequalities are intepreted elementwise

8

What CVX does

after cvx_end, CVX

• transforms problem into an LP

• calls solver SDPT3

• overwrites (object) x with (numeric) optimal value

• assigns problem optimal value to cvx_optval

• assigns problem status (which here is Solved) to cvx_status

(had problem been infeasible, cvx_status would be Infeasible and x

would be NaN)

9

Declare variables

• declare variables with variable name[(dims)] [attributes]

– variable x(3);

– variable C(4,3);

– variable S(3,3) symmetric;

– variable D(3,3) diagonal;

– variables y z;

10

11

• form affine expressions

A = randn(4, 3);variables x(3) y(4);

– 3*x + 4

– A*x - y

– x(2:3)

– sum(x)

Affine expressions

12

Some functions

function meaning attributes

‖x‖p cvx

x2 cvx

(x+)2 cvx, nondecr

x+ cvx, nondecr

norm(x, p)

square(x)

square_pos(x)

pos(x)

sum_largest(x,k) √[x 1] + · · ·+ x[k] cvx, nondecr

sqrt(x) x (x ≥ 0) ccv, nondecr

inv_pos(x) 1/x (x > 0) cvx, nonincr

max(x) m2

axx1, . . . , xn cvx, nondecr

quad_over_lin(x,y) x /y (y > 0) cvx, nonincr in y

lambda_max(X) λmax(X) (X = XT ) cvx

huber(x)

x2, |x| ≤ 1

2|x| − 1, |x| > 1cvx

Composition rules

• for convex h, h(g1, . . . , gk) is recognized as convex if, for each i,

– gi is affine, or– gi is convex and h is nondecreasing in its ith arg, or– gi is concave and h is nonincreasing in its ith arg

• for concave h, h(g1, . . . , gk) is recognized as concave if, for each i,

– gi is affine, or– gi is convex and h is nonincreasing in ith arg, or– gi is concave and h is nondecreasing in ith arg 13

• can combine atoms using valid composition rules, e.g.:

– a convex function of an affine function is convex– the negative of a convex function is concave– a convex, nondecreasing function of a convex function is convex– a concave, nondecreasing function of a concave function is concave

Valid (recognized) examples

u, v, x, y are scalar variables; X is a symmetric 3× 3 variable

• convex:

– norm(A*x - y) + 0.1*norm(x, 1)

– quad_over_lin(u - v, 1 - square(v))

– lambda_max(2*X - 4*eye(3))

– norm(2*X - 3, ’fro’)

• concave:

– min(1 + 2*u, 1 - max(2, v))

– sqrt(v) - 4.55*inv_pos(u - v)

14

Rejected examples

u, v, x, y are scalar variables

• neither convex nor concave:

– square(x) - square(y)

– norm(A*x - y) - 0.1*norm(x, 1)

• rejected due to limited DCP ruleset:

– sqrt(sum(square(x))) (is convex; could use norm(x))

– square(1 + x^2) (is convex; could use square_pos(1 + x^2), or1 + 2*pow_pos(x, 2) + pow_pos(x, 4))

15

Sets

• some constraints are more naturally expressed with convex sets

• sets in CVX work by creating unnamed variables constrained to the set

• examples:

– semidefinite(n)

– nonnegative(n)

– simplex(n)

– lorentz(n)

• semidefinite(n), say, returns an unnamed (symmetric matrix)variable that is constrained to be positive semidefinite

16

Using the semidefinite cone

variables: X (symmetric matrix), z (vector), t (scalar)constants: A and B (matrices)

• X == semidefinite(n)

– means X ∈ Sn+ (or X 0)

• A*X*A’ - X == B*semidefinite(n)*B’

– means ∃ Z 0 so that AXAT −X = BZBT

• [X z; z’ t] == semidefinite(n+1)

– means

[

X z

zT t

]

0

17

Objectives and constraints

• objective can be

– minimize(convex expression)

– maximize(concave expression)

– omitted (feasibility problem)

• constraints can be

– convex expression <= concave expression

– concave expression >= convex expression

– affine expression == affine expression

– omitted (unconstrained problem)

18

More involved example

A = randn(5);

A = A’*A;

cvx_begin

variable X(5, 5) symmetric;

variable y;

minimize(norm(X) - 10*sqrt(y))

subject to

X - A == semidefinite(5);

X(2,5) == 2*y;

X(3,1) >= 0.8;

y <= 4;

cvx_end

19

Defining new functions

• can make a new function using existing atoms

• example: the convex deadzone function

f(x) = max|x| − 1, 0 =

0, |x| ≤ 1

x− 1, x > 1

1− x, x < −1

• create a file deadzone.m with the code

function y = deadzone(x)

y = max(abs(x) - 1, 0)

• deadzone makes sense both within and outside of CVX

20

Defining functions via incompletely specified problems

• suppose f0, . . . , fm are convex in (x, z)

• let φ(x) be optimal value of convex problem, with variable z andparameter x

minimize f0(x, z)

subject to fi(x, z) ≤ 0, i = 1, . . . ,m

A1x+A2z = b

• φ is a convex function

• problem above sometimes called incompletely specified since x isn’t(yet) given

• an incompletely specified concave maximization problem defines aconcave function

21

CVX functions via incompletely specified problems

implement in cvx withfunction cvx_optval = phi(x)

cvx_begin

variable z;

minimize(f0(x, z))

subject to

f1(x, z) <= 0; ...

A1*x + A2*z == b;

cvx_end

• function phi will work for numeric x (by solving the problem)

• function phi can also be used inside a CVX specification, wherever aconvex function can be used

22

Simple example: Two element max

• create file max2.m containing

function cvx_optval = max2(x, y)

cvx_begin

variable t;

minimize(t)

subject to

x <= t;

y <= t;

cvx_end

• the constraints define the epigraph of the max function

• could add logic to return max(x,y) when x, y are numeric(otherwise, an LP is solved to evaluate the max of two numbers!)

23

A more complex example

• f(x) = x+ x1.5 + x2.5, with dom f = R+, is a convex, monotoneincreasing function

• its inverse g = f−1 is concave, monotone increasing, with dom g = R+

• there is no closed form expression for g

• g(y) is optimal value of problem

maximize t

subject to t+ + t1.5+ + t2.5+ ≤ y

(for y < 0, this problem is infeasible, so optimal value is −∞)

24

• implement asfunction cvx_optval = g(y)

cvx_begin

variable t;

maximize(t)

subject to

pos(t) + pow_pos(t, 1.5) + pow_pos(t, 2.5) <= y;

cvx_end

• use it as an ordinary function, as in g(14.3), or within CVX as aconcave function:cvx_begin

variables x y;

minimize(quad_over_lin(x, y) + 4*x + 5*y)

subject to

g(x) + 2*g(y) >= 2;

cvx_end

25

Example

• optimal value of LP

f(c) = infcTx | Ax b

is concave function of c

• by duality (assuming feasibility of Ax b) we have

f(c) = sup−λT b | ATλ+ c = 0, λ 0

26

• define f in CVX as

function cvx_optval = lp_opt_val(A,b,c)

cvx_begin

variable lambda(length(b));

maximize(-lambda’*b);

subject to

A’*lambda + c == 0; lambda >= 0;

cvx_end

• in lp opt val(A,b,c) A, b must be constant; c can be affine

27

CVX hints/warnings

• watch out for = (assignment) versus == (equality constraint)

• X >= 0, with matrix X, is an elementwise inequality

• X >= semidefinite(n) means: X is elementwise larger than somepositive semidefinite matrix (which is likely not what you want)

• writing subject to is unnecessary (but can look nicer)

• many problems traditionally stated using convex quadratic forms canposed as norm problems (which can have better numerical properties):

x’*P*x <= 1 can be replaced with norm(chol(P)*x) <= 1

28

Useful Resources

• http://cvxr.com/cvx/

28


Lecture 24: Subgradient Method – April 24



24.1 Optimization Algorithms and Convergence Rates

Till now, we have discussed several important algorithms for solving convex optimization:

• Ellipsoid method: poly-time algorithm, black-box method, requires first-order and separationoracles

• Interior point method: poly-time algorithm, barrier method, requires structural assumptionson the domain and self-concordant barriers

• Newton method: (local) quadratic convergent algorithm, black-box method, requires smooth-ness assumptions on the objective and first order and second-order oracles

While the above algorithms have the capability to solve convex programs to high accuracy withina small number of iterations, they suffer from very expensive iteration cost (often cubicly in termsof the problem size), which eventually become impractical for large-scale convex problems.

First-order Methods:

For large-scale convex optimization, simpler algorithms such as first-order methods essentially be-come the only method of the choices. There exists a dedicated library of efficient first-order opti-mization algorithms:

• Gradient descent

• Nesterov’s accelerated gradient descent and variants (FISTA, geometric descent, etc)

• Coordinate descent and many variants

• Conditional gradient (a.k.a. Frank-Wolfe method)

• Subgradient methods

• Primal-dual methods (Arrow-Hurwicz method, etc)

• Proximal and operator splitting methods (proximal gradient method, ADMM, etc)

• Stochastic and incremental gradient methods (stochastic gradient descent, SVRG, etc)

24-1

Lecture 24: Subgradient Method – April 24 24-2

In the rest of the semester we will present several primary method mainly for the generic non-differentiable constrained problems.

Rate of Convergence:

Suppose the sequence xk converges to x∗

Definition 24.1 (Q-convergence) The convergence rate is said to be

• linear: if ∃q ∈ (0, 1), such that lim supk→+∞‖xk+1−x∗‖‖xk−x∗‖ = q

• suplinear: if lim supk→+∞‖xk+1−x∗‖‖xk−x∗‖ = 0

• sublinear: if lim supk→+∞‖xk+1−x∗‖‖xk−x∗‖ = 1

• quadratic: if lim supk→+∞‖xk+1−x∗‖‖xk−x∗‖2

< +∞

These are also called Q-convergence (quotient).

Example: xk = 12k, 1k , (

12)2

kare Q-linear, Q-sublinear, Q-superlinear convergence.

Note that Q-convergence can be troublesome, for example, consider the sequence

xk =

1 + 1

2k, k is even

1, k is odd.

Definition 24.2 (R-convergnece) We say xk converge to x∗ R-linearly if ∃ δk s.t. ‖ xk −x∗ ‖≤ δk and δk converges Q-linearly to x∗.

For simplicity, we will drop the ‘Q’ and ‘R’.

24.2 Subgradient Method

Consider the generic convex minimization

min f(x)

s.t. x ∈ X

where

• f is convex and possibly non-differentiable

• X is non-empty, closed and convex.

Lecture 24: Subgradient Method – April 24 24-3

Assume the problem is solvable with optimal solution and value denoted as x∗, f∗. Recall that avector g ∈ ∂f(x) is called a subgradient of f at x if

f(y) ≥ f(x) + gT (y − x) ∀y ∈ dom(f)

The subgradient inequality can be interpreted as a supporting hyperplane for the level set Lf(x)(f) =y : f(y) ≤ f(x):

gT (y − x) ≤ 0,∀y ∈ Lf(x)(f) = y : f(y) ≤ f(x)

24.2.1 Subgradient Method (N.Shor, 1967)

Subgradient method works as follows: start with x1 ∈ X and update

xt+1 = ΠX(xt − γtgt)

where

• gt ∈ ∂f(xt) is a subgradient of f at xt.

• γt > 0 is a proper stepsize

• ΠX(x) = arg miny∈X ‖ y − x ‖2 is the Euclidean projection.

Remark:

• When X = Rn, f is continuously differentiable, this reduces to

xt+1 = xt − αt∇f(xt)

which is the well-known Gradient Descent method

• Unlike Gradient Descent, the negative direction of subgradient is not always a descent direc-tion.

Choices of Stepsize:

• Constant stepsize: γt = γ, ∀t

• Diminishing stepsize: γt → 0 and∑∞

t=1 γt = +∞, e.g. γt = 1√t

• Square summable stepsize:∑∞

t=1 γ2t < +∞ and

∑∞t=1 γt = +∞, e.g. γt = 1

t .

• Scaled stepsize: γt = γ‖gt‖2

• Dynamic stepsize (Polyak’s optimal stepsize): γt = f(xt)−f∗‖gt‖22


Lecture 25: Subgradient Method and Bundle Methods – April 24




• Subgradient Method

• Bundle Method

– Kelley cutting plane method

– Level method

Reference: Nesterov.2004. Chapter 3.2.3, 3.3

25.1 Subgradient Method

Recall that subgradient method works as follows

xt+1 = ΠX(xt − γtgt), t = 1, 2, ...

where gt ∈ ∂f(xt), γt > 0 and ΠX(x) = arg miny∈X ‖ y − x ‖2 is the projection operator.

Note that the projection on X is easy to compute when X is simple, e.g. X is a ball, box, simplex,polyhedron, etc.

Lemma 25.1 (Projection) ∀x inRn, z ∈ X,

‖ x− z ‖22≥‖ x−ΠX(x) ‖22 + ‖ z −ΠX(x) ‖22

Proof: When x ∈ X, the inequality immediately hold true. Let x 6∈ X. By definition.

ΠX(x) = arg minx∈X‖ z − x ‖22

By optimality condition, this implies 2(ΠX(x)− x)T (z −ΠX(x)) ≥ 0, ∀z ∈ X. Hence,

‖ x− z ‖22 =‖ x−ΠX(x) + ΠX(x)− z ‖22≥‖ x−ΠX(x) ‖22 + ‖ ΠX(x)− z ‖22

25-1

Lecture 25: Subgradient Method and Bundle Methods – April 24 25-2

Lemma 25.2 (Key relation) For the subgradient method, we have

‖ xt+1 − x∗ ‖22 ≥ ‖ xt − x∗ ‖22 −2γt(f(xt)− f∗) + γ2t ‖ gt ‖22 (?)

Proof:

‖ xt+1 − x∗ ‖22 = ‖ ΠX(xt − γtgt)− x∗ ‖22≤ ‖ xt − γtgt − x∗ ‖22= ‖ xt − x∗ ‖22 −2γtg

Tt (xt − x∗) + γ2

t ‖ gt ‖2

Due to convexity of f , we have f∗ ≥ f(xt) + gTt (x∗ − xt), i.e.

gTt (xt − x∗) > f(xt)− f∗

Combing there two inequalities leads to the desired result.

Remark: Note that when f∗ is known, we can choose the ‘optimal’ γt by minimizing the righthand side of (?): γ∗t = f(xt)−f∗

‖gt‖22, which is the Polyak’s stepsize.

In fact, knowing f∗ is not a problem sometimes. For instance, when the goal is to solve the convexfeasibility problem: Find x∗ ∈ X, s.t. gi(x) ≤ 0, i = 1, ...,m. We can formulate this as

minx∈X

max1≤i≤m

gi(x) or minx∈X

m∑i=1

max(gi(x), 0)

The optimal value f∗ is known to be 0 in this case.

If f∗ is not known, one can replace f∗ by its online estimate.

Theorem 25.3 (Convergence) Suppose f(x) is convex and Lipschitz continuous on X:

|f(x)− f(y)| ≤Mf ‖ x− y ‖2, ∀x, y ∈ X

where Mf < +∞. Then the subgradient method satisfies:

f(xT )− f∗ ≤‖ x1 − x∗ ‖22 +

∑Tt=1 γ

2tM

2f

2∑T

t=1 γt

where xT = (∑T

t=1 γt)−1(∑T

t=1 γtxt)

Proof: The Lipschitz continuity implies that ‖ gt ‖2≤ Mf , ∀t. Summing up the key relation (?)from t = 1 to t = T , we obtain

2

T∑t=1

γt(f(xt)− f∗) ≤‖ x1 − x∗ ‖22 − ‖ xT+1 − x∗ ‖22 +

T∑t=1

γ2tM

2f

By convexity of f : (∑T

t=1 γt)f(xT ) ≤∑T

t=1 γtf(xt) This further leads to

(T∑t=1

γt)[f(xT )− f∗] ≤ 1

2‖ x1 − x∗ ‖2 +

M2

2

T∑t=1

γ2t

and concludes the proof.


Convergence under various stepsize Assume DX = maxx,y ‖ x− y ‖2 is the diameter of theset X. It is interesting to see how the bounds in the above theorem would imply the convergenceand even the convergence rate with different choices of stepsizes. By abuse of notation, we denoteboth min

1≤t≤Tf(xt)− f∗ and f(xT )− f∗ as εT .

1. Constant stepsize: with γt ≡ γ,

εT ≤D2X + Tγ2M2

f

2Tγ=D2X

2T· 1

γ+M2f

2γ

T→∞−−−−→M2f

2γ.

It is worth noticing that the error upper-bound does not diminish to zero as T grows toinfinity, which shows one of the drawbacks of using arbitrary constant stepsizes. In addition,to optimize the upper bound, we can select the optimal stepsize γ∗ to obtain:

γ∗ =DX

Mf

√T⇒ εT ≤

DXMf√T

.

It is shown that under this optimal choice εT ∼ O(DXMf√

T). However, this exhibits another

drawback of constant stepsize that in practice T is not known in prior for evaluating theoptimal γ∗.

2. Scaled stepsize: with γt = γ‖g(xt)‖2 ,

εT ≤D2X + γ2T

2γT∑t=1

1/‖g(xt)‖2≤Mf

(Ω

2T· 1

γ+

1

2γ

)T→∞−−−−→

Mf

2γ.

Similarly, we can select the optimal γ by minimizing the right hand side, i.e. γ∗ = DX√T

:

γt =DX√

T‖g(xt)‖2⇒ εT ≤

DXM√T

.

The same convergence rate is achieved while the same drawback about not knowing T in priorstill exists in choosing γt.

3. Non-summable but diminishing stepsize:

εT ≤

(D2X +

T∑t=1

γ2tM

2f

)/(2

T∑t=1

γt

)

≤

(D2X +

T1∑t=1

γ2tM

2f

)/(2

T∑t=1

γt

)+

M2f

T∑t=T1+1

γ2t

/2T∑

t=T1+1

γt

where 1 ≤ T1 ≤ T . When T → ∞, select large T1 and the first term on the right hand side→ 0 since γt is non-summable. The second term also → 0 because γ2

t always approaches zerofaster than γt. Consequently, we know that

εTT→∞−−−−→ 0.


An example choice of the stepsize is γt = O(

1tq

)with q ∈ (0, 1]. As in the above cases, if we

choose γt =√

2ΩM√t, then

γt =DX

Mf

√t⇒ εT ≤ O

(DXMf ln(T )√

T

).

In fact, if we choose the averaging from T2 instead of 1, we have

minT/2≤t≤T

f(xt)− f∗ ≤O(Mf ·DX√

T

).

4. Non-summable but square-summable stepsize: It is obvious that

εT ≤

(Ω +

M2

2

T∑t=1

γ2t

)/(T∑t=1

γt

)T→∞−−−−→ 0.

A typical choice of γt = 1t1+q , q > 0 also result in the rate of O( 1√

T).

5. Polyak stepsize: The stepsize yields

‖xt+1 − x∗‖22 ≤‖xt − x∗‖22 −(f(xt)− f∗)2

‖g(xt)‖22, (25.1)

which guarantees ‖xt − x∗‖22 decreases each step. Applying (25.1) recursively, we obtain

T∑t=1

(f(xt)− f∗)2 ≤ D2X ·Mf <∞.

Therefore we have εT → 0 as T →∞ and εT ≤ O(

1√T

).

Corollary 25.4 When T is known, setting γt ≡ DX

Mf

√T

, in particular, we have

f(xT )− f∗ ≤DXMf√

T

Remark: Subgradient method converges sublinearly. For an accuracy ε > 0, need O(D2

XM2f

ε2)

number of iterations.

25.2 Bundle Method

When running the subgradient method, we obtain a bundle of affine underestimate of f(x):

f(xt) + gTt (x− xt), t = 1, 2, ...


Definition 25.5 The piecewise linear function:

ft(x) = max1≤i≤t

f(xi) + gTi (x− xi)

where gi ∈ ∂f(xi) is called the t-th model of convex function f .

Note

1. ft(x) ≤ f(x),∀x ∈ X

2. ft(xi) = f(xi), ∀1 ≤ i ≤ t

3. f1(x) ≤ f2(x) ≤ ... ≤ ft(x) ≤ ... ≤ f(x)

25.2.1 Kelley method (Kelley, 1960)

The Kelley method works as follows:

xt+1 = arg minx∈X

ft(x)

Obviously, the above algorithm converges so long as X is compact. The auxiliary problem is notso disturbing (reduces to LP) when X is polyhedron. However, the issue is that xt is not uniqueand Kelley method can be very unstable. Indeed, the worst-case complexity of Kelley method isat least O( 1

ε(n−1)/2 ).

Remedy: To prevent the instability issue, a possible remedy is update xt+1 by

xt+1 = arg minx∈X

ft(x) +

αt2‖ x− xt ‖22

with properly selected αt > 0.

25.2.2 Level Method (Lemarchal, Nemirovski, Nesterov, 1995)

Denote

¯ft = min

x∈Xft(x) (minimal value of the model)

ft = min1≤i≤t

f(xi) (record value of the model)

we have¯f1 ≤

¯f2 ≤ ... ≤ f∗ ≤ ... ≤ f2 ≤ f1

Denote the level setLt =

x : ft(x) ≤ lt := (1− α)

¯ft + αft

Note that Lt is nonempty, convex and closed, and doesn’t contain the search points x1, ..., xt


The Level method works as follows

xt+1 = ΠLt(xt) = arg minx∈X

‖ x− xt ‖22: ft(x) ≤ lt

Note when α = 0, reduces to Kelley method. α = 1, there will be no progress.

The auxiliary problem reduces to a quadratic program when X is polyhedron.

Theorem 25.6 When t > 1(1−α)2α(2−α)

(MfDX

ε

)2, we have

ft −¯ft ≤ ε

where Mf is the Lipschitz constant and DX is the diameter of set X.

Remark: Level method achieves same complexity as the subgradient method (which indeed isunimprovable), but can perform much better than subgradient method in practice.

Lecture Notes IE 521 Convex Optimizationniaohe.ise.illinois.edu/IE521/IE 521-convex...

Documents

Transcript of Lecture Notes IE 521 Convex Optimizationniaohe.ise.illinois.edu/IE521/IE 521-convex...