Lecture Notes IE 521 Convex Optimizationniaohe.ise.illinois.edu/IE521/IE 521-convex...
Transcript of Lecture Notes IE 521 Convex Optimizationniaohe.ise.illinois.edu/IE521/IE 521-convex...
Spring 2017
Lecture Notes IE 521 Convex Optimization
Niao He UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN
What’s Inside Module I: Classical Convex Optimization - Fundamentals
♣ Lecture 1: Convex Set and Topological Properties
♣ Lecture 2: Convex Set and Representation Theorems
♣ Lecture 3: Convex Set and Separation Theorems
♣ Lecture 4: Convex Function and Calculus of Convexity
♣ Lecture 4: Convex Function and Characterizations
♣ Lecture 6: Subgradient and Subdifferential
♣ Lecture 7: Closed Convex Function and Conjugate Theory
♣ Lecture 8: Convex Programming
♣ Lecture 9: Lagrangian Duality and Optimality Conditions
♣ Lecture 10: Minimax Problems
♣ Lecture 11: Polynomial Solvability of Convex Programs
♣ Lecture 12: Center-of-gravity Method and Ellipsoid Method
Module II: Modern Convex Optimization ♦ Lecture 13: Conic Programming
♦ Lecture 14: Conic Duality
♦ Lecture 15: Applications of Conic Duality
♦ Lecture 16: Applications in Robust Optimization
♦ Lecture 17: Interior Point Method and Path-Following Schemes
♦ Lecture 18: Newton Method for Unconstrained Minimization
♦ Lecture 19: Minimizing Self-concordant functions
♦ Lecture 20: Damped Newton for Self-concordant Minimization
♦ Lecture 21: Interior Point Method and Self-concordant Barriers
♦ Lecture 22: Interior Point Method for Conic Programming
♦ Lecture 23: CVX Tutorial and Examples
Module III: Algorithms for Constrained Convex Optimization ♥ Lecture 24: Subgradient Method
♥ Lecture 25: Bundle Method
♥ Lecture 26: Constrained Level Method
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 1: Convex Sets – January 23
Instructor: Niao He Scribe: Niao He
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Topology review
• Convex sets (convex/conic/affine hulls)
• Examples of convex sets
• Calculus of convex sets
• Some nice topological properties of convex sets.
1.1 Topology Review
Let X be a nonempty set in Rn. A point x0 is called an interior point if X contains a small ballaround x0, i.e. ∃r > 0, such that B(x0, r) := x : ‖x − x0‖2 ≤ r ⊆ X. A point x0 is called alimit point if there exits a convergent sequence in X that converges to x0, i.e. ∃xn ⊆ X, suchthat xn → x0 as n→∞.
• The interior of X, denoted as int(X), is the set of all interior point of X.
• The closure of X, denoted as cl(X), is the set of all limit points of X.
• The boundary of X, denoted as ∂(X) = cl(X)/int(X), is the set of points that belongs tothe closure but not in the interior.
X is closed if cl(X) = X; X is open if int(X) = X. Here are some basic facts:
• int(X) ⊆ X ⊆ cl(X);
• A set X is closed if and only if its complement Xc = Rn/X is open;
• The intersection of arbitrary number of closed sets is closed, i.e., ∩α∈AXα is closed if Xα isclosed for all α ∈ A.
• The union of finite number of closed sets is closed, i.e., ∪ni=1Xi is closed if Xi is closed fori = 1, . . . , n.
1-1
Lecture 1: Convex Sets – January 23 1-2
1.2 Convex Sets
Definition 1.1 (Convex set) A set X ⊆ Rn is convex if ∀x, y ∈ X,λx + (1 − λ)y ∈ X for anyλ ∈ [0, 1].
In another word, the line segment that connects any two elements lies entirely in the set.
(a) convex set (b) non-convex set
Figure 1.1: Examples of convex and non-convex sets
Given any elements x1, . . . , xk, the combination λ1x1 + λ2x2 + . . .+ λkxk is called
• Covex: if λi ≥ 0, i = 1, . . . , k and λ1 + λ2 + . . .+ λk = 1;
• Conic: if λi ≥ 0, i = 1, . . . , k;
• Affine: if λ1 + λ2 + . . .+ λk = 1;
• Linear: if λi ∈ R, i = 1, . . . , k.
Consequently, we have
• A set is convex if all convex combinations of its elements are in the set;
• A set is a convex cone if all conic combinations of its elements are in the set;
• A set is an affine subspace if all affine combinations of its elements are in the set;
• A set is a linear subspace if all linear combinations of its elements are in the set.
Clearly, a linear subspace is always a convex cone; a convex cone is always a convex set.
Lecture 1: Convex Sets – January 23 1-3
Definition 1.2 (Convex hull) A convex hull of a set X ⊆ Rn is the set of all convex combinationof its elements, denoted as
Conv(X) =
∑k
i=1λixi : k ∈ N, λi ≥ 0,
∑k
i=1λi = 1, xi ∈ X,∀i = 1, . . . , k
.
Figure 1.2: Examples of convex hulls
Similarly, one can define the conic hull and affine hull of a set.
Cone(X) =
∑k
i=1λixi : k ∈ N, xi ∈ X,λi ≥ 0, ∀i = 1, . . . , k
.
Aff(X) =
∑k
i=1λixi : k ∈ N, xi ∈ X,
∑k
i=1λi = 1,∀i = 1, . . . , k
.
Proposition 1.3 We have the following
1. A convex hull is always convex.
2. If X is convex, then conv(X) = X.
3. For any set X, conv(X) is the smallest convex set that contains X.
Proof:
1. By definition, for any x, y ∈ Conv(X), we can write x =∑
i λixi and y =∑
i µixi whereλi, µi ≥ 0 and
∑i λi =
∑i µi = 1. Hence, for any α ∈ [0, 1], we have
αx+ (1− α)y = α∑i
λixi + (1− α)∑i
µixi =∑i
ξixi
where ξi = αλi + (1 − α)µi,∀i. Note that ξi ≥ 0 and∑
i ξi = α∑
i λi + (1 − α)∑
i µi = 1.Therefore, αx+ (1− α)y ∈ Conv(X). Hence, Conv(X) is convex.
Lecture 1: Convex Sets – January 23 1-4
2. First of all, based on definition of convex hull, it is straightforward to see that X ⊆ Conv(X).Next, we show that Conv(X) ⊆ X by induction on k. The baseline is when k = 1, which istrivial. Now assuming that any convex combination with k entries is in X, we want to showthat any convex combination of k + 1 entries is still in X. Consider the convex combinationbelow given by λ1, . . . , λk+1 with λi ≥ 0, i = 1, . . . , k + 1 and
∑k+1i=1 λi = 1.
λ1x1 + . . .+ λk+1xk+1 = (1− λk+1)
(λ1
1− λk+1x1 + . . .+
λk1− λk+1
xk
)︸ ︷︷ ︸
z
+λk+1xk+1
Based on induction, we can see that z ∈ X since z is a convex combination of k entries in X.By convexity of X, we further have λ1x1 + . . .+ λk+1xk+1 ∈ X.
3. Suppose Y is convex and Y ⊇ X, we want to show that Y ⊇ Conv(X). From previousargument, if Y contains X, then Y should contain all convex combinations of X, i.e. Y ⊇Conv(X).
Examples of Convex Sets
Example 1. Some simple convex sets:
• Hyperplane: x ∈ Rn : aTx = b
• Halfspace: x ∈ Rn : aTx ≤ b
• Affine space: x ∈ Rn : Ax = b
• Polyhedron: x ∈ Rn : Ax ≤ b
• Simplex: x ∈ Rn : x ≥ 0,∑n
i=1 xi = 1 = conv(e1, . . . , en).
Example 2. Euclidean balls:x ∈ Rn : ‖x‖2 ≤ r
where ‖ · ‖2 is the Euclidean norm defined on Rn.
Example 3. Ellipsoid:x ∈ Rn : (x− a)TQ(x− a) ≤ r2
where Q 0 and is symmetric.
Lecture 1: Convex Sets – January 23 1-5
1.3 Calculus of Convex Sets
The following operators preserve the convexity of sets, which can be easily verified based on thedefinition.
1. Intersection: If Xα, α ∈ A are convex sets, then
∩α∈AXα
is also a convex set.
2. Direct product: If Xi ⊆ Rni , i = 1, . . . , k are convex sets, then
X1 × · · · ×Xk := (x1, . . . , xk) : xi ∈ Xi, i = 1, . . . , k
is also a convex set.
3. Weighted summation: If Xi ⊆ Rn, i = 1, . . . , k are convex sets, then
α1X1 + · · ·+ αkXk := α1x1 + . . .+ αkxk : xi ∈ Xi, i = 1, . . . , k
is also a convex set.
4. Affine image: If X ⊆ Rn is a convex set and A(x) : x 7→ Ax+ b is an affine mapping fromRn to Rk, then
A(X) := Ax+ b : x ∈ X
is also a convex set.
5. Inverse affine image: If X ⊆ Rn is a convex set and A(y) : y 7→ Ay+b is an affine mappingfrom Rk to Rn, then
A−1(X) := y : Ay + b ∈ X
is also a convex set.
Proof:
1. Let x, y ∈ ∩α∈AXα, then x, y ∈ Xα, ∀α ∈ A. Since Xα is convex, for any λ ∈ [0, 1], λx+ (1−λ)y ∈ Xα, ∀α ∈ A. Hence, λx+ (1− λ)y ∈ ∩α∈AXα.
2. Let x = (x1, . . . , xk) ∈ X1 × · · · ×Xk, y = (y1, . . . , yk) ∈ X1 × · · · ×Xk,. Since Xi is convex,for λ ∈ [0, 1], λxi + (1− λ)yi ∈ Xi, for i = 1, . . . , k. Hence
λx+ (1− λ)y = (λx1 + (1− λ)y1, . . . , λxk + (1− λ)yk) ∈ X1 × · · · ×Xk.
3. Let x, y ∈ α1X1 + · · · + αkXk, by definition, there exists xi, yi ∈ Xi, i = 1, . . . , k, such thatx = α1x1 + . . .+ αkxk, y = α1y1 + . . .+ αkyk. Hence, for all λ ∈ [0, 1]
λx+ (1− λ)y = α1z1 + . . .+ αkzk ∈ α1X1 + · · ·+ αkXk
because zi = λxi + (1− λ)yi ∈ Xi, for all i = 1, . . . , k.
Lecture 1: Convex Sets – January 23 1-6
4. Let y1, y2 ∈ A(X), then there exits x1, x2 ∈ X such that y1 = Ax1 + b and y2 = Ax2 + b.Therefore, for any λ ∈ [0, 1], we have λy1+(1−λ)y2 = A(λx1+(1−λ)x2)+b ∈ A(X) becauseλx1 + (1− λ)x2 ∈ X.
5. Let y1, y2 ∈ A−1(X), then there exits x1, x2 ∈ X such that x1 = Ay1 + b and x2 = Ay2 + b.Therefore, for any λ ∈ [0, 1], we have A(λy1 + (1 − λ)y2) + b ∈ λx1 + (1 − λ)x2 ∈ X, thisimplies that λy1 + (1− λ)y2 ∈ A−1(X).
1.4 Nice Topological Properties of Convex Sets
Convex sets are special because of their nice geometric properties.
Proposition 1.4 If X be a convex set with nonempty interior, then int(X) is dense in cl(X).
Proof: Let x0 ∈ int(X) and x ∈ cl(X). We can construct a convergence sequence yn = 1nx0+(1− 1
n)xsuch that yn → x. We only need to show that yn ∈ int(X). Therefore, it suffices to prove thefollowing claim:
Claim 1.5 If x0 ∈ int(X) and x ∈ cl(X), then [x0, x) ∈ int(X), namely, for any α ∈ [0, 1), thepoint z := αx0 + (1− α)x ∈ int(X).
This can be proved as follows. Since x0 ∈ int(X), there exits r > 0 such that B(x0, r) ⊆ X.Since x ∈ cl(X), there exits a sequence xn ⊆ X such that xn → x. Let zn = αx0 + (1 − α)xn,then zn → z. When n is large enough, ‖zn − z‖2 ≤ αr
2 . Since B(x0, r) ⊆ X and xn ∈ X, thenB(zn, αr) = αB(x0, r) + (1 − α)xn ⊆ X. Hence, B(z, αr2 ) ⊆ B(zn, αr) ⊆ X. This is because forany z′ ∈ B(z, αr2 ), ‖z′ − z‖ ≤ αr
2 ,
‖z′ − zn‖2 ≤ ‖z′ − z‖2 + ‖zn − z‖2 ≤αr
2+αr
2= αr.
Remark. Note that in general, for any set X, int(X) ⊆ X ⊆ cl(X), but int(X) and cl(X)can differ dramatically. For instance, let X be the set of all irrational numbers in (0, 1), thenint(X) = ∅, cl(X) = [0, 1]. The proposition implies that a convex set is perfectly well characterizedby its closure or interior if nonempty.
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 2: Geometry of Convex Sets – January 25
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Nice topological properties (cont’d)
• Representation Theorem (Caratheodory)
• Radon, Helley Theorems
2.1 Recall
• A set X is convex if ∀x, y ∈ X,λx+ (1− λ)y ∈ X for any λ ∈ [0, 1].
• Convex Hull:
Conv(X) =
∑k
i=1λixi : k ∈ N, λi ≥ 0,
∑k
i=1λi = 1, xi ∈ X,∀i = 1, . . . , k
.
Note that the convex hull of X is the smallest convex set that contains X.
• Affine Hull:
aff(X) =
∑k
i=1λixi : k ∈ N, xi ∈ X,
∑k
i=1λi = 1,∀i = 1, . . . , k
.
Notet that the affine hull of X is the smallest affine subspace that contains X. An affinespace M is a shifted linear space, i.e. M = a + L, e.g. x : Ax = b = x0 + x : Ax = 0where x0 is such that Ax0 = b. The dimension of X: dim(X) = dim(aff(X))
• If X ∈ Rn is convex with non-empty interior, then
∀x0 ∈ int(X), x ∈ cl(X)⇒ λx0 + (1− λ)x ∈ int(X).
Moreover, the interior of X if nonempty, is dense in cl(X), i.e., cl(intX) = cl(X)
Questions: What if int(X) = ∅? For example, X = Conv(e1, e2, e3) = (x1, x2, x3) : x1, x2, x3 ≥0, x1 + x2 + x3 = 1, int(X) = ∅,aff(X) = (x1, x2, x3) : x1 + x2 + x3 = 1, dim(X) = 2.
2-1
Lecture 2: Geometry of Convex Sets – January 25 2-2
Definition 2.1 (Relative Interior) rint(X) = x : ∃r > 0, s.t.B(x, r)⋂
Aff(X) ⊆ X
Fact: If X is convex and nonempty, then rint(X) is always non-empty.The proof is left as an exercise.
Proposition 2.2 Let X be a nonempty convex set. Then
a) int(X),cl(X), rint(X) are convex
b) If x0 ∈ rint(X), x ∈ cl(X), then λx0 + (1− λ)x ∈ rint(X),∀λ ∈ (0, 1]
c) cl(rint(X)) = cl(X)
d) rint(cl(X)) = rint(X)
Proof: a) is straightfoward, b), c) can be derived as an analogy to the case when int(X) is non-empty. d) is due to the following fact: ∀x0 ∈ rint(X), x ∈ rint(cl(X)), there exitts y ∈ cl(X), s.t.x ∈ (x0, y). From b), this implies x ∈ rint(X).
Remark A convex set is perfectly well approximated by its relative interior or closure.
2.2 Representation Theorem
Theorem 2.3 (Caratheodory) Let X ⊆ Rn be non empty and dim(X) = d ≤ n. Every pointx ∈ conv(X) is a convex combination of at most (d+1) points, i.e.
Conv(X) =
d+1∑i=1
λixi : xi ∈ X,λi ≥ 0,d+1∑i=1
= 1
Proof: Suppose the minimal representation of x ∈ Conv(X) has m ≥ d+ 1 terms
x =m∑i=1
αixi,where αi ≥ 0,m∑i=1
αi = 1
The system of linear equations ∑mi=1 δixi = 0∑mi=1 δi = 0
has non trivial solution.We can write x =
∑mi=1(αi − tδi)xi. Let λi(t) = (αi − tδi), i = 1, . . . ,m, we have
∑λi(t) = 1.
Let t∗ = minαiδi, δi > 0
:=
αj
δj, then λi(t∗) > 0,∀i 6= j and λj(t∗) = 0. This leads to a smaller
representation of x, contradiction!
Example Suppose there are 100 different kinds of herbal tea, everyone of them is a blend of 25herbs. Donald wants n particular mixture of all herbal teas with equal proportions. What’s theleast number of teas he should buy? 26
Lecture 2: Geometry of Convex Sets – January 25 2-3
2.3 Radon and Helley Theorem
Theorem 2.4 (Radon) Let S be a collection of N points in Rn with N ≥ n + 2. Then we canwrite S = S1 ∪ S2 s.t. S1 ∩ S2 = ∅, and Conv(S1) ∩ Conv(S2) 6= ∅
Proof: Let S = x1, ..., xN Consider the linear system∑Ni=1 γixi = 0∑Ni=1 γi = 0
=⇒ (n+1)equations, but N ≥ (n+ 2) unknowns
There exists a non-zero solution γ1, ..., γN .Let I = i : γi ≥ 0, J = j : γj < 0 and a =
∑i∈I γi = −
∑j∈J γj , then∑
i∈Iγixi =
∑j∈J
(−γj)xj =⇒∑i∈I
γiaxi =
∑j∈J
−γjaxj
The partition S1 = xi, i ∈ I and S2 = xj : j ∈ J gives the desired result.
Theorem 2.5 (Helley) Let S1, ..., SN be a collection of convex sets in Rn. Assume every (n+1)sets of them have a point in common, then all the sets have a point in common.
Proof: Let’s prove this by induction on NBase case: N = n+ 1, obviously true.Induction step: Assume that the collection of N(≥ n+ 1) sets have common point if every (n+ 1)of them have common point. We want to show that this holds true for a collection of N + 1 sets.
From the assumption, there exits xi ∈ S1 ∩ ... ∩ Si−1 ∩ Si+1 ∩ ... ∩ SN+1 6= ∅. Hence, we obtain(N + 1) ≥ (n+ 2) points x1, ..., xN+1.
By Radom’s theorem, we can split x1, x2, ...xN+1 into disjoint sets, whose convex hulls havenonempty intersection. Without loss of generality, let’s assume the two disjoint sets are x1, . . . , xkand xk+1, . . . , xN, and
Conv(x1, ..., xk) ∩ Conv(xk+1, ..., xN+1) 6= ∅.
Let z ∈ Conv(x1, ..., xk) ∩ Conv(xk+1, ..., xN+1). Since x1, ..., xk ⊆ Sk+1 ∩ ... ∩ SN+1 =⇒z ∈ Conv(x1, ..., xk) ⊆ Sk+1 ∩ ... ∩ SN+1.Since xk+1, ..., xN+1 ⊆ S1 ∩ ... ∩ Sk =⇒ z ∈Conv(xk+1, ..., xN+1) ⊆ S1 ∩ ... ∩ Sk. Therefore, z ∈ S1 ∩ ... ∩ SN+1.
Remark
• The theorem is not true for infinite collection: e.g. Si = [i,∞),∩+∞i=1Si = ∅
• The theorem is not true if reduce to (n+ 1) sets to n sets.
Corollary 2.6 (Helley) Let F be any collection of compact convex sets in Rn. If every (n + 1)sets have common point, then all sets have n points in common
Lecture 2: Geometry of Convex Sets – January 25 2-4
Remark Helley’s theorem have many applications, especially for uniform approximation.
Example Consider the optimization problem
p∗ = minx∈R10
g0(x), s.t. gi(x) ≤ 0, i = 1, ..., 521
Suppose ∀t ∈ R, X0 =x ∈ R10 : g0(x) ≤ t
is convex, Xi =
x ∈ R10 : gi(x) ≤ 0
is convex.
How many constraints can you drop without affecting the optimal value?
Answer: You can drop as many as 521 - 11 = 510 constraints. i.e. you just need to keep 11constraints.
Proof: Suppose every 11 constraint relaxation will change the optimal value. ∀ i1, i2, ...in ⊆1, ...521,
minx∈R10
go(x) : gi1(x) ≤ 0, ..., gin(x) ≤ 0 = p(i1, ..., iN ) < p∗
Since there are only finite combinations, let pmax be the largest among p(i1, ..., iN ), pmax < p∗.Consider the collection of sets, i = 1, ..., 521
Si =x ∈ R10 : g0(x) ≤ pmax, gi(x) ≤ 0
Hence, we have
(i) Si is nonempty and convex, and
(ii) every 11 sets of them have non empty intersection
By Helley’s theorem, S1∩...∩S521 6= ∅ i.e. ∃x ∈ R10 s.t. g0(x) < p∗ and gi(x) ≤ 0, ∀i Contradiction!
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 3: Separation Theorems – January 30
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Separation Theorems
• The Farkas Lemma
• Duality of Linear Programs
Reference: Boyd & Vandenberghe, Chapter 2.5; Ben-Tal & Nemirovski, Chapter 1.2
3.1 Separation of Convex Sets
Definition 3.1 Let S and T be two nonempty convex sets in Rn, A hyperplane H =x ∈ Rn : aTx = b
with a 6= 0 is said to separate S and T if
a) S ⊂ H− =x ∈ Rn : aTx ≤ b
and T ⊂ H+ =
x ∈ Rn : aTx ≥ b
b) S ∪ T 6⊂ H
Note that a) implies thatsupx∈S
aTx ≤ infx∈T
aTx
and b) implies thatinfx∈S
aTx < supx∈T
aTx
The separation is strict if S ⊂x ∈ Rn : aTx ≤ b′
and T ⊂
x ∈ Rn : aTx ≥ b′′
, with b′ < b′′.
Note that strict separation is equivalent to
supx∈S
aTx < infx∈T
aTx
Question: When can S and T be separated? strictly separated? Necessary conditions?
Theorem 3.2 Let S and T be two nonempty convex sets. Then S and T can be separated if andonly if rint(S) ∩ rint(T ) = ∅
3-1
Lecture 3: Separation Theorems – January 30 3-2
Corollary 3.3 Let S be a nonempty convex set and x0 ∈ ∂S. Then there exists a supportinghyperplane H =
x : aTx = aTx0
such that S ⊂
x : aTx ≤ aTx0
and x0 ∈ H.
We will prove a special case of the theorem and corollary.
Theorem 3.4 Let S be closed and convex and x0 6∈ S, Then there exists a hyperplane that strictlyseparated x0 and S.
Proof: Define the projection of x0, denoted as proj(x0) to be the point in S that is closest to x0:
proj(x0) = arg minx∈S‖ x− x0 ‖22
Note that proj(x0) exists and is unique.
• Existence: due to closedness of S
• Uniqueness: If x1, x2 are both closest to x0 in S, then ‖ x1−x0 ‖2=‖ x2−x0 ‖2= d. Considerz = x1+x2
2 ∈ S, then ‖ z−x0 ‖≥ d. Since ‖ (x0−x1)+(x0−x2) ‖22 + ‖ (x0−x1)−(x0−x2) ‖22=2 ‖ x0 − x1 ‖22 +2 ‖ x0 − x2 ‖22. We have 4 ‖ x0 − z ‖22 + ‖ x1 − x2 ‖22= 4d2. Hence‖ x1 − x2 ‖22= 0, i.e. x1 = x2.
Next we show that strict separation is given by H :=x : aTx = b
with a = x0 − proj(x0).
b = aTx0 −‖a‖222 , i.e. aTx < b, ∀x ∈ S, aTx0 > b.
By definition of projection and convexity, ∀λ ∈ [0, 1], x ∈ S,
λx+ (1− λ)proj(x0) ∈ S
Let φ(λ) =‖ λx+ (1− λ)proj(x0)− x0 ‖22=‖ proj(x0)− x0 + λ(x− proj(x0)) ‖22Then
φ(λ) ≥ φ(0),∀λ ∈ [0, 1]
Hence, φ′(0) ≥ 0, i.e. −2aT (x− proj(x0)) ≥ 0. This implies
aTx ≤ aTproj(x0) = aT (x0 − a) = aTx0− ‖ a ‖2< aTx0 −‖ a ‖2
2= b
Corollary 3.5 Let S and T be two nonempty convex sets and S ∩T = ∅. Assume S−T is closed,then S and T can be strictly separated.
Proof: Let Y = S − T . Since Y is a weighted sum of two convex sets, Y is nonempty and convex.Since S ∩ T , 0 6∈ Y , from the precious theorem, ∃a, b such that aT y < b < 0. This implies that
aTx < b+ aT z, ∀x ∈ S, z ∈ T
Hence, supx∈S aTx < infz∈T a
T z, i.e. S and T can be strictly separated.
Lecture 3: Separation Theorems – January 30 3-3
Remark
1. Even if both S and T are closed convex, S − T might not be closed, and they might not bestrictly separated.
2. When both S and T are closed convex, S ∩ T = ∅ and at least one of them is bounded, thenS − T is closed, and S and T can be strictly separated
3.2 Theorems of alternatives
Theorem 3.6 (Farkas’ Lemma) Exactly one of the following sets must be empty:
(i) x ∈ Rn : Ax = b, x ≥ 0
(ii)y ∈ Rm : AT y ≤ 0, bT y > 0
where A ∈ Rm×n, b ∈ Rm.
Remark
• System (i) and (ii) are often called strong alternative, i.e. exactly one of them must befeasible.
• Farka’s Lemma is particularly useful to prove infeasibility of an linear program
• Geometric interpretation: let A = [a1|a2|...|an],
Cone a1, ..., an =
n∑
i=1
xiai : xi ≥ 0, i = 1, ..., n
(ii)empty⇐⇒ b 6∈ Cone a1, ..., an =⇒ ∃y, yTai ≤ 0,∀i = 1, ..., n, yT b > 0
Farkas’ lemma can be regarded as a special case of the separation theorem.
Proof: First, we show that if system (ii) feasible, then system (i) infeasible. Otherwise, 0 < bT y =(Ax)T y = xT (AT y) ≤ 0, contradiction!Second, we show that if system (i) infeasible, then system (ii) feasible. Let C = Cone a1, ..., an,then C is convex and closed. Now that b /∈ C, by the separation theorem, b and C can be (strictly)separated, i.e.
∃y ∈ Rm, γ ∈ R, y 6= 0, such that , yT z ≤ γ,∀z ∈ C, yT b > γ
Since 0 ∈ C, we have γ ≥ 0. Suppose γ > 0, and ∃z0 ∈ C such that yT z0 > 0, then we haveyT (αz0) > γ when α is large enough. Hence, it suffices to set γ = 0. Since a1, .., an ∈ C, we haveyTai ≤ 0, ∀i = 1, ...,m, i.e. AT y ≤ 0.
Remark: The fact that Cone a1, ...am is closed is crucial. Note that in general, when S is nota finite set, Cone(S) is not always closed. e.g. the conic hull of a solid circle S = (x1, x2) :x21 + (x2 − 1)2 < 1 is the open halfspace (x1, x2) : x2 > 0.
Lecture 3: Separation Theorems – January 30 3-4
Variant of Farkas’ Lemma Exactly one of the following two sets must be empty:
1. x ∈ Rn : Ax ≤ b
2.y ≥ 0 : AT y = 0, bT y < 0
Proof: Exercise in HW1.
3.3 LP strong duality
Consider the primal and dual pair of linear programs
min cTx
(P ) s.t. Ax = b
x ≥ 0
max bT y
(D) s.t. AT y ≤ c
Theorem 3.7 If (P ) has a finite optimal value, then so does (D) and the two values equal eachother.
Proof: Exercise in HW 1.
Remark The theorem of alternatives can be generalized to systems with convex constraints, andthe strong duality of linear program can be extended to general convex programs.
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 4: Convex Functions, Part I – February 1
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Convex Functions
• Examples
• Convexity-preserving Operations
Reference: Boyd &Vandenberghe, Chapter 3.1-3.2
4.1 Convex Function
Let f be a function from Rn to R. The domain of f is defined as dom(f) = x ∈ Rn : |f(x)| <∞.For example,
• f(x) = 1x , dom(f) = R\ 0
• f(x) =∑n
i=1 xi ln(xi), dom(f) = Rn++ = x : xi > 0,∀i = 1, ..., n
Definition 4.1 (Convex function) A function f(x) : Rn → R is convex if
(i) dom(f) ⊆ Rn s a convex set;
(ii) ∀x, y ∈ dom(f) and λ ∈ [0, 1], f(λx+ (1− λ)y) ≤ λf(x) + (1− λ)f(y).
Geometrically, the line segment between (x, f(x)), (y, f(y)) sits above the graph of f .
Definitions
• A function is called strictly convex if (ii) holds with strict sign, i.e. f(λx + (1 − λ)y) <λf(x) + (1− λ)f(y).
• A function is called α-strongly convex if f(x)− α2 ‖x‖
22 is convex.
• A function is called concave if −f(x) is convex.
Note that strongly convex =⇒ strictly convex =⇒ convex
4-1
Lecture 4: Convex Functions, Part I – February 1 4-2
Figure 4.1: Convex function
4.2 Examples
1. Simple univariate functions:
• Even powers: xp, p is even
• Exponential: eax,∀a ∈ R
• Negative logarithmic: − log x
• Absolute value: |x|• Negative entropy: x log(x)
2. Affine functions: f(x) = aTx+ b
• both convex & concave, but not strctly convex/concave
3. Some quadratic functions: f(x) = 12x
TQx+ bTx+ c
• convex if and only if Q 0 is positive semi-definite
• strictly convex if and only if Q 0 is positive definite
• special case: f(x) =‖ Ax− b ‖22 is convex
4. Norms: A function π(·) is called a norm if
(a) π(x) ≥ 0, ∀x and π(x) = 0 iff x = 0
(b) π(αx) = |α| · π(x),∀α ∈ R
(c) π(x+ y) ≤ π(x) + π(y)
Note that norms are convex: ∀λ ∈ [0, 1], π(λx + (1− λ)y) ≤ π(λx) + π((1− λ)y) = λπ(x) +(1− λ)π(y) where the inequality comes from (c) and the equality comes from (b).Examples of norms include:
• lp-norm on Rn: ‖ x ‖p:= (∑n
i=1 |xi|p)1/p, where p ≥ 1
• Q-norm on Rn: ‖ x ‖Q=√xTQx, where Q 0 is positive definite
Lecture 4: Convex Functions, Part I – February 1 4-3
• Frobenius norm on Rm×n : ‖ A ‖F= (∑m
i=1
∑nj=1 |Ai,j |2)1/2
• spectral norm on Sn: ‖ A ‖= maxi=1,...,n |λi(A)|, where λi’s are the eigenvalues of A.
5. Indicator function Ic(x) =
0, x ∈ C∞, x 6∈ C
The indicator function IC(x) is convex if the set C is a convex set.
6. Supporting function: I∗C(x) = supy∈C xT y
The support function I∗C(x) is always convex for any set C.Proof: Note that supy∈C f(y) + g(y) ≤ supy∈C f(y) + supy∈C g(y)Then ∀x1, x2, λ ∈ [0, 1]
I∗C(λx1 + (1− λ)x2) = supy∈C
λxT1 y + (1− λ)xT2 y
≤ supy∈C
λxT1 y + supy∈C
(1− λ)xT2 y
= λI∗C(x1) + (1− λ)I∗C(x2)
7. More examples
• Piecewise linear functions: max(aT1 x+ b1, ..., aTk x+ bk)
• Log of exponential sums: log(∑k
i=1 eaTi x+bi)
• Negative log of determinant: − log(det(X))
How to show convexity of these functions?
4.3 Convexity-Preserving Operators
1. Taking conic combination: If fi(x), i ∈ I are convex functions and αi ≥ 0,∀i ∈ I, then
g(x) =∑i∈I
αifi(x)
is a convex function.
Proof: The domain of function g
dom(g) = ∩i:αi>0dom(fi)
is convex. For any x, y ∈ dom(g), λ ∈ [0, 1]
g(λx+ (1− λ)y) =∑
αifi(λx+ (1− λ)y)
≤∑
αi [λfi(x) + (1− λ)fi(y)]
= λ∑
αifi(x) + (1− λ)∑
αifi(y)
= λg(x) + (1− λ)g(y)
Lecture 4: Convex Functions, Part I – February 1 4-4
Remark The property extend to infinite sums and integrals. If f(x, ω) is convex in x for anyω ∈ Ω and α(ω) ≥ 0,∀ω ∈ Ω. then
g(x) =
∫Ωα(ω)f(x, ω)dω
is convex if well defined.
For example if η = η(ω) is a well-defined random variable on Ω, and f(x, η(ω)) is convex,∀ω ∈ Ω, then Eη [f(x, η)] is a convex function.
2. Taking affine composition If f(x) : Rn → R is convex and A(y) : y 7→ Ay + b is an affinemapping from Rm to Rn, then
g(y) := f(Ay + b)
is convex on Rm.
Proof: dom(g) = y : Ay + b ∈ dom(f) is convex.
∀y1, y2 ∈ dom(g) : g(λy1 + (1− λ)y2) = f(λ(Ay1 + b) + (1− λ)(Ay2 + b))
≤ λf(Ay1 + b) + (1− λ)f(Ay2 + b)
= λg(y1) + (1− λ)g(y2)
For example, ‖ Ax− b ‖22,∑
i eaTi x−bi and
∑ni=1 log(aTi x− bi) are convex.
3. Taking pointwise maximum and supremum If fi(x), i ∈ I are convex, then
g(x) := maxi∈Ifi(x)
is also convex.
Proof: First of all, dom(g) = ∩i∈Idom(fi) is convex, For any x, y ∈ dom(g), λ ∈ [0, 1]
g(λx+ (1− λ)y) = maxi∈I
fi(λx+ (1− λ)y)
≤ maxi∈Iλfi(x) + (1− λ)fi(y)
≤ maxi∈I
λfi(x) + maxi∈I
(1− λ)fi(y)
= λg(x) + (1− λ)g(y)
Remark The property extends to the pointwise supremum over a infinite set. If f(x, ω) isconvex in x, for ω ∈ Ω, then
g(x) := supω∈Ω
f(x, ω)
is convex.For example, the following functions are convex:
Lecture 4: Convex Functions, Part I – February 1 4-5
(a) piecewise linear functions: f(x) = max(aT1 x+ b1, ..., aTk x+ bk)
(b) support function: I∗C(x) = supy∈C xT y
(c) maximum distance to any set C: dmax(x,C) = maxy∈C ‖ y − x ‖2(d) maximum eigenvalue of a symmetric matrix: λmax(X) = max‖y‖2=1 y
TXy
Indeed, almost every convex function can be expressed as the pointwise supremum of a familyof affine functions!
4. Taking convex monotone composition:
• scalar case If f is a convex function on Rn and F (·) is a convex and non-decreasingfunction on R, then g(x) = F (f(x)) is convex.
• vector case If fi(x), i = 1, . . . ,m are convex on Rn and F (y1, . . . , ym) is convex andnon-decreasing (component-wise) in each argument, then
g(x) = F (f1(x), . . . , fm(x))
is convex.
Proof: By convexity of fi, we have
fi(λx+ (1− λ)y) ≤ λfi(x) + (1− λ)fi(y),∀i,∀λ ∈ [0, 1].
Hence, we have for any x, y ∈ dom(g), λ ∈ [0, 1],
g(λx+ (1− λ)y) = F (f1(λx+ (1− λ)y), . . . , fm(λx+ (1− λ)y))
≤ F (λf1(x) + (1− λ)f1(y), . . . , λfm(x) + (1− λ)fm(y)) ( by monotonicity of F )
≤ λF (f1(x), . . . , fm(x)) + (1− λ)F (f1(x), . . . , fm(x)) ( by convexity of F )
= λg(x) + (1− λ)g(y) ( by definition of g)
Remark Taking pointwise maximum is a special case of the above rule, by setting F (y1, ..., ym) =max(y1, ..., ym),
maxi=1,...,m
fi(x) = F (f1(x), ..., fm(x))
is convex.
For example:
(a) ef(x) is convex if f is convex
(b) − log f(x) is convex if f is concave
(c) log(∑k
i=1 efi) is convex if fi are convex.
5. Taking Partial minimization: If f(x, y) is convex in (x, y) ∈ Rn and Y is a convex set,then
g(x) = infy∈Y
f(x, y)
Lecture 4: Convex Functions, Part I – February 1 4-6
is convex.
Proof: dom(g) = x : (x, y) ∈ dom(f) and y ∈ C is a projection of dom(f), hence is convex.
Given any x1, x2, by definition, for any ε > 0, ∃y1 ∈ Y, y2 ∈ Y s.t.
f(x1, y1) ≤ g(x1) + ε/2
f(x2, y2) ≤ g(x2) + ε/2
For any λ ∈ [0, 1], adding the two equations, we have
λf(x1, y1) + (1− λ)f(x2, y2) ≤ λg(x1) + (1− λ)g(x2) + ε.
By convexity of f(x, y), this implies
f(λx1 + (1− λ)x2, λy1 + (1− λ)y2) ≤ λg(x1) + (1− λ)g(x2) + ε.
Hence for any ε > 0, g(λx1 + (1− λ)x2) ≤ λg(x1) + (1− λ)g(x2) + ε. Letting ε→ 0 leads tothe convexity of g.
Examples
(a) Minimum distance to a convex set: d(x,C) = miny∈C ‖ x− y ‖2 where C is convex;
(b) Defineg(x) = inf
yh(y)|Ay = x
is convex if h is convex. This is because g(x) = infy f(x, y), where
f(x, y) :=
h(x) Ay = x
∞ o.w.
is convex in (x, y).
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 5: Convex Functions, Part II – February 6
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Characterization of Convex Function
– Epigraph
– Level set
– One dimensional property
– First order condition
– Second order condtion
• Continuity of Convex Functions
References: Boyd &Vandenberghe Chapter 2
5.1 Recall
A function f : Rn → R is convex if it satisfies
(i) convexity of domain: dom(f) = x : |f(x)| < +∞ is convex
(ii) basic convex inequality: if x, y ∈ dom(f), λ ∈ [0, 1] f(λx+ (1− λ)y) ≤ λf(x) + (1− λ)f(y)
Remark (Extended-value function) f can be extended to a function from Rn → R⋃+∞ by
setting f(x) = +∞, if x 6∈ dom(f). Now (ii) can rewritten as ∀x, y ∈ Rn, λ ∈ [0, 1], f(λx + (1 −λ)y) ≤ λf(x) + (1− λ)f(y)
Remark (General convex inequality) ∀λi ≥ 0,∑m
i=1 = 1, we have by induction that
f(m∑i=1
λixi) ≤m∑i=1
λif(xi)
This is also known as Jensen’s inequality. Let ψ be a finite discrete random variable. Then wealways have
f(E[ψ]) ≤ E[f(ψ)]
5-1
Lecture 5: Convex Functions, Part II – February 6 5-2
5.2 Characterization of Convex Functions
5.2.1 Epigraph
The epigraph of a function is defined as
epi(f) =
(x, t) ∈ Rn+1 : f(x) ≤ t
Proposition 5.1 f is convex on Rn if and only if its epigraph is a convex set in Rn+1.
Proof:
• (⇐ part) Firstly, dom(f) = x : ∃t, s.t. (x, t) ∈ epi(f) is convex. Let (x1, t1), (x2, t2) ∈epi(f), then (λx1 + (1− λ)x2, λt1 + (1− λ)t2) ∈ epi(f), ∀λ ∈ [0, 1]. By definition of epigraph,this means f(λx1 + (1− λ)x2) ≤ λt1 + (1− λ)t2. Particularly, setting t1 = f(x1), t2 = f(x2),we have the basic convex inequality.
• (⇒ part) On the other hand,
f convex⇒ f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2),∀λ ∈ [0, 1]
⇒ f(λx1 + (1− λ)x2) ≤ λt1 + (1− λ)t2,∀(x1, t1), (x2, t2) ∈ epi(f)
⇒ λ(x1, t1) + (1− λ)(x2, t2) ∈ epi(f)
⇒ epi(f)is a convex set
5.2.2 Level set
For any t ∈ R, the level set of f with level t is defined as
levt(f) = x ∈ dom(f) : f(x) ≤ t .
Proposition 5.2 If f is convex, then, every level set is convex.
Proof: For all x1, x2 ∈ levt(f) and λ ∈ [0, 1],
f(λx1 + (1− λ)x2) ≤ λf(x1) + (1− λ)f(x2) ≤ λt+ (1− λ)t = t
i.e. λx1 + (1− λ)x2 ∈ levt(f)
Remark: The reverse is not true. A function is called quasi-convex if its domain and all level setsare convex, this holds if and only if
f(λx+ (1− λ)y) ≤ max f(x), f(y) ,∀λ ∈ [0, 1]
Lecture 5: Convex Functions, Part II – February 6 5-3
5.2.3 One-dimensional Property
Recall that a set is convex if and only if its restriction on any line is convex. Similarly, we have
Proposition 5.3 f is convex if and only is its restriction on any line is convex. i.e. ∀x, h ∈Rn, φ(t) = f(x+ th) is convex on the axis.
Proof: (=⇒ ) Firstly, dom(φ) = t ∈ R : x+ th ∈ dom(f) is convex. Also, ∀t1, t2 ∈ R, ∀λ ∈ [0, 1]
φ(λt1 + (1− λ)t2) = f(x+ (λt1 + (1− λ)t2)h)
= f(λ(x+ t1h) + (1− λ)(x+ t2h))
≤ λf(x+ t1h) + (1− λ)f(x2 + t2h)
= λφ(t1) + (1− λ)φ(t2)
Hence, φ(t) is convex. Proof for the reverse direction is omitted.
Remark: Checking convexity in R boils down to check convexity of one-dimensional function onthe axis. (See example in HW1)Recall from basic calculus that for a univariate function f on (a, b),
1. if f is differentiable, f convex ⇐⇒ f ′ increasing
2. if f is twice-differentiable, f convex ⇐⇒ f” ≥ 0
Similar first and second order conditions hold for general convex functions.
5.2.4 First-order Condition
Proposition 5.4 Assume f is differentiable, then f is convex if and only if dom(f) is convex andf(y) ≥ f(x) +5f(x)T (y − x),∀x, y.
Proof: (⇐ part) Let z = λx+ (1− λ)y,∀λ ∈ [0, 1]
f(x) ≥ f(z) +∇f(z)T (x− z)f(y) ≥ f(z) +∇f(z)T (y − z)
Hence, we have
λf(x) + (1− λ)f(y) ≥ f(z) +∇f(z)T (λx+ (1− λ)y − z) = f(z) = f(λx+ (1− λ)y).
(⇒ part) By convexity, we have for all λ ∈ [0, 1],
f((1− λ)x+ λy) ≤ (1− λ)f(x) + λf(y)
⇒ f(x+ λ(y − x))
λ≤ f(x)
λ+ f(y)− f(x)
⇒ f(y) ≥ f(x) +f(x+ λ(y − x))− f(x)
λ, ∀λ ∈ [0, 1]
⇒ f(y) ≥ f(x) +∇f(x)T (y − x), letting λ→ 0
Lecture 5: Convex Functions, Part II – February 6 5-4
Remark: The above proposition implies that we have obtain global underestimate of the entirefunction based on local information (f(x),∇f(x)). This is an important feature of convex functions.Moreover, one can see that a convex function can be viewed as the supremum of affine function.
5.2.5 Second-order condition
Proposition 5.5 Assume f is twice-differentiable, then f is convex if and only if dom(f) is convexand ∇2f(x) 0,∀x ∈ dom(f)
Proof: (⇐ part) ∀x, h, φ(t) = f(x+ th) is convex on the axis
φ”(t) = hT∇2f(x+ th)h ≥ 0
particularly, φ”(0) = hT∇2f(x)h ≥ 0, ∀h Hence ∇2f(x) 0.
(⇒ part) Any one dimensional restriction
φ(t) = f(x+ th) is convex since φ”(t) ≥ 0
Hence f is convex.
Example: f(x) = 12x
TQx+ bTx+ c is convex if and only if Q 0.
5.3 Continuity of Convex Functions
Convex functions are almost everywhere continuous.
Theorem 5.6 Let f : Rn → R be convex, then f is continuous on rint(dom(f)).
Remark: Note that f needs not to be continuous on Rn or dom(f). e.g.
f(x) =
1, x = 0
0, x > 0
+∞, o.w.
Proof: Without loss of generality, let’s assume dim(dom(f)) = n, 0 ∈ int(dom(f)) and x :‖ x ‖2≤ 1 ⊆dom(f). Let us consider the continuity at point 0. Let xn → 0 with ‖ xn ‖2≤ 1.
(a) lim supn→∞ f(xn) ≤ f(0)This is because xk = (1− ‖ xk ‖2) · 0+ ‖ xk ‖2 ·yk, where yk = xk
‖xk‖2 ∈ dom(f). By convexityof f , we have
f(xk) ≤ (1− ‖ xk ‖2) · f(0)+ ‖ xk ‖2 ·f(yk)
Therefore, lim supk→∞ f(xk) ≤ f(0).
Lecture 5: Convex Functions, Part II – February 6 5-5
(b) lim infn→∞ f(xn) ≥ f(0)
This is because 0 = 1‖xk‖2+1xk + ‖xk‖2
‖xk‖2+1zk, where zk = − xk‖xk‖2 ∈ dom(f). By the convexity
of f , we have
f(0) ≤ 1
‖ xk ‖2 +1f(xk) +
‖ xk ‖2‖ xk ‖2 +1
f(zk)
Therefore, lim infk→∞ f(xk) ≥ f(0).
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 6: Convex Functions, Part III – February 8
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• subgradient and subdifferential
– definition and examples, existence, subdifferential properties, calculus
References: Bertsekas, Nedich & Ozdaglar, 2003, Chapter 4
6.1 Motivation
Convex functions are “essentially” convex sets:
• f(x) is convex ⇐⇒ epi(f) is convex
• supα∈A fα(x) is convex ⇐⇒ ∩α∈Aepi(fα) is convex
• ”tight affine minorant” of f ⇐⇒ supporting hyperplane of epi(f)
Question: Can you find any affine function that underestimates f(x) and is tight at x = 0?
• f(x) = 12x
2
• f(x) = |x|
• f(x) =
−√x, x ≥ 0
+∞, o.w.
• f(x) =
1, x = 0
0, x > 0
+∞, o.w.
6-1
Lecture 6: Convex Functions, Part III – February 8 6-2
6.2 Subgradient
Definition 6.1 Let f : Rn → R ∪ +∞ be a convex function. A vector g ∈ Rn is a subgradientof f at a point x ∈ dom(f) if
f(y) ≥ f(x) + gT (y − x), ∀y
The set of all subgradient at x is called the subdifferential of f at x denoted as ∂f .
Remark (Subgradient and epigraph) Note that for any fixed x ∈ dom(f)
g ∈ ∂f(x)⇔ f(y)− gT y ≥ f(x)− gTx,∀y⇔ t− gT y ≥ f(x)− gT y,∀(y, t) ∈ epi(f)
⇔[−g1
]T [yt
]≥[−g1
]T [y
f(x)
], ∀(y, t) ∈ epi(f)
⇔ The hyperplane H :=
(y, t) : (−g, 1)T (y, t) = (−g, 1)T (x, f(x))
is a supporint plane of epi(f)
Remark (Differentiable Case) If f is differentiable at x ∈ dom(f), then ∂f(x) = ∇f(x) is asingleton.
Proof: By definition, let y = x+ εd, g ∈ ∂f(x), f(x+ εd) ≥ f(x) + εgTd.
f(x+ εd)− f(x)
ε≥ gTd,∀d,∀ε ε→0
==⇒ ∇f(x)Td ≥ gTd,∀d
This only holds when g = ∇f(x).
Examples
1. f(x) = 12x
2, ∂f(x) = x
2. f(x) = |x|, ∂f(x) =
sgn(x), x 6= 0
[−1, 1], x = 0. Note at x = 0, ∀g ∈ [−1, 1], |y| ≥ 0 + g · y,∀y
3. f(x) =
−√x, x ≥ 0
+∞, o.w., ∂f(0) = ∅. Note at x = 0, @g ∈ R, s.t.
√y ≥ 0 + g · y,∀y ≥ 0
4. f(x) =
1, x = 0
0, x > 0
+∞, o.w., ∂f(0) = ∅. Note at x = 0, @g ∈ R, s.t. 0 ≥ 1 + g · y,∀y ≥ 0
Theorem 6.2 Let f be a convex function and x ∈ dom(f). Then
1. ∂f(x) is convex and closed
Lecture 6: Convex Functions, Part III – February 8 6-3
2. ∂f(x) is nonempty and bounded if x ∈ rint(dom(f))
Proof:
1. Convexity and closedness are evident due to
∂f(x) =g ∈ Rn : f(y) ≥ f(x) + gT (y − x),∀y
= ∩y
g ∈ Rnf(y) ≥ f(x) + gT (y − x)
is the solution to an infinite system of linear inequalities. The intersection of arbitrary numberof closed and convex sets is still closed and convex.
2. (Non-empty) W.l.o.g., let’s assume dom(f) is full-dimensional and x ∈ int(dom(f)). Sinceepi(f) is convex and (x, f(x)) belongs to its boundary, by separation theorem, ∃α = (s, β) 6= 0,s.t.
sT y + βt ≥ sTx+ βf(x),∀(y, t) ∈ epi(f)
Clearly, we must have β ≥ 0. Since x ∈ int(dom(f)), we cannot have β = 0, Otherwise, toensure sT y ≥ sTx,∀y ∈ B(x, y), s = 0, which is impossible.Hence, β > 0, setting g = −β−1s
f(y) ≥ f(x) + gT (y − x),∀y
(Bounded) Suppose ∂f(x) is unbounded, i.e. ∃gk ∈ ∂f(x), s.t. ‖ gk ‖2→ ∞, as k → ∞.Since x ∈ int(dom(f)), ∃δ > 0, s.t. B(x, δ) ⊆ dom(f). Hence, yk = x+ δ gk
‖gk‖2 ∈ dom(f). Byconvexity,
f(yk) ≥ f(x) + gTk (yk − x) = f(x) + δ ‖ gk ‖2→∞.However, this contradicts with the continuity of f over int(dom(f)).
Remark: On the contrary, if ∀x ∈ dom(f), ∂f(x) is non-empty, dom(f) is convex, then f is convex.This is because ∀x, y ∈ dom(f) and λ ∈ [0, 1]. z = λx + (1 − λ)y ∈ dom(f). So ∂f(z) 6= ∅. Letg ∈ ∂f(z). we have
f(x) ≥ f(z) + gT (x− z)f(y) ≥ f(z) + gT (y − z)
⇒ λf(x) + (1− λ)f(y) ≥ f(λx+ (1− λ)y)
Remark: (Continuity of Subdifferential) Let f : Rn → R be convex and continuous. Supposexk → x and gk ∈ ∂f(xk), gk → g, then g ∈ ∂f(x).
6.3 Subdifferential and Directional Derivative
Recall that the directional derivative of a function f at x along direction d is
f ′(x; d) = limδ→0+
f(x+ δd)− f(x)
δ
If f is differentiable, then f ′(x; d) = ∇f(x)Td
Lecture 6: Convex Functions, Part III – February 8 6-4
Lemma 6.3 When f is convex, the ratio φ(δ) := f(x+δd)−f(x)δ is non-decreasing of δ > 0.
The proof is left as a self exercise.
Theorem 6.4 Let f be convex and x ∈ int(dom(f)), then
f ′(x; d) = maxg∈∂f(x)
gTd
Proof: By definition, We have f(z + δd)− f(x) ≥ δgTd for all δ and g ∈ ∂f(x). Hence, f ′(x; d) ≥gTd,∀g ∈ ∂f(x). Moreover, this implies that
f ′(x; d) ≥ maxg∈∂f(x)
gTd.
It suffices to show that ∃g ∈ ∂f(x), s.t. f ′(x; d) ≤ gTd. Consider the two sets
C1 = (y, t) : f(y) < tC2 =
(y, t) : y = x+ αd, t = f(x) + αf ′(x; d), α ≥ 0
Claim: C1 ∩ C2 = ∅ and C1, C2 are closed and nonempty. This is because f(x + αd) ≥ f(x) +αf ′(x; d), ∀α ≥ 0. (Due to Lemma 6.3).By separation theorem, ∃(g0, β) 6= 0, s.t.
gT0 (x+ αd) + β(f(x) + αf ′(x; d)) ≤ gT0 y + βt,∀α ≥ 0,∀t > f(y)
One can easily show that β > 0. Let g = β−1g0,
gT (x+ αd) + f(x) + αf ′(x; d) ≤ gT y + f(y), ∀α ≥ 0
Let α = 1 and y = x, we have
g + f(x) ≤ gT y + f(y)⇔ −g ∈ ∂f(x)
Let α = 1 and y = x, we have
gTd+ f ′(x; d) ≤ 0⇔ f ′(x; d) ≤ −gTd
Therefore, we have shown that f ′(x; d) = maxg∈∂f(x) gTd
6.4 Calculus of Subgradient
1. Nonnegative summation: If h(x) = β1f1(x) + β2f2(x), with β1, β2 ≥ 0, then
∂h(x) = βαf1(x) + β2∂f2(x)
2. Affine transformation: If h(x) = f(Ax+ b), then
∂h(x) = AT∂h(Ax+ b)
3. Pointwise maximum: If h(x) = maxi∈I fi(x), I = 1, 2, ...,m
∂h(x) = Conv ∂fi(x)|i ∈ I(x) , where I(x) = i ∈ I : fi(x) = h(x)
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 7: Convex Functions (part IV) – February 13
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Closed convex function
• Convex conjugate function
• Conjugate theory
• Introduction to convex programs
References: Bertsekas, Nedich & Ozdaglar, Chapter 7 and Boyd & Vandenbergh Chapter 3.3
7.1 Closed Convex Functions
Recall for a closed convex set X, ∀x 6∈ X, ∃ a hyperplane strictly separates x and X, i.e. ∃(ax, bx),s.t. X ⊂ Hx :=
y : aTx y ≤ bx
, and x 6∈ Hx. Hence
X = ∩x 6∈Xy : AT
x y ≤ bx
Any closed convex set is the intersection of closed half-spaces. Intuitively, when translated toconvex functions, any convex function with closed and nonempty epigraph is the upper bound ofits affine minorants.
Definition 7.1 A function is closed if its epigraph is a closed set.
Remark: Any continuous function is closed, but a closed function is not necessarily continuous.For example,
f(x) =
1, x > 0
0, x = 0
+∞, o.w.
Proposition 7.2 Let f : Rn → R ∪ +∞. Then the following are equivalent.
(i) f is closed;
7-1
Lecture 7: Convex Functions (part IV) – February 13 7-2
(ii) every level set is closed;
(iii) f is lower semi-continuous, i.e. ∀x ∈ Rn, ∀ xk → x, f(x) ≤ lim infk→∞ f(xk).
Proof: Self Exercise
Remark Note that a convex function needs not to be closed, e.g.
f(x) =
1, x = 0
0, x > 0
+∞, o.w
Suppose f is a closed convex function, for a given slope y ∈ Rn, when is an affine function yTx−βan affine minorant of f?
f(x) ≥ yTx− β ⇒ β ≥ yTx− f(x),∀x⇒ β ≥ sup
x∈Rn
yTx− f(x)
︸ ︷︷ ︸
f∗(y)
The function f∗(y), is known as the conjugate function of f .
7.2 Convex Conjugate Function
Definition 7.3 Let f : Rn → R ∪ +∞: The function
f∗(y) = supx∈Rn
yTx− f(x)
= sup
x∈dom(f)
yTx− f(x)
is called the conjugate of function f . Also know as Legendre-Fenchel transformation.
Remark:
1. (Fenchel’s inequality) f(x) + f∗(y) ≥ xT y,∀x, y
2. f∗ is always convex and closed
3. The conjugate of the conjugate function f∗(y), also called biconjugate
f∗∗(x) = supy∈Rn
xT y − f∗(y)
is also closed and convex.
Theorem 7.4 We have
Lecture 7: Convex Functions (part IV) – February 13 7-3
(a) f(x) ≥ f∗∗(x)
(b) If f is closed and convex, then f∗∗ = f .
Proof:
(a) By definition of f∗, f∗(y) ≥ yTx− f(x),∀y. This implies that f(x) ≥ yTx− f∗(y), ∀y. Hence,f(x) ≥ supy
yTx− f∗(y)
= f∗∗(x).
(b) Suppose f∗∗ 6= f , then epi(f) 6⊆ epi(f∗∗)
∃(x0, f∗∗(x0)), s.t. (x0, t0) ∈ epi(f∗∗) and (x0, t0) 6∈ epi(f)
Since f is convex and closed, then epi(f) is a closed and convex set. By separation theorem,∃(y, β) 6= 0 s.t. H =
(x, t) : yTx+ βt = β0
separates epi(f) and (x0, f
∗∗(x0)), Note β 6= 0.w.l.o.g. let β = −1, and
yTx− t > yTx0 − f∗(x0),∀(x, t) ∈ epi(f)
This implies that yTx0 − f(x0) > yTx0 − f∗∗(x0). Hence, f∗∗(x0) > f(x0). Contradiction!
Examples:
1. f(x) = ax− b f∗(y) =
b, x = a
+∞, o.w.
2. f(x) = |x| f∗(y) =
0, |y| < 1
+∞, o.w.
3. f(x) = c2x
2(c > 0) f∗(y) = 12cy
2
7.3 Convex Optimization
The standard form of an optimization is
min f(x)
s.t. gi(x) ≤ 0, i = 1, ...,m (P )
hj(x) = 0, j = 1, ..., k
Definition 7.5 An optimization problem (P ) is convex if
1. the objective function f is convex.
Lecture 7: Convex Functions (part IV) – February 13 7-4
2. the inequality constraint functions g1, ..., gm are convex.
3. there is either no equality constraint or only linear equality constraint.
Note the domain of he problem D = dom(f) ∩ dom(gi) is convex.
Definition 7.6 (Feasibility) The set C = x ∈ dom(f) : gi(x) ≤ 0, i = 1, ...,m, hj(x) = 0, j = 1, .., kis called the feasible set. Any point x ∈ C is called a feasible solution. We say (P ) is feasible ifC 6= φ.
Definition 7.7 (Optimality) The value p∗ = inf f(x) : gi(x) ≤ 0, i = 1, ..,m, hj(x) = 0, j = 1, ..., kis called the optimal value of (P ). If (P ) is infeasible, we set p∗ = +∞.
• We say (P ) is unbounded below is p∗ = −∞.
• We say (P ) is solvable if p∗ is finite and exists a feasible solution x∗, such that p∗ = f(x∗).We call such x∗, an optimal solution. The set of all optimal solution is called the optimal set.
Epigraph form of the problem The standard problem (P ) is equivalent as the problem
minx∈Rn,t∈R
t
s.t. f(x)− t ≤ 0 (P ′)
gi(x) ≤ 0, i = 1, ...,m
hj(x) = 0, j =, 1..., k
Note that
1. (P ′) is still convex if (P ) is convex
2. (x∗, t∗) is optimal to (P ′) if and only if x∗ is optimal to (P ) and t∗ = f(x∗)
Example
minx
maxj=1,...,m
(aT1 x+ b1, ..., aTmx+ bm)
s.t. Cx = d
is equivalent to
minx,t
t
s.t. aTi x+ bi − t ≤ 0, i = 1, ...,m
Cx = d
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 8: Convex Programming – February 15
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Basics of Convex Programs
• Convex Theorem on Alternatives
• Lagrange Duality
References: Ben-Tal & Nemirovski, Chapter 3.1-3.3
8.1 Basics of Convex Programs
The standard form of an optimization problem
minx
f(x)
s.t. gi(x) ≤ 0, i = 1, ...,m (P )
hj(x) = 0, j = 1, ..., k
The optimal value of (P ) is
p∗ =
+∞, if no feasible solution
infx:gi(x)≤0,∀i,hj(x)=0 ∀j f(x), if exists feasible solution.
• (P ) is infeasible, if p∗ = +∞
• (P ) is unbounded below, if p∗ = −∞
• (P ) is solvable. if ∃ a feasible solution x∗, s.t. p∗ = f(x∗)
• (P ) is unattainable, if |p∗| <∞ but 6 ∃ feasible x∗, s.t. p∗ = f(x∗). For example, minx∈(0,1) ex
Given a solution x∗,
• x∗ is a global optimum for (P ) if x∗ is feasible and f(x∗) ≤ f(x), ∀x feasible
8-1
Lecture 8: Convex Programming – February 15 8-2
• x∗ is a local optimum for (P ) if x∗ is feasible and ∃r > 0, s.t. f(x∗) ≤ f(x), ∀ feasible x ∈B(x∗, r)
Proposition 8.1 For convex programs, a local optimum is a global optimum.
Proof: Let C denote the feasible set, and x∗ be local optimum. We want to show that ∀x ∈C, f(x∗) ≤ f(x). Let z = εx∗ + (1− ε)x. Then z ∈ C ∩B(x∗, r) when ε is small enough. Hence,
f(x∗) ≤ f(z) ≤ εf(x∗) + (1− ε)f(x)
i.e., (1− ε)f(x∗) ≤ (1− ε)f(x). Hence, f(x∗) ≤ f(x),∀x ∈ C.
Question: How to verify whether a solution x∗ is optimal?
In the linear program case, we have shown strong duality between (P ) & (D)
min cTx
(P ) s.t. Ax = b
x ≥ 0
max bT y
(D) s.t. AT y ≤ c
We know that
x∗ is optimal⇔Ax∗ = b, x ≥ 0 (primal feasibility )
∃y∗, AT y∗ ≤ c (dual feasibility)
cTx∗ = bT y∗ (zero-duality gap)
The strong duality is based on the Farkas’ Lemma (or separation theorem). As an analogy, we canderive duality and optimality condition for general convex programs.
8.2 Convex Theorems on Alternatives
Consider the general form of convex program
minx∈X
f(x)
s.t. gi(x) ≤ 0, i = 1, ...,m (P )
where X is a convex set, f, g1, . . . , gm are convex functions.
Theorem 8.2 Assume g1, ..., gm satisfy the Slater condition: ∃x ∈ X, s.t. gi(x) < 0. Then exactlyone of the following two systems must be empty.
Lecture 8: Convex Programming – February 15 8-3
(I) x ∈ X : f(x) < 0, gi(x) ≤ 0, i = 1, ...,m
(II) λ ∈ Rm : λ ≥ 0, infx∈X f(x) +∑m
i=1 λigi(x) ≥ 0
Proof:
1. We first show that system (II) feasible ⇒ system (I) infeasible.
Suppose (I) is also feasible, ∃x0, s.t. f(x0) < 0, gi(x0) ≤ 0. Then
∀λ ≥ 0, infx∈X
f(x) +
m∑i=1
λigi(x)
< 0
Contradiction!
2. We then show that system (I) infeasible ⇒ system (II) feasible.
Consider the following two sets:
S =u = (u0, u1, ..., um) ∈ Rm+1 : ∃x ∈ X, f(x) ≤ u0, g1(x) ≤ u1, ..., gm(x) ≤ um
T =
u = (u0, u1, ..., um) ∈ Rm+1 : u0 < 0, u1 ≤ 0, ..., um ≤ 0
Note that S is convex (why?) and nonempty, T is convex and nonempty and S ∩ T = ∅.By separation theorem, ∃a = (a0, a1, ..., am) ∈ Rm+1 and a 6= 0, s.t.
supu∈T
aTu ≤ infu∈S
aTu
i.e. supu0<0,ui≤0,∀i=1,..,m
m∑i=0
aiui ≤ infx,u0,...,um:u0≥f(x),ui≥gi(x),∀i=1,..,m
m∑i=0
aiui
Observe a ≥ 0, hence:
0 ≤ infxa0f(x) + a1g1(x) + ...+ amgm(x)
Note that a0 > 0, otherwise: we have (a1, ..., am) ≥ 0 and (a1, . . . , am) 6= 0,
infxa1g1(x) + ...+ amgm(x) ≥ 0
However, ∃x, s.t. gi(x) < 0,∀i = 1, ...,m. This implies
infx∈X
a1g1(x) + ...+ amgm(x) < 0
Contradiction!Hence, setting λi = ai
a0, i = 1, ...,m. we have
λ ≥ 0 and infx
f(x) +
∑λigi(x)
≥ 0
i.e. system (II) is feasible
Lecture 8: Convex Programming – February 15 8-4
Remark (relaxed Slater condition): The Slater condition can be relaxed to accommodate forlinear equalities: ∃x ∈ rint(X) s.t. gi(x) < 0 for all i = 1, ...,m such that gi(x) is not affine.
Note that in the general case,
x∗ is optimal to (P )⇔x ∈ X : f(x) ≤ f(x∗), gi(x) ≤ 0, i = 1, ...,m is feasiblex ∈ X : f(x) < f(x∗), gi(x) ≤ 0, i = 1, ...,m is infeasible
⇔x ∈ X : f(x) ≤ f(x∗), gi(x) ≤ 0, i = 1, ...,m is feasibleλ ∈ Rm : λ ≥ 0, infx∈X f(x) +
∑λigi(x) ≥ f(x∗)is feasible
Observe that the function infx∈X f(x) +∑λigi(x) ≤ f(x∗) for any λ ≥ 0. Therefore, the
optimality of x∗ implies that there must exsit λ∗ ≥ 0 such that infx∈X f(x) +∑λ∗i gi(x) = f(x∗).
8.3 Lagrange Duality
Definition 8.3 The function L(x, λ) = f(x) +∑m
i=1 λigi(x) is called the Lagrange function. Thisinduces two related functions:
L(λ) = infx∈X
L(x, λ)
L(x) = supλ≥0
L(x, λ)
The Lagrange dual of the problem
min f(x)
(P ) s.t. gi(x) ≤ 0, i = 1, ...,m
x ∈ X
is defined as
max L(λ)
(D) s.t. λ ≥ 0
Theorem 8.4 (Duality Theorem) Denote Opt(P ) and Opt(D) as the optimal values to (P ) and(D), we have
(a) (Weak Duality) ∀λ ≥ 0, L(λ) ≤ Opt(D). Moreover, Opt(D) ≤ Opt(P ).
(b) (Strong Duality) If (P ) is convex and below bounded, and satisfies Slater condition, then (D)is solvable, and
Opt(D) = Opt(P )
Proof:
Lecture 8: Convex Programming – February 15 8-5
(a) If (P ) is infeasible, Opt(P ) =∞, Opt(D) ≤ Opt(P ) always hold.If (P ) is feasible, let x0 be feasible solution, i.e. gi(x0) ≤ 0, x0 ∈ X
∀λ ≥ 0, L(λ) = infx∈X
f(x) +
∑λigi(x)
≤ f(x0) + λigi(x0) ≤ f(x0)
Hence,
∀λ ≥ 0, L(λ) ≤ inf f(x0) : x0 ∈ X, gi(x0) ≤ 0, i = 1, ...,m = Opt(P )
Furthermore, Opt(D) = supλ≥0 L(λ) ≤ Opt(P )
(b) By optimality, we know that x ∈ X : f(x) < Opt(P ), gi(x) ≤ 0, i = 1, ...,m has no solution.By convex theorem on alternative and Slater condition, the system λ ≥ 0 : L(λ) ≥ Opt(P )has a solution. This implies that Opt(D) = supλ≥0 L(λ) ≥ Opt(P ). Combining with part (a),we have Opt(P ) = Opt(D).
Example: The Slater condition does not hold in the following example.
minx∈X
e−x
s.t.x2
y≤ 0
where X = (x, y) : y > 0. Note that there exists no solution in x ∈ X such that x2/y < 0.
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 9: Optimality Conditions – February 20
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Illustrations of Lagrange Duality
• Saddle Point Formulation
• Optimality Conditions (KKT conditions)
References: Bental & Nemirovski Chapter 3.2
9.1 Recall
• Convex program:
min f(x)
s.t. gi(x) ≤ 0, i = 1, ...,m (P )
x ∈ X
where f(x), g1, . . . , gm are convex and X is convex.
• Lagrange function:
L(x, λ) = f(x) +
m∑i=1
λigi(x)
where λ = (λ1, ..., λm) is called Lagrange multiplier.
• Lagrange dual function:
L(λ) := infx∈X
L(x, λ) = infx∈Xf(x) +
m∑i=1
λigi(x)
– Note that ∀λ ≥ 0, L(λ) ≤ Opt(P )
• Lagrange dual program:maxλ≥0
L(λ)
We show that Opt(D) = Opt(P ) when (relaxed) Slater condition holds
9-1
Lecture 9: Optimality Conditions – February 20 9-2
9.2 Illustrations of Lagrange Duality
• Linear Program Duality
min cTx
(P ) s.t. Ax = b
x ≥ 0
max bT y
(D) s.t. AT y ≤ c
First, rewrite the original problem as:
min cTx
s.t. Ax− b ≤ 0 (λ1)
b−Ax ≤ 0 (λ2)
x ≥ 0
Introducing the multipliers λ = (λ1, λ2) ≥ 0, the Lagrange function is
L(x, λ) = cTx+ λT1 (Ax− b) + λT2 (b−Ax) = (c+ATλ1 −ATλ2)Tx+ bT (λ2 − λ1)
The Lagrange dual function is
L(λ) = infx≥0
(c+ATλ1 −ATλ2)Tx+ bT (λ2 − λ1) =
bT (λ2 − λ1), c+ATλ1 −ATλ2 ≥ 0
−∞, o.w.
The Lagrange dual is maxλ≥0 L(λ), which is equivalent to
maxλ≥0
bT (λ2 − λ1)
c+AT (λ1 − λ2) ≥ 0
Substituteing y = λ2 − λ1, the formulation above is also equivalent to:
maxy
bT y
s.t. AT y ≤ c
• Quadratic Program Duality:
minx
1
2xTQx+ qTx
(P ) s.t. Ax ≤ bx ≥ 0
maxy,λ
−1
2yTQy + bTλ
(D) s.t. ATλ−Qy = q
λ ≥ 0
where Q 0.
Lecture 9: Optimality Conditions – February 20 9-3
The Lagrange function is L(x, λ) = 12x
TQx+ qTx+ λT (b−Ax).The Lagrange dual function is L(λ) = infx12x
TQx+ (q −ATx)x+ bTλ.The infimum is achieved when Qx = ATλ− q, x = Q−1(ATλ− q).
L(λ) = −1
2(ATλ− q)TQ−1(ATλ− q) + bTλ
The Lagrange dual is
maxλ≥0
L(λ)⇐⇒ maxλ≥0
1
2(ATλ− q)TQ−1(ATλ− q) + bTλ⇐⇒ max
λ≥0,y−1
2yTQy + bTλ
s.t. ATλ−Qy = q
λ ≥ 0
In both cases discussed above, strong duality holds true.
9.3 Saddle Point Formulation
Recall
(P ) : minx∈X
f(x) : gi(x) ≥ 0, i = 1, . . . ,m = minx∈X
L(x) = minx∈X
maxλ≥0
L(x, λ)
(D) : maxλ≥0
L(λ) = maxλ≥0
minx∈X
L(x, λ)
Definition 9.1 (Saddle point) We call (x∗, λ∗), where x∗ ∈ X,λ∗ ≥ 0, a saddle point of L(x, λ) ifL(x, λ∗) ≥ L(x∗, λ∗) ≥ L(x∗, λ), ∀x ∈ X,λ ≥ 0.
Theorem 9.2 (x∗, λ∗) is saddle point of L(x, λ) if and only if x∗ is an optimal solution (P ), λ∗ isan optimal solution to (D) and Opt(P ) = Opt(D).
Proof:
(⇒) Assume (x∗, λ∗) is a saddle point,
L(x, λ∗) ≥ L(x∗, λ∗) ≥ L(x∗, λ), ∀x ∈ X,λ ≥ 0
Opt(P ) = infx∈X
L(x) ≤ L(x∗) = supλ≥0
L(x∗, λ) = L(x∗, λ∗)
Opt(D) = supλ≥0
L(λ) ≥ L(λ∗) = infx∈X
L(x, λ∗) = L(x∗, λ∗)
Hence, Opt(P ) ≤ L(x∗, λ∗) ≤ Opt(D)Combined with weak duality, we have Opt(P ) = Opt(D). Hence, Opt(P ) = L(x∗) = L(x∗, λ∗) =L(λ∗) = Opt(D). Thus, x∗ solves (P ), λ∗ solves (D), and Opt(P ) = Opt(D)
Lecture 9: Optimality Conditions – February 20 9-4
(⇐) Assume (x∗, λ∗) are optimal solutions to (P ) and (D), and Opt(P ) = Opt(D). By optimality,
Opt(P ) = L(x∗) = supλ≥0
L(x∗, λ) ≥ L(x∗, λ∗)
Opt(D) = L(λ∗) = infx∈X
L(x, λ∗) ≤ L(x∗, λ∗)
Since Opt(D) = Opt(P )supλ≥0
L(x∗, λ) = L(x∗, λ∗) = infx∈X
L(x∗, λ)
i.e. (x∗, λ∗) is a saddle point of L(x, λ).
Remark:
• The above theorem holds true for any saddle function L(x, λ) and its induced primal anddual problems, not limited to the Lagrange function.
• Saddle point always exists for the Lagrange function of a solvable convex program satisfyingthe Slater condition. More generally, the existence of saddle points is guaranteed for convex-concave saddle functions over convex compact domains (Minimax Theorem).
9.4 Optimality Conditions
Theorem 9.3 : Let x∗ ∈ X
(i) (sufficient condition) If there exists λ∗ ≥ 0, such that (x∗, λ∗) is a saddle point of L(x, λ),then x∗ is an optimal solution to (P ).
(ii) (necessary condition) Assume (P ) is convex and satisfies the slater condition. If x∗ is anoptimal solution to (P ) then ∃λ∗ ≥ 0, s.t. (x∗, λ∗) is a saddle point of L(x, λ)
Proof:
(i) (sufficient part) Follows from previous theorem.
(ii) (necessary part) By strong duality theorem, ∃ optimal dual solution λ∗ ≥ 0, such thatOpt(P ) = Opt(D). Hence, following from the previous theorem, (x∗, λ∗) is a saddle point ofL(x, λ).
Remark Note that the sufficient condition holds for general constrained program, not necessarilyconvex ones. However, they are far from being necessary and hardly satisfied.
Definition 9.4 (Normal Cone) Let X ⊂ Rn and x ∈ X. The normal cone of X, denoted asNX(x), is the set
NX(x) =h ∈ Rn : hT (y − x) ≥ 0, ∀y ∈ X
Lecture 9: Optimality Conditions – February 20 9-5
Note that NX(x) is a closed convex cone.
Examples:
• x ∈ int(X), NX(x) = 0
• x ∈ rint(X), NX(x) = L⊥ where L = Linear subspace parallel to Aff(X)
• X =x : aTi x ≥ bi, i = 1, ...,m
, x 6∈ int(X).
NX(x) = Cone(ai|aTi x = bi
)
Theorem 9.5 Let (P ) be a convex program and let x∗ be a feasible solution. Assume f, g1, ..., gmare differentiable at x∗.
(a) (sufficient condition) If there exists λ∗ ≥ 0 satisfying
1) ∇f(x∗) +∑m
i=1 λ∗i∇gi(x∗) ∈ NX(x∗)
2) λ∗gi(x∗) = 0, ∀i = 1, ...,m
then x∗ is optimal solution to (P ) (Karush-Kuhn-Tucker, KKT condition (1951))
(b) (necessary condition)If (P ) also satisfies the slater condition. then the above condition is alsonecessary for x∗ to be optimal.
Proof:
(a) (sufficient part) Under the KKT condition
1) implies that L(x, λ∗) ≥ L(x∗, λ∗),∀x ∈ X because
L(x, λ∗) ≥ L(x∗, λ∗) +∇xL(x∗, λ∗)(x− x∗)
as the last part is non-negative
2) + feasibility of x∗ implies L(x∗, λ∗) ≥ L(x∗, λ), ∀λ ≥ 0 because
L(x∗, λ∗) = f(x∗) +
m∑i=1
λigi(x∗) = f(x∗) ≥ f(x∗) +
∑λigi(x
∗) = L(x∗, λ)
Hence (x∗, λ∗) is a saddle point of L(x, λ).
(b) (necessary part) From previous theorem, there exists λ ≥ 0 such that (x∗, λ∗) is a saddlepoint of L(x, λ). We have
L(x, λ∗) ≥ L(x∗, λ∗),∀x ∈ X ⇒ (y − x∗)T∇Lx(x∗, λ∗) = limε→0
L(x∗ + ε(y − x∗), λ∗)− L(x∗, λ∗)
ε≥ 0
⇒ ∇xL(x∗, λ∗) ∈ NX(x∗)
Lecture 9: Optimality Conditions – February 20 9-6
L(x∗, λ∗) ≥ L(x∗, λ), ∀λ ≥ 0⇒m∑i=1
λ∗i gi(x∗) ≥
m∑i=1
λigi(x∗),∀λ ≥ 0
⇒m∑i=1
λ∗i gi(x∗) ≥ 0 (note we also have λ∗i gi(x
∗) ≤ 0)
⇒ λ∗i gi(x∗) = 0, ∀i
This leads to the KKT condition.
Remark To summarize, (x∗, λ∗) is an optimal primal-dual pair if it satisfies:
• Primal feasibility: x∗ ∈ X, gi(x∗) ≤ 0
• Dual feasibility: λ∗ ≥ 0
• Lagrange optimality: ∇f(x∗) +∑m
i=1 λ∗i∇gi(x∗) ∈ NX(x∗)
• Complementary slackness: λ∗i gi(x∗) = 0,∀i = 1, ...,m
Example: Given ai > 0, i = 1, ..., n, solve the problem
minx
∑n
i=1
aixi
s.t. x > 0∑n
i=1xi ≤ 1
The Lagrange function L(x, λ) =∑n
i=1aixi
+ λ(∑n
i=1 xi − 1). The KKT optimality conditions yieldx∗i > 0,
∑ni=1 x
∗i ≤ 1
λ∗ ≥ 0
− ai(x∗i )
2 + λ∗ = 0
λ∗(∑n
i=1 x∗i − 1) = 0
⇒ xi =
√aiλ∗
and
m∑i=1
xi = 1⇒
λ∗ = (
∑ni=1
√ai)
2
x∗i =√ai∑n
i=1
√ai, i = 1, . . . ,m
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 10: Minimax Problems – February 22
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Saddle Point and Minimax Problems
• Minimax Theorem (Sion-Kakutani Theorem)
References: Bental & Nemirovski Chapter 3.4
10.1 Saddle Ppint and Minimax Problem
Let X ⊆ Rn and Y ⊆ Rm be two non-empty sets. Let L(x, y) : X × Y → R be a real-valuedfunction. We say (x∗, y∗) is a saddle point of L(x, y) if
L(x, y∗) ≥ L(x∗, y∗) ≥ L(x∗, y), ∀x ∈ X, y ∈ Y
Game Theory Interpretation: This can be viewed as a two player zero-sum game.
• Player 1 chooses x ∈ X
• Player 2 chooses y ∈ Y
• After both players reveal their decisions, player 1 pays to player 2 the amount L(x, y).
• Player 1 is interested in minimizing his loss, while player 2 is interested in maximizing hisgain
• The strategy (x∗, y∗) is called a (Nash) equilibrium if no player has an interactive to changehis chosen strategy after considering the opponent’s action.
Question: What should players do to optimize their profit?
Considering the following two situations
• Situation I: Player 1 choose x first and player 2 choose y knowing the choice of player 2
minx∈X
L(x) := maxy∈Y
L(x, y)
10-1
Lecture 10: Minimax Problems – February 22 10-2
• Situation II: Player 2 choose y first and player 1 choose x knowing the choice of player 1
maxy∈Y
L(y) := minx∈X
L(x, y)
Two Induced Problems:
(P ) : minx∈X
maxy∈Y
L(x, y) = minx∈X
L(y)
(D) : maxy∈Y
minx∈X
L(x, y) = maxy∈Y
L(x)
Intuitively, it is better for players to play second. Indeed, there is an inequality:
Proposition 10.1supy∈Y
infx∈X
L(x, y) ≤ infx∈X
supy∈Y
L(x, y).
Proof: This is because
∀x ∈ X : infx∈X
L(x, y) ≤ L(x, y),∀y
⇒∀x ∈ X : supy∈Y
infx∈X
L(x, y) ≤ supy∈Y
L(x, y)
⇒ supy∈Y
infx∈X
L(x, y) ≤ infx∈X
supy∈Y
L(x, y)
On the other hand, if (x∗, y∗) is a saddle point.
L(x, y∗) ≥ L(x∗, y∗) ≥ L(x∗, y), ∀x ∈ X, y ∈ Y⇒ inf
x∈XL(x, y∗) ≥ L(x∗, y∗) ≥ sup
y∈YL(x∗, y)
⇒ supy∈Y
infx∈X
L(x, y) ≥ infx∈X
supy∈Y
L(x, y)
Hence, if (x∗, λ∗) is a saddle point, then
supy∈Y
infx∈X
L(x, y) = infx∈X
supy∈Y
L(x, y)
Remark:
• If there exists a saddle point, there is no advantage to the players of knowing the opponent’schoice. Moreover, the equilibrium, minimax and maximin all give the same solution.
• Saddle points do not always exist. For example. L(x, y) = (x− y)2, X = [0, 1], Y = [0, 1]
L(x) = supy∈[0,1]
L(x, y) = maxx2, (x− 1)2
, inf
x∈Xsupy∈Y
L(x, y) =1
4
L(y) = infx∈[0,1]
L(x, y) = 0, supy∈Y
infx∈X
L(x, y) = 0
• As already shown in last lecture: Saddle point exists if and only the induced problems (P )and (D) are both solvable and the optimal values equal to each other. Moreover, the saddlepoints are pairs (x∗, λ∗), where x∗ is optimal to (P ) and λ∗ is optimal to (D).
Lecture 10: Minimax Problems – February 22 10-3
10.2 Existence of Saddle Points
Lemma 10.2 (Minimax Lemma) Let fi(x), i = 1, ...,m be convex and continuous on a convexcompact set X. Then
minx∈X
max1≤i≤m
fi(x) = minx∈X
m∑i=1
λ∗i fi(x)
for some λ∗ ∈ Rm such that, λ∗i ≥ 0, i = 1, ...,m and∑m
i=1 λ∗i = 1
Remark: Let L(x, λ) =∑m
i=1 λifi(x),4 = λ : λ ≥ 0,∑m
i=1 λi = 1 be a standard simplex. Theabove lemma implies: L(x, λ) has a saddle point on X ×4. Since
maxλ∈4
minx∈X
∑λifi(x) ≥ min
x∈X
∑λ∗i fi(x) = min
x∈Xmaxλ∈4
∑λifi(x)
Proof: Consider the epigraph form of the problem minx∈X max1≤i≤m fi(x):
minx,t
t
s.t. fi(x)− t ≤ 0, i = 1, ...,m
(x, t) ∈ Xt
where Xt = (x, t) : x ∈ X, t ∈ R. The optimal value t∗ = minx∈X max1≤i≤m fi(x).Note that the above problem satisfies the slater condition and is solvable (why ?)Hence, there exists (x∗, t∗) ∈ Xt and λ∗ ≥ 0, such that (x∗, t∗;λ∗) is a saddle point of the Lagrangefunction:
L(x, t;λ) = t+m∑i=1
λi(fi(x)− t) = (1−m∑i=1
λi)t+m∑i=1
λifi(x)
Therefore: ∂L∂t L(x∗, t∗;λ∗) = 1−
∑mi=1 λ
∗i = 0∑m
i=1 λ∗i (fi(x
∗)− t∗) = 0⇒
∑mi=1 λ
∗i = 1∑m
i=1 λ∗i fi(x
∗) = t∗
This implies that ∃λ∗i ≥ 0,∑m
i=1 λ∗i = 1, such that
minx∈X
max1≤i≤m
fi(x) = t∗ = min(x,t)∈Xt
L(x, t; , λ∗) = minx∈X
m∑i=1
λ∗i fi(x)
Theorem 10.3 (Minima Theorem, Sion-Kakutani )Let X and Y be two convex compact sets. Let L(x, y) : X × Y → R be a continuous function thatis convex in x ∈ X for every fixed y ∈ Y and concave in y ∈ Y for every fixed x ∈ X. Then L(x, y)has a saddle point on X × Y , and
minx∈X
maxy∈Y
L(x, y) = maxy∈Y
minx∈X
L(x, y)
Lecture 10: Minimax Problems – February 22 10-4
Proof: We should prove that
(P ) : minx∈X
L(x) := maxy∈Y
L(x, y)
(D) : maxy∈Y
L(y) := minx∈X
L(x, y)
are both solvable with equal optimal values.
1. X and Y are compact, L(x, y) is continuous, both L(x) and L(y) are continuous and attaintheir optimum on compact set. It is sufficient to show Opt(D) ≥ Opt(P ),i.e.
maxy∈Y
minx∈X
L(x, y) ≥ minx∈X
maxy∈Y
L(x, y)
Consider the sets X(y) = x ∈ X : L(x, y) ≤ Opt(D)
2. Note that X(y) is nonempty, compact and convex for any y ∈ Y . We show that everycollection of these sets has a point in common.Suppose ∃y1, ..., ym. s.t. X(y1) ∩ ... ∩X(ym) = ∅. This implies:
minx∈X
maxi=1,...,m
L(x, yi) > Opt(D)
By Minimax Lemma, ∃λ∗i ≥ 0 and∑m
i=1 λ∗i = 1, such that
minx∈X
maxi=1,...,m
L(x, yi) = minx∈X
m∑i=1
λ∗iL(x, yi)
≤ minx∈X
L(x,
m∑i=1
λ∗i yi)
= L(y) ≤ Opt(D)
where y =∑m
i=1 λ∗i yi and the first inequality is by concavity of L(x, y). The result here leads
to a contradiction! Hence, every finite collection of X(y) has a common point.
3. By Helley’s theorem, all of these sets X(y) : y ∈ Y has a common point. Therefore, ∃x∗ ∈X : x∗ ∈ X(y), ∀y ∈ Y , which means ∃x∗ ∈ X,L(x∗, y) ≤ Opt(D), ∀y ∈ Y . Hence Opt(P ) ≤Opt(D)
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 11: Convex Programming, Part III – February 27
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Some Remarks on Minimax Problem
• Some Remarks on Optimality Conditions
• Polynomial Solvability of Convex Programs
11.1 Minimax Problem
Recall in the last lecture, we have discussed
[Minimax Theorem] If X and Y are convex compact sets, L(x, y), X × Y → R is convex-concaveand continuous, then L(x, y) has a saddle point on X × Y , and
minx∈X
maxy∈Y
L(x, y) = maxy∈Y
minx∈X
L(x, y)
Remark 1: Based on the proof analysis, we can immediate see that some of the assumptions canbe relaxed. Here is a general result:
Theorem 11.1 Let X and Y be convex and one of them is compact. Let L(x, y) be lower-continuous (l.s.c.) and quasi-convex in x ∈ X. and upper semi-continuous(u.s.c) and quasi-concavein y ∈ Y . Then
minx∈X
maxy∈Y
L(x, y) = maxy∈Y
minx∈X
L(x, y)
Remark 2: The compactness is sufficient bu not necessary, see for examples:
1. minx maxy(x+ y) =∞ 6= −∞ = maxy minx(x+ y)
2. minx max0≤y≤1(x+ y) = −∞ = −∞ = max0≤y≤1 minx(x+ y)
3. minx maxy≤1(x+ y) = −∞ = −∞ = maxy≤1 minx(x+ y)
11-1
Lecture 11: Convex Programming, Part III – February 27 11-2
Remark 3: Minimax problem could also arise from Fenchel dual:
Let f(x) be closed and convex, then f = f∗∗. We can equivalently write f as
f(x) = maxy∈Rn
yTx− f∗(x)
Hence
minx∈X
f(x) = minx∈X
maxy∈Rn
yTx− f∗(x)
Remark 4: Alternative optimization does not necessarily converge to the saddle point.Consider the saddle point problem
min−1≤x≤1
max−1≤y≤1
xy
The problem has a unique saddle point: (0, 0). Start with any (x0, y0), where x0 > 0 and doalternative maximization over y and minimization over x:
(x0, y0)⇒ (x0, 1)⇒ (−1, 1)⇒ (−1,−1)⇒ (1,−1)⇒ (1,−1)⇒ (1, 1)⇒ ...
This will not converge to the saddle point.
11.2 Optimality Conditions for Convex Programs
• General Constrained Differentiable Case:
minx∈X
f(x)
gi(x) ≤ 0, i = 1, ...,m (P )
We have already shown that for a feasible solution x∗ to be optimal:
x∗ ∈ X is optimal for (P)slater condition−−−−−−−−−−−−−−−−−− ∃λ∗ ≥ 0 s.t.
(a) ∇f(x∗) +∑m
i=1 λ∗i∇gi(x∗) ∈ NX(x∗)
(b) λ∗i gi(x∗) = 0,∀i = 1, . . . ,m(11.1)
• Simple Constrained Differentiable Case:
minx∈X
f(x)
Proposition 11.2 Assume X is convex and f(x) is convex and differentiable at x∗. Then
x∗ ∈ X is optimal ⇔ ∇f(x∗) ∈ NX(x∗), i.e. ∇f(x∗)T (x− x∗) ≥ 0,∀x ∈ X
Proof:
(⇐) f(x) ≥ f(x∗) +∇f(x∗)T (x− x∗) ≥ f(x∗),∀x ∈ X
(⇒) ∇f(x∗)(x− x∗) = limε→0f(x∗+ε(x−x∗))−f(x∗)
ε ≥ 0
Lecture 11: Convex Programming, Part III – February 27 11-3
Remark
– If x∗ ∈∫
(X), then x∗ is optimal ⇔ ∇f(x∗) = 0
– (Unconstrained case): If X = Rn, then x∗ is optimal ⇔ ∇f(x∗) = 0
– For general (non-convex) optimization problems, ∇f(x∗) = 0 is only a necessary but notsufficient condition for x∗ to be optimal.
• Nondifferentiable Case:
Proposition 11.3 Assume f(x) is convex on X, and x∗ ∈ rint(dom(f)), then
x∗ ∈ X is optimal ⇔ ∃g ∈ ∂f(x∗), s.t. gT (x− x∗) ≥ 0, ∀x ∈ X
Proof:
(⇐) f(x) ≥ f(x∗) + gT (x− x∗) ≥ f(x∗), ∀x ∈ X
(⇒) maxg∈∂f(x∗) gT (x− x∗) = f ′(x∗;x− x∗) = limε→0
f(x∗+ε(x−x∗))−f(x∗)ε ≥ 0
Remark In the unconstrained case when X = Rn, x∗ is optimal ⇔ 0 ∈ ∂f(x∗)
11.3 Solve Convex Programs
min f(x)
s.t. gi(x) ≤ 0, i = 1, ...,m (P )
x ∈ X
In practice, to solve a problem means to find a ‘’approximate” solution to (P) with a small inaccu-racy ε > 0.
Measure of accuracy of an approximate solution x: Some function ε(x) such that ε(x) ≥ 0and ε(x)→ 0 as x→ x∗. For instance:
(i) ε(x) = infx∗∈Xopt ‖ x− x∗ ‖, where Xopt is the optimal set.
(ii) ε(x) = f(x)−Opt(P ), where Opt(P ) is the optimal value.
(iii) ε(x) = max(f(x)−Opt(P ),max1≤i≤m
[gi(x)
]+
), where u+ = max(u, 0)
Lecture 11: Convex Programming, Part III – February 27 11-4
Black-box oriented numerical method: We assume the objective and constraints can beaccessed through oracles.
• zero-order oracle: O =(f(x), g1(x), ..., gm(x)
)• first-order oracle: O =
(∂f(x), ∂g1(x), ..., ∂gm(x)
)• second-order oracle: O =
(∇2f(x),∇2g1(x), ...,∇2gm(x)
)• separation oracle: given x, either reports x ∈ X or returns a separator, i.e. a vector e 6= 0,
such that eTx ≥ supy∈X eT y. Note that when x 6∈ int(X), a separator does exist.
Complexity of a numerical method M : Given an input ε > 0, a problem instance P ,
• oracle complexity: number of oracles required to solve the problem (P ) up to accuracy ε > 0
• arithmetric complexity: number of arithmetric operation (bit-wise operation) requirement tosolve the problem (P ) up to accuracy ε > 0
A solution method M for a family P of problems is called polynomial if ∀p ∈ P , the arithmetriccomplexity
ComplM (ε, p) ≤ O(1)[dim(P )
]α︸ ︷︷ ︸polynomial of size
· ln(V (P )/ε)︸ ︷︷ ︸number of accuracy digits
where V (P ) is some data-dependent quantity.
We say the family P of problems polynomially solvable, i.e. computationally tractable, if it admitspolynomial solution methods.
11.4 Convex Problems are Polynomially Solvable
Illustration: One-dimensional Caseminx∈[a,b]
f(x)
• Zero-order line search: Assume there exists a unique minimizer
– Initialize a localizer G1 = [a, b] for x∗
– At each iteration, choose at, bt ∈ Gt, update the localizer
Gt+1 ←
[a, bt] ∩Gt, if f(at) ≤ f(bt)
[at, b] ∩Gt, if f(at) > f(bt)
If we choose at, bt that split [a, b] into equal length, |Gt+1| = 23 |Gt|. We get linear convergence.
• First-order line search (bisection)
Lecture 11: Convex Programming, Part III – February 27 11-5
– Initialize a localizer G1 = [−R,R] ⊃ [a, b]
– At each iteration, compute the midpoint ct of Gt = [at, bt]
if ct 6∈ [a, b],
Gt+1 =
[at, ct], if ct > b
[ct, bt], if ct < a
if ct ∈ [a, b] and f ′(ct) 6= 0
Gt+1 =
[at, ct], if f ′(ct) > 0
[ct, bt], if f ′(ct) < 0
otherwise, this implies ct is optimal
Note that x∗ ∈ Gt and |Gt+1| = 12 |Gt|, we get linear convergence.
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 12: Polynomial Solvability of Convex Programs – March 01
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Center of Gravity Method
• Ellipsoid Method
References: Bental & Nemirovski, Chapter 7
12.1 Convex Program
We aim to solve the convex optimization problem
minx∈X
f(x)
where f is convex and X ⊆ Rn is closed, bounded convex with non-empty interior (also called aconvex body).
Let r,R be such that x :‖ x− c ‖2≤ r ⊆ X ⊆ x :‖ x ‖2≤ R for some c.
Setup: We assume that the objective and constraint set can be accessed through
• Separation Oracle for X: a routine that given an input x, either reports x ∈ X or returns a
vector ω 6= 0, s.t. ωTx ≥ supy∈X ωT y
• First Order Oracle for f : a routine that given an input x, returns a subgradient g ∈ ∂f(x),
i.e. f(y) ≥ f(x) + gT (y − x), ∀y.
• Zero Order Oracle for f : a routine that given an input x, returns the function value f(x).
Example:
minx
max1≤j≤J
fj(x)
s.t. gi(x) ≤ 0, i = 1, ...,m
where fj(x), 1 ≤ j ≤ J and gi(x), 1 ≤ i ≤ m are convex and differentiable. Here, we have
f(x) = max1≤j≤J
fj(x)
X = x ∈ Rn : gi(x) ≤ 0, i = 1, ...,m
12-1
Lecture 12: Polynomial Solvability of Convex Programs – March 01 12-2
1. We can build first order oracle for f if at a given x, we can compute f1(x), ..., fJ(x), and∇f1(x), ...,∇fJ(x). This is because:
∂f(x) ⊇ Conv ∂fj(x), j such that f(x) = fj(x)
2. We can build separation oracle for X if at given x, we can compute g1(x), ..., gm(x) and∇g1(x), ...,∇gm(x). This is because
x ∈ X ⇔ gi(x) ≤ 0,∀i = 1, ...,m
and
x 6∈ X ⇔ ∃i′, s.t. gi′(x) > 0
⇒ ∇gi′(x)T (y − x) ≤ gi′(y)− gi′(x) ≤ 0,∀y ∈ X⇒ ωTx ≥ sup
y∈XωT y, for ω = ∇gi′(x)
12.2 Center of Gravity Method (Levin, 1965; Newman, 1965)
We consider the simple cutting plane scheme
• Initialize G0 = X
• At iteration t = 1, 2, ..., T , do
– Compute the center of gravity: xt = 1Vol(Gt−1)
∫x∈Gt−1
xdx
– Call the first order oracle and obtain gt ∈ ∂f(xt)
– Set Gt = Gt−1 ∩y : gTt (y − xt) ≤ 0
• Output xT ∈ arg minx∈x1,...,xT f(x)
Lemma 12.1 Let C be a centered convex body in Rn with∫x xdx = 0. Then ∀ω 6= 0
Vol(C ∩x : ωTx ≤ 0
) ≤
(1− (
n
n+ 1)n)Vol(C) ≤ (1− 1
e)Vol(C)
As an immediate result of the center of gravity method, we have
• x∗ ∈ Gt,∀t ≥ 1 because Xopt ⊆ y : f(y) ≤ f(xt) ⊆y : gTt (y − xt) ≤ 0
• Vol(Gt) ≤ (1− 1
e )tVol(X)
Lecture 12: Polynomial Solvability of Convex Programs – March 01 12-3
Theorem 12.2 The approximate solution generated by the center of gravity methods satisfies:
f(xT )− f∗ ≤ (1− 1
e)Tn VarX(f)
where VarX(f) = maxx∈X f(x)−minx∈X f(x), and f∗ is the optimal value.
Proof: Let δ ∈ ((1− 1e )
Tn , 1), and consider the neighborhood of x∗
Xδ = x∗ + δ(x− x∗) : x ∈ X
Vol(Xδ) = δnVol(X) > (1− 1
e)TVol(X) ≥ Vol(GT )
Hence Xδ/GT 6= 0. Let y = x∗ + δ(z − x∗) ∈ Xδ/GT for some z ∈ X.Thus, for certain t∗ ≤ T , we have y ∈ Gt∗−1/Gt∗ . Since y 6∈ Gt∗ , we have gTt∗(y − xt∗) > 0, hencef(y) > f(xt∗). Since y = x∗ + δ(z − x∗), by convexity of f ,
f(y) = f(δz + (1− δ)x∗) ≤ δf(z) + (1− δ)f(x∗)
= f(x∗) + δ[f(z)− f(x∗)]
≤ f(x∗) + δVarX(f)
Hence f(xT ) ≤ f(xt∗) ≤ f∗ + δVarX(f). Let δ → (1− 1e )
Tn , we got the desired result.
Remarks:
1. To obtain a solution with small inaccuracy ε > 0, the number of oracles needed are polyno-mially dependent on the dimension.
• separation oracle: N(ε) = n log(VarX(f)ε )
• first order oracle: N(ε) = n log(VarX(f)ε )
• zero order oracle: N(ε) = n log(VarX(f)ε )
2. The rate of convergence of center of gravity methods is exponentially fast and this is usuallycalled a linear rate.
3. The center of gravity is not necessarily polynomial and cannot be used as a computationaltool because finding the center of gravity at each step can be extremely difficult, even forpolytopes.
4. We can use ellipsoid as the localizer, so that is is easy to compute the center.
12.3 Ellipsoid Method (Shor, Nemirovsky, Yudin, 1970s)
Ellipsoid. Let Q be a symmetric positive definite matrix, and c be the center, an ellipsoid isuniquely characterized by (c,Q):
E(c,Q) =x ∈ Rn : (x− c)TQ−1(x− c) ≤ 1
=x = c+Q
12u : uTu ≤ 1
Lecture 12: Polynomial Solvability of Convex Programs – March 01 12-4
The Volume of E(c,Q) is
Vol(E(c,Q)) = Det(Q12 )Vol(Bn)
where Bn is a n-dimensional Euclidean ball with Vol(Bn) = πn2
Γ(n2
+1) .
Proposition 12.3 Let E = E(c,Q) =x ∈ Rn : (x− c)TQ−1(x− c) ≤ 1
be an ellipsoid. Let
H+ =x : ωTx ≤ ωT c
be a half space with ω 6= 0 that pass through the center c. Then the half
ellipsoid E ∩H+ can be contained by the ellipsoid E+ = E(c+, Q+) with
c+ = c− 1
n+ 1q
Q+ =n2
n2 + 1(Q− 2
n+ 1qqT )
where q = Qω√ωTQω
is the step from c to the boundary of E. Moreover,
Vol(E+) ≤ exp
− 1
2n
Vol(E)
The Ellipsoid Method. The algorithm works as follows:
• Initialize E(c0, Q0) with c0 = 0, Q0 = R2I
• At iteration t = 1, 2, ..., T , do
– Call separation oracle with the input ct−1
– If ct−1 6∈ X, obtain a separator u 6= 0
– If ct−1 ∈ X, call first order oracle and obtain a subgradient ω ∈ ∂f(ct)
– Set the new ellipsoid E(ct, Qt) with
ct = ct−1 −1
n+ 1
Qt−1ω√ωTQt−1ω
Qt =n2
n2 − 1(Qt−1 −
2
n+ 1
Qt−1ωωTQt−1
ωTQt−1ω)
• OutputxT = arg min
c∈c1,...,cT ∩Xf(c)
As a immediate result,
Vol(Et) ≤ exp
− t
2n
Vol(E0) = exp
− t
2n
RnVol(Bn)
Lecture 12: Polynomial Solvability of Convex Programs – March 01 12-5
Theorem 12.4 The approximate solution generated by the Ellipsoid methods after T steps, forT > 2n2 log(Rr ) satisfies:
f(xT )− f∗ ≤ R
rVarX(f) exp
− T
2n2
Proof: Similar as the proof for the center of gravity method. Set δ ∈ (Rr exp− T
2n2
, 1). Then
Xδ = x∗ + δ(x− x∗) : x ∈ X
Vol(Xδ) = δnVol(X) ≥ δnγnVol(Bn) > Rn exp
− T
2n
Vol(Bn) ≥ Vol(Et)
Hence, Xδ/Et 6= ∅. The rest of the proof follows similarly as in Theorem 12.2.
Remark: The oracle complexity for Ellipsoid method is O(n2 log(1ε )), which is only slightly worse
than the center of gravity method.
Remark: The Ellipsoid method works for any general convex problems as long as they admitseparation and first order oracles. Moreover, the algorithm is polynomial if it takes only polynomialtime to call those oracles. For instance, for linear programs with n variables and m constraints, ittakes O(nm) computation cost for the separation oracle.
In summary, here are some advantages and disadvantages of Ellipsoid method:
+ : universal
+ : simple to implement and steady for small size problems
+ : low order dependence on the number of functional constraints
− : quadratic growth on the size of problem, inefficient for large-scale problems.
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 13: Conic Programming – March 06
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Generalized Inequality Constraints and Cones
• Conic Programs: LP, SOCP, SDP
References: Ben-Tal & Nemirovski. Lectures on Modern Convex Optimization, Chapter 1.4
13.1 Generalized Inequality
In order to extend the linear constraints to nonlinear convex constraints, we may extend the stan-dard component-wise inequality to a generalized vector inequality.
Ax ≥ b⇒ Ax < b
where the partial ordering “<” also satisfies:
(i) reflexibility: a < a
(ii) anti-symmetry: a < b and b < a implies a = b
(iii) transitivity: a < b and b < c implies a < c
(iv) homogeneity: a < b and λ ∈ R+ implies λa < λb
(v) additivity: a < b and c < b implies a+ c < b+ d
Remark 1: The generalized inequality on Rm is completely identified by the setK = a ∈ Rm : a < 0via the rule
a < b⇔ a− b ∈ K
This is because:
• a < b and −b < −b (by (i)) implies a− b < 0
• a− b < 0 and b < b (by (i)) implies a < b
13-1
Lecture 13: Conic Programming – March 06 13-2
Remark 2 The set K satisfies
1. K is nonempty: 0 ∈ K
2. K is closed w.r.t addition: a, b ∈ K ⇒ a+ b ∈ K
3. K is closed w.r.t multiplication a ∈ K,λ ∈ R⇒ λa ∈ K
4. K is pointed: a,−a ∈ K ⇒ a = 0
Equivalently, K must be a nonempty pointed convex cone. Indeed, this condition is both necessaryand sufficient for the set K to define a partial ordering ”≥K” via the rule:
a ≥K b⇔ a− b ∈ K
Remark 3: The standard inequality a ≥ b⇔ a− b ∈ Rm+ . In addition, K = Rm
+ is also closed andhas non-empty interior.
13.2 Conic Programs
Definition 13.1 Let K be a regular cone (the cone is convex, closed, pointed and with a nonemptyinterior) and ”≥K” be the induced inequality. The optimization problem:
minx
cTx
s.t. Ax ≥K b (CP )
is called a conic program associated with the cone K.
Examples of Regular Cones
• Non-negative Orthant:Rm
+ = x ∈ Rm : xi ≥ 0, i = 1, ...,m
• Lorentz Cone (a.k.a second order/ ice-cream cone)
Lm =
x ∈ Rm : xm ≥
√√√√m−1∑i=1
x2i
• Semidefine Cone:
Sm+ = A ∈ Sm : A < 0
is the set of m×m symmetric positive semidefinite matrices.
Typical Conic Programs
Lecture 13: Conic Programming – March 06 13-3
• Linear Program: K = Rm+ = R× ...×R is a direct product of nonnegative orthant:
minx
cTx : aTi x− bi ≥ 0, i = 1, ...,m
(LP )
where the linear map Ax− bi := [aTi x− bi; ....; aTmx− bm]
• Conic Quadratic Program (a.k.a Second Order Conic Program): K = Lm1 × ...× Lmk
minx
cTx :‖ Dix− di ‖2≤ eTi x− fi, i = 1, ...,m
(SOCP )
where the linear map
Ax− b :=[D1x− d1; eT1 x− f1︸ ︷︷ ︸
Rm1
; ...;Dkx− dk; eTk x− fk︸ ︷︷ ︸Rmk
]
• Semidefinite Program:
minx
cTx :
n∑i=1
xiAi −B < 0
(SDP )
where the linear map Ax− b =∑m
i=1 xiAi −B : Rn → Sm, where A1, ..., An, B ∈ Sm
Remark: (LP ) ⊆ (SOCP )
This is because aTi x− bi ≥ 0⇔[
0aTi x− bi
]∈ L2.
Remark: (SOCP ) ⊆ (SDP )This is because: x1...
xm
∈ Lm ⇔
xm x1 x2 ... xm−1x1 xmx2 xm...
. . .
xm−1 xm
< 0
Lemma 13.2 (Schur Complement) Let S =
[P QT
Q R
]be symmetric with R 0. Then
S < 0 if and only if P −QTR−1Q < 0
Proof:
S < 0⇔ ∀u.v :[uT , vT
] [P QT
Q R
] [uv
]≥ 0
⇔ infu,v
uTPu+ 2vTQu+ vTRv ≥ 0
⇔ infuuT (P −QTR−1Q)u ≥ 0
⇔ P −QTR−1Q < 0
Lecture 13: Conic Programming – March 06 13-4
To show (SOCP ) ⊆ (SDP )
1. If x ∈ Lm and x 6= 0, then xm > 0 and
xm ≥x21 + ...+ x2m−1
xm
The above lemma implies Ax ≥ 0. If x ∈ Lm and x = 0, then Ax = 0 ∈ Sm+ .
2. If Ax ≥ 0 and Ax 6= 0, then xm > 0. Then by Schur complement Lemma, we have:
xm −x21 + ...+ x2m−1
xm≥ 0⇒ x ∈ Lm
If Ax = 0, then x = 0 ∈ Lm
13.3 Examples
1. L2-norm minimization
(a) minx∈Rn ‖ x ‖2
Note that
minx∈Rn
‖ x ‖2 ⇐⇒ minx,t
t ⇐⇒ minx,t
t
s.t. t ≥‖ x ‖2 s.t.
[xt
]≥Ln+1 0
(b) minx∈Rn xTx
Note that
minx∈Rn
xTx ⇐⇒ minx,t
t ⇐⇒ minx,t
t
s.t. t ≥ xTx s.t.
2xt− 1t+ 1
≥Ln+2 0
2. Quadratic problems: minx∈Rn xTQx+ qTx where Q = LLT < 0
Note that
minx∈Rn
xTQx+ qTx ⇐⇒ minx,t
t ⇐⇒ minx,t
t
s.t. t ≥ xTQx+ qTx s.t.
2LTxt− qTx− 1t− qTx+ 1
≥Ln+2 0
Lecture 13: Conic Programming – March 06 13-5
3. Spectral norm minimization:Spectral norm: ‖ A ‖2= λmax(A), where A is symmetricNote that
minx∈Rn
‖n∑
i=1
xiAi ‖2 ⇐⇒ minx∈Rn,t
t ⇐⇒ minx,t
t
s.t. t ≥ λmax(n∑
i=1
xiAi) s.t. tIm −n∑
i=1
xiAi < 0
where A1, ..., An ∈ Sm
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 14: Conic Duality – March 08
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Dual Cone
• Conic Duality Theorem
References: Ben-Tal & Nemirovski. Lectures on Modern Convex Optimization, Chapter 1.4
14.1 Motivation
Recall the LP Duality:
min cTx
(P ) s.t. Ax = b
x ≥ 0
max bT y
(D) s.t. AT y ≤ c
min cTx
(LP ) s.t. Ax ≥ bmax bT y
(LD) s.t. AT y = c
y ≥ 0
Now consider the conic program
min cTx
(CP ) s.t. Ax ≥K b
|(CD) ?
14-1
Lecture 14: Conic Duality – March 08 14-2
Observe that
Ax ≥ b⇒ ∀y ≥ 0, yT (Ax) ≥ yT b⇒ ∀y ≥ 0 and AT y = c, cTx ≥ bT y⇒ cTx ≥ max
bT y : y ≥ 0, AT y = c
Now that Ax ≥K b⇒ yT (Ax) ≥ yT b for which y?
14.2 Dual Cone
Definition 14.1 Let K be a nonempty cone. The set
K∗ =y : yTx ≥ 0,∀x ∈ K
is called the dual cone to K. Note that K∗ is a closed cone.
Proposition 14.2 Let K be a closed cone and K∗ be its dual cone. Then:
1. (K∗)∗ = K
2. K is pointed iff K∗ has non-empty interior
3. K is a regular cone iff K∗ is a regular cone
Proof: Self-exercise.
Examples: We call these self-dual cones.
• (Rm+ )∗ = Rm
+
• (Lm)∗ = Lm
• (Sm+ )∗ = Sm+
14.3 Conic Duality
The dual of the conic program
(CP ) minx
cTx : Ax ≥K b
is given by
(CD) maxy
AT y = c, y ≥K∗ 0
Lecture 14: Conic Duality – March 08 14-3
Theorem 14.3 (Weak Conic Duality) The optimal value of (CD) is a lower bound of the optimalvalue of (CP ).
Theorem 14.4 (Strong Conic Duality) If the primal (CP ) is bounded below and strictly feasible,i, e,∃x0, s.t. Ax0 K b. Then the dual (CD) is solvable and the optimal values equal to each other.
Proof: Let p∗ be the optimal value of the primal (CP ). It’s sufficient to show that ∃y∗ feasible to(CD), such that bT y∗ ≥ p∗.
Suppose c = 0, p∗ = 0, ∃y∗ = 0, s.t. y∗ ≥K∗ , AT y∗ = c and bT y∗ = p∗.
Now consider c 6= 0. let the set M =Ax− b : cTx ≤ p∗
. Note that M is a nonempty set.
Claim 1 : M ∩ int(K) = ∅This is because: Suppose ∃x, s.t. Ax ∈ int(K) and cT x ≤ p∗. Then there exists a small enoughneighborhood of x that are feasible. Since c 6= 0, there exists a point x in this neighborhood suchthat cT x < cT x ≤ p∗. This contradicts with the fact that p∗ is the optimal value.
By Separation Theorem, ∃y 6= 0, s.t.
supz∈M
yT z ≤ infz∈int(K)
yT z
Note that infz∈int(K) yT z = 0, since int(K) contains the raytz : t ≥ 0 , ∀z ∈int(K).
Hence, we have y ∈ K∗ and
supx:cT x≤p∗
yT (Ax− b) ≤ 0 (14.1)
Therefore, it must hold that λc = AT y for some λ ≥ 0. otherwise, the supremum goes to infinity.
Claim 2 : ∃λ > 0, s.t. AT y = λc
This is because: Suppose λ = 0, AT y = 0, (14.1) implies −bT y ≤ 0. Since (CP ) is strictly feasible,∃x0, s.t. Ax0− b ∈int(K). We already show y ∈ K∗, then yT (Ax0− b) > 0, which implies yT b > 0.Contradiction!
Now let y∗ = yλ , we obtain y∗ ∈ K∗, A
T y∗ = c and cTx − (y∗)T b ≤ 0,∀x such that cTx ≤ p∗.Therefore, y∗ is dual feasible and bT y∗ ≥ p∗
Corollary 14.5 If the dual (CD) is bounded above and strictly feasible,
i.e. ∃y, s.t. y >K∗ 0 and AT y = c
then the primal (CP ) is solvable and the optimal values equal to each other.
Corollary 14.6 If both (CP ) and (CD) are strictly feasible, the both are solvable with equal optimalvalues.
Lecture 14: Conic Duality – March 08 14-4
Theorem 14.7 (Optimality Conditions) Suppose at least one of (CP ) and (CD) is bounded andstrictly feasible, then the feasible primal-dual pair (x∗, y∗) is a pair of optimal primal-dual solutionsiff
1. (Zero duality gap) if and only if cTx∗ − bT y∗ = 0
2. (Complementary Slackness) if and only if (Ax∗ − b)T y∗ = 0
Proof: Let p∗ and d∗ be the optimal values of (CP ) and (CD)
cTx∗ − bT y∗ = (cTx∗ − p∗) + (d∗ − bT y∗) + (p∗ − d∗)
All three terms on RHS are ≥ 0. And cTx∗ − bT y∗ = 0 iff cTx∗ = p∗, d∗ = bT y∗ and p∗ = d∗
Remark In the case of LP , the strictly feasibility is not required for strong duality and it is notrequired for a program to be solvable. However, in the genetic case of CP , this is not necessarytrue.
Example 1: A conic problem can be strictly feasible and bounded, but NOT solvable.
minx1,x2
x1
s.t.
x1 − x21x1 + x2
≥L3 0 ⇐⇒
minx1,x2
x1
s.t. 4x1x2 ≥ 1
x1 + x2 > 0
Example 2: A conic problem can be not strictly feasible yet solvable, and the dual is infeasible.
minx1,x2
x2
s.t.
x1x2x1
≥L3 0 ⇐⇒
minx1,x2
x2
s.t. x2 = 0 ⇐⇒x1 ≥ 0
maxλ
0
s.t.
[λ1 + λ3λ2
]=
[01
]λ ≥L3 0
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 15: Applications of Conic Duality – March 13
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• SOCP Duality
• SDP Duality
• SDP Relaxations for Non-Convex Quadratic Programs
References: Ben-Tal & Nemirovski. Lectures on Modern Convex Optimization, Chapter 2.1, 3.1
15.1 Recall: Conic Duality
Let K be a regular cone, and K∗ =y : yTx ≥ 0,∀x ∈ K
be its dual cone
p∗ = minx
cTx
s.t. Ax ≥K b (CP)
d∗ = maxy
bT y
s.t. AT y = c (CD)
y ≥K∗ 0
• Weak Duality: p∗ ≥ d∗
• Strong Duality: If one of (CP ), (CD) is bounded and strictly feasible, then p∗ = d∗.
• Optimality Condition: (x∗, y∗) is optimal iff
(i) primal feasibility: Ax∗ ≥K b
(ii) dual feasibility: AT y∗ = c, y∗ ≥K∗ 0
(iii) zero-duality gap: cTx∗ − bT y∗ = 0
15-1
Lecture 15: Applications of Conic Duality – March 13 15-2
15.2 SOCP Duality
Second-order Cone Program (SOCP): when K is the Cartesian product of Lorentz cones,namely, K = Ln1 × ...× Lnm .The general form of a second-order cone program is
minx
cTx
s.t. ‖ Aix− bi ‖2≤ dTi x− ei, i = 1, ...,m (SOCP–P)
where c ∈ Rn, Ai ∈ R(ni−1)×n,Ri ∈ R(ni−1)×1, ei ∈ R.
The conic form is
minx
cTx
s.t. Aix ≥Lni bi, i = 1, ...,m
where A1 =
[AidTi
], bi =
[biei
]
Proposition 15.1 Ln is self-dual, i.e. (Ln)∗ = Ln
Proof:
(i) Ln ⊆ (Ln)∗
Suppose y ∈ Ln, we show that ∀x ∈ Ln,
yTx = y1x1 + . . .+ ynxn ≥ −
√√√√n−1∑i=1
y2i
√√√√n−1∑i=1
x2i + ynxn ≥ 0
where the first inequality is due to Cauchy-Schwarz inequality and the second inequality is
because yn ≥√∑n−1
i=1 y2i and xn ≥
√∑n−1i=1 x
2i .
(ii) (Ln)∗ ⊆ LnSuppose y ∈ (Ln)∗, we have yTx ≥ 0,∀x ∈ LnIf (y1, ..., yn−1) = 0, setting x = [0, ..., 0, 1] ∈ Ln, we get yTx = yn ≥ 0, so y ∈ Ln
If (y1, ..., yn−1) 6= 0, setting x = [−y1, ...,−yn−1,√∑n−1
i=1 y2i ] ∈ Ln, we get
yTx =n−1∑i=1
y2i + yn
√√√√n−1∑i=1
y2i ≥ 0⇒ yn ≥
√√√√n−1∑i=1
y2i , i.e. y ∈ Ln
Lecture 15: Applications of Conic Duality – March 13 15-3
Hence, the dual of the SOCP is
maxλ∈Rm,ui∈Rni−1,i=1,...,m
m∑i=1
bTi ui + eTλ
s.t.m∑i=1
(ATi ui + diλi) = c (SOCP–D)
‖ ui ‖2≤ λi, i = 1, ...,m
Remark: The same dual can be derived from Lagrange duality. The Lagrange function is
L(x, λ) = cTx+
m∑i=1
λi
(‖ Aix− bi ‖2 −(dTi x− ei)
)The Lagrange dual is given by
maxλ≥0
minxL(x, λ)
= maxλ≥0
minx
max‖ui‖2≤λi,i=1,...,m
cTx−m∑i=1
uTi (Aixi − bi)−m∑i=1
λi(dTi x− ei)
= maxλ≥0,‖ui‖2≤λi
minx
(c−m∑i=1
(ATi ui + diλi))Tx+
m∑i=1
bTi ui + eTλ
= maxλ,u1,...,um
m∑i=1
bTi ui + eTλ
s.t.m∑i=1
(ATi ui + diλi) = c
‖ ui ‖2≤ λi, i = 1, ...,m
which is the same as dual derived from conic duality.
15.3 SDP Duality
Semidefinite Program (SDP): when K is the positive semidefinite cone.The general form of a semidefinite program is
min cTx
s.t. Ax−B =
n∑i=1
xiAi −B < 0 (SDP–P)
The constraint type x1A1 + ...+ xnAn −B < 0 is called Linear Matrix Inequality.
Proposition 15.2 Sn+ is self-dual, i.e. (Sn+)∗ = Sn+
Lecture 15: Applications of Conic Duality – March 13 15-4
Proof:
(i) Sn+ ⊆ (Sn+)∗
Suppose Y < 0. we have Y = UΛUT =∑n
i=1 λiuiuTi , where λ ≥ 0, i = 1, ...,m
∀X < 0, 〈X,Y 〉 = tr(XTY ) = tr(XY ) =tr(
n∑i=1
λiuiuTi Y ) =
n∑i=1
λitr(uTi Y ui) ≥ 0
Hence, Y ∈ (Sn+)∗.
(ii) (Sn+)∗ ⊆ Sn+Suppose Y ∈ (Sn+)∗, i.e. tr(XY ) ≥ 0,∀x ∈ Sn+. For any x ∈ Rn, let X = xxT , we havetr(XY ) = tr(xxTY ) = xTY x ≥ 0. Hence, Y ∈ Sn+.
The dual of SDP is:
maxY
tr(BY )
s.t. tr(AiY ) = ci i = 1, ..., n (SDP–D)
Y < 0
Remark: Based on conic duality theorem (x∗, y∗) is optimal primal-dual pair iff
1.∑n
i=1 xiAi < B (primal feasibility)
2. Y < 0, tr(AiY ) = ci, i = 1, ...,m (dual feasibility)
3. tr(Y (∑n
i=1 xiAi −B)) = 0, i.e. Y (∑n
i=1 xiAi −B) = 0
Example: Use SDP duality to show that for any B ∈ Sn+:
λmax(B) = maxx∈Rn
xTBx :‖ x ‖2= 1
Note that the right hand side can be rewritten as
maxx
tr(BxxT )
s.t. tr(xxT ) = 1
We show this is equivalent to solve the SDP relaxation.
maxX
tr(BX)
s.t. tr(X) = 1 (P)
X < 0
Lecture 15: Applications of Conic Duality – March 13 15-5
Let p = maxx
tr(BxxT ) : tr(xxT ) = 1
and p = maxX tr(BX) : tr(X) = 1, X < 0. Since thelatter is a relaxation, so p ≥ p. On the other hand, for any X < 0 with tr(X) = 1,
X =n∑i=1
λixixTi , with
n∑i=1
λi = 1, λi ≥ 0, i = 1, ..., n, and ‖ xi ‖2= 1, i.e. tr(xixTi ) = 1
tr(BX) = tr(n∑i=1
λixixTi B) =
n∑i=1
λitr(BxixTi ) ≤ max
i=1,..,ntr(Bxix
Ti ) ≤ p
Hence, p ≤ p. Therefore, p = p.
By the SDP duality, the dual of (P) is
minx
λ
s.t. λI −B < 0 (D)
Note that λI −B < 0 is equivalent to λ ≥ λmax(B).
Since (P) is strictly feasible, X = n−1I satisfies tr(X) = 1 and X 0, strong duality holds.Therefore,
maxx
xTBx :‖ x ‖2= 1
= Opt(P ) = Opt(D) = λmax(B)
15.4 SDP Relaxations of Non-convex Quadratic Programming
Consider the quadratic constrained quadratic programming:
Opt = min xTQ0x+ 2qT0 x+ c0
s.t. xTi Qixi + 2qTi x+ ci ≤ 0, 1 ≤ i ≤ m (QCQP)
Let X =
[xxT xxT 1
]=
[x1
] [x1
]Tand Ai =
[Qi qiqTi ci
], i = 0, 1, ...,m
We can write (QCQP) as
minx,X
tr(A0X)
s.t. tr(AiX) ≤ 0, i = 1, ...,m
X =
[xxT xxT 1
]
Lecture 15: Applications of Conic Duality – March 13 15-6
SDP Relaxation. The following SDP is a relaxation of (QCQP)
OptSDP = minX
tr(A0X)
s.t. tr(AiX) ≤ 0, i = 1, ...,m (SDP-relaxation)
X < 0
Xn+1,n+1 = 1
Therefore, OptSDP ≤ Opt.
Lagrange Relaxation. Another relaxation of QCQP is based on Lagrange duality. The Lagrangedual is given by
maxλ≥0
infx
L(x, λ)
= maxλ≥0
infx
xTQ0x+ 2qT0 x+ c0 + xT (∑m
i=1λiQi)x+ 2(
∑m
i=1λiqi)
Tx+∑m
i=1λici
= maxλ≥0t : xT (Q0 +
∑m
i=1λiQi)x+ 2(q0 +
∑m
i=1λiqi)
Tx+ (∑m
i=1λici + c0) ≥ t,∀x
Note that ∀x, xTAx ≥ 0 is equivalent to A < 0. Hence, the Lagrange relaxation is
OptLagrange = maxλ≥0
t
s.t.
[Q0 +
∑mi=1 λiQi q0 +
∑mi=1 λiqi
(q0 +∑m
i=1 λiqi)T∑m
i=1 λici + c0 − t
]< 0 (Lagrange relaxation)
By weak duality, we know that OptLagrange ≤ Opt.
Indeed, one can show that the Lagrange relaxation is the SDP dual to the SDP relaxation.
• If either of them is strictly feasible, then OptLagrange = OptSDP.
• Otherwise, we always have OptLagrange ≤ OptSDP ≤ Opt.
Therefore, SDP relaxation is always tighter than Lagrange relaxation.
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 16: Applications in Robust Optimization – March 27
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Robust Linear Program
• Example: Robust Portfolio Selection
• Example: Robust Classification
16.1 Robust Linear Program
Consider the linear programmin
cTx : Ax ≤ b
(LP )
Assume that the data (c, A, b) of the program are not known exactly and vary in a given uncertaintyset U . The goal is to find a robust solution that is
• feasible for all instances, i.e. Ax ≤ b,∀(c, A, b) ∈ U
• optimal for the worst-case objective, i.e. supcTx : (c, A, b) ∈ U
We call the following program the robust counterpart of (LP):
minx
sup
(c,A,b)∈UcTx : Ax ≤ b,∀(c, A, b) ∈ U
(RC)
or equivalently,minx,tt : cTx ≤ t and Ax ≤ b ∀(c, A, b) ∈ U
Note that the robust counterpart is a semi-infinite convex optimization program with infinitely manylinear inequality constraints. The structure of (RC) depends on the geometry of the uncertaintyset U .
Without loss of generality, let’s consider the robust counterpart in the simple form:
min cTx
s.t. aTi x ≤ bi, ∀ai ∈ Ui, i = 1, ...,m (RC)
16-1
Lecture 16: Applications in Robust Optimization – March 27 16-2
where Ui ⊆ Rn is the uncertainty set of ai.
Polyhedral Uncertainty
Consider the situation where the uncertainty set is a polyhedron:
Ui = ai : Diai ≤ di i = 1, ...,m
The (RC) becomes
min cTx
s.t. maxai∈Ui
aTi x ≤ bi
By LP duality, we know that
maxai
aTi x ⇐⇒ minpi
dTi pi
s.t. Diai ≤ di s.t. DTi pi = x
pi ≥ 0
Hence, (RC) is equivalent to
minx,p1,...,pm
cTx
s.t. dTi pi ≤ bi, i = 1, ...,m
DTi pi = x, i = 1, ...,m
pi ≥ 0, i = 1, ...,m
which is a linear program.
Ellipsoidal Uncertainty
Consider the situation when the uncertainty set is an ellipsoid
Ui = ai + Piu : ‖u‖2 ≤ 1 , i = 1, ...,m
where ai ∈ Rn, Pi ∈ Rn×n.
The (RC) becomes
minx
cTx
s.t. maxai∈Ui
aTi x ≤ bi
Note that
maxai∈Ui
aTi x = max‖u‖2≤1
aTi x+ uTP Ti x = aTi x+ max‖u‖2≤1
uTP Ti x = aTi x+ ‖P Ti x‖2
Lecture 16: Applications in Robust Optimization – March 27 16-3
Hence, (RC) is equivalent to
minx
cTx
s.t. ‖P Ti x‖2 ≤ bi − aTi x
which is an SOCP.
16.2 Examples
16.2.1 Example I: Robust Portfolio Selection
Consider a classical portfolio optimization problem with n assets. Let x = (x1, ..., xn) be theportfolio vector.
1. Assume that returns ri, i = 1, ..., n are exactly known. To maximize the return leads to theoptimization problem.
maxx
rTx ⇐⇒ maxx,γ
γ
s.t.n∑i=1
xi = 1 s.t. rTx ≥ γ
x ≥ 0n∑i=1
xi = 1, x ≥ 0
which leads to a highly non-robust solution.
2. Assume the returns are know within an ellipsoid:
U =r + ρΣ1/2u : ‖u‖2 ≤ 1
where r and Σ are the empirical mean and covariance matrix.
The robust portfolio problem is
maxx
minr∈U
γTx, s.t. x ≥ 0,n∑i=1
xi = 1
which is equivalent as an SOCP:
maxx
rTx− ρ‖Σ1/2x‖2, s.t. x ≥ 0,1Tx = 1
This can be interpreted as a return-risk tradeoff.
Lecture 16: Applications in Robust Optimization – March 27 16-4
3. Assume that returns are random variables with mean r and covariance Σ. Consider the chanceconstrained program:
maxx,γ
γ
s.t. P (rTx ≥ γ) ≥ 1− εn∑i=1
xi = 1
x ≥ 0
Note that
P (rTx ≥ γ) ≥ 1− ε
⇔ P (Z ≥ γ − rTx√xT Σx
) ≥ 1− ε, where Z ∼ N(0, 1)
⇔ P (Z ≤ rTx− γ√xT Σx
) ≥ 1− ε
⇔ rTx− γ√xT Σx
≥ Φ−1(1− ε)
⇔ rTx− Φ−1(1− ε)‖Σ1/2x‖2 ≥ Γ
Hence, the chance constrained program is equivalent to
maxx
rTx− Φ−1(1− ε)‖Σ1/2x‖2, s.t. x ≥ 0,1Tx = 1
Remark Robust LP with ellipsoidal uncertainty set are closed related to chance constraints of astochastic model.
16.2.2 Example 2: Robust Classification
Consider the support vector machine model for binary classification:
minω,b,ε
m∑i=1
εi
s.t. yi(ωTxi + b) ≥ 1− εi (SVM)
ε ≥ 0
where (xi, yi), i = 1, ...,m are data points with xi ∈ Rn, yi ∈ ±1.
1. Assume that the feature vector xi are subject to spherical uncertainty
xi ∈ Xi = xi + ρu : ‖u‖2 ≤ 1
Lecture 16: Applications in Robust Optimization – March 27 16-5
The robust counterpart of (SVM) simplifies to a SOCP
minω,b,ε
m∑i=1
εi
s.t. yi(ωT xi + b)− ρ‖ω‖2 ≥ 1− εi
εi ≥ 0
⇔ minω,b
m∑i=1
max(
1− yi(ωTxi + b) + ρ‖ω‖2, 0)
which is very similar to the classical SVM with `2-norm regularization:
minω,b
m∑i=1
max(
1− yi(ωTxi + b), 0)
+ λ‖ω‖22
2. Assume that the feature vector xi are subject to box uncertainty
xi ∈ Xi = xi + ρu : ‖u‖∞ ≤ 1
THe robust counterpart of (SVM) becomes
minω,b
m∑i=1
max(
1− yi(ωTxi + b) + ρ‖ω‖1, 0)
which is very similar to the classical SVM with `1-norm regularization.
Remark Robust optimization has a close connection to the regularization technique in machinelearning.
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 17: Interior Point Method, Part I – March 29
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Path-following Scheme
• Self-concordant Functions
Reference:
Nemirovski, Interior Point Polynomial Time Methods in Convex Programing, 2004, Chapter 1Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 4.1
17.1 Historical Notes
• 1947: Dantzig, Simplex method for LP
• 1973: Klee and Minty proved that Simplex Method is not a polynomial-time algorithm
• mid-1970s: Shor, Nemirovski and Yudin, Ellipsoid method for linear and convex programs
• 1979: Khachiyan proved the polynomial-time solvability of LP
• 1984: Karmarkar, polynomial algorithm (potential reduction interior point method) for LP(1967): Dikin, affine scaling algorithm (simplification of Karmarkar’s algorithm) for LP
• late-1980s: Renegar, Gonzaga, path-following interior point method for LP
• 1988: Nesterov and Nemirovski, extend interior point method for convex programs
17.2 Path following Scheme
We intend to solve a general convex program
minx
f(x)
s.t. gi(x) ≤ 0, i = 1, ...,m (P)
where f , gi are twice continuously differentiable convex functions.
17-1
Lecture 17: Interior Point Method, Part I – March 29 17-2
Denote X = x : gi(x) ≤ 0, ∀i = 1, . . . ,m as the feasible domain. Assume Slater condition holdsand X is bounded, so X is a compact convex set with non-empty interior.
Barrier Method: solve a series of unconstrained problems
minx
tf(x) + F (x) (Pt)
where t > 0 is a penalty parameter and F (x) is a barrier function that satisfies:
• F : int(X)→ R and F (x)→ +∞ as x→ ∂(X)
• F is smooth (twice continuously differentiable) and convex
• F is non-degenerate, i.e. ∇2F (x) 0, ∀x ∈ int(X)
Note that for any t > 0, (Pt) has a unique solution in the interior of X.
Denotex∗(t) = arg min
xtf(x) + F (x)
The path x∗(t), t > 0 is called the central path. We have
x∗(t) −→ x∗, as t −→∞
To implement the above path-following scheme, need to specify:
1. the barrier function F (x):
– use self-concordant barriers, e.g. F (x) = −∑m
i=1 log(−gi(x))
2. the method to solve unconstrained minimization problems (Pt):
– use Newton method
3. the policy to update the penalty parameter t.
17.3 Self-concordant Function
Definition 17.1 A function f : R→ R is self-concordant if f is convex and
|f ′′′(x)| ≤ κf ′′(x)3/2,∀x ∈ dom(f)
for some constant κ ≥ 0.
When κ = 2. f is called a standard self-concordant function.
Examples:
Lecture 17: Interior Point Method, Part I – March 29 17-3
• Logarithmic function: f(x) = − ln(x), x > 0 is standard self-concordant:
f ′(x) = −1
x, f ′′(x) =
1
x2, f ′′′(x) = − 2
x3,|f ′′′(x)|f ′′(x)3/2
= 2
• Linear function: f(x) = cx is self-concordant with constant κ = 0 :
f ′(x) = c, f ′′(x) = 0, f ′′′(x) = 0
• Convex quadratic function: f(x) = a2x
2 + bx+ c (a > 0) is self-concordant with κ = 0:
f ′(x) = ax+ b, f ′′(x) = a, f ′′′(x) = 0
• Exponential function: f(x) = ex is not self-concordant:
f ′(x) = f ′′(x) = f ′′′(x) = ex,|f ′′′(x)|f ′′(x)3/2
= ex/2 −→ +∞ as x→ −∞
• Power functions:
f(x) =1
xp(p > 0), (x > 0)
f(x) = |x|p(p > 2)
f(x) = x2p(p > 2)
are not self-concordant.
Remark: Self-concordant function is affine-invariant: If f(x) is self-concordant, f(y) = f(ay+b)is also self-concordant with the same constant.
Proof: It is easy to see that f is convex and
f ′′′(y)
f ′′(y)3/2=|a3f ′′′(ay + b)|[a2f ′′(ay + b)
]3/2 =f ′′′(ay + b)
f ′′(ay + b)3/2≤ κ
When extending to a function f defined on Rn and f ∈ C3(Rn), we say f is self-concordant if it isself-concordant along every line, namely, ∀x ∈ dom(f), h ∈ Rn, φ(t) = f(x+ th) is self-concordantwith some constant κ ≥ 0.
Denote
Dkf(x)[h1, ..., hk] =∂k
∂t1...∂tk|t1=...=tk=0f(x+ t1h1 + ...+ tkhk)
as the k-th differential of f taken at x along the directions h1, ..., hk e.g.
Df(x)[h] = φ′(0) = 〈∇f(x), h〉D2f(x)[h, h] = φ′′(0) = 〈∇2f(x)h, h〉
Lecture 17: Interior Point Method, Part I – March 29 17-4
Definition 17.2 We say a function f : Rn → R is self-concordant if
D3f(x)[h, h, h] ≤ κ(D2f(x)[h, h])3/2,∀x ∈ dom(f), h ∈ Rn
for some constant κ ≥ 0
Operations Preserving Self-Concordance
Proposition 17.3
1. (Affine invariant) If f(y) is self-concordant with constant κ, then the function f(x) = f(Ax+b) is also self-concordant with constant κ.
2. (Summation) If f1(x) and f2(x) are self-concordant with constants κ1, κ2, then the functionf(x) = f1(x) + f2(x) is self-concordant with constant κ = max κ1, κ2
3. (Scaling) If f(x) is self-concordant with constant κ, and α > 0 then the function αf(x) isalso self-concordant with κ = κ√
α
Proof: Exercise!
Hence, it is straightforward to see that
Corollary 17.4 f(x) = −∑m
i=1 ln(bi − aTi x) is standard self-concordant on int(X), where X =x : aTi x ≤ bi, i = 1, ...,m
Remark: Indeed, under regular conditions, self-concordant functions are barrier functions:
• If dom(f) contains no straight line, then ∇2f(x) is non-degenerate (strictly convex)
• If f is closed convex, then f(xk)→ +∞ if xk ⊆ dom(f) and xk → x ∈ ∂(dom(f))
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 18: Interior Point Method - Part II – April 3
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Classical Newton Method for Unconstrained Minimization
• (Damped) Newton Method for Self-concordant Functions
Reference:
Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 1.2.4
18.1 Classical Newton Method
Consider the unconstrained minimization
minx∈Rn
f(x)
where f(x) is twice continuously differentiable (not necessarily convex) on Rn.
Newton Method:xk+1 = xk − [∇2f(xk)]
−1∇f(xk), k = 0, 1, 2, ...
The direction dk = −∇2f(xk)−1∇f(xk) is called Newton’s direction.
Interpretation: The Newton method can be treated as
• Minimizing quadratic approximation of f : From Taylor’s expression, we have:
f(x+ h) = f(x) +∇f(x)Th+1
2hT∇2f(x)h+ o(‖ h ‖2)
The iteration can be viewed as minimizing the quadratic approximation of f :
xk+1 = minx
f(xk) +∇f(xk)
T (x− xk) +1
2(x− xk)T∇2f(xk)(x− xk)
18-1
Lecture 18: Interior Point Method - Part II – April 3 18-2
• Solving linearized optimality condition: From first-order optimality condition: ∇f(x) = 0.Note that
∇f(x+ h) ≈ ∇f(x) +∇2f(x)h
The iteration can also be viewed as solving linearized optimality condition
∇f(xk) +∇2f(xk)(x− xk) = 0
Remark (Newton method vs Gradient Descent)
xk+1 = xk −∇2f(xk)−1∇f(xk) (Newton)
xk+1 = xk − γk∇f(xk) (GD)
Firstly, unlike Gradient Descent, Newton’s direction is not necessarily a descent direction: for∇f(x) 6= 0,
∇f(x)Td = −∇f(x)T∇2f(x)∇f(x) 6< 0
Secondly, Newton method is a second-order method and requires relatively high computational costcomparing to GD.
Remark (Convergence)
(i) Newton method can break down if ∇2f(x) is degenerate.
(ii) When f is quadratic and non-degenerate, Newton method converges in one step.
(iii) The method may diverge even for a nice strongly convex function.
(iv) If started close enough to a strict local minimum, the method can converge very fast.
Example: consider the convex function f(x) =√
1 + x2
One can easily compute that x∗ = 0, and
f ′(x) =x√
1 + x2, f ′′(x) =
1
(1 + x2)3/2
The Newton method works as
xk+1 = xk − (1 + x2k)3/2 xk√
1 + x2k
= −x3k
Note that
• if |x0| < 1, the method converges extremely fast
• if |x0| = 1, the method oscillates between 1 and -1
• if |x0| > 1, the method diverges.
Lecture 18: Interior Point Method - Part II – April 3 18-3
18.2 Classical Analysis
Theorem 18.1 (Local quadratic convergence of Newton method) Assume that
• f has a Lipschitz Hessian: ‖ ∇2f(x)−∇2f(y) ‖2≤M ‖ x− y ‖2 for M > 0.
• f has a strict local minimum x∗: ∇2f(x∗) < µI, with µ > 0.
• The initial point x0 is close enough to x∗: ‖ x0 − x∗ ‖2≤ µ2M
Then Newton method is well-defined and converges to x∗ at a quadratic rate
‖ xk+1 − x∗ ‖2≤M
µ‖ xk − x∗ ‖22
Note that for a symmetric matrix A, ‖ A ‖2:= supx6=b
‖Ax‖2‖x‖2
= maxk |λk(A)|. We first prove the
following simple useful lemma.
Lemma 18.2 Suppose f has Lipschitz Hession with constant M , then for any x, y,
‖∇f(y)−∇f(x)−∇2f(x)(y − x)‖2 ≤M
2‖y − x‖22.
Proof: This is because by basic calculus
∇f(y)−∇f(x) =
∫ 1
0∇2f(x+ t(y − x))(y − x)dt
‖∇f(y)−∇f(x)−∇2f(x)(y − x)‖2 =‖∫ 1
0∇2f(x+ t(y − x))(y − x)dt−∇2f(x)(y − x) ‖2
=‖∫ 1
0
[∇2f(x+ t(y − x))−∇2f(x)
](y − x)dt ‖2
≤∫ 1
0Mt ‖ y − x ‖22 dt
=M
2‖ y − x ‖22
Now we are ready to prove the main theorem.
Proof: We have
xk+1 − x∗ = xk − x∗ − [∇2f(xk)]−1∇f(xk)
= [∇2f(xk)]−1[∇2f(xk)(xk − x∗)−∇f(xk)]
= [∇2f(xk)]−1[∇f(x∗)−∇f(xk)−∇2f(xk)(x
∗ − xk)]
Lecture 18: Interior Point Method - Part II – April 3 18-4
The last equality is due to the fact that ∇f(x∗) = 0 by optimality condition. Hence,
‖ xk+1 − x∗ ‖2 ≤‖ [∇2f(xk)]−1 ‖2‖ ∇f(x∗)−∇f(xk)−∇2f(xk)(x
∗ − xk) ‖2
≤‖ [∇2f(xk)]−1 ‖2 ·
M
2‖ xk − x∗ ‖22
The last inequality is due to the previous lemma.We show now by induction that ‖ xk − x∗ ‖2≤ µ
2M .
Assume ‖ xk − x∗ ‖2≤ µM , then
‖ ∇2f(xk)−∇f(x∗) ‖2≤M ‖ xk − x∗ ‖2≤µ
2
which implies that −µ2 I 4 ∇2f(xk)−∇2f(x∗) 4 µ
2 I. Hence,
∇2f(xk) < ∇2f(x∗)− µ
2<µ
2I
which implies that ‖ [∇2f(xk)]−1 ‖2≤ 2
µ . This leads to
‖ xk+1 − x∗ ‖2≤M
µ‖ xk − x∗ ‖22
and ‖ xk+1 − x∗ ‖2≤ µ2M , which concludes the proof.
Note that the local convergence holds for any unconstrained minimization regardless of convex ornot. When f is strongly convex, the analysis is even simpler.
Theorem 18.3 Assume that
• f has a Lipschitz Hessian ‖ ∇2f(x)−∇f(y) ‖2≤ L ‖ x− y ‖2
• f is µ-strongly convex, i,e, ∇2f(x) < µI,∀x
• The initial point x0 satisfies ‖ ∇f(x0) ‖2< 2µ2
M
Then the gradient converges to zero quadratically
‖ ∇f(xk+1) ‖2≤M
2µ2‖ ∇f(xk) ‖22
Proof: Setting y = xk+1, x = xk in the previous lemma, we have
‖ ∇f(xk+1) ‖2 ≤M
2‖[∇2f(xk)
]−1∇f(xk) ‖22
≤ M
2‖[∇2f(xk)
]−1‖22 · ‖ ∇f(xk) ‖22
≤ M
2µ2‖ ∇f(xk) ‖22
Lecture 18: Interior Point Method - Part II – April 3 18-5
Remark (Complexity) The quadratic convergence implies that:
M
µ‖ xk − x∗ ‖2 ≤
[Mµ‖ xk−1 − x∗ ‖2
]2≤[Mµ‖ xk−2 − x∗ ‖2
]4≤ ...
≤[Mµ‖ x0 − x∗ ‖2
]2k≤ (
1
2)2
k
Hence, ‖ xk − x∗ ‖2< µM 2−2
k
To achieve an accuracy ε, i.e. ‖ xk − x∗ ‖2≤ ε, the number of iterations
k ≥ log2 log2(M
µε)
Remark (Affine Invariance) The Newton method is invariant w.r.t. affine transformation ofvariables.
Let A be a non-singular matrix consider the function
f(y) = f(Ay)
The Newton step for f and f are
xk+1 = xk −[∇2f(xk)
]−1∇f(xk)
yk+1 = yk −[∇2f(yk)
]−1∇f(yk)
Let y0 = A−1x0, then yk = A−1xk. This can be shown by induction
yk+1 = yk −[AT∇2f(Ayk)A
]−1[AT∇f(Ayk)
]= A−1xk −
[At∇2f(xk)A
]−1AT∇f(xk)
= A−1xk −A−1∇2f(xk)−1∇f(xk)
= A−1xk+1
In other words, Newton’s method follow the same trajectory in the ’x-space’ and ’y-space’. Hence,the region of quadratic convergence should not depend on the Euclidean metric.
However, in the classical analysis, the assumption and the measure of error, e.g. the Lipschitzcontinuity of Hessian, depend heavily on the Euclidean metric and is not affine invariant. Anatural remedy is to assume self-concordance. Self concordant function are especially well suitedfor Newton method.
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 19: Interior Point Method - Part III – April 05
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Properties of Self-Concordant Functions
• Minimization of Self-Concordant Functions
• (Damped) Newton Method of Self-Concordant Functions
Reference:
Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 4.1.4 - 4.1.5
19.1 Properties of Self-Concordant Functions
Let f(x) ba a standard self-concordant function, i.e. ∀x ∈ dom(f), h ∈ Rn
∣∣∣D3f(x)[h, h, h]∣∣∣ ≤ 2
(D2f(x)[h, h]
)3/2i.e. ∣∣∣ d3
dt3|t=0f(x+ th)
∣∣∣ ≤ 2( d2dt2|t=0f(x+ th)
)3/2Definition 19.1 We define the local norm of h at x ∈ dom(f) as
‖ h ‖x=√hT∇2f(x)h
We state below a basic inequality without proof, for standard self-concordant function f , it holdsthat ∣∣∣D3f(x)[h1, h2, h3]
∣∣∣ < 2 ‖ h1 ‖x · ‖ h2 ‖x · ‖ h3 ‖x
Remark (“Lipschitz continuity”) at a high level,∣∣∣ ddt|t=0D
2f(x+ tδ)[h, h]∣∣∣ ≤ 2 ‖ δ ‖x D2f(x)[h, h]
The second derivative is relatively Lipschitz continuous w.r.t. the local norm defined by f .
19-1
Lecture 19: Interior Point Method - Part III – April 05 19-2
For instance, when f is self-concordant on R and strictly convex,
|f ′′′(x)||f ′′(x)|3/2
≤ 1 =⇒∣∣∣ ddx
[f ′′(x)−1/2]∣∣∣ ≤ 1
=⇒ −y ≤∫ y
0
d
dx[f ′′(x)]−1/2dx ≤ y
=⇒ −y ≤ 1√f ′′(y)
− 1√f ′′(0)
≤ y
Simplifying the above terms, we arrive at
f ′′(0)
(1 + y√f ′′(0))2
≤ f ′′(y) ≤ f ′′(0)
(1− y√f ′′(0))2
, ∀0 ≤ y <√f ′′(0) (?)
Note that the Hessian is nearly proportional to f ′′(0) around a neighborhood of x = 0.
Moreover, if we integrate (?) and integrate again on both sides (e.g. lower bound):
|f ′′′(x)||f ′′(x)|3/2
≤ 1 =⇒√f ′′(0)−
√f ′′(0)
1 + y√f ′′(0)
≤ f ′(y)− f ′(0)
=⇒ y√f ′′(0)− ln(1 + y
√f ′′(0)) ≤ f(y)− f(0)− f ′(0)y
Rewriting the above equations and considering both sides, we arrive at: ∀0 ≤ y <√f ′′(0)
(y√f ′′(0))2
1 + y√f ′′(0)
≤ y(f ′(y)− f ′(0)) ≤(y√f ′′(0))2
1− y√f ′′(0)
(??)
and
y√f ′′(0)− ln(1 + y
√f(0)) ≤ f(y)− f(0)− f ′(0)y ≤ −y
√f ′′(0)− ln(1− y
√f(0)) (? ? ?)
More generally, for self-concordant functions in Rn, similar results hold (see, Nesterov, 2004 The-orem 4.1.6-4.1.8) . We provide the results below without providing the proofs.
Definition 19.2 (Dikin Ellipsoid)
Wr(x) = y :‖ y − x ‖x≤ 1
W 0r (x) = y :‖ y − x ‖x< 1
Theorem 19.3 For x ∈ dom(f), we have W 0r (x) ⊆ dom(f)
The Dikin ellipsoid of any point in the domain is contained in the domain. More critically, theself-concordant function behaves nicely within this Dikin ellipsoid.
Lecture 19: Interior Point Method - Part III – April 05 19-3
Theorem 19.4 (Hessian of self-concordant function) For x ∈ dom(f), we have
(1− ‖ y − x ‖x)2∇2f(x) 4 ∇2f(y) 4 (1− ‖ y − x ‖x)−2∇2f(x), ∀y ∈W 01 (x)
Theorem 19.5 (Gradient of self-concordant function) For x ∈ dom(f), we have
‖ y − x ‖2x1+ ‖ y − x ‖x
) ≤ 〈∇f(y)−∇f(x), y − x〉 ≤ ‖ y − x ‖2x1− ‖ y − x ‖x
, ∀y ∈W 01 (x)
Theorem 19.6 (Linear approximation of self-concordant function)
1. ∀x, y ∈ dom(f), we have
f(y) ≥ f(x) + 〈f ′(x), y − x〉+ ω(‖ y − x ‖x)
2. ∀x ∈ dom(f), y ∈W 0x (1), we have
f(y) ≤ f(x) + 〈f(x), y − x〉+ ω∗(‖ y − x ‖x)
where ω(t) = t− ln(1 + t) and ω∗(t) = −t− ln(1− t) is the conjugate
Proof: See Theorem 4.1.6, 4.1.7, 4.1.8 in (Nesterov, 2004).
19.2 Minimizing Self-Concordant Functions
Consider the unconstrained minimization
minx∈Rn
f(x)
where f is a standard self-concordant and non-degenerate (i.e. ∇2f(x) < 0) function. Note thatthe problem is not necessarily solvable, e.g. f(x) = − ln(x).
Definition 19.7 (Newton’s decrement) The quantity
λf (x) =√∇f(x)[∇2f(x)]−1∇f(x)
is called Newton decrement.
Remark: The Newton decrement can be interpreted as
1. The decrease of the second order Taylor expansion after a Newton step:
f(x)−minh
f(x) + hT∇f(x) +
1
2hT∇2f(x)h
=
1
2λ2f (x)
Lecture 19: Interior Point Method - Part III – April 05 19-4
2. The conjugate local norm of ∇f(x):
‖ ∇f(x) ‖x,∗ = max∇f(x)T y, ‖ y ‖x≤ 1
=‖ [∇2f(x)]−1/2∇f(x) ‖2= λf (x)
3. The local norm of Newton’s direction d(x) = −∇2f(x)−1∇f(x):
‖ d(x) ‖x= λf (x)
Theorem 19.8 Assume λf (x0) < 1 for some x0 ∈ dom(f). Then there exists a unique minimizerof f .
Proof: It suffices to show that the level set y : f(y) ≤ f(x0) is bounded. Since
f(y) ≥ f(x) + 〈∇f(x), y − x〉+ ω(‖ y − x ‖x)
≥ f(x)− ‖ ∇f(x) ‖x,∗ · ‖ y − x ‖x +ω(‖ y − x ‖x)
= f(x)− λf (x)· ‖ y − x ‖x +ω(‖ y − x ‖x)
we have
f(y) ≤ f(x0) =⇒ ω(‖ y − x0 ‖x0)
‖ y − x0 ‖x0≤ λf (x0) < 1
Notice the function φ(t) = ω(t)t = 1− 1
t ln(1+t) is strictly increasing in t ≥ 0. Hence, ‖ y−x0 ‖x0≤ t∗for some t∗. This implies that the level set must be bounded.
Remark: Note that for self-concordant functions, local condition such as λf (x0) < 1 provides someglobal information on f .
Example : consider the self-concordant function f(x) = εx− ln(x) with dom(f) := x : x > 0.
λf (x) =
√(ε− 1
x)(
1
x2)−1(ε− 1
x) = |1− εx|
When ε ≤ 0, λf (x) ≥ 1, and the function is unbounded below and there does not exist a minimizer.
When ε > 0, λf (x) < 1, for x ∈ (0, 2ε ), there exists a unique minimizer x∗ = 1ε .
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 20: Interior Point Method - Part IV – April 10
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Newton method for self-concordant functions
• Damped Newton method
Reference: Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 4.1.5
20.1 Recall
In the last lecture, we discussed the unconstrained minimization of self-concordant function
minxf(x)
where f(x) is standard self-concordant and non-degenerate.
Dikin ellipsoid: W 0r (x) = y :‖ y − x ‖x< r
For standard self-concordant function, W 01 (x) ⊆ dom(f),∀x ∈ dom(f). Moreover, f has some nice
behavior insider the Dikin ellipsoid. ∀y :‖ y − x ‖x= γ < 1, it holds
1. (1− r)2∇2f(x) 4 ∇2f(y) 4 1(1−r)2∇
2f(x)
2. γ2
1+γ ≤ 〈∇f(y)−∇f(x), y − x〉 ≤ γ2
1−γ
3. ω(γ) ≤ f(y)− f(x)− 〈∇f(x), y − x〉 ≤ ω∗(γ)
where ω(t) = t− ln(1 + t), ω∗(t) = −t− ln(1− t).
Newton’s decrement: λf (x) =√∇f(x)T [∇2f(x)]−1∇f(x)
• Note that λf (x) =‖ ∇f(x) ‖x,∗=‖ d(x) ‖x, where d(x) = [∇2f(x)]−1∇f(x)
• If λf (x) < 1, the point x+ = x− d(x) ∈ dom(f)
• If x∗ is a minimizer of f , then λf (x∗) = 0
• λf (x0) < 1 for some x0 ∈ dom(f), then f has a unique minimizer.
20-1
Lecture 20: Interior Point Method - Part IV – April 10 20-2
20.2 Newton Method for Self-concordant Function
Basic Newton method: initialize x0 ∈ dom(f) and update via
xk+1 = xk − [∇2f(xk)]−1∇f(xk), k = 0, 1, 2, ...
Unlike the classical analysis, we should adopt different measures of error that are independent ofEuclidean metric. For instance,
• Function gap: f(xk)− f(x∗)
• Newton’s decrement: λf (xk) =‖ ∇f(xk) ‖xk,∗
• Local distance to the minimizer: ‖ xk − x∗ ‖xk
• Distance to the minimizer under fixed metric ‖ xk − x∗ ‖x∗
Indeed, all of these measures are equivalent locally.
Proposition 20.1 When λf (x) < 1, we have
1. f(x)− f(x∗) ≤ ω∗(λf (x)) ≤ λf (x)2
2(1−λf (x))2
2. ‖ x− x∗ ‖x≤λf (x)
1−λf (x)
3. ‖ x− x∗ ‖x∗≤λf (x)
1−λf (x)
Proof: See Theorem 4.1.13 in (Nesterov, 2004).
For this reason, we will focus mainly on the convergence in terms of λf (x).For simplicity, in the following, let us denote λk := λf (xk), k = 0, 1, 2...
Theorem 20.2 (Local convergence) If xk ∈ dom(f) and λk < 1, then xk+1 ∈ dom(f) and
λk+1 ≤( λk
1− λk
)2Proof: Note ‖ xk+1 − xk ‖xk= λf (xk) = λk < 1, so xk+1 ∈ dom(f). Also, it holds that
∇2f(xk+1) < (1− λk)2∇2f(xk)
Hence,
λk+1 =
√∇f(xk+1)T
[∇2f(xk+1)
]−1∇f(xk+1) ≤
1
1− λk
√∇f(xk+1)T
[∇2f(xk)
]−1∇f(xk+1)
Lecture 20: Interior Point Method - Part IV – April 10 20-3
Note
∇f(xk+1) = ∇f(xk+1)−∇f(xk)− [∇2f(xk)](xk+1 − xk)
=[ ∫ 1
0∇2f(xk + t(xk+1 − xk))−∇2f(xk)dt
](xk+1 − xk)
: = G(xk+1 − xk)
Hence,
λk+1 ≤1
1− λk
√(xk+1 − xk)TGT [∇2f(xk)]−1G(xk+1 − xk)
≤ 1
1− λk‖xk+1 − xk‖xk · ‖ [∇2f(xk)]
−1/2G[∇2f(xk)]−1/2 ‖2
≤ λk1− λk
‖ [∇2f(xk)]−1/2G[∇2f(xk)]
−1/2︸ ︷︷ ︸H
‖2
We have
G < ∇2f(xk)
∫ 1
0
[(1− tλk)2 − 1
]dt =
(λ2k3
)∇2f(xk)
G 4 ∇2f(xk)
∫ 1
0
[ 1
(1− tλk)2− 1]dt =
λk1− λk
∇2f(xk)
Hence, ‖ H ‖2≤ maxλk −
λ2k3 ,
λk1−λk
= λk
1−λk
This leads to the conclusion λk+1 ≤ ( λk1−λk )2
Remark: Let λ∗ be such that λ∗
(1−λ∗)2 = 1. Then if λk < λ∗, λk+1 < λk. The region of quadratic
convergence is λf (x) ≤ λ∗ = 3−√5
2 ≈ 0.38.
Question: T he Newton method for self-concordant function still might diverge if not started witha point with λf (x) small enough. How to modify the Newton method to ensure global convergence?
The remedy is to use Newton method with line-search or damping factors
xk+1 = xk − γk[∇2f(xk)]−1∇f(xk)
where γk > 0 is some stepsize.
20.3 Damped Newton Method
Damped Newton method: initialize x0 ∈ dom(f) and update via
xk+1 = xk −1
1 + λf (xk)[∇2f(xk)]
−1∇f(xk)
Note this is essentially Newton method with particular stepsize γk = 11+λf (xk)
.
Lecture 20: Interior Point Method - Part IV – April 10 20-4
Remark.
1. Since ‖ xk+1 − xk ‖xk=λf (xk)
1+λf (xk)< 1, xk+1 ∈ W 0
1 (xk) ⊆ dom(f). The damped Newton
procedure is always well-defined.
2. The Newton direction dk = −[∇2f(xk)]−1∇f(xk) is a descent direction when f is non-
degenerate. If f is bounded below, then xt converges to the unique minimizer of f .
Theorem 20.3 (Global convergence of damped Newton) The damped Newton method sat-isfies that
1. (Descent phase) ∀k ≥ 0, f(xk+1) ≤ f(xk)− ω(λf (xk))
2. (Quadratic convergence phase) If λk(xk) <14 , then λf (xk+1) ≤ 2[λf (xk)]
2
Proof:
1. In view of previous theorem,
f(xk+1) ≤ f(xk) + 〈∇f(xk), xk+1 − xk〉+ ω∗(‖ xk+1 − xk ‖xk)
= f(xk)−λf (xk)
1 + λf (xk)+ ω∗
( λf (xk)
1 + λf (xk)
)where ω∗(t) = −t− ln(1− t).This can be further simplified as f(xk+1) ≤ f(xk)− ω(λf (xk))
2. Proof follows similarly as the earlier theorem for basic Newton method.
Remark: A general strategy for solving the self-concordant minimization:
• Damped Newton stage: when λf (xk) ≥ β, where β ∈ (0, 1/4)
f(xk+1) ≤ f(xk)− ω(β)
The number of steps of this stage is bounded by N1 ≤ f(x0)−f(x∗)ω(β) .
• (Damped/Basic) Newton stage: when λf (xk) < β
λf (xk+1) ≤ 2[λf (xk)
]2The number of steps to find a solution with λf (x) ≤ ε is bounded by N2 ≤ O(1) log2 log2(
1ε ).
The total complexity does not exceed
O(1)[f(x0)− f∗ + log log(1
ε)]
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 21: Interior Point Method - Part V – April 12
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Revisit barrier method
• Self-concordant barrier
• Restate path-following scheme
Reference: Nesterov, Introductory Lectures on Convex Optimization, 2004, Chapter 4.2
21.1 Revisit Barrier Method
Recall the barrier method solves the general convex problem
minx∈X
f(x), where X = x : gi(x) ≤ 0, i = 1, ...,m
by solving a sequence of unconstrained minimization: minx tf(x) + F (x), where F (x) is barrierfunction defined on int(X), and t > 0 is penalty parameter.
Without loss of generality, let us consider problem of the form,
minx∈X
cTx (P )
where X is a closed bounded convex set with non-empty interior. Let F be standard self-concordantwith cl(dom(F )) = X. We want to solve (P) by tracing the central path
x∗(t) = arg minx
tcTx+ F (x)︸ ︷︷ ︸Ft(x)
By optimality condition: ∀t > 0,tc+∇F (x∗(t)) = 0
Note that Ft(x) is standard self-concordant and (damped) Newton method achieves a local conver-gence when λFt(x) ≤ 1
4 .
21-1
Lecture 21: Interior Point Method - Part V – April 12 21-2
When increasing t→ t′, we would want to preserve λFt′ (x∗(t)) ≤ 1
4 and make t′ as large as possible.We have
λFt′ (x∗(t)) =‖ ∇Ft′(x∗(t)) ‖x∗(t),∗
=‖ (t′ − t)c+ tc+∇F (x∗(t)) ‖x∗(t),∗=‖ (t′ − t)c ‖x∗(t),∗
=‖ (t′
t− 1)∇F (x∗(t)) ‖x∗(t),∗
Hence,t′
t= 1 +
1
4λF (x∗(t))
To ensure t→ +∞, need λF (x) to be uniformly bounded from above, namely,
λ2F (x) = ∇F (x)T [∇2F (x)]−1∇F (x) ≤ ν
for some ν. This leads to the definition of self-concordant barriers.
21.2 Self-concordant Barriers
21.2.1 Definition and Examples
Definition 21.1 Let v ≥ 0. We call F a ν-self-concordant barrier (v-s.c.b.) for set X = cl(dom(F ))if F is standard self-concordant and satisfies
|DF (x)[h]| ≤ v1/2√D2F (x)[h, h] ∀x ∈ dom(F ), h ∈ Rn (?)
Remark
1. The inequality (?) implies that |∇F (x)Th| ≤ ν ‖ h ‖2x, i.e. F is Lipschitz continuous, w.r.t.the local norm defined by F .
2. When F is non-degenerate, (?) is equivalent to
λ2F (x) = ∇F (x)T [∇2F (x)]−1∇F (x) ≤ ν
3. The following are equivalent:
(?)⇐⇒ ∇2F (x) <1
ν∇F (x)[∇F (x)]T
Examples
• Constant function: f(x) = c is 0-s.c.b.
Lecture 21: Interior Point Method - Part V – April 12 21-3
• Linear function: f(x) = aTx+ c(a 6= 0), is not s.c.b.
• Quadratic function: f(x) = 12x
TQx+ qTx+ c Q 0 is not s.c.b.
• Logarithmic function: f(x) = − lnx (x > 0) is 1-s.c.b.
(f ′(x))2
f ′′(x)= (−1
x)2x2 = 1
21.2.2 Self-concordance Preserving Operators
Proposition 21.2 The following are true:
1. If F (x) is a ν-self-concordant barrier, then F (y) = F (Ay + b) is a ν-self-concordant barrier.
2. If Fi(x) is νi-self-concordant barrier, i = 1, 2, then F1(x) + F2(x) is (ν1 + ν2)-self-concordantbarrier.
3. If F (x) is a ν-self-concordant barrier, then βF (x) with β ≥ 1 is a (βν)-self-concordant barrier.
Example: The function
F (x) = −m∑i=1
ln(bi − aTi x)
is m-self-concordant barrier for the set x : Ax ≤ b
Remark: Indeed, for any closed convex set X ⊆ Rn with non-empty interior, there exists a(βn)-self-concordant barrier for X.
21.2.3 Properties of Self-concordant Barriers
Lemma 21.3 (Boundedness) Let f be a ν-self-concordant barrier for X. Then for any x ∈int(X), y ∈ X, we have 〈∇F (x), y − x〉 ≤ ν
Proof is omitted and can be found in Theorem 4.2.4. in (Nesterov, 2004).
Theorem 21.4 For any t > 0, we have
cTx∗(t)−minx∈X
cTx ≤ ν
t
Proof:cTx∗(t)− cT y = −t−1∇F (x∗(t))T (x∗(t)− y) ≤ ν
t
Lecture 21: Interior Point Method - Part V – April 12 21-4
21.3 Restate Path-following Scheme
We would like to address the convex problem
minx∈X
cTx
by tracing the central path
x∗(t) = arg minx
Ft(x) := tcTx+ F (x)
where F (x) is a ν-self-concordant barrier for X = cl(dom(F )).
Theorem 21.5 For any t > 0, we have
cTx∗(t)−minx∈X
cTx ≤ ν
t
Proof: By optimality condition: tc+∇F (x∗(t)) = 0, i.e. c = −t−1∇F (x∗(t)),
∀y ∈ X : cTx∗(t)− cT y = −t−1∇F (x∗(t))T (x∗(t)− y)
= t−1∇F (x∗(t))T (y − x∗(t))
≤ ν
t
Now consider an approximate solution x that is close to x∗(t):
λFt(x) ≤ β
where β is small enough.
Theorem 21.6 If λFt(x) ≤ β,
cTx−minx∈X
cTx ≤ 1
t(ν +
√νβ
1− β)
Proof: First of all,
cTx− cTx∗(t) ≤‖ c ‖x∗(t),∗ · ‖ x− x∗(t) ‖x∗(t)= t−1 ‖ ∇F (x∗(t)) ‖x∗(t),∗ · ‖ x− x∗(t) ‖x∗(t)
Recall from last lecture that for x∗ = arg minx f(x) with standard self concordant f :
‖ x− x∗ ‖x∗≤λf (x)
1− λf (x)
Lecture 21: Interior Point Method - Part V – April 12 21-5
Hence,
‖ x− x∗(t) ‖x∗(t)≤λFt(x)
1− λFt(x)≤ β
1− β
Thus, cTx− cTx∗(t) ≤√νt
β1−β
cTx−minx∈X
cTx = cTx− cTx∗(t) + cTx∗(t)−minx∈X
cTx
≤√ν
t
β
1− β+ν
t
Now we can formally describe the path-following scheme.
Path-following Scheme
• Initialize (x0, t0) with t0 > 0 and λFt0(x0) ≤ β (β ∈ (0, 14))
• For k ≥ 0, do
tk+1 = tk(1 +γ√ν
)
xk+1 = xk − [∇2F (xk)]−1[tk+1c+∇F (xk)]
Theorem 21.7 (Rate of convergence) In the above scheme, one has
cTxk −minx∈X
cTx ≤ O(1)ν
t0exp
−O(1)
k√ν
where the constant factor O(1) depends solely on β and γ.
Remark [Complexity]
• The total complexity of Newton steps does not exceed
N(ε) ≤ O(√
ν logν
t0ε
)• The total arithmetic cost of finding an ε-solution by the above scheme does not exceed
O
(M√ν log
ν
t0ε
)where M is the arithmetic cost for computing ∇F (x), ∇2F (x) and solving a Newton system.
Lecture 21: Interior Point Method - Part V – April 12 21-6
Remark [Initialization] In order to obtain a fair (x0, t0), s.t.
λFt0(x0) ≤ β
one can apply the following trick. Suppose x ∈ dom(F ) is given. Consider the auxiliary path
y∗(t) = arg minx
[−t∇F (x)Tx+ F (x)]
When t = 1, y∗(1) = x. When t→ 0, y∗(t) = xF := arg minx F (x).
We can trace y∗(t) as t decreases from 1 to 0, until we approach to a point (y0, t0) such thatλFt0
(x0) ≤ β.
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 22: Interior Point Method for Conic Programs – April 17
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• IPM for Conic Programs
• Primal Dual path following IPM
22.1 Summary of IPM
Let us first summarize the key concepts discussed in the last few lectures.
• Self-concordant barrier: A function F (x) is ν-s.c.b. for a set X = cl(dom(F )) if it satisfiesfor any x ∈ dom(F ), h ∈ Rn:
|D3F (x)[h, h, h]| ≤ 2(D2F (x)[h, h])3/2
|DF (x)[h]| ≤√ν(D2F (x)[h, h])1/2
• Path-following interior point method:
min cTx =⇒ minx
tcTx+ F (x)
s.t. x ∈ X
1. choose (x0, t0), where t0 > 0 and x0 is close to the analytical center xF := arg minx F (x)
2. do for k = 0, 1, . . .
tk+1 = tk(1 +γ√ν
)
xk+1 = xk − [∇2F (xk)]−1[tk+1c+∇F (xk)]
• Iteration Complexity: O(√ν log(νε )) iterations.
22-1
Lecture 22: Interior Point Method for Conic Programs – April 17 22-2
22.2 IPM for Conic Programs
Recall the standard form and dual of conic program:
minx
cTx maxy
bT y
s.t. Ax− b ∈ K s.t. AT y = c
y ∈ K∗
This becomes an LP, SOCP, SDP when K = Rn+, L
n, Sn+, respectively.
22.2.1 Self-concordant Barriers for LP, SOCP, SDP
We now introduce the canonical self-concordant barriers for these cones.
Example 1: F (x) = −∑n
i=1 ln(xi) is n-s.c.b for the nonnegative orthant Rn+.
Example 2: F (x) = − ln(x2n − x21 − ...− x2n−1) is 2-s.c.b. for the Lorentz cone Ln.
Example 3: F (X) = − ln(det(X)) = −∑n
i=1 ln(λi(X)) is n-s.c.b. for the positive semidefinitecone Sn+.
The first two examples can be easily verified based on the definition. Let’s prove the third case.Proof: It suffices to show that given X ∈ int(Sn+) and H ∈ Sn, the one-dimensional restriction
φ(t) = F (X + tH) = − ln(det(X + tH))
is a n-self-concordant barrier. We have
φ(t) = − ln(det(X1/2(I + tX−1/2HX1/2)X1/2))
= − ln(det(I + tX−1/2HX−1/2))− ln(Det(X))
= −n∑i=1
ln(1 + tλi(X−1/2HX−1/2)) + φ(0)
Denote λi = λi(X−1/2HX−1/2), i = 1, ..., n, then φ(t) = −
∑mi=1 ln(1 + tλi) + φ(0). Hence,
φ′(0) = −n∑i=1
λi
φ′′(0) =n∑i=1
λ2i
φ′′′(0) = −n∑i=1
λ3i
Lecture 22: Interior Point Method for Conic Programs – April 17 22-3
Note that
(n∑i=1
λi)2 ≤ n
n∑i=1
λ2i =⇒ |φ′(0)2| ≤ nφ′′(0)
|n∑i=1
λ3i | ≤ (n∑i=1
λ2i )3/2 =⇒ |φ′′′(0)| ≤ 2[φ′′(0)]3/2
Therefore, φ(t) is a n-self-concordant barrier for any X ∈ int(Sn+) and H ∈ Sn.
Remark: Indeed, any ν-self-concordant carrier for the cone Sn+ has ν ≥ n.
As a result, the function build on the above barriers,
F (x) := F (Ax− b)
is a self-concordant barrier for X := x : Ax− b ∈ K.
22.2.2 Linear Program: IPM vs Ellipsoid Method
For instance, consider the linear program (m > n)
min cTx
s.t. aTj x ≥ bj , j = 1, ...,m
The barrier function
F (x) = −m∑j=1
ln(aTj x− bj)
is m-self-concordant barrier for the constraint set.
We can compute the gradient and Hessian
∇F (x) = −m∑j=1
aj
aTj x− bj
∇2F (x) =m∑j=1
ajaTj
(aTj x− bj)2
The arithmetic costs of computing F (x), ∇F (x), ∇2F (x) are O(mn), O(mn), O(mn2), respectively.The arithmetic cost for a Newton step, i,e, solving [∇2F (x)]y = ∇F (x), is O(n3). Hence, the overallarithmetic cost of finding a ε-solution to the linear program is
O(mn2)O(√m log(
m
ε)) = O(m3/2n2 log(
m
ε))
In contrast, when applying Ellipsoid method, the overall arithmetic cost is
O(mn+ n2)O(n2 log(1
ε)) = O(mn3 log(
1
ε))
The interior-point method is more efficient when m is not too large, i.e. m ≤ O(n2)
Lecture 22: Interior Point Method for Conic Programs – April 17 22-4
22.3 Primal-Dual Path-Following IPM
When solving conic programs, ideally we would like to develop interior point methods that
• produce primal-dual pairs at each iteration
• handle equality constraints
• require no prior knowledge of a strictly feasible solution
• adjust penalty based on current solution
Key idea: To approximate the KKT conditions.
We focus on the SDP case. Consider the standard form of the primal problem
min Tr(CX)
s.t. Tr(AiX) = bi, i = 1, ...,m (P)
X < 0
and its dual problem
maxy,Z
bT y
s.t.m∑i=1
yiAi + Z = C (D)
Z < 0
Assume (P ) and (D) are strictly primal-dual feasible, so there is no duality gap.
KKT conditions for (P) and (D):
X∗ < 0, Z∗ < 0
∀i = 1, ...,m : Tr(AiX∗) = bi
m∑i=1
y∗iAi + Z∗ = C
(primal-dual feasibility)
X∗Z∗ = 0 (complementary slackness)
Lecture 22: Interior Point Method for Conic Programs – April 17 22-5
We now consider the barrier problem of the primal problem
min Tr(CX)− µ ln(det(X))
s.t. Tr(AiX) = bi, i = 1, ...,m (BP)
and that of the dual problem
maxy
bT y + µ log(det(Z))
s.t.
m∑i=1
yiAi + Z = C (BD)
In fact, these are indeed the Lagrange duals to each other, up to constant.
KKT conditions for (BP) and (BD):
∀i = 1, ...,m : Tr(AiX∗(µ)) = bi
m∑i=1
y∗i (µ)Ai + Z∗(µ) = C
(primal-dual feasibility)
X∗(µ)Z∗(µ) = µI (complementary slackness)
The duality gap at (X∗(µ), y∗(µ)) is
Tr(CX∗(µ))− bT y∗(µ) = Tr(Z∗(µ)X∗(µ)) = µn
As µ→ 0, the duality gap is zero, and (X∗(µ), y∗(µ), Z∗(µ))→ (X∗, y∗, Z∗).
The set (X(µ), y(µ), Z(µ)) : µ > 0 is called the primal-dual central path.
Newton step: Find direction (∆X,∆y,∆Z) by solving the equations:
Tr(Ai(X + ∆X)) = bi, i = 1, ...,m∑m
i=1(yi + ∆yi)Ai + (Z + ∆Z) = C
(X + ∆X)(Z + ∆Z) = µI
=⇒
Tr(Ai∆X) = 0, i = 1, ...,m∑m
i=1 ∆yiAi + ∆Z = 0 (?)
(X + ∆X)(Z + ∆Z) = µI
Lecture 22: Interior Point Method for Conic Programs – April 17 22-6
Basic primal-dual scheme
Initial (X, y, Z) = (X0, y0, Z0) with X0 > 0, Z0 > 0.
At each iteration:
• compute µ = Tr(XZ)n , µ← µ
2
• compute (∆X,∆y,∆Z) by solving the KKT equations (?)
• update (X, y, Z)← (X+α∆X, y+β∆y, Z+β∆Z) with proper α, β that preserves positivityof (X,Z).
Remark (Approximation of KKT equation). Note that only the last equation in the system ofKKT conditions is nonlinear. One cane apply first-order approximation:
µ = (X + ∆X)(Z + ∆Z) ≈ XZ + ∆XZ +X∆Z
and solve the linearized KKT equations.
Lecture 23: CVX Tutorial
(Slides Origin: Boyd & Vandenberghe)
IE 521: Convex Optimization Spring 2017
April 19, 2017
1
Cone program solvers
• LP solvers
– many, open source and commercial
• cone solvers
– each handles combinations of a subset of LP, SOCP, SDP, EXP cones– open source: SDPT3, SeDuMi, CVXOPT, CSDP, ECOS, SCS, . . .– commercial: Mosek, Gurobi, Cplex, . . .
2
Transforming problems to cone form
• lots of tricks for transforming a problem into an equivalent cone program
– introducing slack variables– introducing new variables that upper bound expressions
• these tricks greatly extend the applicability of cone solvers
• writing code to carry out this transformation is painful
• modeling systems automate this step
3
Modeling systems
a typical modeling system
• automates transformation to cone form; supports
– declaring optimization variables– describing the objective function– describing the constraints– choosing (and configuring) the solver
• when given a problem instance, calls the solver
• interprets and returns the solver’s status (optimal, infeasible, . . . )
• (when solved) transforms the solution back to original form
4
Some current modeling systems
• AMPL & GAMS (proprietary)
– developed in the 1980s, still widely used in traditional OR– no support for convex optimization
• YALMIP (‘Yet Another LMI Parser’, matlab)
– first object-oriented convex optimization modeling system
• CVX (matlab)
• CVXPY (python, GPL)
• Convex.jl (Julia, GPL, merging into JUMP)
• CVX, CVXPY, and Convex.jl collectively referred to as CVX*
5
Disciplined convex programming
• describe objective and constraints using expressions formed from
– a set of basic atoms (affine, convex, concave functions)– a restricted set of operations or rules (that preserve convexity)
• modeling system keeps track of affine, convex, concave expressions
• rules ensure that
– expressions recognized as convex are convex– but, some convex expressions are not recognized as convex
• problems described using DCP are convex by construction
• all convex optimization modeling systems use DCP
6
CVX
• uses DCP
• runs in Matlab, between the cvx_begin and cvx_end commands
• relies on SDPT3 or SeDuMi (LP/SOCP/SDP) solvers
• refer to user guide, online help for more info
• the CVX example library has more than a hundred examples
7
Example: Constrained norm minimization
A = randn(5, 3);
b = randn(5, 1);
cvx_begin
variable x(3);
minimize(norm(A*x - b, 1))
subject to
-0.5 <= x;
x <= 0.3;
cvx_end
• between cvx_begin and cvx_end, x is a CVX variable
• statement subject to does nothing, but can be added for readability
• inequalities are intepreted elementwise
8
What CVX does
after cvx_end, CVX
• transforms problem into an LP
• calls solver SDPT3
• overwrites (object) x with (numeric) optimal value
• assigns problem optimal value to cvx_optval
• assigns problem status (which here is Solved) to cvx_status
(had problem been infeasible, cvx_status would be Infeasible and x
would be NaN)
9
Declare variables
• declare variables with variable name[(dims)] [attributes]
– variable x(3);
– variable C(4,3);
– variable S(3,3) symmetric;
– variable D(3,3) diagonal;
– variables y z;
10
11
• form affine expressions
A = randn(4, 3);variables x(3) y(4);
– 3*x + 4
– A*x - y
– x(2:3)
– sum(x)
Affine expressions
12
Some functions
function meaning attributes
‖x‖p cvx
x2 cvx
(x+)2 cvx, nondecr
x+ cvx, nondecr
norm(x, p)
square(x)
square_pos(x)
pos(x)
sum_largest(x,k) √[x 1] + · · ·+ x[k] cvx, nondecr
sqrt(x) x (x ≥ 0) ccv, nondecr
inv_pos(x) 1/x (x > 0) cvx, nonincr
max(x) m2
axx1, . . . , xn cvx, nondecr
quad_over_lin(x,y) x /y (y > 0) cvx, nonincr in y
lambda_max(X) λmax(X) (X = XT ) cvx
huber(x)
x2, |x| ≤ 1
2|x| − 1, |x| > 1cvx
Composition rules
• for convex h, h(g1, . . . , gk) is recognized as convex if, for each i,
– gi is affine, or– gi is convex and h is nondecreasing in its ith arg, or– gi is concave and h is nonincreasing in its ith arg
• for concave h, h(g1, . . . , gk) is recognized as concave if, for each i,
– gi is affine, or– gi is convex and h is nonincreasing in ith arg, or– gi is concave and h is nondecreasing in ith arg 13
• can combine atoms using valid composition rules, e.g.:
– a convex function of an affine function is convex– the negative of a convex function is concave– a convex, nondecreasing function of a convex function is convex– a concave, nondecreasing function of a concave function is concave
Valid (recognized) examples
u, v, x, y are scalar variables; X is a symmetric 3× 3 variable
• convex:
– norm(A*x - y) + 0.1*norm(x, 1)
– quad_over_lin(u - v, 1 - square(v))
– lambda_max(2*X - 4*eye(3))
– norm(2*X - 3, ’fro’)
• concave:
– min(1 + 2*u, 1 - max(2, v))
– sqrt(v) - 4.55*inv_pos(u - v)
14
Rejected examples
u, v, x, y are scalar variables
• neither convex nor concave:
– square(x) - square(y)
– norm(A*x - y) - 0.1*norm(x, 1)
• rejected due to limited DCP ruleset:
– sqrt(sum(square(x))) (is convex; could use norm(x))
– square(1 + x^2) (is convex; could use square_pos(1 + x^2), or1 + 2*pow_pos(x, 2) + pow_pos(x, 4))
15
Sets
• some constraints are more naturally expressed with convex sets
• sets in CVX work by creating unnamed variables constrained to the set
• examples:
– semidefinite(n)
– nonnegative(n)
– simplex(n)
– lorentz(n)
• semidefinite(n), say, returns an unnamed (symmetric matrix)variable that is constrained to be positive semidefinite
16
Using the semidefinite cone
variables: X (symmetric matrix), z (vector), t (scalar)constants: A and B (matrices)
• X == semidefinite(n)
– means X ∈ Sn+ (or X 0)
• A*X*A’ - X == B*semidefinite(n)*B’
– means ∃ Z 0 so that AXAT −X = BZBT
• [X z; z’ t] == semidefinite(n+1)
– means
[
X z
zT t
]
0
17
Objectives and constraints
• objective can be
– minimize(convex expression)
– maximize(concave expression)
– omitted (feasibility problem)
• constraints can be
– convex expression <= concave expression
– concave expression >= convex expression
– affine expression == affine expression
– omitted (unconstrained problem)
18
More involved example
A = randn(5);
A = A’*A;
cvx_begin
variable X(5, 5) symmetric;
variable y;
minimize(norm(X) - 10*sqrt(y))
subject to
X - A == semidefinite(5);
X(2,5) == 2*y;
X(3,1) >= 0.8;
y <= 4;
cvx_end
19
Defining new functions
• can make a new function using existing atoms
• example: the convex deadzone function
f(x) = max|x| − 1, 0 =
0, |x| ≤ 1
x− 1, x > 1
1− x, x < −1
• create a file deadzone.m with the code
function y = deadzone(x)
y = max(abs(x) - 1, 0)
• deadzone makes sense both within and outside of CVX
20
Defining functions via incompletely specified problems
• suppose f0, . . . , fm are convex in (x, z)
• let φ(x) be optimal value of convex problem, with variable z andparameter x
minimize f0(x, z)
subject to fi(x, z) ≤ 0, i = 1, . . . ,m
A1x+A2z = b
• φ is a convex function
• problem above sometimes called incompletely specified since x isn’t(yet) given
• an incompletely specified concave maximization problem defines aconcave function
21
CVX functions via incompletely specified problems
implement in cvx withfunction cvx_optval = phi(x)
cvx_begin
variable z;
minimize(f0(x, z))
subject to
f1(x, z) <= 0; ...
A1*x + A2*z == b;
cvx_end
• function phi will work for numeric x (by solving the problem)
• function phi can also be used inside a CVX specification, wherever aconvex function can be used
22
Simple example: Two element max
• create file max2.m containing
function cvx_optval = max2(x, y)
cvx_begin
variable t;
minimize(t)
subject to
x <= t;
y <= t;
cvx_end
• the constraints define the epigraph of the max function
• could add logic to return max(x,y) when x, y are numeric(otherwise, an LP is solved to evaluate the max of two numbers!)
23
A more complex example
• f(x) = x+ x1.5 + x2.5, with dom f = R+, is a convex, monotoneincreasing function
• its inverse g = f−1 is concave, monotone increasing, with dom g = R+
• there is no closed form expression for g
• g(y) is optimal value of problem
maximize t
subject to t+ + t1.5+ + t2.5+ ≤ y
(for y < 0, this problem is infeasible, so optimal value is −∞)
24
• implement asfunction cvx_optval = g(y)
cvx_begin
variable t;
maximize(t)
subject to
pos(t) + pow_pos(t, 1.5) + pow_pos(t, 2.5) <= y;
cvx_end
• use it as an ordinary function, as in g(14.3), or within CVX as aconcave function:cvx_begin
variables x y;
minimize(quad_over_lin(x, y) + 4*x + 5*y)
subject to
g(x) + 2*g(y) >= 2;
cvx_end
25
Example
• optimal value of LP
f(c) = infcTx | Ax b
is concave function of c
• by duality (assuming feasibility of Ax b) we have
f(c) = sup−λT b | ATλ+ c = 0, λ 0
26
• define f in CVX as
function cvx_optval = lp_opt_val(A,b,c)
cvx_begin
variable lambda(length(b));
maximize(-lambda’*b);
subject to
A’*lambda + c == 0; lambda >= 0;
cvx_end
• in lp opt val(A,b,c) A, b must be constant; c can be affine
27
CVX hints/warnings
• watch out for = (assignment) versus == (equality constraint)
• X >= 0, with matrix X, is an elementwise inequality
• X >= semidefinite(n) means: X is elementwise larger than somepositive semidefinite matrix (which is likely not what you want)
• writing subject to is unnecessary (but can look nicer)
• many problems traditionally stated using convex quadratic forms canposed as norm problems (which can have better numerical properties):
x’*P*x <= 1 can be replaced with norm(chol(P)*x) <= 1
28
Useful Resources
• http://cvxr.com/cvx/
28
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 24: Subgradient Method – April 24
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
24.1 Optimization Algorithms and Convergence Rates
Till now, we have discussed several important algorithms for solving convex optimization:
• Ellipsoid method: poly-time algorithm, black-box method, requires first-order and separationoracles
• Interior point method: poly-time algorithm, barrier method, requires structural assumptionson the domain and self-concordant barriers
• Newton method: (local) quadratic convergent algorithm, black-box method, requires smooth-ness assumptions on the objective and first order and second-order oracles
While the above algorithms have the capability to solve convex programs to high accuracy withina small number of iterations, they suffer from very expensive iteration cost (often cubicly in termsof the problem size), which eventually become impractical for large-scale convex problems.
First-order Methods:
For large-scale convex optimization, simpler algorithms such as first-order methods essentially be-come the only method of the choices. There exists a dedicated library of efficient first-order opti-mization algorithms:
• Gradient descent
• Nesterov’s accelerated gradient descent and variants (FISTA, geometric descent, etc)
• Coordinate descent and many variants
• Conditional gradient (a.k.a. Frank-Wolfe method)
• Subgradient methods
• Primal-dual methods (Arrow-Hurwicz method, etc)
• Proximal and operator splitting methods (proximal gradient method, ADMM, etc)
• Stochastic and incremental gradient methods (stochastic gradient descent, SVRG, etc)
24-1
Lecture 24: Subgradient Method – April 24 24-2
In the rest of the semester we will present several primary method mainly for the generic non-differentiable constrained problems.
Rate of Convergence:
Suppose the sequence xk converges to x∗
Definition 24.1 (Q-convergence) The convergence rate is said to be
• linear: if ∃q ∈ (0, 1), such that lim supk→+∞‖xk+1−x∗‖‖xk−x∗‖ = q
• suplinear: if lim supk→+∞‖xk+1−x∗‖‖xk−x∗‖ = 0
• sublinear: if lim supk→+∞‖xk+1−x∗‖‖xk−x∗‖ = 1
• quadratic: if lim supk→+∞‖xk+1−x∗‖‖xk−x∗‖2
< +∞
These are also called Q-convergence (quotient).
Example: xk = 12k, 1k , (
12)2
kare Q-linear, Q-sublinear, Q-superlinear convergence.
Note that Q-convergence can be troublesome, for example, consider the sequence
xk =
1 + 1
2k, k is even
1, k is odd.
Definition 24.2 (R-convergnece) We say xk converge to x∗ R-linearly if ∃ δk s.t. ‖ xk −x∗ ‖≤ δk and δk converges Q-linearly to x∗.
For simplicity, we will drop the ‘Q’ and ‘R’.
24.2 Subgradient Method
Consider the generic convex minimization
min f(x)
s.t. x ∈ X
where
• f is convex and possibly non-differentiable
• X is non-empty, closed and convex.
Lecture 24: Subgradient Method – April 24 24-3
Assume the problem is solvable with optimal solution and value denoted as x∗, f∗. Recall that avector g ∈ ∂f(x) is called a subgradient of f at x if
f(y) ≥ f(x) + gT (y − x) ∀y ∈ dom(f)
The subgradient inequality can be interpreted as a supporting hyperplane for the level set Lf(x)(f) =y : f(y) ≤ f(x):
gT (y − x) ≤ 0,∀y ∈ Lf(x)(f) = y : f(y) ≤ f(x)
24.2.1 Subgradient Method (N.Shor, 1967)
Subgradient method works as follows: start with x1 ∈ X and update
xt+1 = ΠX(xt − γtgt)
where
• gt ∈ ∂f(xt) is a subgradient of f at xt.
• γt > 0 is a proper stepsize
• ΠX(x) = arg miny∈X ‖ y − x ‖2 is the Euclidean projection.
Remark:
• When X = Rn, f is continuously differentiable, this reduces to
xt+1 = xt − αt∇f(xt)
which is the well-known Gradient Descent method
• Unlike Gradient Descent, the negative direction of subgradient is not always a descent direc-tion.
Choices of Stepsize:
• Constant stepsize: γt = γ, ∀t
• Diminishing stepsize: γt → 0 and∑∞
t=1 γt = +∞, e.g. γt = 1√t
• Square summable stepsize:∑∞
t=1 γ2t < +∞ and
∑∞t=1 γt = +∞, e.g. γt = 1
t .
• Scaled stepsize: γt = γ‖gt‖2
• Dynamic stepsize (Polyak’s optimal stepsize): γt = f(xt)−f∗‖gt‖22
IE 521: Convex Optimization Spring 2017, UIUC
Lecture 25: Subgradient Method and Bundle Methods – April 24
Instructor: Niao He Scribe: Shuanglong Wang
Courtesy warning: These notes do not necessarily cover everything discussed in the class. Pleaseemail TA ([email protected]) if you find any typos or mistakes.
In this lecture, we cover the following topics
• Subgradient Method
• Bundle Method
– Kelley cutting plane method
– Level method
Reference: Nesterov.2004. Chapter 3.2.3, 3.3
25.1 Subgradient Method
Recall that subgradient method works as follows
xt+1 = ΠX(xt − γtgt), t = 1, 2, ...
where gt ∈ ∂f(xt), γt > 0 and ΠX(x) = arg miny∈X ‖ y − x ‖2 is the projection operator.
Note that the projection on X is easy to compute when X is simple, e.g. X is a ball, box, simplex,polyhedron, etc.
Lemma 25.1 (Projection) ∀x inRn, z ∈ X,
‖ x− z ‖22≥‖ x−ΠX(x) ‖22 + ‖ z −ΠX(x) ‖22
Proof: When x ∈ X, the inequality immediately hold true. Let x 6∈ X. By definition.
ΠX(x) = arg minx∈X‖ z − x ‖22
By optimality condition, this implies 2(ΠX(x)− x)T (z −ΠX(x)) ≥ 0, ∀z ∈ X. Hence,
‖ x− z ‖22 =‖ x−ΠX(x) + ΠX(x)− z ‖22≥‖ x−ΠX(x) ‖22 + ‖ ΠX(x)− z ‖22
25-1
Lecture 25: Subgradient Method and Bundle Methods – April 24 25-2
Lemma 25.2 (Key relation) For the subgradient method, we have
‖ xt+1 − x∗ ‖22 ≥ ‖ xt − x∗ ‖22 −2γt(f(xt)− f∗) + γ2t ‖ gt ‖22 (?)
Proof:
‖ xt+1 − x∗ ‖22 = ‖ ΠX(xt − γtgt)− x∗ ‖22≤ ‖ xt − γtgt − x∗ ‖22= ‖ xt − x∗ ‖22 −2γtg
Tt (xt − x∗) + γ2
t ‖ gt ‖2
Due to convexity of f , we have f∗ ≥ f(xt) + gTt (x∗ − xt), i.e.
gTt (xt − x∗) > f(xt)− f∗
Combing there two inequalities leads to the desired result.
Remark: Note that when f∗ is known, we can choose the ‘optimal’ γt by minimizing the righthand side of (?): γ∗t = f(xt)−f∗
‖gt‖22, which is the Polyak’s stepsize.
In fact, knowing f∗ is not a problem sometimes. For instance, when the goal is to solve the convexfeasibility problem: Find x∗ ∈ X, s.t. gi(x) ≤ 0, i = 1, ...,m. We can formulate this as
minx∈X
max1≤i≤m
gi(x) or minx∈X
m∑i=1
max(gi(x), 0)
The optimal value f∗ is known to be 0 in this case.
If f∗ is not known, one can replace f∗ by its online estimate.
Theorem 25.3 (Convergence) Suppose f(x) is convex and Lipschitz continuous on X:
|f(x)− f(y)| ≤Mf ‖ x− y ‖2, ∀x, y ∈ X
where Mf < +∞. Then the subgradient method satisfies:
f(xT )− f∗ ≤‖ x1 − x∗ ‖22 +
∑Tt=1 γ
2tM
2f
2∑T
t=1 γt
where xT = (∑T
t=1 γt)−1(∑T
t=1 γtxt)
Proof: The Lipschitz continuity implies that ‖ gt ‖2≤ Mf , ∀t. Summing up the key relation (?)from t = 1 to t = T , we obtain
2
T∑t=1
γt(f(xt)− f∗) ≤‖ x1 − x∗ ‖22 − ‖ xT+1 − x∗ ‖22 +
T∑t=1
γ2tM
2f
By convexity of f : (∑T
t=1 γt)f(xT ) ≤∑T
t=1 γtf(xt) This further leads to
(T∑t=1
γt)[f(xT )− f∗] ≤ 1
2‖ x1 − x∗ ‖2 +
M2
2
T∑t=1
γ2t
and concludes the proof.
Lecture 25: Subgradient Method and Bundle Methods – April 24 25-3
Convergence under various stepsize Assume DX = maxx,y ‖ x− y ‖2 is the diameter of theset X. It is interesting to see how the bounds in the above theorem would imply the convergenceand even the convergence rate with different choices of stepsizes. By abuse of notation, we denoteboth min
1≤t≤Tf(xt)− f∗ and f(xT )− f∗ as εT .
1. Constant stepsize: with γt ≡ γ,
εT ≤D2X + Tγ2M2
f
2Tγ=D2X
2T· 1
γ+M2f
2γ
T→∞−−−−→M2f
2γ.
It is worth noticing that the error upper-bound does not diminish to zero as T grows toinfinity, which shows one of the drawbacks of using arbitrary constant stepsizes. In addition,to optimize the upper bound, we can select the optimal stepsize γ∗ to obtain:
γ∗ =DX
Mf
√T⇒ εT ≤
DXMf√T
.
It is shown that under this optimal choice εT ∼ O(DXMf√
T). However, this exhibits another
drawback of constant stepsize that in practice T is not known in prior for evaluating theoptimal γ∗.
2. Scaled stepsize: with γt = γ‖g(xt)‖2 ,
εT ≤D2X + γ2T
2γT∑t=1
1/‖g(xt)‖2≤Mf
(Ω
2T· 1
γ+
1
2γ
)T→∞−−−−→
Mf
2γ.
Similarly, we can select the optimal γ by minimizing the right hand side, i.e. γ∗ = DX√T
:
γt =DX√
T‖g(xt)‖2⇒ εT ≤
DXM√T
.
The same convergence rate is achieved while the same drawback about not knowing T in priorstill exists in choosing γt.
3. Non-summable but diminishing stepsize:
εT ≤
(D2X +
T∑t=1
γ2tM
2f
)/(2
T∑t=1
γt
)
≤
(D2X +
T1∑t=1
γ2tM
2f
)/(2
T∑t=1
γt
)+
M2f
T∑t=T1+1
γ2t
/2T∑
t=T1+1
γt
where 1 ≤ T1 ≤ T . When T → ∞, select large T1 and the first term on the right hand side→ 0 since γt is non-summable. The second term also → 0 because γ2
t always approaches zerofaster than γt. Consequently, we know that
εTT→∞−−−−→ 0.
Lecture 25: Subgradient Method and Bundle Methods – April 24 25-4
An example choice of the stepsize is γt = O(
1tq
)with q ∈ (0, 1]. As in the above cases, if we
choose γt =√
2ΩM√t, then
γt =DX
Mf
√t⇒ εT ≤ O
(DXMf ln(T )√
T
).
In fact, if we choose the averaging from T2 instead of 1, we have
minT/2≤t≤T
f(xt)− f∗ ≤O(Mf ·DX√
T
).
4. Non-summable but square-summable stepsize: It is obvious that
εT ≤
(Ω +
M2
2
T∑t=1
γ2t
)/(T∑t=1
γt
)T→∞−−−−→ 0.
A typical choice of γt = 1t1+q , q > 0 also result in the rate of O( 1√
T).
5. Polyak stepsize: The stepsize yields
‖xt+1 − x∗‖22 ≤‖xt − x∗‖22 −(f(xt)− f∗)2
‖g(xt)‖22, (25.1)
which guarantees ‖xt − x∗‖22 decreases each step. Applying (25.1) recursively, we obtain
T∑t=1
(f(xt)− f∗)2 ≤ D2X ·Mf <∞.
Therefore we have εT → 0 as T →∞ and εT ≤ O(
1√T
).
Corollary 25.4 When T is known, setting γt ≡ DX
Mf
√T
, in particular, we have
f(xT )− f∗ ≤DXMf√
T
Remark: Subgradient method converges sublinearly. For an accuracy ε > 0, need O(D2
XM2f
ε2)
number of iterations.
25.2 Bundle Method
When running the subgradient method, we obtain a bundle of affine underestimate of f(x):
f(xt) + gTt (x− xt), t = 1, 2, ...
Lecture 25: Subgradient Method and Bundle Methods – April 24 25-5
Definition 25.5 The piecewise linear function:
ft(x) = max1≤i≤t
f(xi) + gTi (x− xi)
where gi ∈ ∂f(xi) is called the t-th model of convex function f .
Note
1. ft(x) ≤ f(x),∀x ∈ X
2. ft(xi) = f(xi), ∀1 ≤ i ≤ t
3. f1(x) ≤ f2(x) ≤ ... ≤ ft(x) ≤ ... ≤ f(x)
25.2.1 Kelley method (Kelley, 1960)
The Kelley method works as follows:
xt+1 = arg minx∈X
ft(x)
Obviously, the above algorithm converges so long as X is compact. The auxiliary problem is notso disturbing (reduces to LP) when X is polyhedron. However, the issue is that xt is not uniqueand Kelley method can be very unstable. Indeed, the worst-case complexity of Kelley method isat least O( 1
ε(n−1)/2 ).
Remedy: To prevent the instability issue, a possible remedy is update xt+1 by
xt+1 = arg minx∈X
ft(x) +
αt2‖ x− xt ‖22
with properly selected αt > 0.
25.2.2 Level Method (Lemarchal, Nemirovski, Nesterov, 1995)
Denote
¯ft = min
x∈Xft(x) (minimal value of the model)
ft = min1≤i≤t
f(xi) (record value of the model)
we have¯f1 ≤
¯f2 ≤ ... ≤ f∗ ≤ ... ≤ f2 ≤ f1
Denote the level setLt =
x : ft(x) ≤ lt := (1− α)
¯ft + αft
Note that Lt is nonempty, convex and closed, and doesn’t contain the search points x1, ..., xt
Lecture 25: Subgradient Method and Bundle Methods – April 24 25-6
The Level method works as follows
xt+1 = ΠLt(xt) = arg minx∈X
‖ x− xt ‖22: ft(x) ≤ lt
Note when α = 0, reduces to Kelley method. α = 1, there will be no progress.
The auxiliary problem reduces to a quadratic program when X is polyhedron.
Theorem 25.6 When t > 1(1−α)2α(2−α)
(MfDX
ε
)2, we have
ft −¯ft ≤ ε
where Mf is the Lipschitz constant and DX is the diameter of set X.
Remark: Level method achieves same complexity as the subgradient method (which indeed isunimprovable), but can perform much better than subgradient method in practice.