Part 1. Convex Sets,Functions andOptimizationm126w18/pdf/part1.pdf · Part 1. Convex Sets,Functions...

Part 1. Convex Sets, Functions and Optimization

Math 126 Winter 18

Date of current version: January 17, 2018

Abstract This note studies convex sets, functions and optimization. Many parts of this note are basedon the chapters [1, Chapters 1,3,6-8] [2, Chapter 1] [3, Chapters 1-4, Appendix A] and their correspondinglecture notes available online by the authors.

Please email me if you find any typos or errors.

1 Introduction to Convex Optimization

We are interested in solving the following convex optimization problem:

minx∈X

f(x), (1.1)

where we assume that

– f : Rd → R is a convex function, and– X is a convex set.

An optimal (global) solution of (1.1) is

x∗ ∈ argminx∈X

f(x), (1.2)

and this has the smallest value of the function f among all vectors in the set X .This note studies the definition and properties of convex sets, convex functions, and convex opti-

mization problems. For the rest of the term, we will learn the fundamental optimization methods forsolving (1.1); these include gradient/subgradient methods (Part 2), projected/proximal gradient methods(Part 3), dual-based methods (Parts 4,5), and Newton’s methods (Part 6).

This section reviews four examples of convex optimization problems and methods that you are proba-bly familiar with; a least-squares problem, a conjugate gradient method, a Lagrange multiplier, a Newtonmethod.

1.1 Example 1: Least-Squares Problem (see [1, Chapter 3] [3, Chapter 1.2.1])

Consider the following linear system problem:

Find x ∈ Rd s.t. Ax ≈ b, (1.3)

where A ∈ Rp×d and b ∈ Rp. Suppose that the system is overdetermined, i.e., p > d, and A has a fullcolumn rank, which usually makes the system to have no solution. Then, a common approach for finding

Donghwan KimDartmouth CollegeE-mail: [email protected]

an approximate solution of (1.3) is solving the following least-squares problem (an instance of convexoptimization problem):

minx∈Rd

1

2||Ax− b||22 = min

x∈Rd

1

2

p∑

i=1

(a⊤i x− bi)

2, (1.4)

where A = (a1, . . . ,ap)⊤ and b = (b1, . . . , bp)

⊤. If A has a full column rank (so A⊤A is invertible), wehave the following (unique) analytical solution:

x∗ = (A⊤A)−1A⊤b, (1.5)

which is also a solution of a normal equation:

Find x ∈ Rd s.t. A⊤Ax ≈ A⊤b. (1.6)

This can be solved in a time approximately proportional to p2d. If A has a special structure, one canexploit it to solve the problem faster. For example, if A is sparse, which means that it has far fewer thanpd nonzero entries, one can solve the problem much faster than order p2d.

If A is underdetermined (i.e., p < d), the least squares solution is usually not a good estimate of thetrue vector x. One widely used technique for this case is to introduce a regularization term R(x) (i.e.,incorporate a prior knowledge on x) as

minx∈Rd

{

1

2||Ax− b||22 + λR(x)

}

, (1.7)

leading to a regularized least-squares problem. For a quadratic regularization (also known as Tikhonovregularization) R(x) = 1

2 ||Dx||22 where (a Tikhonov matrix) D is chosen to make A⊤A + λD⊤D isinvertible, the analytical solution is

x∗ = (A⊤A+ λD⊤D)−1A⊤b. (1.8)

This is a solution of the following linear system:

Find x ∈ Rd s.t. (A⊤A+ λD⊤D)x ≈ A⊤b. (1.9)

Usual choices of D are an identity matrix (when the true x is known to have small value of ||x||22) or adifferencing matrix such as

D =

1 −1 0 · · · 0 00 1 −1 · · · 0 0...

......

......

0 0 0 · · · 1 −1

, (1.10)

when the true x is known to be spatially smooth.A linear system solver of MATLAB (i.e., linsolve(A,b) or A\b) for example, uses LU factorization

with partial pivoting when A is square and QR factorization with column pivoting otherwise. Thereare many other algorithms for solving a linear system problem or equivalently a least-squares problem.You will be asked in the homework to compare the MATLAB built-in linear system solver and onerepresentative iterative solver named a conjugate gradient method.

Remark 1 The book [3, Chapter 1.2.1] states that solving the non-extremely large least-squares problemis a mature field in 2004. (The book considers the problems with millions of variables to be extremelylarge.) In the era of big data, we usually encounter extremely large least-squares problems, so solvingleast-square problem still has room for improvement.

The below illustrates two follow-up questions that you will be able to (partially) answer after takingthis course.

2

Q1. What can we do when we encounter the set of nonlinear equations hi(x) ≈ bi, i = 1, . . . , p (insteadof a⊤

i x ≈ bi), and want to solve the following nonlinear least squares problem?

minx∈Rd

{

1

2

p∑

i=1

(hi(x)− bi)2 + λR(x)

}

(1.11)

Q2. How about when one prefers using discrepancy measures that are different from ||Ax − b||22? Forexample to be robust to outliers, one may prefer solving

minx∈Rd

{||Ax− b||1 + λR(x)} . (1.12)

1.2 Example 2: Conjugate Gradient Method

One representative iterative algorithm for solving a linear system (e.g., A⊤Ax ≈ A⊤b) is a conjugategradient (CG) method. CG is also known to be an efficient solver for the following quadratic problem(also known as the quadratic programming problem):

minx∈Rd

{

f(x) ≡ 1

2x⊤Qx+ p⊤x

}

, (1.13)

where Q is positive semidefinite (i.e. Q � 0). This is because the problem (1.13) and a least-squaresproblem are equivalent when we let Q = A⊤A = ∇2f(x) and p = −A⊤b. Note that they are alsoequivalent to the problem:

Find x ∈ Rd s.t. ∇f(x) = Qx+ p = A⊤Ax−A⊤b = 0. (1.14)

You will be asked to implement a conjugate gradient method in the homework. (The method and itsMATLAB code is available at https://en.wikipedia.org/wiki/Conjugate_gradient_method.)

CG is an accelerated version of the following standard gradient (descent) method:

Algorithm 1 Gradient (descent) method

1: Input: x0 ∈ Rd.2: for k ≥ 0 do

3: Choose an appropriate step size αk (to be learned later)4: xk+1 = xk − αk∇f(xk) = xk − αk(Qxk + p).

And we will answer the two follow-up questions below in the later classes.

Q1. What are the convergence rates of these algorithms? What does it mean by acceleration?Q2. CG is a fast and efficient method for quadratic problems, but not necessarily for the non-quadratic

problems. What are the accelerated gradient methods for the non-quadratic (or non-linear) problems?

1.3 Example 3: Lagrange Multiplier

Consider the following equality-constrained optimization problem:

minx∈Rd

f(x) (1.15)

subject to hi(x) = 0, i = 1, . . . , n.

The corresponding Lagrangian function of (1.15) is

L(x,λ) = f(x) +

n∑

i=1

λihi(x), (1.16)

3

https://en.wikipedia.org/wiki/Conjugate_gradient_method

where λ := (λ1, . . . , λn)⊤ is a Lagrange multiplier. Then, finding an optimal solution x∗ is known to be

equivalent to solving the following:

Find (x,λ) ∈ Rd+n s.t. ∇x,λL(x,λ) = 0, (1.17)

where the constraint is equivalent to

∇f(x) +

n∑

i=1

λi∇hi(x) = 0, hi(x) = 0, i = 1, . . . , n. (1.18)

Realizing the primal-dual relationship (to be learned later), solving (1.17) is equivalent to solving

minx

maxλ

L(x,λ), (1.19)

which is also known as a primal-dual problem or a saddle-point problem, where λ is a Lagrange dualvariable. Finding a (saddle-point) solution (x∗,λ∗) of (1.19) (and (1.17)) is difficult in general, so iterativemethods (that we will learn in the second half of this class) are necessary.

Q1. What if we also have inequality constraints gi(x) ≤ 0, i = 1, . . . ,m?(Short answer: Karush-Kuhn-Tucker (KKT) condition)

Q2. What is a (Lagrangian) dual problem of a primal problem (1.15)?(Short answer: maxλ {q(λ) ≡ minx L(x,λ)})

Q3. What are dual-based methods?(We will study dual gradient methods, augmented Lagrangian, alternating direction method of mul-tipliers (ADMM) among many.)

1.4 Example 4: Newton’s Method

Consider the following problem:

minx∈Rd

f(x) (1.20)

where f is assumed to be twice-differentiable.Given xk, the next iterate xk+1 is chosen to minimize the quadratic approximation of the function

around xk:

xk+1 = argminx∈Rd

{

f(xk) + 〈∇f(xk), x− xk)〉+1

2(x− xk)

⊤∇2f(xk)(x− xk)

}

. (1.21)

If ∇2f(xk) ≻ 0, this leads to

Algorithm 2 Newton’s method

1: Input: x0 ∈ Rd.2: for k ≥ 0 do

3: xk+1 = xk − [∇2f(xk)]−1∇f(xk).

Each iteration (k) requires finding the Newton’s (descent) direction −[∇2f(xk)]−1∇f(xk) by solving

the following linear system problem:

Find dk ∈ Rd s.t. ∇2f(xk)dk ≈ ∇f(xk). (1.22)

Newton’s method (under certain assumptions) has a local quadratic rate of convergence (to be learnedlater), which is much faster than the (sub)linear rate of gradient methods. On the other hand, the per-iteration computational cost of Newton’s method is much more expensive than gradient methods. Wewill further study the properties of Newton’s method and answer the following questions in Part 6.

Q1. Which additional conditions are required to guarantee a convergence of Newton’s method (in additionto having ∇2f(x) ≻ 0 for every x)?

Q2. What are variants of Newton’s method?(Damped Newton’s method, Quasi-Newton method, and . . .)

Q3. How can we use Newton’s method for solving constrained problems?

4

1.5 Role of Convex Optimization in Nonconvex Optimization (see [3, Chapter 1.4.3])

It has long been recognized that the convex optimization theory is far more straightforward and completethan the nonlinear optimization theory [3, p. 16]. In this context Rockafellar stated, in [4],

“ In fact the great watershed in optimization isn’t between linearity and nonlinearity, but convexityand nonconvexity. ”

While this class will primarily focus on convex optimization problems and applications, we shouldnote that it sometimes plays an important role in nonconvex problems in the following aspects:

– Initialization for local optimization (seeking a local solution): a solution of an approximate convexproblem can be used as the starting point for a local optimization method

– Convex heuristics for nonconvex optimization: e.g., replacing ℓ0 by ℓ1 in compressed sensing,– Bounds for global optimization: provides a lower bound on the optimal value of the nonconvex problem

(e.g., by solving a (convex) Lagrangian dual problem)

In this note, we will first study the definition of convex sets and functions, and their properties. Thisis important because identifying the convexity of the optimization problem is the first key to convexoptimization.

5

2 Mathematical Preliminaries (see [1, Chapter 1] [2, Chapter 1] [3, Appendix A])

This section reviews some mathematical preliminaries that will frequently appear throughout the course.

2.1 The Space Rd and Its Subsets

– Set of d-dimensional column vectors with real components: Rd

– Nonnegative orthant: Rd+ = {(x1, . . . , xd)

⊤ : x1, . . . , xd ≥ 0}– Positive orthant: Rd

++ = {(x1, . . . , xd)⊤ : x1, . . . , xd > 0}

– Closed line segment: [x,y] = {x+ α(y − x) : α ∈ [0, 1]}– Open line segment: (x,y) = {x+ α(y − x) : α ∈ (0, 1)}

2.2 Inner Products and Norms in Rd

Definition 2.1 An inner product on Rd is a map 〈·, ·〉 : Rd ×Rd → R with the following properties:

– Symmetry: 〈x, y〉 = 〈y, x〉 for any x,y ∈ Rd

– Additivity: 〈x, y + z〉 = 〈x, y〉+ 〈x, z〉 for any x,y, z ∈ Rd

– Homogeneity: 〈λx, y〉 = λ 〈x, y〉 for any λ ∈ R and x,y ∈ Rd

– Positive definiteness: 〈x, x〉 ≥ 0 for any x ∈ Rd and 〈x, x〉 = 0 if and only if x = 0

Example 2.1 Examples of inner products on Rd

– Dot product: 〈x, y〉 = x⊤y =∑d

i=1 xiyi for any x,y ∈ Rd

– Weighted dot product with w ∈ Rd++: 〈x, y〉w =

∑di=1 wixiyi

Definition 2.2 A norm on Rd is a map || · || : Rd → R with the following properties:

– Nonnegativity: ||x|| ≥ 0 for any x ∈ Rd and ||x|| = 0 if and only if x = 0– Positive homogeneity: ||λx|| = |λ| ||x|| for any x ∈ Rd and λ ∈ R– Triangle inequality: ||x+ y|| ≤ ||x||+ ||y|| for any x,y ∈ Rd

Example 2.2 Examples of norms on Rd

– Norm associated with any inner product 〈·, ·〉: ||x|| =√

〈x, x〉 for all x ∈ Rd

– Euclidean norm (or ℓ2-norm): ||x||2 =√

∑di=1 x

2i for all x ∈ Rd

– ℓp-norm for p ≥ 1: ||x||p =(

∑di=1 |xi|p

)1p

for all x ∈ Rd

– ℓ∞-norm: ||x||∞ = maxi=1,...,d |xi| = limp→∞ ||x||p

2.3 The Space Rp×d and Its Subsets

– Set of all real-valued p× d matrices: Rp×d

– Set of all d× d symmetric matrices: Sd = {A ∈ Rd×d : A = A⊤}– Set of all d× d positive semidefinite matrices: Sd

+ = {A ∈ Rd×d : A � 0}– Set of all d× d positive definite matrices: Sd

++ = {A ∈ Rd×d : A ≻ 0}

2.4 Inner Products and Norms in Rp×d

The dot product in Rp×d is defined by

〈A, B〉 = tr{

A⊤B}

=

p∑

i=1

d∑

j=1

AijBij for any A,B ∈ Rp×d. (2.1)

Definition 2.3 A norm on Rp×d is a map || · || : Rp×d → R with the following properties:

6

– Nonnegativity: ||A|| ≥ 0 for any A ∈ Rp×d and ||A|| = 0 if and only if A = 0– Positive homogeneity: ||λA|| = |λ| ||A|| for any A ∈ Rp×d and λ ∈ R– Triangle inequality: ||A+B|| ≤ ||A||+ ||B|| for any A,B ∈ Rp×d

Example 2.3 Examples of norms on Rp×d

– Induced matrix norm: ||A|| = maxx6=0

||Ax||||x|| = max||x||=1 ||Ax||

– Spectral norm is the maximum singular value of A: ||A||2 =√

λmax(A⊤A) = σmax(A)– ℓ1-norm (or maximum absolute column sum norm): ||A||1 = maxj=1,...,d

∑pi=1 |Ai,j |

– ℓ∞-norm (or maximum absolute row sum norm): ||A||∞ = maxi=1,...,p

∑dj=1 |Ai,j |

– Frobenius norm (associated with the dot product): ||A||F =√

tr{A⊤A} =√

∑pi=1

∑dj=1 A

2ij

2.5 Basic Topological Concepts

Definition 2.4 The open ball with center xc ∈ Rd and radius r is defined as B(xc, r) = {x ∈ Rd :||x − xc|| < r}. The closed ball with center xc ∈ Rd and radius r is defined as B[xc, r] = {x ∈ Rd :||x− xc|| ≤ r}.

Definition 2.5 Given a set U ⊆ R

d, a point x ∈ U is an interior point of U if there exists r > 0 forwhich B(x, r) ⊆ U . The set of all interior points of a given set U is called the interior of the set definedas int(U) = {x ∈ U : B(x, r) ⊆ U for some r > 0}.

Example 2.4 Examples of interiors of sets.

– int(Rd+) = R

d++

– int(B[x, r]) = B(x, r) for any x ∈ Rd, r > 0

Definition 2.6 A set U is open if int(U) = U . A set U is closed if and only if its complement Uc is open.

Definition 2.7 The closure of a set U ⊆ Rd is the smallest closed set containing U , i.e., cl(U) = ∩{T :U ⊆ T , T is closed}.

Definition 2.8 Given a set U ⊆ R

d, a point x ∈ U is a boundary point of U if any neighborhood of xcontains at least one point in U and at least one point in its complement Uc. The set of all boundarypoints of a given set U is called the boundary of the set.

Definition 2.9 A set U ⊆ R

d is called bounded if there exists M > 0 for which U ⊆ B(0,M). A setU ⊆ Rd is called compact if it is closed and bounded

Example 2.5 Examples of compact sets.

– Closed balls– Line segments

2.6 Differentiability

Definition 2.10 Let f be a function defined on a set U ⊆ Rd. Let x ∈ intU and let 0 6= d ∈ Rd. If thelimit

limt→0+

f(x+ td)− f(x)

t(2.2)

exists, then it is called the directional derivative of f at x along the direction d. For any i = 1, . . . , d, ifthe limit

∂f

∂xi

(x) = limt→0

f(x+ tei)− f(x)

t(2.3)

7

exist, where ei is the d-length column vector whose ith component is one while all the others are zeros,then it is the ith partial derivative of f at x. If all the partial derivatives of a function f exist at apoint x ∈ Rd, then the gradient of f at x is defined to be the column vector consisting of all the partialderivatives:

∇f(x) =

(

∂f

∂x1(x), . . . ,

∂f

∂xd

(x)

)⊤

. (2.4)

Definition 2.11 A function f defined on an open set U ⊆ R

d is called continuously differentiable overU if all the partial derivatives exist and are continuous on U .

Proposition 2.1 Let f : U → R be defined on an open set U ⊆ R

d. Suppose that f is continuouslydifferentiable over U . Then

limd→0

f(x+ d)− f(x)−∇f(x)⊤d

||d|| = 0 for all x ∈ U . (2.5)

Definition 2.12 A function f defined on an open set U ⊆ Rd is called twice continuously differentiableover U if all the second order partial derivatives exist and are continuous over U . Under the assumptionof twice continuous differentiability, the second order partial derivatives are symmetric, meaning that forany i 6= j and any x ∈ U :

∂2f

∂xi∂xj

(x) =∂2f

∂xj∂xi

(x). (2.6)

The Hessian of f at a point x ∈ U is the d× d matrix

∇2f(x) =

∂2f

∂x21

(x) ∂2f∂x1∂x2

(x) · · · ∂2f∂x1∂xd

(x)

∂2f∂x2∂x1

(x) ∂2f

∂x22

(x)...

.... . .

...∂2f

∂xd∂x1(x) ∂2f

∂xd∂x2(x) · · · ∂2f

∂x2d

(x)

(2.7)

8

3 Convex Sets (see [1, Chapter 6] [3, Chapter 2])

3.1 Definition

3.1.1 Affine Sets

Definition 3.1 A set C is affine if the line through any two distinct points in C lies in C, i.e., if for anyx,y ∈ C and any θ ∈ R, we have θx+ (1− θ)y ∈ C.

– A point of the form θ1x1 + · · ·+ θixi, where θ1 + · · ·+ θi = 1, is called an affine combination of thepoints x1, . . . ,xi.

– An affine set contains every affine combination of its points.– The affine hull of a set C is the set of all affine combinations of points in C:

{θ1x1 + · · ·+ θixi : x0, . . . ,xi ∈ C, θ0 + · · ·+ θi = 1}. (3.1)

– The affine hull is the smallest affine set that contains C.

Example 3.1 Solution set of linear equations. The solution set of a system of linear equations, C = {x :Ax = b} is an affine set. Suppose x,y ∈ C, i.e., Ax = b and Ay = b. Then for any θ, we have

A(θx+ (1− θ)y) = θAx+ (1− θ)Ay = θb+ (1− θ)b = b,

which shows θx+ (1− θ)y ∈ C. (The subspace associated with the affine set C is the nullspace of A.)

3.1.2 Convex Sets

Definition 3.2 A set C is convex if the line segment between any two points in C lies in C, i.e., if forany x,y ∈ C and any θ with 0 ≤ θ ≤ 1, we have θx+ (1− θ)y ∈ C.

– Every affine set is convex.– A point of the form θ1x1 + · · · + θixi, where θ1 + · · · + θi = 1 and θ1, . . . , θi ≥ 0, is called a convex

combination of the points x1, . . . ,xi.– A convex set contains every convex combination of its points.– The convex hull of a set C is the set of all convex combinations of points in C:

{θ1x1 + · · ·+ θixi : x0, . . . ,xi ∈ C, θ1 + · · ·+ θi = 1, θ0, . . . , θi ≥ 0}. (3.2)

– The convex hull is the smallest convex set that contains C.

3.1.3 Cones

Definition 3.3 A set C is called a cone, if for every x ∈ C and θ ≥ 0, we have θx ∈ C.

Definition 3.4 A set C is a convex cone if it is convex and a cone, which means that for any x,y ∈ Cand θ1, θ2 ≥ 0, we have θ1x+ θ2y ∈ C.

– A point of the for θ1x1+ · · ·+θixi, where θ1, . . . , θi ≥ 0 is called a conic combination (or nonnegativelinear combination) of the points x1, . . . ,xi.

– A convex cone set contains every conic combination of its points.– The conic hull of a set C is the set of all conic combinations of points in C:

{θ1x1 + · · ·+ θixi : x0, . . . ,xi ∈ C, θ0, . . . , θi ≥ 0}. (3.3)

– The conic hull is the smallest convex cone that contains C.

9

3.2 Examples of Convex Sets

– Hyperplane with normal vector a 6= 0: {x : a⊤x = b}– Halfspace with normal vector a 6= 0: {x : a⊤x ≤ b}– (Euclidean) Ball with center xc and radius r: {x : ||x−xc||2 ≤ r} = {x : (x−xc)

⊤(x−xc) ≤ r2}– Ellipsoid with center xc and Q ∈ Sd

++: {x : (x− xc)⊤Q−1(x− xc) ≤ 1}

– Norm ball with center xc and radius r: {x : ||x− xc|| ≤ r}– Norm cone: {(x, t) : ||x|| ≤ t}– Polyhedra: {x : a⊤

i x ≤ bi, i = 1, . . . ,m, c⊤i x = di, i = 1, . . . , n} = {x : Ax � b, Cx = d}, whereA = [a1, . . . ,am]⊤ and C = [c1, . . . ,d

⊤n

– Nonnegative orthant (cone): Rd+ = {x ∈ Rd : x � 0}

– Simplex: {x : x � 0, 1⊤x ≤ 1}, {x : x � 0, 1⊤x = 1}– Positive semidefinite cone: Sd

+ = {X ∈ Sd : X � 0}Example 3.2 Positive semidefinite cone. The set Sd

+ is a convex cone. SupposeA,B ∈ Sd+, i.e., x

⊤Ax,x⊤Bx ≥0 for any x. Then for any θ1, θ2 ≥ 0, we have

x⊤(θ1A+ θ2B)x = θ1x⊤Ax+ θ2x

⊤Bx ≥ 0.

3.3 Operations that Preserve Convexity

One can verify the convexity of the set using the definition of convexity. This is sometimes difficult, soone could show that a set C is obtained from simple convex sets by the following operations that preserveconvexity:

– Intersection: If a set Cα is convex for every α ∈ A, then

∩α∈ACαis a convex set.

– Product: If sets Ci ⊆ Rdi , i = 1, . . . , n are convex, then

C1 × · · · × Cn = {(x1, . . . ,xn) : xi ∈ Ci, i = 1, . . . , n}is a convex set.

– Weighted summation: If sets Ci ⊆ Rd, i = 1, . . . , n are convex, then

α1C1 + · · ·+ αnCn = {α1x1 + · · ·αnxn : xi ∈ Ci, i = 1, . . . , n}is a convex set. [Note: In response to a question in the class, I said that αi needs additionalproperties but this was incorrect and αi does not need properties here.]

– Affine image: If a set C ⊆ Rd is convex and f(x) = Ax+ b : Rd → R

p is an affine mapping, then

f(C) = {Ax+ b : x ∈ C}is a convex set.

– Inverse affine image: If a set C ⊆ Rd is convex and f(x) = Ax+ b : Rd → R

p is an affine mapping,then

f−1(C) = {x : Ax+ b ∈ C}is a convex set.

Example 3.3 Examples of convex sets that can be constructed by other convex sets under operations thatpreserve convexity.

– Positive semidefintie cone Sd+: the intersection of an infinite number of halfspaces

S

d+ = ∩z 6=0{X ∈ Sd : z⊤Xz ≥ 0}.

For each z 6= 0, z⊤Xz is a (not identically zero) linear function of X, so {X ∈ Sd : z⊤Xz ≥ 0} isa halfspace.

– Polyhedron {x : Ax � b, Cx = d}: the inverse image of the Cartesian product of the nonnegativeorthant Rd

+ and the origin {0}, under the affine function f(x) = (b−Ax,d−Cx)

{x : Ax � b, Cx = d} = {x : f(x) ∈ Rd+ × {0}}

10

3.4 Generalized Inequalities

Definition 3.5 A convex cone K ⊆ Rd is called a proper cone if it satisfies the following:

– K is closed– K is solid, which means it has nonempty interior– K is pointed, which means that it contains no line (or equivalently, x ∈ K, −x ∈ K =⇒ x = 0)

Example 3.4 Examples of proper cones

– Nonnegative orthant: K = Rd+

– Positive semidefinite cone: K = Sd+

A proper cone K can be used to define generalized inequalities as

x �K y ⇐⇒ y − x ∈ K (3.4)

and

x ≺K y ⇐⇒ y − x ∈ int(K). (3.5)

Many properties of the partial ordering �K are similar to the standard ordering ≤ on R, i.e.,

x �K y, u �K v =⇒ x+ u �K y + v. (3.6)

Example 3.5 Examples of generalized inequalities.

– Component-wise inequality (K = Rd+): x �

R

d+y ⇐⇒ xi ≤ yi, i = 1, . . . , d.

– Matrix inequality (K = Sd+): X �

S

d+Y ⇐⇒ Y −X is positive semidefinite.

These two types are so common that we drop the subscript in �K.

3.5 Separating and Supporting Hyperplanes

If C and D are nonempty disjoint convex sets, there exists a 6= 0, b such that

a⊤x ≤ b for x ∈ C, a⊤x ≥ b for x ∈ D. (3.7)

The hyperplane {x : a⊤x = b} separates C and D.A supporting hyperplane to set C at boundary point x0 is

{x : a⊤x = a⊤x0}, (3.8)

where a 6= 0 and a⊤x ≤ a⊤x0 for all x ∈ C.

Theorem 3.1 (Supporting hyperplane theorem) If C is convex, then there exists a supporting hyperplaneat every boundary point of C.

11

4 Convex Functions (see [1, Chapter 7] [3, Chapter 3])

4.1 Definition

Definition 4.1 A function f : Rd → R is convex if dom f is a convex set and if for all x,y ∈ dom f ,and θ with 0 ≤ θ ≤ 1, we have

f(θx+ (1− θ)y) ≤ θf(x) + (1− θ)f(y). (4.1)

A function is convex if and only if for all x ∈ dom f and all v, the function g(t) = f(x + tv) is convex(on its domain dom g = {t : x+ tv ∈ dom f}).

Definition 4.2 A function f : R

d → R is strictly convex if dom f is a convex set and if for allx,y ∈ dom f such that x 6= y, and θ with 0 ≤ θ ≤ 1, we have

f(θx+ (1− θ)y) < θf(x) + (1− θ)f(y). (4.2)

Definition 4.3 A function f is called concave if −f is convex. Similarly, f is called strictly concave if−f is strictly convex.

4.2 Extended-value Extensions

It is often convenient to extend a convex function to all of Rd by defining its value to be ∞ outside itsdomain.

Definition 4.4 If f is convex, its extended-value extension f : Rd → R ∪ {∞} is

f(x) =

{

f(x), x ∈ dom f,

∞, x /∈ dom f.(4.3)

The extension f is defined on all Rd, and takes values in R ∪ {∞}.

Introducing such extension simplifies notation, since the domain of f is implied from f as dom f ={x : f(x) < ∞}. For example, Definition 4.1 can be simplified as below.

Definition 4.5 A function f : Rd → R is convex if for all x,y ∈ Rd, and θ with 0 ≤ θ ≤ 1, we have

f(θx+ (1− θ)y) ≤ θf(x) + (1− θ)f(y). (4.4)

4.3 First-order Conditions

Theorem 4.1 Suppose f is differentiable ( i.e., its gradient ∇f exists at each point in dom f which isopen). Then f is convex if and only if dom f is convex and

f(y) ≥ f(x) + 〈∇f(x), y − x〉 (4.5)

holds for all x,y ∈ dom f .

Proof “=⇒”: 1) Consider the case d = 1. Assume that f is convex and x, y ∈ dom f .

f(x+ t(y − x)) ≤ (1− t)f(x) + tf(y).

If we divide both sides by t, we have

f(y) ≥ f(x) +f(x+ t(y − x))− f(x)

t

and taking the limit as t → 0 yields (4.5).2) Consider the general case, with f : R

d → R. Let x,y ∈ R

d and consider f restricted tothe line passing through them, i.e., the function defined by g(t) = f(x + t(y − x)) where ∇g(t) =

12

〈∇f(x+ t(y − x)), y − x〉. Assume that f is convex, which implies that g is convex. Then, by the argu-ment above we have g(1) ≥ g(0) +∇g(0), which means (4.5).

“⇐=”: Assume that f satisfies (4.5) for all x,y ∈ dom f . Choose any x 6= y, and 0 ≤ θ ≤ 1, and letz = θx+ (1− θ)y. Applying (4.5) yields

f(x) ≥ f(z) + 〈∇f(z), x− z〉, f(y) ≥ f(z) + 〈∇f(z), y − z〉 .

Multiplying the first and second inequalities by θ and 1− θ respectively, and adding them yields

θf(x) + (1− θ)f(y) ≥ f(z) = f(θx+ (1− θ)y),

which proves that f is convex.

For a convex function, the first-order Taylor approximation is in fact a global underestimator of thefunction.

4.4 Second-order Conditions

Theorem 4.2 Suppose f is twice differentiable, that is, its Hessian or second derivative ∇2f exists ateach point in dom f , which is open. Then f is convex if and only if dom f is convex and its Hessian ispositive semidefinite, i.e., for all x ∈ dom f ,

∇2f(x) � 0. (4.6)

4.5 Examples of Convex Functions

Example 4.1 Examples of convex functions on R.

– Affine: ax+ b is convex (and concave) on R for any a, b ∈ R.– Exponential: eax is convex on R for any a ∈ R.– Powers: xa is convex on R++ when a ≥ 1 or a ≤ 0, and concave for 0 ≤ a ≤ 1.– Powers of absolute value: |x|p is convex on R for p ≥ 1.– Logarithm: log x is concave on R++.– Negative entropy: x log x on R++.

Example 4.2 Examples of convex functions on Rd.

– Affine: a⊤x+ b is convex (and concave) on Rd

– Quadratic: 12x

⊤Qx+ p⊤x is convex on Rd for any Q � 0.– Least-squares: 1

2 ||Ax− b||22 is convex on Rd for any A.– Norms: Every norm on Rd is convex.– Max function: max{x1, . . . , xd} is convex on Rd

– Quadratic-over-linear: f(x, y) = x2

ywith dom f = R×R++ is convex.

– Log-sum-exp: log(ex1 + · · ·+ exd) is convex on Rd.

– Geometric mean:(

∏di=1 xi

)1d

on Rd++ is concave.

Example 4.3 Examples of convex functions on Rp×d.

– Affine: tr{

A⊤X}

+b is convex (and concave) on Rp×d.– Spectral norm: ||X||2 is convex on Rp×d.– Log-determinant: log det{X} is concave on dom f = Sd

++

Example 4.4 Log-sum-exp. f(x) = log(

∑di=1 e

xi

)

is convex. Note that this function can be interpreted

as a differentiable approximation of the max function since max{x1, . . . , xd} ≤ f(x) ≤ max{x1, . . . , xd}+log d.

13

Proof The Hessian of f is

∇2f(x) =1

(1⊤z)2((1⊤z) diag{z}−zz⊤),

where z = (ex1 , . . . , exd). To verify that ∇2f(x) � 0 we must show that for all v, v⊤∇2f(x)v ≥ 0, i.e.,

v⊤∇2f(x)v =1

(1⊤z)2

(

d∑

i=1

zi

)(

d∑

i=1

v2i zi

)

−(

d∑

i=1

vizi

)2

≥ 0.

This can be shown using the Cauchy-Schwarz inequality (a⊤a)(b⊤b) ≥ (a⊤b)2 applied to the vectorswith components ai = vi

√zi, bi =

√zi.

4.6 Sublevel Sets and Epigraph

Definition 4.6 The α-sublevel set of a function f : Rd → R is defined as

Cα = {x ∈ dom f : f(x) ≤ α}. (4.7)

Sublevel sets of a convex functions are convex, for any value of α. The converse is not true.

Definition 4.7 The epigraph of a function f : Rd → R is defined as

epi f = {(x, t) : x ∈ dom f, f(x) ≤ t}. (4.8)

A function is convex if and only if its epigraph is a convex set.

4.7 Jensen’s Inequality

Theorem 4.3 If f is convex, x1, . . . ,xn ∈ dom f , and θ1, . . . , θn ≥ 0 with θ1 + · · · θn = 1, then

f

(

n∑

i=1

θixi

)

≤n∑

i=1

θif(xi) (4.9)

4.8 Operations that Preserve Convexity

– Nonnegative weighted sums: If f1, . . . , fn are convex and α1, . . . , αn ≥ 0, then a function∑n

i=1 αifiis convex.

– Composition with an affine mapping: If f is convex, so is f(Ax+ b).– Log barrier for linear inequalities: f(x) = −∑m

i=1 log(

bi − a⊤i x)

with dom f = {x : a⊤i x <

bi, i = 1, . . . , p}.– Pointwise maximum: If f1, . . . , fn are convex, then a function max{f1(x), . . . , fn(x)} is convex.

– Piecewise-linear function: f(x) = maxi=1,...,p(a⊤i x+ bi).

– Pointwise supremum: If for each y ∈ A, f(x,y) is convex in x, then g(x) = supy∈A f(x,y) is convex.– Distance to farthest point of a set C: f(x) = supy∈C ||x− y||.

– Composition: f(x) = h(g1(x), . . . , gn(x)), with gi : Rd → R, h : Rn → R, is convex if

{

gi are convex, h is convex, and h is nondecreasing in each argument,

gi are concave, h is convex, and h is nonincreasing in each argument.

– Composition with log-sum-exp function: log(∑n

i=1 egi(x)

)

is convex if gi are convex.– Minimization: If f(x,y) is convex in (x,y) and C is a convex nonempty set, then g(x) = infy∈C f(x,y)

is convex.– Distance of a point x to a set C: infy∈C ||x− y|| is convex if C is convex.

14

Example 4.5 Example of convex functions that can be constructed by other convex functions under oper-ations that preserve convexity.

– Sum of r largest components of x ∈ Rd:

f(x) = x[1] + · · ·+ x[r]

is convex, where x[i] is ith largest component of x. This can be seen by writing it as

f(x) = maxi1,...,ir

{xi1 + · · ·+ xir : 1 ≤ i1 < · · · < ir ≤ d},

the maximum of all possible sums of r different components of x. Since it is the pointwise maximumof d!

r!(d−r)! linear functions, it is convex.

4.9 Convexity with respect to Generalized Inequalities

Definition 4.8 Suppose K ⊆ R

p is a proper cone with associated generalized inequality �K. We sayf : Rd → R

p is K-convex if for all x,y and 0 ≤ θ ≤ 1,

f(θx+ (1− θ)y) �K θf(x) + (1− θ)f(y). (4.10)

Example 4.6 A function f(X) = X2 : Sd → S

d is Sd+-convex.

Proof For fixed z ∈ Rd, z⊤X2z = ||Xz||22 is convex in X, i.e., for any X,Y ∈ Sd and 0 ≤ θ ≤ 1, wehave

z⊤(θX + (1− θ)Y )2z ≤ θz⊤X2z + (1− θ)z⊤Y 2z,

which implies (θX + (1− θ)Y )2 � θX2 + (1− θ)Y 2.

15

5 Convex Optimization Problems (see [1, Chapter 8] [3, Chapter 4])

5.0.1 Optimization Problems in Standard Form

We consider

minx

f(x) (5.1)

subject to gi(x) ≤ 0, i = 1, . . . ,m,

hi(x) = 0, i = 1, . . . , n,

where

– x ∈ Rd is the optimization variable,– f : Rd → R is the objective or cost function,– gi : Rd → R are the inequality constraint functions,– hi : Rd → R are the equality constraint functions.

Definition 5.1 A point x is feasible if it satisfies the constraints. The problem is said to be feasibleif there exists at least one feasible point. The set of all feasible points is called the feasible set or theconstraint set.

Definition 5.2 The optimal value of the problem is defined as

p∗ = inf{f(x) : gi(x) ≤ 0, i = 1, . . . ,m, hi(x) = 0, i = 1, . . . , n}. (5.2)

We take p∗ = ∞ if problem is infeasible, and p∗ = −∞ if problem is unbounded below

Definition 5.3 A feasible point x∗ is optimal if f(x∗) = p∗. The set of all optimal points is the optimalset. A feasible point x is locally optimal if there is an R > 0 such that

f(x) = infz{f(z) : gi(z) ≤ 0, i = 1, . . . ,m, hi(z) = 0, i = 1, . . . , n, ||z − x||2 ≤ R}. (5.3)

5.1 Convex Optimization Problems in Standard Form

A convex optimization problem further requires that

– f(x) is convex,– gi(x) is convex,– hi(x) = a⊤

i x− bi is affine.

The feasible set of a convex optimization problem is convex.

Example 5.1 The following problem:

minx∈R2

{f(x) ≡ x21 + x2

2} (5.4)

subject to g(x) =x1

1 + x22

≤ 0,

h(x) = (x1 + x2)2 = 0,

is not a convex optimization problem in standard form, since g1 is not convex and h1 is not affine. (Someauthors use the term abstract convex optimization problem to describe abstract problem of minimizinga convex function over a convex set.) Nevertheless, the feasible set {x : x1 ≤ 0, x1 + x2 = 0} isconvex. So although the problem aims to minimize a convex function over a convex set, it is not a convexoptimization problem by the definition here and in [3, Chapter 4.2.1]. One can easily find the followingequivalent (but not identical) convex optimization problem in standard form:

minx

{f(x) ≡ x21 + x2

2} (5.5)

subject to g(x) = x1 ≤ 0,

h(x) = x1 + x2 = 0.

16

5.2 Local and Global Optima

Theorem 5.1 Any locally optimal point of a convex optimization problem is (globally) optimal.

Proof Suppose that x∗ is locally optimal for a convex optimization problem, i.e., x∗ is feasible and

f(x∗) = infz{f(z) : z is feasible, ||z − x∗||2 ≤ R},

for some R > 0.Now suppose that x∗ is not globally optimal, i.e., there is a feasible point y such that f(y) < f(x∗).

(||y − x∗||2 > R, since otherwise f(x∗) ≤ f(y).) Consider the point z given by

z = (1− θ)x∗ + θy, θ =R

2||y − x∗||2,

where we have 0 < θ < 12 . Then we have ||z − x∗||2 = θ||x∗ − y||2 = R

2 < R and by convexity of thefeasible set, z is feasible. By convexity of f , we have

f(z) ≤ (1− θ)f(x∗) + θf(y) < f(x∗),

which contradicts the assumption that x∗ is locally optimal.

5.3 An Optimality Criterion for Differentiable Cost Function

Let X denote the feasible set, i.e.,

X = {x : gi(x) ≤ 0 i = 1, . . . ,m, hi(x) = 0, i = 1, . . . , n}. (5.6)

Theorem 5.2 Suppose that f in a convex optimization problem is differentiable. Then x∗ is optimal ifand only if x∗ ∈ X and

〈∇f(x∗), y − x∗〉 ≥ 0 for all y ∈ X . (5.7)

Proof “⇐=”: Suppose x∗ ∈ X and satisfies (5.7). Then if y ∈ X , by convexity, we have

f(y) ≥ f(x∗) + 〈∇f(x∗), y − x∗〉 ≥ f(x∗).

“=⇒”: Suppose x∗ is optimal, but the condition (5.7) does not hold, i.e., for some y ∈ X we have

〈∇f(x∗), y − x〉 < 0.

Consider the point z(t) = x∗ + t(y − x∗) with 0 ≤ t ≤ 1, which is feasible. We have

d

dtf(z(t))

∣

∣

∣

∣

t=0

= 〈∇f(x∗), y − x∗〉 < 0.

So for small positive t, we have f(z(t)) < f(x∗).

If ∇f(x∗) 6= 0 (implying that x∗ is a boundary point of X ), a negative gradient −∇f(x∗) at thepoint x∗ defines a supporting hyperplane:

{y : 〈−∇f(x∗), y〉 = 〈−∇f(x∗), x∗〉} (5.8)

to the feasible set X at x∗, where 〈−∇f(x∗), y〉 ≤ 〈−∇f(x∗), x∗〉 for all y ∈ X .

Theorem 5.3 For an unconstrained problem ( i.e., m = n = 0), the condition (5.7) reduces to the wellknown necessary and sufficient condition

∇f(x∗) = 0 (5.9)

for x∗ to be optimal.

Proof Since f is differentiable, its domain is (by definition) open, so all y sufficiently close to x arefeasible. Let y = x∗ − t∇f(x∗), where t ∈ R is a parameter. For t small and positive, y is feasible, andso

〈∇f(x∗), y − x∗〉 = −t||∇f(x∗)||2 ≥ 0,

from which we conclude ∇f(x∗) = 0.

17

5.4 Equivalent Convex Optimization Problems

We call two problems equivalent if from a solution of one, a solution of the other is readily found, andvice versa. (It is possible, but complicated, to give a formal definition of equivalence.)

Considering that there are equivalent problems, one can transform a problem into another equivalentproblem that would be easier to solve (in some sense). In convex optimization, it is preferred to pre-serve convexity while transforming a problem into another equivalent problem; the followings are someexamples of such transformation.

– Introducing equality constraints:

minx

f(A0x+ b0)

subject to gi(Aix+ bi) = 0, i = 1, . . . ,m,

hi(x) = 0, i = 1, . . . , n,

⇐⇒

minx,y0,...,ym

f(y0)

subject to gi(yi) = 0, i = 1, . . . ,m,

yi = Aix+ bi, i = 0, . . . ,m,

hi(x) = 0, i = 1, . . . , n.

– Slack variables: For affine gi(x), one can introduce slack variables s as below while preserving theconvexity of the problem

minx

f(x)

subject to gi(x) ≤ 0, i = 1, . . . ,m,

hi(x) = 0, i = 1, . . . , n,

⇐⇒

minx,s

f(x)

subject to si ≥ 0, i = 1, . . . ,m,

gi(x) + si = 0, i = 1, . . . ,m,

hi(x) = 0, i = 1, . . . , n.

– Epigraph problem form:

minx

f(x)

subject to gi(x) ≤ 0, i = 1, . . . ,m,

hi(x) = 0, i = 1, . . . , n,

⇐⇒

minx,t

t

subject to f(x)− t ≤ 0,

gi(x) ≤ 0, i = 1, . . . ,m,

hi(x) = 0, i = 1, . . . , n.

– Minimizing over some variables: infx,y f(x,y) = infx f(x) where f(x) = infy f(x,y).

5.5 Examples of Convex Optimization Problems

– Linear optimization problem (a.k.a. linear program (LP)):

minx

c⊤x+ d

subject to Gx � h,

Ax = b.

– Piecewise-linear minimization:

minx

maxi=1,...,m

(a⊤i x+ bi) ⇐⇒

minx,t

t

subject to a⊤i x+ bi ≤ t, i = 1, . . . ,m.

– Chebyshev approximation:

minx

||Ax− b||∞ ⇐⇒minx,t

t

subject to − t ≤ a⊤i x− bi ≤ t, i = 1, . . . ,m.

18

– Quadratic optimization problem (a.k.a. quadratic program (QP)): For Q ∈ Sd+,

minx

1

2x⊤Qx+ p⊤x

subject to Gx � h,

Ax = b.

– Constrained least-squares:

minx

||Ax− b||22subject to li ≤ xi ≤ ui, i = 1, . . . ,m.

– Distance between the polyhedra {x : A1x � b1} and {x : A2x � b2}:

minx1,x2

||x1 − x2||22subject to A1x1 � b1, A2x2 � b2.

– Quadratically constrained quadratic program (QCQP): For Qi ∈ Sd+,

minx

1

2x⊤Q0x+ p⊤x

subject to1

2x⊤Qix+ p⊤x ≤ 0, i = 1, . . . ,m,

Ax = b.

– Second-order cone programming (SOCP):

minx

c⊤x

subject to ||Aix+ bi||2 ≤ c⊤i x+ di, i = 1, . . . ,m,

Fx = g.

– Second-order cone constraint ||Aix + bi||2 ≤ c⊤i x + di (a second-order cone is a norm cone{(x, t) : ||x|| ≤ t} for the Euclidean norm): (Aix+ bi, c

⊤i + di) lies in the second-order cone.

– Reduces to an LP when Ai = 0– Reduces to an QCQP when ci = 0

Example 5.2 Robust linear programming. Consider uncertainty or variation in the parameters. For ex-ample, let’s say that the parameter ai are known to lie in given ellipsoids

ai ∈ Ei = {ai + Piu : ||u||2 ≤ 1},

where Pi ∈ Rd×d The resulting robust linear program is

minx∈Rd

c⊤x

subject to a⊤i x ≤ bi for all ai ∈ Ei, i = 1, . . . ,m.

The robust linear constraint can be expressed as

supai

{a⊤i x : ai ∈ Ei} = a⊤

i x+ supu

{u⊤P⊤i x : ||u||2 ≤ 1} = a⊤

i x+ ||P⊤i x||2 ≤ bi

Hence the robust LP can be expressed as the SOCP:

minx∈Rd

c⊤x

subject to a⊤i x+ ||P⊤

i x||2 ≤ bi, i = 1, . . . ,m.

19

5.6 Generalized Inequality Constraints

The standard form convex optimization problem with the generalized inequalities in the constraints is:

minx

f(x) (5.10)

subject to gi(x) �Ki0, i = 1, . . . ,m,

Ax = b.

where f : Rd → R, Ki ⊆ Rki are proper cones, and gi : Rd → R

ki are Ki-convex.

5.6.1 Conic Form Problems

One simple convex optimization problem with the generalized inequalities is the following conic formproblems (a.k.a cone programs):

minx

c⊤x

subject to Fx+ g �K h,

Ax = b.

When K = Rd+, this reduces to LP.

Example 5.3 Second-order cone programming. For the second-order cone Ki, we have

minx

c⊤x

subject to ||Aix+ bi|| ≤ c⊤i x+ di,

Fx = g.

⇐⇒minx

c⊤x

subject to − (Aix+ bi, c⊤i x+ di) �Ki

0,

Fx = g.

5.6.2 Semidefinite Programming (SDP)

When K = Sd+, the associated conic form problem is called a semidefinite program (SDP):

minx∈Rd

c⊤x

subject to F (x) = x1F1 + · · ·+ xdFd +G � 0,

Ax = b,

where F1, . . . ,Fd,G ∈ Sk. Inequality constraint is called linear matrix inequality (LMI).

Example 5.4 Matrix norm minimization. Let A(x) = A0 + x1A1 + · · · + xdAd, where Ai ∈ R

p×q.Consider the problem:

minx∈Rd

||A(x)||2.

Using the fact that ||A||2 ≤ t if and only A⊤A � t2I, we have an equivalent problem:

minx∈Rd,t∈R+

t

subject to A(x)⊤A(x) � t2I.

Further using the fact that

A⊤A � t2I (and t ≥ 0) ⇐⇒(

tI A

A⊤ tI

)

� 0,

which comes from the Schur complement condition, we have the equivalent SDP:

minx∈Rd,t∈R+

t

subject to

(

tI A(x)A⊤(x) tI

)

� 0.

20

References

1. Beck, A.: Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications with MATLAB. Soc. Indust.Appl. Math., Philadelphia (2014). URL http://epubs.siam.org/doi/abs/10.1137/1.9781611973655

2. Beck, A.: First-Order Methods in Optimization. Soc. Indust. Appl. Math., Philadelphia (2017). URLhttp://epubs.siam.org/doi/book/10.1137/1.9781611974997

3. Boyd, S., Vandenberghe, L.: Convex optimization. Cambridge, UK (2004). URLhttp://www.stanford.edu/~boyd/cvxbook.html

4. Rockafellar, R.T.: Lagrange multipliers and optimality. SIAM Review 35(2), 183–238 (1993). DOI 10.1137/1035044

21

http://epubs.siam.org/doi/abs/10.1137/1.9781611973655

http://epubs.siam.org/doi/book/10.1137/1.9781611974997

http://www.stanford.edu/~boyd/cvxbook.html

Part 1. Convex Sets,Functions andOptimizationm126w18/pdf/part1.pdf · Part 1. Convex Sets,Functions...

Documents

Transcript of Part 1. Convex Sets,Functions andOptimizationm126w18/pdf/part1.pdf · Part 1. Convex Sets,Functions...