Algorithms for Minimizing Differences of Convex Functions ......Minimizing Differences of Convex...

Minimizing Differences of Convex Functions-The DCA

Algorithms for Minimizing Differences ofConvex Functions and Applications to

Facility Location and Clustering

Mau Nam Nguyen

(joint work with D. Giles and R. B. Rector)

Fariborz Maseeh Department of Mathematics and StatisticsPortland State University

Mathematics and Statistics SeminarWSU Vancouver, April 12, 2017

Convex Functions

DefinitionLet I ⊂ R be an interval and let f : I → R be a function. Thefunction f is said to be CONVEX on Ω if

f(λx + (1− λ)u

)≤ λf (x) + (1− λ)f (u)

for all x ,u ∈ I and λ ∈ (0,1).The function f is called CONCAVE if −f is convex.

Figure: Convex Function and Nonconvex Function.

Convex Functions

ExampleLet I = R = (−∞,∞) and let f (x) = |x | for x ∈ R. Then f is aconvex function. Indeed, for any x ,u ∈ R and λ ∈ (0,1),

f (λx + (1− λ)u) = |λx + (1− λ)u|≤ |λx |+ |(1− λ)u|= λ|x |+ (1− λ)|u|= λf (x) + (1− λ)f (u).

Characterizations for Convexity

TheoremLet f : I → R be differentiable function, where I is an openinterval in R. Then f is convex on I if and only if f ′ is monotoneincreasing on I.

TheoremLet f : I → R be twice differentiable function, where I is an openinterval in R. Then f is convex on I if and only if f ′′(x) ≥ 0 for allx ∈ I.

ExampleThe following functions are convex:

(i) f (x) = ax2 + bx + c, x ∈ R, a > 0.(ii) f (x) = ex , x ∈ R.(iii) f (x) = − ln(x), x ∈ (0,∞).

Convex Sets

DefinitionA subset Ω of Rn is CONVEX if [a,b] ⊂ Ω whenever a,b ∈ Ω.Equivalently, Ω is convex if λa + (1− λ)b ∈ Ω for all a,b ∈ Ωand λ ∈ [0,1].

Figure: Convex set and nonconvex set.

Convex Functions

DefinitionLet f : Ω→ R be defined on a convex set Ω ⊂ Rn. The functionf is said to be CONVEX on Ω if

f(λx + (1− λ)y

)≤ λf (x) + (1− λ)f (y)

for all x , y ∈ Ω and λ ∈ (0,1).The function f is called CONCAVE if −f is convex.

Convex Functions

Theorem

Let fi : Rn → R be convex functions for all i = 1, . . . ,m.Then the following functions are convex as well:

(i) The multiplication by scalars λf for any λ > 0.(ii) The sum function

∑mi=1 fi .

(iii) The maximum function max1≤i≤m fi .Let f : Rn → R be convex and let ϕ : R→ R benondecreasing. Then the composition ϕ f is convex.Let B : Rn → Rp be an affine mapping and let f : Rp → Rbe a convex function. Then the composition f B is convex.

Relationship between Convex Functions and Convex Sets

TheoremA function f : Rn → R is convex if and only if its epigraph epi f isa convex subset of the product space Rn × R.

Figure: Epigraphs of convex function and nonconvex function.

Pierre de Fermat

Pierre de Fermat (1601-1665) was a French lawyer at theParlement of Toulouse, France, and an amateur mathematicianwho is given credit for early developments that led to calculus.

Fermat-Torricelli Problem

In the early of the 17th century, at the end of his book, Treatiseon Minima and Maxima, the French mathematician Fermat(1601-1665) proposed the following problem: given three pointsin the plane, find a point such that the sum of its distances tothe three given points is minimal.Given a1,a2, . . . ,am ∈ Rn, find x ∈ Rn which minimizes thefunction

f (x) =m∑

‖x − ai‖.

Weiszfeld’s Algorithm

The gradient of the function φ is

∇φ(x) =m∑

x − ai

‖x − ai‖, x /∈ a1,a2, . . . ,am.

Solving the equation ∇φ(x) = 0 gives

∑mi=1

‖x − ai‖∑mi=1

1‖x − ai‖

:= F (x).

For continuity, define F (x) := x for x ∈ a1,a2, . . . ,am.Weiszfeld algorithm: choose a starting point x0 ∈ Rn and define

xk+1 = F (xk ) for k ∈ N.

Kuhn’s Convergence Theorem

Weiszfeld also claimed that if x0 /∈ a1,a2, . . . ,am, where ai fori = 1, . . . ,m are not collinear, then (xk ) converges to the uniqueoptimal solution of the problem. A counterexample and acorrect statement and the proof of the convergence were givenby Kuhn in 1972.

TheoremLet (xk ) be the sequence formed by the Weiszfeld’s algorithm.Suppose that xk /∈ a1,a2, . . . ,am for k ≥ 0. Then (xk )converges to the optimal solution of the problem.

C1− Functions

A function f : Rn → R is called a C1 function if it has all partial

derivatives∂f∂xi

for i = 1, . . . ,n, and all these partial derivatives

are continuous functions. The gradient of f at a point x isdenoted by

∇f (x) =( ∂f∂x1

(x), . . . ,∂f∂xn

Example

Let f (x) = ‖x‖2 =∑n

i=1 x2i . Then f is a C1 function with

∂f∂xi

(x) = 2xi for x = (x1, . . . , xn).

and ∇f (x) = (2x1, . . . ,2xn) = 2x .

C1− Convex Functions

Theorem

Let f : Rn → R be a C1 function. Then f is convex if and only if

f (x) + 〈∇f (x), x − x〉 ≤ f (x) for all x , x ∈ Rn.

Figure: C1 Convex Functions.

Lipschitz Continuous Functions and C1,1 Functions

DefinitionA function g : Rn → Rm is said to be Lipschitz continuous ifthere exists a constant ` ≥ 0 such that

‖g(x)− g(u)‖ ≤ `‖x − u‖ for all x ,u ∈ Rn.

A C1 function f : Rn → R is called a C1,1 function if its gradientis Lipschitz continuous, i.e., there exists a constant ` > 0 suchthat

‖∇f (x)−∇f (u)‖ ≤ `‖x − u‖ for all x ,u ∈ Rn.

C1,1 Functions

Theorem

Let f : Rn → R be a C1,1 function whose gradient is Lipschitzcontinuous with Lipschitz constant `. Then

f (x) ≤ f (x) + 〈∇f (x), x − x〉+`

2‖x − x‖2

for all x , x ∈ Rn.

Figure: C1 Convex Functions.

The Gradient Method

Let f : Rn → R be a C1 function and let αk be a sequence ofpositive real numbers called the sequence of step sizes. Asequence xk as follows:

Choose x0 ∈ Rn.Define xk+1 = xk − αk∇f (xk ) for k ≥ 0.

Given a C1 function f : Rn → R and α > 0, define g : Rn → Rn

by g(x) = x − α∇f (x). Then the gradient method with constantstep size α is given by

Choose x0 ∈ Rn.Define xk+1 = xk − α∇f (xk ) = g(xk ) for k ≥ 0.

The Gradient Method

Theorem

Given a C1 convex function f : Rn → R whose gradient isLipschitz continuous with constant ` > 0. Consider the gradient

method with constant step size α =1`. Suppose that f has an

absolute minimum at x∗. Then f (xk ) is monotone decreasingand

0 ≤ f (xk )− f (x∗) ≤ ‖x0 − x∗‖2

2kαfor all k ≥ 1.

Nesterov’s Accelerated Gradient Method

Let f : Rn → R be a C1−convex function whose gradient is

Lipschitz continuous with constant ` > 0 and let α =1`

Choose y1 = x0 ∈ Rn and t0 = 0.

Define xk = yk − αk∇f (yk ) = g(yk ) for k ≥ 1.

Update tk =1 +

√1 + 4t2

2for k ≥ 1.

Define yk+1 = xk +tk − 1tk + 1

(xk − xk−1) for k ≥ 1.

The Gradient Method

Theorem

Given a C1 convex function f : Rn → R whose gradient isLipschitz continuous with constant ` > 0. Consider theNesterove accelerated gradient method with constant step size

α =1`. Suppose that f has an absolute minimum at x∗. Then

0 ≤ f (xk )− f (x∗) ≤ 2‖x0 − x∗‖2

α(k + 1)2 for all k ≥ 1.

Distance Functions

Given a set Ω ⊂ Rn, the distance function associated with Ω isdefined by

d(x ; Ω) := inf‖x − ω‖

∣∣ ω ∈ Ω.

For each x ∈ Rn, the Euclidean projection from x to Ω isdefined by

P(x ; Ω) :=ω ∈ Ω

∣∣ ‖x − ω‖ = d(x ; Ω).

Figure: Distance function.

Squared Distance Functions

TheoremIf Ω is a nonempty, closed, convex subset of Rn. Then:

For each x ∈ Rn, the Euclidean projection Π(x ; Ω) is asingleton.The projection mapping is nonexpansive:∥∥Π(x1; Ω)− Π(x2; Ω)

∥∥ ≤ ‖x1 − x2‖ for all x1, x2 ∈ Rn.

In addition, the squared distance function ϕ(x) = [d(x ; Ω)]2 forx ∈ Rn is a C1,1 function with

∇ϕ(x) = 2[x − P(x ; Ω)].

Nesterov’s Smoothing Technique

TheoremGiven any a ∈ Rn and µ > 0, a Nesterov smoothingapproximation a of ϕ(x) := ‖x − a‖ has the representation

fµ(x) =‖x − b‖2

2µ− µ

x − bµ

; IB)]2,

where IB is the closed unit ball of Rn. In addition,

∇fµ(x) = uµ(x) = Π(x − bµ

; IB).

The gradient ∇fµ is Lipschitz continuous with constant `µ =1µ.

aY. Nesterov: Smooth minimization of non-smooth functions.Math.Program., Ser. A 103 , 127-152 (2005).

Generalized Fermat-Torricelli Problems

Consider the following version of the Fermat-Torricelli problem:

minimize F(x) :=m∑

‖x − ai‖ subject to x ∈ Rn.

A smooth approximation of F is given by

Fµ(x) =m∑

(‖x − ai‖2

2µ− µ

x − ai

µ; IB)]2

The function Hµ continuously differentiable on Rn with itsgradient given by

∇Fµ(x) =m∑

Π(x − ai

µ; IB).

Its gradient is Lipschitz with constant

Lµ =mµ.

Moreover, one has the following estimate

Hµ(x) ≤ H(x) < Hµ(x) + mµ

2for all x ∈ Rn.

To demonstrate the method, let us consider a numericalexample below.

ExampleThe longitude/latitude coordinates in decimal format of US cities arerecorded, for example, at http://www.realestate3d.com/gps/uslatlongdegmin.htm.Our goal is to use to find a point that minimizes the `1 distances to thegiven points representing the cities. If we consider the Euclideannorm, an approximate optimal value is V ∗ = 23409.33, and asuboptimal solution is x∗ = (38.63,97.35). If we consider the`1−norm, the algorithm presented in this section allows us to find anapproximate optimal value V ∗ = 28724.68 and a suboptimalsolution x∗ = (39.48,97.22).

ExampleThe graphs showing the convergence of the algorithms fordifferent norms:

0 500 1000 1500 20000

sum normmax normEuclidean norm

Figure: A generalized Fermat-Torricelli problem with different norms

Differences of Convex Functions

Consider the problem

minimize f (x) := g(x)− h(x) , x ∈ Rn

where g : Rn → R and h : Rn → R are convex.

We call g − h a DC decomposition of f .

g(x) = x4

h(x) = 3x2 + x

f = g − h

Examples of DC Programming

Fermat-Torricelli

f (x) =∑m

i=1 ci‖x−ai‖

Clustering

f (x1, . . . , x`) =∑mi=1 mink

`=1‖x` − ai‖2

MultifacilityLocation

f (x1, . . . , x`) =∑mi=1 mink

`=1‖x` − ai‖

The DCA for Clustering

Problem formulation1: Let ai for i = 1, ...,m be m datapoints in Rn.

Minimize f (x1, . . . , xk ) :=m∑

min‖xl − ai‖2 : l = 1, ..., k

over xl ∈ Rn, l = 1, ..., k .

1L.T.H. An, M.T. Belghiti, P.D. Tao, A new efficient algorithm based on DCprogramming and DCA for clustering, J. Glob. Optim., 27 (2007), 503–608.

K-Mean Clustering

Let x1, x2, . . . , xm be the data points and let c1, . . . , ck denotethe centers.

Randomly select k cluster centers.Assign each data point to the nearest center.Find the average of the data points assigned to eachcenter.Repeat the second step with the obtained new centers inthe third step until the centroids no longer move.

Although k−mean clustering is effective in many situations, italso has some disadvantages.

The k-means algorithm does not necessarily find theoptimal solutionThe algorithm is sensitive to the initial selected clustercenters

DCA for Clustering and K-Mean2

Both DCA1 and DCA2 are better than K-means: theobjective values given by DCA1 and DCA2 are muchsmaller than that computed by K-means.DCA2 is the best among the three algorithms: it providesthe best solution with the shortest time. DCA2 is very fastand can then handle large-scale problems.

f (x1, . . . , xk ) =m∑

min`=1,...,k

‖x` − ai‖1

This is a nonsmooth nonconvex program for which there arerarely efficient solution algorithms, especially in the large scalesetting.

2L.T.H. An, M.T. Belghiti, P.D. Tao, A new efficient algorithm based on DCprogramming and DCA for clustering, J. Glob. Optim., 27 (2007), 503–608.

Subgradients and Fenchel Conjugates of Convex Functions

DefinitionLet f : Rn → R be a convex function. A subgradient of f at x isany v ∈ Rn such that

〈v , x − x〉 ≤ f (x)− f (x) for all x ∈ Rn.

The subdifferential ∂f (x) of f at x is the set of all subgradientsof f at x .

DefinitionLet f : Rn → R be a function. The Fenchel conjugate of f isdefined by

f ∗(x) = supu∈Rn〈x ,u〉 − g(u), x ∈ Rn.

DC Algorithm-The DCA

Consider the problem

minimize f (x) := g(x)− h(x) , x ∈ Rn

where g : Rn → R and h : Rn → R are convex.

The DCA3.

INPUT: x1 ∈ Rn, N ∈ Nfor k = 1, . . . ,N do

Find yk ∈ ∂h(xk )Find xk+1 ∈ ∂g∗(yk )

end forOUTPUT: xN+1

3P.D. Tao, L.T.H. An, A d.c. optimization algorithm for solving thetrust-region subproblem, SIAM J. Optim. 8 (1998), 476–505.

An Example of the DCA

minimize f (x) = x4 − 3x2 − x , x ∈ RThen f (x) = g(x)− h(x), where g(x) = x4 and h(x) = 3x2 + x .We have

∂h(x) = 6x + 1, ∂g∗(y) =

argmin

x∈Rn

(x4 − yx

xk+1 =3

√6xk + 1

DefinitionA function h : Rn → R is called γ-convex (γ ≥ 0) if there existsγ ≥ 0 such that the function defined by k(x) := h(x)− γ

2‖x‖2,

x ∈ Rn, is convex. If there exists γ > 0 such that h is γ−convex,then h is called strongly convex with parameter γ.

TheoremConsider the sequence xk generated by the DCA. Supposethat g is γ1-convex and h is γ2-convex. Then

f (xk )− f (xk+1) ≥ γ1 + γ2

2‖xk+1 − xk‖2 for all k ∈ N.

DefinitionWe say that an element x ∈ Rn is a stationary point of thefunction f = g − h if ∂g(x)∩ ∂h(x) 6= ∅. In the case where g andh are differentiable, x is a stationary point of f if and only if∇f (x) = ∇g(x)−∇h(x) = 0.

TheoremConsider sequence xk generated by the DCA. Then f (xk )is a decreasing sequence. Suppose further that f is boundedfrom below and that g is γ1-convex and h is γ2-convex withγ1 + γ2 > 0. If xk is bounded, then every subsequential limitof the sequence xk is a stationary point of f .

DC Decomposition

f (x1, . . . , xk ) =m∑

min`=1,...,k

‖x` − ai‖

We will utilize the fact that

f (x1, . . . , xk ) =m∑

k∑`=1

‖x` − ai‖ − maxr=1,...,k

k∑`=16=r

‖x` − ai‖

(k∑`=1

‖x` − ai‖

m∑i=1

maxr=1,...,k

k∑`=16=r

‖x` − ai‖

DC Decomposition

We obtain the µ-smoothing approximation fµ = gµ − hµ, where

gµ(x1, . . . , xk ) =1

m∑i=1

k∑`=1

‖x` − ai‖2

hµ(x1, . . . , xk ) =µ

m∑i=1

k∑`=1

x` − ai

µ; IB)]2

maxr=1,...,k

k∑`=16=r

2µ‖x` − ai‖2 −

x` − ai

µ; IB)]2

To implement DCA, we need ∂g∗µ and ∂hµ. . .

∂gµ

Using the Frobenius norm in a space of matrices, we expressgµ as

Gµ(X ) =m2µ‖X‖2 − 1

µ〈X ,B〉+

k2µ‖A‖2,

with the inner product 〈A,B〉 =k∑`

a`jb`j ,

X is the k × n matrix with rows x1, . . . , xk ,A is m × n with rows a1, . . . ,am, andB is k × n whose every row is the sum a1 + · · ·+ am.

∇Gµ(X ) = mµX − 1

µB X ∈ ∂G∗(Y ) iff Y ∈ ∂G(X )

∇G∗µ(Y ) = 1m (B + µY )

∂hµ

hµ(x1, . . . , xk ) =µ

m∑i=1

k∑`=1

x` − ai

µ; IB)]2

maxr=1,...,k

k∑`=1` 6=r

2µ‖x` − ai‖2 −

x` − ai

µ; IB)]2)

∂hµ

∇H1(X ) =

ix1−aiµ − PIB

(x1−aiµ

)...∑

ixk−aiµ − PIB

(xk−aiµ

For each i = 1, . . . ,m, there is some Ri such that theR-excluded sum is maximal. If we call this sum FRi , then ∇FRi

is a k × n matrix whose `th 6= R row is PIB

(x`−aiµ

), and Rth row

V ∈ ∂H2(X ) iff V =m∑

Multifacility Location Algorithm

INPUT: X1 ∈ dom g, N ∈ N

for k = 1, . . . ,N do

Compute Yk = ∇H1(Xk )+Vk

Compute Xk+1 = 1m (B + µYk )

end for

OUTPUT: xN+1

Clustering

Results

iteration2 4 6 8 10 12 14 16 18 20

7500Fermat Torricelli, R10

Algorithm 4

iteration0 50 100 150 200 250 300 350

240Eight Rings, R2

Algorithm 5

iteration0 100 200 300 400 500 600 700 800

Multifacility, R2

Algorithm 5

iteration0 100 200 300 400 500 600 700 800 900 1000

Multifacility, R10

Algorithm 5

iteration2 4 6 8 10 12 14 16 18 20

Multifacility Sets, R2

Algorithm 7

References

[1] An, Belghiti, Tao: A new efficient algorithm based on DCprogramming and DCA for clustering. J. Glob. Optim., 27 (2007),503–608.

[2] Nam, Rector, Giles: Minimizing Differences of Convex Functionsand Applications to Facility Location and clustering.arXiv :1511.07595 (2015).

[3] Nesterov: Smooth minimization of non-smooth functions.Math.Program., Ser. A 103 , 127-152 (2005).

[4] Rockafellar: Convex Analysis, Princeton University Press,Princeton, NJ, 1970.

Algorithms for Minimizing Differences of Convex Functions ......Minimizing Differences of Convex...

Documents

Transcript of Algorithms for Minimizing Differences of Convex Functions ......Minimizing Differences of Convex...

Lecture notes on convex polyominoes, convex permutominoes ...

Lecture 4 Convex Programming and Lagrange Duality...Lecture 4 Convex Programming and Lagrange Duality (Convex Programming program, Convex Theorem on Alternative, convex duality Optimality

E ciency of minimizing compositions of convex functions ...ddrusv/convex_comp.pdf · 1 Introduction In this work, we consider the class of composite optimization problems min x F(x)

Minimizing TGV-based Variational Models with Non-Convex ... · Minimizing TGV-based Variational Models with Non-Convex Data Terms 3 signi cant improvements in terms of accuracy. An

Non-Convex Total Variation Regularization for Convex ...eeweb.poly.edu/iselesni/GME-TVD/GME-TVD-2020-JMIV-preprint.pdfNon-Convex Total Variation Regularization for Convex Denoising

Minimizing Convex Functions with Integral Minimizers · Minimizing Convex Functions with Integral Minimizers Haotian Jiang ∗ Abstract Given a separation oracle SOfor a convex function

ects.ogu.edu.tr°statistik Doktora... · Web viewLeast-squares and linear programming 2 Convex optimization 3 Affine and convex sets 4 Convex functions 5 Convex optimization problems

Convex formulation and global optimization for multimodal active contour segmentation · active contour minimizing this energy function can be seen as a combination of edge based

Ch03. Convex Sets and Concave Functionsweb.hku.hk/~pingyu/6066/Slides/LN3_Convex Sets and... · Convex Sets Examples of Convex and Non-Convex Sets Figure: Convex and Non-Convex Set

On convex optimization problems in quantum information theory · Convex optimization problems arise naturally in quantum information theory, often in terms of minimizing a convex

Gradient methods for minimizing composite functionspremolab.ru/pub_files/pub88/qhkDNEyp8.pdf · local minima of composite convex objective functions φ(x). Namely, we assume that

Convex Sets and Convex Functions 1 Convex Sets,

Introduction to convex optimization I...Outline Introduction to convex problems Special classes of convex problems 1 Linear programming 2 Convex quadratic programming Sergio García

subgrad method notes - Stanford University · The subgradient method is a very simple algorithm for minimizing a nondiﬀerentiable convex function. The method looks very much like

Minimizing Sum of Truncated Convex Functions and Its ... · ing variable selection, outlier detection, and signal processing. ... for global optimality with stochastic search (Zhigljavsky

About Hölder-regularity of the convex shape minimizingcvgmt.sns.it/media/doc/paper/3060/Nonreg_Final_Revised_Version-1.… · About Hölder-regularity of the convex shape minimizing

Polygons can be CONCAVE or CONVEX CONVEX CONCAVE.

Today’s agenda: Plane Mirrors.campus.mst.edu/physics/courses/24/lectures/lecture23/... · 2018-04-17 · Spherical Mirrors: concave and convex mirrors. You must understand the differences

AN INEXACT ACCELERATED PROXIMAL GRADIENT METHOD · AN INEXACT ACCELERATED PROXIMAL GRADIENT METHOD KAIFENG JIANG minimizing smooth convex functions, later extended by Beck and Teboulle

Understanding Non-convex Optimization - Praneeth Netrapalli€¦ · •Convex optimization ()is a convex function, 𝒞is convex set •ut “today’s problems”, and this tutorial,