Algorithms for Minimizing Differences of Convex Functions ......Minimizing Differences of Convex...

Post on 23-Sep-2020

6 views 0 download

Transcript of Algorithms for Minimizing Differences of Convex Functions ......Minimizing Differences of Convex...

Minimizing Differences of Convex Functions-The DCA

Algorithms for Minimizing Differences ofConvex Functions and Applications to

Facility Location and Clustering

Mau Nam Nguyen

(joint work with D. Giles and R. B. Rector)

Fariborz Maseeh Department of Mathematics and StatisticsPortland State University

Mathematics and Statistics SeminarWSU Vancouver, April 12, 2017

Minimizing Differences of Convex Functions-The DCA

Convex Functions

DefinitionLet I ⊂ R be an interval and let f : I → R be a function. Thefunction f is said to be CONVEX on Ω if

f(λx + (1− λ)u

)≤ λf (x) + (1− λ)f (u)

for all x ,u ∈ I and λ ∈ (0,1).The function f is called CONCAVE if −f is convex.

Figure: Convex Function and Nonconvex Function.

Minimizing Differences of Convex Functions-The DCA

Convex Functions

ExampleLet I = R = (−∞,∞) and let f (x) = |x | for x ∈ R. Then f is aconvex function. Indeed, for any x ,u ∈ R and λ ∈ (0,1),

f (λx + (1− λ)u) = |λx + (1− λ)u|≤ |λx |+ |(1− λ)u|= λ|x |+ (1− λ)|u|= λf (x) + (1− λ)f (u).

Figure: Convex Function and Nonconvex Function.

Minimizing Differences of Convex Functions-The DCA

Characterizations for Convexity

TheoremLet f : I → R be differentiable function, where I is an openinterval in R. Then f is convex on I if and only if f ′ is monotoneincreasing on I.

TheoremLet f : I → R be twice differentiable function, where I is an openinterval in R. Then f is convex on I if and only if f ′′(x) ≥ 0 for allx ∈ I.

ExampleThe following functions are convex:

(i) f (x) = ax2 + bx + c, x ∈ R, a > 0.(ii) f (x) = ex , x ∈ R.(iii) f (x) = − ln(x), x ∈ (0,∞).

Minimizing Differences of Convex Functions-The DCA

Convex Sets

DefinitionA subset Ω of Rn is CONVEX if [a,b] ⊂ Ω whenever a,b ∈ Ω.Equivalently, Ω is convex if λa + (1− λ)b ∈ Ω for all a,b ∈ Ωand λ ∈ [0,1].

Figure: Convex set and nonconvex set.

Minimizing Differences of Convex Functions-The DCA

Convex Functions

DefinitionLet f : Ω→ R be defined on a convex set Ω ⊂ Rn. The functionf is said to be CONVEX on Ω if

f(λx + (1− λ)y

)≤ λf (x) + (1− λ)f (y)

for all x , y ∈ Ω and λ ∈ (0,1).The function f is called CONCAVE if −f is convex.

Figure: Convex Function and Nonconvex Function.

Minimizing Differences of Convex Functions-The DCA

Convex Functions

Theorem

Let fi : Rn → R be convex functions for all i = 1, . . . ,m.Then the following functions are convex as well:

(i) The multiplication by scalars λf for any λ > 0.(ii) The sum function

∑mi=1 fi .

(iii) The maximum function max1≤i≤m fi .Let f : Rn → R be convex and let ϕ : R→ R benondecreasing. Then the composition ϕ f is convex.Let B : Rn → Rp be an affine mapping and let f : Rp → Rbe a convex function. Then the composition f B is convex.

Minimizing Differences of Convex Functions-The DCA

Relationship between Convex Functions and Convex Sets

TheoremA function f : Rn → R is convex if and only if its epigraph epi f isa convex subset of the product space Rn × R.

Figure: Epigraphs of convex function and nonconvex function.

Minimizing Differences of Convex Functions-The DCA

Pierre de Fermat

Pierre de Fermat (1601-1665) was a French lawyer at theParlement of Toulouse, France, and an amateur mathematicianwho is given credit for early developments that led to calculus.

Minimizing Differences of Convex Functions-The DCA

Fermat-Torricelli Problem

In the early of the 17th century, at the end of his book, Treatiseon Minima and Maxima, the French mathematician Fermat(1601-1665) proposed the following problem: given three pointsin the plane, find a point such that the sum of its distances tothe three given points is minimal.Given a1,a2, . . . ,am ∈ Rn, find x ∈ Rn which minimizes thefunction

f (x) =m∑

i=1

‖x − ai‖.

Minimizing Differences of Convex Functions-The DCA

Weiszfeld’s Algorithm

The gradient of the function φ is

∇φ(x) =m∑

i=1

x − ai

‖x − ai‖, x /∈ a1,a2, . . . ,am.

Solving the equation ∇φ(x) = 0 gives

x =

∑mi=1

ai

‖x − ai‖∑mi=1

1‖x − ai‖

:= F (x).

For continuity, define F (x) := x for x ∈ a1,a2, . . . ,am.Weiszfeld algorithm: choose a starting point x0 ∈ Rn and define

xk+1 = F (xk ) for k ∈ N.

Minimizing Differences of Convex Functions-The DCA

Kuhn’s Convergence Theorem

Weiszfeld also claimed that if x0 /∈ a1,a2, . . . ,am, where ai fori = 1, . . . ,m are not collinear, then (xk ) converges to the uniqueoptimal solution of the problem. A counterexample and acorrect statement and the proof of the convergence were givenby Kuhn in 1972.

TheoremLet (xk ) be the sequence formed by the Weiszfeld’s algorithm.Suppose that xk /∈ a1,a2, . . . ,am for k ≥ 0. Then (xk )converges to the optimal solution of the problem.

Minimizing Differences of Convex Functions-The DCA

C1− Functions

A function f : Rn → R is called a C1 function if it has all partial

derivatives∂f∂xi

for i = 1, . . . ,n, and all these partial derivatives

are continuous functions. The gradient of f at a point x isdenoted by

∇f (x) =( ∂f∂x1

(x), . . . ,∂f∂xn

(x)).

Example

Let f (x) = ‖x‖2 =∑n

i=1 x2i . Then f is a C1 function with

∂f∂xi

(x) = 2xi for x = (x1, . . . , xn).

and ∇f (x) = (2x1, . . . ,2xn) = 2x .

Minimizing Differences of Convex Functions-The DCA

C1− Convex Functions

Theorem

Let f : Rn → R be a C1 function. Then f is convex if and only if

f (x) + 〈∇f (x), x − x〉 ≤ f (x) for all x , x ∈ Rn.

Figure: C1 Convex Functions.

Minimizing Differences of Convex Functions-The DCA

Lipschitz Continuous Functions and C1,1 Functions

DefinitionA function g : Rn → Rm is said to be Lipschitz continuous ifthere exists a constant ` ≥ 0 such that

‖g(x)− g(u)‖ ≤ `‖x − u‖ for all x ,u ∈ Rn.

A C1 function f : Rn → R is called a C1,1 function if its gradientis Lipschitz continuous, i.e., there exists a constant ` > 0 suchthat

‖∇f (x)−∇f (u)‖ ≤ `‖x − u‖ for all x ,u ∈ Rn.

Minimizing Differences of Convex Functions-The DCA

C1,1 Functions

Theorem

Let f : Rn → R be a C1,1 function whose gradient is Lipschitzcontinuous with Lipschitz constant `. Then

f (x) ≤ f (x) + 〈∇f (x), x − x〉+`

2‖x − x‖2

for all x , x ∈ Rn.

Figure: C1 Convex Functions.

Minimizing Differences of Convex Functions-The DCA

The Gradient Method

Let f : Rn → R be a C1 function and let αk be a sequence ofpositive real numbers called the sequence of step sizes. Asequence xk as follows:

Choose x0 ∈ Rn.Define xk+1 = xk − αk∇f (xk ) for k ≥ 0.

Given a C1 function f : Rn → R and α > 0, define g : Rn → Rn

by g(x) = x − α∇f (x). Then the gradient method with constantstep size α is given by

Choose x0 ∈ Rn.Define xk+1 = xk − α∇f (xk ) = g(xk ) for k ≥ 0.

Minimizing Differences of Convex Functions-The DCA

The Gradient Method

Theorem

Given a C1 convex function f : Rn → R whose gradient isLipschitz continuous with constant ` > 0. Consider the gradient

method with constant step size α =1`. Suppose that f has an

absolute minimum at x∗. Then f (xk ) is monotone decreasingand

0 ≤ f (xk )− f (x∗) ≤ ‖x0 − x∗‖2

2kαfor all k ≥ 1.

Minimizing Differences of Convex Functions-The DCA

Nesterov’s Accelerated Gradient Method

Let f : Rn → R be a C1−convex function whose gradient is

Lipschitz continuous with constant ` > 0 and let α =1`

.

Choose y1 = x0 ∈ Rn and t0 = 0.

Define xk = yk − αk∇f (yk ) = g(yk ) for k ≥ 1.

Update tk =1 +

√1 + 4t2

k−1

2for k ≥ 1.

Define yk+1 = xk +tk − 1tk + 1

(xk − xk−1) for k ≥ 1.

Minimizing Differences of Convex Functions-The DCA

The Gradient Method

Theorem

Given a C1 convex function f : Rn → R whose gradient isLipschitz continuous with constant ` > 0. Consider theNesterove accelerated gradient method with constant step size

α =1`. Suppose that f has an absolute minimum at x∗. Then

0 ≤ f (xk )− f (x∗) ≤ 2‖x0 − x∗‖2

α(k + 1)2 for all k ≥ 1.

Minimizing Differences of Convex Functions-The DCA

Distance Functions

Given a set Ω ⊂ Rn, the distance function associated with Ω isdefined by

d(x ; Ω) := inf‖x − ω‖

∣∣ ω ∈ Ω.

For each x ∈ Rn, the Euclidean projection from x to Ω isdefined by

P(x ; Ω) :=ω ∈ Ω

∣∣ ‖x − ω‖ = d(x ; Ω).

Figure: Distance function.

Minimizing Differences of Convex Functions-The DCA

Squared Distance Functions

TheoremIf Ω is a nonempty, closed, convex subset of Rn. Then:

For each x ∈ Rn, the Euclidean projection Π(x ; Ω) is asingleton.The projection mapping is nonexpansive:∥∥Π(x1; Ω)− Π(x2; Ω)

∥∥ ≤ ‖x1 − x2‖ for all x1, x2 ∈ Rn.

In addition, the squared distance function ϕ(x) = [d(x ; Ω)]2 forx ∈ Rn is a C1,1 function with

∇ϕ(x) = 2[x − P(x ; Ω)].

Minimizing Differences of Convex Functions-The DCA

Nesterov’s Smoothing Technique

TheoremGiven any a ∈ Rn and µ > 0, a Nesterov smoothingapproximation a of ϕ(x) := ‖x − a‖ has the representation

fµ(x) =‖x − b‖2

2µ− µ

2[d(

x − bµ

; IB)]2,

where IB is the closed unit ball of Rn. In addition,

∇fµ(x) = uµ(x) = Π(x − bµ

; IB).

The gradient ∇fµ is Lipschitz continuous with constant `µ =1µ.

aY. Nesterov: Smooth minimization of non-smooth functions.Math.Program., Ser. A 103 , 127-152 (2005).

Minimizing Differences of Convex Functions-The DCA

Generalized Fermat-Torricelli Problems

Consider the following version of the Fermat-Torricelli problem:

minimize F(x) :=m∑

i=1

‖x − ai‖ subject to x ∈ Rn.

A smooth approximation of F is given by

Fµ(x) =m∑

i=1

(‖x − ai‖2

2µ− µ

2[d(

x − ai

µ; IB)]2

).

Minimizing Differences of Convex Functions-The DCA

Generalized Fermat-Torricelli Problems

The function Hµ continuously differentiable on Rn with itsgradient given by

∇Fµ(x) =m∑

i=1

Π(x − ai

µ; IB).

Its gradient is Lipschitz with constant

Lµ =mµ.

Moreover, one has the following estimate

Hµ(x) ≤ H(x) < Hµ(x) + mµ

2for all x ∈ Rn.

Minimizing Differences of Convex Functions-The DCA

Generalized Fermat-Torricelli Problems

To demonstrate the method, let us consider a numericalexample below.

ExampleThe longitude/latitude coordinates in decimal format of US cities arerecorded, for example, at http://www.realestate3d.com/gps/uslatlongdegmin.htm.Our goal is to use to find a point that minimizes the `1 distances to thegiven points representing the cities. If we consider the Euclideannorm, an approximate optimal value is V ∗ = 23409.33, and asuboptimal solution is x∗ = (38.63,97.35). If we consider the`1−norm, the algorithm presented in this section allows us to find anapproximate optimal value V ∗ = 28724.68 and a suboptimalsolution x∗ = (39.48,97.22).

Minimizing Differences of Convex Functions-The DCA

Generalized Fermat-Torricelli Problems

ExampleThe graphs showing the convergence of the algorithms fordifferent norms:

0 500 1000 1500 20000

0.5

1

1.5

2

2.5

3

3.5

4x 10

4

k

Fun

ctio

n va

lue

sum normmax normEuclidean norm

Figure: A generalized Fermat-Torricelli problem with different norms

Minimizing Differences of Convex Functions-The DCA

Differences of Convex Functions

Consider the problem

minimize f (x) := g(x)− h(x) , x ∈ Rn

where g : Rn → R and h : Rn → R are convex.

We call g − h a DC decomposition of f .

g(x) = x4

h(x) = 3x2 + x

f = g − h

Minimizing Differences of Convex Functions-The DCA

Examples of DC Programming

Fermat-Torricelli

f (x) =∑m

i=1 ci‖x−ai‖

Clustering

f (x1, . . . , x`) =∑mi=1 mink

`=1‖x` − ai‖2

MultifacilityLocation

f (x1, . . . , x`) =∑mi=1 mink

`=1‖x` − ai‖

Minimizing Differences of Convex Functions-The DCA

The DCA for Clustering

Problem formulation1: Let ai for i = 1, ...,m be m datapoints in Rn.

Minimize f (x1, . . . , xk ) :=m∑

i=1

min‖xl − ai‖2 : l = 1, ..., k

over xl ∈ Rn, l = 1, ..., k .

1L.T.H. An, M.T. Belghiti, P.D. Tao, A new efficient algorithm based on DCprogramming and DCA for clustering, J. Glob. Optim., 27 (2007), 503–608.

Minimizing Differences of Convex Functions-The DCA

K-Mean Clustering

Let x1, x2, . . . , xm be the data points and let c1, . . . , ck denotethe centers.

Randomly select k cluster centers.Assign each data point to the nearest center.Find the average of the data points assigned to eachcenter.Repeat the second step with the obtained new centers inthe third step until the centroids no longer move.

Although k−mean clustering is effective in many situations, italso has some disadvantages.

The k-means algorithm does not necessarily find theoptimal solutionThe algorithm is sensitive to the initial selected clustercenters

Minimizing Differences of Convex Functions-The DCA

DCA for Clustering and K-Mean2

Both DCA1 and DCA2 are better than K-means: theobjective values given by DCA1 and DCA2 are muchsmaller than that computed by K-means.DCA2 is the best among the three algorithms: it providesthe best solution with the shortest time. DCA2 is very fastand can then handle large-scale problems.

f (x1, . . . , xk ) =m∑

i=1

min`=1,...,k

‖x` − ai‖1

This is a nonsmooth nonconvex program for which there arerarely efficient solution algorithms, especially in the large scalesetting.

2L.T.H. An, M.T. Belghiti, P.D. Tao, A new efficient algorithm based on DCprogramming and DCA for clustering, J. Glob. Optim., 27 (2007), 503–608.

Minimizing Differences of Convex Functions-The DCA

Subgradients and Fenchel Conjugates of Convex Functions

DefinitionLet f : Rn → R be a convex function. A subgradient of f at x isany v ∈ Rn such that

〈v , x − x〉 ≤ f (x)− f (x) for all x ∈ Rn.

The subdifferential ∂f (x) of f at x is the set of all subgradientsof f at x .

DefinitionLet f : Rn → R be a function. The Fenchel conjugate of f isdefined by

f ∗(x) = supu∈Rn〈x ,u〉 − g(u), x ∈ Rn.

Minimizing Differences of Convex Functions-The DCA

DC Algorithm-The DCA

Consider the problem

minimize f (x) := g(x)− h(x) , x ∈ Rn

where g : Rn → R and h : Rn → R are convex.

The DCA3.

INPUT: x1 ∈ Rn, N ∈ Nfor k = 1, . . . ,N do

Find yk ∈ ∂h(xk )Find xk+1 ∈ ∂g∗(yk )

end forOUTPUT: xN+1

3P.D. Tao, L.T.H. An, A d.c. optimization algorithm for solving thetrust-region subproblem, SIAM J. Optim. 8 (1998), 476–505.

Minimizing Differences of Convex Functions-The DCA

An Example of the DCA

minimize f (x) = x4 − 3x2 − x , x ∈ RThen f (x) = g(x)− h(x), where g(x) = x4 and h(x) = 3x2 + x .We have

∂h(x) = 6x + 1, ∂g∗(y) =

argmin

x∈Rn

(x4 − yx

)=

3

√y4

xk+1 =3

√6xk + 1

4

Minimizing Differences of Convex Functions-The DCA

DC Algorithm-The DCA

DefinitionA function h : Rn → R is called γ-convex (γ ≥ 0) if there existsγ ≥ 0 such that the function defined by k(x) := h(x)− γ

2‖x‖2,

x ∈ Rn, is convex. If there exists γ > 0 such that h is γ−convex,then h is called strongly convex with parameter γ.

TheoremConsider the sequence xk generated by the DCA. Supposethat g is γ1-convex and h is γ2-convex. Then

f (xk )− f (xk+1) ≥ γ1 + γ2

2‖xk+1 − xk‖2 for all k ∈ N.

Minimizing Differences of Convex Functions-The DCA

DC Algorithm-The DCA

DefinitionWe say that an element x ∈ Rn is a stationary point of thefunction f = g − h if ∂g(x)∩ ∂h(x) 6= ∅. In the case where g andh are differentiable, x is a stationary point of f if and only if∇f (x) = ∇g(x)−∇h(x) = 0.

TheoremConsider sequence xk generated by the DCA. Then f (xk )is a decreasing sequence. Suppose further that f is boundedfrom below and that g is γ1-convex and h is γ2-convex withγ1 + γ2 > 0. If xk is bounded, then every subsequential limitof the sequence xk is a stationary point of f .

Minimizing Differences of Convex Functions-The DCA

DC Decomposition

f (x1, . . . , xk ) =m∑

i=1

min`=1,...,k

‖x` − ai‖

We will utilize the fact that

f (x1, . . . , xk ) =m∑

i=1

k∑`=1

‖x` − ai‖ − maxr=1,...,k

k∑`=16=r

‖x` − ai‖

=m∑

i=1

(k∑`=1

‖x` − ai‖

)−

m∑i=1

maxr=1,...,k

k∑`=16=r

‖x` − ai‖

.

Minimizing Differences of Convex Functions-The DCA

DC Decomposition

We obtain the µ-smoothing approximation fµ = gµ − hµ, where

gµ(x1, . . . , xk ) =1

m∑i=1

k∑`=1

‖x` − ai‖2

hµ(x1, . . . , xk ) =µ

2

m∑i=1

k∑`=1

[d(

x` − ai

µ; IB)]2

+m∑

i=1

maxr=1,...,k

k∑`=16=r

(1

2µ‖x` − ai‖2 −

µ

2

[d(

x` − ai

µ; IB)]2

)

To implement DCA, we need ∂g∗µ and ∂hµ. . .

Minimizing Differences of Convex Functions-The DCA

∂gµ

Using the Frobenius norm in a space of matrices, we expressgµ as

Gµ(X ) =m2µ‖X‖2 − 1

µ〈X ,B〉+

k2µ‖A‖2,

with the inner product 〈A,B〉 =k∑`

n∑j

a`jb`j ,

X is the k × n matrix with rows x1, . . . , xk ,A is m × n with rows a1, . . . ,am, andB is k × n whose every row is the sum a1 + · · ·+ am.

∇Gµ(X ) = mµX − 1

µB X ∈ ∂G∗(Y ) iff Y ∈ ∂G(X )

∇G∗µ(Y ) = 1m (B + µY )

Minimizing Differences of Convex Functions-The DCA

∂hµ

hµ(x1, . . . , xk ) =µ

2

m∑i=1

k∑`=1

[d(

x` − ai

µ; IB)]2

+m∑

i=1

maxr=1,...,k

k∑`=1` 6=r

(1

2µ‖x` − ai‖2 −

µ

2

[d(

x` − ai

µ; IB)]2)

Minimizing Differences of Convex Functions-The DCA

∂hµ

∇H1(X ) =

ix1−aiµ − PIB

(x1−aiµ

)...∑

ixk−aiµ − PIB

(xk−aiµ

)

For each i = 1, . . . ,m, there is some Ri such that theR-excluded sum is maximal. If we call this sum FRi , then ∇FRi

is a k × n matrix whose `th 6= R row is PIB

(x`−aiµ

), and Rth row

is 0.

V ∈ ∂H2(X ) iff V =m∑

i=1

FRi

Minimizing Differences of Convex Functions-The DCA

Multifacility Location Algorithm

INPUT: X1 ∈ dom g, N ∈ N

for k = 1, . . . ,N do

Compute Yk = ∇H1(Xk )+Vk

Compute Xk+1 = 1m (B + µYk )

end for

OUTPUT: xN+1

Minimizing Differences of Convex Functions-The DCA

Clustering

Minimizing Differences of Convex Functions-The DCA

Clustering

Minimizing Differences of Convex Functions-The DCA

Clustering

Minimizing Differences of Convex Functions-The DCA

Results

iteration2 4 6 8 10 12 14 16 18 20

f(x

k)

5500

6000

6500

7000

7500Fermat Torricelli, R10

Algorithm 4

iteration0 50 100 150 200 250 300 350

f(X

k)

190

195

200

205

210

215

220

225

230

235

240Eight Rings, R2

Algorithm 5

iteration0 100 200 300 400 500 600 700 800

f(X

k)

1800

1820

1840

1860

1880

1900

1920

1940

1960

1980

Multifacility, R2

Algorithm 5

iteration0 100 200 300 400 500 600 700 800 900 1000

f(X

k)

4800

4900

5000

5100

5200

5300

5400

5500

Multifacility, R10

Algorithm 5

iteration2 4 6 8 10 12 14 16 18 20

f(X

k)

1500

2000

2500

3000

3500

4000

4500

5000

5500

Multifacility Sets, R2

Algorithm 7

Minimizing Differences of Convex Functions-The DCA

References

[1] An, Belghiti, Tao: A new efficient algorithm based on DCprogramming and DCA for clustering. J. Glob. Optim., 27 (2007),503–608.

[2] Nam, Rector, Giles: Minimizing Differences of Convex Functionsand Applications to Facility Location and clustering.arXiv :1511.07595 (2015).

[3] Nesterov: Smooth minimization of non-smooth functions.Math.Program., Ser. A 103 , 127-152 (2005).

[4] Rockafellar: Convex Analysis, Princeton University Press,Princeton, NJ, 1970.