A Theoretical Analysis of Optimization by Gaussian...

15
A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi and John W. Fisher III Computer Science and Artificial Intelligence Lab. (CSAIL) Massachusetts Institute of Technology (MIT) {hmobahi,fisher}@csail.mit.edu Abstract Optimization via continuation method is a widely used approach for solving nonconvex minimization problems. While this method generally does not provide a global minimum, empirically it often achieves a superior local minimum compared to alternative approaches such as gradient descent. However, theoretical analysis of this method is largely unavailable. Here, we provide a theoreti- cal analysis that provides a bound on the endpoint solution of the continuation method. The derived bound depends on a problem specific character- istic that we refer to as optimization complexity. We show that this characteristic can be analyt- ically computed when the objective function is expressed in some suitable basis functions. Our analysis combines elements of scale-space theory, regularization and differential equations. 1 Introduction Nonconvex energy minimization problems arise fre- quently in learning and inference tasks. For exam- ple, consider some fundamental tasks in computer vi- sion. Inference in image segmentation (Mumford and Shah 1989), image completion (Mobahi, Rao, and Ma 2009), and optical flow (Sun, Roth, and Black 2010), as well as learning of part-based models (Felzenszwalb et al. 2010), and dictionary learning (Mairal et al. 2009), all involve nonconvex objectives. In noncon- vex optimization, computing the global minima are generally intractable and as such, heuristic methods are sought. These methods may not always find the global minimum, but often provide good suboptimal solutions. A popular heuristic is the so called contin- uation method. It starts by solving an easy problem, and progressively changes it to the actual complex task. Each step in this progression is guided by the solution obtained in the previous step. This idea is very popular owing to its ease of imple- mentation and often superior empirical performance 1 Copyright c 2015, Association for the Advancement of Ar- tificial Intelligence (www.aaai.org). All rights reserved. 1 That is finding much deeper local minima, if not the global minima. against alternatives such as gradient descent. Instances of this concept have been utilized by the artificial intel- ligence community for more than three decades. Exam- ples include graduated-nonconvexity (Blake and Zisser- man 1987), mean field theory (Yuille 1987), determin- istic annealing (Rose, Gurewitz, and Fox 1990), and optimization via scale-space (Witkin, Terzopoulos, and Kass 1987). It is widely used in various state-of-the- art solutions (see Section 2). Despite that, there exists no theoretical understanding of the method itself 2 . For example, it is not clear which properties of the prob- lem make its associated optimization easy or difficult for this approach. This paper provides a bound on the objective value attained by the continuation method. The derived bound monotonically depends on a particular charac- teristic of the objective function. That is, lower value of the characteristic guarantees attaining lower objec- tive value by the continuation. This characteristic re- flects the complexity of the optimization task. Hence, we refer to it as the optimization complexity. Im- portantly, we show that this complexity parameter is computable when the objective function is expressed in some suitable basis functions such as Gaussian Ra- dial Basis Function (RBF). We provide a brief description of our main result here, while the complete statement is postponed to Theorem 7. Let f (x) be a nonconvex function to be minimized and let ˆ x be the solution discovered by the continua- tion method. Let f be the minimum of the simplified objective function. Then, f x) w 1 f + w 2 α, (1) where w 1 > 0 and w 2 > 0 are independent of f and α is the optimization complexity of f . When f can be expressed by Gaussian RBFs f (x)= K k=1 a k e - (x-x k ) 2 2δ 2 , then in Proposition 9 we show that its optimization complexity α is proportional to 2 We note that prior “application tailored” analysis is available, e.g. (Kosowsky and Yuille 1994). However, there is no general and application independent result in the lit- erature.

Transcript of A Theoretical Analysis of Optimization by Gaussian...

Page 1: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

A Theoretical Analysis of Optimization by Gaussian Continuation

Hossein Mobahi and John W. Fisher IIIComputer Science and Artificial Intelligence Lab. (CSAIL)

Massachusetts Institute of Technology (MIT)hmobahi,[email protected]

Abstract

Optimization via continuation method is a widelyused approach for solving nonconvex minimizationproblems. While this method generally does notprovide a global minimum, empirically it oftenachieves a superior local minimum compared toalternative approaches such as gradient descent.However, theoretical analysis of this method islargely unavailable. Here, we provide a theoreti-cal analysis that provides a bound on the endpointsolution of the continuation method. The derivedbound depends on a problem specific character-istic that we refer to as optimization complexity.We show that this characteristic can be analyt-ically computed when the objective function isexpressed in some suitable basis functions. Ouranalysis combines elements of scale-space theory,regularization and differential equations.

1 IntroductionNonconvex energy minimization problems arise fre-quently in learning and inference tasks. For exam-ple, consider some fundamental tasks in computer vi-sion. Inference in image segmentation (Mumford andShah 1989), image completion (Mobahi, Rao, and Ma2009), and optical flow (Sun, Roth, and Black 2010),as well as learning of part-based models (Felzenszwalbet al. 2010), and dictionary learning (Mairal et al.2009), all involve nonconvex objectives. In noncon-vex optimization, computing the global minima aregenerally intractable and as such, heuristic methodsare sought. These methods may not always find theglobal minimum, but often provide good suboptimalsolutions. A popular heuristic is the so called contin-uation method. It starts by solving an easy problem,and progressively changes it to the actual complex task.Each step in this progression is guided by the solutionobtained in the previous step.

This idea is very popular owing to its ease of imple-mentation and often superior empirical performance1

Copyright c© 2015, Association for the Advancement of Ar-tificial Intelligence (www.aaai.org). All rights reserved.

1That is finding much deeper local minima, if not theglobal minima.

against alternatives such as gradient descent. Instancesof this concept have been utilized by the artificial intel-ligence community for more than three decades. Exam-ples include graduated-nonconvexity (Blake and Zisser-man 1987), mean field theory (Yuille 1987), determin-istic annealing (Rose, Gurewitz, and Fox 1990), andoptimization via scale-space (Witkin, Terzopoulos, andKass 1987). It is widely used in various state-of-the-art solutions (see Section 2). Despite that, there existsno theoretical understanding of the method itself2. Forexample, it is not clear which properties of the prob-lem make its associated optimization easy or difficultfor this approach.

This paper provides a bound on the objective valueattained by the continuation method. The derivedbound monotonically depends on a particular charac-teristic of the objective function. That is, lower valueof the characteristic guarantees attaining lower objec-tive value by the continuation. This characteristic re-flects the complexity of the optimization task. Hence,we refer to it as the optimization complexity. Im-portantly, we show that this complexity parameter iscomputable when the objective function is expressedin some suitable basis functions such as Gaussian Ra-dial Basis Function (RBF).

We provide a brief description of our main result here,while the complete statement is postponed to Theorem7. Let f(x) be a nonconvex function to be minimizedand let x be the solution discovered by the continua-tion method. Let f† be the minimum of the simplifiedobjective function. Then,

f(x) ≤ w1 f† + w2

√α , (1)

where w1 > 0 and w2 > 0 are independentof f and α is the optimization complexity of f .When f can be expressed by Gaussian RBFs f(x) =∑Kk=1 ake

− (x−xk)2

2δ2 , then in Proposition 9 we showthat its optimization complexity α is proportional to

2We note that prior “application tailored” analysis isavailable, e.g. (Kosowsky and Yuille 1994). However, thereis no general and application independent result in the lit-erature.

Page 2: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

∑Kj=1

∑Kk=1 ajake

−(xj−xk)2

2(2δ2−ε2) .

Our analysis here combines elements of scale spacetheory (Loog, Duistermaat, and Florack 2001), differen-tial equations (Widder 1975), and regularization theory(Girosi, Jones, and Poggio 1995).

We clarify that optimization by continuation, whichtraces one particular solution, should not be confusedby homotopy continuation in the context of finding allroots of a system of equation3. Homotopy continuationhas a rich theory for the latter problem (Morgan 2009;Sommese and Wampler 2005), but that is a very differ-ent problem from the optimization setup.

Throughout this article, we use , for equality by def-inition, x for scalars, x for vectors, and X for sets. De-

note a function by f(.), its Fourier transform by f(.),and its complex conjugate by f(.). We often denote thedomain of the function by X = Rd and the domain ofits Fourier transform by Ω , Rd. Let kσ(x), for σ > 0,denote the isotropic Gaussian kernel,

kσ(x) ,1

(√

2πσ)de−‖x‖2

2σ2 .

Let ‖ . ‖ indicate ‖ . ‖2, and R++ , x ∈ R |x >0. Finally, given a function of form g : Rd × R++,

∇g(x; t) , ∇xg(x; t), ∇2g(x; t) , ∇2xg(x; t), and

g(x; t) , ddtg(x; t). Finally, ∆g(x; t) ,

∑dk=1

∂2

∂x2k

.

2 Optimization by ContinuationConsider the problem of minimizing a nonconvex ob-jective function. In optimization by continuation, atransformation of the nonconvex function to an easy-to-minimize function is considered. The method then pro-gressively converts the easy problem back to the origi-nal function, while following the path of the minimizer.In this paper, we always choose the easier function tobe convex. The minimizer of the easy problem can befound efficiently.

This simple idea has been used with great successfor various nonconvex problems. Classic examples in-clude data clustering (Gold, Rangarajan, and Mjolsness1994), graph matching (Gold and Rangarajan 1996;Zaslavskiy, Bach, and Vert 2009; Liu, Qiao, and Xu2012), semi-supervised kernel machines (Sindhwani,Keerthi, and Chapelle 2006), multiple instance learning(Gehler and Chapelle 2007; Kim and Torre 2010), semi-supervised structured output (Dhillon et al. 2012), lan-guage modeling (Bengio et al. 2009), robot navigation(Pretto, Soatto, and Menegatti 2010), shape matching(Tirthapura et al. 1998), `0 norm minimization (Trza-sko and Manduca 2009), image deblurring (Boccuto et

3In principle, one may formulate the optimization prob-lem as finding all roots of the gradient and then evaluatingthe objective at those points to choose the lowest. However,this is not practical as the number of stationary points canbe abundant, e.g. exponential in number of variables forpolynomials.

al. 2002), image denoising (Rangarajan and Chellappa1990; Nikolova, Ng, and Tam 2010), template match-ing (Dufour, Miller, and Galatsanos 2002), pixel cor-respondence (Leordeanu and Hebert 2008), active con-tours (Cohen and Gorre 1995), Hough transform (Le-ich, Junghans, and Jentschel 2004), and image matting(Price, Morse, and Cohen 2010), finding optimal pa-rameters in computer programs (Chaudhuri and Solar-Lezama 2011) and seeking the optimal proofs (Chaud-huri, Clochard, and Solar-Lezama 2014).

In fact, the growing interest in this method has madeit one of the most favorable solutions for the contempo-rary nonconvex minimization problems. Just within thepast few years, the method has been utilized for low-rank matrix recovery (Malek-Mohammadi et al. 2014),error correction by `0 recovery (Mohimani et al. 2010),super resolution (Coupe et al. 2013), photometric stereo(Wu and Tan 2013), image segmentation (Hong, Lu,and Sundaramoorthi 2013), face alignment (Saragih2013), shape and illumination recovery (Barron 2013),3D surface estimation (Balzer and Morwald 2012), anddense correspondence of images (Kim et al. 2013). Thelast two are in fact state of the art solutions for theirassociated problems. In addition, it has recently beenargued that some recent breakthroughs in the trainingof deep architectures (Hinton, Osindero, and Teh 2006;Erhan et al. 2009), has been made by algorithms thatuse some form of continuation for learning (Bengio2009).

We now present a formal statement of optimizationby the continuation method. Given an objective func-tion f : X → R, where X = Rd. Consider an em-bedding of f into a family of functions g : X × T ,where T , [0,∞), with the following properties. First,g(x, 0) = f(x). Second, g(x, t) is bounded below andis strictly convex in x when t tends to infinity 4. Third,g(x, t) is continuously differentiable in x and t.

Such embedding g is sometimes called a homotopy,as it continuously transforms one function to another.The conditions of strict convexity and bounded frombelow for g( . , t) with t → ∞ imply that there exists aunique minimizer for the g( . , t) when t → ∞. Wecall this minimizer x∞.

Define the curve x(t) for t ≥ 0 as one with the fol-lowing properties. First, limt→∞ x(t) = x∞. Second,∀t ≥ 0 ; ∇g

(x(t), t

)= 0. Third, x(t) is continu-

ous in t. This curve simply sweeps a specific stationarypath of g originated at x∞, as the parameter t pro-gresses backward (See Figure 1). In general, such curveneither needs to exist, nor be unique. However, theseconditions can be guaranteed by imposing extra condi-tion ∀t ≥ 0 ; det(∇2g(x(t); t)) 6= 0 (see e.g. Theorem 3of (Wu 1996)). Throughout this paper, it is assumedthat x(t) exists.

In practice, the continuation method is used as thefollowing. First, x∞ is either derived analytically or

4A rigorous definition of such asymptotic convexity isprovided in the supplementary appendix.

Page 3: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

Figure 1: Plots show g versus x for each fixed time t.

Algorithm 1 Algorithm for Optimization by Continu-ation Method1: Input: f : X → R, Sequence t0 > t1 > · · · > tn = 0.

2: x0 = global minimizer of g(x; t0).3: for k = 1 to n do4: xk = Local minimizer of g(x; tk), initialized at

xk−1.5: end for6: Output: xn

approximated numerically by arg minx g(x; t) for largeenough t. The latter can use standard convex opti-mization tools as g(x; t) approaches a convex functionin x for large t. Then, the stationary path x(t) is nu-merically tracked until t = 0 (See Algorithm 1). Asmentioned in the introduction, for a wide range of ap-plications, the continuation solution x(0) often providesa good local minimizer of f(x), if not the global mini-mizer.

Although this work only focuses on the use of ho-motopy continuation for nonconvex optimization, thereis also interest in this method for convex optimization,e.g. to improve or guarantee the convergence rate (Xiaoand Zhang 2012).

3 Analysis

Due to the space limitation, only the statement of re-sults are provided here. Full proofs are available in asupplementary appendix.

3.1 Path Independent Analysis

The first challenge we confront in developing a guaran-tee for the value of g(x(0); 0) is that g( . ; 0) must be

evaluated at the point x(0). However, we do not knowx(0) unless we actually run the continuation algorithmand see where it lands at upon termination. This is ob-viously not an option for the theoretical analysis of theproblem. Hence, the question is whether it is possibleto say something about the value of g(x(0); 0) withoutknowing the point x(0).

Here we prove that this is possible and we derive anupper bound for g(x(0); 0) without knowing the curvex(t) itself. We, however, require the value of g at theinitial point to be known. In addition, we require aglobal (curve independent) inequality to relate g(x; t)and g(x; t). Our result is stated in the following lemma.

Lemma 1 (Worst Case Value of g(x(t); t

)) Given

a function f : X → R and its associated homotopymap g. Given a point x1 that is the stationarypoint of g(x; t1) (w.r.t. x). Denote the curve ofstationary points originated from x1 at t1 by x(t),i.e. ∀t ∈ [0, t1] ; ∇g(x(t), t) = 0. Suppose this curveexists. Given continuous functions a and b, such that∀t ∈ [0, t1]∀x ∈ X ; a(t)g(x; t) + b(t) ≤ g(x; t). Then,the following inequality holds for any t ∈ [0, t1],

g(x(t); t

)(2)

≤(g(x(t1); t1

)−∫ t1

t

e∫ t1sa(r) drb(s) ds

)e−

∫ t1t a(r) dr .

The proof of this lemma essentially consists of apply-ing a modified version of the differential form of Gron-wall’s inequality. This lemma determines our nextchallenge, which is finding the a(t) and b(t) for a givenf . In order to do that, we need to be more explicitabout the choice of the homotopy. Our following devel-opment relies on Gaussian homotopy.

3.2 Gaussian Homotopy

The Gaussian homotopy g : X × T → R for a func-tion f : X → R is defined as the convolution of f withkσ, g(x;σ) , [f ? kσ](x) ,

∫X f(y) kσ(x− y) dy.

In order to emphasize that the homotopy parametert coincides with the standard deviation of the Gaussian,from here on, we switch to the notation g(x;σ) for thehomotopy instead of previously used g(x; t). A well-known property of the Gaussian convolution is that itobeys the heat equation (Widder 1975),

g(x;σ) = σ∆g(x;σ) . (3)

This means that in Lemma 1, the conditiona(σ)g(x;σ) + b(σ) ≤ g(x;σ) can be replaced bya(σ)g(x;σ) + b(σ) ≤ σ∆g(x;σ). In order to find sucha(σ) and b(σ), we first obtain a lower bound on ∆g(x;σ)in terms of g(x;σ). Then, we will set a(σ)g(x;σ)+b(σ)to be smaller than the lower bound.

Gaussian homotopy has useful properties in the con-text of the continuation method. First, it enjoy someoptimality criterion in terms of the best convexifica-tion of f(x) (Mobahi and Fisher III ). Second, for some

Page 4: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

complete basis functions, such as polynomials orGaussian RBFs, Gaussian convolution has a closedform expression. Finally, under mild conditions, alarge enough bandwidth can make g(x;σ) unimodal(Loog, Duistermaat, and Florack 2001) and hence easyto minimize. In fact, the example in Figure 1 is con-structed by Gaussian convolution. Observe how theoriginal function (bottom) gradually looks more like aconvex function in the figure.

3.3 Lower Bounding ∆g as a Function of g

Here we want to relate ∆g(x;σ) to g(x;σ). Since thedifferential operator is only w.r.t. variable x, we cansimplify the notation by disregarding dependency on σ.Hence, we work with h(x) , g(x;σ) for some fixed σ.Hence, the goal becomes lower bounding ∆h(x) as afunction of h(x).

The lower bound must hold at any arbitrary point,say x0. Remember, we want to bound ∆h(x0) only asa function of the value of h(x0) and not x0 itself. Inother words, we do not know where x0 is, but we aretold what h(x0) is. We can pose this problem as the

following functional optimization task, where h0 ,h(x0) is a known quantity.

y = inff,x1

∆f(x1) , s.t. , f(x1) = h0 , f(x) = h(x) .

(4)

Then it follows5 that y ≤ ∆h(x0). However, solving(4) is too idealistic due to the constraint f(x) = h(x)and the fact that h(x) can be any complicated func-tion. A more practical scenario is to constrain f(x) tomatch with h(x) in terms of some signatures. Thesesignatures must be easy to compute for h(x) and allowsolving the associated functional optimization in f .

A potentially useful signature for constraining theproblem is function’s smoothness. We quantify the

latter for a function f(x) by∫

Ω|f(ω)|2

G(‖ω‖) dω where G is

a decreasing function called stabilizer. This formessentially penalizes higher frequencies in f . Func-tional optimization involving this type of constrainthas been studied in the realm of regularization the-ory in machine learning (Girosi, Jones, and Poggio1995). Deeper mathematical details can be found in(Dyn et al. 1989; Dyn 1989; Madych and Nelson 1990).The smoothness constraint plays a crucial role in ouranalysis. We denote it by α for brevity, where α ,

(2π)−d2

∫Ω|h(ω)|2

G(‖ω‖) dω, and refer to this quantity as the

optimization complexity. Hence, the ideal task (4)can be relaxed to the following,

5If h is a one-to-one map, f(x1) = h0 and f(x) = h(x)imply that x1 = x0 and hence y = ∆h(x0).

y = inff,x1

∆f(x1) (5)

s.t. , f(x1) = h0 ,

∫Ω

|f(ω)|2

G(‖ω‖)dω = (

√2π)dα .

Since (5) is a relaxation of (4) (because the con-straint f(x) = h(x) is replaced by the weaker con-

straint∫

Ω|f(ω)|2

G(‖ω‖) dω =∫

Ω|h(ω)|2

G(‖ω‖) dω), it follows that

y ≤ y. Since y ≤ ∆h(x0), we get y ≤ ∆h(x0), hencethe desired lower bound.

In the setting (5), we can indeed solve the associatedfunctional optimization. The result is stated in the fol-lowing lemma.

Lemma 2 Consider f : X → R with well-definedFourier transform. Let G : Ω → R++ be anydecreasing function. Suppose f(x1) = h0 and

( 1√2π

)d∫

Ω|f(ω)|2

G(ω)dω = α for given constants h0 and α.

Then inff,x1∆f(x1) = c1∆G(0) − c2∆∆G(0), where

(c1, c2) is the solution to the following system,

c1G(0)− c2∆G(0) = h0

c21∫

ΩG(ω) dω + 2c1c2

∫Ω‖ω‖2G(ω) dω . . .

+c22∫

Ω‖ω‖4G(ω) dω = (

√2π)dα

. (6)

Here ∆∆ means the application of the Laplace op-erator twice. The lemma is very general, working forany decreasing function G : Ω → R++. An interesting

choice for the stabilizer G is the Gaussian function(this is a familiar case in the regularization theory dueto Yuille (Yuille and Grzywacz 1989)). This leads tothe following corollary.

Corollary 3 Consider f : X → R with well-defined

Fourier transform. Let G(ω) , εde−ε2‖ω‖2

2 . Sup-

pose f(x1) = h0 and∫

Ω|f(ω)|2

G(ω)dω = (

√2π)dα for

given constants h0 and α. Then inff,x1∆f(x1) =

−h0+2√

2√α−h2

0

ε2 .

Example Consider h(x) = −e− x2

2 . Let G(ω) , e−ω2

2

(i.e. set ε = 1). It is easy to check that∫R|h(ω)|2

G(ω)dω =

√2π. Hence, α = 1. Let x0 = 0. Obviously, h(x0) =−1. Using Corollary 3 we have inff,x1

f ′′(x1) = −(−1+

2√

2√

1− (−1)2) = 1. We now show that the worstcase bound suggested by Corollary 3 is sharp for this

example. It is so because h′′(x) = (1− x2)e−x2

2 , whichat x0 = 0 becomes h′′(x0) = 1.

3.4 Extension to the Smoothed Objective

Corollary 3 applies to any functions f(x) that has well-defined Fourier transform and any stabilizer of formG(ω). This includes any parameterized family of func-tions and stabilizer, as long as the parameter(s) and

Page 5: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

x are independent of each other. In particular, onecan choose the parameter to be σ and replace f(x) by

g(x;σ) and G(ω) by G(ω;σ) , εd(σ)e−ε2(σ)‖ω‖2

2 . Notethat σ and x are independent.

This simple argument allows us to express Corollary3 in the the following parametric way.

Corollary 4 Consider f : X → R with well-definedFourier transform. Define g(x;σ) , [h ? kσ](x).

Let G(ω;σ) , εd(σ)e−ε2(σ)‖ω‖2

2 . Suppose g(x1;σ) =

g0(σ) and∫

Ω|g(ω;σ)|2

G(ω;σ)dω = (

√2π)dα(σ) for given val-

ues g0(σ) and α(σ). Then infg( . ;σ),x1∆g(x1;σ) =

− g0(σ)+2√

2√α(σ)−g20(σ)

ε2(σ) .

3.5 Choice of ε(σ)

For the purpose of analysis, we restrict the choice ofε(σ) > 0 as stated by the following proposition. Thisresults in monotonic α(σ), which greatly simplifies theanalysis.

Proposition 5 Suppose the function ε(σ) > 0 satisfies0 ≤ ε(σ)ε(σ) ≤ σ. Then α(σ) ≤ 0.

This choice can be further refined by the followingproposition.

Proposition 6 The only form for ε(σ) > 0 that satis-fies 0 ≤ ε(σ)ε(σ) ≤ σ is,

ε(σ) = β√σ2 + ζ , (7)

for any 0 < β ≤ 1 and ζ > −σ2.

3.6 Lower Bounding σ∆g(x;σ) bya(σ)g(x;σ) + b(σ)

The goal of this section is finding continuous func-tions a and b such that a(σ)g(x;σ) + b(σ) ≤σ∆g(x;σ). By manipulating Corollary 4, one can de-

rive ∆g(x0;σ) ≥ − g(x0;σ)+2√

2√α(σ)−g(x0;σ)2

ε2(σ) , where

(√

2π)dα(σ) , 1ε(σ)

∫Ω|g(ω;σ)|2 e

ε2(σ) ‖ω‖22 dω.

By multiplying both sides by σ (remember σ >0) and factorizing α(σ) the above inequality can be

equivalently written as, σ∆g(x0;σ) ≥ −σ g(x0;σ)ε2(σ) −

2σ√

2α(σ)

ε2(σ)

√1− g(x0;σ)2

α(σ) . This inequality implies

∆g(x0;σ) ≥ −σ g(x0;σ)ε2(σ) − 2σ

√2α(σ)

ε2(σ)

( 1+γg(x0;σ)√α(σ)√

1−γ2

), where

0 ≤ γ < 1 is any constant and we use the fact that∀(u, γ) ∈ [−1, 1] × [0, 1) ;

√1− u2 ≤ 1+γu√

1−γ2(with

g(x0;σ)√α(σ)

being u). The inequality now has the affine form

σ∆g(x0;σ) ≥ a(σ)g(x0;σ) + b(σ), where

a(σ) = − σ

ε2(σ)− 2

√2σ γ

ε2(σ)√

1− γ2, b(σ) = −

2σ√

2α(σ)

ε2(σ)√

1− γ2.

(8)

Note that the continuity of ε as stated in (7) impliescontinuity of a and b.

3.7 Integrations and Final Bound

Theorem 7 Let f : X → R be the objective function.Given the initial value g

(x(σ1);σ1

). Then for any 0 ≤

σ ≤ σ1, and any constants 0 < γ < 1, 0 < β < 1,ζ > −σ2, the following holds,

g(x(σ);σ

)≤ (

σ2 + ζ

σ21 + ζ

)p g(x(σ1);σ1

)+c√α(σ)

(1−(

σ2 + ζ

σ21 + ζ

)p),

(9)

where p , 12β2 ( 2

√2 γ√

1−γ2− 1) and c ,

√2

2√

2 γ−√

1−γ2.

The proof essentially combines (8) with the factg(x;σ) = σ∆g(x;σ) (i.e. the heat equation) to obtain

g(x;σ) ≥ a(σ)g(x;σ) + b(σ), where a(σ) =(

2√

2 γ√1−γ2

1)

σε2(σ) and b(σ) = − 2σ

√2α(σ)

ε2(σ)√

1−γ2. This form is now

amenable to Lemma 1. Using the form of ε(σ) in (7),∫a(r) dr can be computed analytically to 1

2β2

(2√

2 γ√1−γ2

1)

log(σ2 + ζ). Finally, using the Holder’s inequality

‖f g‖1 ≤ ‖f‖1 ‖g‖∞, we can separate√α(σ) from the

remaining of the integrand in form of sup√α(σ). The

latter further simplifies to√α(σ) due to non-increasing

property of α stated in Proposition 5.We now discuss the role of optimization complexity

α(σ) in (9). For brevity, let w1(σ, σ1) , (σ2+ζσ21+ζ

)p, and

w2(σ, σ1) , c(1 − (σ

2+ζσ21+ζ

)p). Observe that w1 and w2

are independent of f , while g and α depend on f .It can be proved that w2 is nonnegative (Proposi-

tion 8), and obviously so is√α(σ). Hence, lower opti-

mization complexity α(σ) results in a smaller objectivevalue g

(x(σ);σ

). Since the optimization complexity α

depends on the objective function, it provides a wayto quantify the hardness of the optimization task athand.

A practical consequence of our theorem is that onemay determine the worst case performance without run-ning the algorithm. Importantly, the optimization com-plexity can be easily computed when f is representedby some suitable basis form; in particular by GaussianRBFs. This is the subject of the next section. Notethat while our result holds for any choice of constantswithin the prescribed range, ideally they would be cho-sen to make the bound tight. That is, the negative andpositive terms respectively receive the large and smallweights.

Before ending this section, we present the followingproposition which formally proves w2 is positive.

Proposition 8 Let c ,√

2

2√

2 γ−√

1−γ2and p ,

12β2 ( 2

√2 γ√

1−γ2− 1) for any choice of 0 ≤ γ < 1 and

Page 6: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

0 < β ≤ 1. Suppose 0 < σ < σ1 and ζ > −σ2. Then

c(1− (σ2+ζσ21+ζ

)p) > 0

3.8 Analytical Expression for α(σ)

In order to utilize the presented theorem in practice forsome given objective function f , we need to know itsassociated optimization complexity α(σ). That is,

we must be able to compute∫

Ω|h(ω)|2

G(ω)dω analytically.

Is this possible, at least for a class of interesting func-tions? Here we show that this is possible if the functionf is represented in some suitable form. Specifically,here we prove that the integrals in α(σ) can be com-puted analytically when f is represented by GaussianRBFs.

Before proving this, we provide a brief description ofGaussian RBF representation. It is known that, undermild conditions, RBF functions are capable of univer-sal approximation (Park and Sandberg 1991). The lit-erature on RBF is extensive (Buhmann and Buhmann2003; Schaback and Wendland 2001). This representa-tion has been used for interpolation and approximationin various practical applications. Examples include butare not limited to neural networks (Park and Sandberg1991), object recognition (Pauli, Benkwitz, and Som-mer 1995), computer graphics (Carr et al. 2001), andmedical imaging (Carr, Fright, and Beatson 1997).

Proposition 9 Suppose h(x) ,∑Kk=1 ake

− (x−xk)2

2δ2

and let G(ω) , εde−ε2‖ω‖2

2 , and suppose ε < δ. Then,the following holds,

∫Ω

|h(ω)|2

G(ω)dω = (

√2πδ2

ε√

2δ2 − ε2)d

K∑j=1

K∑k=1

ajake−

(xj−xk)2

2(2δ2−ε2)

(10)

Observing that when f(x) ,∑Kk=1 ake

− (x−xk)2

2δ2 ,

then g(x;σ) ,∑Kk=1( δ√

δ2+σ2)dake

− (x−xk)2

2(δ2+σ2) , the fol-

lowing is a straightforward Corollary of Proposition 9,which allows us to compute α(σ) for RBF representedf .

Corollary 10 Suppose f(x) ,∑Kk=1 ake

− (x−xk)2

2δ2 ,

so that g(x;σ) ,∑Kk=1( δ√

δ2+σ2)dake

− (x−xk)2

2(δ2+σ2) . Let

G(ω;σ) , εd(σ)e−ε2(σ)‖ω‖2

2 and suppose ε(σ) <√δ2 + σ2. Then, the following holds,

∫Ω

|g(ω;σ)|2

G(ω;σ)dω = (

√2πδ2

ε(σ)√

2δ2 + 2σ2 − ε2(σ))d

×K∑j=1

K∑k=1

ajake−

(xj−xk)2

2(2δ2+2σ2−ε2(σ)) .

4 Conclusion & Future WorksIn this work, for the first time, we provided a theo-retical analysis of the optimization by the continuationmethod. Specifically, we developed an upper bound onthe value of the objective function that the continu-ation method attains. This bound monotonically de-pends on a characteristic of the objective function thatwe called the optimization complexity. We showed howthe optimization complexity can be computed analyti-cally when the objective is represented in some suitablebasis functions such as Gaussian RBFs.

Our analysis visited different areas such as scalespace, differential equations, and regularization theory.The optimization complexity depends on the choice ofthe stabilizer G. In this paper, we only use GaussianG. However, extending G to other choices G can beinvestigated in the future.

AcknowledgmentThis work is partially funded by the Shell Research.First author is thankful to Alan L. Yuille (UCLA) fordiscussions, and grateful to William T. Freeman (MIT)for supporting this work.

References[Balzer and Morwald 2012] Balzer, J., and Morwald, T. 2012. Isogeometric

finite-elements methods and variational reconstruction tasks in vision - aperfect match. In CVPR. 2

[Barron 2013] Barron, J. 2013. Shapes, Paint, and Light. Ph.D. Dissertation,EECS Department, University of California, Berkeley. 2

[Bengio et al. 2009] Bengio, Y.; Louradour, J.; Collobert, R.; and Weston,J. 2009. Curriculum learning. In ICML. 2

[Bengio 2009] Bengio, Y. 2009. Learning Deep Architectures for AI. 2

[Blake and Zisserman 1987] Blake, A., and Zisserman, A. 1987. Visual Re-construction. MIT Press. 1

[Boccuto et al. 2002] Boccuto, A.; Discepoli, M.; Gerace, I.; and Pucci, P.2002. A gnc algorithm for deblurring images with interacting discontinu-ities. 2

[Buhmann and Buhmann 2003] Buhmann, M. D., and Buhmann, M. D.2003. Radial Basis Functions. Cambridge University Press. 6

[Carr et al. 2001] Carr, J. C.; Beatson, R. K.; Cherrie, J. B.; Mitchell, T. J.;Fright, W. R.; McCallum, B. C.; and Evans, T. R. 2001. Reconstructionand representation of 3d objects with radial basis functions. SIGGRAPH,67–76. ACM. 6

[Carr, Fright, and Beatson 1997] Carr, J. C.; Fright, W. R.; and Beatson,R. K. 1997. Surface interpolation with radial basis functions for medicalimaging. IEEE Trans. Med. Imaging 16(1):96–107. 6

[Chaudhuri and Solar-Lezama 2011] Chaudhuri, S., and Solar-Lezama, A.2011. Smoothing a program soundly and robustly. In CAV. 2

[Chaudhuri, Clochard, and Solar-Lezama 2014] Chaudhuri, S.; Clochard,M.; and Solar-Lezama, A. 2014. Bridging boolean and quantitative syn-thesis using smoothed proof search. SIGPLAN Not. 49(1):207–220. 2

[Cohen and Gorre 1995] Cohen, L. D., and Gorre, A. 1995. Snakes: Sur laconvexite de la fonctionnelle denergie. 2

[Coupe et al. 2013] Coupe, P.; Manjon, J. V.; Chamberland, M.; De-scoteaux, M.; and Hiba, B. 2013. Collaborative patch-based super-resolution for diffusion-weighted images. NeuroImage. 2

[Dhillon et al. 2012] Dhillon, P. S.; Keerthi, S. S.; Bellare, K.; Chapelle, O.;and Sundararajan, S. 2012. Deterministic annealing for semi-supervisedstructured output learning. In AISTATS 2012. 2

[Dufour, Miller, and Galatsanos 2002] Dufour, R. M.; Miller, E. L.; andGalatsanos, N. P. 2002. Template matching based object recognition withunknown geometric parameters. IEEE Trans. Image Processing 11(12). 2

[Dyn et al. 1989] Dyn, N.; Ron, A.; Levin, D.; and Jackson, I. 1989. Onmultivariate approximation by integer translates of a basis function. (TR-0886). 4

[Dyn 1989] Dyn, N. 1989. Interpolation and approximation by radial andrelated functions. Approximation Theory 1:211–234. 4

[Erhan et al. 2009] Erhan, D.; Manzagol, P.-A.; Bengio, Y.; Bengio, S.; andVincent, P. 2009. The difficulty of training deep architectures and theeffect of unsupervised pre-training. In AISTATS, 153–160. 2

Page 7: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

[Felzenszwalb et al. 2010] Felzenszwalb, P. F.; Girshick, R. B.; McAllester,D. A.; and Ramanan, D. 2010. Object detection with discrimina-tively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell.32(9):1627–1645. 1

[Gehler and Chapelle 2007] Gehler, P., and Chapelle, O. 2007. Determin-istic annealing for multiple-instance learning. In AISTATS 2007, 123–130.2

[Girosi, Jones, and Poggio 1995] Girosi, F.; Jones, M.; and Poggio, T. 1995.Regularization theory and neural networks architectures. Neural computa-tion 7(2):219–269. 2, 4

[Gold and Rangarajan 1996] Gold, S., and Rangarajan, A. 1996. A grad-uated assignment algorithm for graph matching. IEEE PAMI 18:377–388.2

[Gold, Rangarajan, and Mjolsness 1994] Gold, S.; Rangarajan, A.; andMjolsness, E. 1994. Learning with preknowledge: Clustering with pointand graph matching distance measures. In NIPS. 2

[Hinton, Osindero, and Teh 2006] Hinton, G. E.; Osindero, S.; and Teh,Y. W. 2006. A fast learning algorithm for deep belief networks. NeuralComputation 18(7):1527–1554. 2

[Hong, Lu, and Sundaramoorthi 2013] Hong, B.-W.; Lu, Z.; and Sun-daramoorthi, G. 2013. A new model and simple algorithms for multi-labelmumford-shah problems. In CVPR. 2

[Kim and Torre 2010] Kim, M., and Torre, F. D. 2010. Gaussian processesmultiple instance learning. In ICML. 2

[Kim et al. 2013] Kim, J.; Liu, C.; Sha, F.; and Grauman, K. 2013. De-formable spatial pyramid matching for fast dense correspondences. InCVPR, 2307–2314. IEEE. 2

[Kosowsky and Yuille 1994] Kosowsky, J. J., and Yuille, A. L. 1994. Theinvisible hand algorithm: Solving the assignment problem with statisticalphysics. Neural Networks 7(3):477–490. 1

[Leich, Junghans, and Jentschel 2004] Leich, A.; Junghans, M.; andJentschel, H.-J. 2004. Hough transform with GNC. EUSIPCO. 2

[Leordeanu and Hebert 2008] Leordeanu, M., and Hebert, M. 2008.Smoothing-based optimization. In CVPR. IEEE Computer Society. 2

[Liu, Qiao, and Xu 2012] Liu, Z.; Qiao, H.; and Xu, L. 2012. An extendedpath following algorithm for graph-matching problem. IEEE Trans. PatternAnal. Mach. Intell. 34(7):1451–1456. 2

[Loog, Duistermaat, and Florack 2001] Loog, M.; Duistermaat, J. J.; andFlorack, L. 2001. On the behavior of spatial critical points under gaussianblurring. a folklore theorem and scale-space constraints. In Scale-Space,volume 2106 of Lecture Notes in Computer Science. 2, 4

[Madych and Nelson 1990] Madych, W. R., and Nelson, S. A. 1990. Mul-tivariate Interpolation and Conditionally Positive Definite Functions. II.Mathematics of Computation 54(189):211–230. 4

[Mairal et al. 2009] Mairal, J.; Bach, F.; Ponce, J.; and Sapiro, G. 2009. InICML, volume 382, 87. ACM. 1

[Malek-Mohammadi et al. 2014] Malek-Mohammadi, M.; Babaie-Zadeh, M.;Amini, A.; and Jutten, C. 2014. Recovery of low-rank matrices under affineconstraints via a smoothed rank function. IEEE Transactions on SignalProcessing 62(4):981–992. 2

[Mobahi and Fisher III ] Mobahi, H., and Fisher III, J. W. On the linkbetween gaussian homotopy continuation and convex envelopes. In LectureNotes in Computer Science (EMMCVPR 2015). 3

[Mobahi, Rao, and Ma 2009] Mobahi, H.; Rao, S.; and Ma, Y. 2009. Data-driven image completion by image patch subspaces. In Picture Coding Sym-posium. 1

[Mohimani et al. 2010] Mohimani, G. H.; Babaie-Zadeh, M.; Gorodnitsky,I.; and Jutten, C. 2010. Sparse recovery using smoothed l0 (sl0): Conver-gence analysis. CoRR abs/1001.5073. 2

[Morgan 2009] Morgan, A. 2009. Solving Polynomial Systems Using Contin-uation for Engineering and Scientific Problems. Society for Industrial andApplied Mathematics. 2

[Mumford and Shah 1989] Mumford, D., and Shah, J. 1989. Optimal ap-proximations by piecewise smooth functions and associated variationalproblems. Comm. Pure Appl. Math. 42(5):577–685. 1

[Nikolova, Ng, and Tam 2010] Nikolova, M.; Ng, M. K.; and Tam, C.-P.2010. Fast nonconvex nonsmooth minimization methods for image restora-tion and reconstruction. Trans. Img. Proc. 19(12):3073–3088. 2

[Park and Sandberg 1991] Park, J., and Sandberg, I. W. 1991. Universalapproximation using radial-basis-function networks. Neural Comput. 3:246–257. 6

[Pauli, Benkwitz, and Sommer 1995] Pauli, J.; Benkwitz, M.; and Sommer,G. 1995. Rbf networks for object recognition. In Center for CognitiveSciences, Bremen University. 6

[Pretto, Soatto, and Menegatti 2010] Pretto, A.; Soatto, S.; and Menegatti,E. 2010. Scalable dense large-scale mapping and navigation. In ICRA. 2

[Price, Morse, and Cohen 2010] Price, B. L.; Morse, B. S.; and Cohen, S.2010. Simultaneous foreground, background, and alpha estimation for im-age matting. In CVPR, 2157–2164. IEEE. 2

[Rangarajan and Chellappa 1990] Rangarajan, A., and Chellappa, R. 1990.Generalized graduated nonconvexity algorithm for maximum a posterioriimage estimation. In ICPR90, II: 127–133. 2

[Rose, Gurewitz, and Fox 1990] Rose, K.; Gurewitz, E.; and Fox, G. 1990. Adeterministic annealing approach to clustering. Pattern Recognition Letters11(9):589 – 594. 1

[Saragih 2013] Saragih, J. 2013. Deformable face alignment via local mea-surements and global constraints. In Gonzlez Hidalgo, M.; Mir Torres, A.;and Varona Gmez, J., eds., Deformation Models, volume 7 of Lecture Notesin Computational Vision and Biomechanics. Springer Netherlands. 187–207.2

[Schaback and Wendland 2001] Schaback, R., and Wendland, H. 2001.Characterization and construction of radial basis functions. Multivariateapproximation and applications 1–24. 6

[Sindhwani, Keerthi, and Chapelle 2006] Sindhwani, V.; Keerthi, S. S.; andChapelle, O. 2006. Deterministic annealing for semi-supervised kernelmachines. In ICML ’06, 841–848. New York, NY, USA: ACM. 2

[Sommese and Wampler 2005] Sommese, A., and Wampler, C. 2005. TheNumerical Solution of Systems of Polynomials Arising in Engineering andScience. World Scientific. 2

[Sun, Roth, and Black 2010] Sun, D.; Roth, S.; and Black, M. J. 2010. Se-crets of optical flow estimation and their principles. In CVPR, 2432–2439.IEEE. 1

[Tirthapura et al. 1998] Tirthapura, S.; Sharvit, D.; Klein, P.; and Kimia,B. 1998. Indexing based on edit-distance matching of shape graphs. InSPIE Int. Symp. Voice, Video, and Data Comm., 25–36. 2

[Trzasko and Manduca 2009] Trzasko, J., and Manduca, A. 2009. Highlyundersampled magnetic resonance image reconstruction via homotopic l0-minimization. IEEE Trans. Med. Imaging 28(1):106–121. 2

[Widder 1975] Widder, D. V. 1975. The Heat Equation. Academic Press. 2,3

[Witkin, Terzopoulos, and Kass 1987] Witkin, A.; Terzopoulos, D.; andKass, M. 1987. Signal matching through scale space. IJCV. 1

[Wu and Tan 2013] Wu, Z., and Tan, P. 2013. Calibrating photometricstereo by holistic reflectance symmetry analysis. In CVPR. 2

[Wu 1996] Wu, Z. 1996. The Effective Energy Transformation Scheme asa Special Continuation Approach to Global Optimization with Applicationto Molecular Conformation. SIAM J. on Optimization 6:748–768. 2

[Xiao and Zhang 2012] Xiao, L., and Zhang, T. 2012. A Proximal-GradientHomotopy Method for the l1-Regularized Least-Squares Problem. In ICML.3

[Yuille and Grzywacz 1989] Yuille, A. L., and Grzywacz, N. M. 1989. Amathematical analysis of the motion coherence theory. IJCV. 4

[Yuille 1987] Yuille, A. 1987. Energy Functions for Early Vision and AnalogNetworks. A.I. memo. 1

[Zaslavskiy, Bach, and Vert 2009] Zaslavskiy, M.; Bach, F.; and Vert, J.-P.2009. A path following algorithm for the graph matching problem. IEEEPAMI. 2

Page 8: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

SUPPLEMENTARY APPENDIX

A Asymptotic ConvexityInformally, g(x, t) is called asymptotically strictly convex if it becomes strictly convex when t → ∞. The formaldefinition is provided below.

Asymptotic Strict Convexity The function g(x; t) is called asymptotically strict convex if ∀M >0 , ∃ t∗M , ∀x1 , ∀x2 , ∀λ ∈ [0, 1] , ∀t ≥ t∗M : ‖x1‖ ≤M∧‖x2‖ ≤M ⇒ g

(λx1+(1−λ)x2; t

)< λg(x1, t)+(1−λ)g(x2, t).

Example Consider the objective function of form f(x) = −e−x2

2ε2 for some small ε > 0. This function resembles the`0 norm, the latter being a central object in the literature of sparse representation. Note that f provides a muchbetter approximation for `0 norm, compared to the widely used `1 norm. However, f(x) is concave everywhere,except at its tip, hence a difficult component in a minimization task.

Let the homotopy be defined as g(x; t) , −e−x2

2(ε2+t2) . Observe that at g(x; t)t=0 = f(x). The function g(x; t) isasymptotically convex according to the formal definition. Qualitatively, the interval on which g(x; t) is convex cangrow without bound by increasing the value of t (See Figure 2).

Figure 2: Left to Right: The function −e−x2

2(ε2+t2) with increasing values of t ∈ 0, 13 ,

23 , 1. Convex regions are

colored by blue.

B ProofsLemma 1 (Worst Case Value of g

(x(t); t

)) Given a function f : X → R and its associated homotopy map g.

Given a point x1 that is the stationary point of g(x; t1) (w.r.t. x). Denote the curve of stationary points originatedfrom x1 at t1 by x(t), i.e. ∀t ∈ [0, t1] ; ∇g(x(t), t) = 0. Suppose this curve exists. Given continuous functionsa and b, such that ∀t ∈ [0, t1]∀x ∈ X ; a(t)g(x; t) + b(t) ≤ g(x; t). Then, the following inequality holds for anyt ∈ [0, t1],

g(x(t); t

)≤(g(x(t1); t1

)−∫ t1

t

e∫ t1sa(r) drb(s) ds

)e−

∫ t1t a(r) dr .

Proof

d

dtg(x(t); t) (11)

= ∇g(x(t); t) x(t) + g(x(t); t) (12)

= 0 + g(x(t); t) , (13)

where (12) uses derivative’s chain rule, and (13) applies the fact that ∇g(x(t); t) = 0 (because by definition forany t, x(t) is a stationary point for g( . , t)).

Recall the assumption a(t)g(x; t) + b(t) ≤ g(x; t). This in particular (by setting x to x(t)) implies thata(t)g(x(t); t) + b(t) ≤ g(x(t); t). Plugging it into (13) implies that,

d

dtg(x(t); t) ≥ a(t)g(x(t); t) + b(t) (14)

⇒ d

dtg†(t) ≥ a(t)g†(t) + b(t) , (15)

Page 9: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

where g†(t) , g(x(t); t) is introduced to reduce the mathematical clutter. Since a(t) and b(t) are assumed to becontinuous and g†(t) to be continuously differentiable, we can use the differential form of Gronwalls inequality,it is implied that6,

g†(t) ≤(g†(t1)−

∫ t1

t

e∫ t1sa(r) drb(s) ds

)e−

∫ t1t a(r) dr . (25)

Lemma 2 Consider f : X → R with well-defined Fourier transform. Let G : Ω→ R++ be any decreasing function.

Suppose f(x1) = h0 and ( 1√2π

)d∫

Ω|f(ω)|2

G(ω)dω = α for given constants h0 and α. Then inff,x1

∆f(x1) = c1∆G(0)−c2∆∆G(0), where (c1, c2) is the solution to the following system,

c1G(0)− c2∆G(0) = h0

c21∫

ΩG(ω) dω + 2c1c2

∫Ω‖ω‖2G(ω) dω + c22

∫Ω‖ω‖4G(ω) dω = (

√2π)dα

. (26)

Proof Our goal is to solve the following optimization problem,

inff,x1

∆f(x1) (27)

s.t. f(x1) = h0 (28)

(1√2π

)d∫Rd

|f(ω)|2

G(ω)dω = α . (29)

We can write this as a nested optimization, where we first compute inf w.r.t. f and parametrically in x1, andthen we take another inf of the result w.r.t. x1.

6 Gronwall’s inequality states that, for any t ≤ t1, when ddtg(t) ≥ a(t)g(t) + b(t), for a and b being continuous functions, it

follows that g(t) ≤(g(t1)−

∫ t1te∫ t1s a(r)drb(s) ds

)e−

∫ t1t a(r)dr.

We first prove that dds

(g(s)e

∫ t1s a(r)dr

)≥ e

∫ t1s a(r)drb(s) as below,

d

ds

(g(s)e

∫ t1s a(r)dr) = e

∫ t1s a(r)dr d

dsg(s) + g(s)

d

dse∫ t1s a(r)dr (16)

= e∫ t1s a(r)dr d

dsg(s) + g(s)e

∫ t1s a(r)dr(

d

ds

∫ t1

s

a(r)dr) (17)

= e∫ t1s a(r)dr d

dsg(s)− g(s)e

∫ t1s a(r)dra(s) (18)

≥ e∫ t1s a(r)dr(a(s)g(s) + b(s)− g(s)a(s)

)(19)

= e∫ t1s a(r)drb(s) , (20)

where (16) uses product rule, (17) applies chain rule, (18) utilizes uses Leibniz’s rule, and (19) uses the assumption ddtg†(t) ≥

a(t)g†(t) + b(t).Applying fundamental theorem of calculus to the derived inequality leads to,

d

ds

(g(s)e

∫ t1s a(r)dr) ≥ e∫ t1s a(r)drb(s) (21)

⇒∫ t1

t

d

ds

(g(s)e

∫ t1s a(r)dr) ds ≥ ∫ t1

t

e∫ t1s a(r)drb(s) ds (22)

≡ g(t1)e∫ t1t1

a(r)dr − g(t)e∫ t1t a(r)dr ≥

∫ t1

t

e∫ t1s a(r)drb(s) ds (23)

≡ g(t) ≤(g(t1)−

∫ t1

t

e∫ t1s a(r)drb(s) ds

)e−

∫ t1t a(r)dr , (24)

where (22) is because integration preserves inequality, and (23) uses the fundamental theorem of calculus.

Page 10: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

infx1

inff

∆f(x1) (30)

s.t. f(x1) = h0 (31)

(1√2π

)d∫Rd

|f(ω)|2

G(ω)dω = α . (32)

We first focus on the inner inf, i.e. w.r.t. f . We can express this in the equivalent Fourier form,

inff

(1√2π

)d∫Rdi2‖ω‖2f(ω)eiω

Tx1 dω (33)

s.t. (1√2π

)d∫Rdf(ω)eiω

Tx1 = h0 dω (34)

(1√2π

)d∫Rd

|f(ω)|2

G(ω)dω = α . (35)

The Lagrangian becomes,

L = (1√2π

)d∫Rdi2‖ω‖2f(ω)eiω

Tx1 dω + µ0((1√2π

)d∫Rdf(ω)eiω

Tx1 dω − h0) (36)

+µ1((1√2π

)d∫Rd

|f(ω)|2

G(ω)dω − α)

= (1√2π

)d∫Rd

((µ0 − ‖ω‖2)f(ω)eiω

Tx1 + µ1|f(ω)|2

G(ω)

)dω − µ0h0 − µ1α (37)

= (1√2π

)d∫Rd

((µ0 − ‖ω‖2)f(ω)eiω

Tx1 + µ1f(ω)f(−ω)

G(ω)

)dω − µ0h0 − µ1α (38)

(39)

Taking functional derivative w.r.t. f(ω),

δL

δf(ω)= (

1√2π

)d((µ0 − ‖ω‖2)eiω

Tx1 + µ1f(ω)

G(ω)

). (40)

The solution f∗(ω) is necessarily a stationary point of the Lagrangian. Zero crossing the derivative gives,

f∗(ω) =1

µ1G(ω)(‖ω‖2 − µ0)eiω

Tx1 =(c1 + c2‖ω‖2

)eiω

Tx1G(ω) . (41)

Consequently,

f(x)∗ = c1G(x− x1)− c2∆G(x− x1) . (42)

Coefficients c1 and c2 can be found by plugging this solution into the two equality constraints. This is done byevaluating f∗(x) = c1G(x−x1)−c2∆G(x−x1) at x = x1 and putting the result into f∗(x1) = h0 (hence eliminating

f∗(x)), and plugging f∗(ω) = c1G(ω)eiωTx1 +c2‖ω‖2G(ω)eiω

Tx1 into ( 1√2π

)d∫Rd|f∗(ω)|2

G(ω)dω = α (hence eliminating

f∗(ω)). That is, solving the following system of equations in c1 and c2.c1G(0)− c2∆G(0) = h0

( 1√2π

)d∫

Ω|c1G(ω)eiω

T x1+c2‖ω‖2G(ω)eiωT x1 |2

G(ω)dω = α

. (43)

When ∀ω ; G(ω) 6= 0, the integrand of the second equation further simplifies,

Page 11: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

|c1G(ω)eiωTx1 + c2‖ω‖2G(ω)eiω

Tx1 |2

G(ω)(44)

=|c1G(ω) + c2‖ω‖2G(ω)|2

G(ω)(45)

=

(c1G(ω) + c2‖ω‖2G(ω)

)(c1G(ω) + c2‖ω‖2G(ω)

)G(ω)

(46)

=(c1 + c2‖ω‖2

)(c1G(ω) + c2‖ω‖2G(ω)

)(47)

= c21G(ω) + 2c1c2‖ω‖2G(ω) + c22‖ω‖4G(ω) . (48)

Hence, we actually need to solve the following system,c1G(0)− c2∆G(0) = h0

c21∫Rd G(ω) dω + 2c1c2

∫Rd ‖ω‖

2G(ω) dω + c22∫Rd ‖ω‖

4G(ω) dω = (√

2π)dα. (49)

We now solve the system of equations in variables (c1, c2) and plug in its solution into the objective value ∆f∗(x1).

Note that we already derived f∗(x) = c1G(x − x1) − c2∆G(x − x1), hence inff ∆f(x1) , ∆f∗(x1) = c1∆G(0) −c2[∆(∆G)](0) (where the inff is obviously subject to the provided constraints).

Remember from (30) that we still have a infx1 to work out on top of the solution for inff . However, sincein inff ∆f(x1) = c1∆G(0) − c2[∆(∆G)](0), the RHS does not depend on x1, it follows that infx1 inff ∆f(x1) =c1∆G(0)− c2[∆(∆G)](0) (obviously subject to the provided constraints).

Corollary 3 Consider f : X → R with well-defined Fourier transform. Let G(ω) , εde−ε2‖ω‖2

2 . Suppose f(x1) = h0

and∫

Ω|f(ω)|2

G(ω)dω = (

√2π)dα for given constants h0 and α. Then inff,x1 ∆f(x1) = −h0+2

√2√α−h2

0

ε2 .

Proof Let G(x) , e−‖x‖2

2ε2 (and hence G(ω) , εde−ε2‖ω‖2

2 ). Then the following identities hold,

G(0) = 1 (50)

∆G(0) = − d

ε2(51)

∆∆G(0) =d(d+ 2)

ε4(52)∫

RdG(ω) dω = (

√2π)d (53)∫

Rdω2G(ω) dω =

d(√

2π)d

ε2(54)∫

Rd‖ω‖4G(ω) dω =

d(d+ 2)(√

2π)d

ε4. (55)

Since G(ω) > 0 and is decreasing, Lemma 2 can be applied, which states that inff,x1∆f(x1) = c1∆G(0) −

c2∆∆G(0) = −d c1ε2 −d(d+2) c2

ε4 , and (c1, c2) is the solution of,c1G(0)− c2∆G(0) = h0

c21∫Rd G(ω) dω + 2c1c2

∫Rd ‖ω‖

2G(ω) dω + c22∫Rd ‖ω‖

4G(ω) dω = (√

2π)dα. (56)

Using the identities provided earlier, this system can be easily solved. Due to its quadratic nature, it leads to a

pair of potential solutions for (c1, c2), which once plugged into inff,x1∆f(x1) = −d c1ε2 −

d(d+2) c2ε4 becomes,

inff,x1

∆f(x1) ∈ −h0 + 2√

2√α− h2

0

ε2,−h0 − 2

√2√α− h2

0

ε2 . (57)

Page 12: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

Out of this pair, the one which leads to smaller inff,x1 ∆f(x1) obviously is,

inff,x1

∆f(x1) =−h0 − 2

√2√α− h2

0

ε2. (58)

Corollary 4 Consider f : X → R with well-defined Fourier transform. Define g(x;σ) , [h ? kσ](x). Let G(ω;σ) ,

εd(σ)e−ε2(σ)‖ω‖2

2 . Suppose g(x1;σ) = g0(σ) and∫

Ω|g(ω;σ)|2

G(ω;σ)dω = (

√2π)dα(σ) for given values g0(σ) and α(σ). Then

infg( . ;σ),x1∆g(x1;σ) = − g0(σ)+2

√2√α(σ)−g20(σ)

ε2(σ) .

Proof The proof is elementary use of the previous Corollary. The previous Corollary applies to any functions f(x)

that has well-defined Fourier transform and any smoothness kernel G(ω). This includes any parameterized family offunctions and smoothness kernels, as long as the parameter (particularly σ) and the spatial variable x are independentof each other. In particular, one can choose the function as g(x;σ). That is because as long as f(x) has well-definedFourier transform, so does g(x;σ). In addition σ and x are independent. Furthermore, with some abuse of notation,

one can define G as G(ω;σ) , εd(σ)e−ε2(σ)‖ω‖2

2 again because σ is independent of ω.

Proposition 5 Suppose the function ε(σ) > 0 satisfies 0 ≤ ε(σ)ε(σ) ≤ σ. Then α(σ) ≤ 0.

Proof Recall the definition of α(σ),

α(σ) = (1√2π

)d∫Rd

|g(ω;σ)|2

G(ω;σ)dω . (59)

In order to prove α(σ) ≤ 0, it is sufficient to show that ddσ|g(ω;σ)|2

G(ω;σ)≤ 0. The latter is proved in the following,

d

|g(ω;σ)|2

G(ω;σ)=

d

|f(ω) 1√2πe−‖ω‖2σ2

2 |2

G(ω;σ)(60)

= (1

2π)d|f(ω)|2 d

dσe−‖ω‖

2σ2 1

G(ω;σ)(61)

= (1

2π)d|f(ω)|2 d

dσe‖ω‖

2(−σ2+ε2(σ)

2 ) 1

ε(σ)(62)

= (1

2π)d|f(ω)|2 1

ε(σ)e‖ω‖

2(−σ2+ε2(σ)

2 )(

2‖ω‖2(−σ + ε(σ)ε(σ))− ε(σ)

ε(σ)

)(63)

≤ 0 . (64)

Proposition 6 The only form for ε(σ) > 0 that satisfies 0 ≤ ε(σ)ε(σ) ≤ σ is ε(σ) = β√σ2 + ζ for any 0 < β ≤ 1

and ζ > −σ2.

Proof The condition 0 ≤ ε(σ)ε(σ) ≤ σ can be equivalently expressed as ε(σ)ε(σ) = β2σ for any β ∈ [0, 1]. This nowbecomes a separable differential equation,

εd

dσε = β2σ (65)

≡ ε dε = β2σ dσ (66)

≡∫ε dε = c+

∫β2σ dσ (67)

≡ 1

2ε2 = c+

1

2β2σ2 (68)

≡ ε = ±β√

2c

β2+ σ2 . (69)

Page 13: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

Obviously, the solution exists only when 2cβ2 ≥ −σ2. Since by definition we have ε(σ) > 0, only the positive solution

is acceptable, so that ε = β√

2cβ2 + σ2. Also to maintain ε(σ) > 0, β = 0 and 2c

β2 = −σ2 must be excluded. That

changes the conditions to β ∈ (0, 1] and 2cβ2 ≥ −σ2. Finally, since c in 2c

β2 can be any real number, we can define

ζ , 2cβ2 , which is any real number that satisfies ζ > −σ2.

Theorem 7 Let f : X → R be the objective function. Given the initial value g(x(σ1);σ1

). Then for any 0 ≤ σ ≤ σ1,

and any constants 0 < γ < 1, 0 < β < 1, ζ > −σ2, the following holds,

g(x(σ);σ

)≤ (

σ2 + ζ

σ21 + ζ

)p g(x(σ1);σ1

)+ c√α(σ)

(1− (

σ2 + ζ

σ21 + ζ

)p), (70)

where p , 12β2 ( 2

√2 γ√

1−γ2− 1) and c ,

√2

2√

2 γ−√

1−γ2.

Proof We start by combining (8) with the fact g(x;σ) = σ∆g(x;σ) (i.e. the heat equation) we obtain that for

any x ∈ X and σ ∈ R++, g(x;σ) ≥ a(σ)g(x;σ) + b(σ), where a(σ) =(

2√

2 γ√1−γ2

− 1)

σε2(σ) and b(σ) = − 2σ

√2α(σ)

ε2(σ)√

1−γ2.

As long as a(σ) and b(σ) are continuous functions, we can apply Lemma 1 to obtain g(x(σ);σ

)≤(g(x(σ1);σ1

)−∫ σ1

σe∫ σ1s

a(r) drb(s) ds)e−

∫ σ1σ

a(r) dr.

Recalling the form of ε(σ) in (7),∫a(r) dr can be analytically computed,

∫a(σ) dσ =

( 2√

2 γ√1− γ2

− 1) ∫ σ

ε2(σ)dσ (71)

=( 2√

2 γ√1− γ2

− 1) ∫ σ

β2(σ2 + ζ)dσ (72)

=1

β2

( 2√

2 γ√1− γ2

− 1) ∫ σ

σ2 + ζdσ (73)

=1

2β2

( 2√

2 γ√1− γ2

− 1)

log(σ2 + ζ) . (74)

Therefore,

g(x(σ);σ

)≤

(g(x(σ1);σ1

)−∫ σ1

σ

e∫ σ1s

a(r) drb(s) ds)e−

∫ σ1σ

a(r) dr (75)

=(g(x(σ1);σ1

)−∫ σ1

σ

(σ2

1 + ζ

s2 + ζ)

12β2

( 2√

2 γ√1−γ2

−1)b(s) ds

)(σ2 + ζ

σ21 + ζ

)1

2β2( 2√

2 γ√1−γ2

−1)(76)

=(g(x(σ1);σ1

)+

∫ σ1

σ

(σ2

1 + ζ

s2 + ζ)p

2s√

2α(s)

β2(s2 + ζ)√

1− γ2ds)

(σ2 + ζ

σ21 + ζ

)p (77)

≤(g(x(σ1);σ1

)+ ( sup

s∈[σ,σ1]

√α(s))

∫ σ1

σ

2√

2s(σ21+ζs2+ζ )p ds

β2(s2 + ζ)√

1− γ2

)(σ2 + ζ

σ21 + ζ

)p (78)

=(g(x(σ1);σ1

)+

2√

2α(σ)

β2√

1− γ2

∫ σ1

σ

(σ2

1 + ζ

s2 + ζ)p

s

s2 + ζds)

(σ2 + ζ

σ21 + ζ

)p (79)

=(g(x(σ1);σ1

)+

√2α(σ)

2√

2 γ −√

1− γ2

((σ2

1 + ζ

σ2 + ζ)

12β2

( 2√

2 γ√1−γ2

−1)− 1))

(σ2 + ζ

σ21 + ζ

)1

2β2( 2√

2 γ√1−γ2

−1)(80)

= (σ2 + ζ

σ21 + ζ

)1

2β2( 2√

2 γ√1−γ2

−1)g(x(σ1);σ1

)+

√2α(σ)

2√

2 γ −√

1− γ2

(1− (

σ2 + ζ

σ21 + ζ

)1

2β2( 2√

2 γ√1−γ2

−1))(81)

= (σ2 + ζ

σ21 + ζ

)p g(x(σ1);σ1

)+ c√α(σ)

(1− (

σ2 + ζ

σ21 + ζ

)p), (82)

Page 14: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

where (78) applies Holder’s inequality ‖f g‖1 ≤ ‖f‖1 ‖g‖∞, and ‖ .‖∞ denotes the sup norm. Furthermore, (79)

uses the fact that α(σ) is non-increasing, hence sups∈[σ,σ1]

√α(s) = α(s).

Proposition 8 Let c ,√

2

2√

2 γ−√

1−γ2and p , 1

2β2 ( 2√

2 γ√1−γ2

− 1) for any choice of 0 ≤ γ < 1 and 0 < β ≤ 1. Suppose

0 < σ < σ1 and ζ > −σ2. Then c(1− (σ2+ζσ21+ζ

)p) > 0

Proof First we show that p and c always have the same sign, i.e. p c > 0. To see that, just write the definitions,

p c =1

2β2(

2√

2 γ√1− γ2

− 1)

√2

2√

2 γ −√

1− γ2=

√2

2β2√

1− γ2. (83)

Now for brevity define η , σ2+ζσ21+ζ

and observe that η < 1. The goal is to show c(1 − ηp) > 0. Since p and c have

the same sign, we investigate two possible scenarios, when p and c are bot negative and when they are both positive.First suppose p > 0 and c > 0. Since η < 1, we have ηp < 1 and so 1− ηp > 0. Hence c(1− ηp) > 0. Now suppose

p < 0 and c < 0. Since η < 1, we have ηp > 1 and so 1− ηp < 0. Hence c(1− ηp) > 0.

Proposition 9 Suppose h(x) ,∑Kk=1 ake

− (x−xk)2

2δ2 and let G(ω) , εde−ε2‖ω‖2

2 , and suppose ε < δ. Then, thefollowing holds,

∫Ω

|h(ω)|2

G(ω)dω = (

√2πδ2

ε√

2δ2 − ε2)d

K∑j=1

K∑k=1

ajake−

(xj−xk)2

2(2δ2−ε2) (84)

Proof We provide the proof for one dimensional functions. The extension to multi dimensional case is straightforward.

Derive the Fourier transform of h(x),

h(ω) ,1√2π

∫Rh(x)e−iωx dx (85)

=1√2π

K∑k=1

ak

∫Re− (x−xk)2

2δ2k e−iωx dx (86)

=

K∑k=1

ak δke− δ

2kω

2+2iωxk2 . (87)

Hence,

|h(ω)|2 = h(ω)h(ω) =

K∑j=1

K∑k=1

ajak δjδk e−δ2j ω

2+2iωxj+δ2kω

2−2iωxk2 . (88)

Therefore,

∫R

|h(ω)|2

G(ω)dω =

∫R

K∑j=1

K∑k=1

ajak δjδk e−δ2j ω

2+2iωxj+δ2kω

2−2iωxk2

1

εeε2ω2

2 dω (89)

=1

ε

K∑j=1

K∑k=1

ajak δjδk

∫Re−

ω2(δ2j+δ2k−ε

2)+2iω(xj−xk)

2 dω (90)

=

√2π

ε

K∑j=1

K∑k=1

ajak δjδk√δ2j + δ2

k − ε2e−

(xj−xk)2

2(δ2j+δ2k−ε2) , (91)

where (90) uses the identity∫R e−−a

2ω2+2ibω2 dω =

√2πa e−

b2

2a2 . The latter is true due to the following,

Page 15: A Theoretical Analysis of Optimization by Gaussian ...people.csail.mit.edu/hmobahi/pubs/aaai_2015.pdf · A Theoretical Analysis of Optimization by Gaussian Continuation Hossein Mobahi

∫e−

a2ω2+2ibω2 dω =

∫e−

b2

2a2− (ib+a2ω)2

2a2 dω (92)

= e−b2

2a2

∫e−

(ib+a2ω)2

2a2 dω (93)

= e−b2

2a2

∫e−

(ib+a2ω)2

2a2 dω (94)

=

√π√2ae−

b2

2a2 erf(ib+ a2ω√

2a) . (95)

Hence, the definite integral becomes,

∫e−

a2ω2+2ibω2 dω =

√π√2ae−

b2

2a2(

limω→∞

erf(ib+ a2ω√

2a)− lim

−ω→∞erf(

ib+ a2ω√2a

))

(96)

=

√π√2ae−

b2

2a2(1− (−1)

)(97)

=

√2π

ae−

b2

2a2 . (98)

When computing integral limits above, we used the fact that a > 0 and that erf(∞) = 1 and erf(−∞) = −1.