arXiv:2010.15482v1 [math.NA] 29 Oct 2020

18
Convergence of Constrained Anderson Acceleration Mathieu Barr´ e Adrien Taylor Alexandre d’Aspremont INRIA - D.I ENS, ENS, CNRS, PSL University, Paris, France INRIA - D.I ENS, ENS, CNRS, PSL University, Paris, France CNRS - D.I ENS, ENS, CNRS, PSL University, Paris, France Abstract We prove non asymptotic linear convergence rates for the constrained Anderson accelera- tion extrapolation scheme. These guarantees come from new upper bounds on the con- strained Chebyshev problem, which consists in minimizing the maximum absolute value of a polynomial on a bounded real interval with l 1 constraints on its coefficients vector. Constrained Anderson Acceleration has a nu- merical cost comparable to that of the origi- nal scheme. 1 Introduction Obtaining faster convergence rates is a central con- cern in numerical analysis. Given an algorithm whose iterates converge to a point x * , extrapolation meth- ods consist in combining iterates of the converging algorithm to obtain a new point hopefully closer to the solution. This idea was first applied for accelerat- ing convergence of sequences in R by fitting a linear model on the iterates and using as extrapolated point, the fixed point of this model (Aitken, 1927; Shanks, 1955; Brezinski, 2006). Extrapolation techniques have been extended to linearly converging sequences of vec- tor (Anderson, 1965; Sidi et al., 1986) with conver- gence guarantees on auto-regressive models. More re- cently, those methods, and in particular Anderson ac- celeration, have been knowing a regain of interest in the optimization community. This renewed interest started with the work of Scieur et al. (2016), which applied these extrapolation schemes to optimization algorithms, and where a regularization technique was proposed for obtaining convergence guarantees beyond auto-regressive settings. mailing authors : mathieu.barre[at]inria.fr adrien.taylor[at]inria.fr, aspremon[at]ens.fr These extrapolation methods were widely extended: to the stochastic setting (Scieur et al., 2017), to compos- ite optimization problems (Massias et al., 2018; Mai and Johansson, 2019), for splitting methods (Poon and Liang, 2019; Fu et al., 2019), and to accelerate momen- tum based methods (Bollapragada et al., 2018). Obtaining convergence guarantees for Anderson acceleration-type extrapolation, outside of the simple auto-regressive case (which corresponds to quadratic programs), is still an open issue. In (Scieur et al., 2016), explicit convergence guarantees are given in a regime asymptotically close to the solution. Zhang et al. (2018) provides a globally converging Anderson acceleration type algorithm; however, due to the gen- eral setting of the paper, no convergence rate are pro- vided. Nonasymptotic convergence bounds are pro- vided in (Toth and Kelley, 2015; Li and Jian, 2020), but they involve the inverse of the smallest eigenvalue of a Krylov matrix, usually very poorly conditioned. Contribution : Our contribution is two-fold. We provide an explicit upper bound for the opti- mal value of the constrained Chebychev problem on polynomials. We demonstrate it is exact on some range of parameters and show numerically that it is close to the ground truth elsewhere. We us this to give an explicit linear convergence rate for constrained Anderson acceleration (CAA) scheme outside of the auto-regressive setting. 1.1 Notations Depending on the context k·k either denotes the classi- cal Euclidean norm, when applied to a vector in R n , or the operator norm when applied to a matrix in R n×n . k·k 1 is the sum of the absolute values of the compo- nents of a vector. When applied to a polynomial it is the sum of the absolute value of its coefficients. arXiv:2010.15482v1 [math.NA] 29 Oct 2020

Transcript of arXiv:2010.15482v1 [math.NA] 29 Oct 2020

Convergence of Constrained Anderson Acceleration

Mathieu Barre Adrien Taylor Alexandre d’AspremontINRIA - D.I ENS,

ENS, CNRS, PSL University,Paris, France

INRIA - D.I ENS,ENS, CNRS, PSL University,

Paris, France

CNRS - D.I ENS,ENS, CNRS, PSL University,

Paris, France

Abstract

We prove non asymptotic linear convergencerates for the constrained Anderson accelera-tion extrapolation scheme. These guaranteescome from new upper bounds on the con-strained Chebyshev problem, which consistsin minimizing the maximum absolute valueof a polynomial on a bounded real intervalwith l1 constraints on its coefficients vector.Constrained Anderson Acceleration has a nu-merical cost comparable to that of the origi-nal scheme.

1 Introduction

Obtaining faster convergence rates is a central con-cern in numerical analysis. Given an algorithm whoseiterates converge to a point x∗, extrapolation meth-ods consist in combining iterates of the convergingalgorithm to obtain a new point hopefully closer tothe solution. This idea was first applied for accelerat-ing convergence of sequences in R by fitting a linearmodel on the iterates and using as extrapolated point,the fixed point of this model (Aitken, 1927; Shanks,1955; Brezinski, 2006). Extrapolation techniques havebeen extended to linearly converging sequences of vec-tor (Anderson, 1965; Sidi et al., 1986) with conver-gence guarantees on auto-regressive models. More re-cently, those methods, and in particular Anderson ac-celeration, have been knowing a regain of interest inthe optimization community. This renewed intereststarted with the work of Scieur et al. (2016), whichapplied these extrapolation schemes to optimizationalgorithms, and where a regularization technique wasproposed for obtaining convergence guarantees beyondauto-regressive settings.

mailing authors : mathieu.barre[at]inria.fradrien.taylor[at]inria.fr, aspremon[at]ens.fr

These extrapolation methods were widely extended: tothe stochastic setting (Scieur et al., 2017), to compos-ite optimization problems (Massias et al., 2018; Maiand Johansson, 2019), for splitting methods (Poon andLiang, 2019; Fu et al., 2019), and to accelerate momen-tum based methods (Bollapragada et al., 2018).

Obtaining convergence guarantees for Andersonacceleration-type extrapolation, outside of the simpleauto-regressive case (which corresponds to quadraticprograms), is still an open issue. In (Scieur et al.,2016), explicit convergence guarantees are given in aregime asymptotically close to the solution. Zhanget al. (2018) provides a globally converging Andersonacceleration type algorithm; however, due to the gen-eral setting of the paper, no convergence rate are pro-vided. Nonasymptotic convergence bounds are pro-vided in (Toth and Kelley, 2015; Li and Jian, 2020),but they involve the inverse of the smallest eigenvalueof a Krylov matrix, usually very poorly conditioned.

Contribution : Our contribution is two-fold.

• We provide an explicit upper bound for the opti-mal value of the constrained Chebychev problemon polynomials. We demonstrate it is exact onsome range of parameters and show numericallythat it is close to the ground truth elsewhere.

• We us this to give an explicit linear convergencerate for constrained Anderson acceleration (CAA)scheme outside of the auto-regressive setting.

1.1 Notations

Depending on the context ‖·‖ either denotes the classi-cal Euclidean norm, when applied to a vector in Rn, orthe operator norm when applied to a matrix in Rn×n.

‖ · ‖1 is the sum of the absolute values of the compo-nents of a vector. When applied to a polynomial it isthe sum of the absolute value of its coefficients.

arX

iv:2

010.

1548

2v1

[m

ath.

NA

] 2

9 O

ct 2

020

Convergence of Constrained Anderson Acceleration

1.2 Setting

In this paper we study linear convergence of the con-strained Anderson extrapolation scheme on some oper-ator F : Rn → Rn. For us, F is typically a gradientstep with constant stepsize. We make the followingassumptions on F throughout the paper.

Assumptions:

1. F is ρ Lipschitz with ρ < 1. This implies thatF has a unique fixed point x∗ and the iterates ofthe fixed point iterations xk+1 = F (xk) convergewith a linear rate ρ.

2. There is G ∈ Sn×n+ with G 4 ρI and ξ : Rn → Rnα-Lipschitz with α ≥ 0 such that F = G+ ξ.

In the case of F encoding a gradient step, and givena µ strongly convex function f : Rn → R, with L-Lipschitz gradient, it is well known that the operatorF = I − 1

L∇f is contractive with ρ =(1− µ

L

). In

addition, assuming f is C2 with η-Lipschitz hessian, wecan show that (I− 1

L∇f) satisfies Assumption 2 arounda point x0 ∈ Rn, with α proportional to η‖∇f(x0)‖.This fact is made more precise in §4.2.

We focus on the online version of Anderson accelera-tion (Scieur et al., 2016, 2018). This means that thenumber of iterates used to perform the extrapolationis fixed to k + 1 with some integer k > 0.

Algorithm 1 Constrained Anderson Acceleration

Input: x0 ∈ Rn, F satisfying the assumptions, C aconstraint bound and k controlling the number ofextrapolation steps.for i = 0 . . . k doxi+1 = F (xi)

end forR =

[x0 − x1 · · · xk − xk+1

]Compute c = argmin

1T c=1, ‖c‖1≤C‖Rc‖ (1)

xe =∑ki=0 cixi

Output: xe

Algorithm 1 corresponds to one step of the onlinemethod. As we are interested in linear convergencerates, we look at results of the form ‖F (xe) − xe‖ ≤ρ‖F (x0)− x0‖, where xe is the output of Algorithm 1started at x0. The quantity ‖F (x) − x‖ is a criterionthat is used to control how far is x from being the fixedpoint of F , indeed ‖F (x) − x‖ = 0 ⇐⇒ x = x∗. Inaddition, since F is ρ Lipschitz by assumption we alsohave that ‖F (xk) − xk‖ ≤ ρk‖F (x0) − x0‖ thus wesay that extrapolation provides acceleration as soonas ρ < ρk.

It contrasts with the acceleration involving a conver-gence rate proportional to (1 −

√1− ρ) instead of ρ

(e.g Optimal Method of Nesterov (2018)), this type ofacceleration is obtained for Anderson acceleration inthe offline setting where k grows to infinity, meaningthat we use more and more iterates to perform extrap-olation which is not recommended practice.

2 Constrained Anderson Acceleration

We first recall some standard results on Anderson Ac-celeration applied to a linear operator when α = 0.Then we introduce constraints on the extrapolationcoefficients in order to deal with a perturbation pa-rameter α > 0.

2.1 Anderson Acceleration on LinearProblem

Let consider first the simple setting where α = 0 andAlgorithm 1 is used with C = ∞. We recall the well-known Anderson acceleration result on the linear case.

Proposition 2.1. Let F satisfying the assumptions of§1.2 with α = 0 and xe ∈ Rn the output of Algorithm 1started at x0 ∈ Rn that is not the fixed point of F , withC =∞ and k + 1 > 1. We have that

‖F (xe)− xe‖‖F (x0)− x0‖

≤ minp∈Rk[X]p(1)=1

maxx∈[0,ρ]

|p(x)| = ρ∗ := 2βk

1+β2k ,

with β = 1−√

1−ρ1+√

1−ρ . In addition ρ∗ < ρk.

Proof. Reformulation of (Golub and Varga, 1961;Scieur et al., 2016)

In the following, α may be nonzero and the previousProposition does not apply.

2.2 The Non Linear Case

When applying the extrapolation step (1) to a nonlin-ear operator F , the regularity of the matrix RTR ∈S(k+1) becomes an important issue. This matrix canbe arbitrarily close to singular or even singular, for in-stance when F is a gradient steps operator, it is knownthat the consecutive gradient tends to get aligned.Thus the solution vector c can have coefficients withvery large magnitude. When α > 0, those coefficientsare multiplied with the nonlinear part of F and canmake the iterates of the algorithm diverge (see (Scieuret al., 2016) for an example of such divergence). Tofix this, one needs to control the magnitude of thesecoefficients, by e.g. regularizing problem (1), as inScieur et al. (2016) with C =∞, or by imposing hard

Mathieu Barre, Adrien Taylor, Alexandre d’Aspremont

constraints on c, as we do. Regularizing involves eas-ier computations in practice but imposing constraintsmakes the analysis simpler.

Proposition 2.2. Given F satisfying assumptions of§1.2, α ≥ 0 and xe ∈ Rn the output of Algorithm 1started at x0 ∈ Rn with C ≥ 1 and k ≥ 1. We have

‖F (xe)−xe‖ ≤(

maxx∈[0,ρ]

|p∗(x)|+ 3Cαk

)‖F (x0)−x0‖

wherep∗ ∈ argmin

p∈Rk[X]p(1)=1‖p‖1≤C

maxx∈[0,ρ]

|p(x)|

and ‖p‖1 is the l1 norm of the vector of coefficientsof p.

Proof. We provide here a sketch of the proof. Acomplete version is provided in Appendix A.

Using the definition xi+1 = F (xi) = G(xi) + ξ(xi).one can show the following bound

‖F (xe)− xe‖ ≤

∥∥∥∥∥k∑i=0

ci(xi+1 − xi)

∥∥∥∥∥+

∥∥∥∥∥ξ(xe)−k∑i=0

ciξ(xi)

∥∥∥∥∥ .The first term of the right hand side is the quantitythat is minimized in (1) and the second term is due tothe non linear part of F .

To control the first term we use the fact that since cis solution of (1) and c∗ the vector of coefficients of p∗is admissible, then∥∥∥∥∥

k∑i=0

ci(xi+1 − xi)

∥∥∥∥∥ ≤∥∥∥∥∥k∑i=0

c∗i (xi+1 − xi)

∥∥∥∥∥ .After some transformations using α-Lischitzness of ξ,we obtain the bound wanted.

Proposition 2.2 exhibits a trade-off between (i) allow-ing coefficients to have larger magnitudes, i.e., via alarge C, leading to a smaller max

x∈[0,ρ]|p∗(x)| that gets

closer to ρ∗, and (ii) diminishing C to better con-trol the nonlinear part of F but gettinga a slower ratemaxx∈[0,ρ]

|p∗(x)| closer to ρk.

The following corollary simply states that one can al-low a small relative error in the computation of (1)and keep a linear convergence.

Corollary 2.3. Under the conditions of Proposi-tion 2.2, if (1) is solved with relative precisionε‖F (x0)− x0‖ on the optimal value for ε > 0 then

‖(F−I)xe‖ ≤(

maxx∈[0,ρ]

|p∗(x)|+ 3Cαk + ε

)‖(F−I)x0‖.

In the next section, we describe the variation ofmaxx∈[0,ρ]

|p∗(x)| as a function of C, rendering the above

trade-off explicit.

3 Constrained Chebychev Problem

The Chebyshev problem, defined in the following theo-rem, is fundamental in many field of numerical analy-sis. It is used to provide convergence rate for manyoptimization methods such as the conjugate gradi-ent algorithm, Anderson acceleration, or Chebysheviterations (Golub and Varga, 1961; Nemirovskiy andPolyak, 1984; Shewchuk, 1994; Nemirovsky, 1992).

Theorem 3.1 (Golub and Varga (1961)). Let ρ > 0and k > 0, we call Chebyshev problem of degree k on[0, ρ] the following optimization problem

minp∈Rk[X]p(1)=1

maxx∈[0,ρ]

|p(x)| (Cheb)

The solution of this optimization problem is p∗(X) =

Tk(2X−ρρ )∣∣∣∣Tk(

2−ρρ )

∣∣∣∣ where Tk is the first kind Chebyshev poly-

nomial of order k. The optimal value is equal to ρ∗defined in Proposition 2.2.

Proof. For completeness, a proof of this result is pro-vided in Appendix B.2.

We have seen in Proposition 2.2 that we need to controlthe optimal value of a slightly modified problem wherewe add a constraint on the l1 norm of the vector ofcoefficients of the polynomial. Adding this constraintbreaks the explicit result of Theorem 3.1. No closedform solution for the constrained Chebyshev problemis known in the general case. In this section we aim atproviding good upper bounds on the optimal value ofthis constrained problem.

Given k ≥ 1 and ρ ∈]0, 1[, we are now interested in thefollowing constrained Chebyshev problem

ρ(C) := minp∈Rk[X]p(1)=1‖p‖1≤C

maxx∈[0,ρ]

|p(x)| (Ctr-Cheb)

The next subsection describes how to compute thisρ(C) numerically for C ≥ 1. Note that the admissibleset is empty when C < 1.

3.1 Numerical Solutions

When C ≥ 1, the problem (Ctr-Cheb) has a non emptyadmissible set. In addition, the feasible set an inter-section of an affine space and an l1 ball, and henceis convex. The objective function being a norm on

Convergence of Constrained Anderson Acceleration

10 20 30

0.400

0.600

C

0 100 200 300

0.200

0.400

0.600

C

0 0.5 1

·104

0.000

0.200

0.400

C

ρ(C)

Prop 3.13

Lemma 3.4

Equation (5)ρ∗

ρk

0 20 40 60 80

0.985

0.990

0.995

C

0 1,000 2,000 3,000

0.960

0.980

C

0 2 4

·105

0.900

0.950

C

Figure 1: Blue curves correspond to numerical solutions to (3), red ones correspond to the bound from Propo-sition 3.13 with M = k, purple ones are the bound from Lemma 3.4 and the greens are the maximum of theabsolute value of the solution of (5) over [0, ρ]. On x-axis C goes from 1 to C∗ defined in (4). Top : ρ = 0.9.Bottom: ρ = 0.999. Left: k = 3. Middle: k = 5. Right: k = 8.

Rk[X] and equivalently on Rk+1, is convex. The prob-lem (Ctr-Cheb) is equivalent to

min tp ∈ Rk[X], t ∈ Rp(1) = 1, ‖p‖1 ≤ C−t ≤ p(x) ≤ t, ∀x ∈ [0, ρ]

(2)

This problem involves positivity constraints of polyno-mials on a bounded interval. A classical argument totransform this local positivity into positivity on R isusing the following change of variable. p(x) ≥ 0 ∀x ∈[0, ρ] ⇐⇒ (1 + x2)kp

(ρ x2

1+x2

)≥ 0 ∀x ∈ R. Then,

positiviy constraint for a polynomial on R can be re-laxed using a sum of squares (SOS) formulation (Par-rilo, 2000; Lasserre, 2001), which is exact for univari-ate polynomials (e.g (Magron et al., 2019, Theorem1) for a short proof). Standard packages can be usedto solve efficiently the following optimization problemwith SOS constraints.

min tp ∈ Rk[X], t ∈ Rp(1) = 1, ‖p‖1 ≤ C(1 + x2)kp(ρ x2

1+x2 ) + (1 + x2)kt ≥ 0 ∀x ∈ R(1 + x2)kt− (1 + x2)kp(ρ x2

1+x2 ) ≥ 0 ∀x ∈ R

(3)

We used YALMIP (Lofberg, 2004) and MOSEK (ApS,2019). Numerical solutions to (3) are provided in Fig-ure 1 (blue) for a few values of ρ and k.

3.2 Exact Bounds and Upper Bounds

The main goal of this part is to provide an explicit up-per bound for the function ρ(C) defined in (Ctr-Cheb),for using it in the result of Proposition 2.2.

We denote by C∗ the l1-norm of the rescaled Cheby-shev polynomial p∗ of Theorem 3.1.

C∗ = ‖p∗‖1, where p∗ solves (Cheb) (4)

Remark 3.2. By Theorem 3.1, ρ(C) is constant equalto ρ∗ as soon as C ≥ C∗.Remark 3.3. When C = 1, the admissible set of(Ctr-Cheb) consists only on the monomials of degreesmaller than k. Xk has the minimal absolute value on[0, ρ] and ρ(1) = ρk.

The following lemma gives a first upper bound on ρ.

Lemma 3.4. The function ρ defined in (Ctr-Cheb) isconvex on [1,+∞[. Thus for C ∈ [1, C∗]

ρ(C) ≤ C∗−CC∗−1 ρ

k + C−1C∗−1ρ∗

This is actually a very coarse bound due to the factthat C∗ >> 1. Indeed we can observe in Figure 1 thatthere is a important gap between ρ and the coarse up-per bound from Lemma 3.4 that is displayed in purple.However we will use the convexity of ρ to link togetherfiner upper bounds taken at different C’s.

Another upper bound is represented on Figure 1(green). This is the maximum over [0, ρ] of the ab-

Mathieu Barre, Adrien Taylor, Alexandre d’Aspremont

solute value of q defined as

q ∈ argminp(1)=1‖p‖1≤C

maxx∈[0,ρ]

|p(x)− p∗(x)| (5)

where p∗ is solution of (Cheb). This is a naive up-per bound but it is as hard to compute as solving(Ctr-Cheb).

In the next lemma, we provide an explicit expressionof ρ(C) for C in an explicit neighbourhood of 1.

Lemma 3.5. For C ∈ [1, 2+ρk

2−ρk ] we have the followingexpression for ρ

ρ(C) = C+12 ρk − C−1

2

Proof. Let show that p(X) = C+12 Xk − C−1

2 is so-lution of (Ctr-Cheb). First notice that p is feasibleas ‖p‖1 = C and p(1) = 1. In addition, since C ∈[1, 2+ρk

2−ρk ] we have maxx∈[0,ρ]

|p(x)| = p(ρ) = C+12 ρk − C−1

2 .

Let q be another feasible polynomial such that q =∑ki=0 qiX

i and q 6= p. We show that |q(ρ)| ≥ |p(ρ)|.

q(ρ) =∑ci≥0

qiρi +

∑ci≤0

qiρi

≥∑ci≥0

qiρk +

∑ci≤0

qi

=∑ci≥0

qiρk +

1−∑ci≥0

qi

In addition one notices that

∑ci≥0 qi −

∑ci≤0 qi ≤ C

thus∑ci≥0 qi ≤

C+12 and thus

q(ρ) ≥ C+12 (ρk − 1) + 1

= p(ρ) > 0

Thus |q(ρ)| = q(ρ) ≥ p(ρ) = maxx∈[0,ρ]

|p(x)|. Then

maxx∈[0,ρ]

|q(x)| ≥ maxx∈[0,ρ]

|p(x)| and necessarily p is a solu-

tion of (Ctr-Cheb).

Remark 3.6. When k = 1, C∗ = 2+ρ2−ρ , and the

function ρ is entirely defined by Remark 3.2 andLemma 3.5.

Remark 3.7. We have ρ( 2+ρk

2−ρk ) = ρk

2−ρk . When ρ →

1, 2+ρk

2−ρk → 3, and1− ρk

2−ρk1−ρk → 2. This means ρ( 2+ρk

2−ρk ) =

1− 2(1− ρk) + o(1− ρk).

The following proposition gives the form of the solu-tion of (Ctr-Cheb) in a neighborhood of C∗. It statesthat the solutions of the Chebyshev problem with lightconstraints are also rescaled Chebyshev polynomialson a segment [−ε, ρ] instead of [0, ρ], with ε ≥ 0.

Proposition 3.8. Let ρ ∈]0, 1[, and k ≥ 1. For any

ε ∈ [0, ε] with ε = ρ1+cos(

2k−12k π)

1−cos(2k−1

2k π)we have

ρ(‖pε‖1) = maxx∈[−ε,ρ]

|pε(x)|

where pε = argminp∈Rk[X]p(1)=1

maxx∈[−ε,ρ]

|p(x)|

Proof. The proof is provided in Appendix C.

Due to the fact that the coefficients of the polynomialspε for ε ∈ [0, ε] have alternating signs, we can get a niceexpression for ‖pε‖1.

Lemma 3.9. Let ρ ∈]0, 1[, and k ≥ 1. For any ε ∈

[0, ε] with ε = ρ1+cos(

2k−12k π)

1−cos(2k−1

2k π)we have

‖pε‖1 = ρε2

(1ρ+ε

)k ((2 + ρ− ε− 2

√(1 + ρ)(1− ε)

)k+(

2 + ρ− ε+ 2√

(1 + ρ)(1− ε))k)

where ρε =2βkε

1+β2kε

and βε =1−√

1−ρ+ε1+ε

1+

√1−ρ+ε1+ε

.

Proof. From the proof in Appendix C we can see thatthe coefficients of pε alternate signs, which means that‖pε‖1 = |pε(−1)|. We then use the classical expressionfor Tk(x) with |x| ≥ 1 (see e.g (Mason and Handscomb,2002, Eq 1.49))Tk(x) = 1

2

((x−

√x2 − 1)k + (x+

√x2 − 1)k

).

Remark 3.10. In particular, one can recover thevalue of C? (which we did not provide before) fromLemma 3.9 applied to the unconstrained Chebyshevproblem (Ctr-Cheb) (ε = 0) From previous lemma weobserve that

C∗ = ρ∗2ρk

((2 + ρ− 2

√1 + ρ)k + (2 + ρ+ 2

√1 + ρ)k

)Proposition 3.8 and Lemma 3.9 do not provide directaccess to ρ(C) for C ∈ [C, C∗] with C = ‖pε‖1. Indeed,we cannot explicitly invert the relation ε → ‖pε‖1.However one can get arbitrarily good upper bounds bysampling (εi)i∈[1,M ] ∈ [0, ε]. Then, one can computeCi = ‖pεi‖1 explicitly using Lemma 3.9, and use con-vexity to interpolate linearly between the Ci and ρ(Ci),reaching a piecewise linear upper bound on [C, C]. Ar-bitrarily high accuracy can be obtained by increasingM in this procedure.

To construct the following upper bound to ρ(C) onthe remaining C’s, we rely on some numerical obser-vations. Indeed we observed that the maximum over

Convergence of Constrained Anderson Acceleration

[0, ρ] of rescaled Chebyshev polynomials pε provide agood upper bound of ρ(‖pε‖1) for a large range of ε.In particular, we study pρ in the following as we canget a relatively simple expression for ‖pρ‖1.

Proposition 3.11. Let ρ ∈]0, 1[ and k ≥ 1,

ρ(C1) ≤ ρ1 ,2βkρ

1+β2kρ

with

C1 = ρ12ρk

((1−

√1 + ρ2)k + (1 +

√1 + ρ2)k

), (6)

and βρ =√

1+ρ−√

1−ρ√1+ρ+

√1−ρ .

Proof. By Lemma B.3, pρ(X) = ρ1Tk(Xρ ).

Let show that ‖pρ‖1 = C1. To see that

one can notice that ‖pε‖1 = ρ1

∣∣∣Tk( iρ )∣∣∣ =

12ρk

∣∣∣(i− (i2 − ρ)1/2)k

+(i− (i2 − ρ)1/2

)k)∣∣∣ = C1, us-

ing (Mason and Handscomb, 2002, Eq 1.49).

Thus pρ is admissible for (Ctr-Cheb) and its maximumabsolute value on [−ρ, ρ], ρ1 provides an upper boundfor ρ(C1).

We study the regime ρ ∼ 1 in order to get more insighton what the previous results mean.

Remark 3.12. We observe that

ρ1 = 2ρk

(1+√

1−ρ2)k+(1−√

1−ρ2)k≤ ρk

and when ρ→ 1

ρ1 − ρ∗ ∼ k2k−1 (ρk − ρ∗)

C1 ∼ (1+√

2)k+(1−√

2)k

(1+√

2)2k+(1−√

2)2kC∗ ≤ 1

2kC∗

In the regime ρ ∼ 1 we have that by decreasing theconstraint C by a factor 2k, one only looses a factork

2k−1 of possible acceleration.

We propose the following upper bound for the functionρ(C) over [1,∞[.

Proposition 3.13. Let C ≥ 1, k > 2 , 0 < ρ < 1.Choosing M ∈ N∗, (εi)i∈[1,M ] =

2i−1

)i∈[1,M ]

. Let

(Ci)i∈[1,M ] = (‖pεi‖1)i∈[1,M ], C−1 = 1, C0 = 2+ρk

2−ρk and

CM+1 = C∗. We denote by (ρi)i∈[1,M ] =(

2βki1+β2k

i

)with

βi =1−√

1−ρ+εi1+εi

1+

√1−ρ+εi1+εi

, ρ−1 = ρk, ρ0 = ρk

2−ρk and ρM+1 =

ρ∗. Then

ρ(C) ≤ maxi∈[−1,M ]

(C−Ci

Ci+1−Ci ρi+1 + Ci+1−CCi+1−Ci ρi, ρ∗

)is an explicit upper bound on ρ.

Proof. The Ci = ‖pεi‖1 can be made explicit using forinstance the formula in (Mason and Handscomb, 2002,Equation 2.18) for the coefficients of Tk. If C > C∗ wesaw that ρ(C) = ρ∗. Otherwise C is between a Ci anda Ci+1 and the result follows from the convexity of ρ.

As shown in Figure 1, Proposition 3.13 provides par-tially numerical upper bounds, represented in red onthe figure, that are close to ρ(C). In the next sec-tion we use this result to provide explicit bounds onconstrained Anderson acceleration for non linear oper-ators.

4 Main Results

As discussed above, combining Proposition 3.13 andProposition 2.2 gives an explicit linear rate of conver-gence for one pass of Algorithm 1.

Proposition 4.1. Let F be satisfying assumptions of§1.2, α ≥ 0, and xe ∈ Rn be the output of Algorithm 1with x0 ∈ Rn, C ≥ 1, and k > 2. It holds that

‖F (xe)− xe‖ ≤ ρ(C)‖F (x0)− x0‖

where ρ = ρ(C) + 3αkC with ρ defined in (Ctr-Cheb).In addition, given M ∈ N∗,

ρ(C) ≤ maxi∈[−1,M ]

(C−Ci

Ci+1−Ci ρi+1 + Ci+1−CCi+1−Ci ρi, ρ∗

)+3αkC

where the ρi, Ci are defined in Proposition 3.13.

In what follows, we explicitly control this linear rateto provide guarantees on acceleration.

4.1 Explicit Upper Bound on CAAConvergence Rate

For readability purposes, let us consider a very sim-ple form of the upper bound of ρ(C), using Proposi-tion 3.13 with M = 1. This corresponds to a piecewiselinear upper bound on ρ(C) with four parts. The fol-lowing proposition provides a range of values of C forwhich Algorithm 1 accelerates the convergence guar-antees of the iterative process F , depending on theperturbation parameter α.

Proposition 4.2. Under the assumptions of Proposi-tion 4.1,

(i) As soon as α < ρk(1−ρk)3k(2+ρk)

:= α0 there exists a

non empty interval I containing 2+ρk

2−ρk such that

ρ(C) < ρk for C ∈ I.

(ii) α < min(ρk(1−ρk)

3k(2+ρk), ρ

k−ρ13kC1

) := α1

=⇒ ρ(C) < ρk for C ∈ [ 2+ρk

2−ρk , C1].

Mathieu Barre, Adrien Taylor, Alexandre d’Aspremont

(iii) α < min(ρk(1−ρk)

3k(2+ρk), ρ

k−ρ∗3kC∗

) := α2

=⇒ ρ(C) < ρk for C ∈ [ 2+ρk

2−ρk , C∗].

Proof. See Appendix D

Remark 4.3. When ρ→ 1,α0 ∼ 1−ρ9 ,

2+ρk

2−ρk → 3, C1 → 12

((1−√

2)k

+(√

2 + 1)k)

,

α1 ∼ min( 19 ,

2(k−1)

3((1−√

2)k+(√

2+1)k) )(1− ρ).

C∗ → 12

((3− 2

√2)k

+(2√

2 + 3)k)

,

α2 ∼ min( 19 ,

2(2k−1)

3((3−2

√2)k+(2√

2+3)k) )(1− ρ).

Remark 4.4. Using only the coarse upper bound fromLemma 3.4 for ρ(C), we can only guarantee the exis-tence of a C that provides acceleration for Algorithm 1

when α < ρk−ρ∗3kC∗

. As a comparison, Proposition 4.2

guarantees acceleration when α < ρk(1−ρk)3k(2+ρk)

. When ρ is

close to 1, ρk−ρ∗3kC∗

3k(2+ρk)ρk(1−ρk)

∼ 6(2k−1)

(2√

2+3)k+(3−2

√2)k .

For instance, when k = 10 and ρ being close to 1, ourbound guarantees acceleration in Algorithm 1 for α’sabout 106 times larger than with the coarse bound.

4.2 CAA on Gradient Descent

Recently Anderson acceleration has been successfullyapplied in the optimization field where F is an opera-tor representing an optimization algorithm.

Here we look at the particular case where F is thegradient step operator of a µ-strongly convex functionf : Rn → R with L-Lipschitz gradient and optimumx∗ ∈ Rn. In addition, we suppose that f is C2 withη-Lipschitz hessian ∇2f . It is well known (see for in-stance Ryu and Boyd (2016)) that F = (I − 1

L∇f) isa ρ =

(1− µ

L

)-Lipschitz operator. We consider Algo-

rithm 1 applied to F at point x0 ∈ Rn, and proposeto decompose F as

F =(I − 1

L∇2f(x0)

)+ 1

L

(∇2f(x0)−∇f

).

We pose Q = I − 1L∇

2f(x0) which has its spectrumin[0,(1− µ

L

)], and ξ(x) = 1

L

(∇2f(x0)x−∇f(x)

).

The following lemma states that on the set where ouriterates evolve, ξ is indeed Lipschitz.

Lemma 4.5. Let k ≥ 1, C ≥ 1, x0 ∈ Rn, (xi)i∈[1,k]

such that xi+1 = xi − 1L∇f(xi) for i ∈ [0, k − 1],

then ξ is ηL2 kC‖∇f(x0)‖-Lipschitz on BC = {x =∑k

i=0 cixi ∈ Rn|1tc = 1, ‖c‖1 ≤ C}.

Proof. See Appendix E

Lemma 4.5 allows applying Proposition 1 to F withα = η

L2 kC‖∇f(x0)‖. There are therefore two ways tohave a small α in this context and therefore to guar-antee acceleration using Proposition 4.1, either (i) by

having a hessian with a small Lipschitz constant, whichmeans being globally close to a quadratic, or (ii) bybeing sufficiently close to the optimum (i.e., ‖∇f(x0)‖small).

In this setting, ρ(C), from Proposition 4.1, becomes

ρ(C) = ρ(C) + 3 ηL2 ‖∇f(x0)‖k2C2 (7)

and the bound is no more piecewise linear, but piece-wise quadratic in C, instead.

ρ(C) ≤ maxi∈[−1,M ]

(C−Ci

Ci+1−Ci ρi+1 + Ci+1−CCi+1−Ci ρi, ρ∗

)+ 3 η

L2 ‖∇f(x0)‖k2C2(8)

As before we study this bound in the case M = 1 forsimplicity. The next proposition provides a range ofvalue of C for which acceleration is guaranteed withAlgorithm 1 depending on η

L2 ‖∇f(x0)‖Proposition 4.6. Under the assumptions of §4.2.

(i) As soon as ηL2 ‖∇f(x0)‖ < ρk(1−ρk)(2−ρk)

3k2(2+ρk)2:= α3

there exists a non empty interval I containing2+ρk

2−ρk such that ρ(C) < ρk for C ∈ I.

(ii) ηL2 ‖∇f(x0)‖ < min(ρ

k(1−ρk)(2−ρk)3k2(2+ρk)2

, ρk−ρ1

3k2C21

) := α4

=⇒ ρ(C) < ρk for C ∈ [ 2+ρk

2−ρk , C1].

(iii) ηL2 ‖∇f(x0)‖ < min(ρ

k(1−ρk)(2−ρk)3k2(2+ρk)2

, ρk−ρ∗

3k2C2∗

) := α5

=⇒ ρ(C) < ρk for C ∈ [ 2+ρk

2−ρk , C∗].

Proof. Similar to Proposition 4.2.

Remark 4.7. When µL → 0, α3 ∼ 1

27kµL .

α4 ∼ min( 127k ,

4(k−1)

3k((1−√

2)k+(√

2+1)k)2 ) µL .

α5 ∼ min( 127k ,

4(2k−1)

3k((2√

2+3)k+(3−2

√2)k)2k

) µL .

Figure 2 displays the values of our bounds with fixedk, µ and L for various values of the perturbation pa-rameter that is η‖∇f(x0)‖ in this section. We observethat we do not loose much by using the simple upperbound with M = 1 compared with the exact value ofρ(C) that is to be obtained by solving (3).

Due to the particular form of the perturbation parame-ter α, proportional to η‖∇f(x0)‖ in the case of the gra-dient step operator, we see that as soon as η‖∇f(x0)‖is small enough to get ρ < 1, one can apply successivelyAlgorithm 1 while enjoying a perturbation parametergetting smaller and smaller, leading to a faster conver-gence. Algorithm 2 is obtained by adding a guardedstep to the procedure. This step consists in using theextrapolated point xe only if the gradient at this pointis smaller than the gradient at some iterates used inthe extrapolation, and allows obtaining nicer conver-gence properties.

Convergence of Constrained Anderson Acceleration

100 102 104

0.96

0.98

C

100 102 104

0.96

0.98

C

100 102 104

0.96

0.98

C

η‖∇f(x0)‖ = 0

η‖∇f(x0)‖ = 10−6 µL

η‖∇f(x0)‖ = 10−5 µL

η‖∇f(x0)‖ = 10−4 µL

η‖∇f(x0)‖ = 10−3 µL

η‖∇f(x0)‖ = 10−2 µL

ρ∗

ρk

Figure 2: Bounds on the convergence rate of Algorithm 1 with k = 5, µ = 10−3 and L = 1. Left: ρ(C) defined in(7), Middle: right hand side of the bound (8) with M = k. Right: right hand side of the bound (8) with M = 1.

Algorithm 2 Guarded Constrained Anderson Accel-eration

Input: x0 ∈ Rn, a strongly function f with L-Lipschitz gradient , C a constraint bound and k con-trolling the number of extrapolation steps, N num-ber of iterations.for i = 0 . . . N − 1 dox0i = xi

for j = 0 . . . k doxj+1i = xji − 1

L∇f(xji )end forR =

[x0i − x1

i · · · xki − xk+1i

]Compute c = argmin

1T c=1, ‖c‖1≤C‖Rc‖

xei =∑kj=0 cjx

ji

xi+1 = argminx∈{xei ,xki }

‖∇f(x)‖

end forOutput: xN

Proposition 4.8. Let f be a µ-strongly convex func-tion, with L-Lipschitz gradient and η-Lispchitz hes-sian. Let ρ = 1 − µ

L , k > 2, C > 1, N ≥ 1 andx0 ∈ Rn, and denote by (xi)i∈[0,N ] the sequence of iter-ates of Algorithm 2 on f started at x0 for N iterationswith parameter C and k. We have

‖∇f(xN )‖ ≤N∏i=1

ρi(C)‖∇f(x0)‖

where

ρi = min

(ρk, max

j∈[−1,1]

C−CjCj+1−Cj ρj+1 +

Cj+1−CCj+1−Cj ρj

+ 3 ηL2 ‖∇f(xi−1)‖k2C2

)In addition,

N ≥log

(ηL2

3k2(2+ρk)2‖∇f(x0)‖ρk(1−ρk)(2−ρk)

)k log

=⇒N∏i=1

ρi < ρkN

and ρN (C) −→N→∞

maxj∈[−1,1]

C−CjCj+1−Cj ρj+1 +

Cj+1−CCj+1−Cj ρj.

Proof. This follows by combining results of Proposi-tion 4.1 and Proposition 4.6.

The statement of Proposition 4.8 is illustrated in Fig-ure 3. In short, as C gets larger, more iterations arerequired for escaping the guarded regime (as the guar-antee from Proposition 4.1 does not provide any im-provement over ρk). But as soon as ‖∇f(xN )‖ getssmall enough, the guarantee of CAA improves that ofF , escaping the guarded regime, and as ‖∇f(xN )‖ de-creases, the convergence rate gets closer to ρ(C) (whichis smaller for large Cs).

0 2,000 4,000 6,00010−16

10−9

10−2

N

‖∇f

(xN

)‖

‖∇f(x0)‖ρkN

C = 10

C = 102

C = 103

C = 104

Figure 3: Illustration of Corollary 4.8 on with k = 5,µ = 10−3, L = 1, η = 10−2 and ‖∇f(x0)‖ = 10−1 fordifferent values of C.

Conclusion: We derived upper bounds on the opti-mal value of the constrained Chebyshev problem, andused them to produce explicit non asymptotic con-vergence bounds on constrained Anderson accelerationwith nonlinear operators. Our convergence bounds aresomewhat conservative as they rely on treating thenonlinear part as a perturbation to the linear setting.A remaining open question is whether one can provebetter convergence bounds on Anderson accelerationwithout decoupling linear and nonlinear parts of theoperator. However, this would require very differentproof techniques.

Mathieu Barre, Adrien Taylor, Alexandre d’Aspremont

Acknowledgements

MB acknowledges support from an AMX fellowship. AT acknowledges support from the European ResearchCouncil (grant SEQUOIA 724063). AA is at the departement d’informatique de l’ENS, Ecole normale superieure,UMR CNRS 8548, PSL Research University, 75005 Paris, France, and INRIA. AA would like to acknowledgesupport from the ML and Optimisation joint research initiative with the fonds AXA pour la recherche and KametVentures, a Google focused award, as well as funding by the French government under management of AgenceNationale de la Recherche as part of the ”Investissements d’avenir” program, reference ANR-19-P3IA-0001(PRAIRIE 3IA Institute).

References

Alexander Craig Aitken. On Bernoulli’s numerical solution of algebraic equations. Proceedings of the RoyalSociety of Edinburgh, 46:289–305, 1927.

Donald G. Anderson. Iterative procedures for nonlinear integral equations. Journal of the ACM (JACM), 12(4):547–560, 1965.

MOSEK ApS. The MOSEK optimization toolbox for MATLAB manual. Version 9.0., 2019. URL http://docs.

mosek.com/9.0/toolbox/index.html.

Raghu Bollapragada, Damien Scieur, and Alexandre d’Aspremont. Nonlinear acceleration of momentum andprimal-dual algorithms. arXiv preprint arXiv:1810.04539, 2018.

Claude Brezinski. Acceleration de la convergence en analyse numerique, volume 584. Springer, 2006.

Eid H. Doha, Ali H. Bhrawy, and S.S. Ezz-Eldien. Efficient chebyshev spectral methods for solving multi-termfractional orders differential equations. Applied Mathematical Modelling, 35(12):5662–5672, 2011.

Donald A. Flanders and George Shortley. Numerical determination of fundamental modes. Journal of AppliedPhysics, 21(12):1326–1332, 1950.

Anqi Fu, Junzi Zhang, and Stephen Boyd. Anderson accelerated douglas-rachford splitting. arXiv preprintarXiv:1908.11482, 2019.

Gene H Golub and Richard S Varga. Chebyshev semi-iterative methods, successive overrelaxation iterativemethods, and second order richardson iterative methods. Numerische Mathematik, 3(1):157–168, 1961.

Jean-Bernard Lasserre. Global optimization with polynomials and the problem of moments. SIAM Journal onoptimization, 11(3):796–817, 2001.

Zhize Li and L.I. Jian. A fast anderson-chebyshev acceleration for nonlinear optimization. In InternationalConference on Artificial Intelligence and Statistics, pages 1047–1057. PMLR, 2020.

Johan Lofberg. Yalmip : A toolbox for modeling and optimization in matlab. In In Proceedings of the CACSDConference, Taipei, Taiwan, 2004.

Victor Magron, Mohab Safey El Din, and Markus Schweighofer. Algorithms for weighted sum of squares decom-position of non-negative univariate polynomials. Journal of Symbolic Computation, 93:200–220, 2019.

Vien V. Mai and Mikael Johansson. Anderson acceleration of proximal gradient methods. arXiv preprintarXiv:1910.08590, 2019.

John C. Mason and David C. Handscomb. Chebyshev polynomials. CRC press, 2002.

Mathurin Massias, Alexandre Gramfort, and Joseph Salmon. Celer: a fast solver for the lasso with dual extrap-olation. In International Conference on Machine Learning, pages 3315–3324, 2018.

Arkadi S. Nemirovskiy and Boris T. Polyak. Iterative methods for solving linear ill-posed problems under preciseinformation. ENG. CYBER., (4):50–56, 1984.

Arkadi S. Nemirovsky. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992.

Yurii Nesterov. Lectures on convex optimization, volume 137. Springer, 2018.

Pablo A. Parrilo. Structured semidefinite programs and semialgebraic geometry methods in robustness and opti-mization. PhD thesis, California Institute of Technology, 2000.

Convergence of Constrained Anderson Acceleration

Clarice Poon and Jingwei Liang. Trajectory of alternating direction method of multipliers and adaptive acceler-ation. In Advances in Neural Information Processing Systems, pages 7355–7363, 2019.

Ernest K. Ryu and Stephen Boyd. Primer on monotone operator methods. Appl. Comput. Math, 15(1):3–43,2016.

Damien Scieur, Alexandre d’Aspremont, and Francis Bach. Regularized nonlinear acceleration. In Advances InNeural Information Processing Systems, pages 712–720, 2016.

Damien Scieur, Francis Bach, and Alexandre d’Aspremont. Nonlinear acceleration of stochastic algorithms. InAdvances in Neural Information Processing Systems, pages 3982–3991, 2017.

Damien Scieur, Edouard Oyallon, Alexandre d’Aspremont, and Francis Bach. Online regularized nonlinearacceleration. arXiv preprint arXiv:1805.09639, 2018.

Daniel Shanks. Non-linear transformations of divergent and slowly convergent sequences. Journal of Mathematicsand Physics, 34(1):1–42, 1955.

Jonathan Richard Shewchuk. An introduction to the conjugate gradient method without the agonizing pain,1994.

Avram Sidi, William F Ford, and David A Smith. Acceleration of convergence of vector sequences. SIAM Journalon Numerical Analysis, 23(1):178–196, 1986.

Alex Toth and C.T. Kelley. Convergence analysis for anderson acceleration. SIAM Journal on NumericalAnalysis, 53(2):805–819, 2015.

Junzi Zhang, Brendan O’Donoghue, and Stephen Boyd. Globally convergent type-i anderson acceleration fornon-smooth fixed-point iterations. arXiv preprint arXiv:1808.03971, 2018.

A Proof Proposition 2.2

Let us study the fixed point iterations of F , that is

xi+1 = G(xi) + ξ(xi),

which is equivalent toxi+1 − x∗ = G(xi − x∗) + ξ(xi)− ξ(x∗)

as x∗ = F (x∗). By further developing the previous expression, we arrive to

xi+1 − x? = Gi+1(x0 − x?) +

k∑j=0

Gi−j(ξ(xj)− ξ(x∗)).

Another useful quantity is

xi+1 − xi = (G− I)Gi(x0 − x?) + (G− I)

i−1∑j=0

Gi−j−1 (ξ(xj)− ξ(x∗)) + ξ(xi)− ξ(x∗).

Let us use those expressions, along with a triangle inequality, for working out the fixed-point residual

‖(F − I)(xe)‖ = ‖(G− I)(xe − x∗) + ξ(xe)− ξ(x∗)‖

=

∥∥∥∥∥k∑i=0

ci(G− I)(xi − x∗) + ξ(xe)− ξ(x∗)

∥∥∥∥∥=

∥∥∥∥∥∥k∑i=0

ciGi(G− I)(x0 − x∗) + (G− I)

k∑i=0

ci

i−1∑j=0

Gi−1−j(ξ(xj)− ξ(x∗)) + ξ(xe)− ξ(x∗)

∥∥∥∥∥∥=

∥∥∥∥∥k∑i=0

ci(xi+1 − xi)−k∑i=0

ciξ(xi)− ξ(xe)

∥∥∥∥∥≤

∥∥∥∥∥k∑i=0

ci(xi+1 − xi)

∥∥∥∥∥+

∥∥∥∥∥ξ(xe)−k∑i=0

ciξ(xi)

∥∥∥∥∥ ,

Mathieu Barre, Adrien Taylor, Alexandre d’Aspremont

where the first term on the right hand side is exactly the quantity that is minimized in Algorithm 1. We thenbound the two terms separately. Let c∗ denotes the coefficients of the polynomial p∗ = argmin

p∈Rk[X]p(1)=1‖p‖1≤C

maxx∈[0,ρ]

|p(x)|, we

proceed as follows:

∥∥∥∥∥k∑i=0

ci(xi+1 − xi)

∥∥∥∥∥ ≤∥∥∥∥∥k∑i=0

c∗i (xi+1 − xi)

∥∥∥∥∥ since c∗ is admissible for the optimization problem (1)

=

∥∥∥∥∥∥k∑i=0

c∗iGi(G− I)(x0 − x∗) + (G− I)

k∑i=0

c∗i

i−1∑j=0

Gi−1−j(ξ(xj)− ξ(x∗)) +

k∑i=0

c∗i (ξ(xi)− ξ(x∗))

∥∥∥∥∥∥=

∥∥∥∥∥∥k∑i=0

c∗iGi [(G− I)(x0 − x∗) + ξ(x0)− ξ(x∗)] + (G− I)

k∑i=0

c∗i

i−1∑j=0

Gi−1−j(ξ(xj)− ξ(x∗))

+

k∑i=0

c∗i[ξ(xi)− ξ(x∗)−Gi(ξ(x0)− ξ(x∗))

]∥∥∥∥∥

∥∥∥∥∥k∑i=0

c∗iGi [(F − I)(x0)]

∥∥∥∥∥+

∥∥∥∥∥∥(G− I)

k∑i=1

c∗i

i−1∑j=0

Gi−1−j(ξ(xj)− ξ(x∗))

+

k∑i=1

c∗i[ξ(xi)− ξ(x∗)−Gi(ξ(x0)− ξ(x∗))

]∥∥∥∥∥≤ ‖p∗(G)‖ ‖(F − I)(x0)‖+

∥∥∥∥∥∥k∑i=1

c∗i

i∑j=1

Gi−j(ξ(xj)− ξ(x∗))−i−1∑j=0

Gi−1−j(ξ(xj)− ξ(x∗))

∥∥∥∥∥∥≤ ‖p∗(G)‖ ‖(F − I)(x0)‖+

∥∥∥∥∥∥k∑i=1

c∗i

i−1∑j=0

Gi−j−1 [ξ(xj+1)− ξ(xj)]

∥∥∥∥∥∥≤ ‖p∗(G)‖ ‖(F − I)(x0)‖+ α

k∑i=1

|c∗i |i−1∑j=0

ρi−j−1ρj ‖(F − I)(x0)‖

(‖p∗(G)‖+ α

k∑i=1

|c∗i |ρi−1i

)‖(F − I)(x0)‖

≤ (‖p∗(G)‖+ αk‖c∗‖1) ‖(F − I)(x0)‖≤ (‖p∗(G)‖+ αkC) ‖(F − I)(x0)‖ .

One can bound ‖p∗(G)‖ with standard arguments. Since 0 4 G 4 ρI, there exist an orthogonal matrix O and adiagonal matrix D such that G = OtDO. We get ‖p∗(G)‖ = ‖Otp∗(D)O‖ ≤ ‖p∗(D)‖. One can then notice that‖p∗(D)‖ = max

λ∈Sp(G)|p∗(λ)| ≤ max

x∈[0,ρ]|p∗(x)|, where Sp(G) is the set of eigenvalues of G.

Convergence of Constrained Anderson Acceleration

Let us bound the second term of the right hand side in (9)

‖ξ(xe)−k∑i=0

ciξ(xi)‖ ≤ ‖ξ(xe)− ξ(xk)‖+ ‖ξ(xk)−k∑i=0

ciξ(xi)‖

≤ α

(‖xe − xk‖+

k∑i=0

|ci|‖xk − xi‖

)

≤ 2α

k−1∑i=0

|ci|‖xk − xi‖

≤ 2α

k−1∑i=0

|ci|ρi‖xk−i − x0‖

≤ 2α

k−1∑i=0

|ci|ρik−1−i∑j=0

‖xj+1 − xj‖

≤ 2αk−1∑i=0

|ci|ρik−1−i∑j=0

ρj‖(F − I)(x0)‖

≤ 2α

k−1∑i=0

|ci|ρi(k − i)‖(F − I)(x0)‖

≤ 2αk‖c‖1‖(F − I)(x0)‖≤ 2αkC‖(F − I)(x0)‖.

Combining the two bounds concludes the proof.

B Useful Lemmas

Unspecified facts on Chebyshev polynomials of the first kind are borrowed from (Mason and Handscomb, 2002).

Proposition B.1. Let k ∈ N and a > 1, we have

TkTk(a)

= argminp∈Rk[X]p(a)=1

maxx∈[−1,1]

|p(x)| (9)

where Tk is the first kind Chebyshev polynomials of order k.

Proof. A proof of this result can be found in (Flanders and Shortley, 1950, Equation 10). It relies on the factthat Tk reaches its maximum absolute value on k + 1 points with oscillating sign.

Let us show a result that is classically used for analyzing Anderson acceleration (Golub and Varga, 1961; Scieuret al., 2016).

Proposition B.2. Let k ∈ N , and ρ < 1.

Tk( 2X−ρρ )

Tk( 2−ρρ )

= argminp∈Rk[X]p(1)=1

maxx∈[0,ρ]

|p(x)|

where Tk is the first kind Chebychev polynomials of order k. And

maxx∈[0,ρ]

∣∣∣∣∣Tk( 2X−ρρ )

Tk( 2−ρρ )

∣∣∣∣∣ =2βk

1 + β2k

with β = 1−√

1−ρ1+√

1−ρ .

Mathieu Barre, Adrien Taylor, Alexandre d’Aspremont

Proof. The problem minp∈Rk[X]p(1)=1

maxx∈[0,ρ]

|p(x)| is equivalent to minp∈Rk[X]p(1)=1

maxy∈[−1,1]

|p(ρy+12 )| which is equivalent to

minq∈Rk[X]

q(2−ρρ )=1

maxy∈[−1,1]

|q(y)| by denoting q(y) = p(ρy+12 ) or equivalently p(x) = q( 2x−ρ

ρ ). The last problem is solved

using Proposition B.1 with a = 2−ρρ > 1. This gives us a solution q∗(y) = Tk(y)

Tk(2−ρρ )

, and thus a solution to the

original problem p∗(x) =Tk(

2x−ρρ )

Tk(2−ρρ )

.

For the value of the max, we know that maxy∈[−1,1]

|Tk(y)| = 1 and then maxx∈[0,ρ]

∣∣∣∣∣Tk(2X−ρρ )

Tk(2−ρρ )

∣∣∣∣∣ = 1

Tk(2−ρρ )

. Since 2−ρρ > 1

one can use the formulae for Tk(x) for |x| ≥ 1 (see e.g (Mason and Handscomb, 2002, Eq 1.49)):

Tk(x) =1

2

((x−

√x2 − 1

)k+(x+

√x2 − 1

)k)when |x| ≥ 1.

It mechanically follows that

Tk( 2−ρρ ) = 1

2

((2−ρρ −

√( 2−ρρ )2 − 1

)k+(

2−ρρ +

√( 2−ρρ )2 − 1

)k)= 1

2

((2−ρ−2

√1−ρ

ρ

)k+(

2−ρ+2√

1−ρρ

)k)

= 12

((1−√

1− ρ)2k

+(1 +√

1− ρ)2k)

ρk

= 12

((1−√

1− ρ)2k

+(1 +√

1− ρ)2k)

(1−√

(1− rho)2)k

= 12

((1−√

1− ρ)2k

+(1 +√

1− ρ)2k)(

1−√

1− ρ)k (

1 +√

1− ρ)k

= 12

(1−√

1−ρ)2k

(1+√

1−ρ)2k + 1

(1−√

1−ρ)k

(1+√

1−ρ)k

= 1+β2k

2βk

The following lemma extends the previous one by looking at polynomials with minimal maximum absolute valueon [−ε, ρ].

Lemma B.3. Let k ∈ N, ρ < 1 and ε ≥ 0. It holds that

pε = argminp∈Rk[X]p(1)=1

maxx∈[−ε,ρ]

|p(x)|

where pε(x) =Tk

(2x+ερ+ε−1

)∣∣∣Tk(2

1+ερ+ε−1

)∣∣∣ , and

maxx∈[−ε,ρ]

|pε(x)| = 2βkε1+β2k

ε

with βε =1−√

1−ρ+ε1+ε

1+

√1−ρ+ε1+ε

.

Convergence of Constrained Anderson Acceleration

Proof. This proof is similar to that of Proposition B.2. Indeed minp∈Rk[X]p(1)=1

maxx∈[−ε,ρ]

|p(x)| can be reformulated as

minq∈Rk[X]

q(2(1+ε)ρ+ε −1)=1

maxy∈[−1,1]

|q(y)| with q(y) = p( 2(x+ε)ρ+ε − 1). As 2(1+ε)

ρ+ε − 1 ≥ 1 one can apply Proposition B.1 and get

that minp∈Rk[X]p(1)=1

maxx∈[−ε,ρ]

|p(x)| = 1∣∣∣Tk(21+ερ+ε−1

)∣∣∣ =2βkε

1+β2kε

with βε =1−√

1−ρ+ε1+ε

1+

√1−ρ+ε1+ε

and pε(x) =Tk

(2x+ερ+ε−1

)∣∣∣Tk(2

1+ερ+ε−1

)∣∣∣ .

In the following, we focus on problems where ε is close to 0.

Lemma B.4. Let k ∈ N∗ and ρ < 1. For ε ∈ [0, ε] with ε = ρ1+cos(

2k−12k π)

1−cos(2k−1

2k π)we have the following properties on

pε(X) = Tk( 2Xρ+ε −

ρ−ερ+ε ).

(i) Let c ∈ Rk+1 such that pε(X) =∑ki=0 ciX

i. Then sign(ci) = (−1)k−i for i ∈ [1, k] and (−1)kc0 ≥ 0.

(ii) |pε(X)| is maximal on the mi =(ρ+ε) cos(

iπk )+ρ−ε

2 ∈ [−ε, ρ] and pε(mi) = (−1)i.

Proof. The Chebyshev polynomial of first kind Tk(X) is defined such that Tk(cos(θ)) = cos(kθ) for all θ ∈ R.Thus the roots of Tk are the (zi)i∈[0,k−1] = (cos

(2i+12k π

))i∈[0,k−1] ∈ [−1, 1]. The roots of pε(X) are the zεi defined

such that2zεiρ+ε −

ρ−ερ+ε = zi. This corresponds to

zεi = (ρ+ε)zi+ρ−ε2 ∈ [−ε, ρ].

The smallest root is zεk−1 =(ρ+ε) cos(

2k−12k π)+ρ−ε2 ≥ 0 for ε ∈ [0, ρ

1+cos(2k−1

2k π)

1−cos(2k−1

2k π)] = [0, ε].

This implies that pε(X) keeps the same sign on [−ε, 0]. In addition c0 = 0 when ε = ε and is nonzerootherwise. We also have that sign(c0) = sign(pε(0)) = sign(pε(−ε)) = sign(Tk(−1)) = (−1)k.

For the other coefficients, we first need to show that the sign of dlpεdxl

(x) is also constant on [−ε, 0]. Thisrelies on the Rolle’s theorem. Indeed we have k roots in [0, ρ] for pε a polynomial of degree k. Thus by Rolle’stheorem there exists a root to p′ε(X) between each root of pε, so the k − 1 possible roots of p′ε are in ]0, ρ[, andwe repeat the argument by applying Rolle’s Theorem on p′ε, etc.

Thus for i ∈ [1, k], sign(ci) = sign(dipεdxi (0)) = sign(d

ipεdxi (−ε)) = sign(d

iTkdxi (−1)) = sign(d

ip0dxi (0)). Finally we used

the formula (Doha et al., 2011, Equation 2.12) for the coefficients of p0 which leads to

dip0dxi (0) = (−1)k−ik (k+i−1)!

Γ(i+12 )(k−i)!ρi

√π

and thus sign(ci) = (−1)k−i and we conclude that (i) holds.

(ii) is a property of the Chebyshev polynomial Tk. Indeed since Tk(cos(θ)) = cos(kθ) then maxx∈[−1,1]

|Tk(x)| = 1

and is attained for xi = cos( iπk ) with i ∈ [0, k]. In particular Tk(xi) = (−1)i. Thus |pε| has its maxima on

mi =(ρ+ε) cos(

iπk )+ρ−ε

2

and pε(mi) = (−1)i.

C Proof of Proposition 3.8

Let show that for ε ∈ [0, ε], ρ(‖pε‖1) = maxx∈[−ε,ρ]

|pε(x)| with pε ∈ argminp∈Rk[X]p(1)=1

maxx∈[−ε,ρ]

|p(x)|.

Mathieu Barre, Adrien Taylor, Alexandre d’Aspremont

By Lemma B.3 we have that pε is a rescaled Chebyshev polynomial. pε(x) =Tk

(2x+ερ+ε−1

)∣∣∣Tk(2

1+ερ+ε−1

)∣∣∣ .Goal: Let show that pε is a local minimum (this will be a global one thanks to convexity) of p→ max

x∈[0,ρ]|p(x)|

on the set E = {p ∈ Rk[X], p(1) = 1, ‖p‖1 ≤ ‖pε‖1}.

Let h =∑ki=0 hiX

i ∈ Rk[X] 6= 0 such that pε + h ∈ E. This implies directly that h(1) = 0.

Suppose that

maxx∈[0,ρ]

|pε(x) + h(x)| < maxx∈[0,ρ]

|pε(x)|. (10)

By definition of pε we have that maxx∈[−ε,ρ]

|pε(x) + h(x)| ≥ maxx∈[−ε,ρ]

|pε(x)| and thus by combining it with (10) we

have

maxx∈[0,ρ]

|pε(x) + h(x)| < maxx∈[0,ρ]

|pε(x)| = maxx∈[−ε,ρ]

|pε(x)| ≤ maxx∈[−ε,ρ]

|pε(x) + h(x)|,

This implies that

maxx∈[−ε,ρ]

|pε(x) + h(x)| = maxx∈[−ε,0]

|pε(x) + h(x)|.

Finally this leads to

maxx∈[−ε,0]

|pε(x) + h(x)| ≥ maxx∈[−ε,0]

|pε(x)|. (11)

We write pε(x) =∑ki=0 ciX

i. By Lemma B.4 we know that sign(ci) = (−1)k−i and that |pε| is maximal on

[−ε, ρ] at the mi =(ρ+ε) cos(

iπk )+ρ−ε

2 for i ∈ [0, k] and sign(pε(mi)) = (−1)i.

In addition, mi ∈]0, ρ] for i ∈ [0, k − 1]. Indeed the mi are in decreasing order and mk−1 is strictly larger than

the smallest root of pε, which means mk−1 >(ρ+ε) cos(

(2k−1)π2k )+ρ−ε

2 ≥ (ρ+ε) cos((2k−1)π

2k )+ρ−ε2 = 0

Since the |pε(mi)| are nonzero, we can take h with norm small enough such that |pε(mi) + h(mi)| = |pε(mi)|+sign(pε(mi))h(mi) = |pε(mi)|+ (−1)ih(mi) for i ∈ [0, k − 1]. Equation (10) imposes (−1)ih(mi) < 0. Since themi are in decreasing order, the mean value theorem tells us that h has a root in each ]mi+1,mi[ for i ∈ [0, k− 2].In addition, since pε + h ∈ E, h(1) = 0, meaning that h has k roots on ]0, 1]. Since h is of degree k it cannothave any additional root outside ]0, 1].

In addition, this means that h is nonzero and doesn’t change sign on ]−∞, 0]. The same conclusion holdsfor pε on ]−∞, 0[.

Taking maxx∈[−ε/2,0]

|h(x)| < maxx∈[−ε,−0]

|pε(x)| − maxx∈[−ε/2,0]

|pε(x)| which is strictly positive since −ε is the only point of

maximum for pε on [−ε, 0], we have that

maxx∈[−ε/2,0]

|pε(x) + h(x)| < maxx∈[−ε,0]

|pε(x)| ≤ maxx∈[−ε,0]

|pε(x) + h(x)| (12)

by (11). Thus |pε + h| restricted to [−ε, 0] has its maximum on [−ε,−ε/2]. |pε| > 0 and |h| > 0 on [−ε, ε2 ]and has constant sign on this interval. We denote by m∗ ∈ [−ε,− ε2 ] a point such that |pε(m∗) + h(x∗)| =

maxx∈[−ε,0]

|pε(x) + h(x)|. Since |pε(m∗)| > 0 one can take h with norm small enough such that |pε(m∗) + h(m∗)| =

|pε(m∗)|+ sign(pε(m∗))h(m∗) = |pε(m∗)|+ (−1)kh(m∗). For (11) to hold, sign(h(m∗)) has to be (−1)k.

Since h has constant sign on [−ε, 0], its sign is (−1)k and then sign(h0) = sign(h(0)) = (−1)k.

For i ∈ [1, k], ci 6= 0 by Lemma B.4, thus we can take h with norm small enough such that |ci+hi| = |ci|+sign(ci)hifor i ∈ [1, k]. Lemma B.4 also tells us that sign(ci) = (−1)k−i for i ∈ [1, k] and (−1)kc0 ≥ 0. Therefore, one can

Convergence of Constrained Anderson Acceleration

write

‖pε + h‖1 =

k∑i=0

|ci + hi|

=

k∑i=0

(−1)k−i(ci + hi)

= ‖pε‖1 +

k∑i=0

(−1)k−i(hi)

= ‖pε‖1 + (−1)kh(−1)

≤ ‖pε‖1 because pε + h ∈ E.

Finally, we reach(−1)kh(−1) ≤ 0

meaning that sign(h(−1)) = (−1)k+1, however we saw that h has sign (−1)k on ]−∞, 0]. Since h has alreadyall its root in ]0, 1], this is a contradiction.

Thus we cannot find nonzero direction h with arbitrarily small norm such that pε+h ∈ E and that satisfies (10).pε is thus a local minimum and thus a global one by convexity of the objective and of E.

D Proof of Proposition 4.2

Before presenting the proof of the proposition, we show a technical lemma.

Lemma D.1. Let k ∈ N, ρ < 1, C∗ is defined in (4) with explicit value in Remark 3.10 and C1 in (6). It holdsthat

2+ρk

2−ρk ≤ C1 for k > 1

andC1 ≤ C∗ for k ≥ 1

Proof. We start from the expression of C1

C1 = ρ12ρk

((1−

√1 + ρ2)k + (1 +

√1 + ρ2)k

)=

βkρρk(1+β2

ρk)

((1−

√1 + ρ2)k + (1 +

√1 + ρ2)k

),

where βρ =√

1+ρ−√

1−ρ√1+ρ+

√1−ρ =

1−1√

1−ρ2ρ . Thus

C1 =(1−√

1−ρ2k)

ρ2k+(1−√

1−ρ2)2k

((1−

√1 + ρ2)k + (1 +

√1 + ρ2)k

)=

(1−√

1−ρ2k)

(1−√

1−ρ2)k(1+√

1−ρ2)k+(1−√

1−ρ2)2k

((1−

√1 + ρ2)k + (1 +

√1 + ρ2)k

)=

(1−√

1 + ρ2)k + (1 +√

1 + ρ2)k

(1−√

1− ρ2)k + (1 +√

1− ρ2)k

=

∑bk/2ci=0

(k2i

)(1 + ρ2)i∑bk/2c

i=0

(k2i

)(1− ρ2)i

.

To show that C1 ≥ 2+ρk

2−ρk we need to show

(2− ρk)

bk/2c∑i=0

(k

2i

)(1 + ρ2)i − (2 + ρk)

bk/2c∑i=0

(k

2i

)(1− ρ2)i ≥ 0,

Mathieu Barre, Adrien Taylor, Alexandre d’Aspremont

and in particular we study

(2− ρk)(1 + ρ2)i − (2 + ρk)(1− ρ2)i.

When i = 0 this is equal to −2ρk, when i = 1 this is equal to 4ρ2 − 2ρk. In addition, we can easily see that it isincreasing with i and thus it is positive when i ≥ 1. We can write for k ≥ 2

(2− ρk)

bk/2c∑i=0

(k

2i

)(1 + ρ2)i − (2 + ρk)

bk/2c∑i=0

(k

2i

)(1− ρ2)i ≥ −2ρk +

(k

2

)(−2ρk + 4ρ2)

≥ 4ρ2(1− ρk−2)

≥ 0 with strict inequality when k > 2,

and then

C1 ≥ 2+ρk

2−ρk with strict inequality when k > 2.

Then we show the second inequality between C∗ and C1.

C∗ = ρ∗2ρk

((2 + ρ− 2

√1 + ρ)k + (2 + ρ+ 2

√1 + ρ)k

)=

βk∗ρk(1+β2

∗k)

((2 + ρ− 2

√1 + ρ)k + (2 + ρ+ 2

√1 + ρ)k

)where β∗ = 1−

√1−ρ

1+√

1−ρ = 2−ρ−2√

1−ρρ . Thus C∗ can be written as

C∗ = (2−ρ−2√

1−ρk)ρ2k+(2−ρ−2

√1−ρ)2k

((2 + ρ− 2

√1 + ρ)k + (2 + ρ+ 2

√1 + ρ)k

)=

(2 + ρ− 2√

1 + ρ)k + (2 + ρ+ 2√

1 + ρ)k

(2− ρ− 2√

1− ρ)k + (2− ρ+ 2√

1− ρ)k

=

∑bk/2ci=0

(k2i

)(1 + ρ

2 )k−2i(1 + ρ)i∑bk/2ci=0

(k2i

)(1− ρ

2 )k−2i(1− ρ)i

when k ≥ 1 (1 + ρ2 )k−2i(1 + ρ)i > (1 + ρ2)i and (1− ρ

2 )k−2i(1− ρ)i < (1− ρ2)i for i ∈ [0, bk2 c] and thus

C∗ > C1 when k ≥ 1

Proof of Proposition 4.1: By Lemma D.1, for k > 2, 1 < C0 < C1 < C∗.

(i) if α < ρk−ρ03kC0

=ρk− ρk

2−ρk

3k2+ρk

2−ρk= ρk(1−ρk)

3k(2+ρk)then ρ(C0) = ρ0 + 3αkC0 < ρk by Lemma 3.5.

(ii) if α < min(ρk(1−ρk)

3k(2+ρk), ρ

k−ρ13kC1

) then ρ(C0) < ρk, and ρ(C1) ≤ ρ1 + 3αkC1 < ρk. Then by convexity of ρ,

ρ(C) < ρk for C ∈ [C0, C1].

(iii) if α < min(ρk(1−ρk)

3k(2+ρk), ρ

k−ρ∗3kC∗

) then ρ(C0) < ρk and ρ(C∗) = ρ∗ + 3αkC∗ < ρk. By convexity of ρ we have

ρ(C) < ρk for C ∈ [C0, C∗].

Convergence of Constrained Anderson Acceleration

E Proof of Lemma 4.5

We have Dξ(x) = 1L

(∇2f(x0).−∇2f(x)

). Under the assumption of ∇2f being η-Lipschitz, it holds hat

‖Dξ(x)‖ ≤ ηL‖x− x0‖. Thus for x ∈ BC

‖Dξ(x)‖ ≤ ηL‖x− x0‖

≤ ηL‖

k∑i=0

ci(xi − x0)‖

≤ ηL

k∑i=0

|ci|‖xi − x0‖

≤ ηL

k∑i=1

|ci|i−1∑j=0

‖xj+1 − xj‖

≤ ηL2

k∑i=1

|ci|i−1∑j=0

(1− µ

L

)j ‖∇f(x0)‖

≤ ηL2 k‖c‖1‖∇f(x0)‖.

Since Dξ is bounded on the convex set BC , then by the mean value theorem ξ is ηL2 kC‖∇f(x0)‖-Lipschitz on BC .