Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf ·...

Stochastic Proximal Gradient Consensus OverTime-Varying Multi-Agent Networks

Mingyi HongJoint work with Tsung-Hui Chang

IMSE and ECE Department,Iowa State University

Presented at INFORMS 2015

Mingyi Hong (Iowa State University) 1 / 37

Main Content

Setup: Optimization over a time-varying multi-agent network

Main Results

An algorithm for a large class of convex problems with rate guarantees

Connections among a number of popular algorithms

Outline

1 Review of Distributed Optimization

2 The Proposed AlgorithmThe Proposed AlgorithmsDistributed ImplementationConvergence Analysis

3 Connection to Existing Methods

4 Numerical Results

5 Concluding Remarks

Review of Distributed Optimization

Basic Setup

Consider the following convex optimization problem

miny∈R

f (y) :=N

∑i=1

fi(y), (P)

Each fi(y) is a convex and possibly nonsmooth function

A collection of N agents connected by a network:

1 Network defined by an undirected graph G = V , E

2 |V| = N vertices and |E | = E edges.

3 Each agent can only communicate with its immediate neighbors

Basic Setup

Numerous applications in optimizing networked systems

1 Cloud computing [Foster et al 08]

2 Smart grid optimization [Gan et al 13] [Liu-Zhu 14][Kekatos 13]

3 Distributed learning [Mateos et al 10] [Boyd et al 11] [Bekkerman et al 12]

4 Communication and signal processing [Rabbat-Nowak 04] [Schizas et al 08][Giannakis et al 15]

5 Seismic Tomography [Zhao et al 15]

The Algorithms

A lot of algorithms are available for problem (P)

1 The distributed subgradient (DSG) based methods

2 The Alternating Direction Method of Multiplier (ADMM) based methods

3 The Distributed Dual Averaging based methods

Algorithm families differ in applicable problems and convergence cond.

The DSG Algorithm

Each agent i keeps a local copy of y, denoted as xi

Each agent i iteratively computes

xr+1i =

∑j=1

rj − γrdr

i , ∀ i ∈ V .

We used the following notations

1 dri ∈ ∂ fi(yr

i ): a subgradient of the local function fi

2 wrij ≥ 0: the weight for the link eij at iteration r

3 γr > 0: some stepsize parameter

The DSG Algorithm (Cont.)

Compactly, the algorithm can be written in vector form

xr+1 = Wxr − γrdr

1 xr ∈ R: vector of the agents’ local variable

2 dr ∈ R: vector of subgradients

3 W: a row-stochastic weight matrix

The DSG Algorithm (Cont.)

Convergence has been analyzed in many works [Nedic-Ozdaglar09a][Nedic-Ozdaglar 09b]

The algorithm converges with a rate of O(ln(r)/√

r) [Chen 12]

Usually diminishing stepsize

The algorithm has been generalized to problems with

1 constraints [Nedic-Ozdaglar-Parrilo 10]

2 quantized messages [Nedic et al 08]

3 directed graphs [Nedic-Olshevsky 15]

4 stochastic gradients [Ram et al 10]

Accelerated versions with rates O(ln(r)/r) [Chen 12] [Jakovetic et al 14]

The EXTRA Algorithm

Recently, [Shi et al 14] proposed an EXTRA algorithm

xr+1 = Wxr − 1β

dr +1β

dr−1 + xr − Wxr−1

where W = 12 (I + W); f is assumed to be smooth; W is symmetric

EXTRA is an error-corrected version of DSG

xr+1 = Wxr − 1β

∑t=1

(W − W)xt−1

It is shown that

1 A constant stepsize β can be used (with computable lower bound)

2 The algorithm converges with a (improved) rate of O(1/r)

The ADMM Algorithm

The general ADMM solves the following two-block optimization problem

minx,y

f (x) + g(y)

s.t. Ax + By = c, x ∈ X, y ∈ Y

The augmented Lagrangian

L(x, y; λ) = f (x) + g(y) + 〈λ, c− Ax− By〉+ ρ

2‖c− Ax− By‖2

The algorithm

1 Minimize L(x, y; λ) w.r.t. x

2 Minimize L(x, y; λ) w.r.t. y

3 λ← λ + ρ(c− Ax− By)

The ADMM for Network Consensus

For each link eij introduce two link variables zij, zjiReformulate problem (P) as [Schizas et al 08]

min f (x) :=N

∑i=1

fi(xi),

s.t. xi = zij, xj = zij,

xi = zji, xj = zji, ∀ eij ∈ E .

The ADMM for Network Consensus (cont.)

The above problem is equivalent to

min f (x) :=N

∑i=1

fi(xi),

s.t. Ax + Bz = 0

where A, B are matrices related to network topology

Converges with O(1/r) rate [Wei-Ozdaglar 13]

When the objective is smooth and strongly convex, linear convergencehas been shown in [Shi et al 14]

For a star-network, convergence to stationary solution for nonconvexproblem (with rate O(1/

√r)) [H.-Luo-Razaviyayn 14]

Comparison of ADMM and DSG

Table: Comparison of ADMM and DSG.

DSG ADMMProblem Type general convex smooth/smooth+simple NS.

Stepsize diminishing(a) constantConvergence Rate O(ln(r)/

√r) O(1/r)

Network Topology dynamic static(b)

Subproblem simple difficult(c)

(a) Except [Shi et al 14], which uses a constant stepsize

(b) Except [Chang-H.-Wang 14] [Ling et al 15], gradient-type subproblem

(c) Except [Wei-Ozdaglar 13], random graph

Comparison of ADMM and DSG

Connections?

Outline

4 Numerical Results

The Proposed Algorithm

The proposed method is ADMM based

We consider

min f (y) :=N

∑i=1

fi(y) =N

∑i=1

gi(y) + hi(y), (Q)

1 Each hi is lower-semicontinuous with easy “prox” operator

proxβh (u) := min

yhi(y) +

2‖y− u‖2.

2 Each gi has a Lipschitz continuous gradient, i.e., for some ρi > 0

‖∇gi(y)−∇gi(v)‖ ≤ Pi‖y− v‖, ∀ y, v ∈ dom(h), ∀ i.

Graph Structure

Both static and random time-varying graph

For random network assume that

1 At a given iteration Gr is a subgraph of a connected graph G2 Each link e has a probability of pe ∈ (0, 1] of being active

3 A node i is active if an active link connects to it

4 Each iteration the graph realization is independent

Gradient Information

Each agent has access to an estimate of the gradient gi(xi, ξi) such that

E[gi(xi, ξi)] = ∇gi(xi)

E[‖gi(xi, ξi)−∇gi(xi)‖2

]≤ σ2, ∀ i

Can be extended to allow only subgradient of the obj

The Augmented Lagrangian

The problem we solve is still given by

min f (x) :=N

∑i=1

gi(xi) + hi(xi),

s.t. Ax + Bz = 0

The augmented Lagrangian

LΓ(x, z, λ) =N

∑i=1

gi(xi) + hi(xi) + 〈λ, Ax + Bz〉+ 12‖Ax + Bz‖Γ

A diagonal matrix Γ is used as the penalty parameter (one edge one ρij)

Γ := diagρijij∈E

The Proposed Algorithm The Proposed Algorithms

The DySPGC Algorithm

The proposed algorithm is named DySPGC (Dynamic StochasticProximal Gradient Consensus)

It optimizes LΓ(x, z, λ) using similar steps as ADMM

The x-step will be replaced by a proximal gradient step

The DySPGC: Static Graph + Exact Gradient

Algorithm 1. PGC Algorithm

At iteration 0, let BTλ0 = 0, z0 = 12 MT

+x0.At each iteration r + 1, update the variable blocks by:

xr+1 = arg min 〈∇g(xr), x− xr〉+ h(x)

∥∥∥Ax + Bzr + Γ−1λr∥∥∥2

12‖x− xr‖2

zr+1 = arg min12

∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥2

λr+1 = λr + Γ(

Axr+1 + Bzr+1)

The DySPGC: Static Graph + Exact Gradient

Algorithm 1. PGC Algorithm

xr+1 = arg min 〈∇g(xr), x− xr〉+ h(x)

∥∥∥Ax + Bzr + Γ−1λr∥∥∥

12‖x− xr‖2

zr+1 = arg min12

∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥

λr+1 = λr + Γ(

Axr+1 + Bzr+1)

The DySPGC: Static Graph + Stochastic Gradient

Algorithm 2. SPGC Algorithm

xr+1 = arg min⟨

G(xr, ξr+1), x− xr⟩+ h(x)

∥∥∥Ax + Bzr + Γ−1λr∥∥∥2

12‖x− xr‖2

Ω+ηr+1 IMN

zr+1 = arg min12

∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥2

λr+1 = λr + Γ(

Axr+1 + Bzr+1)

The DySPGC: Static Graph + Stochastic Gradient

Algorithm 2. SPGC Algorithm

xr+1 = arg min⟨

G(xr, ξr+1), x− xr⟩+ h(x)

∥∥∥Ax + Bzr + Γ−1λr∥∥∥2

12‖x− xr‖2

Ω+ηr+1 IMN

zr+1 = arg min12

∥∥∥Axr+1 + Bz + Γ−1λr∥∥∥2

λr+1 = λr + Γ(

Axr+1 + Bzr+1)

The DySPGC: Dynamic Graph + Stochastic Gradient

Algorithm 3. DySPGC Algorithm

xr+1 = arg min⟨

Gr+1(xr, ξr+1), x− xr⟩+ hr+1(x)

∥∥∥Ar+1x + Br+1zr + Γ−1λr∥∥∥2

12‖x− xr‖2

Ωr+1+ηr+1 IMN

xr+1i = xr

i , if i /∈ V r+1

zr+1 = arg min12

∥∥∥Ar+1xr+1 + Br+1z + Γ−1λr∥∥∥2

zr+1ij = zr

ij, if eij /∈ Ar+1

λr+1 = λr + Γ(

Ar+1xr+1 + Br+1zr+1)

The DySPGC: Dynamic Graph + Stochastic Gradient

Algorithm 3. DySPGC Algorithm

xr+1 = arg min⟨

Gr+1(xr, ξr+1), x− xr⟩+ hr+1(x)

∥∥∥Ar+1x + Br+1zr + Γ−1λr∥∥∥2

12‖x− xr‖2

Ωr+1+ηr+1 IMN

xr+1i = xr

i , if i /∈ V r+1

zr+1 = arg min12

∥∥∥Ar+1xr+1 + Br+1z + Γ−1λr∥∥∥2

zr+1ij = zr

ij, if eij /∈ Ar+1

λr+1 = λr + Γ(

Ar+1xr+1 + Br+1zr+1)

The Proposed Algorithm Distributed Implementation

Distributed Implementation

The algorithms admit distributed implementation

In particular, the PGC admits a single-variable characterization

Implementation of PGC

Define a stepsize parameter as

βi := ∑j∈Ni

(ρij + ρji) + wi, ∀ i.

(ωi: proximal parameters; ρij: penalty parameters for constraints)

Define a stepsize matrix Υ := diag([β1, · · · , βN ]) 0.

Define a weight matrix W ∈ RN×N as (a row-stochastic matrix)

(W[i, j]) =

ρji+ρij

∑`∈Ni(ρ`i+ρi`)+ωi

=ρji+ρij

βi, if eij ∈ E ,

ωi∑`∈Ni

(ρ`i+ρi`)+ωi= ωi

βi, ∀ i = j, i ∈ V

0, otherwise,

Implementation of PGC (cont.)

Implementation of PGC

Let ζr ∈ ∂h(xr) be some subgradient vector for the nonsmooth function; thenthe PGC algorithm admits the following single-variable characterization

xr+1 − xr + Υ−1(ζr+1 − ζr)

= Υ−1(−∇g (xr) +∇g

(xr−1

))+ Wxr − 1

2(IN + W)xr−1.

In particular, for smooth problems

xr+1 = Wxr − Υ−1∇g(xr) + Υ−1∇g(xr−1) + xr − 12(IN + W)xr−1.

The Proposed Algorithm Convergence Analysis

Convergence Analysis

We analyze the (rate of) convergence of the proposed methodsLet us define a matrix of Lip-constants

P = diag([P1, · · · , PN ]).

Measure convergence rate by [Gao et al 14, Ouyang et al 14]

| f (xr)− f (x∗)︸︷︷︸objective gap

|, and ‖Axr + Bzr‖︸︷︷︸consensus gap

Convergence Analysis

Table: Main Convergence Results.

Algorithm Convergence Condition Convergence Rate

Network Type Gradient Type

Static Exact ΥW + Υ 2P O(1/r)Static Stochastic ΥW + Υ 2P O(1/

Random Exact Ω P O(1/r)Random Stochastic Ω P O(1/

Note: For the exact gradient case, stepsize β can be halved if onlyconvergence is needed

Outline

4 Numerical Results

Connection to Existing Methods

Comparison with Different Algorithms

Algorithm Conn. with DySPCA Special Setting

EXTRA [Shi 14] Special Case Static, h ≡ 0, W = WT , G = ∇gDSG [Nedic 09] Different x-step Static, g smooth, G = ∇g

IC-ADMM [Chang14] Special Case Static, G = ∇g, g compositeDLM [Ling 15] Special Case Static, G = ∇g, h ≡ 0, βij = β, ρij = ρ

PG-EXTRA [Shi 15] Special Case Static, W = WT , G = ∇g

Comparison with Different Algorithms

Figure: Relationship among different algorithms

The EXTRA Related Algorithms

The EXTRA related algorithms (for either smooth or nonsmooth cases)[Shi et al 14, 15] are special cases of DySPCA

1 Symmetric weight matrix W = WT

2 Exact gradient

3 Scalar stepsize

4 Static graph

The DSG Method

Replacing our x-update by (setting the dual variable λr = 0)

xr+1 = arg min 〈∇g(xr), x− xr〉+ 〈0, Ax + Bzr〉

+12‖Ax + Bzr‖2

Γ +12‖x− xr‖2

Let βi = β j = β, then the PGC algorithm becomes

xr+1 = − 1β∇g(xr) + Wxr.

with W = 12 (I + W)

This is precisely the DSG iterates

Convergence not covered by our results

Outline

4 Numerical Results

Numerical Results

Some preliminary numerical results by solving a LASSO problem

12 ∑N

i=1 ‖Aix− bi‖2 + ν‖x‖1

where Ai ∈ RK×M, bi ∈ RK

The parameters: N = 16, M = 100, ν = 0.1, K = 200

Data matrix randomly generated

Static graphs, generated according to the method proposed in[Yildiz-Scaglione 08], with a radius parameter set to 0.4.

Numerical Results

Comparison between PG-EXTRA and PGC

Stepsize of PG-EXTRA chosen according to conditions given in [Shi 14]W is Metropolis constant edge weight matrixPCG: wi = Pi/2, ρij = 10−3

Figure: Comparison between PG-EXTRA and PGC

Numerical Results

Comparison between DSG and Stochastic PGC

Stepsize of DSG chosen as a small constantσ2 = 0.1W is Metropolis constant edge weight matrixSPCG: wi = Pi, ρij = 10−3

Figure: Comparison between DSG and SPGCMingyi Hong (Iowa State University) 35 / 37

Numerical Results

Outline

4 Numerical Results

Concluding Remarks

Summary

Develop a DySPGC algorithm for multi-agent optimization

It can deal with

1 Stochastic gradient

2 Time-varying networks

3 Nonsmooth composite objective

Convergence rate guarantee for various scenarios

Concluding Remarks

Future Work/Generalization

Identified the relation between DSG-type and ADMM-type methods

Allows for significant generalization

1 Acceleration [Ouyang et al 15]

2 Variance Reduction for local problem when fi is a finite sum

fi(xi) =M

∑j=1

`j(xi)

3 Inexact x-subproblems (using, e.g., Conditional-Gradient)

4 Nonconvex problems [H.-Luo-Razaviyayn 14]

Concluding Remarks

Thank You!

Concluding Remarks

Parameter Selection

It is easy to pick various parameters in various different scenarios

Case A: The weight matrix W is given and symmetric

1 We must have βi = β j = β;

2 For any fixed β, can compute (Ω, ρij)

3 Increase β to satisfy convergence condition

Case B: The user has the freedom in picking (ρij, Ω)1 For any set of (ρij, Ω), can compute W and βi

2 Increase Ω to satisfy convergence condition

In either case, the convergence condition can be verified by local agents

Concluding Remarks

Case 1: Exact Gradient with Static Graph

Convergence for PGC Algorithm

Suppose that problem (Q) has a nonempty set of optimal solutions X∗ 6= ∅.Suppose Gr = G for all r and G is connected. Then the PGC converges to aprimal-dual optimal solution if

2Ω + M+ΞMT+ = ΥW + Υ P.

M+ΞMT+ is some matrix related to network topology

A sufficient condition isΩ P

or ωi > Pi for all i ∈ V ; can be determined locally.

Concluding Remarks

Case 2: Stochastic Gradient with Static Graph

Convergence for SPGC Algorithm

Assume that dom(h) is a bounded set. Suppose that the following conditionshold

ηr+1 =√

r + 1, ∀ r,

and the stepsize matrix satisfies

2Ω + M+ΞMT+ = ΥW + Υ 2P. (8)

Then at a given iteration r, we have

E [ f (xr)− f (x∗)] + ρ‖Axr + Bzr‖ ≤ σ2√

z + d2λ(ρ) + max

iωid2

)where dλ(ρ) > 0, dx > 0, dz > 0 are some problem dependent constants.

Concluding Remarks

Case 2: Stochastic Gradient with Static Graph (cont.)

Both objective value and constraint violation converge with rate O(1/√

Easy to extend to the exact gradient case, with rate O(1/r)

Requires larger proximal parameter Ω than Case 1

Concluding Remarks

Case 3: Exact Gradient with Time-Varying Graph

Convergence for DySPGC Algorithm

Suppose that problem (Q) has a nonempty set of optimal solutions X∗ 6= ∅,and G(xr, ξr+1) = ∇g(xr) for all r. Suppose the graph is randomly generated.If we choose the following stepsize

then (xr, zr, λr) that converges w.p.1. to a primal-dual solution.

1 The stepsize is more restrictive than Case 1 (not dependent on graph)

2 Convergence is in the sense of with probability 1

Concluding Remarks

Case 4: Stochastic Gradient with Time-Varying Graph

Convergence for DySPGC Algorithm

Suppose wt = xt, zt, λt is a sequence generated by DySPCA, and that

ηr+1 =√

r + 1, ∀ r, and Ω P.

Then we have

E [ f (xr)− f (x∗) + ρ‖Axr + Bzr‖]

≤ σ2√

(2dJ + d2

z + d2λ(ρ) + max

iωid2

)where dλ(ρ), dJ , dx, dy are some positive constants.

Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf ·...

Documents

Transcript of Stochastic Proximal Gradient Consensus Over Time-Varying ...people.ece.umn.edu/~mhong/DYSPGC.pdf ·...

Distributed Subgradient Methods for Multi-agent …...1 Distributed Subgradient Methods for Multi-agent Optimization∗ Angelia Nedi´c† and Asuman Ozdaglar‡ October 29, 2007 Abstract

S t. B e nedic t

Convex Analysis and Optimization Chapter 5 · PDF fileConvex Analysis and Optimization Chapter 5 Solutions Dimitri P. Bertsekas with Angelia Nedi c and Asuman E. Ozdaglar Massachusetts

Network Security and Contagion...Network Security and Contagion Daron Acemoglu, Azarakhsh Malekian, Asu Ozdaglar Department of Economics Department of Electrical Engineering and Computer

Convex Analysis and Optimization Chapter 3 Solutions · Convex Analysis and Optimization Chapter 3 Solutions Dimitri P. Bertsekas with Angelia Nedi c and Asuman E. Ozdaglar Massachusetts

A Distributed Algorithm for Energy Optimization in ...€¦ · for example in Rantzer (2009), Nedic and Ozdaglar (2009), Nedic et al. (2010). The paper starts by introducing the system

1 Optimization and Stochastic Control of MANETs Asu Ozdaglar Electrical Engineering and Computer Science Massachusetts Institute of Technology CBMANET.

WHEN VIRTUAL BECOMES BETTER THAN REAL: INVESTIGATING … · However, Nedic et al. (2003) reported that students who used the virtual environment NetLab, indicated that conducting

Minimization-Maximization Problems: Applications (in Communication), Challenges …people.ece.umn.edu/~mhong/ctw2019.pdf · 2021. 1. 6. · Minimization-Maximization Problems: Applications

CURRICULUM VITAE ASU OZDAGLAR VITAE ASU OZDAGLAR Professor ... ARO MURI, “Evolution of Cultural Norms and Dynamics of Socio-political Change,” joint with Profs. Jadbabaie, ...

Transmission Capacity to Accommodate a Mixed Background of Generation Keith Bell and Dusko Nedic University of Strathclyde/TNEI Services Ltd. August 2007.

Distributed Learning over Unreliable Networksproceedings.mlr.press/v97/yu19f/yu19f-supp.pdf · et al., 2006; Li & Zhang, 2010; Lobel & Ozdaglar, 2011; Nedic et al., 2017; Nedic &

Annemarie Luger and Mitja Nedic · 2017. 5. 31. · Annemarie Luger and Mitja Nedic Abstract. In this article, a characterization of the class of Herglotz-Nevan-linna functions in

DoubleSqueeze: Parallel Stochastic Gradient Descent with ...tongzhang-ml.org/papers/icml19_ec.pdf · for example, (Lian et al., 2018; Nedic & Olshevsky, 2015;´ Nedic et al., 2017).

Continuous-time Proportional-Integral Distributed ... · ables, e.g. (Nedic & Ozdaglar, 2009b; Luenberger & Ye, 2008; Boyd, 2004). Of particular interest is the decomposition method

Price and Capacity Competition - Asu Ozdaglar · Price and Capacity Competition ’ & $ % This Paper † A stylized model of price and capacity competition. † Implications for timing

Welcome Home!saintbarbara.net/wp-content/uploads/2018/08/September-2018-Epistle-1.pdf · Sofija Nedic, Linda Peters, Sofia Santoli 23rd Saint Polyxene – Polyxeni Tziouvaras, Xeni

NEDIC Datacuration project HSRC

Microeconomic Origins of Macroeconomic Tail Risks Origins of Macroeconomic Tail Risks † By Daron Acemoglu, Asuman Ozdaglar, and Alireza Tahbaz-Salehi* Using a multisector general

Computing the Stationary Distribution, Locally · 2015-05-04 · Lee, Ozdaglar, and Shah: Local Stationary Distribution 2 Article submitted to Operations Research; manuscript no.