A Flexible ADMM for Big Data Applicationsprichtar/Optimization_and_Big_Data_2015/... · A Flexible...

1
A Flexible ADMM for Big Data Applications Daniel Robinson and Rachael Tappenden, Johns Hopkins University Introduction We study the optimization problem: minimize x 1 ,...,x n f (x) := n X i=1 f i (x i ) subject to n X i=1 A i x i = b. (1) Functions f i : R N i R ∪ {∞} are strongly convex with constant μ i > 0 Vector b R m and matrices A i R m×N i are problem data. We can write x =[x 1 ,...,x n ] R N , A =[A 1 ,...,A n ], with x i R N i and N = n i=1 N i We assume a solution to (1) exists What’s new?! 1. Present a new Flexible Alternating Direction Method of Multipliers (F-ADMM) to solve (1). For strongly convex f i ; for general n 2; uses a Gauss-Seidel updating scheme. 2. Incorporate (very general) regularization matrices {P i } n i=1 Stabilize the iterates; make subproblems easier to solve Assume: {P i } n i=1 are symmetric and sufficiently positive definite. 3. Prove that F-ADMM is globally convergent. 4. Introduce a Hybrid ADMM variant (H-ADMM) that is partially parallelizable. 5. Special case of H-ADMM can be applied to convex functions A Flexible ADMM (F-ADMM) Algorithm 1 F-ADMM for solving problem (1). 1: Initialize: x (0) , y (0) , parameters ρ> 0, γ (0, 2), and matrices {P i } n i=1 2: for k =0, 1, 2,... do 3: for i =1,...,n do Update the primal variables in a Gauss-Seidel fashion: x (k +1) i arg min x i n f i (x i )+ ρ 2 kA i x i + i-1 X j =1 A j x (k +1) j + n X l =i+1 A l x (k ) l - b - y (k ) ρ k 2 2 + 1 2 kx i - x (k ) i k 2 P i o 4: end for 5: Update dual variables: y (k +1) y (k ) - γρ(Ax (k +1) - b) 6: end for Theorem 1. Under our assumptions the sequence {(x (k ) ,y (k ) )} k 0 generated by Algorithm 1 converges to some vector (x * ,y * ) that is a solution to problem (1). A Hybrid ADMM (H-ADMM) F-ADMM is inherently serial (it has a Gauss-Seidel-type updating scheme) Jacobi-type (parallel) methods are imperative for big data problems Goal: find a balance between algorithm speed via parallelization (Jacobi) and allowing up- to-date information to be fed back into the algorithm (Gauss-Seidel). Apply F-ADMM to “grouped data” and choose the regularization matrix carefully New Hybrid Gauss-Seidel/Jacobi ADMM method (H-ADMM) that is partially parallelizable Group the data: Form groups of p blocks where n = ‘p x =[ x 1 ,...,x p | {z } | x p+1 ,...,x 2p | {z } | ... | x (-1)p+1 ...x n | {z } ] x 1 x 2 x and A =[ A 1 ,...,A p | {z } | A p+1 ,...,A 2p | {z } | ... | A (-1)p+1 ...A n | {z } ] A 1 A 2 A Problem (1) becomes: minimize xR N f (x) X j =1 f j (x j ) subject to X j =1 A j x j = b. (2) Choice of ‘group’ regularization matrices {P i } n i=1 is crucial: Choosing P i := blkdiag(P S i,1 ,...,P S i,p ) - ρA T i A i (3) (index set S i = {(i - 1)p, . . . , ip} for 1 i ) makes group iterations separable/parallelizable! Algorithm 2 H-ADMM for solving problem (1). 1: Initialize: x (0) , y (0) , ρ> 0 and γ (0, 2), index sets {S i } i=1 , matrices {P i } n i=1 . 2: for k =0, 1, 2,... do 3: for i =1,...,‘ (in a serial Gauss-Seidel fashion solve) do 4: Set b i b - i-1 q =1 A q x (k +1) q - s=i A s x (k ) s + y (k) ρ 5: for j ∈S i (in parallel Jacobi fashion solve) do x (k +1) j arg min x j n f j (x j )+ ρ 2 kA j (x j - x (k ) j ) - b i k 2 2 + 1 2 kx j - x (k ) j k 2 P j o 6: end for 7: end for 8: Update the dual variables: y (k +1) y (k ) - γρ(Ax (k +1) - b). 9: end for Key features of H-ADMM H-ADMM is a special case of F-ADMM (Convergence is automatic from F-ADMM theory) Groups updated in Gauss-Seidel fashion; Blocks within group updated in Jacobi fashion Decision variables x j for j ∈S i are solved for in parallel ! Regularization matrices P i : Never explicitly form P i for i =1,...,‘. Only need P 1 ,...,P n Regularization matrix P i makes subproblem for group x (k +1) i separable into p blocks – Assume: Matrices P i are symmetric and sufficiently positive definite Computational considerations “Optimize” H-ADMM to number of processors: For machine with p processors, set group size to p too! Still using “new” information, just not as much as in ‘individual block’ setting Special Case: Hybrid ADMM with =2, convergence holds for convex f i Numerical Experiments Determine the solution to an underdetermined system of equations with the smallest 2-norm: minimize xR N 1 2 kxk 2 2 subject to Ax = b. (4) n = 100 blocks; p = 10 processors; = 10 groups; block size N i = 100 i; Convexity parameter μ i =1 i; A is 3 · 10 3 × 10 4 and sparse; Noiseless data Algorithm parameters: ρ =0.1, γ =1; stopping condition: 1 2 kAx - bk 2 2 10 -10 . Choosing P j = τ j I - ρA T j A j j ∈S i , for some τ j > 0, gives x (k +1) j = τ j τ j +1 x (k ) j + ρ τ j +1 A T j b i j ∈S i . We compare F-ADMM and H-ADMM with Jacobi ADMM (J-ADMM) [1]. (J-ADMM is similar to F-ADMM, but the blocks are updated using a Jacobi-type scheme.) Results using theoretical τ j values 0 20 40 60 80 100 0 100 200 300 Magnitude of tau for each block J-ADMM F-ADMM H-ADMM Method Epochs J-ADMM 4358.2 H-ADMM 214.1 F-ADMM 211.3 Left plot: A plot of the magnitude of τ j for each block 1 j n for problem (4). The τ j values are much smaller for H-ADMM and F-ADMM, than for J-ADMM. Larger τ j corresponds to ‘more positive definite’ regularization matrices P j . Right table: Number of epochs required by J-ADMM, F-ADMM, and H-ADMM for problem (4) using theoretical values of τ j for j =1,...,n (averaged over 100 runs). H-ADMM and F-ADMM require significantly fewer epochs than J-ADMM using theoretical values of τ j . Results using parameter tuning Parameter tuning can help practical performance! Assign the same value τ j for all blocks j =1,...,n and all algorithms. (i.e., τ 1 = τ 2 = ··· = τ n .) τ j J-ADMM H-ADMM F-ADMM ρ 2 2 kAk 4 530.0 526.3 526.2 0.6 · ρ 2 2 kAk 4 324.0 320.1 319.9 0.4 · ρ 2 2 kAk 4 217.7 214.5 214.1 0.22 · ρ 2 2 kAk 4 123.1 119.3 119.0 0.2 · ρ 2 2 kAk 4 95.8 95.5 0.1 · ρ 2 2 kAk 4 75.3 73.0 The table presents the number of epochs required by J-ADMM, F-ADMM, and H-ADMM on problem (4) as τ j varies. For each τ j we run each algorithm (J-ADMM, F-ADMM and H-ADMM) on 100 random instances of the problem formulation described above. Here, τ j takes the same value for all blocks j =1,...,n. All algorithms require fewer epochs as τ j decreases, and F-ADMM requires the smallest number of epochs. As τ j decreases, number of epochs decreases For fixed τ j F-ADMM and H-ADMM outperform J-ADMM F-ADMM converge for very small τ j values References 1. Wei Deng, Ming-Jun Lai, Zhimin Peng, and Wotao Yin, “Parallel Multi-Block ADMM with o(1/k) Convergence”, Technical Report, UCLA (2014) 2.Daniel Robinson and Rachael Tappenden, “A Flexible ADMM Algorithm for Big Data Applications”, Technical Report, JHU, arXiv:1502.04391 (2015) {daniel.p.robinson,rtappen1}@jhu.edu

Transcript of A Flexible ADMM for Big Data Applicationsprichtar/Optimization_and_Big_Data_2015/... · A Flexible...

Page 1: A Flexible ADMM for Big Data Applicationsprichtar/Optimization_and_Big_Data_2015/... · A Flexible ADMM for Big Data Applications Daniel Robinson and Rachael Tappenden, Johns Hopkins

A Flexible ADMM for Big Data ApplicationsDaniel Robinson and Rachael Tappenden, Johns Hopkins University

Introduction

We study the optimization problem:

minimizex1,...,xn

f (x) :=

n∑i=1

fi(xi) subject to

n∑i=1

Aixi = b. (1)

• Functions fi : RNi → R ∪ {∞} are strongly convex with constant µi > 0

• Vector b ∈ Rm and matrices Ai ∈ Rm×Ni are problem data.•We can write x = [x1, . . . , xn] ∈ RN , A = [A1, . . . , An], with xi ∈ RNi and N =

∑ni=1Ni

•We assume a solution to (1) exists

What’s new?!

1. Present a new Flexible Alternating Direction Method of Multipliers (F-ADMM) to solve (1).• For strongly convex fi; for general n ≥ 2; uses a Gauss-Seidel updating scheme.

2. Incorporate (very general) regularization matrices {Pi}ni=1• Stabilize the iterates; make subproblems easier to solve•Assume: {Pi}ni=1 are symmetric and sufficiently positive definite.

3. Prove that F-ADMM is globally convergent.4. Introduce a Hybrid ADMM variant (H-ADMM) that is partially parallelizable.5. Special case of H-ADMM can be applied to convex functions

A Flexible ADMM (F-ADMM)

Algorithm 1 F-ADMM for solving problem (1).1: Initialize: x(0), y(0), parameters ρ > 0, γ ∈ (0, 2), and matrices {Pi}ni=12: for k = 0, 1, 2, . . . do3: for i = 1, . . . , n do Update the primal variables in a Gauss-Seidel fashion:

x(k+1)i ←argmin

xi

{fi(xi) +

ρ

2‖Aixi +

i−1∑j=1

Ajx(k+1)j +

n∑l=i+1

Alx(k)l − b−

y(k)

ρ‖22 +

1

2‖xi − x

(k)i ‖

2Pi

}4: end for5: Update dual variables: y(k+1)← y(k) − γρ(Ax(k+1) − b)6: end for

Theorem 1. Under our assumptions the sequence {(x(k), y(k))}k≥0 generated by Algorithm 1converges to some vector (x∗, y∗) that is a solution to problem (1).

A Hybrid ADMM (H-ADMM)

F-ADMM is inherently serial (it has a Gauss-Seidel-type updating scheme)Jacobi-type (parallel) methods are imperative for big data problemsGoal: find a balance between algorithm speed via parallelization (Jacobi) and allowing up-to-date information to be fed back into the algorithm (Gauss-Seidel).• Apply F-ADMM to “grouped data” and choose the regularization matrix carefully•New Hybrid Gauss-Seidel/Jacobi ADMM method (H-ADMM) that is partially parallelizableGroup the data: Form ` groups of p blocks where n = `p

x = [x1, . . . , xp︸ ︷︷ ︸ | xp+1, . . . , x2p︸ ︷︷ ︸ | . . . | x(`−1)p+1 . . . xn︸ ︷︷ ︸ ]x1 x2 x`

and

A = [A1, . . . , Ap︸ ︷︷ ︸ | Ap+1, . . . , A2p︸ ︷︷ ︸ | . . . | A(`−1)p+1 . . . An︸ ︷︷ ︸ ]A1 A2 A`

Problem (1) becomes:

minimizex∈RN

f (x) ≡∑̀j=1

fj(xj) subject to∑̀j=1

Ajxj = b. (2)

Choice of ‘group’ regularization matrices {Pi}ni=1 is crucial: Choosing

Pi := blkdiag(PSi,1, . . . , PSi,p)− ρATi Ai (3)

(index set Si = {(i−1)p, . . . , ip} for 1 ≤ i ≤ `) makes group iterations separable/parallelizable!

Algorithm 2 H-ADMM for solving problem (1).1: Initialize: x(0), y(0), ρ > 0 and γ ∈ (0, 2), index sets {Si}`i=1, matrices {Pi}ni=1.2: for k = 0, 1, 2, . . . do3: for i = 1, . . . , ` (in a serial Gauss-Seidel fashion solve) do4: Set bi← b−

∑i−1q=1Aqx

(k+1)q −

∑`s=iAsx

(k)s + y(k)

ρ5: for j ∈ Si (in parallel Jacobi fashion solve) do

x(k+1)j ←argmin

xj

{fj(xj) +

ρ

2‖Aj(xj − x

(k)j )− bi‖22 +

1

2‖xj − x

(k)j ‖

2Pj

}6: end for7: end for8: Update the dual variables: y(k+1)← y(k) − γρ(Ax(k+1) − b).9: end for

Key features of H-ADMM

•H-ADMM is a special case of F-ADMM (Convergence is automatic from F-ADMM theory)•Groups updated in Gauss-Seidel fashion; Blocks within group updated in Jacobi fashion

– Decision variables xj for j ∈ Si are solved for in parallel !•Regularization matrices Pi:

– Never explicitly form Pi for i = 1, . . . , `. Only need P1, . . . , Pn– Regularization matrix Pi makes subproblem for group x

(k+1)i separable into p blocks

– Assume: Matrices Pi are symmetric and sufficiently positive definite•Computational considerations

– “Optimize” H-ADMM to number of processors: For machine with p processors, set groupsize to p too!

– Still using “new” information, just not as much as in ‘individual block’ setting– Special Case: Hybrid ADMM with ` = 2,⇒ convergence holds for convex fi

Numerical Experiments

Determine the solution to an underdetermined system of equations with the smallest 2-norm:

minimizex∈RN

1

2‖x‖22 subject to Ax = b. (4)

• n = 100 blocks; p = 10 processors; ` = 10 groups; block size Ni = 100 ∀i;•Convexity parameter µi = 1 ∀i; A is 3 · 103 × 104 and sparse; Noiseless data• Algorithm parameters: ρ = 0.1, γ = 1; stopping condition: 1

2‖Ax− b‖22 ≤ 10−10.

Choosing Pj = τjI − ρATj Aj ∀j ∈ Si, for some τj > 0, gives

x(k+1)j =

τjτj + 1

x(k)j +

ρ

τj + 1ATj bi ∀j ∈ Si.

We compare F-ADMM and H-ADMM with Jacobi ADMM (J-ADMM) [1]. (J-ADMM is similarto F-ADMM, but the blocks are updated using a Jacobi-type scheme.)

Results using theoretical τj values

0 20 40 60 80 1000

100

200

300

Magnitude of tau for each block

J-ADMMF-ADMMH-ADMM

Method Epochs

J-ADMM 4358.2H-ADMM 214.1F-ADMM 211.3

Left plot: A plot of the magnitude of τj for each block 1 ≤ j ≤ n for problem (4). The τj values are much smaller for H-ADMM andF-ADMM, than for J-ADMM. Larger τj corresponds to ‘more positive definite’ regularization matrices Pj. Right table: Number of

epochs required by J-ADMM, F-ADMM, and H-ADMM for problem (4) using theoretical values of τj for j = 1, . . . , n (averaged over100 runs). H-ADMM and F-ADMM require significantly fewer epochs than J-ADMM using theoretical values of τj.

Results using parameter tuning

Parameter tuning can help practical performance! Assign the same value τj for all blocksj = 1, . . . , n and all algorithms. (i.e., τ1 = τ2 = · · · = τn.)

τj J-ADMM H-ADMM F-ADMMρ2

2 ‖A‖4 530.0 526.3 526.2

0.6 · ρ2

2 ‖A‖4 324.0 320.1 319.9

0.4 · ρ2

2 ‖A‖4 217.7 214.5 214.1

0.22 · ρ2

2 ‖A‖4 123.1 119.3 119.0

0.2 · ρ2

2 ‖A‖4 — 95.8 95.5

0.1 · ρ2

2 ‖A‖4 — 75.3 73.0

The table presents the number of epochs required by J-ADMM, F-ADMM, and H-ADMM on problem (4) as τj varies. For each τj werun each algorithm (J-ADMM, F-ADMM and H-ADMM) on 100 random instances of the problem formulation described above. Here,τj takes the same value for all blocks j = 1, . . . , n. All algorithms require fewer epochs as τj decreases, and F-ADMM requires thesmallest number of epochs.

• As τj decreases, number of epochs decreases• For fixed τj F-ADMM and H-ADMM outperform J-ADMM• F-ADMM converge for very small τj values

References1. Wei Deng, Ming-Jun Lai, Zhimin Peng, and Wotao Yin, “Parallel Multi-Block ADMM with o(1/k) Convergence”, Technical

Report, UCLA (2014)

2. Daniel Robinson and Rachael Tappenden, “A Flexible ADMM Algorithm for Big Data Applications”, Technical Report,JHU, arXiv:1502.04391 (2015)

{daniel.p.robinson,rtappen1}@jhu.edu

1