MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL:...

35
MOCHA: Federated Multi-Task Learning Virginia Smith Stanford / CMU Chao-Kai Chiang · USC Maziar Sanjabi · USC Ameet Talwalkar · CMU NIPS ‘17

Transcript of MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL:...

Page 1: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

MOCHA: Federated Multi-Task Learning

Virginia SmithStanford / CMU

Chao-Kai Chiang · USCMaziar Sanjabi · USC

Ameet Talwalkar · CMU

NIPS ‘17

Page 2: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

MACHINE LEARNING WORKFLOW

machine learning model

data & problem

optimization algorithm minw

nX

i=1

`(w, xi) + g(w)

Page 3: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

MACHINE LEARNING WORKFLOW

optimization algorithm minw

nX

i=1

`(w, xi) + g(w)

systems setting

data & problem

machine learning model

^ IN PRACTICE

Page 4: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

how can we perform fast distributed optimization?

Page 5: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

BEYOND THE DATACENTER

Massively Distributed Node Heterogeneity Unbalanced Non-IID Underlying Structure

Page 6: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

BEYOND THE DATACENTER

Massively Distributed Node Heterogeneity

Unbalanced Non-IID Underlying Structure

Statistical Challenges

Systems Challenges

Page 7: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

MACHINE LEARNING WORKFLOW

optimization algorithm minw

nX

i=1

`(w, xi) + g(w)

data & problem

machine learning model

^ IN PRACTICE

systems setting

Page 8: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

MACHINE LEARNING WORKFLOW

optimization algorithm minw

nX

i=1

`(w, xi) + g(w)

data & problem

machine learning model

^ IN PRACTICE

systems setting

Page 9: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

Statistical ChallengesUnbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

Page 10: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

Statistical ChallengesUnbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

Page 11: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

A GLOBAL APPROACH

W

[MMRHA, AISTATS 16]

Page 12: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

A LOCAL APPROACH

W1

W2

W3

W4

W5

W6

W7

W8

W10

W9

W11

W12

Page 13: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

OUR APPROACH: PERSONALIZED MODELS

W1

W2

W3

W4

W5

W6

W7

W8

W10

W9

W11

W12

Page 14: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

OUR APPROACH: PERSONALIZED MODELS

W1

W2

W3

W4

W5

W6

W7

W8

W10

W9

W11

W12

Page 15: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

task relationship

regularizermodels

MULTI-TASK LEARNING

All tasks related

Outlier tasks

Clusters / groups

Asymmetric relationships

[ZCY, SDM 2012]

losses

minW,⌦

mX

t=1

ntX

i=1

`t(wt,xit) +R(W,⌦)

Page 16: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

FEDERATED DATASETS

Human Activity

Google Glass

Land Mine

Vehicle Sensor

Page 17: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

PREDICTION ERROR

Global Local MTL

Human Activity

2.23 (0.30)

1.34 (0.21)

0.46 (0.11)

Google Glass

5.34 (0.26)

4.92 (0.26)

2.02 (0.15)

Land Mine

27.72 (1.08)

23.43 (0.77)

20.09 (1.04)

Vehicle Sensor

13.4 (0.26)

7.81 (0.13)

6.59 (0.21)

Page 18: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

Statistical ChallengesUnbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

Page 19: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

Statistical ChallengesUnbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

Page 20: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING

minW,⌦

mX

t=1

ntX

i=1

`t(wTt x

it) +R(W,⌦)

Solve for W, 𝛀 in an alternating fashion 𝛀 can be updated centrally W needs to be solved in federated setting

Challenges: Communication is expensive Statistical & systems heterogeneity

Stragglers Fault tolerance

Page 21: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

GOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING

minW,⌦

mX

t=1

ntX

i=1

`t(wTt x

it) +R(W,⌦)

Solve for W, 𝛀 in an alternating fashion 𝛀 can be updated centrally W needs to be solved in federated setting

Challenges: Communication is expensive Statistical & systems heterogeneity

Stragglers Fault tolerance

Idea:

Modify a communication-efficient method for the data center setting to handle:

✔ Multi-task learning ✔ Stragglers

✔ Fault tolerance

Page 22: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

COCOA: COMMUNICATION-EFFICIENT DISTRIBUTED OPTIMIZATION

mini-batch methods

one-shot communication

key idea:control communication

Page 23: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

COCOA: PRIMAL-DUAL FRAMEWORK

PRIMAL DUAL≥

minw2Rd

1

n

nX

i=1

`(wTxi) + �g(w) max

↵2Rn� 1

n

nX

i=1

`⇤(�↵i)� �g⇤(X,↵)

max

↵2Rn

1

n

nX

i=1

�`⇤(�↵i)� �KX

k=1

g̃⇤(X[k],↵[k])

αk(t)αk(t+1) αk

Page 24: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

COCOA: PRIMAL-DUAL FRAMEWORK

PRIMAL DUAL≥

minw2Rd

1

n

nX

i=1

`(wTxi) + �g(w) max

↵2Rn� 1

n

nX

i=1

`⇤(�↵i)� �g⇤(X,↵)

max

↵2Rn

1

n

nX

i=1

�`⇤(�↵i)� �KX

k=1

g̃⇤(X[k],↵[k])

αk(t)αk(t+1) αk

challenge #1: extend to MTL setup

Page 25: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

COCOA: COMMUNICATION PARAMETER

𝚹 ≈amount of local

computationvs.

communication∈ [0, 1)

Main assumption: each subproblem is solved to accuracy 𝚹

exactly solve

inexactly solve

Page 26: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

COCOA: COMMUNICATION PARAMETER

𝚹 ≈amount of local

computationvs.

communication∈ [0, 1)

Main assumption: each subproblem is solved to accuracy 𝚹

exactly solve

inexactly solve

challenge #2: make communication

more flexible

Page 27: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION

minW,⌦

mX

t=1

ntX

i=1

`t(wTt x

it) +R(W,⌦)

Solve for W, 𝛀 in an alternating fashion

Modify CoCoA to solve W in federated setting

min↵

mX

t=1

ntX

i=1

`⇤t (�↵it) +R⇤(X↵)

min�↵t

ntX

i=1

`⇤t (�↵it ��↵i

t) + hwt(↵),Xt�↵ti+�0

2kXt�↵tk2Mt

Page 28: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

MOCHA: PER-DEVICE, PER-ITERATION APPROXIMATIONS

Stragglers (Statistical heterogeneity)Difficulty of solving subproblem Size of local dataset

Stragglers (Systems heterogeneity)Hardware (CPU, memory) Network connection (3G, LTE, …) Power (battery level)

Fault toleranceDevices going offline

New assumption: each subproblem is solved to accuracy

✓ht 2 [0, 1]✓ 2 [0, 1)

Page 29: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

CONVERGENCE

linear rate1/ε rate

New assumption: each subproblem is solved to accuracy ✓ht 2 [0, 1]

and assume: P[✓ht := 1] < 1

Theorem 1. Let be -Lipschitz, thenL

T � 1

(1� ⇥̄)

✓8L2n2

✏+ c̃

`t Theorem 2. Let be -smooth, then

T � 1

(1� ¯

⇥)

µ+ n

µlog

n

(1/µ)`t

Page 30: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

MOCHA: COMMUNICATION-EFFICIENT FEDERATED OPTIMIZATION

Algorithm 1 Mocha: Federated Multi-Task Learning Framework

1: Input: Data Xt stored on t = 1, . . . ,m devices

2: Initialize ↵(0):= 0, v

(0):= 0

3: for iterations i = 0, 1, . . . do

4: for iterations h = 0, 1, · · · , Hi do

5: for devices t 2 {1, 2, . . . ,m} in parallel do

6: call local solver, returning ✓ht -approximate solution �↵t

7: update local variables ↵t ↵t +�↵t

8: reduce: v v +

Pt Xt�↵t

9: Update ⌦ centrally using w(v) := rR⇤(v)

10: Compute w(v) := rR⇤(v)

11: return: W := [w1, . . . ,wm]

Page 31: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

STATISTICAL HETEROGENEITY

0 1 2 3 4 5 6 7Estimated Time 106

10-3

10-2

10-1

100

101

102

Prim

al S

ub-O

ptim

ality

Human Activity: Statistical Heterogeneity (WiFi)

MOCHACoCoAMb-SDCAMb-SGD

Wifi

0 1 2 3 4 5 6 7 8Estimated Time 106

10-3

10-2

10-1

100

101

102

Prim

al S

ub-O

ptim

ality

Human Activity: Statistical Heterogeneity (LTE)

MOCHACoCoAMb-SDCAMb-SGD

LTE

0 0.5 1 1.5 2Estimated Time 107

10-3

10-2

10-1

100

101

102

Prim

al S

ub-O

ptim

ality

Human Activity: Statistical Heterogeneity (3G)

MOCHACoCoAMb-SDCAMb-SGD

3G

MOCHA IS ROBUST TO STATISTICAL

HETEROGENEITY

MOCHA & COCOA PERFORM

PARTICULARLY WELL IN HIGH-

COMMUNICATION SETTINGS

Page 32: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

SYSTEMS HETEROGENEITY

0 1 2 3 4 5 6 7 8Estimated Time 106

10-3

10-2

10-1

100

101

102

Prim

al S

ub-O

ptim

ality

Vehicle Sensor: Systems Heterogeneity (Low)

MOCHACoCoAMb-SDCAMb-SGD

Low

0 1 2 3 4 5 6 7 8Estimated Time 106

10-3

10-2

10-1

100

101

102

Prim

al S

ub-O

ptim

ality

Vehicle Sensor: Systems Heterogeneity (High)

MOCHACoCoAMb-SDCAMb-SGD

High

MOCHA SIGNIFICANTLY OUTPERFORMS ALL COMPETITORS

[BY 2 ORDERS OF MAGNITUDE]

Page 33: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

FAULT TOLERANCE

0 2 4 6 8 10Estimated Time 106

10-3

10-2

10-1

100

101

102

Prim

al S

ub-O

ptim

ality

Google Glass: Fault Tolerance, W Step

W-Step

0 1 2 3 4 5 6 7 8Estimated Time 107

10-3

10-2

10-1

100

101

102

Prim

al S

ub-O

ptim

ality

Google Glass: Fault Tolerance, Full Method

Full Method

MOCHA IS ROBUST TO DROPPED NODES

Page 34: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

Statistical ChallengesUnbalanced Non-IID Underlying Structure

OUTLINE

Massively Distributed Node Heterogeneity

Systems Challenges

Page 35: MOCHA: Federated Multi-Task Learninglearningsys.org/nips17/assets/slides/mocha-NIPS.pdfGOAL: FEDERATED OPTIMIZATION FOR MULTI-TASK LEARNING min W, Xm t=1 Xnt i=1 ` t (wT t x i t)+R(W,

Virginia Smith Stanford / CMU

cs.berkeley.edu/~vsmithCODE & PAPERS

WWW.SYSML.CC