Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E...

36
Allreduce for Parallel Learning John Langford, Microsoft Resarch, NYC May 8, 2017

Transcript of Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E...

Page 1: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Allreduce for Parallel Learning

John Langford, Microsoft Resarch, NYC

May 8, 2017

Page 2: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!

I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.

Page 3: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!

The worst part: he had a point.

Page 4: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Applying for a fellowship in 1997

Interviewer: So, what do you want to do?John: I’d like to solve AI.I: How?J: I want to use parallel learning algorithms to create fantasticlearning machines!I: You fool! The only thing parallel machines are good for iscomputational windtunnels!The worst part: he had a point.

Page 5: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Why is it hard?

Everyone’s first instinct: Try using parameter servers.

Model

Shard

Model

Shard

Model

Shard

Shard

Data

Shard

Data

Shard

Data

Shard

Data

Shard

Data

1 Hsu, Karampatziakis, Langford, and Smola, “Parallel OnlineLearning”, https://arxiv.org/abs/1103.4204

2 Dean et al, “Large scale Deep Distributed Networks”, NIPS2012.

Big problems in practice:1 Overwhelmingly inefficient. Best case: marginally faster with

x100 electricity.2 Nondeterministic. Minor bugs incredibly difficult to track.

Page 6: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Why is it hard?

Everyone’s first instinct: Try using parameter servers.

Model

Shard

Model

Shard

Model

Shard

Shard

Data

Shard

Data

Shard

Data

Shard

Data

Shard

Data

1 Hsu, Karampatziakis, Langford, and Smola, “Parallel OnlineLearning”, https://arxiv.org/abs/1103.4204

2 Dean et al, “Large scale Deep Distributed Networks”, NIPS2012.

Big problems in practice:1 Overwhelmingly inefficient. Best case: marginally faster with

x100 electricity.2 Nondeterministic. Minor bugs incredibly difficult to track.

Page 7: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =

∑i wixi?

17B Examples16M parameters1K nodesHow long does it take?

70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ faster than all possible singlemachine linear learning algorithms.

Page 8: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =

∑i wixi?

17B Examples16M parameters1K nodesHow long does it take?

70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ faster than all possible singlemachine linear learning algorithms.

Page 9: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Terascale Linear Learning ACDL11

Given 2.1 Terafeatures of data, how can you learn a good linearpredictor fw (x) =

∑i wixi?

17B Examples16M parameters1K nodesHow long does it take?

70 minutes = 500M features/second: faster than the IObandwidth of a single machine⇒ faster than all possible singlemachine linear learning algorithms.

Page 10: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

MPI-style AllReduce

7

2 3 4

6

Allreduce initial state

5

1

AllReduce = Reduce+BroadcastProperties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 11: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

MPI-style AllReduce

2828 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+BroadcastProperties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 12: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

MPI-style AllReduce

2 3 4

6

7

5

1

Create Binary Tree

AllReduce = Reduce+BroadcastProperties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 13: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

MPI-style AllReduce

2 3 4

7

8

1

13

Reducing, step 1

AllReduce = Reduce+BroadcastProperties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 14: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

MPI-style AllReduce

2 3 4

8

1

13

Reducing, step 2

28

AllReduce = Reduce+BroadcastProperties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 15: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

MPI-style AllReduce

2 3 41

28

Broadcast, step 1

28 28

AllReduce = Reduce+BroadcastProperties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 16: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

MPI-style AllReduce

28

28 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+Broadcast

Properties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 17: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

MPI-style AllReduce

28

28 28

Allreduce final state

28 28 28 28

AllReduce = Reduce+BroadcastProperties:

1 Easily pipelined so no latency concerns.

2 Bandwidth ≤ 6n.

3 No need to rewrite code!

Page 18: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

An Example Algorithm: Weight averaging

n = AllReduce(1)While (pass number < max)

1 While (examples left)1 Do online update.

2 AllReduce(weights)

3 For each weight w ← w/n

Other algorithms implemented:

1 Nonuniform averaging for online learning

2 Conjugate Gradient

3 LBFGS

Page 19: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

An Example Algorithm: Weight averaging

n = AllReduce(1)While (pass number < max)

1 While (examples left)1 Do online update.

2 AllReduce(weights)

3 For each weight w ← w/n

Other algorithms implemented:

1 Nonuniform averaging for online learning

2 Conjugate Gradient

3 LBFGS

Page 20: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Hadoop AllReduce

1

DataProgram

“Map” job moves program to data.

2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.

3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

Page 21: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Hadoop AllReduce

1

DataProgram

“Map” job moves program to data.

2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.

3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

Page 22: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Hadoop AllReduce

1

DataProgram

“Map” job moves program to data.

2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.

3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

Page 23: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Hadoop AllReduce

1

DataProgram

“Map” job moves program to data.

2 Delayed initialization: Most failures are disk failures. Firstread (and cache) all data, before initializing allreduce. Failuresautorestart on different node with identical data.

3 Speculative execution: In a busy cluster, one node is oftenslow. Hadoop can speculatively start additional mappers. Weuse the first to finish reading all data once.

The net effect: Reliable execution out to perhaps 10K node-hours.

Page 24: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Robustness & Speedup

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50 60 70 80 90 100

Spe

edup

Nodes

Speed per method

Average_10Min_10

Max_10linear

Page 25: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Splice Site Recognition

0 10 20 30 40 500.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Iteration

auP

RC

Online

L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass

L−BFGS

0 5 10 15 20

0.466

0.468

0.47

0.472

0.474

0.476

0.478

0.48

0.482

0.484

Iteration

auP

RC

Online

L−BFGS w/ 5 online passesL−BFGS w/ 1 online pass

L−BFGS

Page 26: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Splice Site Recognition

0 5 10 15 200

0.1

0.2

0.3

0.4

0.5

0.6

Effective number of passes over data

auP

RC

L−BFGS w/ one online passZinkevich et al.Dekel et al.

Page 27: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

What about parallel deep learning?

Needs to work with a GPU.

Page 28: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

GPU Allreduce optimization 1: Minibatch gradient descent

GPUs have much more computation than communication.

Give every GPU n examples and compute average gradient onthem. Synchronize Gradient

Page 29: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

GPU Allreduce optimization 1: Minibatch gradient descent

GPUs have much more computation than communication.

Give every GPU n examples and compute average gradient onthem. Synchronize Gradient

Page 30: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

GPU Allreduce optimization 1: Half-Batch

Do communication asynchronous to computation.

1 minibatch communicates while the other computes on olderparameters.

Page 31: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

GPU Allreduce optimization 1: Half-Batch

Do communication asynchronous to computation.

1 minibatch communicates while the other computes on olderparameters.

Page 32: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

GPU Allreduce optimization 1: 1-bit SGD

Discretize the gradient to 1 bit before communicating off GPU.

Keep and accumulate discretization errors on GPU.

Page 33: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

GPU Allreduce optimization 1: 1-bit SGD

Discretize the gradient to 1 bit before communicating off GPU.

Keep and accumulate discretization errors on GPU.

Page 34: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

GPU Allreduce Optimization 2: Ring Allreduce

Every node masters a subset and messages travel in a ring.

1 Downside: latency increase?

2 Upside: perfectly efficient synchronization

Page 35: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Bibliography

L-BFGS J. Nocedal, Updating Quasi-Newton Matrices with LimitedStorage, Mathematics of Computation 35:773–782, 1980.

grad sum C. Teo, Q. Le, A. Smola, V. Vishwanathan, A ScalableModular Convex Solver for Regularized Risk Minimization,KDD 2007.

Rings 1 Pitch Patarasuk and Xin Yuan, Bandwidth Optimal all-reducealgorithms for clusters of workstations, JPDC 2009.

avg. G. Mann et al. Efficient large-scale distributed training ofconditional maximum entropy models, NIPS 2009.

ov. avg M. Zinkevich, M. Weimar, A. Smola, and L. Li, ParallelizedStochastic Gradient Descent, NIPS 2010.

P. online D. Hsu, N. Karampatziakis, J. Langford, and A. Smola,Parallel Online Learning, in SUML 2010.

D. Mini O. Dekel, R. Gilad-Bachrach, O. Shamir, and L. Xiao, OptimalDistributed Online Predictions Using Minibatch, JMLR 2012.

Page 36: Allreduce for Parallel Learningmltf/parallel_learning.pdf · 2017-05-25 · Langford A Reliable E ective Terascale Linear Learning System, Arxiv 2011/JMLR 2014. DistBeliefDean et

Bibliography II

Tera Alekh Agarwal, Olivier Chapelle, Miroslav Dudik, JohnLangford A Reliable Effective Terascale Linear LearningSystem, Arxiv 2011/JMLR 2014.

DistBelief Dean et al, Large Scale Distributed Deep Networks, NIPS2012.

One-bit Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu,1-Bit Stochastic Gradient Descent and its Application toData-Parallel Distributed Training of Speech DNNs,Interspeech 2014.

Rings 2 Amodei et al, Deep Speech 2: End-to-End SpeechRecognition in English and Mandarin, ICML 2016.