Demystifying Parallel and Distributed Deep Learning:An In-Depth Concurrency Analysis
Tal Ben-Nun and Torsten HoeflerETH Zurich
Presenter : Derrick Blakelyhttps://qdata.github.io/deep2Read
March 6, 2019
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 1 / 34
Outline
1 Background
2 Parallel Computing and Communication
3 Neural Network Concurrency
4 Parameter Sharing and Consistency
5 Frameworks
6 Conclusion
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 2 / 34
Background
Outline
1 Background
2 Parallel Computing and Communication
3 Neural Network Concurrency
4 Parameter Sharing and Consistency
5 Frameworks
6 Conclusion
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 3 / 34
Background
Deep Learning is Using More Nodes
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 4 / 34
Background
More GPUs, More MPI
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 5 / 34
Parallel Computing and Communication
Outline
1 Background
2 Parallel Computing and Communication
3 Neural Network Concurrency
4 Parameter Sharing and Consistency
5 Frameworks
6 Conclusion
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 6 / 34
Parallel Computing and Communication
Parallel Computing Basics: Computation Graphs
Critical path: causes the earliest possible completion time
Depth D: time needed to execute critical path
T1 = W : sequential execution; same as total work
T∞ = D: with unlimited processors
Average parallelism: T1/T∞ = W /D
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 7 / 34
Parallel Computing and Communication
Parallel Computing Basics: Granularity
Granularity: G ≈ Tcomp/Tcomm
Alternatively:
G =minn∈Vw(n)
maxe∈Ec(e)
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 8 / 34
Parallel Computing and Communication
Parallel Computing Basics: Granularity
Granularity: G ≈ Tcomp/Tcomm
Alternatively:
G =minn∈Vw(n)
maxe∈Ec(e)
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 8 / 34
Parallel Computing and Communication
Parallel Computing Basics: Granularity
Granularity: G ≈ Tcomp/Tcomm
Alternatively:
G =minn∈Vw(n)
maxe∈Ec(e)
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 8 / 34
Parallel Computing and Communication
Goals
Maximize parallelism
Minimize communication and load imbalance
Tradeoff: high parallelism, low communication
Tradeoff: high load balance vs low communication
High parallelism and high load balance are often compatible
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 9 / 34
Parallel Computing and Communication
Communication
Want to perform the following AllReduce:
y = x1 ⊕ x2 ⊕ ...⊕ xP−1 ⊕ xP
P: num processing elements
xi : length-m vector of data items stored on a processing element
γ: size of a data item (bytes)
G : computation cost per byte
L: network latency
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 10 / 34
Parallel Computing and Communication
Naive: Linear-Depth Reduction
T = γmG (P − 1) + L(P − 1)
Average Parallelism: T1/T∞ = W /D = (P − 1)/(P − 1) = 1
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 11 / 34
Parallel Computing and Communication
Tree and Butterfly AllReduce
T = 2γmG logP + 2L logP T = γmG logP + L logP
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 12 / 34
Parallel Computing and Communication
Ring (or Pipeline) AllReduce
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 13 / 34
Parallel Computing and Communication
Ring (or Pipeline) AllReduce
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 13 / 34
Parallel Computing and Communication
Ring (or Pipeline) AllReduce
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 13 / 34
Parallel Computing and Communication
Ring (or Pipeline) AllReduce
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 13 / 34
Parallel Computing and Communication
AllReduce
Ring:T = 2γmG (P − 1)/P + 2L(P − 1)
Reduce-Scatter-Gather:
T = 2γmG (P − 1)/P + 2L logP
Lower bound:T ≥ 2γmG (P − 1)/P + L logP
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 14 / 34
Neural Network Concurrency
Outline
1 Background
2 Parallel Computing and Communication
3 Neural Network Concurrency
4 Parameter Sharing and Consistency
5 Frameworks
6 Conclusion
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 15 / 34
Neural Network Concurrency
Within-Layer Concurrency
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 16 / 34
Neural Network Concurrency
Data Parallelism
Simple and efficient
Must replicate models, possible GPU out of memory
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 17 / 34
Neural Network Concurrency
Model Parallelism
Conserve GPU memory
Can work well with multiple GPUs on the same system
Must share minibatch with every worker
Back-prop requires all-to-all communication
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 18 / 34
Neural Network Concurrency
Layer-by-layer Concurrency: Pipelining
Avoid out of memory errors
Sparse communication: GPUs only communicate with GPU in frontof them
Have to make sure there is overlap in computation
Latency linear in number of processors
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 19 / 34
Neural Network Concurrency
Graph Parallelism
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 20 / 34
Parameter Sharing and Consistency
Outline
1 Background
2 Parallel Computing and Communication
3 Neural Network Concurrency
4 Parameter Sharing and Consistency
5 Frameworks
6 Conclusion
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 21 / 34
Parameter Sharing and Consistency
Centralized Parameter Sharing: Parameter Server
T = 2P γmGs + 2L
Ensures consistency
“Stragglers” cause poor utilization
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 22 / 34
Parameter Sharing and Consistency
Asynchronous Parameter Server
Better utilization, faster training
Slow agents cause parameter divergence
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 23 / 34
Parameter Sharing and Consistency
Stale Synchronous Parameter Server
Statistical performance vs hardware performance tradeoff
Having a parameter server at all can bottleneck training
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 24 / 34
Parameter Sharing and Consistency
Decentralized Parameter Sharing
T = 2γmG (P − 1)/P + 2L logP
MPI or NCCL can automatically provide a good AllReduce
Avoids parameter server bottleneck
“Straggler” problem
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 25 / 34
Parameter Sharing and Consistency
Stale Synchronous Decentralized Parameter Sharing
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 26 / 34
Parameter Sharing and Consistency
Asynchronous Decentralized Parameter Sharing
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 27 / 34
Frameworks
Outline
1 Background
2 Parallel Computing and Communication
3 Neural Network Concurrency
4 Parameter Sharing and Consistency
5 Frameworks
6 Conclusion
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 28 / 34
Frameworks
Neural Net Frameworks
TensorFlow: allows Parameter Server and AllReduce (MPI, NCCL,TCP/IP)
PyTorch: AllReduce with MPI, NCCL, or gloo
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 29 / 34
Frameworks
Why not use Hadoop or Spark?
Don’t natively support GPU runtimes
Require JVM–slow
Designed for fault-tolerance, not speed
Only support data-parallelism
NNs have cyclic computation graphs (must revisit working sets)
Must be synchronous
TF or PyTorch better optimized for deep learning
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 30 / 34
Conclusion
Outline
1 Background
2 Parallel Computing and Communication
3 Neural Network Concurrency
4 Parameter Sharing and Consistency
5 Frameworks
6 Conclusion
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 31 / 34
Conclusion
Ongoing Distributed ML Challenges
1 Optmization and Theory
2 Make algorithms more scalable
3 Better software for distributing ML4 Develop distributed ML systems
ConsistencyFault toleranceCommunication latencyResource management
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 32 / 34
Conclusion
Up-and-Coming ML Topics
Impact of compression and quantization?
Challenges unique to networks with dynamic control flow?
Best ways to implement utilize graph parallelism?
Hierarchical tasks?
Automated architecture searches?
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 33 / 34
Conclusion
References
Ben-Nun, Tal, and Torsten Hoefler. ”Demystifying parallel and distributeddeep learning: An in-depth concurrency analysis.” arXiv preprintarXiv:1802.09941 (2018).
Presenter : Derrick Blakely https://qdata.github.io/deep2Read (University of Virginia)Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis Tal Ben-Nun and Torsten Hoefler ETH ZurichMarch 6, 2019 34 / 34
Top Related