Modular Fault Tolerant Switched Reluctance Machine – Design and Dynamic Simulations
Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to...
Transcript of Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to...
![Page 1: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/1.jpg)
Frameworks for Distributed Machine Learning
Presented by John Alsop, Yasser Shalabi, and Oliver Melvin
img source: https://www.docuvo.com/whatwedo.html 1
![Page 2: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/2.jpg)
We present two ML frameworks targeting:
Deep Learning:
Graph Processing: Faiter
2
![Page 3: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/3.jpg)
Common themes...
3
![Page 4: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/4.jpg)
Distributed ML Frameworks: Common Themes
Trend towards BIG DATA
Big models, big graphs, lots of compute needed
=> Any framework must be Scalable
=> Any framework must be Tolerant to Failure
Scalable
Tolerant
to Failure
4
![Page 5: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/5.jpg)
Trend towards Heterogeneity
Distributed ML Frameworks: Common Themes
GPUs, FPGAs, other accelerators increasingly used to accelerate operations
=> Any framework must be Portable
=> Any framework must be Tolerant to Load Imbalance
Scalable
Tolerant
to Failure
Portable
Tolerant to
Load Imbal.
5
![Page 6: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/6.jpg)
TensorFlow: A System for Large-Scale Machine Learning
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng, Google Brain
6
![Page 7: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/7.jpg)
Introduction
7
![Page 8: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/8.jpg)
Introduction
Availability of Data 8
![Page 9: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/9.jpg)
Introduction
Availability of Data Software Platforms 9
![Page 10: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/10.jpg)
Introduction
ML TechniquesAvailability of Data Software Platforms 10
![Page 11: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/11.jpg)
Large Scale Training
● Large data sets are available● Compute resources are available
○ Warehouse Scale Computers○ GPGPUs○ ASIC Accelerators
11
![Page 12: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/12.jpg)
Large Scale Training
● Large data sets are available● Compute resources are available
○ Warehouse Scale Computers○ GPGPUs○ ASIC Accelerators
12
![Page 13: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/13.jpg)
Large Scale Training
● Large data sets are available● Compute resources are available
○ Warehouse Scale Computers○ GPGPUs○ ASIC Accelerators
● Larger more models, complex techniques performing better○ But they need more data and more time for convergence
● So how can we scale training?
13
![Page 14: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/14.jpg)
Background: NN Training
Process:
Take input image
Compute loss function (forward pass)
14
![Page 15: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/15.jpg)
Background: NN Training
Process:
Take input image
Compute loss function (forward pass)
Compute error gradients (backward pass)
15
![Page 16: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/16.jpg)
Background: NN Training
Process:
Take input image
Compute loss function (forward pass)
Compute error gradients (backward pass)
Update weights
Repeat
16
![Page 17: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/17.jpg)
Background: Dist Belief, parameter-server arch
17
![Page 18: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/18.jpg)
Background: Dist Belief, parameter-server arch
1. Asynch SGD2. Distributed kernels3. 30x DNN4. SOA Performance
18
![Page 19: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/19.jpg)
Background: Shortcomings of DistBelief
1. Difficulty of implementing new layersa. C++ classes implement layersb. Configuration file defines DNN architecturec. Not flexible enough for researchers
19
![Page 20: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/20.jpg)
Background: Shortcomings of DistBelief
1. Difficulty of implementing new layersa. C++ classes implement layersb. Configuration file defines DNN architecturec. Not flexible enough for researchers
2. Refining Algorithmsa. SGD is the heart of the training -- finalized in the parameter serverb. Need atomicity for some techniques -- get/put interface cannot accommodate
20
![Page 21: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/21.jpg)
Background: Shortcomings of DistBelief
1. Difficulty of implementing new layersa. C++ classes implement layersb. Configuration file defines DNN architecturec. Not flexible enough for researchers
2. Refining Algorithmsa. SGD is the heart of the training -- finalized in the parameter serverb. Need atomicity for some techniques -- get/put interface cannot accommodate
3. Supporting new algorithmsa. If it doesn’t conform to feed-forward it doesnt map well to DistBeliefb. EM, RF, RL, AdvMl,
21
![Page 22: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/22.jpg)
Background: Shortcomings of DistBelief
1. Difficulty of implementing new layersa. C++ classes implement layersb. Configuration file defines DNN architecturec. Not flexible enough for researchers
2. Refining Algorithmsa. SGD is the heart of the training -- finalized in the parameter serverb. Need atomicity for some techniques -- get/put interface cannot accommodate
3. Supporting new algorithmsa. If it doesn’t conform to feed-forward it doesnt map well to DistBeliefb. EM, RF, RL, AdvMl,
4. Scaling down to other environmentsa. Designed to run on distributed cluster of multi-coresb. Augmented for GPGPU support for Conv NN
22
![Page 23: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/23.jpg)
TensorFlow: Solution Strategy
1. Execution Flexibility via DataFlow abstractiona. Makes it easy to extract the parallelism
23
![Page 24: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/24.jpg)
TensorFlow: Solution Strategy
1. Execution Flexibility via DataFlow abstractiona. Makes it easy to extract the parallelism
2. Provides DFGs for primitive operatorsa. Softmax, convolution, MM, …b. Makes it easy to experiment with novel layersc. Automatic gradient calculation
24
![Page 25: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/25.jpg)
TensorFlow: Solution Strategy
1. Execution Flexibility via DataFlow abstractiona. Makes it easy to extract the parallelism
2. Provides DFGs for primitive operatorsa. Softmax, convolution, MM, …b. Makes it easy to experiment with novel layersc. Automatic gradient calculation
3. Deferred executiona. Offload the larger chunks where possible...
25
![Page 26: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/26.jpg)
TensorFlow: Solution Strategy
1. Execution Flexibility via DataFlow abstractiona. Makes it easy to extract the parallelism
2. Provides DFGs for primitive operatorsa. Softmax, convolution, MM, …b. Makes it easy to experiment with novel layersc. Automatic gradient calculation
3. Deferred executiona. Offload the larger chunks where possible...
4. Common Abstraction for Acceleratorsa. Easy to integrate new accelerators into the foldb. The operators are specialized for different devices
5. Common data primitive : Tensor
26
![Page 27: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/27.jpg)
27
![Page 28: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/28.jpg)
Execution Model
28
![Page 29: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/29.jpg)
Execution Model
● Single DFG represents all computation and state for ML algorithm○ Input preprocessing, mathematical operators, parameters, parameter update rules○ Communication explicit, simplifying scheduling and partitioning
29
![Page 30: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/30.jpg)
Computation is a DFG
30
![Page 31: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/31.jpg)
Execution Model
● Single DFG represents all computation and state for ML algorithm○ Input preprocessing, mathematical operators, parameters, parameter update rules○ Communication explicit, simplifying scheduling and partitioning
● Differences with existing DF systems:○ Concurrent execution on overlapping subgraphs supported○ Individual vertices contain sharable, mutable state
31
![Page 32: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/32.jpg)
Execution Model
● Single DFG represents all computation and state for ML algorithm○ Input preprocessing, mathematical operators, parameters, parameter update rules○ Communication explicit, simplifying scheduling and partitioning
● Differences with existing DF systems:○ Concurrent execution on overlapping subgraphs supported○ Individual vertices contain sharable, mutable state
mutable state is critical when training large models
Compute + Mutable State = PS++
32
![Page 33: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/33.jpg)
Distributed
33
![Page 34: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/34.jpg)
Communication is explicit...
34
![Page 35: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/35.jpg)
TensorFlow handles the glue
35
![Page 36: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/36.jpg)
Fault Tolerance
● Days or many hours to train models -- fault tolerance is key● RDD is overkill
○ Strong consistency is not needed for ML
● User-level checkpointing operations and client library for configuration○ SAVE/RESTORE
36
![Page 37: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/37.jpg)
Synchronous!
37
![Page 38: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/38.jpg)
Time-to-Quality vs Time-per-Update
● Recall earlier algorithm○ Aggregate update delayed by stragglers
● SGD found to be robust to asynchrony○ Asynchronous = better utilization...○ But with GPGPUs...
● TensorFlow can handle Aynch or Synch updates...○ Also, synchronous + backup workers○ Idea: have K backup workers and N workers, then simply take updates from first N workers that
complete
38
![Page 39: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/39.jpg)
Time-to-Quality vs Time-per-Update
● Recall earlier algorithm○ Aggregate update delayed by stragglers
● SGD found to be robust to asynchrony○ Asynchronous = better utilization...○ But with GPGPUs...
● TensorFlow can handle Aynch or Synch updates○ Also, synchronous + backup workers○ Idea: have K backup workers and N workers, then simply take updates from first N workers that
complete
39
![Page 40: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/40.jpg)
Time-to-Quality vs Time-per-Update
● Recall earlier algorithm○ Aggregate update delayed by stragglers
● SGD found to be robust to asynchrony○ Asynchronous = better utilization...○ But with GPGPUs...
● TensorFlow can handle Aynch or Synch updates○ Also, synchronous + backup workers○ Idea: have K backup workers and N workers, then simply take updates from first N workers that
complete
40
![Page 41: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/41.jpg)
Highlights from Results
41
![Page 42: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/42.jpg)
42
![Page 43: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/43.jpg)
43
![Page 44: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/44.jpg)
Conclusion
● We didn’t cover everything○ Dynamic control flow○ Extensibility studies
● Key lesson: comprehensive look at problem necessary to design a good solution for it
○ Asynchrony was okay but throughput oriented GPGPUs made synchronous better…○ RDD level fault tolerance was not necessary○ Heterogeneity built into the design
● Focusing on the user○ Design/experiment, train, and deploy
44
![Page 45: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/45.jpg)
A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments
Zhigang Wang, Lixin Gao, Yu Gu, Yubin Bao, and Ge Yu
45
![Page 46: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/46.jpg)
A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments
46
![Page 47: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/47.jpg)
Graph as input → iteratively extract meaning → output updated graph
Iterative Computations = Graph Algorithms
Examples: ● PageRank● SSSP● Adsorption● Sparse Jacobi linear eq. solver
A
B
D
C
Iterative update function:
Converges to desired resultvA
vB
vCvA
vA
vC
vD
47
![Page 48: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/48.jpg)
A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments
48
![Page 49: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/49.jpg)
Nodes are partitioned across processors, communicate via MPI
In Cloud Environments
P3
P1
P2
A
B
D
CvA
vB
vCvA
vA
vC
vD
49
![Page 50: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/50.jpg)
Nodes are partitioned across processors, communicate via MPI
In Cloud Environments
All processors synchronize after each iteration => scales poorly
Amount of work depends on connectivity => load imbalance
P3
P1
P2
A
B
D
C
Glo
bal B
arrie
r
50
![Page 51: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/51.jpg)
A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments
51
![Page 52: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/52.jpg)
Can we avoid global barrier for some algorithms?
Asynchronous: Maiter
Observe that:
P3
P1
P2
A
B
D
C
Can be rewritten using:
And v1..n1..k can be applied asynchronously if:
● ⊕ is commutative and associative● g() is distributive over ⊕
(Just need to define initial vi1 for all i )
vAvA
vAvB
vCvC
vD
vXScalable
Tolerant
to Failure
Portable
Tolerant to
Load Imbal.
✔
✔
52
![Page 53: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/53.jpg)
Maiter Implementation
May sort by priority to speed convergence
53
![Page 54: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/54.jpg)
Maiter Evaluation
Hadoop: synchronous streaming framework
Maiter-Sync: Synchronous delta-based framework
Maiter-RR: Asynchronous Maiter, process state table Round-Robin
Maiter-Pri: Asynchronous Maiter, process state table based on priority
54
![Page 55: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/55.jpg)
Maiter Evaluation
PageRank on billion node synthetic graph: Asynchrony and priority sorting help Maiter converge faster
55
![Page 56: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/56.jpg)
Maiter Evaluation
Near optimal scaling
56
![Page 57: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/57.jpg)
A Fault-Tolerant Framework for Asynchronous Iterative Computations in Cloud Environments
57
![Page 58: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/58.jpg)
What happens if some nodes fail?
Trivial solution: roll back all nodes to last checkpoint, or to initial state
Fault-Tolerant: Faiter
P3
P1
P2
A
B
D
C
58
![Page 59: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/59.jpg)
Checkpoints are expensive in an asynchronous system
Can we avoid rolling back all nodes?
Yes! Can roll back only failed nodes if:
● g() is distributive over new op ⊖● x⊖x=0● (x⊕y)⊖z=x⊕(y⊖z)
Also: can checkpoint asynchronously
Fault-Tolerant: Faiter
What happens if some nodes fail?
Trivial solution: roll back all nodes to last checkpoint, or to initial state
P3
P1
P2
vA0
vD0
vC0
vB0
Scalable
Tolerant
to Failure
Portable
Tolerant to
Load Imbal.
✔
✔
✔
59
![Page 60: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/60.jpg)
Faiter Implementation
1) Master detects node failure, broadcasts recovery signal
2) Run a synchronous iteration using vi rather than vi
P3
P1
P2
A
B
D
C
3) Resume asynchronous operation using new initial values:
M
vA
vB
vCvA
vA
vC
vD
vAvA
vAvB
vCvC
vD60
![Page 61: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/61.jpg)
Faiter Implementation
61
![Page 62: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/62.jpg)
Faiter Evaluation
FR-Scratch: All nodes roll back to 0 on failure
FR-WORB: Failed nodes roll back to 0
FR-WAC: Failed nodes roll back to last asynchronous checkpoint
T1 = 0.1 runtime, T2 = 0.5, T3 = 0.9
62
![Page 63: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/63.jpg)
Faiter Evaluation
Failure recovery with and without asynchronous checkpointing is helpful at higher T
(with checkpointing is generally better)
At low T, barrier overheads can do more harm than good
63
![Page 64: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/64.jpg)
Portability
Independent computation based on vid
FPGA
GPU
Scalable
Tolerant
to Failure
Portable
Tolerant to
Load Imbal.
✔
✔
✔
OK
64
![Page 65: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/65.jpg)
Conclusion
Maiter:
● Define when asynchrony is allowed in delta-based graph processing● Demonstrate performance and scalability benefits of asynchrony
Faiter:
● Identify weakness in asynchronous fault recovery● Define when complete synchronous rollback is unnecessary● Demonstrate performance benefits of efficient fault tolerance
65
![Page 66: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/66.jpg)
TensorFlow Discussion
● Can the data-parallel execution model be extended to other systems discussed in the course?
● How to handle reallocation of pre-empted nodes?○ Load balancing not discussed
● Are stronger consistency guarantees worth the overhead?○ Mutable state speeds up large scale model training, removes need for parameter server
● How to balance company vs. individual scale trade offs ○ Lacks some optimizations (i.e. hand-optimized kernels similar to Neon)○ Fault tolerance - user dependent and checkpoint based○ Few defaults - requires domain knowledge to create performant systems
66
![Page 67: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/67.jpg)
Device Clarification
● GPUs, CPUs, other specialized hardware for training● TPUs provide high performance/Watt for server side inference● Cellphone GPUs aren’t incorporated into training systems. Rather they enable
offline inference, i.e. offline translation
67
![Page 68: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/68.jpg)
Faiter Discussion
● Single point of failure● Fault Tolerance
○ FR-WORB vs FR-WAC○ Is it really guaranteed to perform better than checkpointing? Should comparisons be included,
perhaps with carefully placed failures in Faiter?○ No comparison to other asynchronous checkpointing methods, e.g. the mentioned
Chandy-Lamport method in GraphLab, nor Lineage recovery
● Scalable to real-world sized corporate clusters?○ Tests run on t2.micro instances, these provide burst performance
● Unclear why the graph algorithms & datasets were chosen for evaluation
68
![Page 69: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/69.jpg)
Faiter Discussion
● Can framework be applied to non-iterative graph algorithms? Or even stream processing?
● Are the distributed, communicative and associate property assumptions realistic for most desired computation?
69
![Page 70: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/70.jpg)
Under what conditions are these systems not suitable for use?
70
![Page 71: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/71.jpg)
Backup
71
![Page 72: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/72.jpg)
72
![Page 73: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/73.jpg)
Dynamic Control Flow
73
![Page 74: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/74.jpg)
Image sources
Brain img source: https://www.docuvo.com/whatwedo.html
Social net source: https://www.quora.com/How-do-social-networking-sites-and-search-engines-use-the-concept-of-graph-Please-use-layman-terms
GoogLeNet source: Christian Szegedy et al,”Going Deeper with Convolutions”, CVPR2015
Notepad source: https://clipartfest.com/categories/view/bf3ad8f22b3d818eee77c01504f976009ecefebf/clipart-notepad.html
FPGA source: Adrian Caulfield et al. “A Cloud-Scale acceleration architecture”, MICRO 2016
Blue waters source: http://www.cray.com/enabling-scientific-breakthroughs-petascale
Graphs source: http://www.cise.ufl.edu/research/sparse/matrices/
CNN source: http://parse.ele.tue.nl/education/cluster2
74
![Page 75: Frameworks for Distributed Machine Learning · Scalable Tolerant to Failure Portable Tolerant to Load Imbal. 5. TensorFlow: A System for Large-Scale Machine Learning Martín Abadi,](https://reader030.fdocuments.us/reader030/viewer/2022040608/5ec46ed7ef4f3c5729273263/html5/thumbnails/75.jpg)
2.3 Related work
1. Single Machine Frameworksa. Theano, Caffe, Torch
2. Batch Dataflow Frameworksa. MapReduce, Spark
3. Parameter Server Architecturesa. DistBelief, Project Adam, MXNet
75