Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...
Transcript of Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...
![Page 1: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/1.jpg)
Wafer-Scale MLCerebras Systems
![Page 2: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/2.jpg)
Largest Chip Ever Built
• 46,225 mm2 silicon
• 1.2 trillion transistors
• 400,000 AI optimized cores
• 18 Gigabytes of On-chip Memory
• 9 PByte/s memory bandwidth
• 100 Pbit/s fabric bandwidth
![Page 3: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/3.jpg)
Cerebras Wafer Scale Engine
![Page 4: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/4.jpg)
The Cerebras CS-1
A full solution in a single system:
• Powered by the WSE
• Program with TF and other frameworks
• Installs in standard datacenter rack
The world’s most powerful AI computer
![Page 5: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/5.jpg)
Deep Learning Training is Hard
Size:• Billions-trillions of ops per sample
• Millions-billions of samples per training
• Peta-exa scale compute
Shape:• Fine-grained: A lot of parallelism;
presents opportunity to accelerate
• Coarse-grained: Inherently serial
![Page 6: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/6.jpg)
The Cerebras Architecture is Optimized for DL Compute
• Core optimized for neural network primitives
• Flexible, programmable core: NN architectures are evolving
• Designed for sparse compute: workloads contain fine-grained sparsity
• Local memory: weights & activations are local with low data reuse
• Fast interconnect: layer-to-layer with high bandwidth and low latency
![Page 7: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/7.jpg)
Flexible Cores Optimized for Tensor Operations
Key to supporting rapidly evolving NN architectures
• Fully programmable compute core
• Full array of general instructions with ML extensions
• Flexible general ops for control processing• e.g. arithmetic, logical, load/store, branch
• Optimized tensor ops for data processing• Tensors as first class operands
• e.g. fmac [z] = [z], [w], a
3D 3D 2D scalar
![Page 8: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/8.jpg)
Sparse Compute Engine for Neural Networks
NN operations like nonlinearities naturally create fine-grained sparsity
• Native, sparse processing enables higher efficiency and performance
• Dataflow scheduling in hardware• Triggered by data
• Filters out sparse zero data
• Skips unnecessary processing
• Fine-grained execution datapaths• Efficiently processes dynamic, non-uniform work
Dense Network
Sparse Network
ReLU
![Page 9: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/9.jpg)
Traditional Memory Architectures not Optimized for DL
In neural networks, weights and activations are local, with low reuse
Traditional memory designs are punished
• Central shared memory is slow & far away
• Requires high data reuse (caching)
• Fundamental operation (matrix*vector) has low data reuse
![Page 10: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/10.jpg)
A Memory Architecture that is Optimized for DL
In neural networks, weights and activations are local, with low data reuse
The right answer is distributed, high performance, on-chip memory
• All memory is fully distributed along with compute datapaths
• Datapath has full performance from memory
![Page 11: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/11.jpg)
High-Bandwidth Low-Latency Interconnect
Low latency intra/inter-layer local connectivity with high bandwidth
• Fast and fully configurable fabric
• Small single-word messages
• All HW-based communication avoids SW overhead
• 2D mesh topology effective for local communication• High bandwidth and low latency for local
communication
• Higher utilization and efficiency than global topologies
![Page 12: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/12.jpg)
Achieving Radical Performance Gains
Training neural networks requires more compute than can fit on a single traditional chip
• More AI optimized cores
• More high speed on chip memory
• More fabric bandwidth at low latency connecting cores together
![Page 13: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/13.jpg)
![Page 14: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/14.jpg)
Build Big Chips
Big Chips Process Data More Quickly-> Producing Answers In Less Time
• Cluster scale performance on a single chip
• GB of fast memory 1 clock cycle from core
• On-chip interconnect across entire wafer
![Page 15: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/15.jpg)
Training Neural Networks at Wafer Scale
How to use 400k cores to train a single neural network?
![Page 16: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/16.jpg)
Device 1 Device 2Device 1 Device 2
Training Neural Networks at Wafer Scale
• Wafer Scale Engine enables spectrum of parallel execution strategies
• Execution strategies optimized for different training characteristics
Data Parallel(Batch Size)
Mo
del
Par
alle
l
WSE Layer-Pipelined
Layer-sequentialWSE GPU
Model ParallelData Parallel
Sample 1
Running multiple parts of network at same time
Running multiple samples at same time
Sample NModel Part 1
Model Part 2
![Page 17: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/17.jpg)
Traditional: Layer-SequentialSingle GPU: Modest Data Parallel
• Run one layer at once
• Medium batch size• GPU memory performance needs batch
• Result: performance on single GPU
GPU
Tim
e
Layer1
Layer2
Layer3
Layer2
Layer1
Weight Update
NeuralNetwork
…
![Page 18: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/18.jpg)
Traditional: Layer-SequentialGPU Cluster: Extreme Data Parallel
• Run layer replicas on parallel devices• Low communication between devices
• GPU/server low performance interconnect
• Extremely large batch size• Medium batch per device, many devices
• Convergence challenges
• Weight sync overhead• Synchronize before final weight update
• Partially overlapped with compute
• GPU/server low performance interconnect
• Result: non-linear performance scaling
GPUs
Tim
e
Layer1
Layer2
Layer3
Layer2
Layer1
Weight Update
Weight Sync
…
NeuralNetwork
![Page 19: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/19.jpg)
WSE: Layer-SequentialReplicas: Modest Data Parallel
• Run layer replicas on fabric sections
• Medium batch size• Small batch size per replica
• Large number of parallel replicas
• WSE memory performance at low batch
• Low weight sync overhead• Synchronize before final weight update
• WSE low latency interconnect
• Result: linear performance scaling with medium batch size
Wafer Scale Engine
Tim
e
Layer1
Layer2
Layer3
Layer2
Layer1
Weight UpdateWeight Sync
…
NeuralNetwork
![Page 20: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/20.jpg)
WSE: Layer-SequentialNo Replicas: Low Data Parallel
• Replicas not constrained to device• Number of replicas defined by layer size• Large layers can spread across entire wafer
• Run one layer at once on entire wafer• WSE high bandwidth interconnect
• Small batch size• Single replica on entire wafer• WSE memory performance at low batch
• No weight sync overhead• Weights updated in place
• Result: linear performance scaling with small batch for large networks
Wafer Scale Engine
Tim
e
Layer1
Layer2
Layer3
Layer2
Layer1
Weight Update…
NeuralNetwork
![Page 21: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/21.jpg)
WSE: Layer-PipelinedModel Parallel, Modest Data Parallel
• Run all layers on fabric sections• Execute network as pipeline
• WSE high bandwidth interconnect
• Medium batch size• Batch size is O(network depth)
• Amortize pipeline fill-drain overhead
• No weight sync overhead• Weights updated in place
• Result: linear performance scaling with medium batch size
Wafer Scale Engine
Tim
e
Layer1
Layer1,2
Layer1,2,3…
Layer1,2,3
Layer1,2,3
Layer1,2,3
Fill
NeuralNetwork
![Page 22: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/22.jpg)
WSE: Layer-PipelinedModel Parallel, Modest Data Parallel
• Run all layers on fabric sections• Execute network as pipeline
• WSE high bandwidth interconnect
• Medium batch size• Batch size is O(network depth)
• Amortize pipeline fill-drain overhead
• No weight sync overhead• Weights updated in place
• Result: linear performance scaling with medium batch size
Wafer Scale Engine
Tim
e
Layer1,2,3
Layer1,2
Weight Update
Layer1
……
Layer1,2,3
Layer1,2,3
Layer1,2,3
Drain
NeuralNetwork
![Page 23: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/23.jpg)
WSE: Layer-PipelinedModel Parallel, Low Data Parallel
• Pipelined Backpropagation
• Weight update without draining pipeline
• Arbitrarily small batch size• Weight update at any frequency• WSE memory performance at low batch
• No weight sync overhead• Weights updated in place
• Staleness compensation in deep networks• Simple momentum-based technique• Compensates for fwd vs. bwd delay
• Result: linear performance scaling with arbitrarily small batch size
Wafer Scale Engine
Tim
e
…
Layer1,2,3
Layer1,2,3
Layer1,2,3
Layer1,2,3
Weight Update
Weight Update
……
NeuralNetwork
![Page 24: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/24.jpg)
Wafer-Scale ML
• Spectrum of parallel execution strategies for different training characteristics
• Wafer scale architecture designed from the ground up to run deep learning at scale
• CS-1 is the world’s most powerful AI computer
• Linear cluster scale performance on a single chip
• CS-1 with WSE is up and running real customer workloads at scale today
![Page 25: Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training characteristics •Wafer scale architecture designed from the ground up to run deep](https://reader033.fdocuments.us/reader033/viewer/2022050601/5fa88c524052a903bf612733/html5/thumbnails/25.jpg)