Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...

Wafer-Scale MLCerebras Systems

Largest Chip Ever Built

• 46,225 mm2 silicon

• 1.2 trillion transistors

• 400,000 AI optimized cores

• 18 Gigabytes of On-chip Memory

• 9 PByte/s memory bandwidth

• 100 Pbit/s fabric bandwidth

Cerebras Wafer Scale Engine

The Cerebras CS-1

A full solution in a single system:

• Powered by the WSE

• Program with TF and other frameworks

• Installs in standard datacenter rack

The world’s most powerful AI computer

Deep Learning Training is Hard

Size:• Billions-trillions of ops per sample

• Millions-billions of samples per training

• Peta-exa scale compute

Shape:• Fine-grained: A lot of parallelism;

presents opportunity to accelerate

• Coarse-grained: Inherently serial

The Cerebras Architecture is Optimized for DL Compute

• Core optimized for neural network primitives

• Flexible, programmable core: NN architectures are evolving

• Designed for sparse compute: workloads contain fine-grained sparsity

• Local memory: weights & activations are local with low data reuse

• Fast interconnect: layer-to-layer with high bandwidth and low latency

Flexible Cores Optimized for Tensor Operations

Key to supporting rapidly evolving NN architectures

• Fully programmable compute core

• Full array of general instructions with ML extensions

• Flexible general ops for control processing• e.g. arithmetic, logical, load/store, branch

• Optimized tensor ops for data processing• Tensors as first class operands

• e.g. fmac [z] = [z], [w], a

3D 3D 2D scalar

Sparse Compute Engine for Neural Networks

NN operations like nonlinearities naturally create fine-grained sparsity

• Native, sparse processing enables higher efficiency and performance

• Dataflow scheduling in hardware• Triggered by data

• Filters out sparse zero data

• Skips unnecessary processing

• Fine-grained execution datapaths• Efficiently processes dynamic, non-uniform work

Dense Network

Sparse Network

Traditional Memory Architectures not Optimized for DL

In neural networks, weights and activations are local, with low reuse

Traditional memory designs are punished

• Central shared memory is slow & far away

• Requires high data reuse (caching)

• Fundamental operation (matrix*vector) has low data reuse

A Memory Architecture that is Optimized for DL

In neural networks, weights and activations are local, with low data reuse

The right answer is distributed, high performance, on-chip memory

• All memory is fully distributed along with compute datapaths

• Datapath has full performance from memory

High-Bandwidth Low-Latency Interconnect

Low latency intra/inter-layer local connectivity with high bandwidth

• Fast and fully configurable fabric

• Small single-word messages

• All HW-based communication avoids SW overhead

• 2D mesh topology effective for local communication• High bandwidth and low latency for local

communication

• Higher utilization and efficiency than global topologies

Achieving Radical Performance Gains

Training neural networks requires more compute than can fit on a single traditional chip

• More AI optimized cores

• More high speed on chip memory

• More fabric bandwidth at low latency connecting cores together

Build Big Chips

Big Chips Process Data More Quickly-> Producing Answers In Less Time

• Cluster scale performance on a single chip

• GB of fast memory 1 clock cycle from core

• On-chip interconnect across entire wafer

Training Neural Networks at Wafer Scale

How to use 400k cores to train a single neural network?

Device 1 Device 2Device 1 Device 2

Training Neural Networks at Wafer Scale

• Wafer Scale Engine enables spectrum of parallel execution strategies

• Execution strategies optimized for different training characteristics

Data Parallel(Batch Size)

WSE Layer-Pipelined

Layer-sequentialWSE GPU

Model ParallelData Parallel

Sample 1

Running multiple parts of network at same time

Running multiple samples at same time

Sample NModel Part 1

Model Part 2

Traditional: Layer-SequentialSingle GPU: Modest Data Parallel

• Run one layer at once

• Medium batch size• GPU memory performance needs batch

• Result: performance on single GPU

Layer1

Layer2

Layer3

Layer2

Layer1

Weight Update

NeuralNetwork

Traditional: Layer-SequentialGPU Cluster: Extreme Data Parallel

• Run layer replicas on parallel devices• Low communication between devices

• GPU/server low performance interconnect

• Extremely large batch size• Medium batch per device, many devices

• Convergence challenges

• Weight sync overhead• Synchronize before final weight update

• Partially overlapped with compute

• GPU/server low performance interconnect

• Result: non-linear performance scaling

Layer1

Layer2

Layer3

Layer2

Layer1

Weight Update

Weight Sync

NeuralNetwork

WSE: Layer-SequentialReplicas: Modest Data Parallel

• Run layer replicas on fabric sections

• Medium batch size• Small batch size per replica

• Large number of parallel replicas

• WSE memory performance at low batch

• Low weight sync overhead• Synchronize before final weight update

• WSE low latency interconnect

• Result: linear performance scaling with medium batch size

Wafer Scale Engine

Layer1

Layer2

Layer3

Layer2

Layer1

Weight UpdateWeight Sync

NeuralNetwork

WSE: Layer-SequentialNo Replicas: Low Data Parallel

• Replicas not constrained to device• Number of replicas defined by layer size• Large layers can spread across entire wafer

• Run one layer at once on entire wafer• WSE high bandwidth interconnect

• Small batch size• Single replica on entire wafer• WSE memory performance at low batch

• No weight sync overhead• Weights updated in place

• Result: linear performance scaling with small batch for large networks

Wafer Scale Engine

Layer1

Layer2

Layer3

Layer2

Layer1

Weight Update…

NeuralNetwork

WSE: Layer-PipelinedModel Parallel, Modest Data Parallel

• Run all layers on fabric sections• Execute network as pipeline

• WSE high bandwidth interconnect

• Medium batch size• Batch size is O(network depth)

• Amortize pipeline fill-drain overhead

Wafer Scale Engine

Layer1

Layer1,2

Layer1,2,3…

Layer1,2,3

NeuralNetwork

WSE: Layer-PipelinedModel Parallel, Modest Data Parallel

• Run all layers on fabric sections• Execute network as pipeline

• WSE high bandwidth interconnect

• Medium batch size• Batch size is O(network depth)

• Amortize pipeline fill-drain overhead

Wafer Scale Engine

Layer1,2,3

Layer1,2

Weight Update

Layer1

……

Layer1,2,3

NeuralNetwork

WSE: Layer-PipelinedModel Parallel, Low Data Parallel

• Pipelined Backpropagation

• Weight update without draining pipeline

• Arbitrarily small batch size• Weight update at any frequency• WSE memory performance at low batch

• Staleness compensation in deep networks• Simple momentum-based technique• Compensates for fwd vs. bwd delay

• Result: linear performance scaling with arbitrarily small batch size

Wafer Scale Engine

Layer1,2,3

Weight Update

……

NeuralNetwork

Wafer-Scale ML

• Spectrum of parallel execution strategies for different training characteristics

• Wafer scale architecture designed from the ground up to run deep learning at scale

• CS-1 is the world’s most powerful AI computer

• Linear cluster scale performance on a single chip

• CS-1 with WSE is up and running real customer workloads at scale today

Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...

Documents

Transcript of Wafer-Scale ML...Wafer-Scale ML •Spectrum of parallel execution strategies for different training...

Wafer-Scale Integration of Analog Neural Networksweb1.kip.uni-heidelberg.de/Veroeffentlichungen/download...ﬁcial neural network tailored for wafer-scale integration. The presented

Building a Wafer-Scale Deep Learning Chip: Lessons Learned · Building a Wafer-Scale Deep Learning Chip: Inventions Required • Thermal Expansion • Traditional chip-on-substrate

Electrochemical studies on wafer-scale synthesized silicon ...

Bare Die, Chip Scale and Wafer Scale Package Handling...Bare Die, Chip SCale anD Wafer SCale paCkage hanDling 6 entegriS, inC. h20-02a COVer • Cover with bumps on the bottom surface

2 photoelectrochemical hydrogen production Wafer-scale ...1 1 Wafer-scale transferable molybdenum disulfide thin-film catalysts for 2 photoelectrochemical hydrogen production 3 4 Supplementary

SiGe/CMOS Millimeter-Wave Integrated Circuits and Wafer-Scale ...

Fast Stencil-Code Computation on a Wafer-Scale Processor

Wafer-scale micro-optics fabrication · 2017-05-17 · Wafer-scale micro-optics fabrication is based on technology established by the semiconductor industry. Thousands of components

Wafer Scale Integration of 2003

Terahertz wafer-scale mobility mapping of graphene on ...orbit.dtu.dk/files/119690207/Terahertz_wafer_scale.pdf · Terahertz wafer-scale mobility mapping of graphene on insulating

A wafer-scale backplane-assisted resonating nanoantenna ......A wafer-scale backplane-assisted resonating nanoantenna array SERS device created by tunable thermal dewetting nanofabrication

PowerPoint Presentation€¦ · Kg 1 Kg White Rose . White Rose . White Rose . Wafer White Rose . 90 40 100 ml 45 40 35 20 White Rose . 100 ml 90 ml 80 ml 70 ml 60 ml 50 ml 40 ml

Wafer-scale, full-coverage, acoustic self-limiting ...

Wafer-scale synthesis of monolayer two-dimensional ...

Wafer-Scale Integration Systolic Arrays - MIT CSAILpeople.csail.mit.edu/cel/resume/papers/wafer-scale-integration.pdf · IEEE TRANSACTIONS ON COMPUTERS, VOL. c-34, NO. 5, MAY 1985

Wafer-Scale Single-Crystal Monolayer Graphene Grown ...

UB ML Rotary Wafer Switch - nsfcontrols.co.uk

THREE DIMENSIONAL INTEGRATION AND ON-WAFER … · THREE DIMENSIONAL INTEGRATION AND ON-WAFER PACKAGING FOR HETEROGENEOUS WAFER-SCALE CIRCUIT ARCHITECTURES Linda Katehi, Barry Perlman1,

NANOMATERIALS Wafer-scale single-crystal hexagonal boron ... · NANOMATERIALS Wafer-scale single-crystal hexagonal boron nitride filmvia self-collimated grain formation Joo Song Lee1,2,

Toward Human-Scale Brain Computing Using 3D Wafer Scale ... · 45 Toward Human-Scale Brain Computing Using 3D Wafer Scale Integration ARVIND KUMAR, IBM Thomas J. Watson Research Center