Ljubisa Bajićand Jasmina Vasiljević · 2020. 8. 18. · Dynamic Execution 1 10 100 1000 10000...
Transcript of Ljubisa Bajićand Jasmina Vasiljević · 2020. 8. 18. · Dynamic Execution 1 10 100 1000 10000...
Compute substrate for Software 2.0Ljubisa Bajić and Jasmina Vasiljević
ML vs.
Moore’s Law compute
ML
1
10
100
1000
10000
100000
1000000
10000000
2019 2020 2021 2022 2023 2024 2025 2026
ML vs. Moore's Law (Optimistic)
ML compute demand Moore's law - 50%/yr
Optical, analog/nanowires
ML compute demand
Moore’s law
10K*Moores Law (cluster)
100K*Moores Law (mega cluster)
Dynamic Execution
1
10
100
1000
10000
100000
1000000
10000000
2019 2020 2021 2022 2023 2024 2025 2026
ML vs. Moore's Law
ML compute demand Moore's law - 50%/yr
Scale helps, but the only
long-term solution
is to change the slope of
the curve
20 Watts
Scale out
Scale Out
Large Clusters are Already the Norm
§ Shared memory architectures could not provide the required scale
§ Many modern neural nets are trained and inferenced on clusters
§ Many nodes with 4-16 GPUs
§ Private memory space, explicit data movement
§ Data parallel at first§ Sidesteps many communication/synchronization issues
§ but model parallel has become necessary§ The full complexity of cluster programming is now exposed
Largest. Clusters. Ever.
§ Networking + compute on each chip
§ Computation directly on packets
§ Packet routing controlled by graph compiler
§ Hundreds of thousands of nodes in cluster
§ One device in Pytorch WH WH WH WH
WH WH WH WH
WH WH WH WH
WH WH WH WH
WH
2U system
42U Rack
Module
Cluster
Shared Memory Machines
§ Each processing unit can see the full memory space
§ A processor needs an array element: just issue a LOAD
§ Tensor manipulations and views mostly reduce to
strided access to same buffer in memory
§ Primary compiler challenge – loop nest optimization4 1
3 2
7 6
5 0
9 2
6 4
5 1
0 3
4
1
7
6
3
2
5
0
9
2
5
1
6
4
0
3
0x0
0x4
0x8
4 1
3 2
7 6
5 0
9 2
6 4
5 1
0 3
4
1
7
6
3
2
5
0
9
2
5
1
6
4
0
3
0x0
0x4
0x8Transpose
Stride 4
Stride 1
Clusters and ML Chips Have Private Memory
§ Data is split up between nodes and no local view exists§ Data transfers explicitly managed
§ Tensor manipulations -> inter-node co-ordination§ Example transpose implementation:
§ Data transfer between 1,0 <-> 0,1
§ Transpose of local tiles
§ Hard challenges:§ Data tiling and parallelization
§ Data transfers, synchronization
§ Complexities with tensor manipulations
§ Memory management
§ We solve them holistically
Tenstorrent confidential
Transpose
4 1
3 2
7 6
5 0
9 2
6 4
5 1
0 3
4
1
3
2
7
6
5
0
9
2
6
4
5
1
0
3
Node 0,0 Node 0,1
Node 1,0 Node 1,1
0x0
0x4
0x8
4
1
3
2
7
6
5 0
9
2
6
4
5
10 3
4
3
1
2
9
6
2
4
7
5
6
0
5
0
1
3
Node 0,0 Node 0,1
Node 1,0 Node 1,1
0x0
0x4
0x8
Need XFER
between cores
Compiler
Runtime Hardware
Dynamic ExecutionWhat is it?
Dynamic Execution
Sparse Compute Runtime CompressionControl Flow
Models that dynamically
choose subsets of blocks to
compute during each pass
fp32 fp32
fp16
8-bit 8-bit
mat
mult
Dynamic Precision
Weights and activations
O(n) Matrix MultiplicationWeights/activations (dense)
Activations (dense) Result (dense) Activations (sparse)
Weights (dense)
Result (sparse)
Weights (dense)
Result (sparse)Activations (dense)
Dense: O(n^3) Sparse: linear speed-up Chained sparse MM:
quadratic speed up
Sparsity Max boost
50% 2X
90% 10X
Sparsity Max boost
50% 4X
90% 100X
O(n) continued
§ Generally applicable
§ works for training and inference (unlike pruning)
§ models with general applicability (like GPT3)
§ Requires models that dynamically choose subsets of blocks to
compute during each pass
§ Mixture of experts
§ LSH
§ Pre-pass based
§ Requires hardware that can realize full speed-up from block sparsity
Result (sparse)
16x16 block
The Full Stack SolutionArchitecture & Software
GrayskullCluster On a Chip
§ 2D grid of cores
§ 120 self contained cores
§ Each core executing independent program
§ Network on chip
§ 2D bi-directional torus
§ Optimized for ML-workload
§ Connectivity
§ PCIe
§ DRAM
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T TP
C
I
E
A
R
C
LPDDR4
T
T
T
T
T
T
T
T
T T T T T T T T T
T T
T T
T T
T T
T T
T T
T T
T T
T
T
T
T
T
T
T
T
T T T
T T T T T T T T T T T T
LPDDR4 LPDDR4 LPDDR4
LPDDR4 LPDDR4 LPDDR4 LPDDR4
NoC BW 330 GB/sec
PCIe Gen3 x16
Off-chip memory LPDDR4
Single Core
§ Packet Compute
§ Vector, SIMD
§ Programmable & flexible compute
§ Sparse compute
§ Packet Manager§ Data transfers & storage
§ Tensor manipulation
§ Dynamic compression
§ Storage§ Local SRAM
§ Access to DRAM
§ 5 RISC cores
§ Powerful single-issue processor
§ Runtime software
Packet
Compute
Packet
Manager
Packet
ManagerSRAM
NoC
Local SRAM1MB
660 GB/sec R/W bw
Compute3 TOPs (8-bit)
0.75 TFLOPs (16-bit)
Data formats
Bfloat, half-float, tf32
8-bit
Several custom formats, <=8-bit fp
Challenges of Connecting Compute Layers
§ Parallelization
§ Splitting tensors amongst the cores
§ Moving tensors between the cores
§ Tensor Manipulation (TM) instructions
§ Reshuffle data in various ways
§ Performance
§ Overlapping compute & transfers
§ Efficient utilization of NoC
§ Efficient utilization of memory bandwidth Compute
A
Compute
B
Tensor Manipulation
instructions
Compute
B
Compute
A
Compute
C
Compute
D
Compound
Complexity
Reshape Transpose Squeeze
The Full Stack Approach
AOT
Compiler
Runtime Hardware
Parallelizes compute &
packetizes tensors
Dynamic memory
management
Packet
manager
Graph CompilerCompilation Into Packets
§ Packetization
§ Packet headers: packet IDs & routing information
§ No pointers, everything is expressed in terms of packet IDs
§ Compute layer parallelization optimized by graph compiler
§ Data movement & synchronization expressed explicitly by compiler
§ NoC is visible to compiler
§ Produces an Instruction Queue for the Packet Manager§ Packets re-ordered by NoC
§ In-line TMs
§ Memory access patterns
Graph Compiler
he
ad
er
payload
Single
packet
Mini-tensors
Tensor
Compute
A
Compute
B
Compute
A
Compute
BReshape Transpose Squeeze
Packet Manager
• Packet Compute Engine
• Programmable device, flexibility
• Computes what Packet Manager feeds it
• Packet header triggers a program for the
packet
Packet
Manager
Packet
Compute
Single Core
NoC
SRAM
DRAM
Packet Manager
• Tensor Manipulation Engine
• Reshape, transpose, concat, slice
• In-line, between compute & memory
• Dynamic packet compression
• Data Transfer Engine
• Multi-core synchronization
• Data dependencies, data hazards, data ready,
memory space ready
• Works with runtime software
• Router
• Moves data across the NoC
• Back-pressure, guaranteed ordering, deadlock free
• Optimized multi-cast and gather operation for ML
workloads
Packet Manager
NoC Router
Tensor
Manipulation
Data Transfer
Engine
Packet
Compute Engine
SRAM
DRAM
Runtime Software
§ Five RISC processors per core
§ Dynamic memory allocation § Runtime buffer (de)allocation
§ Runtime controlled memory target
§ Data locality through SRAM
§ Spills to DRAM and host
§ Control flow§ if-statements, for-loops, while-loops
§ Decisions reflected by jumping around the Instruction Queue executed by Packet Compute and Packet Manager
Packet
Compute
Engine
Packet
Manager
Packet
ManagerSRAM
Flexible Scheduling & Parallelization
Clusters of nodes mapped
spatially onto the cores
pipeline
stage 1
pipeline
stage 2
pipeline
stage 3
pipeline
stage 4
pipeline
stage 1
pipeline
stage 2
pipeline
stage 3
pipeline
stage 4
Pipeline Parallelism Model Parallelism
A
B
C
D
Layers without data
dependencies run
concurrently
CB DA
6 cores
4 cores9 cores
4 cores
Combining multiple parallelization methodsleads to higher utilization of large number of cores,
resulting in higher performance
Pe
rfo
rma
nce
No. of CoresPe
rfo
rma
nce
No. of CoresPe
rfo
rma
nce
No. of CoresPe
rfo
rma
nce
No. of Cores
parallelization #1
parallelization #2
parallelization #3
parallelization #4
The Full Stack Approach
§ High-performance through concurrency
§ Asynchronous cores: flexible parallelization & scheduling
§ Packet manager & Packet Compute overlap data transfers & compute
§ High memory utilization
§ AOT graph compiler can not accurately predict buffer lifetimes
§ Dynamic memory management compensates
§ Dynamic execution
§ Runtime packet compression & data locality benefits
§ Sparse compute
§ Control flow graphs
AOT
Compiler
Runtime Hardware
Parallelizes compute &
packetizes tensors
Dynamic memory
management
Packet
manager
Company Overview, Status&
Plans
Company Overview
§ Tenstorrent
§ Founded in 2016
§ ~70 employees in Toronto and Austin
§ Equal mix of CPU, GPU, FPGA backgrounds
§ Goals and targets
§ ML inference and training
§ Edge to data center
§ General purpose high throughput parallel computation
T
E
Compute core
Ethernet (100G)
DRAM interface
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T TP
C
I
E
A
R
C
LPDDR4
T
T
T
T
T
T
T
T
T T T T T T T T T
T T
T T
T T
T T
T T
T T
T T
T T
T
T
T
T
T
T
T
T
T T T
T T T T T T T T T T T T
LPDDR4 LPDDR4 LPDDR4
LPDDR4 LPDDR4 LPDDR4 LPDDR4
Grayskull (2020)
ML processor
• 8 channels of LPDDR4, PCIE g4 x16
• 4 core OoO ARC CPU, runs linux
• 368 TOPS / 92 TFLOPS, 120MB SRAM
• 65W
Evaluation with multiple large customers
Shipping this fall
Wormhole (2021)
Network switch & ML processor
• Integrated network switch
• 16 ports of 100G ethernet
• 6 channels of GDDR6, PCIE g4 x16
• 4 core OoO ARC CPU, runs linux
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
T T T T
G6
P
C
I
E
E
E
E
E
E
E
E
E
E
E
EA
R
C
G6
T
T
T
T
T
T
T
T
T T T T T T T T T
G6E E
G6G6 G6E E
E
Jawbridge (2019)
ML processor
• 1 channels of LPDDR4, PCIE g4 x4
• 4 core OoO ARC CPU, runs linux
• 4 TOPS / 1 TFLOPS, 6MB SRAM
• 1.5W
T T T
T T T
LPDDR4
P
C
I
E
OOO CPU
65W Grayskull BERT Inference Performance
Workload Score
BERT BASE, SQuAD 1.1, fp16 – no conditionals 2,830
BERT BASE, SQuAD 1.1, fp16 + light conditional execution 10,150
BERT BASE, SQuAD 1.1, mixed precision, moderate conditional execution 23,345
Work in progress, BERT model modified with conditional execution
*
Software: Compiler generality
NLP
§ BERT
§ ALBERT
§ GPT2
§ T5 LM
§ GNMT
§ Transformer
§ Electra
NLP
Key verticals:
Healthcare, Financials,
Ecommerce, Retail
Models ready:
§ BERT
§ ALBERT
§ GPT2
§ T5 LM
§ GNMT
§ Transformer
§ Electra
Vision/Imaging
Key verticals:
Retail, Security,
Automotive
Models ready:
§ Resnet50
§ DeepCoNN
§ Googlenet
§ Densenet
§ Inception
§ Alexnet
§ ResNext
§ SqueezeNet
§ Mobilenet
§ VGG
§ YOLO
§ SSD Resnet34
§ SSD Mobilenet
Others
Key verticals:
Gaming, Entertainment,
social media, ecommerce
Models ready:
§ NCF
§ DLRM
§ Autoencoder
§ Stacked denoising autoencoder
Eval with customers
Public beta on our dev cloud
November 1st
Framework Integration + Deployment
Back End
Graph CompilerOptimizer
Front End
§ Full Pytorch integration
§ Native device
§ Torchscript with full support for conditionals
§ ONNX
§ A single device from PyTorch no matter the size of
computer
§ Automated deployment flow
§ Pre-trained models can benefit from Tenstorrent features
WH WH WH WH
WH WH WH WH
WH WH WH WH
WH WH WH WH
2U system
Single card
Summary
compute
ML§ Scale and conditional computation to let ML models grow
§ Flexibility: run anything
§ Usability: easy to use software, hiding all complexity of
programming clusters