Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta...
Transcript of Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta...
1
2
Cerebras
V100
A100SambaNova
Groq
GraphCore
Habana
Cambrion
3
Deep Learning Accelerator Craze: The Tale of Two Trends
Number of transistors on chip doubles every 24 months
Compute requirements doubles every 3.5 months!
Source: https://blog.openai.com/ai-and-compute/
Slow-down of the Moore’s lawFast growing computation demand
4
If we don’t do anything
Wait for 40 years to train 100 times larger models!
5
Alternatively…
• Explore Parallelism• Model
• Data
• Pipeline
• Hybrid
• Design Specialized Hardware Accelerator• 99+ Hardware Startup companies
6
Embarrassment of Riches
Parallelism Strategy Exploration
• HyPar
• FlexFlow
• MeshTensorFlow
• …
Specialized Hardware Accelerator (99+)
• Nvidia V100
• Nvidia A100
• Cerebras
• SambaNova
• Groq
• GraphCore
• Habana
• …Parallelism-AgnosticFixed Hardware
7
Low Utilization at Scale
Data Source: Microsoft blog (https://syncedreview.com/2020/02/12/17-billion-parameters-microsoft-deepspeed-breeds-worlds-largest-nlp-model/) and Kunle Olukton’s presentation at ScaledML
8 b Parameters512 GPUs20% efficiency
17 b parameters1000 GPUs
6% efficiency
8
Analysis Paralysis
9
Execution Time
Application
Parallelism StrategyHardware Config. Magic Box
What do we need?
10
Execution Time
Application
Parallelism StrategyHardware Config. Magic Box
Best* Time Best* Hardware
What do we need?
Best* Parallelism Strategy
11
Execution Time
Application
Parallelism StrategyHardware Config. Magic Box
Best* Time Best* Hardware
Technology Parameters
Power Budget
Area Budget
What do we need?
Best* Parallelism Strategy
Design Constraints
12
Execution Time
Parallelism StrategyHardware Config. Magic Box
Best* Time Best* Hardware
Technology Parameters
Power Budget
Area Budget
What do we need?Applications
Best* Parallelism Strategy
Design Constraints
• Today: Which accelerator meets my need?• Tomorrow: Which technology is the most
promising?
13
MechaFlow:A Software/Hardware/Technology Co-design Space Exploration Framework
14
MechaFlow: Telescopic View
Execution Time
Parallelism StrategyHardware Config. Magic Box
Best* Time Best* Hardware
Technology Parameters
Power Budget
Area Budget MechaFlow
Applications
Best* Parallelism Strategy
15
Case Studies
1. How much performance gain from co-designing hardware and parallelism strategy?
2. How much performance gain from new upcoming packaging technologies?
16
Methodology
• Language Modeling• Word-language model (RNN-based LSTM)• SOTA: 18 Billion parameters• Desired: 256 Billion parameters (hidden: 19968, layers:2, vocab:800K, seq.: 20)
• Parallelism:• 64-way parallelism• Data parallelism• Model/Kernel parallelism: Row-Column (RC) and Column-Row (CR) • Pipeline/Layer parallelism• {RC or CR}-k{i}-k{j}-d{k}-l{m}: e.g. RC-k8-k2-d4-l1
• Hardware:• Baseline: V100• Design constraints: 300 watt, 1230 mm2/node, 815 mm2 /core• Technology: 14 nm
17
Q1.Co-design Hardware and Parallelism StrategyHow much performance gain?
18
Co-design Parallelism Strategy and HW Design?
02468
1012
Exec
uti
on
Tim
e/St
ep
(Sec
.)
Best Hardware per Parallelism Strategy V100
19
Not so much gain from specialization to parallelism strategy
0
5
10
15
Exec
uti
on
Tim
e/St
ep
(Sec
.)
Best Hardware per Parallelism Strategy V100 Best HW for Best Parallelism
20Specialized Hardware Configurations PPS
Para
llelis
m S
trat
egy
Hardware Parameters
Spe
cial
ized
Har
dw
are
Co
nfi
gura
tio
ns
PP
S
No Single “Best”
Rel
. Sp
eed
up
wrt
. Sp
ecia
lized
Har
dw
are
Per
Para
llelis
m S
trat
egy
(PP
S)
Rel
ativ
e Pa
ram
eter
wrt
. V1
00
21
Q1 Summary
• Observation 1: Not so much gain from hardware specialization to each parallelism strategy
• Observation 2: There is no single best hardware; There are many distinct and universally good hardware design configurations.
22
Q2.Technology TrendsWhich packing technology is most promising?
23
To “SiIF” or Not to “SiIF”?
• SiIF: 64 nodes/wafer, 1 wafer• MCM: 4 nodes/wafer, 16 wafers• Single: 1 node/wafer, 64 wafers
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
0 10 20 30 40 50
Tim
e/St
ep (
Sec.
)
Parallelism Strategy (Sorted for SiIF)
SiIF Single MCM
24
Conclusion
Slow-down of the Moore’s law
Fast growing computation demand
Cerebras
V100
A100
SambaNova
Groq
GraphCore
Habana
Cambrion
25
Saptadeep Pal Puneet Gupta
Joel Hestness Greg Diamos Kenneth Church