Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta...

2

Cerebras

V100

A100SambaNova

Groq

GraphCore

Habana

Cambrion

3

Deep Learning Accelerator Craze: The Tale of Two Trends

Number of transistors on chip doubles every 24 months

Compute requirements doubles every 3.5 months!

Source: https://blog.openai.com/ai-and-compute/

Slow-down of the Moore’s lawFast growing computation demand

4

If we don’t do anything

Wait for 40 years to train 100 times larger models!

5

Alternatively…

• Explore Parallelism• Model

• Data

• Pipeline

• Hybrid

• Design Specialized Hardware Accelerator• 99+ Hardware Startup companies

6

Embarrassment of Riches

Parallelism Strategy Exploration

• HyPar

• FlexFlow

• MeshTensorFlow

• …

Specialized Hardware Accelerator (99+)

• Nvidia V100

• Nvidia A100

• Cerebras

• SambaNova

• Groq

• GraphCore

• Habana

• …Parallelism-AgnosticFixed Hardware

7

Low Utilization at Scale

Data Source: Microsoft blog (https://syncedreview.com/2020/02/12/17-billion-parameters-microsoft-deepspeed-breeds-worlds-largest-nlp-model/) and Kunle Olukton’s presentation at ScaledML

8 b Parameters512 GPUs20% efficiency

17 b parameters1000 GPUs

6% efficiency

https://syncedreview.com/2020/02/12/17-billion-parameters-microsoft-deepspeed-breeds-worlds-largest-nlp-model/

8

Analysis Paralysis

9

Execution Time

Application

Parallelism StrategyHardware Config. Magic Box

What do we need?

10

Execution Time

Application


Best* Time Best* Hardware

What do we need?

Best* Parallelism Strategy

11

Execution Time

Application



Technology Parameters

Power Budget

Area Budget

What do we need?


Design Constraints

12

Execution Time




Power Budget

Area Budget

What do we need?Applications


Design Constraints

• Today: Which accelerator meets my need?• Tomorrow: Which technology is the most

promising?

13

MechaFlow:A Software/Hardware/Technology Co-design Space Exploration Framework

14

MechaFlow: Telescopic View

Execution Time




Power Budget

Area Budget MechaFlow

Applications


15

Case Studies

1. How much performance gain from co-designing hardware and parallelism strategy?

2. How much performance gain from new upcoming packaging technologies?

16

Methodology

• Language Modeling• Word-language model (RNN-based LSTM)• SOTA: 18 Billion parameters• Desired: 256 Billion parameters (hidden: 19968, layers:2, vocab:800K, seq.: 20)

• Parallelism:• 64-way parallelism• Data parallelism• Model/Kernel parallelism: Row-Column (RC) and Column-Row (CR) • Pipeline/Layer parallelism• {RC or CR}-k{i}-k{j}-d{k}-l{m}: e.g. RC-k8-k2-d4-l1

• Hardware:• Baseline: V100• Design constraints: 300 watt, 1230 mm2/node, 815 mm2 /core• Technology: 14 nm

17

Q1.Co-design Hardware and Parallelism StrategyHow much performance gain?

18

Co-design Parallelism Strategy and HW Design?

02468

1012

Exec

uti

on

Tim

e/St

ep

(Sec

.)

Best Hardware per Parallelism Strategy V100

19

Not so much gain from specialization to parallelism strategy

0

5

10

15

Exec

uti

on

Tim

e/St

ep

(Sec

.)

Best Hardware per Parallelism Strategy V100 Best HW for Best Parallelism

20Specialized Hardware Configurations PPS

Para

llelis

m S

trat

egy

Hardware Parameters

Spe

cial

ized

Har

dw

are

Co

nfi

gura

tio

ns

PP

S

No Single “Best”

Rel

. Sp

eed

up

wrt

. Sp

ecia

lized

Har

dw

are

Per

Para

llelis

m S

trat

egy

(PP

S)

Rel

ativ

e Pa

ram

eter

wrt

. V1

00

21

Q1 Summary

• Observation 1: Not so much gain from hardware specialization to each parallelism strategy

• Observation 2: There is no single best hardware; There are many distinct and universally good hardware design configurations.

22

Q2.Technology TrendsWhich packing technology is most promising?

23

To “SiIF” or Not to “SiIF”?

• SiIF: 64 nodes/wafer, 1 wafer• MCM: 4 nodes/wafer, 16 wafers• Single: 1 node/wafer, 64 wafers

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

0 10 20 30 40 50

Tim

e/St

ep (

Sec.

)

Parallelism Strategy (Sorted for SiIF)

SiIF Single MCM

24

Conclusion

Slow-down of the Moore’s law

Fast growing computation demand

Cerebras

V100

A100

SambaNova

Groq

GraphCore

Habana

Cambrion

25

Saptadeep Pal Puneet Gupta

Joel Hestness Greg Diamos Kenneth Church

Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta...

Documents

Transcript of Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta...