Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta...

25
1

Transcript of Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta...

Page 1: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

1

Page 2: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

2

Cerebras

V100

A100SambaNova

Groq

GraphCore

Habana

Cambrion

Page 3: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

3

Deep Learning Accelerator Craze: The Tale of Two Trends

Number of transistors on chip doubles every 24 months

Compute requirements doubles every 3.5 months!

Source: https://blog.openai.com/ai-and-compute/

Slow-down of the Moore’s lawFast growing computation demand

Page 4: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

4

If we don’t do anything

Wait for 40 years to train 100 times larger models!

Page 5: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

5

Alternatively…

• Explore Parallelism• Model

• Data

• Pipeline

• Hybrid

• Design Specialized Hardware Accelerator• 99+ Hardware Startup companies

Page 6: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

6

Embarrassment of Riches

Parallelism Strategy Exploration

• HyPar

• FlexFlow

• MeshTensorFlow

• …

Specialized Hardware Accelerator (99+)

• Nvidia V100

• Nvidia A100

• Cerebras

• SambaNova

• Groq

• GraphCore

• Habana

• …Parallelism-AgnosticFixed Hardware

Page 7: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

7

Low Utilization at Scale

Data Source: Microsoft blog (https://syncedreview.com/2020/02/12/17-billion-parameters-microsoft-deepspeed-breeds-worlds-largest-nlp-model/) and Kunle Olukton’s presentation at ScaledML

8 b Parameters512 GPUs20% efficiency

17 b parameters1000 GPUs

6% efficiency

Page 8: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

8

Analysis Paralysis

Page 9: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

9

Execution Time

Application

Parallelism StrategyHardware Config. Magic Box

What do we need?

Page 10: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

10

Execution Time

Application

Parallelism StrategyHardware Config. Magic Box

Best* Time Best* Hardware

What do we need?

Best* Parallelism Strategy

Page 11: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

11

Execution Time

Application

Parallelism StrategyHardware Config. Magic Box

Best* Time Best* Hardware

Technology Parameters

Power Budget

Area Budget

What do we need?

Best* Parallelism Strategy

Design Constraints

Page 12: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

12

Execution Time

Parallelism StrategyHardware Config. Magic Box

Best* Time Best* Hardware

Technology Parameters

Power Budget

Area Budget

What do we need?Applications

Best* Parallelism Strategy

Design Constraints

• Today: Which accelerator meets my need?• Tomorrow: Which technology is the most

promising?

Page 13: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

13

MechaFlow:A Software/Hardware/Technology Co-design Space Exploration Framework

Page 14: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

14

MechaFlow: Telescopic View

Execution Time

Parallelism StrategyHardware Config. Magic Box

Best* Time Best* Hardware

Technology Parameters

Power Budget

Area Budget MechaFlow

Applications

Best* Parallelism Strategy

Page 15: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

15

Case Studies

1. How much performance gain from co-designing hardware and parallelism strategy?

2. How much performance gain from new upcoming packaging technologies?

Page 16: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

16

Methodology

• Language Modeling• Word-language model (RNN-based LSTM)• SOTA: 18 Billion parameters• Desired: 256 Billion parameters (hidden: 19968, layers:2, vocab:800K, seq.: 20)

• Parallelism:• 64-way parallelism• Data parallelism• Model/Kernel parallelism: Row-Column (RC) and Column-Row (CR) • Pipeline/Layer parallelism• {RC or CR}-k{i}-k{j}-d{k}-l{m}: e.g. RC-k8-k2-d4-l1

• Hardware:• Baseline: V100• Design constraints: 300 watt, 1230 mm2/node, 815 mm2 /core• Technology: 14 nm

Page 17: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

17

Q1.Co-design Hardware and Parallelism StrategyHow much performance gain?

Page 18: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

18

Co-design Parallelism Strategy and HW Design?

02468

1012

Exec

uti

on

Tim

e/St

ep

(Sec

.)

Best Hardware per Parallelism Strategy V100

Page 19: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

19

Not so much gain from specialization to parallelism strategy

0

5

10

15

Exec

uti

on

Tim

e/St

ep

(Sec

.)

Best Hardware per Parallelism Strategy V100 Best HW for Best Parallelism

Page 20: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

20Specialized Hardware Configurations PPS

Para

llelis

m S

trat

egy

Hardware Parameters

Spe

cial

ized

Har

dw

are

Co

nfi

gura

tio

ns

PP

S

No Single “Best”

Rel

. Sp

eed

up

wrt

. Sp

ecia

lized

Har

dw

are

Per

Para

llelis

m S

trat

egy

(PP

S)

Rel

ativ

e Pa

ram

eter

wrt

. V1

00

Page 21: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

21

Q1 Summary

• Observation 1: Not so much gain from hardware specialization to each parallelism strategy

• Observation 2: There is no single best hardware; There are many distinct and universally good hardware design configurations.

Page 22: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

22

Q2.Technology TrendsWhich packing technology is most promising?

Page 23: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

23

To “SiIF” or Not to “SiIF”?

• SiIF: 64 nodes/wafer, 1 wafer• MCM: 4 nodes/wafer, 16 wafers• Single: 1 node/wafer, 64 wafers

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

0 10 20 30 40 50

Tim

e/St

ep (

Sec.

)

Parallelism Strategy (Sorted for SiIF)

SiIF Single MCM

Page 24: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

24

Conclusion

Slow-down of the Moore’s law

Fast growing computation demand

Cerebras

V100

A100

SambaNova

Groq

GraphCore

Habana

Cambrion

Page 25: Cerebras GraphCore · 2020. 5. 30. · GraphCore Habana Cambrion. 25 Saptadeep Pal Puneet Gupta Joel Hestness Greg Diamos Kenneth Church. Title: PowerPoint Presentation Author: …

25

Saptadeep Pal Puneet Gupta

Joel Hestness Greg Diamos Kenneth Church