Accelerator-level Parallelism
Transcript of Accelerator-level Parallelism
![Page 1: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/1.jpg)
Accelerator-level Parallelism
Mark D. Hill, Wisconsin & Vijay Janapa Reddi, Harvard
@ Technion (Virtually), June 2020
1
Aspects of this work on Mobile SoCs and Gables were developed while the authors were “interns” with Google’s Mobile Silicon Group. Thanks!
![Page 2: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/2.jpg)
Future apps demand much more computingStandard tech scaling & architecture NOT sufficientMobile SoCs show a promising approach:
ALP = Parallelism among workload components concurrently executing on multiple accelerators (IPs)
Call to action to develop “science” for ubiquitous ALP
Accelerator-level Parallelism Call to Action
4
![Page 3: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/3.jpg)
I. Computer History & X-level Parallelism
II. Mobile SoCs as ALP Harbinger
III. Gables ALP SoC Model
IV. Call to Action for Accelerator-level Parallelism
Outline
5
![Page 4: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/4.jpg)
20th Century Information & Communication TechnologyHas Changed Our World• <long list omitted>
Required innovations in algorithms, applications, programming languages, … , & system software
Key (invisible) enablers (cost-)performance gains• Semiconductor technology (“Moore’s Law”)• Computer architecture (~80x per Danowitz et al.)
6
![Page 5: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/5.jpg)
Enablers: Technology + Architecture
9
Danowitz et al., CACM 04/2012
Technology
Architecture
![Page 6: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/6.jpg)
How did Architecture Exploit Moore’s Law?
MORE (& faster) transistors è even faster computers
Memory – transistors in parallel• Vast semiconductor memory (DRAM)• Cache hierarchy for fast memory illusion
Processing – transistors in parallelBit-, Instruction-, Thread-, & Data-level Parallelism
Now Accelerator-level Parallelism10
![Page 7: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/7.jpg)
X-level Parallelism in Computer Architecture
11
P
$
M
bus
i/f
dev
1 CPU
BLP+ILPBit/Instrn-Level
Parallelism
![Page 8: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/8.jpg)
Bit-level Parallelism (BLP)Early computers: few switches (transistors)• è compute a result in many steps• E.g., 1 multiplication partial product per cycle
Bit-level parallelism• More transistors è compute more in parallel• E.g., Wallace Tree multiplier (right)
Larger words help: 8bà16bà32bà64b
Important: Easy for software
NEW: Smaller word size, e.g. machine learning inference accelerators 12
![Page 9: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/9.jpg)
Instruction-level Parallelism (ILP)
13
Processors logically do instructions sequentially (timeà)add
Predict direction: target or fall thru
Actually do instructions in parallel è ILPadd
load
branch
and Speculate!
store Speculate more!
load
E.g., Intel Skylake has 224-entry reorder buffer w/ 14-19-stage pipeline
Important: Easy for software
IBM Stretch [1961]
![Page 10: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/10.jpg)
X-level Parallelism in Computer Architecture
14
P
$
M
bus
i/f
dev
1 CPU Multiprocessor
BLP+ILP + TLPThread-LevelParallelism
Bit/Instrn-LevelParallelism
![Page 11: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/11.jpg)
Thread-level Parallelism (TLP)Thread-level Parallelism• HW: Multiple sequential processor cores• SW: Each runs asynchronous thread
SW must partition work, synchronize, & manage communication• E.g. pThreads, OpenMP, MPI
On-chip TLP called “multicore” – forced choice
Less easy for software but• More TLP in cloud than desktop à cloud!!• Bifurcation: experts program TLP; others use it
15
Intel Pentium Pro Extreme Edition, early 2000s
CDC 6600, 1964,(TLP via multithreaded processor)
![Page 12: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/12.jpg)
X-level Parallelism in Computer Architecture
17
P
$
M
bus
i/f
dev
1 CPU Multicore
BLP+ILP + TLPBit/Instrn-Level
ParallelismThread-LevelParallelism
![Page 13: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/13.jpg)
Data-level Parallelism (DLP)Need same operation on many data itemsDo with parallelism è DLP• Array of single instruction multiple data (SIMD)• Deep pipelines like Cray vector machines• Intel-like Streaming SIMD Extensions (SSE)
Broad DLP success awaited General-Purpose GPUs1. Single Instruction Multiple Thread (SIMT)2. SW (CUDA) & libraries (math & ML)3. Experimentation as $1-10K not $1-10M
Bifurcation again: experts program SIMT (TLP+DLP); others use it18
Illinois ILLIAC IV, 1966
NVIDIA Tesla
![Page 14: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/14.jpg)
X-level Parallelism in Computer Architecture
19
P
$
M
bus
i/f
dev
1 CPU Multicore
GPU
dev-M
+ Discrete GPU
BLP+ILP + TLP + DLPData-LevelParallelism
Bit/Instrn-LevelParallelism
Thread-LevelParallelism
![Page 15: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/15.jpg)
X-level Parallelism in Computer Architecture
20
P
$
M
bus
i/f
dev
1 CPU Multicore
GPU
+ Integrated GPU
BLP+ILP + TLP + DLPData-LevelParallelism
Bit/Instrn-LevelParallelism
Thread-LevelParallelism
![Page 16: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/16.jpg)
X-level Parallelism in Computer Architecture
21
![Page 17: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/17.jpg)
I. Computer History & X-level Parallelism
II. Mobile SoCs as ALP Harbinger
III. Gables ALP SoC Model
IV. Call to Action for Accelerator-level Parallelism
Outline
22
![Page 18: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/18.jpg)
X-level Parallelism in Computer Architecture
23
P
$
M
bus
i/f
dev
1 CPU Multicore
GPU
+ Integrated GPUSystem on a Chip
(SoC)BLP+ILP + TLP + DLP
Data-LevelParallelism
Bit/Instrn-LevelParallelism
Thread-LevelParallelism
+ ALPAccelerator-Level
Parallelism
![Page 19: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/19.jpg)
Potential for Specialized Accelerators (IPs)
25
[Brodersen & Meng, 2002]
v
v
16 Encryption17 Hearing Aid18 FIR for disk read19 MPEG Encoder20 802.11 Baseband
Accelerator is a hardware component that executes a targeted computation class faster & usually with (much) less energy.
![Page 20: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/20.jpg)
CPU, GPU, xPU (i.e., Accelerators or IPs)
262019 Apple A12 w/ 42 accelerators
42 Really?
The Hitchhiker's Guide to the Galaxy?
![Page 21: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/21.jpg)
Example Usecase(recording 4K video)
27
Janapa Reddi, et al.,IEEE Micro, Jan/Feb 2019
ALP = Parallelism among workload components concurrently executing on multiple accelerators (IPs)
![Page 22: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/22.jpg)
Must run each usecase sufficiently fast -- no need fasterA usecase uses IPs concurrently: more ALP than serialFor each usecase, how much acceleration for each IP?
Mobile SoCs Run Usecases
28
Accelerators (IPs) è
Usecases (rows)CPUs(AP) Display Media
Scaler GPUImage Signal Proc.
JPEGPixel VisualCore
Video Decoder
VideoEncoder
Dozens More
Photo Enhancing X X X X X X
Video Capture X X X X X
Video Capture HDR X X X X X
Video Playback X X X X X
Image Recognition X X X X
![Page 23: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/23.jpg)
ALP(t) = #IPs concurrently active at time t
29
Time to perform usecase (sec)
ActiveIPs
109876543210
Disclaimer:Made up Data
![Page 24: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/24.jpg)
I. Computer History & X-level Parallelism
II. Mobile SoCs as ALP Harbinger
III. Gables ALP SoC Model [HPCA’19]
IV. Call to Action for Accelerator-level Parallelism
Outline
30
![Page 25: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/25.jpg)
Envision usecases (years ahead)Port to many SoCs??
Diversity hinders use[Facebook, HPCA’19]
How to reason aboutSoC performance?
Mobile SoCs Hard To Program For and Select
31
![Page 26: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/26.jpg)
Envision usecases (2-3 years ahead)Select IPsSize IPsDesign Uncore
Which accelerators? How big? How to even start?
Mobile SoCs Hard To Design
32
![Page 27: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/27.jpg)
Computer Architecture & Performance Models
33
Multiprocessor & Amdahl’s Law
Multicore & Roofline
Insight
AccuracyEffort
Models vs Simulation● More insight● Less effort● But less accuracyModels give first answer, not final answer
Gables extends Roofline è first answer for SoC ALP
![Page 28: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/28.jpg)
Multicore HW• Ppeak = peak perf of all cores• Bpeak = peak off-chip bandwidth
Multicore SW• I = operational intensity = #operations/#off-chip-bytes• E.g., 2 ops / 16 bytes à I = 1/8
Output Patt = upper bound on performance attainable
Roofline for Multicore Chips, 2009
34
![Page 29: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/29.jpg)
Roofline for Multicore Chips, 2009
35
Source: https://commons.wikimedia.org/wiki/File:Exam
ple_of_a_naive_Roofline_model.svg
Ppeak
Bpeak* I
(I)
(Patt)
Compute v. Communication: Op. Intensity (I) = #operations / #off-chip bytes
![Page 30: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/30.jpg)
ALP System on Chip (SoC) Model:
Gables uses Roofline per IP to provide first answer!• SW: performance model of a “gabled roof?”• HW: select & size accelerators
NEW Gables
36
2019 Apple A12 w/ 42 accelerators
![Page 31: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/31.jpg)
Gables for N IP SoCA0 = 1
A0*Ppeak
B0
CPUsIP[0]
← Share off-chip Bpeak →
A1*Ppeak
B1
IP[1]AN-1*Ppeak
BN-1
IP[N-1]
37
Usecase at each IP[i]• Operational intensity Ii operations/byte• Non-negative work fi (fi’s sum to 1) w/ IPs in parallel
![Page 32: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/32.jpg)
Example Balanced Design Start w/ Gables
38
DRAM
IP[0]CPUs
Bpeak = 10
TWO-IP SoC
IP[1]GPU
Ppeak = 40 A1*Ppeak = 5*40 = 200
B0 = 6 B1 = 15
Workload (Usecase):f0 = 1 & f1 = 0 I0 = 8 = good cachingI1 = 0.1 = latency tolerantPerformance?
![Page 33: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/33.jpg)
39
Perf limited by IP[0] at I0 = 8IP[1] not used à no rooflineLet’s Assign IP[1] work: f1 = 0 à 0.75
Ppeak = 40Bpeak = 10
A1 = 5B0 = 6
B1 = 15
f1 = 0I0 = 8
I1 = 0.139
![Page 34: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/34.jpg)
40
IP[1] present but Perf drops to 1! Why?I1 = 0.1 à memory bottleneckEnhance Bpeak = 10 à 30(at a cost)
Ppeak = 40Bpeak = 10
A1 = 5B0 = 6
B1 = 15
f1 = 0.75I0 = 8
I1 = 0.140
![Page 35: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/35.jpg)
41
Perf only 2 with IP[1] bottleneck
IP[1] SRAM/reuse I1 = 0.1 à 8Reduce overkill Bpeak = 30 à 20
Ppeak = 40Bpeak = 30
A1 = 5B0 = 6
B1 = 15
f1 = 0.75I0 = 8
I1 = 0.141
![Page 36: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/36.jpg)
42
Perf = 160 < A*Ppeak = 200Can you do better?It’s possible!
Ppeak = 40Bpeak = 20
A1 = 5B0 = 6
B1 = 15
f1 = 0.75I0 = 8I1 = 8
42
Usecases using K accelerators àGables has K+1 rooflines
![Page 37: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/37.jpg)
44Into Synopsys design flow << 6 months of publication!
![Page 38: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/38.jpg)
Two cases where: Gables >> Actual 1. Communication between two IP blocks• Root: Too few buffers to cover communication latency• Little’s Law: # outstanding msgs = avg latency * avg BW• https://www.sigarch.org/three-other-models-of-computer-system-performance-part-1/• Solution: Add buffers; actual performance à Gables
2. More complex interaction among IP blocks• Root: Usecase work (task graph) not completely parallel• Solution: No change, but useful double-check
Case Study: IT Company + Synopsys
45
![Page 39: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/39.jpg)
Case Study: Allocating SRAM
Where SRAM?
● Private w/i each IP● Shared resource
SHARED
IP0
IP1
IP2
48
![Page 40: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/40.jpg)
Does more IP[i] SRAM help Op. Intensity (Ii)?
Non-linear function that increases when new footprint/working-set fits
Should consider these plots when sizing IP[i] SRAM
Later evaluation can use simulation performance on y-axis
Ii
IP[i] SRAM
Not much
fits
Small W/Sfits Med.
W/S fits
Large W/S fits
W/S = working set
50
Compute v. Communication: Op. Intensity (I) = #operations / #off-chip bytes
![Page 41: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/41.jpg)
[HPCA’19]
Model Extensions
Interactive tool
Gables Android Source at GitHub
http://research.cs.wisc.edu/multifacet/gables/
Gables Home Page
51
![Page 42: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/42.jpg)
SW: Map usecase to IP’s w/ many BWs & accelerationHW: IP[i] under/over-provisioned for BW or acceleration? Gables—like Amdahl’s Law—gives intuition & a first answerBut still missing is SoC “architecture” & programming model
Mobile System on Chip (SoC) & Gables
52
![Page 43: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/43.jpg)
I. Computer History & X-level Parallelism
II. Mobile SoCs as ALP Harbinger
III. Gables ALP SoC Model
IV. Call to Action for Accelerator-level Parallelism
Outline
53
![Page 44: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/44.jpg)
Future Apps Demand Much More Computing
54
![Page 45: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/45.jpg)
Future apps demand much more computingStandard tech scaling & architecture NOT sufficientMobile SoCs show a promising approach:
ALP = Parallelism among workload components concurrently executing on multiple accelerators (IPs)
Call to action to develop “science” for ubiquitous ALP• An SoC architecture that exposes & hides?• A whole SoC programming model/runtime?
Accelerator-level Parallelism Call to Action
55
![Page 46: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/46.jpg)
ALP/SoC Software Descent to Hellfire!
57
Hellfire!
P P P PP P P PP P P PP P P P
HeterogeneousMulticore
Thought bridge: Must divide work heterogeneouslyP P P P
P P P PP P P PP P P P
HomogeneousMulticore
Any thread-level parallelism, e.g., homogeneous
P P D EA A C CA A C CA B B B
HeterogeneousAccelerators
Accelerate each differently with unique HLLs
(DSLs) & SDKs
P P D EA A C CA A C CA B B B
Today: DeviceAccelerators
All of above & hide in many
kernel drivers LP
Uniprocessor
No visible parallelism
Key: P == processor core; A-E == accelerators
Local SW stack abstractseach accelerator.
But no good, general SW abstraction for SoC ALP!
![Page 47: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/47.jpg)
SW+HW Lessons from GP-GPUs?
58
Nvidia GK110BLP+TLP+DLP
Feature Then Now1. Programming Graphics OpenGL SIMT (Cuda/OpenCL/HIP)
2. Concurrency Either CPU or GPU only;Intra-GPU mechanisms
Finer-grain interactionIntra-GPU mechanisms
3. Communication Copy data betweenhost & device memories
Maybe shared memory, sometimes coherence
4. Design Driven by graphics only;GP: $0B market
GP major player, e.g., deep neural networks
Programming for data-level parallelism: four decades SIMDàVectorsàSSEàSIMT!
![Page 48: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/48.jpg)
SW+HW Directions for ALP?
59
Feature Now Future?1. Programming Local: Per-IP DSL & SDK
Global: Ad hocAbstract ALP/SoC like SIMT does for GP-GPUs
2. Concurrency Ad hoc GP-GPU-like scheduling? Virtualize/partition IP?
3. Communication SW: Up/down OS stackHW: Via off-chip memory
SW+HW for queue pairs? Want control/data planes
4. Design, e.g., select, combine, & size IPs
Ad hoc Make a “science.” Speed with tools/frameworks
Apple A12: BLP+ILP+TLP+DLP+ALP
Need programmability for broad success!!!!In less than four decades?
![Page 49: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/49.jpg)
Challenges
60
1. Programmability
4. Design Space
3. Communication
2. ConcurrencyWhither global model/runtime?DAG of streamsfor SoCs?
HW assist for scheduling?Virtualize & partition?
How should SWstack reason about local/global memory, caches, queues, & scratchpads?
When combine“similar” accelerators?
Power vs. area?
Opportunities
Hennessy & Patterson: A New Golden Age for Computer ArchitectureScience
![Page 50: Accelerator-level Parallelism](https://reader031.fdocuments.us/reader031/viewer/2022012415/61705aa3ac1be40af6684510/html5/thumbnails/50.jpg)
New Feb 2020!
61