Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram...

39
Codesign Tradeoffs for High- Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer

Transcript of Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram...

Page 1: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Codesign Tradeoffs for High-Performance,Low-Power Linear Algebra Architectures

Ardavan Pedram Robert van de Geijn Andreas Gerstlauer

Page 2: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work

04/11/23 2©Ardavan Pedram 2012

Page 3: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work

04/11/23 3©Ardavan Pedram 2012

Page 4: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Trend of processors

• Technology scaling has reached physical limits– Limit of performance is power

• We may have Dark silicon on the chip– Only a percentage of chip might be active

04/11/23 4©Ardavan Pedram 2012

Page 5: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Heterogeneous Solution

– Increase power efficiency: GFLOPS/W

– More of cores with lower frequency and power

– Specialized cores Orders of magnitude better

power efficiency (GFLOPS/W) Expensive Long time to market

04/11/23 5

Nvidia Tegra System on Chip

©Ardavan Pedram 2012

Page 6: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Linear Algebra Processor Design Goals

• Efficiency of full custom hardware • Orders of magnitude improvement

• Achieving upper limits of power/performance ratio

• Flexibility to execute a whole class of coarse- grain operations

• Co-optimized and co-designed across all layers

• Targeting linear algebra applications

04/11/23 6

Source: Andreas Olofsson

©Ardavan Pedram 2012

Page 7: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Linear Algebra Routines• Linear Algebra Package (LAPACK) level

– Cholesky and QR factorization

• Basic Linear Algebra Subroutines (BLAS)– General matrix-matrix multiplication

(GEMM)

• Inner kernels– Hand-optimized

• GEMM is often what delivers high-performance to many crucial applications

04/11/23 7©Ardavan Pedram 2012

Page 8: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work

04/11/23 8©Ardavan Pedram 2012

Page 9: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

GEMM Implementations• CPUs: 95% peak

– [Goto et al.2008][Intel MKL]

• GPUs: 70% peak– [Nath et al.2010] Nvidia Fermi– [Volkov et al.2008] Nvidia Tesla

• FPGAs: 99% peak– [Zikari et al. 2007]– [Zhuo et al. 2008]

• Specialized architectures– Clearspeed CSX: 78% peak

– Systolic Arrays:• [Lippert et al.2001]

• Intel Quad core– 40 GFLOPS @2.6 GHz

• Nvidia FERMI– 350 GFLOPS @1.15 GHz

• Altera Stratix IV– 100 GFLOPS @ 0.4 GHz

• CSX 700– 75 GFLOPS @ 0.25 GHz

04/11/23 9©Ardavan Pedram 2012

Page 10: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Common Sources of Inefficiencies in conventional architectures

• CPUs & GPUs– Instruction handling– Multi-ported register file– Cache overheads: tags and coherency– Thread scheduling

• FPGAs– Low area efficiency

• Specialized architectures– Data communication overheads

04/11/23 10©Ardavan Pedram 2012

Page 11: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Modeling• Generalization• Conclusion and Future Work

04/11/23 11©Ardavan Pedram 2012

Page 12: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Matrix Multiplication Hierarchy

04/11/23

• Fastest general-purpose implementation of GEMM.[GotoBLAS]

C A B

©Ardavan Pedram 2012 12

Page 13: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Rank-1 Update• Rank-1 Update:

Updates a matrix by adding outer product of two vectors to it

04/11/23 13

Matrix multiplication using series of rank-1 updates:Let C, A, and B be 4x4, 4xkc, and kcx4 matrices. C+=AB can be computed as:

for i=0 to kc-1

end for

AACC

BB

©Ardavan Pedram 2012

Page 14: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Linear Algebra Core (LAC) Desgin

• Customized for rank-1 update– 2D arrangement of PEs– Broadcast buses

• Integrates into memory hierarchy

04/11/23 14©Ardavan Pedram 2012

Page 15: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

CC04/11/23 15

On-Chip Memory

C += A0B0+ … + AK-1BK-1

MainMemory

Core Local stores

Memory Hierarchy

AA BBCC

Page 16: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

04/11/23 16

On-Chip Memory

Ci += Ai,pBp

Core Local stores

Memory Hierarchy

CC AA BB

Page 17: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

04/11/23 17

On-Chip Memory

Ci,j+= Ai,pBp,j

Core Local stores

MainMemory

Memory Hierarchy

CC AA BB

Page 18: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

On-Chip Memory

04/11/23 18

Core Local stores

MainMemory

Memory Hierarchy

CC AA BB

Page 19: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Design of Linear Algebra Core (LAC)

• Distributed memory architecture• Broadcast Buses04/11/23 19©Ardavan Pedram 2012

Page 20: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Data Mapping on LAC

04/11/23 20

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(1,2) PE(2,2) PE(2,3)

PE(3,0) PE(1,3) PE(3,2) PE(3,3)

Mapping of A16x16 on 4x4 2D arrangement of PEs

4x4 2D arrangement of PEs

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

©Ardavan Pedram 2012

Page 21: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Data Mapping on LAC

04/11/23 21

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

                               

PE(0,0) PE(0,1) PE(0,2) PE(0,3)

PE(1,0) PE(1,1) PE(1,2) PE(1,3)

PE(2,0) PE(1,2) PE(2,2) PE(2,3)

PE(3,0) PE(1,3) PE(3,2) PE(3,3)

Mapping of A16x16 on 4x4 2D arrangement of PEs

4x4 2D arrangement of PEs

©Ardavan Pedram 2012

Page 22: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Rank-1 Update

c11+=a1ixbi1c11+=a1ixbi1

c21+=a2ixbi1c21+=a2ixbi1

c12+=a1ixbi2c12+=a1ixbi2

c22+=a2ixbi2c22+=a2ixbi2

c13+=a1ixbi3c13+=a1ixbi3

c23+=a2ixbi3c23+=a2ixbi3

c14+=a1ixbi4c14+=a1ixbi4

c24+=a2ixbi4c24+=a2ixbi4

c31+=a3ixbi1c31+=a3ixbi1

c41+=a4ixbi1c41+=a4ixbi1

c32+=a3ixbi2c32+=a3ixbi2

c42+=a4ixbi2c42+=a4ixbi2

c33+=a3ixbi3c33+=a3ixbi3

c43+=a4ixbi3c43+=a4ixbi3

c34+=a3ixbi4c34+=a3ixbi4

c44+=a4ixbi4c44+=a4ixbi4

dddd

ss

04/11/23

Orange : elements of A Green : elements of B Blue : elements of C

22©Ardavan Pedram 2012

Page 23: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

GEMM on LAP

23©Ardavan Pedram 2012

Page 24: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Multi LAC on Chip

• Same panel of B for all cores

• On-chip memory stores a complete n×n block of C

• Each core computes different panel of C

04/11/23 24

Lac 0Memory

Lac 1Memory

Lac 2Memory

©Ardavan Pedram 2012

Page 25: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work

04/11/23 25©Ardavan Pedram 2012

Page 26: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Performance and Power Analysis

• Analytical formulae– Utilization– Bandwidth– Size of local stores

• Cycle-accurate simulator– Matrix multiplication– Cholesky factorization

• Component selections

– MAC units (45nm) [Galal et al.2010]

– Storage model with [CACTI 6.0]• Pure SRAM Model

– Interconnect• AMBA AHB [Lahiri.2004]• [Wolkotte.2009]

– Activity of components based on GEMM

– Leakage as 25%~30% of dynamic power

04/11/23 26©Ardavan Pedram 2012

Page 27: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Core Utilization Trade-off

04/11/23 27

• Bandwidth vs. local memory size trade-off

• 100% utilization

• Core dimension trade-off

©Ardavan Pedram 2012

Page 28: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Multi-LAC Solution Trade-off

04/11/23 28

• On-chip memory limits performance

• On-chip Bandwidth requirement grows exponentially to maintain peak performance

©Ardavan Pedram 2012

Page 29: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

• 33 GB/s off-chip BW

• Over 600 DP-GFLOPS

• Over 90% utilization

Performance vs. External Bandwidth

04/11/23 29

256x256 /512x512 / 768x768 /1024x1024

Page 30: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

PE Efficiency for Different Frequencies

• Area– Mostly occupied by SRAM

• Power– Mostly consumed by MAC

units• 120 GFLOPS/W

– upper limit for SP-PE• 60 GFLOPS/W

– upper limit for DP-PE• 1 GHz sweet spot of

performance vs. efficiency• Low voltages,

– SRAM power consumption limits efficiency

04/11/23 30©Ardavan Pedram 2012

Page 31: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

LAP vs. Intel® Core2 Duo Penryn

• Power Break down– [V George et al.2007]

• Out of Order and Frontend– 40% of the core

power (over 5 W)

• Execution logic– Register file

04/11/23 31©Ardavan Pedram 2012

Page 32: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

LAP vs. GTX280 Nvidia Tesla

• Single Precision GEMM04/11/23 32

Page 33: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

LAP VS. GTX480 Nvidia Fermi

04/11/23 33©Ardavan Pedram 2012

Page 34: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Summary of LAP

– 600/1200 DP/SP-GFLOPS– One/two Orders of magnitude Improvements vs. GPUs/CPUs

04/11/23 34©Ardavan Pedram 2012

Page 35: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

GEMM Performance and efficiency on different platforms

04/11/23 35

GFLOPS W/mm2 GFLOPS/mm2 GFLOPS/W Utilization

Cell BE (SP) 200 0.3 1.5 5 88%

NVidia GTX480 SM (SP) 780 0.2 0.9 5.2 70%

NVidia GTX480 SM (DP) 390 0.2 0.5 2.6 70%

Intel Core-i7 960 (SP) 96 0.4 0.5 1.2 95%

Intel Core-i7 960 (DP) 48 0.4 0.25 0.6 95%

Altera Stratix IV (DP) 100 0.02 0.05 3.5 90+%

ClearSpeed CSX700(DP)

75 0.02 0.2 12.5 78%

LAP (SP) 1200 0.2 6-11 55 90+%

LAP (DP) 600 0.2 3-5 25 90+%

©Ardavan Pedram 2012

Page 36: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Outline

• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work

04/11/23 36©Ardavan Pedram 2012

Page 37: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Conclusion

• Linear algebra Processor– Algorithm/Architecture co-design– Power and efficiency estimation– Generalized to more complex algorithms (Cholesky)

– Results @ 1GHz• DP: 32 GFLOPS, 47 GFLOPS/W• 0.6 Watts • 2.8 mm2 in 45nm• 4 GB/s external BW • Orders of magnitude improvement

04/11/23 37©Ardavan Pedram 2012

Page 38: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Conclusion

04/11/23 38©Ardavan Pedram 2012

• Studied Architectures and their power consumption sources

Page 39: Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram Robert van de Geijn Andreas Gerstlauer.

Future Work

• Implementation– Hardware synthesis

• Generalization– Level-3 BLAS– LU and QR

factorization

04/11/23 39

• Integration within a general purpose framework

• Design space exploration– Picking the right

algorithm variant

©Ardavan Pedram 2012