Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram...
-
Upload
beatrice-swindall -
Category
Documents
-
view
215 -
download
3
Transcript of Codesign Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures Ardavan Pedram...
Codesign Tradeoffs for High-Performance,Low-Power Linear Algebra Architectures
Ardavan Pedram Robert van de Geijn Andreas Gerstlauer
Outline
• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work
04/11/23 2©Ardavan Pedram 2012
Outline
• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work
04/11/23 3©Ardavan Pedram 2012
Trend of processors
• Technology scaling has reached physical limits– Limit of performance is power
• We may have Dark silicon on the chip– Only a percentage of chip might be active
04/11/23 4©Ardavan Pedram 2012
Heterogeneous Solution
– Increase power efficiency: GFLOPS/W
– More of cores with lower frequency and power
– Specialized cores Orders of magnitude better
power efficiency (GFLOPS/W) Expensive Long time to market
04/11/23 5
Nvidia Tegra System on Chip
©Ardavan Pedram 2012
Linear Algebra Processor Design Goals
• Efficiency of full custom hardware • Orders of magnitude improvement
• Achieving upper limits of power/performance ratio
• Flexibility to execute a whole class of coarse- grain operations
• Co-optimized and co-designed across all layers
• Targeting linear algebra applications
04/11/23 6
Source: Andreas Olofsson
©Ardavan Pedram 2012
Linear Algebra Routines• Linear Algebra Package (LAPACK) level
– Cholesky and QR factorization
• Basic Linear Algebra Subroutines (BLAS)– General matrix-matrix multiplication
(GEMM)
• Inner kernels– Hand-optimized
• GEMM is often what delivers high-performance to many crucial applications
04/11/23 7©Ardavan Pedram 2012
Outline
• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Analysis• Conclusion and Future Work
04/11/23 8©Ardavan Pedram 2012
GEMM Implementations• CPUs: 95% peak
– [Goto et al.2008][Intel MKL]
• GPUs: 70% peak– [Nath et al.2010] Nvidia Fermi– [Volkov et al.2008] Nvidia Tesla
• FPGAs: 99% peak– [Zikari et al. 2007]– [Zhuo et al. 2008]
• Specialized architectures– Clearspeed CSX: 78% peak
– Systolic Arrays:• [Lippert et al.2001]
• Intel Quad core– 40 GFLOPS @2.6 GHz
• Nvidia FERMI– 350 GFLOPS @1.15 GHz
• Altera Stratix IV– 100 GFLOPS @ 0.4 GHz
• CSX 700– 75 GFLOPS @ 0.25 GHz
04/11/23 9©Ardavan Pedram 2012
Common Sources of Inefficiencies in conventional architectures
• CPUs & GPUs– Instruction handling– Multi-ported register file– Cache overheads: tags and coherency– Thread scheduling
• FPGAs– Low area efficiency
• Specialized architectures– Data communication overheads
04/11/23 10©Ardavan Pedram 2012
Outline
• Motivation and Vision• Related Works and Background• Linear Algebra Processor• Power/Performance Modeling• Generalization• Conclusion and Future Work
04/11/23 11©Ardavan Pedram 2012
Matrix Multiplication Hierarchy
04/11/23
• Fastest general-purpose implementation of GEMM.[GotoBLAS]
C A B
©Ardavan Pedram 2012 12
Rank-1 Update• Rank-1 Update:
Updates a matrix by adding outer product of two vectors to it
04/11/23 13
Matrix multiplication using series of rank-1 updates:Let C, A, and B be 4x4, 4xkc, and kcx4 matrices. C+=AB can be computed as:
for i=0 to kc-1
end for
AACC
BB
©Ardavan Pedram 2012
Linear Algebra Core (LAC) Desgin
• Customized for rank-1 update– 2D arrangement of PEs– Broadcast buses
• Integrates into memory hierarchy
04/11/23 14©Ardavan Pedram 2012
CC04/11/23 15
On-Chip Memory
C += A0B0+ … + AK-1BK-1
MainMemory
Core Local stores
Memory Hierarchy
AA BBCC
04/11/23 16
On-Chip Memory
Ci += Ai,pBp
Core Local stores
Memory Hierarchy
CC AA BB
04/11/23 17
On-Chip Memory
Ci,j+= Ai,pBp,j
Core Local stores
MainMemory
Memory Hierarchy
CC AA BB
On-Chip Memory
04/11/23 18
Core Local stores
MainMemory
Memory Hierarchy
CC AA BB
Design of Linear Algebra Core (LAC)
• Distributed memory architecture• Broadcast Buses04/11/23 19©Ardavan Pedram 2012
Data Mapping on LAC
04/11/23 20
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(1,2) PE(2,2) PE(2,3)
PE(3,0) PE(1,3) PE(3,2) PE(3,3)
Mapping of A16x16 on 4x4 2D arrangement of PEs
4x4 2D arrangement of PEs
©Ardavan Pedram 2012
Data Mapping on LAC
04/11/23 21
PE(0,0) PE(0,1) PE(0,2) PE(0,3)
PE(1,0) PE(1,1) PE(1,2) PE(1,3)
PE(2,0) PE(1,2) PE(2,2) PE(2,3)
PE(3,0) PE(1,3) PE(3,2) PE(3,3)
Mapping of A16x16 on 4x4 2D arrangement of PEs
4x4 2D arrangement of PEs
©Ardavan Pedram 2012
Rank-1 Update
c11+=a1ixbi1c11+=a1ixbi1
c21+=a2ixbi1c21+=a2ixbi1
c12+=a1ixbi2c12+=a1ixbi2
c22+=a2ixbi2c22+=a2ixbi2
c13+=a1ixbi3c13+=a1ixbi3
c23+=a2ixbi3c23+=a2ixbi3
c14+=a1ixbi4c14+=a1ixbi4
c24+=a2ixbi4c24+=a2ixbi4
c31+=a3ixbi1c31+=a3ixbi1
c41+=a4ixbi1c41+=a4ixbi1
c32+=a3ixbi2c32+=a3ixbi2
c42+=a4ixbi2c42+=a4ixbi2
c33+=a3ixbi3c33+=a3ixbi3
c43+=a4ixbi3c43+=a4ixbi3
c34+=a3ixbi4c34+=a3ixbi4
c44+=a4ixbi4c44+=a4ixbi4
dddd
ss
04/11/23
Orange : elements of A Green : elements of B Blue : elements of C
22©Ardavan Pedram 2012
GEMM on LAP
23©Ardavan Pedram 2012
Multi LAC on Chip
• Same panel of B for all cores
• On-chip memory stores a complete n×n block of C
• Each core computes different panel of C
04/11/23 24
Lac 0Memory
Lac 1Memory
Lac 2Memory
©Ardavan Pedram 2012
Outline
• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work
04/11/23 25©Ardavan Pedram 2012
Performance and Power Analysis
• Analytical formulae– Utilization– Bandwidth– Size of local stores
• Cycle-accurate simulator– Matrix multiplication– Cholesky factorization
• Component selections
– MAC units (45nm) [Galal et al.2010]
– Storage model with [CACTI 6.0]• Pure SRAM Model
– Interconnect• AMBA AHB [Lahiri.2004]• [Wolkotte.2009]
– Activity of components based on GEMM
– Leakage as 25%~30% of dynamic power
04/11/23 26©Ardavan Pedram 2012
Core Utilization Trade-off
04/11/23 27
• Bandwidth vs. local memory size trade-off
• 100% utilization
• Core dimension trade-off
©Ardavan Pedram 2012
Multi-LAC Solution Trade-off
04/11/23 28
• On-chip memory limits performance
• On-chip Bandwidth requirement grows exponentially to maintain peak performance
©Ardavan Pedram 2012
• 33 GB/s off-chip BW
• Over 600 DP-GFLOPS
• Over 90% utilization
Performance vs. External Bandwidth
04/11/23 29
256x256 /512x512 / 768x768 /1024x1024
PE Efficiency for Different Frequencies
• Area– Mostly occupied by SRAM
• Power– Mostly consumed by MAC
units• 120 GFLOPS/W
– upper limit for SP-PE• 60 GFLOPS/W
– upper limit for DP-PE• 1 GHz sweet spot of
performance vs. efficiency• Low voltages,
– SRAM power consumption limits efficiency
04/11/23 30©Ardavan Pedram 2012
LAP vs. Intel® Core2 Duo Penryn
• Power Break down– [V George et al.2007]
• Out of Order and Frontend– 40% of the core
power (over 5 W)
• Execution logic– Register file
04/11/23 31©Ardavan Pedram 2012
LAP vs. GTX280 Nvidia Tesla
• Single Precision GEMM04/11/23 32
LAP VS. GTX480 Nvidia Fermi
04/11/23 33©Ardavan Pedram 2012
Summary of LAP
– 600/1200 DP/SP-GFLOPS– One/two Orders of magnitude Improvements vs. GPUs/CPUs
04/11/23 34©Ardavan Pedram 2012
GEMM Performance and efficiency on different platforms
04/11/23 35
GFLOPS W/mm2 GFLOPS/mm2 GFLOPS/W Utilization
Cell BE (SP) 200 0.3 1.5 5 88%
NVidia GTX480 SM (SP) 780 0.2 0.9 5.2 70%
NVidia GTX480 SM (DP) 390 0.2 0.5 2.6 70%
Intel Core-i7 960 (SP) 96 0.4 0.5 1.2 95%
Intel Core-i7 960 (DP) 48 0.4 0.25 0.6 95%
Altera Stratix IV (DP) 100 0.02 0.05 3.5 90+%
ClearSpeed CSX700(DP)
75 0.02 0.2 12.5 78%
LAP (SP) 1200 0.2 6-11 55 90+%
LAP (DP) 600 0.2 3-5 25 90+%
©Ardavan Pedram 2012
Outline
• Motivation and Vision• Related Works and Background• Linear Algebra Core• Power/Performance Analysis• Conclusion and Future Work
04/11/23 36©Ardavan Pedram 2012
Conclusion
• Linear algebra Processor– Algorithm/Architecture co-design– Power and efficiency estimation– Generalized to more complex algorithms (Cholesky)
– Results @ 1GHz• DP: 32 GFLOPS, 47 GFLOPS/W• 0.6 Watts • 2.8 mm2 in 45nm• 4 GB/s external BW • Orders of magnitude improvement
04/11/23 37©Ardavan Pedram 2012
Conclusion
04/11/23 38©Ardavan Pedram 2012
• Studied Architectures and their power consumption sources
Future Work
• Implementation– Hardware synthesis
• Generalization– Level-3 BLAS– LU and QR
factorization
04/11/23 39
• Integration within a general purpose framework
• Design space exploration– Picking the right
algorithm variant
©Ardavan Pedram 2012