Efficient Performance Scaling of Future CGRAs for Mobile Applications
description
Transcript of Efficient Performance Scaling of Future CGRAs for Mobile Applications
University of MichiganElectrical Engineering and Computer Science1
Efficient Performance Scaling of Future CGRAs for Mobile Applications
Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke
December 11, 2012
University of Michigan, Ann Arbor
University of MichiganElectrical Engineering and Computer Science
Convergence of Functionalities
2
Convergence of functionalities demands a flexible solution due to the design cost and programmability
Anatomy of an iPhone4
4G Wireless
Navigation
AudioVideo
3D
Flexible Accelerator!
University of MichiganElectrical Engineering and Computer Science
3
CGRA : Attractive Alternative to ASICs
Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration
University of MichiganElectrical Engineering and Computer Science
Bridging the Gap Between Market Demandand Computation Power
4
2009 2010 2011 2012 2013 2014 20150
400
800
1200
1600
2000CPU
Audio
Video
Year
Com
puta
tiona
l req
uire
men
t
How to scale performance with retaining energy efficiency?
[Canali, Internet Computing Magazine, IEEE, 2009]
University of MichiganElectrical Engineering and Computer Science
Agenda:Scaling the Energy Efficiency of CGRAs
5
• Investigate the key factors and their feasibility in the view of performance and power efficiency– Hardware scalability vs. hardware flexibility
• Interconnection topology• Complex PE vs. simple PE• Vector memory operation support• Homogeneity vs. Heterogeneity
University of MichiganElectrical Engineering and Computer Science
Experimental Setup• Target applications
– Media benchmark: AAC decoder, H.264 decoder, and 3D rendering– Game physics benchmarks: line of sight, convolution, and conjugate
• Target architecture: various types of CGRAs– 16 ~ 64 heterogeneous/homogeneous resources
• IMPACT frontend compiler + Edge-centric modulo scheduler
• Power measurement– IBM 65nm technology @ 200MHz/1V
6
University of MichiganElectrical Engineering and Computer Science
Q1: Interconnection Topology
7
• Overview– Routing overhead limits the performance when increasing the size of the CGRA– Common solution: clustering– What is the optimal interconnection topology?
• Methodology
– Compare the performance of three different clustering schemes.• Baseline• Fixed partition: CGRAs are physically split into multiple partitions• Flexible partition: number of partitions can be dynamically changed from 1 to 8
– Total number of PEs: 4 to 128
University of MichiganElectrical Engineering and Computer Science
Q1: Interconnection Topology
8
DLP loops
No-DLP loops
Application
Baseline
Fixed partition
Flexible mapping
University of MichiganElectrical Engineering and Computer Science
Performance Comparison (Base, Fixed, Flex)
9
base fle
x
base 2
flex
base 2 4
flex
base 2 4 8
flex
base 2 4 8
flex
base 2 4 8
flex
4 8 16 32 64 128
02468
1012
Architecture
Rel
ativ
e pe
rform
ance Media
base fle
x
base 2
flex
base 2 4
flex
base 2 4 8
flex
base 2 4 8
flex
base 2 4 8
flex
4 8 16 32 64 128
02468
10
Architecture
Rel
ativ
e pe
rform
ance Game
• Fixed partitioning doesn’t always show better performance.• Flexible architectures show the best performance and retain scalability
University of MichiganElectrical Engineering and Computer Science
Q2: Complex PEs vs. Simple PEs
10
• Overview– CGRAs with complex PEs are introduced
• Two level interconnect• Number of RFs can decrease• Multiple instructions can be chained
– Challenge: resource utilization– Goal: determine the availability of complex PEs in the view of energy consumption
• Methodology– Compare the energy consumption on different PE styles
• Number of FUs inside a PE: 1 ~ 6• Uniform vs. Optimized
University of MichiganElectrical Engineering and Computer Science
PE Designs
11
Register file
Simple integer ALU
Simple integer+ Complex ALU
University of MichiganElectrical Engineering and Computer Science
Energy Consumption
12
• Energy consumption does not increase dramatically as number of PEs• In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions
1 2 3 4 5 60.5
1
1.5
2
2.5
3
3.5
4
Media uniform
Game uniform
Media optimized
Game optimized
# of FUs per PE
Rel
ativ
e en
ergy
con
sum
ptio
n
1.5x energy
University of MichiganElectrical Engineering and Computer Science
Q3: SIMD Memory Support
13
• Overview– SIMD memory support provides less power and less number of instructions– Challenge: degree of DLP.– Goal: determine the availability of SIMD memory access in the view of energy consumption
• Methodology– Compare the energy consumption on different SIMD widths: 1 ~ 16
University of MichiganElectrical Engineering and Computer Science
Relative Energy Consumption
14
1 2 4 8 160
2
4
6
8
10
12
14Relative power per accessRelative # of accessRelative total energy
Vector width
• Total energy consumption at wider vector width can be a similar level to a scalar memory unit– High degree of spatial locality can compensate for power overheads
University of MichiganElectrical Engineering and Computer Science
Conclusion• Flexible partitioning should be supported for further improving the
performance.
• Complex PE can be more energy efficient even in low resource utilizations.
• The wide SIMD memory support can be realistic due to the mobile application characteristics.
15
Beginning
University of MichiganElectrical Engineering and Computer Science16
Questions?
For more informationhttp://cccp.eecs.umich.edu
University of MichiganElectrical Engineering and Computer Science
Q1: Homogeneity vs. Heterogeneity
17
• Overview– Heterogeneous CGRAs are common– No experiments on the effect of heterogeneity over homogeneity
• Methodology– Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit)– Decrease the number of PEs supporting complex ALU and memory unit– Performance goal: 80% of performance @ homogeneous CGRA
How about performance?
University of MichiganElectrical Engineering and Computer Science
Performance Degradation
18
basemul_8
mul_4mul_2
mul_1
mem_8
mem_4
mem_2
mem_1exp
_8exp
_4exp
_2exp
_10
0.10.20.30.40.50.60.70.80.9
1
Rela
tive
perf
orm
ance
basemul_8
mul_4mul_2
mul_1
mem_8
mem_4
mem_2
mem_1exp_8
exp_4exp_2
exp_1
0
0.2
0.4
0.6
0.8
1
Rela
tive
perf
orm
ance
• The amounts of performance degradation are not substantial – The performance is normally constrained not by the complex instructions
• Performance degradation depends much more on memory operations• For 80% of the baseline performance, we can decrease the number of both
complex and memory units by up to 75%.
Media Game
University of MichiganElectrical Engineering and Computer Science
Conclusion• Heterogeneous FU organization is highly effective.
• Flexible partitioning should be supported for further improving the performance.
• Complex PE can be more energy efficient even in low resource utilizations.
• The wide SIMD memory support can be realistic due to the mobile application characteristics.
19
Beginning
University of MichiganElectrical Engineering and Computer Science
CGRA : Attractive Alternative to ASICs
viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW
Morphosys SiliconHive ADRES
20
Suitable for running multimedia applications for future embedded sys-tems
High throughput, low power consumption, high flexibility
Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW