Efficient Performance Scaling of Future CGRAs for Mobile Applications

University of MichiganElectrical Engineering and Computer Science1

Efficient Performance Scaling of Future CGRAs for Mobile Applications

Yongjun Park, Jason Jong Kyu Park , and Scott Mahlke

December 11, 2012

University of Michigan, Ann Arbor

University of MichiganElectrical Engineering and Computer Science

Convergence of Functionalities

2

Convergence of functionalities demands a flexible solution due to the design cost and programmability

Anatomy of an iPhone4

4G Wireless

Navigation

AudioVideo

3D

Flexible Accelerator!


3

CGRA : Attractive Alternative to ASICs

Array of PEs connected in a mesh-like interconnect High throughput with a large number of resources Distributed hardware offers low cost/power consumption High flexibility with dynamic reconfiguration


Bridging the Gap Between Market Demandand Computation Power

4

2009 2010 2011 2012 2013 2014 20150

400

800

1200

1600

2000CPU

Audio

Video

Year

Com

puta

tiona

l req

uire

men

t

How to scale performance with retaining energy efficiency?

[Canali, Internet Computing Magazine, IEEE, 2009]


Agenda:Scaling the Energy Efficiency of CGRAs

5

• Investigate the key factors and their feasibility in the view of performance and power efficiency– Hardware scalability vs. hardware flexibility

• Interconnection topology• Complex PE vs. simple PE• Vector memory operation support• Homogeneity vs. Heterogeneity


Experimental Setup• Target applications

– Media benchmark: AAC decoder, H.264 decoder, and 3D rendering– Game physics benchmarks: line of sight, convolution, and conjugate

• Target architecture: various types of CGRAs– 16 ~ 64 heterogeneous/homogeneous resources

• IMPACT frontend compiler + Edge-centric modulo scheduler

• Power measurement– IBM 65nm technology @ 200MHz/1V

6


Q1: Interconnection Topology

7

• Overview– Routing overhead limits the performance when increasing the size of the CGRA– Common solution: clustering– What is the optimal interconnection topology?

• Methodology

– Compare the performance of three different clustering schemes.• Baseline• Fixed partition: CGRAs are physically split into multiple partitions• Flexible partition: number of partitions can be dynamically changed from 1 to 8

– Total number of PEs: 4 to 128


Q1: Interconnection Topology

8

DLP loops

No-DLP loops

Application

Baseline

Fixed partition

Flexible mapping


Performance Comparison (Base, Fixed, Flex)

9

base fle

x

base 2

flex

base 2 4

flex

base 2 4 8

flex

base 2 4 8

flex

base 2 4 8

flex

4 8 16 32 64 128

02468

1012

Architecture

Rel

ativ

e pe

rform

ance Media

base fle

x

base 2

flex

base 2 4

flex

base 2 4 8

flex

base 2 4 8

flex

base 2 4 8

flex

4 8 16 32 64 128

02468

10

Architecture

Rel

ativ

e pe

rform

ance Game

• Fixed partitioning doesn’t always show better performance.• Flexible architectures show the best performance and retain scalability


Q2: Complex PEs vs. Simple PEs

10

• Overview– CGRAs with complex PEs are introduced

• Two level interconnect• Number of RFs can decrease• Multiple instructions can be chained

– Challenge: resource utilization– Goal: determine the availability of complex PEs in the view of energy consumption

• Methodology– Compare the energy consumption on different PE styles

• Number of FUs inside a PE: 1 ~ 6• Uniform vs. Optimized


PE Designs

11

Register file

Simple integer ALU

Simple integer+ Complex ALU


Energy Consumption

12

• Energy consumption does not increase dramatically as number of PEs• In 1.5x energy budget, complex PEs with 2~3 FUs can also be proper solutions

1 2 3 4 5 60.5

1

1.5

2

2.5

3

3.5

4

Media uniform

Game uniform

Media optimized

Game optimized

# of FUs per PE

Rel

ativ

e en

ergy

con

sum

ptio

n

1.5x energy


Q3: SIMD Memory Support

13

• Overview– SIMD memory support provides less power and less number of instructions– Challenge: degree of DLP.– Goal: determine the availability of SIMD memory access in the view of energy consumption

• Methodology– Compare the energy consumption on different SIMD widths: 1 ~ 16


Relative Energy Consumption

14

1 2 4 8 160

2

4

6

8

10

12

14Relative power per accessRelative # of accessRelative total energy

Vector width

• Total energy consumption at wider vector width can be a similar level to a scalar memory unit– High degree of spatial locality can compensate for power overheads


Conclusion• Flexible partitioning should be supported for further improving the

performance.

• Complex PE can be more energy efficient even in low resource utilizations.

• The wide SIMD memory support can be realistic due to the mobile application characteristics.

15

Beginning

University of MichiganElectrical Engineering and Computer Science16

Questions?

For more informationhttp://cccp.eecs.umich.edu


Q1: Homogeneity vs. Heterogeneity

17

• Overview– Heterogeneous CGRAs are common– No experiments on the effect of heterogeneity over homogeneity

• Methodology– Start from 16-PE homogeneous CGRA (integer ALU, complex ALU, memory unit)– Decrease the number of PEs supporting complex ALU and memory unit– Performance goal: 80% of performance @ homogeneous CGRA

How about performance?


Performance Degradation

18

basemul_8

mul_4mul_2

mul_1

mem_8

mem_4

mem_2

mem_1exp

_8exp

_4exp

_2exp

_10

0.10.20.30.40.50.60.70.80.9

1

Rela

tive

perf

orm

ance

basemul_8

mul_4mul_2

mul_1

mem_8

mem_4

mem_2

mem_1exp_8

exp_4exp_2

exp_1

0

0.2

0.4

0.6

0.8

1

Rela

tive

perf

orm

ance

• The amounts of performance degradation are not substantial – The performance is normally constrained not by the complex instructions

• Performance degradation depends much more on memory operations• For 80% of the baseline performance, we can decrease the number of both

complex and memory units by up to 75%.

Media Game


Conclusion• Heterogeneous FU organization is highly effective.

• Flexible partitioning should be supported for further improving the performance.

• Complex PE can be more energy efficient even in low resource utilizations.

• The wide SIMD memory support can be realistic due to the mobile application characteristics.

19

Beginning


CGRA : Attractive Alternative to ASICs

viterbi at 80Mbps h.264 at 30fps 50-60 MOps /mW

Morphosys SiliconHive ADRES

20

Suitable for running multimedia applications for future embedded sys-tems

High throughput, low power consumption, high flexibility

Morphosys : 8x8 array with RISC processor SiliconHive : hierarchical systolic array ADRES : 4x4 array with tightly coupled VLIW

Efficient Performance Scaling of Future CGRAs for Mobile Applications

Documents

Transcript of Efficient Performance Scaling of Future CGRAs for Mobile Applications