University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A...

University of MichiganElectrical Engineering and Computer Science1

Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution

for Mobile Multimedia Applications

Hyunchul Park1, Yongjun Park2, Scott Mahlke2

December 12, 2009

Texas Instruments Inc.1

University of Michigan, Ann Arbor 2

University of MichiganElectrical Engineering and Computer Science

ARM9 ARM11 TI C6x Core2Duo0

5

10

15

20

25

30

35

40

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

• Multimedia applications have high performance, cost, energy demands– High-quality video– Flash animation

• Clear need for application and domain-specific hardware

Introduction

24 fps min.

Fram

es/s

ecMPEG-4 Decoder

Cell-phone battery life(hours)

2

energyperformance


Convergence of Functionalities

3

Anatomy of iPhone

HD TV decoder

Video Recording

Video Editing

3D Rendering

4G Wireless

Advanced Image

Processing

Convergence of functionalities demands a flexible solutionApplications have different characteristics


ASIC Alternatives

General PurposeProcessors

DSPs

Efficiency, Performance

Fle

xibi

lity

ASICs

Domain specificEfficiency

Somewhat programmable

What’s the right way to support multimedia applications ?

4


Coarse-Grained Reconfigurable Architecture (CGRA)

• Array of PEs connected in a mesh-like interconnect• High throughput, low cost/power with distributed hardware• High flexibility with dynamic reconfiguration• Morphosys, SiliconHive, ADRES


Execution Model of CGRAs

6

for ( …… ) {

}

time

Host

CGRA

• Modulo scheduling exploits loop level parallelism


Large Scale CGRA

• Need for higher performance– Higher resolution/more detail video– Multiple concurrent applications support

• Increasing technology allows more resources available

Loop 0 Loop 0 Loop 0Loop 0

Loop 1

Loop 2

Loop 3

Task 0 Task 1 Task 2 Task 3 Task 4Loop 0


Streaming Execution Model• Streaming property

– Packet of data goes through independent tasks

• Partition tasks into stages– Map each stage onto different

hardware• Pipeline parallelism

– Pipeline the outermost loop

8


Insights

• Multimedia applications rich both in ILP/pipeline parallelism– Not mutually exclusive, cooperatively enhance performance

• Resource requirement varies– Statically / dynamically

• Need a flexible execution model– Exploiting both types of parallelism– Resource allocation based on computation requirement– Dynamically adapt to computation variance

9


Polymorphic Pipeline Array

• Multi-core accelerator : each 2x2 array becomes a processor• Cores can be combined to form a larger logical core• Exploit both coarse-grain and fine-grain pipeline parallelism• No dynamic routing logic: all communications statically generated

10

Core Core Core Core

Core Core Core Core

Logical Core

Logical Core

Logical Core


Execution Model

11

• Pipeline outermost loop

ST 0 ST 1 ST 2 ST 3

ST 0

ST 1

ST 2

ST 3


Execution Model

12

• Pipeline outermost loop• Compute intensive stage

– Assign more resources– Modulo scheduling

ST 0

ST 1

ST 2

ST 3

ST 0 ST 1 ST 2 ST 3


Execution Model

13

ST 0

ST 1

ST 2

ST 3

ST 0 ST 1 ST 2

ST 3

• Pipeline outermost loop• Compute intensive stage

– Assign more resources– Modulo scheduling


Partitioning of PPA

• Static partitioning– Schedules can be optimized– Computation variance leads to low utilization

• Dynamic partitioning– Adjust core assignment at run-time– Adapt to computation variance, but some overhead

• How to support dynamic partitioning– Multiple schedules: code bloat– Unified schedule targeting multiple sub-arrays (virtualization)

14


Virtualized Modulo Scheduling

15

0

A

B A

B

• One binary that can run in multiple targets– Part of code migrate to

neighboring core– No rescheduling

• Challenges– Avoid resource conflict – Enforce multiple modulo

constraints– Inter-core communication

A

B

A

A

A B

B

B

A B0 1

BA

IIII


Multi-level Modulo Constraints

16

0

1

2

3

0

2 3

4

5

6

7

5

4

6

7 8

9

11

8

9

10

11

12

10

13

time F0 F1 F2 F3

Core 0

0

2 3

6

9

0

2 3

5

4

6

7 8

9

11

II = 4

II =

4



17

0

1

2

3

4

5

6

7

5

4

7 8

0

2 3

6

9

11

8

9

10

11

12

10

13

time F0 F1 F2 F3

Core 0

II = 4

II =

4



18

0

1

2

3

0

2 3

4

5

6

7

5

4

6

8

9

10

11

7 8

9

11

12

10

13

time F0 F1 F2 F3

Core 0

0

1

2

3

4

5

6

7

8

9

10

11

time F0 F1 F2 F3

Core 1

II =

4



19

0

1

2

3

0

2 3

4

5

6

7

5

4

6

8

9

10

11

7 8

9

11

12

10

13

time F0 F1 F2 F3

Core 0

0

1

2

3

4

5

6

7

8

9

10

11

time F0 F1 F2 F3

Core 1

II = 2

II =

4

II =

2II

= 2


Inter-core Communication

20

0

1

2

3

0

2 3

4

5

6

7

5

4

6

8

9

10

11

7 8

9

11

12

10

13

time F0 F1 F2 F3

Core 0

0

1

2

3

4

5

6

7

8

9

10

11

time F0 F1 F2 F3

Core 1

II = 2

Direct RF connection


VMS Summary

• Edge-centric Modulo Scheduling [PACT’08] with virtualization support

• Generate a unified schedule– Schedule for the smallest array, then expanded

• Multi-level modulo constraints enforced– Avoid resource conflict when expanded– Apply to computation/routing/registers

• Register transfer operations for inter-core communications– Enabled only when expanded

21


Evaluation of PPA

• Exploiting both types of parallelism in AAC• Dynamic partitioning overhead

– 13% overhead for single-core schedule, runtime overhead

22

CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores

0

10

20

30

40

50

60

70

80


0

10

20

30

40

50

60

70

80


0

10

20

30

40

50

60

70

80


0

10

20

30

40

50

60

70

80


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

ARM9 ARM11 TI C6x PPA Core2Duo0

5

10

15

20

25

30

35

40

Where PPA stands

24 fps min.

Fram

es/s

ec

MPEG-4 Decoder

Cell-phone battery life(hours)

23

energyperformance


Questions?

University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A...

Documents

Transcript of University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A...