University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A...
-
date post
19-Dec-2015 -
Category
Documents
-
view
217 -
download
1
Transcript of University of Michigan Electrical Engineering and Computer Science 1 Polymorphic Pipeline Array: A...
University of MichiganElectrical Engineering and Computer Science1
Polymorphic Pipeline Array: A Flexible Multicore Accelerator with Virtualized Execution
for Mobile Multimedia Applications
Hyunchul Park1, Yongjun Park2, Scott Mahlke2
December 12, 2009
Texas Instruments Inc.1
University of Michigan, Ann Arbor 2
University of MichiganElectrical Engineering and Computer Science
ARM9 ARM11 TI C6x Core2Duo0
5
10
15
20
25
30
35
40
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
• Multimedia applications have high performance, cost, energy demands– High-quality video– Flash animation
• Clear need for application and domain-specific hardware
Introduction
24 fps min.
Fram
es/s
ecMPEG-4 Decoder
Cell-phone battery life(hours)
2
energyperformance
University of MichiganElectrical Engineering and Computer Science
Convergence of Functionalities
3
Anatomy of iPhone
HD TV decoder
Video Recording
Video Editing
3D Rendering
4G Wireless
Advanced Image
Processing
Convergence of functionalities demands a flexible solutionApplications have different characteristics
University of MichiganElectrical Engineering and Computer Science
ASIC Alternatives
General PurposeProcessors
DSPs
Efficiency, Performance
Fle
xibi
lity
ASICs
Domain specificEfficiency
Somewhat programmable
What’s the right way to support multimedia applications ?
4
University of MichiganElectrical Engineering and Computer Science5
Coarse-Grained Reconfigurable Architecture (CGRA)
• Array of PEs connected in a mesh-like interconnect• High throughput, low cost/power with distributed hardware• High flexibility with dynamic reconfiguration• Morphosys, SiliconHive, ADRES
University of MichiganElectrical Engineering and Computer Science
Execution Model of CGRAs
6
for ( …… ) {
}
time
Host
CGRA
• Modulo scheduling exploits loop level parallelism
University of MichiganElectrical Engineering and Computer Science7
Large Scale CGRA
• Need for higher performance– Higher resolution/more detail video– Multiple concurrent applications support
• Increasing technology allows more resources available
Loop 0 Loop 0 Loop 0Loop 0
Loop 1
Loop 2
Loop 3
Task 0 Task 1 Task 2 Task 3 Task 4Loop 0
University of MichiganElectrical Engineering and Computer Science
Streaming Execution Model• Streaming property
– Packet of data goes through independent tasks
• Partition tasks into stages– Map each stage onto different
hardware• Pipeline parallelism
– Pipeline the outermost loop
8
University of MichiganElectrical Engineering and Computer Science
Insights
• Multimedia applications rich both in ILP/pipeline parallelism– Not mutually exclusive, cooperatively enhance performance
• Resource requirement varies– Statically / dynamically
• Need a flexible execution model– Exploiting both types of parallelism– Resource allocation based on computation requirement– Dynamically adapt to computation variance
9
University of MichiganElectrical Engineering and Computer Science
Polymorphic Pipeline Array
• Multi-core accelerator : each 2x2 array becomes a processor• Cores can be combined to form a larger logical core• Exploit both coarse-grain and fine-grain pipeline parallelism• No dynamic routing logic: all communications statically generated
10
Core Core Core Core
Core Core Core Core
Logical Core
Logical Core
Logical Core
University of MichiganElectrical Engineering and Computer Science
Execution Model
11
• Pipeline outermost loop
ST 0 ST 1 ST 2 ST 3
ST 0
ST 1
ST 2
ST 3
University of MichiganElectrical Engineering and Computer Science
Execution Model
12
• Pipeline outermost loop• Compute intensive stage
– Assign more resources– Modulo scheduling
ST 0
ST 1
ST 2
ST 3
ST 0 ST 1 ST 2 ST 3
University of MichiganElectrical Engineering and Computer Science
Execution Model
13
ST 0
ST 1
ST 2
ST 3
ST 0 ST 1 ST 2
ST 3
• Pipeline outermost loop• Compute intensive stage
– Assign more resources– Modulo scheduling
University of MichiganElectrical Engineering and Computer Science
Partitioning of PPA
• Static partitioning– Schedules can be optimized– Computation variance leads to low utilization
• Dynamic partitioning– Adjust core assignment at run-time– Adapt to computation variance, but some overhead
• How to support dynamic partitioning– Multiple schedules: code bloat– Unified schedule targeting multiple sub-arrays (virtualization)
14
University of MichiganElectrical Engineering and Computer Science
Virtualized Modulo Scheduling
15
0
A
B A
B
• One binary that can run in multiple targets– Part of code migrate to
neighboring core– No rescheduling
• Challenges– Avoid resource conflict – Enforce multiple modulo
constraints– Inter-core communication
A
B
A
A
A B
B
B
A B0 1
BA
IIII
University of MichiganElectrical Engineering and Computer Science
Multi-level Modulo Constraints
16
0
1
2
3
0
2 3
4
5
6
7
5
4
6
7 8
9
11
8
9
10
11
12
10
13
time F0 F1 F2 F3
Core 0
0
2 3
6
9
0
2 3
5
4
6
7 8
9
11
II = 4
II =
4
University of MichiganElectrical Engineering and Computer Science
Multi-level Modulo Constraints
17
0
1
2
3
4
5
6
7
5
4
7 8
0
2 3
6
9
11
8
9
10
11
12
10
13
time F0 F1 F2 F3
Core 0
II = 4
II =
4
University of MichiganElectrical Engineering and Computer Science
Multi-level Modulo Constraints
18
0
1
2
3
0
2 3
4
5
6
7
5
4
6
8
9
10
11
7 8
9
11
12
10
13
time F0 F1 F2 F3
Core 0
0
1
2
3
4
5
6
7
8
9
10
11
time F0 F1 F2 F3
Core 1
II =
4
University of MichiganElectrical Engineering and Computer Science
Multi-level Modulo Constraints
19
0
1
2
3
0
2 3
4
5
6
7
5
4
6
8
9
10
11
7 8
9
11
12
10
13
time F0 F1 F2 F3
Core 0
0
1
2
3
4
5
6
7
8
9
10
11
time F0 F1 F2 F3
Core 1
II = 2
II =
4
II =
2II
= 2
University of MichiganElectrical Engineering and Computer Science
Inter-core Communication
20
0
1
2
3
0
2 3
4
5
6
7
5
4
6
8
9
10
11
7 8
9
11
12
10
13
time F0 F1 F2 F3
Core 0
0
1
2
3
4
5
6
7
8
9
10
11
time F0 F1 F2 F3
Core 1
II = 2
Direct RF connection
University of MichiganElectrical Engineering and Computer Science
VMS Summary
• Edge-centric Modulo Scheduling [PACT’08] with virtualization support
• Generate a unified schedule– Schedule for the smallest array, then expanded
• Multi-level modulo constraints enforced– Avoid resource conflict when expanded– Apply to computation/routing/registers
• Register transfer operations for inter-core communications– Enabled only when expanded
21
University of MichiganElectrical Engineering and Computer Science
Evaluation of PPA
• Exploiting both types of parallelism in AAC• Dynamic partitioning overhead
– 13% overhead for single-core schedule, runtime overhead
22
CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores
0
10
20
30
40
50
60
70
80
CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores
0
10
20
30
40
50
60
70
80
CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores
0
10
20
30
40
50
60
70
80
CGRA static dyn static dyn static dyn static dyn static dyn4 cores 4 cores 5 cores 6 cores 7 cores 8 cores
0
10
20
30
40
50
60
70
80
University of MichiganElectrical Engineering and Computer Science
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
ARM9 ARM11 TI C6x PPA Core2Duo0
5
10
15
20
25
30
35
40
Where PPA stands
24 fps min.
Fram
es/s
ec
MPEG-4 Decoder
Cell-phone battery life(hours)
23
energyperformance
University of MichiganElectrical Engineering and Computer Science24
Questions?