Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona
description
Transcript of Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona
CDE: A Compiler-driven, Dependence-CDE: A Compiler-driven, Dependence-centric, Eager-executing architecture for the centric, Eager-executing architecture for the
billion transistor erabillion transistor era
Carmelo AcostaCarmelo Acosta
Sriram VajapeyamSriram Vajapeyam
Alex RamirezAlex Ramirez
Mateo ValeroMateo Valero
UPC-BarcelonaUPC-Barcelona
MotivationMotivation Entering the billion transistor era
How to use the available Hw to increase performance Maintain cost and complexity under control Obtain a true general-purpose architecture
Do not limit High Performance to a single application class
Clustered architectures seem the way to go Avoid excessive dependence on the compiler Avoid impossible communication delays Avoid complex interconnection networks
Hierarchical program partitioning Both in the compiler and the hardware
OutlineOutline
Motivation The CDE architecture
Hierarchical program partitioning Epochs
• Selective Eager Execution
Dependence clusters Hierarchical architecture
Epoch Processing Core (EPC) Processing Elements (PE)
Program execution Related work Summary and conclusions
The CDE architectureThe CDE architecture The way CDE obtains performance
Rely on the compiler for code partitioning Hierarchical program view Matching hierarchical hardware
Use both run-time and compile-time speculation to keep the transistors occupied
How to achieve it The Dependence Cluster (DC) is the basic execution
unit Larger than one instruction
• Larger virtual instruction window
Reduces communication Amortizes speculation costs
• Commit, squash, and redo an entire DC
Hierarchical program partitioningHierarchical program partitioning Horizontal control epochs
Large code segments Loops, functions,
hyperblock-like Limit the scope of
compiler optimizations Trace scheduling Selective eager
execution
Vertical dependence clusters
Chains of dependent instructions
Localize communications
EpochsEpochs [ 0] 0x12001e09c: ldq t0, -21056(gp) [ 1] 0x12001e0a0: beq t0, 0x12001e0e4 [ 2] 0x12001e0a4: ldq t2, 8(t0) [ 3] 0x12001e0a8: beq t2, 0x12001e0dc [ 4] 0x12001e0ac: ldq t4, 8(t2) [ 5] 0x12001e0b0: ldq t4, 8(t4) [ 6] 0x12001e0b4: xor a0, t4, t4 [ 7] 0x12001e0b8: beq t4, 0x12001e0ec [ 8] 0x12001e0bc: ldq t2, 16(t2) [ 9] 0x12001e0c0: beq t2, 0x12001e0dc [10] 0x12001e0c4: ldq t6, 8(t2) [11] 0x12001e0c8: ldq t6, 8(t6) [12] 0x12001e0cc: xor a0, t6, t6 [13] 0x12001e0d0: beq t6, 0x12001e0ec [14] 0x12001e0d4: ldq t2, 16(t2) [15] 0x12001e0d8: bne t2, 0x12001e0ac [16] 0x12001e0dc: ldq t0, 16(t0) [17] 0x12001e0e0: bne t0, 0x12001e0a4 [18] 0x12001e0e4: ldq v0, 16(a0) [19] 0x12001e0e8: ret zero, (ra), 1 [20] 0x12001e0ec: ldq t2, 8(t2) [21] 0x12001e0f0: ldq v0, 16(t2) [22] 0x12001e0f4: ret zero, (ra), 1
b) SuperScalar code
DC #0
[4]
[5]
[6]
[7]
[8]
[14]
[15]
[2]
[3]
[0]
[1]
[16]
[17]
[8]
[9]
[8]
[10]
[11]
[12]
[13]
[18] [19] [20]
[21]
[22]
DC #1 DC #2 DC #3 DC #4 DC #5 DC #6 DC #7 DC #8 DC #9 DC
#10
c) Control Epoch
NODE *xlygetvalue(NODE *sym){ register NODE *fp,*ep;
/* check the environment list */ for (fp = xlenv; fp; fp = cdr(fp)) for (ep = car(fp); ep; ep = cdr(ep)) if (sym == car(car(ep))) return (cdr(car(ep)));
/* return the global value */ return (getvalue(sym));}
a) Source code
Eager executionEager execution
Traditional trace-scheduling
Bet on one direction Optimize frequent case Generate fix-up code for
infrequent case
Eager-execution Remove the branch Optimize each separate
case Squash the incorrect
trace
Hard to predict branch
Optimized trace + fix-up code
Remove branch and execute both paths
Dependence clustersDependence clusters
Essentially a set of dependent instructions May have dependencies with other DCs in the same
Epoch
The compiler balances Inter-DC dependencies
Localize communication within a DC ILP
Place independent instructions in a different DC
Hierarchical architecture partitioningHierarchical architecture partitioning
Epoch Processing Core
Quickly sequences through control epochs
Epoch level speculation
Mesh of MIPS-2000 like Processing Elements
Execute individual Dependence Clusters
EPC
PE
Epoch Processing Core (EPC)Epoch Processing Core (EPC)
Fetches and processes epochs one at a time Speculatively branches to the next epoch
Epoch level sequencing Epoch level speculation
Renames live-in and live-outs of each epoch Out of order epoch execution
Dispatches the DC’s to the PE grid Coupled with the required data about the epoch
Renaming of live-in and live-outs
Processing Elements (PE)Processing Elements (PE)
MIPS-2000 like In-order Single-issue Short pipeline
Local register file Intra-DC dependencies
Communications manager
Inter-DC dependencies
F D E M W
Reg.file
Comms.
Program execution (Cycle 0)Program execution (Cycle 0)
The EPC fetches, processes, renames and starts Epoch’s execution.
DC #0
[4]
[5]
[6]
[7]
[8]
[14]
[15]
[2]
[3]
[0]
[1]
[16]
[17]
[8]
[9]
[8]
[10]
[11]
[12]
[13]
[18] [19] [20]
[21]
[22]D
C #1 DC #2 DC #3 DC #4 DC #5 DC #6 DC #7 DC #8 DC #9 DC
#10
Program execution (Cycle 1)Program execution (Cycle 1)
1 2 3 4 5 6
EPC
Initial EPC-PEs communication delay.
Program execution (Cycle 2)Program execution (Cycle 2)
0-IF
18-IF
19-IF
1 2 3 4 5 6
0
8
7
EPC
DCs #0, #7 and #8 start execution on their respective PEs.
DC#0
DC#7DC#8
Program execution (Cycle 3)Program execution (Cycle 3)
0-IF 0-ID
18-IF 18-ID
19-IF 19-ID
1 2 3 4 5 6
0
8
7
EPC
Each PE continues its execution as statically scheduled by the compiler.
DC#0
DC#7DC#8
Program execution (Cycle 4)Program execution (Cycle 4)
0-IF 0-ID 0-EX
1-IF
2-IF
16-IF
18-IF 18-ID 18-EX
19-IF 19-ID 19-EX
1 2 3 4 5 6
2 1
0
8
7
EPC
DCs #1 and #2 start execution on their respective PEs.
DC#0
DC#7DC#8
DC#1
DC#2
Program execution (Cycle 5)Program execution (Cycle 5)
0-IF 0-ID 0-EX 0-M
1-IF 1-ID
2-IF 2-ID
16-IF 16-ID
18-IF 18-ID 18-EX 18-M
19-IF 19-ID 19-EX 19-M
1 2 3 4 5 6
2 1
0
8
7
EPC
DC#0 (0-M) generates reg. t0, bypassed to next instruction (1-EX) and sent to DCs #1 and #2.
DC#0
DC#7DC#8
DC#1
DC#2
Program execution (Cycle 6)Program execution (Cycle 6)
0-IF 0-ID 0-EX 0-M 0-W
1-IF 1-ID 1-EX
2-IF 2-ID 2-EX
3-IF
16-IF 16-ID 16-EX
17-IF
18-IF 18-ID 18-EX 18-M 18-W
19-IF 19-ID 19-EX 19-M 19-W
2-IF
16-IF
1 2 3 4 5 6
2’ 1’
2 1
0
8
7
EPC
DCs #1’ and #2’ (next instance) start execution. Reg. t0 arrives at DCs #1 and #2.
DC#0
DC#7DC#8
DC#1
DC#2
DC#1’
DC#2’
Related WorkRelated Work
RAW Not Hierarchical HW Exploits Basic Block
parallelism
GPA Grid of ALUs High Instruction Fetch
requirements Exploits HyperBlock
parallelism
Multiscalar Horizontal but not vertical
Code partitioning SuperScalar Branch
treatment
ILDP Hardware only Approach Dynamic steer of
dependent instructions to PEs
Depends on an accumulator-based ISA
Trace Processors Hardware only Approach Dynamic paths are
captured in traces
Implementation considerationsImplementation considerations Low complexity architecture based on
regularity Epoch Processing Core Grid of PE Communication network
High performance due to far-fetched speculation Large virtual instruction window
Strong dependence on the compiler Code partitioning, DC communication Epochs limit the scope of optimizations
Solving multiple problems at onceSolving multiple problems at once
CDE can also behave in a polimorphic way
Exploiting ILP Far-fetched speculation through Epoch speculation
Exploiting TLP Multi-threaded Epoch Processong Core
Distribute the PE's among all running threads
Exploiting DLP No need to re-dispatch a DC to the PE's
Simply re-start the DC with new data
Summary and conclusionsSummary and conclusions Hierarchical partitioning
Epoch speculation maintains transistors occupied Eager execution works around difficult branches
DC helps to keep complexity at bay Amortizes cost of speculation (squash, commit)
Scalable performance with more PE Increasing wire delays may limit scalability Rely on the compiler to minimize communication
Design in its initial stages Lots of unanswered questions
Specially regarding the memory hierarchy Feedback is welcome!