Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona

CDE: A Compiler-driven, Dependence-CDE: A Compiler-driven, Dependence-centric, Eager-executing architecture for the centric, Eager-executing architecture for the

billion transistor erabillion transistor era

Carmelo AcostaCarmelo Acosta

Sriram VajapeyamSriram Vajapeyam

Alex RamirezAlex Ramirez

Mateo ValeroMateo Valero

UPC-BarcelonaUPC-Barcelona

MotivationMotivation Entering the billion transistor era

How to use the available Hw to increase performance Maintain cost and complexity under control Obtain a true general-purpose architecture

Do not limit High Performance to a single application class

Clustered architectures seem the way to go Avoid excessive dependence on the compiler Avoid impossible communication delays Avoid complex interconnection networks

Hierarchical program partitioning Both in the compiler and the hardware

OutlineOutline

Motivation The CDE architecture

Hierarchical program partitioning Epochs

• Selective Eager Execution

Dependence clusters Hierarchical architecture

Epoch Processing Core (EPC) Processing Elements (PE)

Program execution Related work Summary and conclusions

The CDE architectureThe CDE architecture The way CDE obtains performance

Rely on the compiler for code partitioning Hierarchical program view Matching hierarchical hardware

Use both run-time and compile-time speculation to keep the transistors occupied

How to achieve it The Dependence Cluster (DC) is the basic execution

unit Larger than one instruction

• Larger virtual instruction window

Reduces communication Amortizes speculation costs

• Commit, squash, and redo an entire DC

Hierarchical program partitioningHierarchical program partitioning Horizontal control epochs

Large code segments Loops, functions,

hyperblock-like Limit the scope of

compiler optimizations Trace scheduling Selective eager

execution

Vertical dependence clusters

Chains of dependent instructions

Localize communications

Alex Ramirez

Also called waves by the U.Washington grouop (S.Swanson et al, WCED03)

EpochsEpochs [ 0] 0x12001e09c: ldq t0, -21056(gp) [ 1] 0x12001e0a0: beq t0, 0x12001e0e4 [ 2] 0x12001e0a4: ldq t2, 8(t0) [ 3] 0x12001e0a8: beq t2, 0x12001e0dc [ 4] 0x12001e0ac: ldq t4, 8(t2) [ 5] 0x12001e0b0: ldq t4, 8(t4) [ 6] 0x12001e0b4: xor a0, t4, t4 [ 7] 0x12001e0b8: beq t4, 0x12001e0ec [ 8] 0x12001e0bc: ldq t2, 16(t2) [ 9] 0x12001e0c0: beq t2, 0x12001e0dc [10] 0x12001e0c4: ldq t6, 8(t2) [11] 0x12001e0c8: ldq t6, 8(t6) [12] 0x12001e0cc: xor a0, t6, t6 [13] 0x12001e0d0: beq t6, 0x12001e0ec [14] 0x12001e0d4: ldq t2, 16(t2) [15] 0x12001e0d8: bne t2, 0x12001e0ac [16] 0x12001e0dc: ldq t0, 16(t0) [17] 0x12001e0e0: bne t0, 0x12001e0a4 [18] 0x12001e0e4: ldq v0, 16(a0) [19] 0x12001e0e8: ret zero, (ra), 1 [20] 0x12001e0ec: ldq t2, 8(t2) [21] 0x12001e0f0: ldq v0, 16(t2) [22] 0x12001e0f4: ret zero, (ra), 1

b) SuperScalar code

DC #0

[4]

[5]

[6]

[7]

[8]

[14]

[15]

[2]

[3]

[0]

[1]

[16]

[17]

[8]

[9]

[8]

[10]

[11]

[12]

[13]

[18] [19] [20]

[21]

[22]

DC #1 DC #2 DC #3 DC #4 DC #5 DC #6 DC #7 DC #8 DC #9 DC

#10

c) Control Epoch

NODE *xlygetvalue(NODE *sym){ register NODE *fp,*ep;

/* check the environment list */ for (fp = xlenv; fp; fp = cdr(fp)) for (ep = car(fp); ep; ep = cdr(ep)) if (sym == car(car(ep))) return (cdr(car(ep)));

/* return the global value */ return (getvalue(sym));}

a) Source code

Eager executionEager execution

Traditional trace-scheduling

Bet on one direction Optimize frequent case Generate fix-up code for

infrequent case

Eager-execution Remove the branch Optimize each separate

case Squash the incorrect

trace

Hard to predict branch

Optimized trace + fix-up code

Remove branch and execute both paths

Alex Ramirez

Watch out for code growth. Use the ability of CDE code to do code distribution and code placement.

Dependence clustersDependence clusters

Essentially a set of dependent instructions May have dependencies with other DCs in the same

Epoch

The compiler balances Inter-DC dependencies

Localize communication within a DC ILP

Place independent instructions in a different DC

Hierarchical architecture partitioningHierarchical architecture partitioning

Epoch Processing Core

Quickly sequences through control epochs

Epoch level speculation

Mesh of MIPS-2000 like Processing Elements

Execute individual Dependence Clusters

EPC

PE

Alex Ramirez

There can not be more DCs within an epoch than PE in the grid. The compiler neeeds to know how many PEs are there for each pocessor generation.

Epoch Processing Core (EPC)Epoch Processing Core (EPC)

Fetches and processes epochs one at a time Speculatively branches to the next epoch

Epoch level sequencing Epoch level speculation

Renames live-in and live-outs of each epoch Out of order epoch execution

Dispatches the DC’s to the PE grid Coupled with the required data about the epoch

Renaming of live-in and live-outs

Processing Elements (PE)Processing Elements (PE)

MIPS-2000 like In-order Single-issue Short pipeline

Local register file Intra-DC dependencies

Communications manager

Inter-DC dependencies

F D E M W

Reg.file

Comms.

Program execution (Cycle 0)Program execution (Cycle 0)

The EPC fetches, processes, renames and starts Epoch’s execution.

DC #0

[4]

[5]

[6]

[7]

[8]

[14]

[15]

[2]

[3]

[0]

[1]

[16]

[17]

[8]

[9]

[8]

[10]

[11]

[12]

[13]

[18] [19] [20]

[21]

[22]D

C #1 DC #2 DC #3 DC #4 DC #5 DC #6 DC #7 DC #8 DC #9 DC

#10


1 2 3 4 5 6

EPC

Initial EPC-PEs communication delay.


0-IF

18-IF

19-IF

1 2 3 4 5 6

0

8

7

EPC

DCs #0, #7 and #8 start execution on their respective PEs.

DC#0

DC#7DC#8


0-IF 0-ID

18-IF 18-ID

19-IF 19-ID

1 2 3 4 5 6

0

8

7

EPC

Each PE continues its execution as statically scheduled by the compiler.

DC#0

DC#7DC#8


0-IF 0-ID 0-EX

1-IF

2-IF

16-IF

18-IF 18-ID 18-EX

19-IF 19-ID 19-EX

1 2 3 4 5 6

2 1

0

8

7

EPC

DCs #1 and #2 start execution on their respective PEs.

DC#0

DC#7DC#8

DC#1

DC#2


0-IF 0-ID 0-EX 0-M

1-IF 1-ID

2-IF 2-ID

16-IF 16-ID

18-IF 18-ID 18-EX 18-M

19-IF 19-ID 19-EX 19-M

1 2 3 4 5 6

2 1

0

8

7

EPC

DC#0 (0-M) generates reg. t0, bypassed to next instruction (1-EX) and sent to DCs #1 and #2.

DC#0

DC#7DC#8

DC#1

DC#2


0-IF 0-ID 0-EX 0-M 0-W

1-IF 1-ID 1-EX

2-IF 2-ID 2-EX

3-IF

16-IF 16-ID 16-EX

17-IF

18-IF 18-ID 18-EX 18-M 18-W

19-IF 19-ID 19-EX 19-M 19-W

2-IF

16-IF

1 2 3 4 5 6

2’ 1’

2 1

0

8

7

EPC

DCs #1’ and #2’ (next instance) start execution. Reg. t0 arrives at DCs #1 and #2.

DC#0

DC#7DC#8

DC#1

DC#2

DC#1’

DC#2’

Related WorkRelated Work

RAW Not Hierarchical HW Exploits Basic Block

parallelism

GPA Grid of ALUs High Instruction Fetch

requirements Exploits HyperBlock

parallelism

Multiscalar Horizontal but not vertical

Code partitioning SuperScalar Branch

treatment

ILDP Hardware only Approach Dynamic steer of

dependent instructions to PEs

Depends on an accumulator-based ISA

Trace Processors Hardware only Approach Dynamic paths are

captured in traces

Carmelo Alexis Acosta Ojeda

What I try to express here is that DCs are intended to be all possible threads from a single dynamic path where no parallelism can be exploited. In Trace Processors each PE must exploit the parallelism of each trace as opposite to CDE.

Carmelo Alexis Acosta Ojeda

But we have to keep in mind that RAW employ a level of execution hiearchy. Each tile executes a single stream, made by the compiler.

Implementation considerationsImplementation considerations Low complexity architecture based on

regularity Epoch Processing Core Grid of PE Communication network

High performance due to far-fetched speculation Large virtual instruction window

Strong dependence on the compiler Code partitioning, DC communication Epochs limit the scope of optimizations

Alex Ramirez

Should we start with hyperblocks?We can select for eager execution the very same branches that the compiler chose for predication.

Solving multiple problems at onceSolving multiple problems at once

CDE can also behave in a polimorphic way

Exploiting ILP Far-fetched speculation through Epoch speculation

Exploiting TLP Multi-threaded Epoch Processong Core

Distribute the PE's among all running threads

Exploiting DLP No need to re-dispatch a DC to the PE's

Simply re-start the DC with new data

Alex Ramirez

OF COURSE, this is only assuming that we can do dynamic DC to PE assignment

Summary and conclusionsSummary and conclusions Hierarchical partitioning

Epoch speculation maintains transistors occupied Eager execution works around difficult branches

DC helps to keep complexity at bay Amortizes cost of speculation (squash, commit)

Scalable performance with more PE Increasing wire delays may limit scalability Rely on the compiler to minimize communication

Design in its initial stages Lots of unanswered questions

Specially regarding the memory hierarchy Feedback is welcome!

Alex Ramirez

Also: memory ordering and memory disambiguation. Memory aliasing, and out-of-order memory access.

Alex Ramirez

And rely on the PE allocator to minimize communication too. We may use compiler HINTS, but we can not afford static assignment (or else we need to RENAME the static assignment ... which also seemsa good idea: do not assign individual DCs, but a whole Epopch at a time. That is: dynamic epoch mapping, but static DC mapping within an epoch?)

Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona

Documents

Transcript of Carmelo Acosta Sriram Vajapeyam Alex Ramirez Mateo Valero UPC-Barcelona