WaveScalar and the WaveCache

21
Spring 2003 CSE P548 1 WaveScalar and the WaveCache Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington

description

WaveScalar and the WaveCache. Steven Swanson Ken Michelson Mark Oskin Tom Anderson Susan Eggers University of Washington. Worries to Keep You up at Night. In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP. - PowerPoint PPT Presentation

Transcript of WaveScalar and the WaveCache

Page 1: WaveScalar and the WaveCache

Spring 2003 CSE P548 1

WaveScalar and the WaveCache

Steven SwansonKen Michelson

Mark OskinTom AndersonSusan Eggers

University of Washington

Page 2: WaveScalar and the WaveCache

Spring 2003 CSE P548 2

Worries to Keep You up at Night

In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP. Memory latency is still a problem. For reasonable yields, only 1 transistor in 24 billion

may be broken (if one flaw breaks a chip).

Page 3: WaveScalar and the WaveCache

Spring 2003 CSE P548 3

WaveScalar’s Solution: Utilize Die Capability

A sea of simple, RISClike processors in-order, single-issue takes advantage of billions of transistors without

exacerbating the other problems short design & implementation time operates at a short cycle not need lots of ILP fewer defects

Page 4: WaveScalar and the WaveCache

Spring 2003 CSE P548 4

L2 C

ache

WaveScalar Processing Element

FLOW CONTROL

FU

FLOW CONTROL

DECODE

CONFIG.LOGIC

INPUTS

OUTPUTS

Page 5: WaveScalar and the WaveCache

Spring 2003 CSE P548 5

WaveScalar’s Solution: Short Wires

Dataflow execution model each processor executes when it’s operands have

arrived same principle as out-of-order execution but applies to

the processor & includes fetching no single program counter

short wires: no long control lines no centralized hardware data structures no need for sequential & individual instruction fetches

Page 6: WaveScalar and the WaveCache

Spring 2003 CSE P548 6

WaveScalar’s Solution: Short Wires

Dataflow execution model, cont’d. differs from original dataflow computers

distributed tag management (matching between renamed producer-consumer registers)

special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution

all instructions in a “wave” execute on data with the same wave number

Page 7: WaveScalar and the WaveCache

Spring 2003 CSE P548 7

WaveScalar’s Solution: Short Wires

Dataflow execution model differs from original dataflow computers

explicit wave-ordered memory compiler assigns sequence number to each memory

operation in a bread-first manner sequence number for an operation, its predecessor &

successor all sent with produced data wave & sequence numbers provide a total order on

memory operations through any traversal of a wave+ normal memory semantics+ no need for special dataflow languages; C & C++ programs

execute just fine

Page 8: WaveScalar and the WaveCache

Spring 2003 CSE P548 8

WaveScalar’s Solution: Short Wires

Nearest-neighbor communication code placement to locate consumers near their

producers short, fast node-to-node links rather than slow

broadcast networks exploits dataflow locality: probability of producing a value

for a particular consumer instruction & therefore register (register renaming can destroy this)

instructions can dynamically migrate toward their neighbors during execution

Page 9: WaveScalar and the WaveCache

Spring 2003 CSE P548 9

Dynamic Optimization

The common case has higher costs, and the

branch can detect this…

Common Case

Rare Case

Branch

Join

Page 10: WaveScalar and the WaveCache

Spring 2003 CSE P548 10

Dynamic Optimization

…and fix it, by moving. The join can do the same.

Common Case

Rare Case

Branch

Join

Page 11: WaveScalar and the WaveCache

Spring 2003 CSE P548 11

L2 C

ache

WaveScalar’s Solution: Short Wires

PE Domain

Page 12: WaveScalar and the WaveCache

Spring 2003 CSE P548 12

L2 C

ache

WaveScalar’s Solution: Short Wires

D$ + Store Buffer

Cluster

Page 13: WaveScalar and the WaveCache

Spring 2003 CSE P548 13

WaveScalar’s Solution: Creative Use of Untapped Parallelism

Expand the window for exploiting ILP no in-order fetch using only one PC (sucking though

a straw) place instructions with the processing elements out-of-order execution on a grand scale

Allow multiple threads to execute concurrently OS & applications multiple applications, parallel threads

Page 14: WaveScalar and the WaveCache

Spring 2003 CSE P548 14

WaveScalar’s Solution: The I-Cache is the Processor

Model is processor-in-memory (PIM) processing element associated with each instruction

WaveScalar version processing elements placed in the I-cache to reduce

latency

Page 15: WaveScalar and the WaveCache

Spring 2003 CSE P548 15

L2 C

ache

WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity

Fewer design & implementation errors from the grid of simple, uniform design

Route around processors with flaws

decentralized control

dynamic instruction migration

Page 16: WaveScalar and the WaveCache

Spring 2003 CSE P548 16

Research Agenda: Architecture

WaveScalar ISA Microarchitecture design

node design domain size cache-coherence across clusters cluster arrangement

Control & memory speculation WaveScalar instruction management

hardware for instruction placement & replacement hardware for dynamic, self-optimizing placement

Page 17: WaveScalar and the WaveCache

Spring 2003 CSE P548 17

Research Agenda: Architecture

Multithreaded WaveScalar Design of the network & routing issues Power management Static & dynamic fault detection & recovery (rerouting

instructions) System-level design Application to non-silicon designs

Page 18: WaveScalar and the WaveCache

Spring 2003 CSE P548 18

Research Agenda: Compilers

Instruction placement Revisit classic optimizations

code savings vs. communication costs cache pollution vs. loop parallelism

New opportunities for optimization a match between compiler & execute models WaveScalar-specific instructions

Page 19: WaveScalar and the WaveCache

Spring 2003 CSE P548 19

Research Agenda: OS & Networking

Tension between facilitating short routines & poor instruction locality

The software side of thread management A bunch of stuff I don’t know about

optimizing the OS interface new thread protection policies memory management issues security lazy context switching utilizing virtual machines

Page 20: WaveScalar and the WaveCache

Spring 2003 CSE P548 20

Putting It All Together

Grid of hundreds (maybe thousands) of simple, data-flow processing nodes

no centralized control; scalable few design errors; increase in yield

Processing nodes embedded in the I-cache Instructions execute in place Send results directly to the consumers

short, point-to-point links Instructions can dynamically migrate

reduce latency to hot consumers map around defects

3X performance without any prediction mechanisms more with them

Page 21: WaveScalar and the WaveCache

Spring 2003 CSE P548 21