WaveScalar and the WaveCache
description
Transcript of WaveScalar and the WaveCache
Spring 2003 CSE P548 1
WaveScalar and the WaveCache
Steven SwansonKen Michelson
Mark OskinTom AndersonSusan Eggers
University of Washington
Spring 2003 CSE P548 2
Worries to Keep You up at Night
In 2016 200,000 RISC-1 processors will fit on a die. It will take 36 cycles to cross the die. Still a lack of ILP. Memory latency is still a problem. For reasonable yields, only 1 transistor in 24 billion
may be broken (if one flaw breaks a chip).
Spring 2003 CSE P548 3
WaveScalar’s Solution: Utilize Die Capability
A sea of simple, RISClike processors in-order, single-issue takes advantage of billions of transistors without
exacerbating the other problems short design & implementation time operates at a short cycle not need lots of ILP fewer defects
Spring 2003 CSE P548 4
L2 C
ache
WaveScalar Processing Element
FLOW CONTROL
FU
FLOW CONTROL
DECODE
CONFIG.LOGIC
INPUTS
OUTPUTS
Spring 2003 CSE P548 5
WaveScalar’s Solution: Short Wires
Dataflow execution model each processor executes when it’s operands have
arrived same principle as out-of-order execution but applies to
the processor & includes fetching no single program counter
short wires: no long control lines no centralized hardware data structures no need for sequential & individual instruction fetches
Spring 2003 CSE P548 6
WaveScalar’s Solution: Short Wires
Dataflow execution model, cont’d. differs from original dataflow computers
distributed tag management (matching between renamed producer-consumer registers)
special WaveScalar instructions assign a number to all operands in a wave (think iteration or trace) & coordinate wave execution
all instructions in a “wave” execute on data with the same wave number
Spring 2003 CSE P548 7
WaveScalar’s Solution: Short Wires
Dataflow execution model differs from original dataflow computers
explicit wave-ordered memory compiler assigns sequence number to each memory
operation in a bread-first manner sequence number for an operation, its predecessor &
successor all sent with produced data wave & sequence numbers provide a total order on
memory operations through any traversal of a wave+ normal memory semantics+ no need for special dataflow languages; C & C++ programs
execute just fine
Spring 2003 CSE P548 8
WaveScalar’s Solution: Short Wires
Nearest-neighbor communication code placement to locate consumers near their
producers short, fast node-to-node links rather than slow
broadcast networks exploits dataflow locality: probability of producing a value
for a particular consumer instruction & therefore register (register renaming can destroy this)
instructions can dynamically migrate toward their neighbors during execution
Spring 2003 CSE P548 9
Dynamic Optimization
The common case has higher costs, and the
branch can detect this…
Common Case
Rare Case
Branch
Join
Spring 2003 CSE P548 10
Dynamic Optimization
…and fix it, by moving. The join can do the same.
Common Case
Rare Case
Branch
Join
Spring 2003 CSE P548 11
L2 C
ache
WaveScalar’s Solution: Short Wires
PE Domain
Spring 2003 CSE P548 12
L2 C
ache
WaveScalar’s Solution: Short Wires
D$ + Store Buffer
Cluster
Spring 2003 CSE P548 13
WaveScalar’s Solution: Creative Use of Untapped Parallelism
Expand the window for exploiting ILP no in-order fetch using only one PC (sucking though
a straw) place instructions with the processing elements out-of-order execution on a grand scale
Allow multiple threads to execute concurrently OS & applications multiple applications, parallel threads
Spring 2003 CSE P548 14
WaveScalar’s Solution: The I-Cache is the Processor
Model is processor-in-memory (PIM) processing element associated with each instruction
WaveScalar version processing elements placed in the I-cache to reduce
latency
Spring 2003 CSE P548 15
L2 C
ache
WaveScalar’s Solution: Design to Compensate for Circuit Unreliablity
Fewer design & implementation errors from the grid of simple, uniform design
Route around processors with flaws
decentralized control
dynamic instruction migration
Spring 2003 CSE P548 16
Research Agenda: Architecture
WaveScalar ISA Microarchitecture design
node design domain size cache-coherence across clusters cluster arrangement
Control & memory speculation WaveScalar instruction management
hardware for instruction placement & replacement hardware for dynamic, self-optimizing placement
Spring 2003 CSE P548 17
Research Agenda: Architecture
Multithreaded WaveScalar Design of the network & routing issues Power management Static & dynamic fault detection & recovery (rerouting
instructions) System-level design Application to non-silicon designs
Spring 2003 CSE P548 18
Research Agenda: Compilers
Instruction placement Revisit classic optimizations
code savings vs. communication costs cache pollution vs. loop parallelism
New opportunities for optimization a match between compiler & execute models WaveScalar-specific instructions
Spring 2003 CSE P548 19
Research Agenda: OS & Networking
Tension between facilitating short routines & poor instruction locality
The software side of thread management A bunch of stuff I don’t know about
optimizing the OS interface new thread protection policies memory management issues security lazy context switching utilizing virtual machines
Spring 2003 CSE P548 20
Putting It All Together
Grid of hundreds (maybe thousands) of simple, data-flow processing nodes
no centralized control; scalable few design errors; increase in yield
Processing nodes embedded in the I-cache Instructions execute in place Send results directly to the consumers
short, point-to-point links Instructions can dynamically migrate
reduce latency to hot consumers map around defects
3X performance without any prediction mechanisms more with them
Spring 2003 CSE P548 21