Lecture 18: Introduction to Multiprocessorsbwrcs.eecs.berkeley.edu/Classes/CS252/Notes... · »...
Transcript of Lecture 18: Introduction to Multiprocessorsbwrcs.eecs.berkeley.edu/Classes/CS252/Notes... · »...
1
Lecture 18:Introduction to Multiprocessors
Prepared and presented by:Kurt Keutzer
with thanks for materials fromKunle Olukotun, Stanford;
David Patterson, UC Berkeley
2
Why Multiprocessors?
Needs� Relentless demand for higher performance
» Servers» Networks
� Commercial desire for product differentiation
Opportunities� Silicon capability� Ubiquitous computers
3
Exploiting (Program) Parallelism
Instruction
Loop
Thread
Process
Leve
ls o
f Par
alle
lism
Grain Size (instructions)
1 10 100 1K 10K 100K 1M
4
Exploiting (Program) Parallelism -2
Instruction
Loop
Thread
Process
Leve
ls o
f Par
alle
lism
Grain Size (instructions)
1 10 100 1K 10K 100K 1M
Bit
5
Need for Parallel Computing
� Diminishing returns from ILP» Limited ILP in programs» ILP increasingly expensive
to exploit
� Peak performance increases linearly with more processors» Amhdahl’s law applies
� Adding processors is inexpensive» But most people add
memory also
Die Area
Per
form
ance
Die Area
Per
form
ance
P+M
2P+M 2P+2M
6
What to do with a billion transistors ?
� Technology changes the cost and performance of computer elements in a non-uniform manner
» logic and arithmetic is becoming plentiful and cheap
» wires are becoming slow and scarce
� This changes the tradeoffs between alternative architectures
» superscalar doesn’t scale well
– global control and data� So what will the architectures
of the future be?
2007
2004
2001
1998
1 clk
3 (10, 16, 20?) clks
64 x the area4x the speedslower wires
7
Elements of a multiprocessing system
� General purpose/special purpose� Granularity - capability of a basic module� Topology - interconnection/communication geometry� Nature of coupling - loose to tight� Control-data mechanisms� Task allocation and routing methodology� Reconfigurable
» Computation» Interconnect
� Programmer’s model/Language support/ models of computation� Implementation - IC, Board, Multiboard, Networked� Performance measures and objectives
[After E. V. Krishnamurty - Chapter 5
8
Use, Granularity
General purpose� attempting to improve general purpose computation (e.g. Spec
benchmarks) by means of multiprocessingSpecial purpose� attempting to improve a specific application or class of
applications by means of multiprocessing
Granularity - scope and capability of a processing element (PE)� Nand gate� ALU with registers� Execution unit with local memory� RISC R1000 processor
9
Topology
Topology - method of interconnection of processors� Bus� Full-crossbar switch� Mesh� N-cube� Torus� Perfect shuffle, m-shuffle� Cube-connected components� Fat-trees
10
Coupling
Relationship of communication among processors� Shared clock (Pipelined)� Shared registers (VLIW)� Shared memory (SMM)� Shared network
11
Control/Data
Way in which data and control are organizedControl - how the instruction stream is managed (e.g. sequential
instruction fetch)Data - how the data is accessed (e.g. numbered memory
addresses)� Multithreaded control flow - explicit constructs: fork, join, wait,
control program flow - central controller� Dataflow model - instructions execute as soon as operands are
ready, program structures flow of data, decentralized control
12
Task allocation and routing
Way in which tasks are scheduled and managedStatic - allocation of tasks onto processing elements pre-
determined before runtimeDynamic - hardware/software support allocation of tasks to
processors at runtime�
13
Reconfiguration
Computational� restructuring of computational elements
» reconfigurable - reconfiguration at compile time» dynamically reconfigurable- restructuring of computational
elements at runtime
Interconnection scheme� switching network - software controlled� reconfigurable fabric
14
Programmer’s model
How is parallelism expressed by the user?Expressive power� Process-level parallelism
» Shared-memory» Message-passing
� Operator-level parallelism� Bit-level parallelismFormal guarantees� Deadlock-free� Livelock freeSupport for other real-time notions� Exception handling
15
Parallel Programming Models
� Message Passing» Fork thread
– Typically one per node» Explicit communication
– Send messages– send(tid, tag, message)– receive(tid, tag, message)
» Synchronization– Block on messages
(implicit sync)– Barriers
� Shared Memory (address space)» Fork thread
– Typically one per node» Implicit communication
– Using shared address space
– Loads and stores» Synchronization
– Atomic memory operators– barriers
16
Message Passing Multicomputers
� Computers (nodes) connected by a network» Fast network interface
– Send, receive, barrier» Nodes not different than regular PC or workstation
� Cluster conventional workstations or PCs with fast network » cluster computing» Berkley NOW» IBM SP2
P
M
P
M
P
M
Network
Node
17
Shared-Memory Multiprocessors
� Several processors share one address space
» conceptually a shared memory
» often implemented just like a multicomputer
– address space distributed over private memories
� Communication is implicit» read and write accesses to
shared memory locations� Synchronization
» via shared memory locations– spin waiting for non-zero
» barriers
P
M
Network
P P
Conceptual Model
P
M
P
M
P
M
Network
Actual Implementation
18
Cache Coherence - A Quick Overview
� With caches, action is required to prevent access to stale data
» Processor 1 may read old data from its cache instead of new data in memory or
» Processor 3 may read old data from memory rather than new data in Processor 2’s cache
� Solutions» no caching of shared data
– Cray T3D, T3E, IBM RP3, BBN Butterfly
» cache coherence protocol– keep track of copies– notify (update or
invalidate) on writes
P1
M
Network
P2 PN
$ $ $
P1: Rd(A) Rd(A)
P2: Wr(A,5)
P3: Rd(A)
A:3
19
Implementation issues
Underlying hardware implementation� Bit-slice� Board assembly� Integration in an integrated-circuitExploitation of new technologies� DRAM integration on IC� Low-swing chip-level interconnect
20
Performance objectives
Objectives� Speed� Power� Cost� Ease of programming/time to market/ time to money� In-field flexibilityMethods of measurement� Modeling� Emulation� Simulation
» Transaction» Instruction-set» Hardware
21
Flynn’s Taxonomy of Multiprocessing
Single-instruction single-datastream (SISD) machinesSingle-instruction multiple-datastream (SIMD) machinesMultiple-instruction single-datastream (MISD) machinesMultiple-instruction multiple-datastream (MIMD) machines
Examples?
22
Examples
Single-instruction single-datastream (SISD) machines» Non-pipelined Uniprocessors
Single-instruction multiple-datastream (SIMD) machines» Vector processors (VIRAM)
Multiple-instruction single-datastream (MISD) machines» Network processors (Intel IXP1200
Multiple-instruction multiple-datastream (MIMD) machines» Network of workstations (NOW)
23
Predominant ApproachesPipelining ubiquitiousMuch academic research focused on performance improvements
of ``dusty decks’’� Illiac 4 - Speed-up of Fortran� SUIF, Flash - Speed-up of CNiche market in high-performance computing� CrayCommercial support for high-end servers� Shared-memory multiprocessors for server marketCommercial exploitation of silicon capability� General purpose: Super-scalar, VLIW� Special purpose: VLIW for DSP, Media processors, Network
processorsReconfigurable computing
24
Fetch
PG PS PW PR DP DC E1 E2 E3 E4 E5
Decode Execute
Execute Packet 1
C62x Pipeline OperationPipeline Phases
� Single-Cycle Throughput� Operate in Lock Step� Fetch
» PG Program Address Generate» PS Program Address Send» PW Program Access Ready Wait» PR Program Fetch Packet Receive
� Decode» DP Instruction Dispatch» DC Instruction Decode
� Execute» E1 - E5 Execute 1 through Execute 5
PG PS PW PR DP DC E1 E2 E3 E4 E5Execute Packet 2 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 3 PG PS PW PR DP DC E1 E2 E3 E4 E5Execute Packet 4 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 5 PG PS PW PR DP DC E1 E2 E3 E4 E5Execute Packet 6 PG PS PW PR DP DC E1 E2 E3 E4 E5
Execute Packet 7 PG PS PW PR DP DC E1 E2 E3 E4 E5
25
Superscalar: PowerPC 604 and Pentium Pro
� Both In-order Issue, Out-of-order execution, In-order Commit
26
IA-64 aka EPIC aka VLIW
� Compiler schedules instructions� Encodes dependencies
explicitly» saves having the hardware
repeatedly rediscover them� Support speculation
» speculative load» branch prediction
� Really need to make communication explicit too
» still has global registers and global instruction issue
Register File
Instruction Cache
Instruction Issue
27
Phillips Trimedia Processor
28
TMS320C6201 Revision 2
C6201 CPU Megamodule
Data Path 1
D1M1S1L1
A Register File
Data Path 2
L2S2M2D2
B Register File
Instruction DispatchProgram Fetch
Interrupts
Control Registers
Control Logic
Emulation Test
Ext. Memory Interface
4-DMA
Program Cache / Program Memory32-bit address, 256-Bit data512K Bits RAM
Host Port
Interface
2 Timers
2 Multi-channel buffered
serial ports (T1/E1)
Data Memory32-Bit address, 8-, 16-, 32-Bit data
512K Bits RAM
Pwr Dwn
Instruction Decode
29
TMS320C6701 DSP Block Diagram
’C67x Floating-Point CPU Core
Data Path 1
D1M1S1L1
A Register File
Data Path 2
L2S2M2D2
B Register File
Instruction DispatchProgram Fetch
Interrupts
Control Registers
Control Logic
Emulation Test
External Memory Interface
4 Channel
DMA
Program Cache/Program Memory32-bit address, 256-Bit data
512K Bits RAM
Host Port
Interface
2 Timers
2 Multi-channel buffered
serial ports (T1/E1)
Data Memory32-Bit address
8-, 16-, 32-Bit data512K Bits RAM
Power Down
Instruction Decode
30
ArithmeticLogicUnit
AuxiliaryLogicUnit
MultiplierUnit
’C67x Floating-Point CPU Core
Data Path 1
D1M1S1L1
A Register File
Data Path 2
L2S2M2D2
B Register File
Instruction Decode
Instruction Dispatch
Program Fetch
Interrupts
Control Registers
Control Logic
Emulation
Test
Floating-PointCapabilities
TMS320C67x CPU Core
31
Single-Chip MultiprocessorsCMP
� Build a multiprocessor on a single chip
» linear increase in peakperformance
» advantage of fast interaction between processors
� Fine grain threads» make communication and
synchronization very fast (1 cycle)
» break the problem into smaller pieces
� Memory bandwidth» Makes more effective use of
limited memory bandwidth� Programming model
» Need parallel programs
P P P P
$ $ $ $
$
M
32
Intel IXP1200 Network Processor
� 6 micro-engines» RISC engines» 4 contexts/eng» 24 threads total
� IX Bus Interface» packet I/O» connect IXPs
– scalable� StrongARM
» less critical tasks� Hash engine
» level 2 lookups� PCI interface
SDRAMCtrl
MicroEngPCI
Interface
SRAMCtrl
SACore
MicroEng
MicroEng
MicroEng
MicroEng
MicroEng
MiniDCache
DCache
ICache
ScratchPad
SRAM
IX BusInterface
HashEngine
33
IXP1200 MicroEngine
� 32-bit RISC instruction set� Multithreading support for 4 threads
» Maximum switching overhead of 1 cycle� 128 32-bit GPRs in two banks of 64� Programmable 1KB instruction store (not shown in diagram)� 128 32-bit transfer registers� Command bus arbiter and FIFO (not shown in diagram)
32 SRAMRead XFERRegisters
64 GPRs(A-Bank)
32 SDRAMRead XFER
Registers
64 GPRs(B-Bank)
ALU
32 SRAMWrite XFER
Registers
32 SDRAMRead XFER
Registers
from SRAM
from SDRAM
to SRAM
to SDRAM
34
IXP1200 Instruction Set
� Powerful ALU instructions:» can manipulate word and part of word quite effectively
� Swap-thread on memory reference» Hides memory latency» sramsramsramsram[read, r0, base1, offset, 1], [read, r0, base1, offset, 1], [read, r0, base1, offset, 1], [read, r0, base1, offset, 1], ctxctxctxctx_swap_swap_swap_swap
� Can use an “intelligent” DMA-like controller to copy packets to/from memory» sdramsdramsdramsdram[[[[t_t_t_t_fifofifofifofifo____wrwrwrwr, , , , --------, , , , pktpktpktpkt____bffrbffrbffrbffr, offset, 8], offset, 8], offset, 8], offset, 8]
� Exposed branch behavior» can fill variable branch slots» can select a static prediction on a per-branch basis
ARMmov r1, r0, lsl #16mov r1, r1, r0, asr #16add r0, r1, r0, asr #16
IXP1200ld_field_w_clr[temp, 1100, accum]alu_shf[accum, temp, +, accum, <<16]
35
UCB: Processor with DRAM (PIM)IRAM, VIRAM
� Put the processor and the main memory on a single chip
» much lower memory latency» much higher memory
bandwidth
� But» need to build systems with
more than one chip
M
P
64Mb SDRAM ChipInternal - 128 512K subarrays4 bits per subarray each 10ns51.2 Gb/s
External - 8 bits at 10ns, 800Mb/s
1 Integer processor ~ 100KBytes DRAM1 FP processor ~ 500KBytes DRAM
1 Vector Unit ~ 1 MByte DRAM
V
36
IRAM Vision Statement
Microprocessor & DRAM on a single chip:» on-chip memory latency
5-10X, bandwidth 50-100X» improve energy efficiency
2X-4X (no off-chip bus)» serial I/O 5-10X v. buses» smaller board area/volume» adjustable memory size/width
DRAM
fab
Proc
Bus
D R A M
$ $Proc
L2$
Logic
fabBus
D R A M
I/OI/O
I/OI/O
Bus
37
Potential Multimedia Architecture
� “New” model: VSIW=Very Short Instruction Word!» Compact: Describe N operations with 1 short instruct.» Predictable (real-time) performance vs. statistical performance (cache)» Multimedia ready: choose N*64b, 2N*32b, 4N*16b» Easy to get high performance» Compiler technology already developed, for sale!
– Don’t have to write all programs in assembly language
38
Revive Vector (= VSIW) Architecture!
� Cost: ≈ $1M each?� Low latency, high BW
memory system?� Code density?� Compilers?� Performance?� Power/Energy?� Limited to scientific
applications?
� Single-chip CMOS MPU/IRAM� IRAM
� Much smaller than VLIW� For sale, mature (>20 years)� Easy scale speed with technology� Parallel to save energy, keep perf� Multimedia apps vectorizable too: N*64b,
2N*32b, 4N*16b
39
V-IRAM1: 0.18 µm, Fast Logic, 200 MHz1.6 GFLOPS(64b)/6.4 GOPS(16b)/16MB
Memory Crossbar Switch
M
M
…M
M
M
…M
M
M
…M
M
M
…M
M
M
…M
M
M
…M
…
M
M
…M
M
M
…M
M
M
…M
M
M
…M
+
Vector Registers
x
÷
Load/Store
16K I cache 16K D cache
2-waySuperscalar
VectorProcessor
4 x 64 4 x 64 4 x 64 4 x 64 4 x 64
4 x 64or
8 x 32or
16 x 16
4 x 644 x 64
QueueInstruction
I/OI/O
I/OI/O
SerialI/O
40
Ring-basedSwitch
CPU+$
Tentative VIRAM-1 Floorplan
I/O
� 0.18 µm DRAM16-32 MB in 16 banks x 256b
� 0.18 µm, 5 Metal Logic
� ≈ 200 MHz MIPS IV, 16K I$, 16K D$
� ≈ 4 200 MHz FP/int. vector units
� die: ≈ 20x20 mm� xtors: ≈ 130-250M� power: ≈2 Watts
4 Vector Pipes/Lanes
Memory (128 Mbits / 16 MBytes)
Memory (128 Mbits / 16 MBytes)
41
Tentative VIRAM-”0.25” Floorplan
� Demonstrate scalability via 2nd layout (automatic from 1st)
� 8 MB in 2 banks x 256b, 32subbanks
� ≈ 200 MHz CPU, 8K I$, 8K D$
� 1 ≈ 200 MHz FP/int. vector units
� die: ≈ 5 x 20 mm
� xtors: ≈ 70M� power: ≈0.5
Watts
CPU+$
1 VU
Memory(32 Mb /
4 MB)
Memory(32 Mb /
4 MB)
Kernel GOPS V-1 V-0.25
Comp. 6.40 1.6iDCT 3.10 0.8Clr.Conv. 2.95 0.8Convol. 3.16 0.8FP Matrix 3.19 0.8
42
Stanford: Hydra Design
➤ Single-chip multiprocessor➤ Four processors ➤ Separate primary caches➤ Write-through data caches
to maintain coherence
➤ Shared 2nd-level cache➤ Separate read and write
busses➤ Data Speculation Support
Write-through Bus (64b)
Read/Replace Bus (256b)
On-chip L2 Cache
DRAM Main Memory
Rambus Memory Interface
CPU 0
L1 Inst. Cache
L1 Data Cache & Speculation Bits
Speculation Write Buffers
CPU 1
L1 Inst. Cache
L1 Data Cache & Speculation Bits
CPU 2
L1 Inst. Cache
L1 Data Cache & Speculation Bits
CPU 3
L1 Inst. Cache
L1 Data Cache & Speculation Bits
I/O Devices
I/O Bus Interface
CPU 0 Memory Controller CPU 1 Memory Controller CPU 2 Memory Controller CPU 3 Memory Controller
Centralized Bus Arbitration Mechanisms
CP2 CP2 CP2 CP2
#0 #1 #2 #3 retire
43
Mescal Architecture
Scott WeberUniversity of California at Berkeley
44
Outline
� Architecture rationale and motivation� Architecture goals� Architecture template� Processing elements� Multiprocessor architecture� Communication architecture
45
Architectural Rationale and Motivation
� Configurable processors have shown orders of magnitude performance improvements
� Tensilica has shown ~2x to ~50x performance improvements » Specialized functional units» Memory configurations
� Tensilica matches the architecture with software development tools
FU
RegFile
Memory
ICache
FUFU
RegFile
Memory
ICache
HUFDCT FUConfiguration
Set memory parametersAdd DCT and Huffmanblocks for a JPEG app
46
Architectural Rationale and Motivation
� In order to continue this performance improvement trend» Architectural features which exploit more concurrency are required» Heterogeneous configurations need to be made possible» Software development tools support new configuration options
FUFU
RegFile
Memory
ICache
HUFDCT FU
PE PE
PE PE
PE
PE
PE PE PE
FUFU FU
RegFile
Memory
ICache
DCTHUF
...begins tolook like aVLIW...
PE PE
PE PE
PE
PE
PE PE PE
...concurrent processesare required in orderto continue performanceimprovement trend...
...generic meshmay not suit theapplication’stopology...
PE PE
PE PE
PE
PE PE PE
...configurable VLIWPEs and network topology...
47
Architecture Goals
� Provide template for the exploration of a range of architectures
� Retarget compiler and simulator to the architecture
� Enable compiler to exploit the architecture
� Concurrency» Multiple instructions per processing element» Multiple threads per and across processing elements» Multiple processes per and across processing elements
� Support for efficient computation» Special-purpose functional units, intelligent memory, processing elements
� Support for efficient communication» Configurable network topology» Combined shared memory and message passing
48
Architecture Template
� Prototyping template for array of processing elements» Configure processing element for efficient computation» Configure memory elements for efficient retiming» Configure the network topology for efficient communication
FUFU FU
RegFile
Memory
ICache
DCTHUFFUFU FU
RegFile
Memory
ICache
FU FU FUFU FU
RegFile
Memory
ICache
DCTHUF
Memory
RegFile
...configurePE...
...configurememoryelements...
...configure PEsand network tomatch the application...
49
Range of Architectures
� Scalar Configuration� EPIC Configuration� EPIC with special FUs� Mesh of HPL-PD PEs� Customized PEs, network� Supports a family of
architectures» Plan to extend the family
with the micro-architectural features presented
FU
Register File
Memory System
Instruction Cache
50
PE PE
PE PE
PE
PE
PE PE PE
FUFU FUFU FU
Register File
Memory System
Instruction Cache
Range of Architectures
� Scalar Configuration� EPIC Configuration� EPIC with special FUs� Mesh of HPL-PD PEs� Customized PEs, network� Supports a family of
architectures» Plan to extend the family
with the micro-architectural features presented
51
FUFU FFT
Register File
Memory System
Instruction Cache
DCTDES
Range of Architectures
� Scalar Configuration� EPIC Configuration� EPIC with special FUs� Mesh of HPL-PD PEs� Customized PEs, network� Supports a family of
architectures» Plan to extend the family
with the micro-architectural features presented
52
FUFU FFT
Register File
Memory System
Instruction Cache
DCTDES
Range of Architectures
� Scalar Configuration� EPIC Configuration� EPIC with special FUs� Mesh of HPL-PD PEs� Customized PEs, network� Supports a family of
architectures» Plan to extend the family
with the micro-architectural features presented
PE
PE PE
PE
PE
PE
PE PE PE
53
Range of Architectures
� Scalar Configuration� EPIC Configuration� EPIC with special FUs� Mesh of HPL-PD PEs� Customized PEs, network� Supports a family of
architectures» Plan to extend the family
with the micro-architectural features presented
PE PE
PE PE
PE
PE PE PE
54
Range of Architectures (Future)
� Template support for such an architecture
� Prototype architecture� Software development
tools generated» Generate compiler» Generate simulator
SDRAMCtrl
MicroEngPCI
Interface
SRAMCtrl
SACore
MicroEng
MicroEng
MicroEng
MicroEng
MicroEng
MiniDCache
DCache
ICache
ScratchPad
SRAM
IX BusInterface
HashEngine
IXP1200 Network Processor (Intel)
55
The RAW Architecture
Slides prepared by Manish Vachhrajani
56
Outline� RAW architecture
» Overview» Features» Benefits and Disadvantages
� Compiling for RAW» Overview» Structure of the compiler» Basic block compilation» Other techniques
57
RAW Machine Overview
� Scalable architecture without global interconnect
� Constructed from Replicated Tiles» Each tile has a mP and a
switch» Interconnect via a static
and dynamic network
58
RAW Tiles
� Simple 5 stage pipelined µP w/ local PC(MIMD)
» Can contain configurable logic
� Per Tile IMEM and DMEM, unlike other modern architectures
� µP contains ins. to send and recv. data
IMEM
DMEMPC
REGS
CL SMEM
Switch
PC
59
RAW Tiles(cont.)
� Tiles have local switches» Implemented with a stripped down µµµµP» Static Network
– Fast, easy to implement– Need to know data transfers, source and destintation at
compile time» Dynamic Network
– Much slower and more complex– Allows for messages whose route is not known at compile
time
60
Configurable Hardware in RAW
� Each tile Contains its own configurable hardware� Each tile has several ALUs and logic gates that can operate at
bit/byte/word levels� Configurable interconnect to wire componenets together� Coarser than FPGA based implementations
61
Benefits of RAW� Scalable
» Each tile is simple and replicated» No global wiring, so it will scale even if wire delay doesn’t» Short wires and simple tiles allow higher clock rates
� Can target many forms of Parallelism� Ease of design
» Replication reduces design overhead» Tiles are relatively simple designs» simplicity makes verification easier
62
Disadvantages of RAW� Complex Compilation
» Full space-time compilation» Distributed memory system» Need sophisticated memory analysis to resolve “static
references”� Software Complexity
» Low-level code is complex and difficult to examine and write by hand
� Code Size?
63
Traditional Operations on RAW
� How does one exploit the Raw architecture across function calls, especially in libraries?» Can we easily maintain portability with different tile counts?
� Memory Protection and OS Services» Context switch overhead» Load on dynamic network for memory protection and virtual
memory?
64
Compiling for RAW machines
� Determine available parallelism� Determine placement of memory items� Discover memory constraints
» Dependencies between parallel threads» Disambiguate memory references to allow for static access to
data elements» Trade-off memory dependence and Parallelism
65
Compiling for RAW(cont.)
� Generate route instructions for switches» static network only
� Generate message handlers for dynamic events» Speculative execution» Unpredictable memory references
� Optimal partitioning algorithm is NP complete
66
Structure of RAWCC
� Partition data to increase static accesses
� Partition instructions to allow parallel execution
� allocate data to tiles to minimize communication overhead
Traditional Dataflow Optimizations
Build CFG
MAPS System
Space-time scheduler
Source Language
RAW executable
67
The MAPS System
� Manages memory to generate static promotions of data structures
� For loop accesses to arrays uses modulo unrolling� For data structures, uses SPAN analysis package to identify
potential references and partition memory» structures can be split across processing elements.
68
Space-Time Scheduler
� For Basic Blocks» Maps instructions to processors» Maps scalar data to processors» Generates communication instructions» Schedules computation and communication
� For overall CFG, performs control localization
69
Basic Block Orchestrator
� All values are copied to the tiles that work on the data from the home tile
� Within a Block, all access are local
� At the end of a block, values are copied to home tiles
Initial Code Transformation
Instruction Partitioner
Global DataPartitioner
Data & Ins.Placer
Event Scheduler
Comm CodeGenerator
70
Initial Code Transformation
� Convert Block to static single assignment form» removes false dependencies» Analagous to register renaming
� Live on entry, and live on exit variables marked with dummy instructions» Allows for overlap of “stitch” code with useful work
71
Instruction Partitioner� Partitions stream into multiple streams, one for each tile� Clustering
» Partition instructions to minimize runtime considering only communication
� Merging» Reduces cluster count to match tile count» Uses a heuristic based algorithm to achieve good balance and
low communication overhead
72
Global Data Partitioner� Partitions global data for assignment to home locations
» Local data is copied at the start of a basic block� Summarize instruction stream’s data access pattern with affinity� Maps instructions and data to virtual processors
» Map instructions, optimally place data based on affinity» Remap instructions with data placement knowledge» Repeat until local minima is reached
� Only real data are mapped, not dummies formed in ICT
73
Data and Instruction Placer
� Places data items onto physical tiles» driven by static data items
� Places instructions onto tiles» Uses data information to determine cost
� Takes into account actual model of communications network� Uses a swap based greedy allocation
74
Event Scheduler
� Schedules routing instructions as well as computation instructions in a basic block
� Schedules instructions using a greedy list based scheduler� Switch schedule is ensured to be deadlock free
» Allows tolerance of dynamic events
75
Control Flow
� Control Localization» Certain branches are enveloped in macro instructions, and
the surrounding blocks merged» Allows branch to occur only on one tile
� Global Branching» Done through target broadcast and local branching
76
Performance� RAW achieves anywhere from 1.5 to 9 times speedup
depending on application and tile count� Applications tested were particularly well suited to RAW� Heavily dependent integer programs may do poorly(encryption,
etc.))� Depends on its ability to statically schedule and localize memory
accesses
77
Future Work� Use multisequential execution to run multiple applications
simultaneously» Allow static communication between threads known at
compile time» Minimize dynamic overhead otherwise
� Target ILP across branches more agressively� Explore configurability vs. parallelism in RAW
78
Reconfigurable processors
� Adapt the processor to the application
» special function units» special wiring between function
units� Builds on FPGA technology
» FPGAs are inefficient– a multiplier built from an
FPGA is about 100x larger and 10x slower than a custom multiplier.
» Need to raise the granularity– configure ALUs, or whole
processors» Memory and communication are
usually the bottleneck– not addressed by
configuring a lot of ALUs� Programming model
» Difficult to program» Verilog
79
SCOREStream Computation Organized for
Reconfigurable Execution
Eylon CaspiMichael ChuAndré DeHonRandy HuangJoseph YehJohn WawrzynekNicholas Weaver
80
Opportunity
High-throughput, regular operations� can be mapped spatially onto FPGA-like
(programmable, spatial compute substrate)� achieving higher performance
» (throughput per unit area)� than conventional, programmable devices
» (e.g. processors)
81
Problem
� Only have raw devices� Solutions non-portable� Solutions not scale to new hardware� Device resources exposed to developer� Little or no abstraction of implementations� Composition of subcomponents hard/ad hoc� No unifying computational model or run-time environment
82
Introduce: SCORE
� Compute Model» virtualizes RC hardware resources» supports automatic scaling» supports dynamic program requirements efficiently» provides compositional semantics» defines runtime environment for programs
83
Viewpoint
� SCORE (or something like it) is a necessary condition to enable automatic exploitation of new RC hardware as it becomes available.
� Automatic exploitation is essential to making RC a long-term viable computing solution.
84
Outline� Opportunity� Problem� Review
» related work» enabling hardware
� Model» execution» programmer
� Preliminary Results� Challenges and Questions ahead
85
…borrows heavily from...
� RC, RTR� P+FPGA� Dataflow� Streaming Dataflow� Multiprocessors� Operating System� (see working paper)
� Tried to steal all the good ideas :-)
� build a coherent model� exploit strengths of RC
86
Enabling Hardware
� High-speed, computational arrays» [250MHz, HSRA, FPGA’99]
� Large, on-chip memories» [2Mbit, VLSI Symp. ‘99]» [allow microsecond reconfiguration]
� Processor and FPGA hybrids» [GARP, NAPA, Triscend, etc.]
87
BRASS Architecture
88
Array Model
89
Platform Vision� Hardware capacity scales up with each generation
» Faster devices» More computation» More memory
� With SCORE, old programs should run on new hardware» and exploit the additional capacity automatically
90
Example: SCORE Execution
91
Spatial Implementation
92
Serial Implementation
93
Summary: Elements of a multiprocessing system
� General purpose/special purpose� Granularity - capability of a basic module� Topology - interconnection/communication geometry� Nature of coupling - loose to tight� Control-data mechanisms� Task allocation and routing methodology� Reconfigurable
» Computation» Interconnect
� Programmer’s model/Language support/ models of computation� Implementation - IC, Board, Multiboard, Networked� Performance measures and objectives
[After E. V. Krishnamurty - Chapter 5
94
ConclusionsPortions of multi/parallel processing have become successful
» Pipelining ubiquitious» Superscalar ubiquitious» VLIW successful in DSP, Multimedia - GPP?
Silicon capability re-invigorating multiprocessor research» GPP - Flash, Hydra, RAW» SPP - Intel IXP 1200, IRAM/VIRAM, Mescal
Reconfigurable computing has found a niche in wireless communications
Problem of programming models, languages, computational models etc. for multiprocessors still largely unsolved