Scalable Distributed Memory Machines: Massively Parallel...
Transcript of Scalable Distributed Memory Machines: Massively Parallel...
EECC756 - ShaabanEECC756 - Shaaban#1 lec # 12 Spring2003 5-6-2003
Scalable Distributed Memory Machines:Scalable Distributed Memory Machines: Massively Parallel Processors ( Massively Parallel Processors (MPPsMPPs) & Clusters) & ClustersGoal: Parallel machines that can be scaled to hundreds or thousands of processors.
• Scalable Parallel Systems Design Choices:– Custom-designed or commodity nodes?– Network scalability.– Capability of node-to-network interface (critical).
• Level of interpretation of the message by the Communication Assist (CA)without involving the CPU
– Supporting programming models?• Software vs. hardware protocol implementation
• What does hardware scalability mean?– Avoid inherent design limits on resources as machine size is increased.– Bandwidth Scaling: Bandwidth increases with machine size P.– Latency Scaling: Latency increases slowly with machine size P.– Cost Scaling: Cost should increase slowly with P.– Physical Scaling: Scalable system packaging. Node density. Modular
construction: Chip-level integration, Board-level integration, System-levelintegration.
(Parallel Computer Architecture, Chapter 7)
EECC756 - ShaabanEECC756 - Shaaban#2 lec # 12 Spring2003 5-6-2003
Commodity Supercomputing:
Cluster Computing• The research in heterogeneous supercomputing led to the development of high-
speed system area networks and portable message passing environments.
• These developments in conjunction with the impressive performanceimprovements and low cost of commercial general-purpose microprocessorsled to the current trend in high-performance parallel computing of movingaway from expensive specialized traditional supercomputing platforms tocluster computing that utilizes cheaper, general purpose systems consisting ofloosely coupled commodity of-the-shelf (COTS)
• Such clusters are commonly known as Beowulf clusters and are comprised ofthree components:§ Computing Nodes: Each low-cost computing node is usually a small Symmetric
Multi-Processor (SMP) system that utilizes COTS components including commercialGeneral-Purpose Processors (GPPs) with no custom components.
§ System Interconnect: Utilize COTS Ethernet-based or system area interconnectsincluding Myrinet and Dolphin SCI interconnects originally developed for HSC.
§ System Software and Programming Environments: Such clusters usually run anopen-source royalty-free version of UNIX (Linux being the de-facto standard).The message-passing environments (PVM , MPI), developed originally forheterogeneous supercomputing systems, provide portable message-passingcommunication primitives.
From Lecture 6
EECC756 - ShaabanEECC756 - Shaaban#3 lec # 12 Spring2003 5-6-2003
Message-Passing Parallel Systems:Commercial Massively Parallel Processor Systems (MPPs) Vs. Clusters
MPPs vs. Clusters• MPPs: “Big Iron” machines
– COTS components usually limited to usingcommercial processors.
– High system cost• Clusters: Commodity Supercomputing
– COTS components used for all systemcomponents.
– Lower cost than MPP solutions
° ° °
Scalable network
CA
P
$
Switch
M
Switch Switch
Scalable Network:Low latencyHigh bandwidthMPPs: CustomClusters: COTS
•Gigabit Ethernet•System AreaNetworks (SANS)
• ATM• Myrinet• SCI
Node: O(10) SMPMPPs: Custom nodeClusters: COTS node (workstationsor PCs)
Custom-designed CPU?MPPs: Custom or commodityClusters: commodity
Operating system?MPPs: ProprietaryClusters: royalty-free (Linux)
Communication Assist:MPPs: CustomClusters: COTS
Distributed Memory
Parallel Programming:
Between nodes: Message passing using PVM, MPIIn SMP nodes: SAS Multithreading using Pthreads, OpenMP
From Lecture 6
EECC756 - ShaabanEECC756 - Shaaban#4 lec # 12 Spring2003 5-6-2003
MPPs Scalability IssuesMPPs Scalability Issues• Problems:
– Total available bandwidth.– Memory-access latency (local vs. remote access).– Interprocess communication complexity or synchronization
overhead.– Multi-cache inconsistency.– Message-passing and message processing overheads.
• Possible Solutions:– Fast dedicated, proprietary and scalable, networks and protocols.– Low-latency fast synchronization techniques possibly hardware-assisted .– Hardware-assisted message processing in communication assists (node-to-
network interfaces).– Weaker memory consistency models.– Scalable directory-based cache coherence protocols (Next lecture, PCA ch. 8)– Shared virtual memory implemented with hardware support.– Improved software portability: standard parallel programming
environments and operating system support.– Software latency-hiding techniques.
EECC756 - ShaabanEECC756 - Shaaban#5 lec # 12 Spring2003 5-6-2003
One Extreme:One Extreme:
Limited Scaling of Limited Scaling of SMPsSMPs Using a Bus Using a Bus
• Bus-Based SMPs: Each level of the system design isgrounded in the scaling limits at the layers below andassumptions of close coupling between components.
Characteristic Bus
Physical Length ~ 1 ft
Number of Connections fixed
Maximum Bandwidth fixed
Interface to Comm. medium memory interface + coherence protocols
Global Order arbitration
Protection Virt -> physical
Trust total
OS single
comm. abstraction HW
Poor Scalability
- Limited Concurrent transactions- Globally ordered transactions via arbitration.- Limited fixed bandwidth.
EECC756 - ShaabanEECC756 - Shaaban#6 lec # 12 Spring2003 5-6-2003
Another Extreme:Another Extreme:Scaling of Clusters of Workstations Using a LAN?Scaling of Clusters of Workstations Using a LAN?
• No clear limit to physical scaling, no global order.
– Usually high communication latency.
• System Area Networks (SANS): Alternative Networks for clusters– Scalable, high bandwidth, low latency and protocol overheads.
– Myrinet, SCI, ServerNet …
Characteristic Bus LAN
Physical Length ~ 1 ft KM
Number of Connections fixed many
Maximum Bandwidth fixed ???
Interface to Comm. medium memory inf peripheral
Global Order arbitration ???
Protection Virt -> physical OS
Trust total none
OS single independent
comm. abstraction HW SW
ATM 622 mb/sFDDI 100 mb/sSwitch-basedFast/GigabitEthernet
SANSMyrinet 2 Gb/s
EECC756 - ShaabanEECC756 - Shaaban#7 lec # 12 Spring2003 5-6-2003
•• Bandwidth scalability depends largely on networkBandwidth scalability depends largely on networkcharacteristics:characteristics:– Channel bandwidth.
– Static: Topology: Node degree, Bisection width etc.
– Multistage: Switch size and connection pattern properties.
– Node-to-network interface (Communication Assist CA)capabilities.
Bandwidth ScalabilityBandwidth Scalability
P M M P M M P M M P M M
S S S S
Typical switches
Bus
Multiplexers
Crossbar
Scalable Parallel Machine
EECC756 - ShaabanEECC756 - Shaaban#8 lec # 12 Spring2003 5-6-2003
Dancehall MP OrganizationDancehall MP Organization
• Network bandwidth?
• Bandwidth demand?– Independent processes?
– Communicating processes?
– Network bandwidth requirementscale linearly with P
• Latency?
° ° °
Scalable network
P
$
Switch
M
P
$
P
$
P
$
M M° ° °
Switch Switch
Extremely high demands on network in terms ofbandwidth, latency even forindependent processes.
MIN
EECC756 - ShaabanEECC756 - Shaaban#9 lec # 12 Spring2003 5-6-2003
Generic Distributed MemoryGeneric Distributed MemoryOrganizationOrganization
• Network bandwidth?• Bandwidth demand?
– Independent processes?– Communicating processes?
• Latency? O(log2P) increase?• Cost scalability of system?
° ° °
Scalable network
CA
P
$
Switch
M
Switch Switch
Multi-stageinterconnectionnetwork (MIN)?Point-to-point?Custom-designed?SAN?
Node:O(10) Bus-based SMP
Custom-designed CPU?Node/System integration level?How far? Cray-on-a-Chip? SMP-on-a-Chip?
OS Supported?Network protocols?
Communication AssistExtent of functionality? DMA, user-level access user-level handlers
MessagetransactionDMA?
Global physical or virtual Shared address space support?
EECC756 - ShaabanEECC756 - Shaaban#10 lec # 12 Spring2003 5-6-2003
Key System Scaling PropertyKey System Scaling Property• Bandwidth should increase with P, while latency
remains low.
• Large number of independent communication pathsbetween nodes.
=> Allow a large number of concurrent transactions using different channels.
• Transactions are initiated independently.
• No global arbitration.
• Effect of a transaction only visible to the nodes involved– Effects propagated through additional transactions.
EECC756 - ShaabanEECC756 - Shaaban#11 lec # 12 Spring2003 5-6-2003
Network Latency ScalingNetwork Latency Scaling
• T(n) = Overhead + Channel Time + Routing Delay
• Scaling of overhead? f(n)?
• Channel Time(n) = n/B --- BW at bottleneck
• RoutingDelay(h,n) = h ∆∆
Assuming cut-through routing is used
EECC756 - ShaabanEECC756 - Shaaban#12 lec # 12 Spring2003 5-6-2003
Network Latency Scaling ExampleNetwork Latency Scaling Example
• Max distance: log2 n
• Number of switches: α n log n
• overhead = 1 us, BW = 64 MB/s, ∆∆ = 200 ns per hop• Using pipelined or cut-through routing:• T64(128) = 1.0 us + 2.0 us + 6 hops * 0.2 us/hop = 4.2 us
• T1024(128) = 1.0 us + 2.0 us + 10 hops * 0.2 us/hop = 5.0 us
• Store and Forward• T64
sf(128) = 1.0 us + 6 hops * (2.0 + 0.2) us/hop = 14.2 us
• T1024sf(128) = 1.0 us + 10 hops * (2.0 + 0.2) us/hop = 23 us
O(log2 n) Stage MIN using switches:
Only 20% increase in latency for 16x size increase
~ 60% increase in latency for 16x size increase
EECC756 - ShaabanEECC756 - Shaaban#13 lec # 12 Spring2003 5-6-2003
Cost ScalingCost Scaling• cost(p,m) = fixed cost + incremental cost (p,m) + Network(p)
• Bus Based SMP? Fixed-cost high
• Ratio of processors : memory : network : I/O ?
• Parallel efficiency(p) = Speedup(P) / P
• Similar to speedup, one can define:
Costup(p) = Cost(p) / Cost(1)
• Cost-effective: speedup(p) > costup(p)
memory
EECC756 - ShaabanEECC756 - Shaaban#14 lec # 12 Spring2003 5-6-2003
Cost Effective?Cost Effective?
2048 processors: 475 fold speedup at 206x cost
0
500
1000
1500
2000
0 500 1000 1500 2000
Processors
Speedup = P/(1+ logP)
Costup = 1 + 0.1 P
Using MINs?
EECC756 - ShaabanEECC756 - Shaaban#15 lec # 12 Spring2003 5-6-2003
Parallel Machine Network ExamplesParallel Machine Network Examples
SAN: 8x8 switches 2Gb/s
EECC756 - ShaabanEECC756 - Shaaban#16 lec # 12 Spring2003 5-6-2003
Physical ScalingPhysical ScalingPhysical node construction, density/integration
• Chip-level integration:– Integrate network interface, message router I/O links.
• nCUBE/2, Alpha 21364, IBM Power 4, AMD Opetron
– IRAM-style Cray-on-a-Chip: V-IRAM
– Memory/Bus controller/chip set: Alpha 21364, AMD Opetron
– SMP on a chip: Chip Multiprocessor (CMP): IBM Power 4
• Board-level integration:– Replicating using standard microprocessor cores.
• CM-5 replicated the core of a Sun SparkStation 1 workstation.
• Cray T3D and T3E replicated the core of a DEC Alpha workstation+ custom shell logic + CA.
• System level integration:• IBM SP-2 uses 8-16 almost complete RS6000 workstations placed in
racks + custom CA, network, system software.
EECC756 - ShaabanEECC756 - Shaaban#17 lec # 12 Spring2003 5-6-2003
Chip-level integration Example:Chip-level integration Example:
nCUBEnCUBE/2 Machine Organization/2 Machine Organization
• Entire machine synchronous at 40 MHz (25ns)
• Routing delay = 2.2 usec
Single-chip node
Basic module
Hypercube networkconfiguration
DRAM interface
DM
Ach
anne
ls
Rou
ter
MMU
I-Fetch&
decode
64-bit integerIEEE floating point
Operand$
Execution unit
500, 000 transistors (considered large at the time,1991)
64 nodes socketed on a board
13 bit-serial linksup to 8096nodes possible(2048 built)
ReducedVAX
Memory Interface
EECC756 - ShaabanEECC756 - Shaaban#18 lec # 12 Spring2003 5-6-2003
Chip-level integration Example:Chip-level integration Example:Vector Intelligent RAM 2 (Vector Intelligent RAM 2 (V-IRAM-2)V-IRAM-2)
Memory Crossbar Switch
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
…
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
+
Vector Registers
x
÷
Load/Store
8K I cache 8K D cache
2-waySuperscalar VectorProcessor
8 x 64 8 x 64 8 x 64 8 x 64 8 x 64
8 x 64or
16 x 32or
32 x 16
8 x 648 x 64
QueueInstruction
I/OI/O
I/OI/O
Projected 2004. < 0.1 µm, > 2 Projected 2004. < 0.1 µm, > 2 GHz GHz 16 GFLOPS(64b)/64 GOPS(16b)/128MB16 GFLOPS(64b)/64 GOPS(16b)/128MB
EECC756 - ShaabanEECC756 - Shaaban#19 lec # 12 Spring2003 5-6-2003
ll Alpha 21264 core with enhancementsAlpha 21264 core with enhancements
ll Integrated DirectIntegrated Direct RAMbus RAMbus memory controller: memory controller:n 800 MHz operation, 30ns CAS latency pin to pin, 6 GB/sec read or write bandwidthn Directory based cache coherence
ll Integrated network interface:Integrated network interface:n Direct processor-to-processor interconnect, 10 GB/second per processorn 15ns processor-to-processor latency, Out-of-order network with adaptive routingn Asynchronous clocking between processors, 3 GB/second I/O interface per processor
Chip-level integration Example:Chip-level integration Example:
Alpha 21364Alpha 21364
MemoryController
RAMBUS21264
Core
16 L1Miss Buffers
L2Cache
Address Out
Address In
NetworkInterface
NorthSouthEastWestI/O
16 L1Victim Buf 16 L2
Victim Buf
64K Icache
64K Dcache
EECC756 - ShaabanEECC756 - Shaaban#20 lec # 12 Spring2003 5-6-2003
Chip-level integration Example:Chip-level integration Example:
A Possible Alpha 21364 SystemA Possible Alpha 21364 System
364M
IO364
M
IO364
M
IO364
M
IO
364M
IO364
M
IO364
M
IO364
M
IO
364M
IO364
M
IO364
M
IO364
M
IO
EECC756 - ShaabanEECC756 - Shaaban#21 lec # 12 Spring2003 5-6-2003
• Two tightly-integrated >1GHz CPU cores per 170Million Transistor chip.
• 128KB L1 Cache perprocessor
• 1.5 MB On-Chip SharedL2 Cache
• External 32MB L3 Cache:Tags kept on chip.
• 35 Gbytes/s Chip-to-Chipinterconnects.
Chip-level integration Example:Chip-level integration Example:
IBM Power 4 CMP IBM Power 4 CMP
EECC756 - ShaabanEECC756 - Shaaban#22 lec # 12 Spring2003 5-6-2003
Chip-level integration Example:Chip-level integration Example:
IBM Power 4 IBM Power 4
EECC756 - ShaabanEECC756 - Shaaban#23 lec # 12 Spring2003 5-6-2003
Chip-level integration Example:Chip-level integration Example:
IBM Power 4 MCM IBM Power 4 MCM
EECC756 - ShaabanEECC756 - Shaaban#24 lec # 12 Spring2003 5-6-2003
Board-level integration Example:Board-level integration Example:
CM-5 Machine OrganizationCM-5 Machine OrganizationDiagnostics network
Control network
Data network
Processingpartition
Processingpartition
Controlprocessors
I/O partition
PM PM
SPARC
MBUS
DRAMctrl
DRAM DRAM DRAM DRAM
DRAMctrl
Vectorunit DRAM
ctrlDRAM
ctrl
Vectorunit
FPU Datanetworks
Controlnetwork
$ctrl
$SRAM
NI
Design replicated the core of a Sun SparkStation 1 workstation
Fat Tree(1993)
Other Board-level Integration Examples: Cray T3D, T3E (Alpha processors), Intel ASCI Red (Pentium Pro), Intel Paragon (i860) ….
33 MHzSPARC
EECC756 - ShaabanEECC756 - Shaaban#25 lec # 12 Spring2003 5-6-2003
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed fr om8-port switches
NIC
System Level Integration Example:System Level Integration Example:
IBM SP-1/SP-2IBM SP-1/SP-2
8-16 almostcompleteRS6000workstationsplaced in racks.
Also IBM ASCI White(Power3)
40 MB/s
MIN
EECC756 - ShaabanEECC756 - Shaaban#26 lec # 12 Spring2003 5-6-2003
Realizing Programming Models:Realizing Programming Models:
Realized by ProtocolsRealized by Protocols
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
Network Transactions
EECC756 - ShaabanEECC756 - Shaaban#27 lec # 12 Spring2003 5-6-2003
Challenges in Realizing Prog. ModelsChallenges in Realizing Prog. Modelsin Large-Scale Machinesin Large-Scale Machines
• No global knowledge, nor global control.
– Barriers, scans, reduce, global-OR give fuzzy global state.
• Very large number of concurrent transactions.
• Management of input buffer resources:
– Many sources can issue a request and over-commitdestination before any see the effect.
• Latency is large enough that one is tempted to “take risks”:
– Optimistic protocols.
– Large transfers.
– Dynamic allocation.
• Many more degrees of freedom in design and engineering ofthese system.
EECC756 - ShaabanEECC756 - Shaaban#28 lec # 12 Spring2003 5-6-2003
Network Transaction ProcessingNetwork Transaction Processing
Key Design Issue:
• How much interpretation of the message by CA withoutinvolving the CPU?
• How much dedicated processing in the CA?
PM
CA
PM
CA° ° °
Scalable Network
Node Architecture
Communication Assist
Message
Output Processing – checks – translation – formating – scheduling
Input Processing – checks – translation – buffering – action
CA = Communication Assist
EECC756 - ShaabanEECC756 - Shaaban#29 lec # 12 Spring2003 5-6-2003
Extent of CA Interpretation of NetworkExtent of CA Interpretation of NetworkTransactions: Spectrum of DesignsTransactions: Spectrum of Designs
1 None: Physical bit stream– blind, physical DMA nCUBE, iPSC, . . .
2 User/System– User-level port CM-5, *T
– User-level handler J-Machine, Monsoon, . . .
3 Remote virtual address– Processing, translation Paragon, Meiko CS-2
4 Global physical/virtual address– Proc + Memory controller RP3, BBN, T3D, T3E
5 Cache-to-cache– Cache controller Dash, KSR, Flash
Incr
easi
ng
HW
Su
pp
ort
, Sp
ecia
lizat
ion
, In
tru
sive
nes
s, P
erfo
rman
ce (
??
?)
EECC756 - ShaabanEECC756 - Shaaban#30 lec # 12 Spring2003 5-6-2003
1 No CA Net Transactions Interpretation:1 No CA Net Transactions Interpretation:
Physical DMA Physical DMA
• DMA controlled by registers, generates interrupts.
• Physical => OS initiates transfers.
• The destination processor initiates a DMA transfer from the network
• The next incoming network message is blindly stored in system buffer.
• Send-side:– Construct system “envelope” around user data in kernel area.
• Receive:– Must receive into system buffer, since no message interpretation in CA.
PMemory
Cmd
DestData
Addr
Length
Rdy
PMemory
DMAchannels
Status,inter rupt
Addr
Length
Rdy° ° °
sender auth
dest addr
System-Level Network Ports
User Area Kernel Area NetworkCopy DMA
Network Kernel Area User AreaDMA Copy
EECC756 - ShaabanEECC756 - Shaaban#31 lec # 12 Spring2003 5-6-2003
1 Blind DMA Example:1 Blind DMA Example: nCUBE nCUBE/2 Network Interface/2 Network Interface
• Independent DMA channel per link direction– Leave input buffers always open.– Segmented messages.
• Routing determines if message is intended for local orremote node
– Dimension-order routing on hypercube.– Bit-serial with 36 bit cut-through.
Processor
Switch
Input ports
° ° °
Output ports
Memory
Addr AddrLength
Addr Addr AddrLength
AddrLength
° ° °
DMAchannels
Memorybus
Os Send 16 ins 260 cy 13 us
OS Receive 18 ins 200 cy 15 us
- includes interrupt
150 us message-passinglibrary overhead
EECC756 - ShaabanEECC756 - Shaaban#32 lec # 12 Spring2003 5-6-2003
DMA In Conventional LANDMA In Conventional LANNetwork InterfacesNetwork Interfaces
NIC Controller
DMAaddr
len
trncv
TX
RX
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Data
Host Memory NIC
IO Busmem bus
Proc
SendDMA Descriptors
ReceiveDMA Descriptors
DMAInput/OutputQueues
Fast Ethernet, ATM ...
EECC756 - ShaabanEECC756 - Shaaban#33 lec # 12 Spring2003 5-6-2003
2 User-Level Network Ports2 User-Level Network Ports
• Initiate transaction at user level.• CA interprets and delivers message to user
without OS intervention.• Network port in user virtual address space.• User/system flag in envelope.
– Protection check, translation, routing, mediaaccess in source CA
– User/sys check in destination CA, interrupt onsystem.
PMem
DestData
User/system
PMemStatus,interrupt
° ° °
CA CA
Virtual address space
Status
Net outputport
Net inputport
Program counter
Registers
Processor
Appear to user as logical message queues plus status
EECC756 - ShaabanEECC756 - Shaaban#34 lec # 12 Spring2003 5-6-2003
2 User-Level Network Example:2 User-Level Network Example: CM-5 CM-5• Two data networks and
one control network.
• Input and output FIFOfor each network.
• Tag per message:
– Index NetworkInteface (NI) mappingtable.
• *T integrated NI on chip.
• Also used in iWARP.
• Five-word Message
Diagnostics network
Control network
Data network
Processingpartition
Processingpartition
Controlprocessors
I/O partition
PM PM
SPARC
MBUS
DRAMctrl
DRAM DRAM DRAM DRAM
DRAMctrl
Vectorunit DRAM
ctrlDRAM
ctrl
Vectorunit
FPU Datanetworks
Controlnetwork
$ctrl
$SRAM
NI
Os 50 cy 1.5 us
Or 53 cy 1.6 us
interrupt 10us
To user levelLatency 3-5 us
EECC756 - ShaabanEECC756 - Shaaban#35 lec # 12 Spring2003 5-6-2003
2 User-Level Handlers2 User-Level Handlers
• Tighter integration of user-level network port with the processor at theregister level.
• CA is essentially a function unit in the processor.
• Hardware support to vector to address specified in message
– Network message ports in processor registers.
• Greatly reduced latency: Data moved in and out of network usingregister-to-register instructions.
User /sys te m
PM e m
D e stData A ddress
PM e m
° ° °
EECC756 - ShaabanEECC756 - Shaaban#36 lec # 12 Spring2003 5-6-2003
2 User-Level Handlers Example2 User-Level Handlers Example
iWARP iWARP
• Nodes integratecommunication withcomputation on systolicbasis.
• Message data direct toregister (word by word)
Or
• Stream into memory(DMA).
Interface unit
Host
Two registers are bound to network input/output ports
EECC756 - ShaabanEECC756 - Shaaban#37 lec # 12 Spring2003 5-6-2003
3 Remote/Global Virtual Address Space
Dedicated Message Processing Without Specialized Hardware DesignDedicated Message Processing Without Specialized Hardware Design
Message processor performs arbitrary output processing (at system level)Message processor interprets incoming network transactions (at system level)User Processor <–> Message Processor share memoryMessage Processor <–> Message Processor via system network transaction
Network
° ° °
dest
Mem
P M P
NI
User System
Mem
P M P
NI
User System
Node:Bus-basedSMP
MPMessageProcessor
One of the SMP node processors serves as a dedicated communication (CP) or messageprocessor (MP) that handles network transaction processing, message interpretationand possibly global address operations in software.
EECC756 - ShaabanEECC756 - Shaaban#38 lec # 12 Spring2003 5-6-2003
• User Processor stores cmd / msg / data into shared output queue.• Communication assists make transaction happen.
– Checking, translation, scheduling, transport, interpretation.
• Effect observed on destination address space and/or events.
• Reduces overhead on user processor but adds latency
Network
° ° °
dest
Mem
P M P
NI
User System
Mem
PM P
NI
Levels of Network TransactionLevels of Network Transaction
Node:Bus-basedSMP
MPMessageProcessor
MP can also be integrated in NI
EECC756 - ShaabanEECC756 - Shaaban#39 lec # 12 Spring2003 5-6-2003
Example: Intel ParagonExample: Intel ParagonNetwork
° ° ° Mem
P M P
NIi860xp50 MHz16 KB $4-way32B BlockMESI
sDMArDMA
64400 MB/s
$ $
16 175 MB/s Duplex
I/ONodes
rteMP handler
Var dataEOP
I/ONodes
Service
Devices
Devices
2048 B
(1992)
EECC756 - ShaabanEECC756 - Shaaban#40 lec # 12 Spring2003 5-6-2003
4 Global Virtual/Physical Address SpaceUsing Specialized Hardware DesignSpecialized Hardware Design
• Shared physical/virtual address space realized with specializedspecialized hardware toprovide global loads, stores, and atomic memory operations.
• The specialized communication assist is viewed as a pseudo-memorymodule and pseudo-processor that translates bus transactions into networktransactions.
– The memory management unit (MMU) translates a virtual address into aglobal physical address presented to the memory system.
– If the global physical address is local the local memory responds, if not theCA acts as a memory module while accessing the remote location byextracting remote node number from the physical address and issuing anetwork transaction.
– The remote CA acts as pseudo-processor to its node by reading the desiredmemory location and then issues a network transaction in response.
• Early designs used the dancehall organization while later designs useddistributed memory.
• Examples include: CM*, NYU Ultracomputer, BBN Butterfly, IBM RP3,Denelcor HEP-1, BBN TC200 and Cray T3D, T3E
EECC756 - ShaabanEECC756 - Shaaban#41 lec # 12 Spring2003 5-6-2003
4 Global Virtual/Physical Address Space ExampleCray T3E
Overview of the Cray T3E:
• Self-hosted running Unicos/mk.
• Logically shared address space over physically distributed memory (up to 2GB per processor).
• Each PE contains:
– A DEC Alpha 21164 CPU (initially at 300 MHz )
– Shell logic: control chip, router chip, local memory (75MHz)
– External 64-bit registers (E-registers): 512 user 128 system, used for allremote communication and synchronization.
• Up to 2048 are connected by a 3D torus with fully adaptive minimal routing.
• No board-level cache; only internal CPU cache (8K L1, 96K L2).
• Only local memory is cached; cache coherency through the use of externalbackmap to probe and update on-chip cache.
• Network links: Time multiplexed; one 64-bit word each 13.3 ns (five timesfaster than system clock).
• GigaRing channel I/O: 267 MB/s sustained for every four processors.
EECC756 - ShaabanEECC756 - Shaaban#42 lec # 12 Spring2003 5-6-2003
T3E Processing Element (PE)Block Diagram
Control & External E-Registers
EECC756 - ShaabanEECC756 - Shaaban#43 lec # 12 Spring2003 5-6-2003
T3E Global Communication• E-registers:
– Extend the physical address space of the CPU.
– Increase degree of pipelining for global memory requests.
– Direct loads and stores performed between E-registers and CPU.
– Global E-register operations used for global data transfer (local orremote), messaging, and atomic-operation synchronization.
• Global E-register operations use a global virtual address formed byshell logic:
– Virtual PE address translated to physical PE at the source.
– Virtual address transmitted across the net is translated to aphysical address using a global translation buffer at the target PE
• Data distribution features of many implicit programminglanguages (i.e broadcast/gather/scatter) supported using anintegrated hardware centrifuge.
EECC756 - ShaabanEECC756 - Shaaban#44 lec # 12 Spring2003 5-6-2003
T3E Global Virtual Addressing
EECC756 - ShaabanEECC756 - Shaaban#45 lec # 12 Spring2003 5-6-2003
Address T
ranslation For G
lobal(E
-Register) R
eferences
EECC756 - ShaabanEECC756 - Shaaban#46 lec # 12 Spring2003 5-6-2003
Remote Data Access Using E-Registers• Loading remote data into an E-register involves three steps:
1 The processor portion of the global virtual address isconstructed in an address E-register
2 A get command is issued via a store to a special region inmemory that specifies the destination E-register
• Get command sets the destination E-register to empty
• Causes the remote read to be performed and the destination E-register to be loaded with the data and set to full when the getcompletes.
3 Finally the data is read into the processor via a load from thedata E-register
• If the processor attempts to load from an empty E-register, thememory operation is stalled until the get completes.
• The process for a remote put is similar, except the store data isplaced in the data E-register specified in the put command whileanother address E-register has global virtual address.
EECC756 - ShaabanEECC756 - Shaaban#47 lec # 12 Spring2003 5-6-2003
T3E Atomic Memory Operations
EECC756 - ShaabanEECC756 - Shaaban#48 lec # 12 Spring2003 5-6-2003
T3E Barrier/Eureka Synchronization• Barriers allow a set of participating processors to determine when
all processors have signaled some event
• Eurekas allow a set of processors to determine when any one ofthe processors has signaled some event.
• The T3E provides a set of 32 barrier/eureka synchronization units(BSUs) at each processor.
• The BSUs are accessible as memory-mapped registers and areallocated and protected via the address translation mechanism.
• A set of processors can be given access to a particular BSUthrough which they can perform barrier and/or eurekasynchronization.
• Multiple disjoint sets of processors may reuse the same logicalBSU.
• A BSU at a processor can be in one of several states.
• Processors can read this state and perform operations on the BSUvia load and store operations.
EECC756 - ShaabanEECC756 - Shaaban#49 lec # 12 Spring2003 5-6-2003
EECC756 - ShaabanEECC756 - Shaaban#50 lec # 12 Spring2003 5-6-2003
T3E Barrier and Eureka Transitions
EECC756 - ShaabanEECC756 - Shaaban#51 lec # 12 Spring2003 5-6-2003
EECC756 - ShaabanEECC756 - Shaaban#52 lec # 12 Spring2003 5-6-2003