Synchronization and Communication in the T3E Multiprocessor.
-
Upload
patrick-hampton -
Category
Documents
-
view
232 -
download
3
Transcript of Synchronization and Communication in the T3E Multiprocessor.
![Page 1: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/1.jpg)
Synchronization and Communication in the T3E
Multiprocessor
![Page 2: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/2.jpg)
Background
• T3E is the second of Cray’s massively scalable multiprocessors (after T3D)
• Both are scalable up to 2048 processing elements
• Shared memory systems, but programmable using message passing (PVM or MPI, “more portable”) or shared memory (HPF)
![Page 3: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/3.jpg)
Challenges
• T3E (and T3D) attempted to overcome the inherent limitations of employing commodity microprocessors in very large multiprocessors
• Memory interface - cache-line based system makes references to single words inefficient
• Typical address spaces too small for use in big systems
• Non-cached references are often desirable (e.g. message to other processor)
![Page 4: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/4.jpg)
T3D Strengths (used in T3E)
• External structure in each PE to expand address space
• Shared address space• 3D torus interconnect• Pipelined remote memory
access with prefetch queue and non-cached stores
![Page 5: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/5.jpg)
T3D: Room for improvement
• Overblown barrier network
• One outstanding cache line fill at a time (low load bandwidth)
• Too many ways to access remote memory
• Low single-node performance
• Unoptimized special hardware features (block transfer engine, DTB Annex, dedicated message queues and registers)
![Page 6: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/6.jpg)
T3E Overview
• Each PE contains Alpha 21164, local memory, and control and routing chips
• Network links time-multiplexed at 5X system frequency
• Self-hosted running Unicos/mk
• No remote caching or board-level caches
![Page 7: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/7.jpg)
E-Registers
• Extend physical address space
• Increase attainable memory pipelining
• Enable high single-word bandwidth
• Provide mechanisms for data distribution, messaging, and atomic memory operations
• In general, they improve on the inefficient individual structures of the T3D
![Page 8: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/8.jpg)
Operations with E-Registers
• Appropriate operands are stored in appropriate E-registers by processor
• Processor then issues another store command to initiate operation
– Address specifies command and source or destination E-register
– Data specifies pointer to already stored operands and remote address index
![Page 9: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/9.jpg)
Address Translation
• Global virtual addresses and virtual PE numbers formed outside processors
• Centrifuge used for efficient data distribution
• Specifying memory location on data bus enables bigger address space
![Page 10: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/10.jpg)
Remote Reads/Writes
• All operations done by reading into E-registers (Gets) or writing from E-registers to memory (Puts)
• Vector forms transfer 8 words with arbitrary stride (e.g. every 3rd word)
• Large number of E-registers allows significant Gets/Puts pipelining– Limited by bus interface (256B/26.7ns)
• Single word load bandwidth high – can be loaded into contiguous E-registers, then moved into cache (instead of getting each cache line)
![Page 11: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/11.jpg)
Atomic Memory Operations
• Fetch_&_Inc, Fetch_&_Add, Compare_&_Swap, Masked_Swap
• Can be performed on any memory location
• Performed like any E-register operation– Operands in E-registers– Triggered via store, sent over network– Result sent back and stored in specified E-
register
![Page 12: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/12.jpg)
Messaging
T3D: Specific queue location of fixed size
T3E: Arbitrary number of queues, mapped to normal memory, of any size up to 128 MB
T3D: All incoming messages generated interrupts, adding significant penalties
T3E: Three options – interrupt, don’t interrupt (detected via polling), and interrupt after threshold number of messages
![Page 13: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/13.jpg)
Messaging Specifics
• Message queues consist of Message Queue Control Words (MQCW)
• Messages assembled into 8 E-registers, SEND issued with address of MQCW
• Message queue is managed in software – avoids OS if polling is used
![Page 14: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/14.jpg)
Synchronization
• Support for barriers and eurekas (message from one processor to group)
• 32 barrier synchronization units (BSUs) at each processor, accessed as memory-mapped registers
• Synchronization packets use a dedicated high-priority virtual channel– Propagated through logical tree embedded in
3D torus interconnect
![Page 15: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/15.jpg)
Synchronization• Simple barrier operation
involves 2 states– First arms all processors in
group (S_ARM)– Once all are armed,
network notifies all of completion and processors return to S_BAR
• Eureka requires 3 states to ensure one is received before issuing next one
– Eureka notification immediately followed by barrier
![Page 16: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/16.jpg)
Performance
Increasing number of E-registers allows greater pipelining and bandwidth (limited by control logic)
Effective bandwidth greatly increases with higher transfer sizes due to effects of overhead, startup latency
![Page 17: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/17.jpg)
Performance
Transfer bandwidth independent of stride, except when data happens to be loaded from same bank(s) (multiples of 4, 8)
Several million AMOs/sec required to saturate memory system and increase latency
![Page 18: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/18.jpg)
Performance
Very high message bandwidth is supported without latency increase
Hardware barrier many times faster than efficient software barrier (about 15 for 1024 PEs)
![Page 19: Synchronization and Communication in the T3E Multiprocessor.](https://reader036.fdocuments.us/reader036/viewer/2022062308/56649e445503460f94b38064/html5/thumbnails/19.jpg)
Conclusions
• E-registers allow a highly pipelined memory system and provide a common interface for all global memory operations
• Both messages and standard shared memory ops supported
• Fast hardware barrier supported with almost no extra cost
• No remote caching eliminates need for bulky coherence mechanisms and helps allow 2048 PE systems
• Paper provides no means of quantitative comparison to alternative systems