Direct Communicationand Synchronization Mechanismsin Chip Multiprocessors
49
Direct Communication and Synchronization Mechanisms in Chip Multiprocessors Stamatis Kavadias Computer Science Department University of Crete (UOC-CSD) and Institute of Computer Science Foundation for Research and Technology –Hellas (FORTH-ICS)
PhD Thesis PresentationThe physical constraints of transistor integration have made chip multiprocessors (CMPs) a necessity, and increasing the number of cores (CPUs) the best approach, yet, for the exploitation of more transistors. Already, the feasible number of coresper chip increases beyond our ability to utilize them for general purposes. Although many important application domains can easily benefit from the use of more cores, scaling, in general, single-application performance with multiprocessing presents a tough milestone for computer science.The use of per core on-chip memories, managed in software with RDMA, adopted in the IBM Cell processor, has challenged the mainstream approach of using coherent caches for the on-chip memory hierarchy of CMPs. The two architectures have largely different implications for software and disunite researchers for the mostsuitable approach to multicore exploitation. We demonstrate the combination of the two approaches, with cache-integration of a network interface (NI) for explicit interprocessorcommunication, and flexible dynamic allocation of on-chip memory tohardware-managed (cache) and software-managed parts. The network interface architecture combines messages and RDMA-based transfers,with remote load-store access to the software-managed memories, and allows multipath routing in the processor interconnection network. We propose the technique of event responses that efficiently exploits the normal cache access flow for network interface functions, and prototype our combined approach in an FPGAbasedmulticore system, which shows reasonable logic overhead (less than 20%) in cache datapaths and controllers, for the basic NI functionality.We also design and implement synchronization mechanisms in the network interface (counters and queues), that take advantage of event responses and exploit the cache tag and data arrays for synchronization state. We propose novel queues, thatefficiently support multiple readers, providing hardware lock and job dispatching services, and counters, that enable selective fences for explicit transfers, and can be synthesized to implement barriers in the memory system.Evaluation of the cache-integrated NI on the hardware prototype, demonstrates the flexibility of exploiting both cacheable and explicitly-managed data, and potential advantages of NI transfer mechanism alternatives. Simulations of up to 128 core CMPs show that our synchronization primitives provide significant benefits for contendedlocks and barriers, and can improve task scheduling efficiency in the Cilk run-time system, for executions within the scalability limits of our benchmarks.
Transcript of Direct Communicationand Synchronization Mechanismsin Chip Multiprocessors
Direct Communicationand Synchronization Mechanismsin Chip
MultiprocessorsStamatis KavadiasComputer Science
DepartmentUniversity of Crete (UOC-CSD)andInstitute of Computer
ScienceFoundation for Research and Technology Hellas
(FORTH-ICS)Motivation and Approach CMP architectures becoming more
distributed (manycore) Utilize scalable NoC (>> few tens of
cores) Scalable communication mechanisms required to exploit chip
Locality will be very important Low latency communication exploit
locality effectively Fast synchronization improve efficiency of
fine-grain comp. This study advocates: Use on-chip scratchpad
memory for comm/comp Exploit direct communication and
synchronization mechanisms Aim scalable mechanisms &
implementation Exploit increased (replicated) resources Reduce
overheads with on-chip bulk transfers Enable efficient
communication supporting NoC optimizations2University of Crete
& Foundation for Research & Technology - HellasProposed
Architectural Enhancements& Contributions Modify contemporary
CMP architecture to support: Shared address space extension for
direct scratchpad access Cache integration of a network interface
(NI) Direct communication mechanisms for RDMA & messages Direct
synchronization mechanisms (counters & queues) The
contributions of this thesis are: Design a CMP network interface
integrated at top memory hierarchy levels Introduce event responses
technique for cache integration of NI communication &
synchronization mechanisms Design direct synch. mechanisms with
existing cache resources Refine HW design to reduce gates by 19.3%
(