Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task...

12
Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane *† , Brandon Potter , Abhisek Pan , Alexandru Dutu , Vinay Agarwala , Wonchan Lee , Deepak Majeti § , Bibek Ghimire k , Eric Van Tassell , Samuel Wasmundt , Brad Benton , Mauricio Breternitz , Michael L. Chu , Mithuna Thottethodi ** , Lizy K. John * , Steven K. Reinhardt * University of Texas at Austin Stanford University University of California, San Diego {mlebeane, ljohn}@utexas.com [email protected] [email protected] k Louisiana State University § HPE Vertica ** Purdue University [email protected] [email protected] [email protected] Advanced Micro Devices, Inc. {Michael.Lebeane, Brandon.Potter, Abhisek.Pan, Alexandru.Dutu, Vinay.Agarwala, Eric.Vantassell, Brad.Benton, Mauricio.Breternitz, Mike.Chu, Steve.Reinhardt}@amd.com Abstract—Accelerators have emerged as an important compo- nent of modern cloud, datacenter, and HPC computing environ- ments. However, launching tasks on remote accelerators across a network remains unwieldy, forcing programmers to send data in large chunks to amortize the transfer and launch overhead. By combining advances in intra-node accelerator unification with one-sided Remote Direct Memory Access (RDMA) communica- tion primitives, it is possible to efficiently implement lightweight tasking across distributed-memory systems. This paper introduces Extended Task Queuing (XTQ), an RDMA-based active messaging mechanism for accelerators in distributed-memory systems. XTQ’s direct NIC-to-accelerator communication decreases inter-node GPU task launch latency by 10-15% for small-to-medium sized messages and ameliorates CPU message servicing overheads. These benefits are shown in the context of MPI accumulate, reduce, and allreduce operations with up to 64 nodes. Finally, we illustrate how XTQ can improve the performance of popular deep learning workloads implemented in the Computational Network Toolkit (CNTK). Index Terms—Accelerator architectures, Computer architec- ture, Computer networks, Distributed computing, Network in- terfaces. I. I NTRODUCTION Accelerators have emerged as ubiquitous components of many datacenter, cloud computing, and HPC ecosystems [1– 3]. At the time of this writing, over 100 of the top 500 supercomputers use accelerators [4], typically GPUs or Intel Xeon Phi co-processors [5]. Similarly, Amazon offers GPU- enabled nodes as part of their virtualized Elastic Compute Cloud service [6]. However, despite widespread adoption, ac- celerators are traditionally deployed as peripheral components that must marshal data to and from main memory at the directive of a kernel driver. Many of the major heterogeneous platform providers are striving to incorporate accelerators more tightly into a node’s compute ecosystem [7–10]. Most of the major hardware vendors offer a number of standardized features that enable accelerators to participate in computation as peers to the host CPU. These features typically include user-mode task invocation, shared virtual memory with a well-defined mem- ory consistency model, shared-memory-based synchronization, and accelerator context switching. We refer to frameworks that implement the above features at a node level as tightly coupled compute ecosystems. Eliminating data copies and privileged device-driver opera- tions from the critical path has helped tightly coupled architec- tures drive down kernel launch latencies from approximately 30μs for a high-performance discrete GPU to less than 7μs for a device such as AMD’s A10-7850K Accelerated Processing Unit (APU) [11]. Low-overhead kernel launches and data-copy elimination have the potential to change how applications use accelerators. Tasks that previously were too small to amortize dispatch and data movement costs will begin to benefit from accelerator offload. We expect these trends to continue, as kernel launch latency is of paramount importance to chip designers [8, 12] and researchers [13, 14]. While node-level, tightly coupled frameworks remove data copies and heavyweight driver invocations for intra-node ac- celerators, Network Interface Controllers (NICs) with Remote Direct Memory Access (RDMA) offer these same benefits for inter-node data transfers. RDMA-capable NICs and fab- rics [15–19] move data from one node to another without in- volving the target-node CPU by enabling the target-node NIC to perform DMA directly to and from application memory. This paper introduces NIC primitives which combine RDMA with tightly coupled, user-level task queuing. These primitives enable applications to efficiently enqueue tasks on any compute device in a distributed-memory system, without involving the target-node CPU or the operating system on ei- SC16; Salt Lake City, Utah, USA; November 2016 978-1-4673-8815-3/16$31.00 c 2016 IEEE

Transcript of Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task...

Page 1: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

Extended Task Queuing:Active Messages for Heterogeneous Systems

Michael LeBeane∗†, Brandon Potter†, Abhisek Pan†, Alexandru Dutu†, Vinay Agarwala†, Wonchan Lee‡,Deepak Majeti§, Bibek Ghimire‖, Eric Van Tassell†, Samuel Wasmundt¶, Brad Benton†,

Mauricio Breternitz†, Michael L. Chu†, Mithuna Thottethodi∗∗, Lizy K. John∗, Steven K. Reinhardt†

∗University of Texas at Austin ‡Stanford University ¶University of California, San Diego{mlebeane, ljohn}@utexas.com [email protected] [email protected]

‖Louisiana State University §HPE Vertica ∗∗Purdue [email protected] [email protected] [email protected]

†Advanced Micro Devices, Inc.{Michael.Lebeane, Brandon.Potter, Abhisek.Pan, Alexandru.Dutu, Vinay.Agarwala,

Eric.Vantassell, Brad.Benton, Mauricio.Breternitz, Mike.Chu, Steve.Reinhardt}@amd.com

Abstract—Accelerators have emerged as an important compo-nent of modern cloud, datacenter, and HPC computing environ-ments. However, launching tasks on remote accelerators acrossa network remains unwieldy, forcing programmers to send datain large chunks to amortize the transfer and launch overhead.By combining advances in intra-node accelerator unification withone-sided Remote Direct Memory Access (RDMA) communica-tion primitives, it is possible to efficiently implement lightweighttasking across distributed-memory systems.

This paper introduces Extended Task Queuing (XTQ), anRDMA-based active messaging mechanism for accelerators indistributed-memory systems. XTQ’s direct NIC-to-acceleratorcommunication decreases inter-node GPU task launch latencyby 10-15% for small-to-medium sized messages and amelioratesCPU message servicing overheads. These benefits are shown inthe context of MPI accumulate, reduce, and allreduce operationswith up to 64 nodes. Finally, we illustrate how XTQ canimprove the performance of popular deep learning workloadsimplemented in the Computational Network Toolkit (CNTK).

Index Terms—Accelerator architectures, Computer architec-ture, Computer networks, Distributed computing, Network in-terfaces.

I. INTRODUCTION

Accelerators have emerged as ubiquitous components ofmany datacenter, cloud computing, and HPC ecosystems [1–3]. At the time of this writing, over 100 of the top 500supercomputers use accelerators [4], typically GPUs or IntelXeon Phi co-processors [5]. Similarly, Amazon offers GPU-enabled nodes as part of their virtualized Elastic ComputeCloud service [6]. However, despite widespread adoption, ac-celerators are traditionally deployed as peripheral componentsthat must marshal data to and from main memory at thedirective of a kernel driver.

Many of the major heterogeneous platform providers arestriving to incorporate accelerators more tightly into a node’s

compute ecosystem [7–10]. Most of the major hardwarevendors offer a number of standardized features that enableaccelerators to participate in computation as peers to thehost CPU. These features typically include user-mode taskinvocation, shared virtual memory with a well-defined mem-ory consistency model, shared-memory-based synchronization,and accelerator context switching. We refer to frameworks thatimplement the above features at a node level as tightly coupledcompute ecosystems.

Eliminating data copies and privileged device-driver opera-tions from the critical path has helped tightly coupled architec-tures drive down kernel launch latencies from approximately30µs for a high-performance discrete GPU to less than 7µs fora device such as AMD’s A10-7850K Accelerated ProcessingUnit (APU) [11]. Low-overhead kernel launches and data-copyelimination have the potential to change how applications useaccelerators. Tasks that previously were too small to amortizedispatch and data movement costs will begin to benefit fromaccelerator offload. We expect these trends to continue, askernel launch latency is of paramount importance to chipdesigners [8, 12] and researchers [13, 14].

While node-level, tightly coupled frameworks remove datacopies and heavyweight driver invocations for intra-node ac-celerators, Network Interface Controllers (NICs) with RemoteDirect Memory Access (RDMA) offer these same benefitsfor inter-node data transfers. RDMA-capable NICs and fab-rics [15–19] move data from one node to another without in-volving the target-node CPU by enabling the target-node NICto perform DMA directly to and from application memory.

This paper introduces NIC primitives which combineRDMA with tightly coupled, user-level task queuing. Theseprimitives enable applications to efficiently enqueue tasks onany compute device in a distributed-memory system, withoutinvolving the target-node CPU or the operating system on ei-SC16; Salt Lake City, Utah, USA; November 2016

978-1-4673-8815-3/16$31.00 c©2016 IEEE

Page 2: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

NICCache Cache

CPUCache

Memory

Initiator Target

NICCacheCache

CPUCache

Memory

CPUCacheMemory

NIC Memory

Network

Network

Network

IOCCPUCache Memory

NICMemory

AcceleratorIOC

CacheCacheCPU

Cache

Memory

Cache CacheCPU

Cache

Memory

XTQ NIC XTQ NIC

Accelerator

Accelerator Accelerator

Accelerator Accelerator

Memory Memory

(a) Conventional heterogeneous systems

NICCache Cache

CPUCache

Memory

Initiator Target

NICCacheCache

CPUCache

Memory

CPUCacheMemory

NIC Memory

Network

Network

Network

IOCCPUCache Memory

NICMemory

AcceleratorIOC

CacheCacheCPU

Cache

Memory

Cache CacheCPU

Cache

Memory

XTQ NIC XTQ NIC

Accelerator

Accelerator Accelerator

Accelerator Accelerator

Memory Memory

(b) Tightly coupled heterogeneous systems

NICCache Cache

CPUCache

Memory

Initiator Target

NICCacheCache

CPUCache

Memory

CPUCacheMemory

NIC Memory

Network

Network

Network

IOCCPUCache Memory

NICMemory

AcceleratorIOC

CacheCacheCPU

Cache

Memory

Cache CacheCPU

Cache

Memory

XTQ NIC XTQ NIC

Accelerator

Accelerator Accelerator

Accelerator Accelerator

Memory Memory

(c) XTQ-enabled, tightly coupled heterogeneous systems

Fig. 1. Remote task enqueue control path on different heterogeneous,distributed-memory systems.

ther the initiator or the target node. We call this mechanism Ex-tended Task Queuing (XTQ), since it extends the lightweight,user-mode task queuing in modern shared-memory platformsacross distributed memory systems. XTQ offers a novel, highlyefficient active messaging [20] platform for accelerators thatimproves upon the state of the art.

Figure 1 shows the control path of a point-to-point remotetask invocation implemented on three types of systems. Inthese examples, a CPU on the initiator node schedules workon a remote accelerator. In a conventional heterogeneous node,the communication flow is similar to Figure 1a. The initiatorCPU uses a high-performance NIC sitting on an I/O bus totransfer the task and associated data to the target via RDMA.While emerging technologies such as NVIDIA’s GPUDirectRDMA [21] allow for NICs to transfer data directly to aGPU’s onboard memory, launching a kernel still requires theintervention of the target CPU’s runtime and kernel driver.

Figure 1b shows the same operation implemented on a con-temporary, tightly coupled SoC. The CPU and the acceleratorshare the same memory, obviating the need to transfer datafrom the target’s main memory to the accelerator’s local devicememory. However, the target-side CPU must still service therequest from the NIC and explicitly schedule work on its localaccelerator.

XTQ provides a mechanism enabling the direct CPU-to-accelerator communication presented in Figure 1c. Our schemeenables tightly integrated accelerators to efficiently and di-rectly communicate with each other through a customizedhardware NIC. The target-side NIC participates in a tightlycoupled queuing model and can directly schedule work to theaccelerator, completely eliminating the CPU communicationpath in Figure 1b.

Directly interfacing an intra-node tasking framework withinter-node RDMA through XTQ offers a number of benefits,including:

• A unified active messaging framework for all computedevices in the system. By leveraging the user-mode taskinvocation of tightly coupled systems, it is possible todesign an active messaging framework that uses the sameinterface to spawn remote tasks on any device (CPU, GPU,FPGA, Processor-in-Memory (PIM), etc.) in the system.Unified active messaging offers exciting new accelerationpossibilities for future applications and runtime libraries.

• A reduction in remote accelerator task launch la-tency. RDMA provides the means for low-latency, CPU-less data transfer without redundant data copies. Tightlycoupled architectures provide the means for lightweight,direct accelerator-to-accelerator communication within ashared-memory node. By marrying the two, a target-sideNIC can schedule work directly on a local acceleratorwithout critical-path software on the CPU. Bypassing theCPU decreases accelerator launch latency and opens upthe possibility of fine-grained remote tasking models foraccelerators.

• Removal of message processing and task launch over-heads on the target CPU. Message progress threads canimpose a significant overhead in distributed systems [22].This problem is exacerbated when the progress thread notonly has to handle messages, but also construct commandpackets and schedule work on accelerators. Direct NIC-to-accelerator task invocation frees the CPU to either performmore useful computation or to enter a low power state.

This paper provides an overview of an XTQ-enabled, tightlycoupled system architecture. Our exploration of XTQ is orga-nized into the following three topics:

• The NIC hardware design. We illustrate that cross-nodeheterogeneous integration can be achieved with the additionof a small amount of hardware to an RDMA-capable NIC.

• Programming via lightweight extensions to an RDMA-capable host API. We implement the XTQ remote taskingprimitive as a simple extension to a generic RDMA networkprogramming interface. This API extension leverages one-sided communication semantics to allow programmers toschedule active messages on any computing device in thecluster. We illustrate that it is easy to express XTQ taskinvocations using only a few API calls.

• Performance on a number of important primitive op-erations and machine learning applications. We illus-trate that XTQ can improve GPU-bound active messageperformance by 10-15% over non-XTQ enhanced messages,while eliminating message and task enqueue overheads onthe target-side CPU. We present latency decompositions forimportant steps in the XTQ task flow and show performanceimprovements for microbenchmarks. We show that XTQcan enhance important MPI primitives such as accumulate,reduce, and allreduce operations, and that XTQ benefitsscale as the number of nodes increases for a fixed problemsize. Finally, we illustrate the real-world impact of XTQ ondistributed deep learning workloads implemented using theComputational Network Toolkit (CNTK) [23].

Page 3: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

II. BACKGROUND

In this section, we discuss the concepts, trends, and recentadvances on which XTQ is built.

A. Intra-node Accelerator Integration

Modern system designs are increasingly providing tightercoupling between CPUs, GPUs, and other accelerators. Thereare a number of recent frameworks offering varying featuresand levels of support [7–10]. XTQ-like active messagingcan be implemented on many tightly coupled, node-levelarchitectures. However, it is helpful to explain this work in thecontext of an existing industry standard. For the remainder ofthe paper, we will use the Heterogeneous System Architecture(HSA) [8] as our prototypical example of a tightly coupledarchitecture. HSA is an open industry standard from the HSAFoundation, a consortium formed by AMD, ARM R©, Imagi-nation Technologies, MediaTek, Texas Instruments, SamsungElectronics, Qualcomm, and others with the objective of help-ing system designers integrate different kinds of computingelements (e.g., CPUs and GPUs) to enable efficient datasharing and work dispatch. Some important features of theHSA specification are illustrated in Figure 2 and are describedin the following paragraphs.

User-Level Command Queuing: In HSA, applicationsallocate accelerator task queues in user memory. Devices fetchand execute tasks directly out of these queues, thereby elim-inating OS kernel transitions and device-driver overheads oncommon paths. Figure 2a illustrates HSA user-level commandqueuing. User-mode queues are arranged as circular buffers,with the read and write pointers implemented as monotonicallyincreasing indices. The queue entry format is defined by HSA’sArchitected Queuing Language (AQL). AQL packets containall the information needed to launch and synchronize a GPUkernel or CPU function.

Shared Virtual Memory: HSA requires that devices accessmemory using the same virtual addresses seen by the appli-cation program. This feature is necessary to allow users topass pointers directly to devices through the HSA task queueswithout device driver intervention or validation. Devices mustalso be capable of initiating page faults to avoid the overheadof pinning pages in memory. Shared translations can beprovided to devices by an IOMMU that references the samepage tables as the host CPUs [24,25], as shown in Figure 2b.

Shared-Memory Synchronization: HSA uses memory-based signal objects for synchronization. Processes indicateto a device that work has been placed in its command queueusing a doorbell signal associated with the queue. The doorbellsignal can map to a memory-mapped device range (e.g., for afirmware- or hardware-dispatched device such as a GPU), orto a shared-memory location (on which a software-dispatcheddevice such as a CPU can poll). Devices or threads can alsowait on tasks to finish using completion signals.

While HSA improves programmability and task invoca-tion in a shared-memory environment, distributed-memorysetups cannot naturally leverage these features. Through XTQ,

MMU

CPU(Producer)

Tightly Coupled Devices

VirtualMemory

Command Queue

Accelerator(Consumer)

Command Packet

CPU

Tightly Coupled Devices

Physical Memory

AcceleratorOS

DriverIOMMU

(a) HSA tasking model.

MMU

CPU(Producer)

Tightly Coupled Devices

VirtualMemory

Command Queue

Accelerator(Consumer)

Command Packet

CPU

Tightly Coupled Devices

Physical Memory

AcceleratorOS

DriverIOMMU

(b) Shared virtual memory.

Fig. 2. Intra-node accelerator integration in HSA.

distributed-memory applications are able to directly interfacewith the intra-node HSA tasking model.

B. Inter-node RDMA frameworks

Modern high-performance clusters employ RDMA adap-tors for lightweight and efficient inter-node data movement.RDMA protocols expose connections directly to user-levelapplications, enabling data movement without involving theoperating system or target-side processor. High-speed RDMAfabrics are increasingly available on commodity systems.Technologies such as InfiniBandTM [15], iWARP [19], andRDMA over Converged Ethernet (RoCE) [18] enable thesesystems to scale to thousands of nodes. Current technologiescan reliably provide read latencies as low as 1.19µs [26],with emerging network technologies promising less than 100nsnetwork latency per hop [16,27]. RDMA technology providesthe communication infrastructure for XTQ to implement het-erogeneous tasking across nodes. XTQ is built on a genericremote Put operation and can be easily integrated into anyRDMA transport layer exposing one-sided communicationsemantics.

While XTQ can be added to any RDMA transport, we usePortals 4 [17] as the framework of reference when explainingXTQ in the context of an existing network transport layer. Por-tals 4 is a connectionless low-level network API designed tosupport the Message Passing Interface (MPI) [28] and variouspartitioned global address space (PGAS) languages [29, 30].It is agnostic to the underlying fabric and exposes both flow-control and RDMA data transfer to higher-level applicationsand libraries.

III. XTQ ARCHITECTURE

This section illustrates the main components of the XTQhardware architecture. XTQ introduces one basic primitive toan RDMA-capable NIC: direct, user-mode task invocation on aremote accelerator. The lightweight hardware that implementsthis operation is described in the following paragraphs. For thepurposes of our exploration of XTQ, we will refer specificallyto a system with tightly coupled CPU and GPU. However, thesame scheme is generalizable to other accelerators in a tightlycoupled system architecture.

A. XTQ Message Format

Figure 3 illustrates the main components of a typical XTQmessage along with the AQL-like [8] command packet formatsfor CPU and GPU tasks. For GPU tasks, the command packet

Page 4: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

Group Seg Size

Grid Z

Private Seg Size

Data Payload

Kernel Arguments

AQL PacketO

pti

on

al /

Var

iab

le L

engt

h

Kernel Object

Grid X

Kernel Argument Address

Grid Y

Reserved

Completion Signal

OR

GPU AQL Packet

Header WG X WG Y

WG Z Reserved

Dispatch

CPU AQL Packet

64 Bits

64

Byte

sArgument 2

Return Address

Argument 3

Argument 0Argument 1

Reserved

Completion Signal

Header ReservedType

Typical XTQ Message

64 Bits

...

Fig. 3. XTQ message format.

CPUAcceleratorTightly Coupled Devices

XTQ NICDoorbell

2

1

3

Data

Command Queue

Rewrite

54

Virtual Memory

Fig. 4. Target side steps in XtqPut operation.

contains all the information needed to launch a kernel, suchas a pointer to the kernel code object, a pointer to thekernel arguments, and workitem/workgroup sizes. Variable-sized kernel arguments are appended to the message at the endof the command packet. For CPU tasks, the command packetcontains fields such as a function pointer and embedded scalararguments.

After the command packet and kernel arguments is avariable-size payload. This payload is generally a task inputbuffer provided by the initiator, but there are no specificrequirements for its usage. Our API and NIC consume twoseparate pointers for the command packet/kernel argumentcombination and the payload. The NIC performs a gatheroperation on the two buffers before transmitting to the target.Gathering the payload and command packet separately avoidsan unnecessary memory copy in the application code.

B. XTQ Put Overview

This section illustrates the steps involved for a CPU toschedule a unit of work on a remote GPU. The first step ina remote task invocation is for the initiator’s CPU to enqueueone or more remote XtqPut operations on the NIC’s commandqueue. The NIC’s software interface is provided pointers totwo distinct memory buffers: one for the command packetand kernel/function arguments and one for the optional datapayload. The host library notifies the NIC of the local memorybuffers through a shared command queue and doorbell mech-anism. The NIC then performs a local gather operation andtransfers data over the network.

Figure 4 illustrates the steps involved in queuing a task onthe target GPU from the NIC. First, the target NIC receives the

XTQ message from the network. The payload portion of themessage is streamed directly into the receive buffer in mainmemory 1 . The NIC also extracts the command packet fromthe message and performs the rewriting services describedin Section III-C. Before enqueuing the packet, the NIC firstaccesses the command queue descriptor. If the queue is full,then the NIC triggers the flow control mechanism discussedlater. Otherwise, the rewritten command packet is enqueuedto the target GPU’s command queue 2 . After the payloadwrite and command enqueue is completed, the NIC writes thecommand queue index to the GPU’s memory-mapped doorbellregister 3 . In our configuration, the GPU contains a CommandProcessor (CP), which is responsible for reading packets fromthe command queue when the doorbell is updated with thenewest write index. The CP proceeds to dequeue the packetfrom its command queue 4 , decodes the launch parameters,and schedules the work on the GPU’s compute threads. TheGPU threads then perform global load and store operations toaccess the kernel arguments and payload data 5 . Optionally,the GPU can notify the local CPU of kernel completion usinga shared-memory signal (not shown).

It is important to note that all of the node-local operationstake place in a unified virtual memory environment. Thepointer addresses used for the command queue and data bufferare virtual addresses. In our scheme it is assumed that both theNIC and GPU will have access to an IOMMU as described inSection II-A.

Security and process isolation are critical concerns fordistributed, multi-process workloads. For the XTQ extensions,the security concerns are handled by the underlying transportlayer. As an example, the Portals 4 network programming APIprovides clear semantics to guarantee isolation of its own per-process data structures. An XTQ extension to the Portals 4framework would inherit these same security mechanisms toprotect its task queues and other auxiliary data structures.

A similar argument exists for flow control. XTQ utilizesshared-memory queues as the interface between the NIC andthe GPU, which can become full. XTQ can utilize any existinghardware transport-level flow-control mechanism. In our con-tinuing example with Portals 4, the philosophy is to providebuilding blocks for a higher software layer to implementarbitrarily sophisticated policies. Portals 4 can identify whentarget-side resources are full and generate events to notifythe initiator or target that a message was not successfully

Page 5: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

Global Virtual Memory...

Kernel Object PtrKernel Arguments Ptr

Data Payload

.....

Kernel Arguments

RDMA Header ...Completion Signal Ptr

Kernel Object PtrKernel Arguments Ptr

Data Payload

.....

Kernel Arguments

Arg0Arg1

Kernel Image

Target Side Buffer

Completion Signal Ptr HSA Signal

Kernel LookupTable

Kernel Lookup TableBase Address Register

Target PID

Kernel Index

....

.... ....

Initiator Target

𝑥

Fig. 5. Target-side XTQ rewrite semantics.

delivered. The same monitoring and messaging facilities areused for the XTQ extension to Portals 4.

C. XTQ Rewrite Semantics

One issue with remote task invocation is that the initiatoris unaware of the addresses of important resources which aredynamically allocated by the target. Some examples are target-resident kernel input/output buffers, completion signals, andthe GPU’s command queue. One simple way to solve thisissue is to broadcast the virtual addresses of all dynamicallyallocated data needed for task execution. However, it is costlyfor initiators to keep track of resource descriptors for thou-sands of possible target nodes; this scenario is particularlygermane in scenarios when resources are frequently allocatedand deallocated during program execution.

XTQ solves this problem by leveraging coordinated indicesto refer to all target-resident data. The initiator populates thecommand packet with these indices, and the target performsa translation from an index to the correct target-local virtualaddress. Therefore, the initiator does not need to store anytarget address information, and the target only needs to keepindex translations for its own resident data. Since one of thedesign goals is to avoid invoking the CPU on tasks that are notspecifically destined for it, the XTQ framework incorporatesthe logic to substitute virtual addresses into the target-sideNIC. We will describe the semantics of this “XTQ rewrite”operation in detail for both CPU and GPU bound tasks.We use the term “rewrite” as opposed to the arguably moreaccurate term “translation” to avoid confusion with virtualto physical address translations which occur during deviceIOMMU accesses.

1) Lookup Tables: The NIC manages a number of per-process lookup tables to hold the index-to-virtual addressrewrites needed for the NIC to enqueue tasks on a computedevice. There are three different types of lookup tables: theKernel Lookup Table, Function Lookup Table, and QueueLookup Table. Every XTQ packet will perform one lookupin either the Kernel or Function Lookup Table depending onthe type of the packet. Additionally, all messages will trigger alookup in the Queue Lookup Table to extract the base pointerof the target command queue. The entries in the lookup tablesare populated by the host CPU using the XTQ API describedin Section IV.

One lookup table entry is needed for each function, kernel,and queue that wishes to participate in XTQ’s direct NIC-to-compute device tasking. For the applications and microbench-marks that we studied, 64 kernel and function registrationswere sufficient, producing Kernel and Function Lookup Tablesthat are around 4KB per process. For a small number of nodes,these data structures can be resident on dedicated tables on theNIC. For larger numbers of nodes, these tables would need toreside in DRAM. In this case, the NIC can implement an on-chip cache to ensure low-latency access to frequently usedtable entries.

2) Rewrite Procedure: Figure 5 shows how the NIC se-lectively replaces certain fields in a GPU AQL packet. Theinitiator places a lookup table index in the field reserved for thekernel object pointer. The target NIC uses this index to offsetinto the Kernel Lookup Table to replace the kernel pointerwith the correct value in the target’s virtual address space. Theactual kernel arguments are aligned directly after the commandpacket in the receive buffer. XTQ replaces the kernel argumentpointer in the command packet with the address at which thekernel arguments will be written to memory. Finally, the firsttwo kernel arguments are replaced with a pointer to a pre-registered target side buffer and the initiator-provided datapayload, respectively. A pointer table can be registered insteadof a target-side buffer if more registrations are needed. KernelLookup Table entries also contain room for the registrationof a target-resident, shared-memory completion signal. Whenthe GPU finishes execution of a task, this completion signalis decremented by the CP to let other agents efficiently waitfor the task to complete. XTQ replaces the command packetcompletion signal entry with the completion signal registeredin the Kernel Lookup Table if it is valid. The rest of the fieldsare passed through as they are received and are assumed to beproperly populated by the initiator.

A similar rewrite procedure is performed for CPU tasks.For CPU tasks, the NIC references the Function Lookup Tableinstead of the Kernel Lookup Table. The primary differenceis that the first four function arguments are embedded directlyin the AQL packet.

Finally, the NIC must identify in which of many possibleuser-mode queues to place the AQL packet. The queue addressis extracted with one final table lookup, using an index

Page 6: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

embedded in the AQL reserved bits to access the QueueLookup Table.

IV. XTQ API

The XTQ tasking framework defines an API that the hostCPU can use to schedule remote tasks on a target computedevice. This API is implemented as an extension to a genericRDMA network programming interface that supports a remotePut operation. The XTQ-specific extensions can be brokendown into the remote task launch function (XtqPut) and anumber of registration functions used to populate the XTQlookup tables.

A. XtqPut Function

XTQ contains one remote-tasking operation: XtqPut. TheXtqPut operation performs the same one-sided, RDMA datamovement operation as a basic Put operation, with additionalsemantics for launching tasks on the target. The exact mech-anism for launching tasks is described in detail in Section III.

B. XTQ Lookup Table Registration

The XTQ lookup tables are populated by the host CPUusing a registration API. The API provides three lookup tableregistration functions:• XtqRegisterFunction: Associates a lookup table index with

a function pointer and an optional target resident buffer.• XtqRegisterKernel: Associates a lookup table index with a

kernel pointer, an optional target resident buffer, and anoptional completion signal.

• XtqRegisterQueue: Associates a lookup table index with acommand queue descriptor.These functions are meant to be invoked using globally

known, coordinated indices. Such indices are common inSPMD programming techniques and are already available indistributed, multiprocess programming frameworks such asMPI.

C. Example Program

To ground our discussion of the API, Figure 6 illustratesa small program written in the SPMD style utilizing theprimary components of XTQ. In this example, the initiatorCPU enqueues a task on the target node’s GPU. The target-side CPU simply waits on a shared-memory signal until thetarget-side GPU has completed the task.

Both CPUs begin by initializing the RDMA communicationlayer and NIC 1 . On the initiator side, the CPU allocates apayload and creates a command packet encapsulating the taskto execute on the target 2 . In this example, the coordinatedindex 42 is used to associate this command with queue, kernel,and signal registrations at the target. This task is then suppliedto the NIC using the XtqPut operation 3 . An XtqPut triggersthe NIC to send the input data and command packet to thetarget.

Meanwhile, the target CPU posts the receive buffer usingthe RDMA communication layer 4 . Next, it initializes thelocal accelerator runtime and creates a kernel, completion

int main(int argc, char *argv[]) {// 1 Initialize RDMA comm layerint rank = RdmaInit();int index = 42;if (rank == INITIATOR) {// 2 Construct XTQ payload and commandvoid *payload = malloc(BUFFER_SIZE);void *cmd = ConstructCmd(CMD_SIZE, 42);// Post initialization sync with targetExecutionBarrier();// 3 Launch on remote GPU using XTQXtqPut(TARGET, cmd, CMD_SIZE,

payload, BUFFER_SIZE);} else {// 4 Post receive buffervoid *recv_buf = malloc(BUFFER_SIZE);RdmaPostBuffer(recv_buf);// 5 Initialize HSA CPU Runtimesignal_t signal;kernel_t kernel;queue_t queue;TaskingInit(&signal, &kernel, &queue);// 6 Register Kernel/QueuesXtqRegisterKernel(signal, kernel, 42);XtqRegisterQueue(queue, 42);// Post initialization sync with initiatorExecutionBarrier();// 7 Wait for GPU to complete taskSignalWait(signal);

}}

Fig. 6. Pseudocode for XtqPut operation.

signal, and user-mode command queue 5 . The target thenregisters the kernel, signal, and queue with the XTQ NICusing the XtqRegisterKernel and XtqRegisterQueue functionsat index 42 6 . These functions populate the lookup tablesused by the target-side NIC. When the initiator’s RDMAoperation arrives at the target, the target NIC recognizes it asan XTQ-enabled RDMA and uses the Kernel Lookup Tableand Queue Lookup Table to replace components of the packetand select a command queue. After the RDMA operation iscomplete, the NIC enqueues the packet to the GPU, whichbegins execution of the kernel. Meanwhile, the target CPUwaits for the operation to complete using a shared-memorysignal 7 .

After initialization, an execution barrier is entered betweensteps 2 and 3 on the initiator and steps 6 and 7 on thetarget. This barrier ensures that the initiator does not send acommand to the target before the target’s lookup tables havebeen populated.

In addition to illustrating the XTQ API, this exampleprogram drives home two related contributions of the XTQframework. First, the NIC delivers the task descriptor directlyto the GPU without host CPU involvement, minimizing thetask launch latency for the GPU. Second, the host CPU doesnot have to use any cycles servicing the network request andscheduling the task on the GPU. In this simple example, thetarget CPU waits for the GPU to complete its operation, butin more complex workloads the CPU could be free to performmeaningful computations.

While this example may seem primitive, it is actually possi-

Page 7: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

TABLE IXTQ SIMULATION CONFIGURATION.

CPU and Memory ConfigurationType 4 Wide OoOClock 3Ghz, 8 coresI, D-Cache 64K, 2-way, 2 cyclesL2-Cache 2MB, 8-way, 4 cyclesL3-Cache 16MB, 16-way, 20 cyclesDRAM DDR3, 8 Channels, 800MHz

GPU ConfigurationClock 1 GHz, 24 Compute UnitsWavefronts 40 (each 64 lanes)/ oldest job firstD-Cache 16kB, 64B line, 16-way, 4 cyclesI-Cache 32kB, 64B line, 8-way, 4 cyclesL2-Cache 768kB, 64B line, 16-way, 24 cycles

Network ConfigurationSwitch Latency 100nsLink Bandwidth 100GbpsTopology Star

ble to support many features with a small amount of softwaresupport from higher-level runtime libraries. For example, thetarget can send the data to another node for further compu-tation by issuing another XtqPut operation after the shared-memory signal resolves. Complex, cross-node dependenciesand task graphs can be implemented using shared-memorysignals and XTQ.

V. EVALUATION

XTQ provides new opportunities to re-evaluate design deci-sions in existing applications and to redesign communicationin emerging applications. In this section, we evaluate theXTQ tasking model on microbenchmarks designed to exposelatencies, primitives that are intrinsic to many MPI programs,and the CNTK deep learning framework. We illustrate thatXTQ offers significant improvements over standard remoteGPU dispatch.

A. Experimental Setup

To evaluate XTQ, we employ a simulation framework basedon the open-source gem5 simulator [31] including the AMDpublic GPU compute model [32]. We have configured ourhardware to resemble a potential future server-class APUsystem, with each APU containing a NIC, GPU, and CPUcores. None of the devices share a cache, and all memoryaccesses are coherent and occur in a unified virtual addressspace through IOMMU technology. This system does notsuffer from the off-chip bandwidth issues [33] that plaguecurrent integrated CPU/GPU solutions, since we expect futurehigh-performance chips to contain a larger number of memorychannels [34] or integrated die-stacked memory. Our infras-tructure simulates multi-node configurations with a simpleswitch and wire delay model incorporating bandwidth andlatencies configured similarly to high-performance fabrics suchas Intel Omni-Path [16]. The NIC model implements thePortals 4 [17] network programming specification with customXTQ extensions implemented on top of Put operations. TheGPU tasking interface is implemented using HSA task de-scriptors and queues. Table I shows the specific configurationfor the major components of our infrastructure.

Our simulation environment models forward-looking GPUlaunch latencies. Once the CP has been notified of an availablecommand packet using its doorbell, it performs a read of thecommand packet and then immediately schedules the kernelwhen a compute unit becomes available. We believe that thislimit represents the conceptual overhead of launching a kernel,and that real-world launch overheads will trend towards thislimit as tightly coupled frameworks continue to integrate GPUsmore closely with CPUs.

In our experiments, we compare three different remotetasking interfaces that we will refer to as CPU, HSA, andXTQ. These configurations are defined as follows:• CPU: Remote tasking is accomplished by using two-sided

send/receive pairs through user-space RDMA. These resultsuse the application thread for message progress and activemessage execution unless otherwise indicated. The CPUconfiguration represents a non-GPU-accelerated system, andis included to separate the baseline benefits of GPU accel-eration from those of XTQ.

• HSA: HSA also uses two-sided send/receive pairs overRDMA, but launches the active message on the GPU afterthe CPU thread has received the data. The CPU commu-nicates to the GPU through user-mode command queues.The HSA configuration represents user-space tasking on atightly-coupled architecture without XTQ.

• XTQ: XTQ uses one-sided XtqPuts through user-spaceRDMA to remotely enqueue active messages on the target’sGPU. The NIC places AQL packets directly into the GPU’scommand queue for execution.

B. Latency Analysis

To precisely quantify the benefits of XTQ in our model,we have coded a microbenchmark very similar to the APIcode example presented in Figure 6. Figure 7 shows a timebreakdown of XTQ and HSA remote task spawn for a small64B payload and a 4KB payload. The small 64B payloadover XTQ takes approximately 1.2µs to complete a remotetask, while the same payload over HSA takes approximately1.5µs. In XTQ, the data transfer phases from initiator to targettake slightly longer than the HSA transfers. XTQ’s usage of64B HSA-like packet format slightly increases the payloadsize over an optimized Portals 4 implementation. However, theperformance penalty incurred from transferring the commandstructure is dwarfed by the benefits in task launch latency.XTQ saves approximately 300ns over HSA during the tasklaunch phase, since the NIC can directly schedule work onthe GPU’s command queues, while HSA requires the CPUto process an RDMA event, create the GPU task descriptor,and place it in the GPU’s command queue. For the 4KBpayload, we see similar savings during task enqueue, althoughthe speedup is proportionally less due to the increase in datatransfer and kernel execution time.

C. MPI Integration

The Message Passing Interface (MPI) [28] is the de factocommunication library for distributed-memory HPC applica-

Page 8: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

311

312

310

313

156

113

241

219

306

303

435

432

94

64

145

135

51

412

126

414

228

229

592

545

70

75

57

68

0 500 1000 1500 2000 2500

XTQ (64B)

HSA (64B)

XTQ (4KB)

HSA (4KB)

Time (ns)

CPU PtlPut NIC Initiator Put Network NIC Target Put GPU Launch GPU Kernel Execution CPU Completion

Fig. 7. Time breakdown of remote GPU kernel launch.

1

10

100

1000

10000

100000

1 100 10000

Fre

qu

ency

Size (B)Accumulate BarrierMPI Other Application

(a) Time distribution.

1

10

100

1000

10000

100000

1 100 10000

Fre

qu

ency

Size (B)Accumulate BarrierMPI Other Application

(b) Accumulate data size distribution.

Fig. 8. NWChem tce ozone accumulate statistics.

tions. Several MPI functions, such as one-sided accumulateoperations and reduction collectives, have a strong compu-tational component that can be parallelized. Using XTQ tooffload this computation to the GPU can lead to improvedperformance and reduced CPU overheads. By freeing the CPUto do independent work, XTQ enables non-blocking variants toachieve substantial computation overlap between the primaryapplication thread and the accelerator.

We extended the one-sided and collective frameworks of theOpen MPI [35] implementation of the MPI-3.0 specificationto incorporate XTQ-based accumulates and reductions. Ourimplementation enables library users to reap the benefits ofXTQ acceleration without altering any code.

1) MPI One-Sided Accumulates: The MPI Accumulatefunction is a one-sided operation used to combine initiator-resident data with target-resident data through a specifiedoperation. The target data is replaced with the result withoutany explicit participation of the target process.

Accumulate operations are preceded by a collective windowcreation operation during which the target process makes aportion of its memory space available to other members in thewindow for direct updates. Subsequent accumulate calls tothe target are bracketed by synchronization operations whichdefine access epochs. An initiator process can make multipleaccumulate calls to a target within an epoch. However, theaccumulate operations are considered complete only after thesynchronization operation that closes the epoch.

The baseline CPU accumulate implementation for Portals 4in OpenMPI is not a true one-sided communication model: itis layered over two-sided, send-receive calls. Our HSA-basedimplementation is similar, except for the fact that the MPIlibrary at the target enqueues the operation on a local GPU

instead of performing it directly.XTQ-based accumulates, however, are completely one-

sided, relieving the target-side MPI process from receivingdata and executing the operation. On receipt of data fromthe initiator, the target-side NIC enqueues the appropriatecommand directly on the GPU’s command queue. When exe-cution completes, the GPU updates the target-side accumulatebuffer with the result and an internal progress thread sends anacknowledgement to the initiator.

MPI accumulate operations are used extensively in theNWChem [36] computational chemistry package through theARMCI/MPI3 library [37]. As an example, the tce ozoneworkload from the NWChem regression tests spends over 58%of its time in the MPI library on a 4 node cluster. Figure 8ashows the percentage of time spent performing various MPIfunctions on a single rank. The chart shows that the 43% of itstotal execution time performing accumulate-related operations.Figure 8b shows a histogram of the payload size and numberof accumulates that occur during tce ozone. We see a largeconcentration of small-to-medium sized accumulates, whichare ideal for XTQ lightweight messaging acceleration. Unfor-tunately, we are unable to simulate this application directlybecause it requires a version of MPI not supported by oursimulation infrastructure. However, our expectation from theprofiling data is that NWChem will realize measurable benefitsfrom XTQ-based GPU acceleration.

2) MPI Reduce and Allreduce Collectives: MPI Reduceand MPI Allreduce are collective operation that use a binaryoperation (e.g., SUM, PROD, or MAX) to combine the cor-responding elements in the input buffer of each participatingprocess [28]. For reduce, the combined results are stored in aresult buffer in the root process’s address space. For allreduce,the combined results are stored in the result buffers of allparticipating processes.

Our implementation of both reductions using XTQ is builton the LibNBC library [38, 39]. LibNBC was designed tosupport non-blocking collectives on generic architectures. Indoing so, it creates a schedule: a directive to execute a setof operations. We modify the schedule to layer reductions onXtqPuts instead of two-sided, send-receive operations. Aninherent problem with implementing reductions with activemessages is that the target-side resources must be allocatedbefore an active message can be received at the target. To ad-

Page 9: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

0.1

1

10

1 16 256 4096 65536 1048576

Spee

dup

Data Items (4 Byte Integers)

CPU

HSA

XTQ

0.1

1

10

1 16 256 4096 65536 1048576

Spee

dup

Data Items (4 Byte Integers)

CPU

HSA

XTQ

1 16 256 4K 64K 1M

(a) XTQ accumulate acceleration on 2 nodes.

0.1

1

10

1 16 256 4,096 65,5361,048,576

Spee

dup

Data Items (4 Byte Integers)

CPU

HSA

XTQ

1 16 256 4K 64K 1M

(b) XTQ reduce acceleration on 2 nodes.

0

500

1000

1500

2000

2500

3000

2 6 10 14 18 22 26 30

Runti

me

(us)

Nodes

CPU

HSA

XTQ

0

500

1000

1500

2000

0 8 16 24 32 40 48 56 64

Runti

me

(us)

CPU

HSA

XTQ

Nodes

(c) XTQ 4MB allreduce acceleration.

Fig. 9. Acceleration of MPI accumulate, reduce, and allreduce operations over XTQ.

dress this problem, XTQ encodes an explicit synchronizationstep into the schedule by issuing zero-length send/receive pairsbetween initiators and targets. This step guarantees that thetarget resources are available before any XtqPuts are issued.

3) MPI Benchmarks: Figure 9 shows performance resultsfor accumulate, reduce, and allreduce implemented on CPU,HSA, and XTQ as defined in Section V-A. For sufficientlylarge benchmarks, GPUs perform much better than CPUsfor these data-parallel operations. For a two-node accumulateoperation (Figure 9a), XTQ performs significantly better thanHSA. Interestingly, XTQ also performs better than CPU evenfor very small accumulate sizes. We believe this result occursbecause CPU implements one-sided accumulates using two-sided send/recvs, while XTQ leverages one-sided puts directlythrough Portals 4. XTQ offers a reasonable performanceimprovement of around 10-15% for a two-node reduction(Figure 9b) over a standard HSA-enabled GPU. For bothoperations, the benefits of XTQ over HSA decrease as thepayload increases over approximately 64KB. All of thesealgorithms, however, are amenable to software pipelining,which will push the payload size back into a range whereXTQ shows significant benefits, even for very large transfers.

Figure 9c illustrates how XTQ performs on allreduce whenstrong scaling up to 64 nodes for a fixed-size global dataset of 4MB. Unlike accumulates and reductions, allreducerequires the result to be transmitted to all nodes participatingin the collective operation. We implement allreduce as areduce-scatter followed by an allgather, which is an efficientimplementation for vector allreduce operations [40]. With afixed-size data set, increasing the number of nodes decreasesthe computation per node while increasing the total numberof messages required to complete the reduction. The inflectionpoint at approximately 12 nodes indicates the point where theoverhead of sending more messages outweighs the benefits ofless computation.

The figure illustrates that for small node counts, the size ofeach round’s messages are large enough to benefit from GPUacceleration with or without XTQ. However, for larger nodecounts, non-XTQ-enabled GPUs are unable to amortize thehigh launch latency over the execution of smaller messages.Only XTQ-enabled GPUs are able to maintain performanceimprovements over a CPU reduction up to 64 nodes.

0.8

1

1.2

1.4

1.6

1.8

2

AlexNet AN4 LST CIFAR Large

Synth

MNIST

Conv

MNIST

Hidden

Pro

ject

ed S

pee

du

p CPU HSA XTQ

Fig. 10. XTQ performance on CNTK workloads across 8 GPU-enabled nodes.

D. Machine Learning Results

In this section, we evaluate XTQ on a distributed deep-learning framework. Our vehicle of exploration for this casestudy is the Computational Network Toolkit (CNTK) [23].CNTK provides a general programming framework that al-lows researchers to express arbitrarily complex deep-learningcomputations and deploy them across multiple nodes.

Training deep-learning networks is a very compute- andnetwork-intensive calculation that is difficult to parallelize ef-ficiently. For our experiments, we configure CNTK to performdata-parallel training, where the model is replicated acrossall nodes. We select six sample applications from the CNTKdistribution that represent a broad range of data sizes andcommunication patterns. Each node computes gradients usingStochastic Gradient Descent (SGD) for a subset of the trainingdata and aggregates the gradients across the cluster using anallreduce communication pattern. This cycle of computationusing SGD and global gradient allreduce is repeated untilthe model has been trained to a user-defined metric. Thefrequency with which gradients are aggregated between nodesis a tradeoff between convergence time and time to completea round of training.

For our analysis, we collected profiling data on the Stam-pede [3] supercomputer, where GPUs were used for localSGD, InfiniBandTM NICs were used for communication, andthe CPUs were used for the computation in the allreducephase. By combining these real-world runs with detailedsimulation of allreduce operations obtained from gem5, we areable to project the performance improvement of acceleratingthe allreduce function with XTQ.

To perform this projection, we profile our Stampede runs

Page 10: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

to extract the data size and duration of allreduce calls fromCNTK applications. We simulate allreduce calls of the samesizes in gem5 to estimate the performance improvement attain-able by HSA and XTQ. We then project the communicationsperformance improvement into the time taken to perform theallreduce in real hardware without XTQ. Since CNTK usesblocking allreduce calls, we do not need to worry about anycommunication and computation overlap.

Figure 10 shows the results for CNTK across 8 highperformance compute nodes. Utilizing GPUs for allreducecompute provides a 23% average improvement in runtimeover a baseline CPU version. XTQ-enabled acceleration giveson average an additional 8% improvement over the HSAbaseline, with AN4 LST improving by 15%. CIFAR doesnot significantly benefit from either form of GPU allreduceacceleration, since it is more bound by the local SGD computephase than the gradient reduction.

VI. RELATED WORK

The seminal Active Messages work [20] embeds com-putation in network messages by directly invoking a usermessage handler on the target. Willcock et al. [41] developedAM++, which extends AM with type-safe object-oriented andgeneric programming, enabling optimizations such as messagecoalescing. More recently, Besta and Hoefler [42] modifythe IOMMU to invoke active-message-like handlers as a sideeffect of RDMA put and get operations. Our work uses HSA’sarchitected task queues to extend active messages to non-CPUaccelerators. The use of explicit user-allocated task queuesrather than direct handler invocation also provides a moreflexible buffering model than the original AM.

Several machines proposed in the ’90s coupled light-weighttasks with explicit message passing, including the J-Machine[43], M-Machine [44], Star-T Voyager [45]; or some com-bination of messaging passing and shared memory such asMIT Alewife [46], Typhoon [47], and FLASH [48]. Ourwork applies some of these concepts to emerging commodityhardware with tightly coupled accelerators and RDMA NICs.

XTQ’s rewrite mechanism is similar to the Cray T3E’sglobal address translation scheme [49], in that global refer-ences are translated to local addresses at the target to decouplelocal resource management from global resource identifiers.The T3E also supported remote enqueuing to memory-baseduser-level message queues, similar to how XTQ enqueues toHSA task queues. The key differences are first, that HSAqueue entries are architecturally defined task descriptors, soXTQ can provide additional semantics (such as argumentrewriting) that enable direct enqueuing of accelerator taskswithout target-side CPU intervention; and second, the HSAdoorbell mechanism provides a notification mechanism otherthan the T3E’s options of interrupts and polling. Note thatXTQ is independent of the global addressing mechanism usedfor conventional RDMA, which in our example system’s caseis provided by Portals 4.

XTQ shares some similarities with the scale-out NUMAinspired designs which optimize RDMA for NUMA memory

fabrics [26] and NIC layouts for tiled, scale-out architec-tures [50]. These works are primarily concerned with efficientdata movement, while ours is concerned with efficient taskinvocation for heterogeneous compute devices.

Oden et al. propose Global GPU Address Spaces (GGAS)[51] and evaluate its performance on GPU-accelerated reduc-tions [52]. While XTQ also exploits GPUs for reductions, thereare a number of significant differences between GGAS andXTQ. GGAS focuses on data transfer, creating a hardware-supported global address space that spans all GPU devicememory and supports direct remote memory accesses from theGPUs. XTQ builds on a more traditional RDMA model withexplicit put and get operations, but adds remote task invocationto the data-transfer model.

A number of commercial and open source offerings providethe ability to utilize GPUs in a distributed environment.NVIDIA’s GPUDirect [21] enables direct high-performanceRDMA between InfiniBandTM NICs and the local memory ofdiscrete GPUs. However, task launch on the discrete GPU stillmust be initiated by the CPU using the CUDA runtime andGPU kernel driver. Our work focuses on HSA-like systemswith unified physical memory and tightly coupled accelerators.As such, we are able to completely bypass the CPU forboth data transfer and kernel launch. Another platform forremote GPU task invocation is rCUDA [53], which allowsCPUs without locally attached GPUs to take advantage ofGPU resources on another node. However, rCUDA is designedprimarily to virtualize a cluster’s GPU resources for coarse-grain task offload, while XTQ is intended to enable fine-graininteraction of low-latency communication and accelerator-based computation. Finally, GPUNet [54] allows GPUs tosource and sink network messages directly through a socketsinterface with coordination from a CPU runtime library.

Some NICs provide built-in offload capabilities for collec-tive operations such as reductions, including Quadrics pro-grammable engines [55], Cray’s Aries collective engine [56],and Mellanox’s CORE-Direct technologies [57]. XTQ pro-vides a more general solution by making a broader rangeof existing accelerated compute engines easily accessible tothe NIC. However, for small computations, NIC-based offloadwill further reduce overheads by avoiding all inter-deviceinteractions. A NIC with built-in offload functionality couldcomplement other accelerators in an XTQ/HSA environment.

VII. CONCLUSION

Emerging node-level architectures tightly couple accelera-tors into host platforms. These frameworks significantly reducetask launch latency via user-level task queues and eliminatedata copy overhead via shared virtual memory, paving thepath for fine-grained, heterogeneous tasking models in shared-memory environments. Concurrently, RDMA enables highlyefficient user-level network data transfers. This paper proposesExtended Task Queuing (XTQ), a mechanism that combinestightly coupled, user-level task queuing with RDMA to provideheterogeneous, lightweight tasking across distributed-memorysystems. XTQ is implemented as an extension to a tightly

Page 11: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

integrated, RDMA-capable NIC and enables applications toschedule tasks on accelerators and CPUs across nodes. UsingXTQ, applications can send messages to remote accelerators,bypassing the operating system on both nodes and not involv-ing the target CPU. Bypassing the target-side CPU reducestask launch latency by 10-15% for small-to-medium sizedmessages, and frees a CPU thread from message processingto perform more useful computation.

Because XTQ lies at the intersection of several emerg-ing paradigms, such as accelerator-based programming andlightweight distributed tasking, few existing applications di-rectly leverage its benefits. We believe that XTQ opens upnew possibilities for applications to leverage accelerators fortightly coupled communication and computation in distributedsystems. Towards this end, we demonstrate that XTQ enablesthe use of GPUs to accelerate MPI reduction, allreduce, andaccumulate operations across a variety of payload sizes andon clusters up to 64 nodes. Finally, we illustrate how XTQcan provide up to 15% performance improvement for emerg-ing deep learning workloads in the Computational NetworkToolkit. We are excited to see how existing applications willadapt to the unique benefits of XTQ, and how new applicationscan be designed to leverage these features.

ACKNOWLEDGMENTS

AMD, the AMD Arrow logo, and combinations thereofare trademarks of Advanced Micro Devices, Inc. ARM isthe registered trademark of ARM Limited in the EU andother countries. HyperTransport is a licensed trademark ofHyperTransport Technology Consortium. Other product namesused in this publication are for identification purposes only andmay be trademarks of their respective companies. Results wereobtained in part using resources from the Texas AdvancedComputing Center. This research was supported by the USDepartment of Energy under the DesignForward and Design-Forward 2 programs, and by National Science Foundationgrant CCF-1337393. Any opinions, findings, and conclusionsor recommendations expressed herein are those of the authorsand do not necessarily reflect the views of the DoE or NSF.

REFERENCES

[1] A. Putnam, A. Caulfield, E. Chung, D. Chiou, K. Constantinides,J. Demme, H. Esmaeilzadeh, J. Fowers, G. Gopal, J. Gray, M. Haselman,S. Hauck, S. Heil, A. Hormati, J.-Y. Kim, S. Lanka, J. Larus, E. Peterson,S. Pope, A. Smith, J. Thong, P. Xiao, and D. Burger, “A reconfigurablefabric for accelerating large-scale datacenter services,” in Int. Symp. onComputer Architecture (ISCA), 2014, pp. 13–24.

[2] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi, “A scalable processing-in-memory accelerator for parallel graph processing,” in Int. Symp. onComputer Architecture (ISCA), 2015, pp. 105–117.

[3] TACC, “Stampede supercomputer user guide,” https://portal.tacc.utexas.edu/user-guides/stampede, 2015.

[4] TOP500.org, “Highlights - November 2015,” http://www.top500.org/lists/2015/11/highlights, 2015.

[5] Intel, “Intel Xeon Phi product family,” www.intel.com/XeonPhi, 2015.[6] Amazon, “Amazon EC2 cloud computing,” https://aws.amazon.com/

ec2/, 2015.[7] Nvidia, “CUDA toolkit 6.0,” https://developer.nvidia.com/

cuda-toolkit-60, 2014.[8] HSA Foundation, “HSA platform system architecture specification 1.0,”

http://www.hsafoundation.com/standards/, 2015.

[9] Intel, “The compute architecture of Intel processor graphicsgen9,” https://software.intel.com/sites/default/files/managed/c5/9a/The-Compute-Architecture-of-Intel-Processor-Graphics-Gen9-v1d0.pdf, 2015.

[10] B. Wile, “Coherent accelerator processor interface (CAPI) forPOWER8 systems,” http://www-304.ibm.com/webapp/set2/sas/f/capi/CAPI POWER8.pdf, IBM, Tech. Rep., 2014.

[11] AMD, “AMD A-Series desktop APUs,” http://www.amd.com/en-us/products/processors/desktop/a-series-apu, 2015.

[12] Nvidia. (2012) Hyperq sample. http://docs.nvidia.com/cuda/samples/6Advanced/simpleHyperQ/doc/HyperQ.pdf.

[13] D. Lustig and M. Martonosi, “Reducing GPU offload latency via fine-grained CPU-GPU synchronization,” in Int. Symp. on High PerformanceComputer Architecture (HPCA), 2013, pp. 354–365.

[14] J. Wang, N. Rubin, A. Sidelnik, and S. Yalamanchili, “Dynamic threadblock launch: A lightweight execution mechanism to support irregularapplications on GPUs,” in Int. Symp. on Computer Architecture (ISCA),2015, pp. 528–540.

[15] InfiniBand Trade Association. (2000) InfiniBand architecture specifica-tion: Release 1.0.2. http://www.infinibandta.org/content/pages.php?pg=technology download.

[16] Intel. (2015) Omni-Path fabric 100 series. https://fabricbuilders.intel.com/.

[17] Sandia National Laboratories, “The Portals 4.0.2 network programminginterface,” http://www.cs.sandia.gov/Portals/portals402.pdf, 2014.

[18] InfiniBand Trade Association, “RDMA over converged ethernet v2,”https://cw.infinibandta.org/document/dl/7781, 2014.

[19] Intel, “Internet wide area RDMA protocol (iWARP),” http://www.intel.com/content/dam/doc/technology-brief/iwarp-brief.pdf, 2010.

[20] T. Eicken, D. Culler, S. Goldstein, and K. Schauser, “Active messages:A mechanism for integrated communication and computation,” in Int.Symp. on Computer Architecture (ISCA), 1992, pp. 256–266.

[21] Mellanox, “Mellanox GPUDirect RDMA user manual,”http://www.mellanox.com/related-docs/prod software/MellanoxGPUDirect User Manual v1.2.pdf, 2015.

[22] T. Hoefler and A. Lumsdaine, “Message progression in parallel comput-ing - to thread or not to thread?” in Int. Conf. on Cluster Computing(Cluster), 2008, pp. 213–222.

[23] A. Agarwal, E. Akchurin, C. Basoglu, G. Chen, S. Cyphers,J. Droppo, A. Eversole, B. Guenter, M. Hillebrand, T. R. Hoens,X. Huang, Z. Huang, V. Ivanov, A. Kamenev, P. Kranen, O. Kuchaiev,W. Manousek, A. May, B. Mitra, O. Nano, G. Navarro, A. Orlov,H. Parthasarathi, B. Peng, M. Radmilac, A. Reznichenko, F. Seide, M. L.Seltzer, M. Slaney, A. Stolcke, H. Wang, Y. Wang, K. Yao, D. Yu,Y. Zhang, and G. Zweig, “An introduction to computational networksand the Computational Network Toolkit,” Microsoft, Technical Report,2014.

[24] AMD, “AMD I/O virtualization technology (IOMMU) specification,”http://support.amd.com/TechDocs/48882 IOMMU.pdf, 2015.

[25] M. Ben-Yehuda, J. Mason, L. Van Doorn, and E. Wahlig, “UtilizingIOMMUs for virtualization in Linux and Xen,” in In proc. of the LinuxSymp., 2006.

[26] S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, “Scale-out NUMA,” in Int. Conf. on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), 2014, pp. 3–18.

[27] B. Towles, J. Grossman, B. Greskamp, and D. Shaw, “Unifying on-chipand inter-node switching within the Anton 2 network,” in Int. Symp. onComputer Architecture (ISCA), 2014, pp. 1–12.

[28] MPI Forum, “MPI: A message-passing interface standard. ver. 3,” www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf, 2012.

[29] D. Bonachea, “GASNet specification, v1.1,” http://gasnet.lbl.gov/CSD-02-1207.pdf, 2002.

[30] D. Bonachea, P. H. Hargrove, M. Welcome, and K. Yelick, “PortingGASNet to Portals: Partitioned global address space (PGAS) languagesupport for the Cray XT,” in Cray User Group (CUG), 2009.

[31] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu,J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, R. Sen, K. Sewell,M. Shoaib, N. Vaish, M. D. Hill, and D. A. Wood, “The gem5 simulator,”SIGARCH Comput. Archit. News, pp. 1–7, 2011.

[32] AMD. (2015) The AMD gem5 APU simulator: Modeling heterogeneoussystems in gem5. http://gem5.org/GPU Models.

[33] M. Daga, A. Aji, and W.-C. Feng, “On the efficacy of a fused CPU+GPUprocessor (or APU) for parallel computing,” in Symp. on ApplicationAccelerators in High-Performance Computing (SAAHPC), 2011, pp.141–149.

Page 12: Extended Task Queuing: Active Messages for Heterogeneous … · 2020. 5. 11. · Extended Task Queuing: Active Messages for Heterogeneous Systems Michael LeBeane y, Brandon Potter

[34] M. Schulte, M. Ignatowski, G. Loh, B. Beckmann, W. Brantley, S. Gu-rumurthi, N. Jayasena, I. Paul, S. Reinhardt, and G. Rodgers, “Achievingexascale capabilities through heterogeneous computing,” Micro, IEEE,vol. 35, no. 4, pp. 26–36, 2015.

[35] E. Gabriel, G. E. Fagg, G. Bosilca, T. Angskun, J. J. Dongarra, J. M.Squyres, V. Sahay, P. Kambadur, B. Barrett, A. Lumsdaine et al.,“Open MPI: Goals, concept, and design of a next generation MPIimplementation,” in Recent Advances in Parallel Virtual Machine andMessage Passing Interface, 2004, pp. 97–104.

[36] M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma,H. J. Van Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus et al.,“NWChem: a comprehensive and scalable open-source solution for largescale molecular simulations,” Computer Physics Communications, vol.181, no. 9, pp. 1477–1489, 2010.

[37] J. Hammond, “ARMCI-MPI - MPICH wiki,” https://wiki.mpich.org/armci-mpi/index.php/Main Page, 2015.

[38] T. Hoefler and A. Lumsdaine, “Design, implementation, and usage ofLibNBC,” Open Systems Laboratory, Technical Report, 2006.

[39] T. Hoefler, A. Lumsdaine, and W. Rehm, “Implementation and per-formance analysis of non-blocking collective operations for MPI,” inInt. Conf. on High Performance Computing, Networking, Storage andAnalysis (SC), 2007, pp. 52:1–52:10.

[40] R. Thakur, R. Rabenseifner, and W. Gropp, “Optimization of collectivecommunication operations in MPICH,” International Journal of HighPerformance Computing Applications, vol. 19, no. 1, pp. 49–66, 2005.

[41] J. J. Willcock, T. Hoefler, N. G. Edmonds, and A. Lumsdaine, “AM++:A generalized active message framework,” in Int. Conf. on ParallelArchitectures and Compilation Techniques (PACT), 2010, pp. 401–410.

[42] M. Besta and T. Hoefler, “Active Access: A mechanism for high-performance distributed data-centric computations,” in Int. Conf. onSupercomputing (ICS), 2015, pp. 155–164.

[43] M. Fillo, S. W. Keckler, W. J. Dally, N. P. Carter, A. Chang, Y. Gurevich,and W. S. Lee, “The M-Machine multicomputer,” in Int. Symp. onMicroarchitecture (MICRO), 1995, pp. 146–156.

[44] M. Noakes, D. Wallach, and W. Dally, “The J-Machine multicomputer:An architectural evaluation,” in Int. Symp. on Computer Architecture(ISCA), May 1993, pp. 224–235.

[45] B. S. Ang, D. Chiou, D. L. Rosenband, M. Ehrlich, L. Rudolph, andArvind, “StarT-Voyager: A flexible platform for exploring scalable SMPissues,” in Int. Conf. for High Performance Computing, Networking,Storage and Analysis (SC), 1998, pp. 1–13.

[46] A. Agarwal, R. Bianchini, D. Chaiken, K. Johnson, D. Kranz, J. Ku-biatowicz, B. Lim, K. Mackenzie, and D. Yeung, “The MIT Alewifemachine: architecture and performance,” in Int. Symp. on ComputerArchitecture (ISCA), 1995, pp. 2–13.

[47] S. Reinhardt, J. Larus, and D. Wood, “Tempest and Typhoon: user-levelshared memory,” in Int. Symp. on Computer Architecture (ISCA), 1994,pp. 325–336.

[48] J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta, “Integrationof message passing and shared memory in the Stanford FLASH mul-tiprocessor,” in Int. Conf. on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), 1994, pp. 38–50.

[49] S. L. Scott, “Synchronization and communication in the T3E multi-processor,” in Int. Conf. on Architectural Support for ProgrammingLanguages and Operating Systems (ASPLOS), 1996, pp. 26–36.

[50] A. Daglis, S. Novakovic, E. Bugnion, B. Falsafi, and B. Grot, “Manycorenetwork interfaces for in-memory rack-scale computing,” in Int. Symp.on Computer Architecture (ISCA), 2015, pp. 567–579.

[51] L. Oden and H. Froning, “GGAS: Global GPU address spaces forefficient communication in heterogeneous clusters,” in Int. Conf. onCluster Computing (CLUSTER), 2013, pp. 1–8.

[52] L. Oden, B. Klenk, and H. Froning, “Energy-efficient collective reduceand allreduce operations on distributed GPUs,” in Int. Symp. on Cluster,Cloud and Grid Computing (CCGrid), 2014, pp. 483–492.

[53] J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti, “rCUDA:Reducing the number of GPU-based accelerators in high performanceclusters,” in Int. Conf. on High Performance Computing and Simulation(HPCS), 2010, pp. 224–231.

[54] S. Kim, S. Huh, X. Zhang, Y. Hu, A. Wated, E. Witchel, and M. Silber-stein, “GPUnet: Networking abstractions for GPU programs,” in 11thUSENIX Symposium on Operating Systems Design and Implementation(OSDI), 2014, pp. 201–216.

[55] D. Roweth and A. Pittman, “Optimised global reduction on QsNetII,”in Symp. on High Performance Interconnects, 2005, pp. 23–28.

[56] B. Alverson, E. Froese, L. Kaplan, and D. Roweth, “CrayXC series network,” http://www.cray.com/sites/default/files/resources/CrayXC30Networking.pdf, 2015.

[57] R. Graham, S. Poole, P. Shamis, G. Bloch, G. Bloch, H. Chapman,M. Kagan, A. Shahar, I. Rabinovitz, and G. Shainer, “ConnectX-2InfiniBand management queues: First investigation of the new supportfor network offloaded collective operations,” in Int. Conf. on Cluster,Cloud and Grid Computing (CCGrid), 2010, pp. 53–62.