RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA,...

19
RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 1 Analysis of Parallel Programming Models for Exascale Computing Sergio Martin Abstract—High-performance computing has become an essential part of developments in scientific and technologi- cally important problems. However, as we approach the age of exascale computing, developers of large scale scientific ap- plications will have to face new challenges. In this study, we focus our attention on reducing the costs of communication in supercomputers and analyze the parallel programming models and tools aimed at helping programmers address this challenge. I. I NTRODUCTION S upercomputers are used in a wide range of fields such as: weather forecasting [54], medicine [59], computer aided design [25], military [57], and simulation of natural disasters [78]. As applications become larger and more sophisticated, so does their need for increasingly powerful supercomputers. Fortunately, the continuous improvement in both computer architecture and interconnect has led to an exponential growth in performance. Fig. 1: Evolution of the top 500 supercomputer performances since 1995. Source: [41] Fig. 1 shows how the performance of the top 500 supercomputers has increased since 1995, growing by an order of magnitude in performance roughly every decade, closely following Moore’s law [45]. Currently, the most powerful computer, Sunway TaihuLight located at the National Supercomputing Center in Wuxi, China, delivers a staggering peak performance of 125 petaflops (10 15 floating point operations per second). The coming milestone – the exaflop (10 18 floating point operations per second) supercomputer is on the horizon. Exascale computing poses new challenges for scientific application programmers [6]. We focus our attention on challenge of reducing the costs of communication. Addressing this challenge will be essential for exascale performance and involves two aspects: the cost of internal communication and the cost of network communication. We refer to internal communication to all data trans- ferred through main memory or cache structures of an individual computer. An inefficient use of these structures can reduce memory bandwidth and have a significant impact on performance. The cost of internal commu- nication plays an important role in the performance of individual computers, and thus in the performance of supercomputers as a whole. We refer to network communication to all data trans- ferred among nodes through a supercomputer’s intercon- nect. As interconnects become larger and more complex, it is expected that the average routing latency among any two computers in exascale systems will increase way beyond that of current petascale systems. It will therefore be necessary for programmers to implement mechanisms to reduce or hide this cost as much as possible in order to realize exascale performance. Although optimizations to ameliorate internal and net- work communication costs can be introduced manually by a programmer, this could entail an excessive amount of effort. In this study, we survey the parallel programming models and tools (language extensions, translators, and compilers) that have been proposed for providing these optimizations with a minimum effort. We analyze the mechanisms implemented by each model and how they contribute in realizing efficient applications for exascale computing. The rest of this paper is organized as follows: in §II, we provide an overview of the current state of supercom- puting. In §III, we explain the costs of communication in supercomputers. In §IV, we analyze the Message Passing model and its limitations. In §V, we present alternative models based on the Message Passing model. In §VI, we introduce the Partitioned Global Address Space model. §VII, we introduce the Asynchronous PGAS model. In §VIII, we introduce dataflow models. In §IX, we discuss how these models will help realizing exascale computing. Finally, in §X, we present our conclusions. II. SUPERCOMPUTING OVERVIEW Until about a decade ago, increasing computational performance depended almost exclusively on single core processors to deliver higher clock speeds. However, after the Intel Pentium’s 4 Tejas processor was canceled in 2004 [20], it became evident that increasing the power of single processors to meet the growing demands was no longer possible; we had hit the infamous power wall [65], the point at which heat output and energy con- sumption made it impossible to keep increasing the clock frequency of CPUs. This limitation has driven a paradigm shift towards parallel processor architectures, including multicore and many-core processors, which has realized a huge improvement in computational performance. Today’s largest petaflop supercomputer operates with O(10 7 ) cores, and it is estimated that exaflop super- computers will require O(10 8 ) cores [4]. In this section we describe (a) the processor topology and memory hierarchy of computing nodes, the building blocks of su- percomputers; (b) many-core devices; (c) the interconnect topologies used for communicating data between nodes,

Transcript of RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA,...

Page 1: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 1

Analysis of Parallel Programming Models forExascale Computing

Sergio Martin

Abstract—High-performance computing has become anessential part of developments in scientific and technologi-cally important problems. However, as we approach the ageof exascale computing, developers of large scale scientific ap-plications will have to face new challenges. In this study, wefocus our attention on reducing the costs of communicationin supercomputers and analyze the parallel programmingmodels and tools aimed at helping programmers addressthis challenge.

I. INTRODUCTION

S upercomputers are used in a wide range of fields suchas: weather forecasting [54], medicine [59], computer

aided design [25], military [57], and simulation of naturaldisasters [78]. As applications become larger and moresophisticated, so does their need for increasingly powerfulsupercomputers. Fortunately, the continuous improvementin both computer architecture and interconnect has led toan exponential growth in performance.

Fig. 1: Evolution of the top 500 supercomputer performancessince 1995. Source: [41]

Fig. 1 shows how the performance of the top 500supercomputers has increased since 1995, growing byan order of magnitude in performance roughly everydecade, closely following Moore’s law [45]. Currently,the most powerful computer, Sunway TaihuLight locatedat the National Supercomputing Center in Wuxi, China,delivers a staggering peak performance of 125 petaflops(1015 floating point operations per second). The comingmilestone – the exaflop (1018 floating point operations persecond) supercomputer is on the horizon.

Exascale computing poses new challenges for scientificapplication programmers [6]. We focus our attentionon challenge of reducing the costs of communication.Addressing this challenge will be essential for exascaleperformance and involves two aspects: the cost of internalcommunication and the cost of network communication.

We refer to internal communication to all data trans-ferred through main memory or cache structures of anindividual computer. An inefficient use of these structurescan reduce memory bandwidth and have a significant

impact on performance. The cost of internal commu-nication plays an important role in the performance ofindividual computers, and thus in the performance ofsupercomputers as a whole.

We refer to network communication to all data trans-ferred among nodes through a supercomputer’s intercon-nect. As interconnects become larger and more complex,it is expected that the average routing latency amongany two computers in exascale systems will increase waybeyond that of current petascale systems. It will thereforebe necessary for programmers to implement mechanismsto reduce or hide this cost as much as possible in orderto realize exascale performance.

Although optimizations to ameliorate internal and net-work communication costs can be introduced manually bya programmer, this could entail an excessive amount ofeffort. In this study, we survey the parallel programmingmodels and tools (language extensions, translators, andcompilers) that have been proposed for providing theseoptimizations with a minimum effort. We analyze themechanisms implemented by each model and how theycontribute in realizing efficient applications for exascalecomputing.

The rest of this paper is organized as follows: in §II,we provide an overview of the current state of supercom-puting. In §III, we explain the costs of communication insupercomputers. In §IV, we analyze the Message Passingmodel and its limitations. In §V, we present alternativemodels based on the Message Passing model. In §VI, weintroduce the Partitioned Global Address Space model.§VII, we introduce the Asynchronous PGAS model. In§VIII, we introduce dataflow models. In §IX, we discusshow these models will help realizing exascale computing.Finally, in §X, we present our conclusions.

II. SUPERCOMPUTING OVERVIEW

Until about a decade ago, increasing computationalperformance depended almost exclusively on single coreprocessors to deliver higher clock speeds. However, afterthe Intel Pentium’s 4 Tejas processor was canceled in2004 [20], it became evident that increasing the powerof single processors to meet the growing demands wasno longer possible; we had hit the infamous power wall[65], the point at which heat output and energy con-sumption made it impossible to keep increasing the clockfrequency of CPUs. This limitation has driven a paradigmshift towards parallel processor architectures, includingmulticore and many-core processors, which has realizeda huge improvement in computational performance.

Today’s largest petaflop supercomputer operates withO(107) cores, and it is estimated that exaflop super-computers will require O(108) cores [4]. In this sectionwe describe (a) the processor topology and memoryhierarchy of computing nodes, the building blocks of su-percomputers; (b) many-core devices; (c) the interconnecttopologies used for communicating data between nodes,

Page 2: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2

Fig. 2: Processor topology and memory hierarchy of an Edison compute node.

and; (d) the scientific motifs commonly used in large-scale computing.

A. Node Architecture

Modern supercomputers are built using a large set ofnodes connected through a high-speed interconnect. Anode is the minimal set of physical components capableof functioning and performing computation. A node alsorepresents the minimal addressable unit in the network,with its own IP address and hostname. Nodes in asupercomputer are typically server-grade computers thatcontain: (a) one or more multicore and/or manycoreprocessors, (b) a main memory (RAM) system, and (c)network interconnect.

Fig. 2 represents the processor topology and memoryhierarchy of a compute node in the Edison [48] supercom-puter located at the National Energy Research ScientificComputing Center. We use Edison as a representativeexample for our explanation in this section.

Edison’s nodes contain two 12-core Intel ”Ivy Bridge”processors each, identified as sockets P#0 and P#1. Asocket is the physical placeholder for a processor chipon the node’s motherboard that provides connectivity toother components of the node (e.g. RAM, PCI-e channels,network device).

Each core in Edison is able to run two threads simul-taneously. A thread represents a stream of instructionsto be executed, plus its execution status (i.e. registers,stack pointers, program counter). The ability of processorcores to execute two or more threads simultaneously iscalled Simultaneous multi-threading [69]. Edison definesthe logical placeholders for the execution of threads insideeach core as Processing Units (PU), numbered from 0to 47, two per core. The operating system uses the PUnumber to assign the mapping of threads to PUs.

Each core contains its own L1 (instructions+data) andL2 cache, while a common L3 cache is shared betweencores if the same socket. All cores have access to the en-tire main system memory space. However, given physicalconstraints, processor cores have different access timesto/from different segments of main memory, called NonUniform Memory Access (NUMA) domains.

Cores allocated in the same NUMA domain are guar-anteed to have the same access time (assuming no con-tention). However, cores accessing data residing in adifferent NUMA domain will suffer from performancedegradation. In the case of Edison nodes, main memory

(64Gb) is divided into 2 (NUMA) domains, each with32Gb of memory and all the cores from a socket belongto the same NUMA domain.

B. Many-Core DevicesOne of the reasons why the peak performance of

top supercomputers have consistently increased in recentyears is the addition of massively parallel (many-core)processors into their computing nodes. These devices basetheir potential in implementing processor chips with alarge number of cores, many more (thousands in GPUs)than conventional multi-core processors (dozens). In orderto fit such numbers of cores in the processor dye, many-core devices have simpler core designs that span less areathan conventional cores.

Many-core devices are specially targeted for compute-intensive, highly parallel algorithms. That is, algorithmsthat can divide the problem domain in many fine-grainedpartitions that can be computed in parallel with minimaladditional synchronization or communication overheads.Algorithms requiring that all cores execute the sameinstruction over different sets of data represent ideal cases.

Although many-core processors are key to increasingthe peak performance of supercomputers, harnessing theirpower still represents a challenge for the following rea-sons: (i) They represent an additional layer of complexityfor programmers. (ii) Communication between the hostand the device requires an extra copy of the data sincethey cannot access each other’s memory space. (iii) Theincrease in computational power puts additional pressureon main and device memories. While we do not addressthis challenge in our study, it could provide the basis foran extensive analysis.

C. Interconnect DesignSupercomputers comprise a huge number of intercon-

nected computing nodes. For this reason, network andsystems architects need to implement efficient networktopologies. That is, find ways to organize nodes, routersand cabling in order to minimize the cost of communica-tion while keeping power and money budgets constrained.

Folded Clos (fat-tree) [40] network topologies, provento be efficient in smaller scales, would incur a prohibitivecost for exascale supercomputers because of the numberof routers and cabling complexity required. Such networkwould dominate the costs of a supercomputer, both inbudget and energy costs [37].

Page 3: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 3

Fig. 3: Comparison of cost per node vs network size ofdifferent topologies. Source: [39]

Several highly-scalable topologies for petascale andexascale supercomputers have been proposed to reducethe complexity of the interconnect. The Flattened But-terfly [38] and Dragonfly [39] topologies have provento be less costly in the number of nodes (see Fig. 3).The use of optical cables in global channels allow theFlattened Butterfly and Dragonfly topologies achieve asimilar bandwidth and latency as folded Clos networks,with highly reduced router and cabling costs.

Edison uses a dragonfly topology, organized in a fourrank hierarchy [49]: Rank 0 consists of 4 nodes allo-cated inside one blade. Communication is routed througha high-speed custom-designed integrated circuit (ASIC)that serves as the main gateway for all 4 nodes; Rank1 consists of 16 blades allocated inside one chassis.Communication is routed through high-bandwidth wiresacross a circuit board that connects all blades ASICs;Rank 2 consists of 3 chassis allocated inside one cabinet.Communication is managed by a blade router connectedby copper cables, and; Rank 3 represents the entire super-computer (5576 nodes, in total) and uses high-bandwidthoptical cables and routers.

The dragonfly topology in Edison guarantees that datapackets sent from node to node perform a maximum of4 hops. A hop represents each transmission of a datapacket between intermediate routers and end-nodes. Theaverage number of hops in the execution of a distributedapplication is an important factor in our analysis of thecost of communication in the next section.

D. Classification of Scientific AlgorithmsThe main use for modern supercomputers is in sim-

ulating physical phenomena. Although there exists abroad range of scientific applications, researchers fromthe University of California, Berkeley and the LawrenceBerkeley National Laboratory identify 7 common compu-tation motifs, known as Phil Collela’s 7 dwarfs, that canbe used to classify them [15, 5]1. Each one of these motifshave different patterns for communication, computationand storage, as described below:

(i) Structured Grids. These algorithms partition theproblem domain (multidimensional space) into a grid ofdiscrete elements. The elements of the grid are distributedas a mesh of rectangles. The solution is approached by

1More motifs were added since that first definition. However, the7 original motifs still represent the vast majority of high-performanceapplications.

Fig. 4: Illustration of (left) structured, and (right)unstructured grids. Sources: [43, 17]

iteratively applying a mesh sweep over every element,where computation and communication patterns are reg-ular. That is, all elements have the same amount ofneighbors that they communicate with, and demand thesame amount of computation. A special kind of structuredgrid, called Adaptive Mesh Refinement, shown in Fig.4 (left) allows a finer division of the grid in areas ofparticular interest. The complexity of these algorithms istypically O(I ∗ Nd), where I is the number of solveriterations, and N is the number of elements in a dimension(assuming a square grid), and d is the number of griddimensions.

(ii) Unstructured Grids. This is a variation of (i) wherethe grid is partitioned based on the physical elementsbeing modeled. The density and shape of grid elementscan be tailored to fit the shapes of physical objects (e.g.the 3D mesh of a surface of an airplane wing), as shownin Fig. 4 (right).

(iii) Dense Linear Algebra. These represent operationsbetween matrices and vectors, where all the elements ofsuch data sets are involved in the calculation. The com-plexity of these algorithms vary from O(N) for vector-scalar operations, to O(N3) for matrix-matrix operations,where N is the number of elements in a dimension(assuming square matrices).

(iv) Sparse Linear Algebra. This is a variation of (iii)in which algorithms are optimized for matrices/vectorsfilled with mostly zero values. Since the presence ofzeroes reduce the amount of computation required, thesealgorithms implement compression techniques to reducethe memory footprint, and keep track of the distributionand location of zeroes.

(v) Particle / N-Body Methods. These algorithms sim-ulate the interaction between discrete points. In general,the properties (position/speed) of every point is calcu-lated as a function of the properties every other point(position/mass/charge). For this reason, these algorithmshave a O(N2) complexity, where N is the amount ofpoints. Variations of this motif, however, exploit thespatial locality of points to reduce the complexity of thesealgorithms to O(NlogN) [24].

(vi) Spectral Methods. These represent numerical al-gorithms that transform data in the frequency domainto/from time/spatial domains. The typical example ofthese methods is the Fast Fourier Transform and itsinverse, with an O(NlogN) complexity, where N is theamount of elements to transform.

(vii) Monte Carlo Methods. These algorithms computestatistical result of random trials. Monte Carlo algorithmsscale well with the number of processors because ran-dom trials are independent from each other and can beexecuted in parallel with negligible communication.

Page 4: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 4

Fig. 5: Evolution of the performance gap between single-coreCPUs and main memory, in terms of relative speedup. Source:

[26]

III. COST OF COMMUNICATION IN EXASCALE

Due to the unprecedented number of processing ele-ments involved and the complexity of memory hierar-chies, achieving exaflop performance will demand ex-traordinary efforts from system designers, interconnectarchitects, and programmers alike. In this section weanalyze the sources of communication overhead thatprogrammers will need to address to realize exascaleperformance.

A. Internal Communication

We refer to internal communication as data transfersbetween cores of the same node through main memory orcache structures inside an individual compute node. In thissection we analyze the sources of internal communicationthat affect the performance of large-scale applications.

While nodes keep incorporating more processors andcores, the performance of main memory technology failsto scale accordingly. This is known as the memory wall,and represents a limiting factor for many scientific ap-plications. The memory wall is caused by the growingperformance gap between processors and main memory,as illustrated in Fig. 5. The performance of an applica-tion can be severely reduced by limitations in memorybandwidth and latency. This problem is exacerbated bymulticore processors, since each additional core putsadditional pressure on the main memory. Current petaflopsystems are particularly affected by this problem, and itis expected that this will be a performance bottleneck inexascale systems as well.

The roofline performance model [72] is an intuitiveway to visualize the maximum attainable performance ofan application over a specific memory/CPU architecture.This model shows how memory bandwidth can limit anapplication to perform below the peak performance of theCPU.

A roofline diagram, shown in Fig. 6, uses the arithmeticintensity of an application (measured in flops/byte) asinput. This value indicates how many floating point oper-ations are executed per byte transmitted from/to memory.The point at which arithmetic intensity meets the rooflinefunction – delimited by the maximum memory bandwidthand CPU peak performance – indicates the application’sattainable performance (measured in flop/s).

The example estimates the performance of two appli-cations. Application A has a higher arithmetic intensity(right side of the diagram) and thus is only limited bythe peak CPU Gflop/s. On the other hand, application Bhas a low arithmetic intensity (left side of the diagram)

Fig. 6: Example roofline diagram that compares twoapplications. Application B, limited by memory bandwidth,and Application A, limited by the CPU. Adapted from: [18]

and requires a higher transfer rate from/to memory. Inthis case, performance is limited by memory bandwidth.

The significance of the roofline model is that it givesan upper bound of how much an application can beoptimized by reducing the number of accesses to mainmemory per arithmetic operation. To achieve high per-formance, programmers need to implement mechanismsto avoid data motion through main memory as much aspossible.

Different computational motifs have characteristicarithmetic intensity patterns. Dense linear algebra algo-rithms involving matrix operations, for instance, performO(N3) floating point operations, while loading O(N2)elements from memory. This means that, for large enoughmatrices, the number of floating point operations is largerthan the bytes loaded from memory. For this reason,these algorithms are mostly bound by the processorperformance. Structured and unstructured grid algorithms,on the other hand, involve close to one floating point op-eration per data element. For this reason, these algorithmsare typically bound by memory bandwidth.

One way to increase the performance of memory-boundalgorithms is to make an efficient use of intermediatecache structures. Cache structures are much faster thanmain memory, and get faster the closer they are locatedfrom cores (e.g. In Edison, L1 cache is 5x faster than L3cache). A good use of cache can decrease the numberof accesses to main memory, increasing the attainableperformance given a certain arithmetic intensity. One wellknown technique, cache blocking, involves changing analgorithm’s memory access patterns to reuse data in cachelines as much as possible before it is replaced. The effectof implementing this optimization is that the attainableperformance of an application would move closer to theCache Bandwidth line in Fig. 6 allowing, for instance,application B to attain a higher performance.

While optimizations like cache blocking are specific toeach algorithm, there are factors that can affect memoryperformance in general. The choice of parallel program-ming model, for instance, can have a significant impacton the arithmetic intensity of an algorithm. In our analysisof programming models, we identify two main causes forthis:

(i) Data duplication. This occurs in models that requirethe use of (send/receive) buffers for the communication ofmessages between threads. In case no data hazards (read-after-write, write-after-read, or write-after-write) exist in

Page 5: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 5

the program semantics, the data could be copied/accessedby multiple threads without the need of buffering. Notonly unnecessary buffering reduces the memory band-width used for actual computation, but also the efficiencyof cache structures.(ii) Misuse of shared memory. When two threads

execute in the same node or NUMA domain, they can beoptimized to work on the same address space without theneed of explicit communication operations. Using a pro-gramming model that enables the use of shared memorycan reduce the pressure on main memory. However, someprogramming models work under the assumption that nothreads share the address space, making it difficult to useshared memory.

These problems are main contributors to the cost ofinternal communication and addressing them will there-fore be crucial for exascale performance and therefore acentral point of discussion throughout our study.

B. Network CommunicationDespite the success in creating scalable interconnects,

there is still much space for improvement. As the size ofa network grows, so does the number of average routinghops (H) that messages require to reach their destina-tion, increasing overall latency. The cost of latency willbecome an important challenge in exascale computing forscientific applications.

The time taken for a message (T ) to be transmittedfrom one node a to another node b can be estimated withthe formula in Eq. 1.

Ta,b = La,b + Sm/Bmax (1)

Where La,b is the network latency between the twonodes, Sm is the size of the message, and Bmax is themaximum bandwidth of the network. We can see that, forsmall Sm, the cost of latency dominates the overall costof communication.

The latency between nodes a and b can be calculatedas a function of the number of routing hops between them(Ha,b) times a cost per hop (h), and a fixed overhead permessage (s), as shown in Eq. 1.

La,b = Ha,b ∗ h+ s (2)

The fixed overhead per message can be caused by bothsoftware (e.g. a communication library can require fillinga software buffer before sending) and hardware (e.g. per-data packet startup time). A large cost of s can signif-icantly impact performance on algorithms that requiresending many small-sized messages. On the other hand,the average value of Ha,b can become significant in large-scale executions, thereby making latency a predominantcomponent in the cost of communicating each message.

Some computational motifs are particularly suscep-tible to the cost of latency. Spectral methods (FFT),for instance, require all-to-all communication patternswhere all nodes need to communicate with each otherregularly, making these algorithms difficult to scale effi-ciently. Other motifs, such as Monte-Carlo methods havea negligible communication cost and can scale almostperfectly with the number of nodes.

Although new technologies will continue to improveinterconnect performance in the coming years (e.g. IntelOmniPath [60] promises a 40% reduction in latency fornew petaflop systems), it will still be necessary for pro-grammers to explicitly reduce the cost of communication.

To achieve this, programmers can implement mechanismsthat enable Computation/Communication Overlap (over-lap, for short) [1] [61]. This optimization can increaseCPU usage and reduce the impact of communication andI/O delays by keeping cores performing useful computa-tion while data is being transmitted.

Some algorithms, such as communication-optimaldense matrix multiplication and LU factorization [63],have been manually optimized to realize overlap. Theproblem with implementing optimizations for overlap byhand is that this requires an excessive amount of effortby programmers. The alternative is to use tools that canimplement mechanisms to support overlap automaticallyor with little effort from programmers. We found thatmechanisms such as oversubscription (i.e. executing moreprocesses than processing cores, discussed in §V-A) anddata dependency-driven execution (i.e. out-of-order exe-cution of code based on the availability of data, discussedin §VIII-C) can be used to automatically enable overlap,without a painstaking refactoring of a program’s code.

In the following sections we discuss the parallel pro-gramming models that have been proposed for large-scalecomputing. We analyze their features and limitations, andhow they address the challenges presented in this section.We start with the common approach which is based onthe Message Passing model. In subsequent sections, weanalyze alternative models.

IV. THE MESSAGE PASSING MODEL

Because large-scale supercomputers do not provide aphysically coherent global memory address space, pro-grammers need to handle communication between pro-cesses residing in different address spaces across thenetwork. Several communication models and librarieshave been proposed to build distributed applications, withMessage Passing being the most widely used.

Message Passing Interface (MPI) [21] is the de factostandard for writing high performance applications on dis-tributed memory computers. Its first specification, MPI-1, presented in 1995, contained a set of 128 C/Fortranfunctions that provided a basic support for messagepassing communication between processes. Subsequentreleases (now at MPI-3.1 [22]) were aimed at expandingthe model in response to new ideas and research made inmore recent parallel programming models. In this sectionwe discuss the principles of operation and limitations ofMPI.

A. Spatial Decomposition of MPI Programs

An MPI program instantiates as a set of processes thatexecute autonomously, allowing them to realize paral-lelism both within and across computer nodes indistinctly.Thanks to this autonomy, an MPI application that executescorrectly in a multicore processor can – assuming no bugsor semantic errors – also execute correctly in the millionsof cores of a supercomputer [7].

Another reason why MPI has become so widely used isthat it makes it easy to develop scientific applications. Inits simplest configuration –where all processes execute thesame program– MPI provides a natural way to describeapplications under the Single Program, Multiple Data2

(SPMD) execution model.

2Not to be confused with the Simple Instruction, Multiple Thread(SIMT) model used in GPUs.

Page 6: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 6

Fig. 7: SPMD decomposition of a 2D grid into 9 subgrids,each one processed by a different MPI process.

1 double∗ U, Uprev;2 allocate(U,(N∗N / ProcessCount) + GhostCellsSize);3 allocate(Uprev, (N∗N / ProcessCount) + GhostCellsSize);4

5 for (int step = 0; i < iterations; step++)6 {7 MPI Isend(&Uprev[BoundaryCells] −> [up,down,left,right]);8 MPI Irecv(&U[GhostCells] <− [up,down,left,right]);9 MPI Waitall();

10

11 for (int i = 0; i < N; i++)12 for (int j = 0; j < N; j++)13 U[i][j] = U[i−1][j] + U[i+1][j] +14 U[i][j−1] + U[i][j+1] − 4∗U[i][j];15

16 swap(&U, &Uprev);17 }

Fig. 8: Pseudocode of an MPI iterative Jacobi solver,simplified for clarity.

SPMD is a commonly used model to describe thespatial decomposition of scientific applications. The ideabehind SPMD is that the workload is divided into smallerparts, which can be computed in parallel using the sameprogram.

Structured grid algorithms such as partial differentialequations solvers represent typical examples of SPMDapplications. These solvers use finite difference methods,such as Jacobi or Gauss-Seidel [44] to iteratively refinean initial n-dimensional grid to a solution that satisfiesa set of equations within a certain error margin. Thesemethods use multi-point stencils, operators that calculatethe new value of a particular point in the grid is as afunction of the value of itself and its neighboring points.

While a sequential algorithm would sweep over thewhole grid applying the stencil one element at a time, anSPMD parallel version of the program divides the gridinto smaller parts that can be calculated in parallel byevery process using the same solver.

Fig. 7 illustrates a parallel execution of the solver using,for example, 9 MPI processes to solve a 2D grid. In thiscase, the grid is decomposed equally into sub-grids of sizeN/3 ∗N/3 that are distributed among all 9 processes. Atevery iteration, each process applies the sequential solverto all the elements of its sub-grid and communicatesboundary cells to/from neighboring processes, requiredto satisfy the data dependencies expressed by the stencil.

Fig. 8 shows the MPI code of a Jacobi solver basedon a 5-point stencil (up, down, left, right, and centerpoints). In lines 2-3, each process allocates and initializesits U and Uprev buffers. U stores the sub-grid for thecurrent iteration, and Uprev stores the sub-grid for theprevious iteration. Each sub-grid contains an extra setof rows/columns, called ghost cells, used to store theboundary cells of neighboring sub-grids. Boundary cellsare exchanged at the start of each iteration (lines 7-8)and execution waits until all communication has finished

(a) Two-sided communication (send/recv).

(b) Explicit One-sided communication (put/get).

Fig. 9: Communication protocols used in MPI.

(line 9). Finally, every process applies the stencil over itsown sub-grid (lines 11-14) and swaps its sub-grid pointers(line 16).

B. Communication in MPITo exchange boundary cells, our example uses the two-

sided communication protocol. In a two-sided operation,both sender and receiver processes need to explicitlyparticipate in the exchange of a message. The protocolrequires that every send operation requested by the senderprocess is matched by a recv operation in the receiverprocess, as illustrated in Fig. 9a.

MPI also supports explicit one-sided communication. Inexplicit one-sided communication, an MPI process canperform read/write operations on the address space ofanother process without the need of a matching call. MPIprovides an explicit interface comprising put/get functionsto update/read values on a process’s window, a uniquelyidentified partition in its private address space.

Fig. 9b illustrates how windows are used to com-municate data between MPI processes. In this example,two operations are performed: process 0 executes a putoperation to copy data from its private space to thewindow in process 1 which, in turn, executes a getoperation to retrieve data from the window in process2 to its own space.

One disadvantage of the two-sided and explicit one-sided communication mechanisms is that they do not notsupport having two processes access on the same memoryspace directly. In both cases, a data buffer is copiedfrom the sender process into a buffer/window in thereceiver’s space. Furthermore, a third copy of the messagein an intermediate buffer might be needed when thesend operation is posted before its recv counterpart. Thisbecomes a source of unnecessary data duplication whencommunicating processes belong to the same physicaladdress space, where buffers could instead be shared.

The cost of buffering has an impact on the per-messagecost we have seen in Eq. 2. Filling data buffers increasethe fixed overhead (s) associated with the message latencycost (La,b). The consequence for a larger s is that theperformance of applications with fine-grained communi-cation (i.e. requiring the communication of large numberof small sized messages) are particularly affected.

MPI provides an extension for shared memory thatallows processes to access memory allocations in the

Page 7: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 7

Fig. 10: Core usage diagram of a two-phase execution of aBSP application.

same physical address space directly. This extension is, ofcourse, limited to processes co-located in the same node,and cannot be used as the only means of communicationamong all MPI processes across the network. Althoughthis solves the problem of data duplication inside a node,it forces programmers to use two different interfaces forcommunicating among MPI processes regarding whetherthey are co-located or not. This is called the MPI+MPImodel, which we analyze further in §V.

C. Temporal Decomposition of MPI Programs

Another determinant factor in the performance of large-scale MPI applications is how they deal with networklatency. Most scientific applications authored using MPIfollow the Bulk Synchronous Parallelism (BSP) executionmodel [70]. BSP provides a temporal model of howprocesses compute and communicate. Under this model,the behavior of an application is defined in supersteps. Ateach superstep, each process performs one, or a combina-tion of two substeps: (1) compute, and (2) send/receivepartial results. The data required for the next superstepwill not be available until all processes finish their currentsuperstep. This means that there is an implicit barrier atthe end of each superstep where all processes synchro-nize.

The BSP model can be used to model many scien-tific application motifs. MPI provides the communicationand synchronization capabilities to easily realize BSPapplications. The problem is that, in the BSP model,communication and computation are performed in distinctphases, inhibiting any overlap between them.

Fig. 10 illustrates the plausible execution of an MPIprocess that performs computation and communicationin two separate phases. The effective core usage (i.e. thetime spent performing actual computation) is shown assolid blocks. The diagonal striped blocks represent com-munication operations. An upwards arrow indicates thestart of a communication request, and downwards arrowrepresents its completion. We can see that a processorcore remains unused while waiting for communicationoperations to complete.

Applications that execute computation and communica-tion in separate stages suffer from the full cost of commu-nication overhead. This represents an important liabilityfor petascale and exascale systems where the latency ofcommunicating a message can become a performancebottleneck.

A programmer can refactor a BSP program to overlapcomputation with communication by employing a split-phase technique [16]. This requires that the programis divided into smaller and independent sections thatcan execute/communicate concurrently. However, suchtransformation may require a significant amount of effortand make MPI applications difficult to implement.

Fig. 11: Execution of 8 MPI processes using a 2 node x 2NUMA Domains x 2 core configuration.

D. Oversubscription Limitation in MPIThe granularity (G) of an SPMD application under MPI

(i.e. how small are the parts of the workload distributedamong processes) is proportional to both the problem size(N ) and how many MPI processes (P ) are used as perEq. 3.

G = N/P (3)

Although the MPI specification provides very few re-strictions on how an MPI library should be implemented,the most widely used implementations in supercomputers(e.g. MPICH [46], MVAPICH [47], Open MPI [55], IntelMPI [31]) instantiate each MPI process in the context ofa separate OS process, as illustrated in Fig. 11.

Each OS process operates in its own local memoryaddress space and executes autonomously from otherprocesses. This simplifies the implementation of MPIlibraries and works well for most applications. Thisapproach, however, constrains the granularity of an MPIapplication to the number of available cores (c). AnMPI program reaches its optimal performance, when therelation P = c is satisfied since each process is matchedto a single core, and all cores are used.

The problem of having the granularity of an MPIapplication fixed by the number of cores is that it does notallow performing efficient oversubscription – a necessarymechanism to hide the cost of communication. Experi-mental observation shows that using P > c (oversubscrip-tion) in MPI libraries degrades application performance[30]. The are many causes for this:

(i) MPI processes will destructively compete for a core,constantly preempting each other from execution. Thisresults in an increase in cache/TLB thrashing caused byrepeated context switching.

(ii) Switching between MPI processes mapped to thesame core carries the overhead of a kernel-level contextswitch.

(iii) MPI libraries achieve optimal performance whenthey employ busy waiting for the detection of new mes-sages. This is the optimal strategy when only one MPIprocess per core is executed because it can instantly de-tect incoming messages. However, busy waiting producesdestructive interference when cores are oversubscribed.

(iv) Barrier synchronization overhead increases withthe amount of MPI processes.

Threaded implementations of MPI exist that solve theoversubscription limitation. These are presented in thenext section.

V. ALTERNATIVE MPI-BASED MODELS

Several alternatives have been proposed to extendMPI’s functionality. In this section we present the twomain alternatives models proposed in research.

Page 8: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 8

Fig. 12: Execution of 16 MPI processes on 8 cores usinguser-level threads.

A. Threaded MPIThe Threaded MPI model is motivated by the need to

have MPI libraries that do not enforce the 1-to-1 MPIprocess to core mapping. These libraries enable efficientoversubscription by allowing the execution multiple MPIprocesses that coexist within a single OS process.

Fine-Grain MPI (FG-MPI) [36] instantiates MPI pro-cesses as user-level threads [35], functions that can beinterrupted at any point of their execution while pre-serving their stack and processor state. User-level threadsare managed and scheduled by a user-level runtime sys-tem. Other user-level thread-based MPI libraries, such asThreaded MPI (TMPI) [66] and AzequiaMPI [19] followa similar approach as FG-MPI.

Preemption of user-level threads is enabled by callinga library-provided yield function that does not require theintervention of the kernel scheduler. When FG-MPI’s run-time system schedules a user-level thread for execution,it resumes from the point it had previously yielded.

By using user-level threads, FG-MPI enables the effi-cient execution of more than one MPI process per OSprocess. Fig. 12 shows how an example of how FG-MPIexecutes 2 MPI processes per core in the context of thesame OS process.

The oversubscription factor (V ) is a term used to definea number of MPI processes P as an integer multiple ofthe number of cores c. For example, if c = 16, using P =16 represents a conventional execution, while using P ={32, 48, 64} represents V = {2, 3, 4}, respectively. Thegranularity of an SPMD application can then be reducedby a factor V .

G = N/(c ∗ V ) (4)

Realizing an oversubscribed execution is one ingredientto achieve communication/computation overlap. The ideais that processes waiting for a communication operationare inexpensively preempted from using the processorcore and replaced with another that is ready to be ex-ecuted [27], thereby keeping the core busy with usefulcomputation.

Fig. 13a shows the execution timeline of an over-subscribed execution using V=2. We can see that thetwo MPI processes are allocated in the same core, eachperforming smaller bursts of effective computation thanthe ones shown in example in Fig. 10. The timeline alsoillustrates the overhead cost of context switching betweenprocesses 0 and 1 as diagonally stripped blocks betweentask executions.

One risk associated with oversubscription is that thecost of context switching can overcome the benefitsobtained by overlap. Fig. 13b shows the timeline of usingV=4, where the increased oversubscription enables theprocessor core to be busy at all times. However, it also

(a) Oversubscription Factor = 2

(b) Oversubscription Factor = 4

Fig. 13: Timeline of core usage of an application with (a)V=1, and (b) V=4

shows that the amount of work performed by each burstof computation is reduced, while the context switchingoverhead remains constant. This means that more of thecore time is spent in switching overhead than actualcomputation. It is therefore important to perform a carefultuning of the oversubscription factor to find a sweet spotwhere the marginal gain in core usage equals the penaltyof its associated overhead.

Another common problem with oversubscription inuser-level thread-based MPI libraries is that global andstatic variables become shared among different tasksexecuting in the same process. In conventional MPIlibraries, each process executes in a separate process-private memory space and has exclusive access of itsglobal and static variables. However, when AMPI orFG-MPI populate a process with more than one task,global variables share the same virtual memory space,thus causing an incorrect behavior due to unintended datasharing.

A workaround for this problem is to perform a manualthread-wise privatization of global and static variables.The idea is to move all global variables into a structure orobject that is instantiated at the beginning of the programand is passed as argument between subroutines. Thismight require extensive efforts by a programmer. Someauthors of AMPI have explored automated solutions [75].However, these solutions are architecture-dependent andmay not work in all cases.

B. Hybrid ModelThe Hybrid (also called MPI+X) model has been

proposed as a way to address the data duplication problemin MPI. This model involves a two-layer approach, whereMPI is used to manage the process distribution across thenetwork, and another interface is used to enable a sharedmemory execution inside each node or NUMA domain.By using shared memory, the impact of data motion isreduced since it eliminates redundant message copyingand data duplication among MPI processes within thesame physical address space.

The X term in MPI+X can refer to any threadinglibrary or language extension. Threading libraries, suchas OpenMP [56] or POSIX threads, are based on theexecution of kernel-level threads. Contrary to user-levelthreads (that require library enabled yield/resume mech-anisms), kernel-level threads are managed directly by theOS scheduler.

In the MPI+OpenMP [64] approach, MPI is used tocreate one process per node or NUMA domain, while

Page 9: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 9

Fig. 14: Execution of the hybrid approach using one processper NUMA domain, and 2 kernel-level threads per process.

OpenMP ensures that all its threads are created within thesame physical and virtual address spaces, thus providinga shared memory environment for programmers. Theadvantage of using kernel-level threads is that they canexecute in parallel –scheduled across different cores–while sharing the same address space, as illustrated inFig. 14. This is not possible with user-level threads alone,since threads mapped to the same address space cannotexecute simultaneously.

The locality of data in the MPI+OpenMP is givenimplicitly – that is, without an explicit allocation ofshared variables in the code. Instead, all OpenMP threadsinstantiated within the same MPI process automaticallyshare all of their global address space and pointers.

Another option for the MPI+X approach, calledMPI+MPI [28], works by handling communicationacross nodes through message passing or one-sided com-munication, while using MPI shared memory functionsto manage share memory access between processes inthe same node.

MPI+MPI does not require a copy of a message into areceive buffer since data can be directly accessed by allprocesses in the same node/NUMA domain. This modelenables shared memory without the need of threading(MPI processes execute as a single non-threaded OS pro-cess), by using inter-process shared memory mechanismsprovided by the OS. However, since the address spaceis not shared (no threading), MPI requires programmersto explicitly define the locality of shared memory allo-cations. Memory allocations can only be shared betweendifferent MPI processes co-located in the same node butnot across the network.

Despite the advantages of hybrid models, they havesome drawbacks: (i) Programmers need to carefully de-fine the interaction between shared and remote communi-cation interfaces. (ii) As in any shared-memory program,synchronization mechanisms need to be implemented toprevent data races. (iii) Oversubscription is not efficient.In the case of MPI+OpenMP, switching between kernel-level threads involves higher overheads than switchingbetween user-level threads. This penalizes performancein oversubscribed executions. (iv) MPI libraries are notnecessarily thread safe. This means that kernel levelthreads may need to serialize their calls to MPI by usingmechanisms for mutual exclusion.

C. Overview of MPI-Based ModelsDespite the success of MPI in the scientific computing

community, it has shown some difficulties in addressingsome of the challenges of exascale. We have identifiedthree main limitations:

(i) MPI is subjected to the data duplication problem.Data is duplicated because a message needs to be storedin the sender’s buffer, which is later deep copied into thereceiver’s buffer. Furthermore, when a send operation is

posted before a receive operation, even a third copy (in-termediate buffer) is required, exacerbating the problem.This problem affects the performance of applications withfine-grained communication in particular, since the fixedcost per message is increased.

We have seen that Hybrid MPI models (MPI+X) areable to take advantage of shared memory. However, themain drawback of MPI+X models is that, by definition,they require two different interfaces for communication.Programming an MPI+X application requires a carefulinsertion of synchronization mechanisms and a correctinteraction between MPI and shared memory code. De-veloping such programs can require extensive efforts bythe programmer and could be prone to bugs.

(ii) The MPI specification does not prescribe anymeans for oversubscription. For flat (i.e. non-threaded)MPI libraries, this represents an important limitation atthe exascale since oversubscription can hide some of thecost of communication. Threaded MPI libraries pose asan efficient solution to overcome the overhead of over-subscription in MPI. However, the limitation of ThreadedMPI libraries is that they do not provide any means forsolving the data duplication problem. In fact, in over-subscribed executions, data is further dispersed amongMPI processes that cannot share memory, hindering datalocality.

From our observations of the MPI model and itsvariants we can conclude that alternative parallel pro-gramming models targeted towards exascale computingshould provide the following features: (a) Support forefficient oversubscription and shared memory. (b) A sin-gle interface for communication between processes insideand across nodes indistinctly.

In the next sections we introduce alternative modelsthat have been proposed to overcome the limitationsof MPI models, providing additional traits required byexascale computing. We start with the PGAS model, thatprovides a solution to the data duplication problem forapplications with fine-grained communication.

VI. PARTITIONED GLOBAL ADDRESS SPACE MODEL

The Partitioned Global Address Space (PGAS) model[16] provides a framework for global memory that ismeant to execute in multiple disjoint physical memoryspaces. In a PGAS program, a distinction is made betweenvariables private to a task, and those accessible by alltasks. We use the term task as a generic way to refer to theexecuting units by which a parallel application is divided.The task concept enforces no assumptions on the isolationof memory address spaces or the autonomy of execution.Tasks can be implemented as a combination of user andkernel-level threads and managed by a runtime system tosupport oversubscription, shared memory, and/or data orexecution dependencies.

A PGAS language hides the complexity of accessingmemory globally across tasks. Global variables can bedirectly modified or read by any task via normal assign-ment operators and pointer accesses just like any othervariable.

A global variable may be either physically located inthe space of one task, or partitioned across the spaceof multiple tasks. The physical location of a partition(i.e. what node/NUMA domain contains it) is definedas the partition’s affinity. Affinity does not affect thecorrectness of a program, since partitions are equallyaccessible by all tasks. However, it plays an important

Page 10: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 10

Fig. 15: An example use of implicit one-sided communicationin the PGAS model.

role in managing data locality. The optimal case is thatin which shared partitions are located physically closestto the tasks accessing them.

Unified Parallel C (UPC) [12], is a PGAS extensionto the C language. Similar tools have been developedfor other languages as well, such as: Co-Array Fortran[53], and Titanium [74] for Java. In UPC, the alloca-tion of shared spaces may be done statically, throughnon-initialized vector declarations (e.g. shared int ar-ray[SIZE]); or dynamically, through memory allocationfunctions (e.g. upc global alloc()).

Shared allocations in UPC can span tasks in linear ormulti-dimensional arrays. Unless otherwise indicated bythe user, all allocations in UPC are uniformly partitionedamong tasks. For example, in a execution of 4 UPC tasks,a linear allocation of 100 bytes would result in 4 partitionsof 25 bytes with 0, 1, 2, and 3 task affinities.

UPC implements implicit one-sided communication,where accesses to shared partitions are embedded into thelanguage, instead of requiring explicit calls to get/put-likeoperations as in MPI. Accesses to remotely shared parti-tions are indistinguishable from private pointer accesses,except for the fact that the destination address refers to aremote location. This can be observed in Fig. 15.

Upon arriving at a read or assignment operation (=)on an element from a shared pointer, UPC performs thefollowing operations: (1) dereferences the accessed ele-ment and obtains the offset within the shared allocation,(2) uses the offset to determine which partition it belongsto and its affinity, and (3) exchanges data with the taskindicated by the affinity.

Implicit one-sided operations in UPC solve the dataduplication problem in MPI since UPC allows access-ing remote partitions directly, without requiring a copyto/from a local buffer. Furthermore, empirical resultshave shown that the UPC approach requires less timeper operation than the explicit one-sided communicationprimitives used in MPI – at the cost of adding synchro-nization mechanisms.

1 shared in p[LARGE NUMBER];2 for (i = 0, i < LARGE NUMBER; i++)3 p[i] = i;

Fig. 16: Pathological case of implicit one-sidedcommunication in UPC.

The problem with UPC’s approach is it can slowdown the communication of large messages. UPC requiresone independent operation per element when accessinga shared array. This makes the cost of communicationlinear in the number of bytes since each element accesshas to be resolved individually. The pathological caseis shown in the code of Fig. 16, where every accessto the shared pointer p represents a different operation.This problem is unavoidable since it is not possible topredict pointer-based accesses. As a consequence, implicit

Fig. 17: Time taken by UPC and MPI explicit one sidedoperations as a function of the number of bytes transmitted.

Source: [32]

1 for (int step = 0; i < iterations; step++)2 {3 upc memget(handles,4 &localGhostPointer,5 &remoteBoundaryPointer6 <− [up,down,left,right]);7 upc sync(handles);8

9 for (int i = 0; i < N; i++)10 for (int j = 0; j < N; j++)11 U[i][j] = U[i−1][j] + U[i+1][j] +12 U[i][j−1] + U[i][j+1] − 4∗U[i][j];13

14 swap(&U, &Uprev);15 }

Fig. 18: Pseudocode of the Jacobi solver kernel using UPC.

communication with UPC is only faster than explicitput/get functions in MPI for small sized messages, asshown in Fig. 17.

While implicit one-sided operations may be ideal foralgorithms that rely on fine-grained communication (e.g.UPC is used in large-scale genome assembly algorithms[23]), it may become a significant overhead when commu-nicating relatively large sets of data, as in Dense Linear(Matrix) Algebra algorithms. For this reason, UPC alsoprovides functions for explicit one-sided communication(upc memget, and upc memput), similar to the ones pro-vided by MPI.

As in MPI, the use of explicit operations in UPCrequires a copy of the data from/to local buffers, thereforealso incurring in the data duplication problem. Anotherissue with one-sided communication is that it requiressynchronization mechanisms to prevent data races, justlike in the Hybrid MPI model. UPC provides the upc syncfunction for pair-wise synchronization between the senderand receiver tasks to verify that a message has beentransferred. Semantically, upc sync operates in a similarway as MPI Wait.

The consequence of using explicit one-sided operationsand pair-wise synchronization is that UPC programs canbe very similar to MPI programs for algorithms thatcannot benefit from fine-grained communication, as in thecase of the Jacobi solver from Fig. 8. The pseudocode ofsame solver programmed using UPC in Fig. 18, shows avery similar structure as the MPI version.

In the next section we analyze the APGAS model, anextension to the PGAS model that introduces the notionof task locality to enable the use of shared memory.

Page 11: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 11

Fig. 19: Creation of a task dependency graph in UPC++.Adapted from an example in: [77]

VII. ASYNCHRONOUS PGAS

Asynchronous PGAS (APGAS) is an extension of thePGAS model that supports the explicit creation of tasksat runtime. Once created, the new task starts immediatelyand executes independently from its parent task. A parenttask can then be set to wait for completion of its childrentasks (and their descendants)3. Since the scheduling oftasks is based on the completion of other tasks, wesay that these models have a task-dependency drivenexecution.

The main contribution in the APGAS model is that itallows programmers to define places, logical abstractionsthat provide the notion of task locality. Whereas affinity inUPC refers to where data is allocated, a place defines thephysical resource where a task is executed. By specifyinga place, the creation of tasks can be done locally (sameplace as parent task) or remotely (in a different place).

By defining their locality, tasks allocated to the sameplace are able to access variables from a shared addressspace without the need of an additional interface (as inHybrid models). This locality-based sharing mechanismis automatically enabled by hardware, and does not sufferfrom the performance penalty for large messages we haveseen from using software-based dereferencing as in UPC.

Our analysis of the PGAS model in the previoussection showed that implicit one-sided communicationcan be inefficient when tasks are located in separateaddress spaces. However, by allowing the programmerto specify places, the APGAS model guarantees that dataand tasks located in the same place will execute in thesame physical address space without the need of software-based pointer access dereferencing, as in PGAS. To accessvariables among tasks in different places, programmersstill need to specify the shared type modifier, just like inUPC. This shows that the PGAS model can be thoughtas a particular case of the APGAS model where all tasksexecute in their own separate place.

UPC++ [77] is an APGAS communication libraryfor C++ applications. UPC++ provides an interface thatintegrates the allocation and communication primitives inUPC with an interface for the creation of new tasks atruntime. UPC++ is based on ideas applied by APGAS-specific languages, such as X10 [13].

In X10, tasks can be created by calling the asyncfunction. Async represents a Remote Method Invocation(RMI) – a request to execute a function in a remotelocation. By using async, the programmer specifies theplace of the new task. The parent task can be set to wait

3MPI also provides a set of functions for the creation of new processeson runtime. However, the new set of processes will belong to a newcommunicator group. Although MPI enables communication betweencommunicators, it does not provide any synchronization mechanisms.

Fig. 20: LULESH weak scaling performance of UPC++ vsMPI. Source: [77]

for each of its child tasks (and their descendants) by usingthe finish function.

UPC++ adopts the async/finish semantics of X10, butalso defines events. An event is a logical switch that is(partially) triggered upon the completion of a task. Eventsin UPC++ serve to create custom task dependency graphsby grouping the completion of one or more tasks astrigger conditions. By using the async after() function, anew task is created but not executed until a certain eventis satisfied.

Fig. 19 shows an example of the creation of a taskdependency graph using events in UPC++. Tasks t1,t2, and t4 can start executing immediately. Event e1is required for t3 to start executing, and will only besatisfied after tasks t1 and t2 are completed. The sameapplies to tasks t5 and t6 until event e2 is satisfied.Regarding locality, tasks (1,2), (3,4), and (5,6) are set toshare the same place. We can see that two main aspectsof this APGAS program have been explicitly defined:(i) the execution of tasks is given by the control-flowdependencies exposed by the defined events, and (ii) thelocality of data is given by the defined places.

UPC++ has been proven to meet and even exceed theperformance MPI at large scales, primarily in fine grainedcommunication algorithms [77]. Fig. 20 shows the per-formance comparison between both models executing theLULESH solver using up to 32768 cores. LULESH [62]is an example of an unstructured grid motif applicationthat solves the Sedov Blast problem [29] in three spatialdimensions. LULESH can be used to benchmark per-formance in new hardware and programming tools. Theexperiment shows that UPC++ achieve a 10% speedupcompared to MPI thanks to the use of shared memoryamong co-located tasks.

Charm++ [33] is a programming framework based onAPGAS principles. Charm++ extends the standard C++syntax with structures based on the original Charm [34]programming language. Charm++ provides an object-based extended API where tasks are defined as chares,C++ classes that can be defined by the programmer whileinheriting a set of base chare methods. Chare classes havetheir own fields, constructors and entry methods. Tasksare created by instantiating objects derived from a chareclass. The lifetime of a task is the same to that of a normalobject, and communication is realized through calls totheir entry methods.

Chares can be associated in arrays or groups thatare allocated in the same processing element (PE). A

Page 12: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 12

1 T∗ data = initialize data();2 ChareClass chareObj = ChareClass::ckNew(args, destPE);3 ChareClass∗ c = chareObj.ckLocal();4

5 if (c != NULL)6 // Local, shared access to the data pointer. No hard copy.7 c−>data = data;8 else9 // Remote, implicit one sided communication. Data is copied.

10 chareObj.acceptData(data);

Fig. 21: Simplified Charm++ code. Shared memorycommunication is only possible if tasks are co-located.

processing element serves the same purpose as a placein UPC++. That is, all chares created in the same PE areguaranteed to execute in the same address space. Whena chare is created, it executes its constructor methodand remains inactive until one of its entry methods iscalled by another task. When tasks do not coexist inthe same PE, Charm++ uses remote method invocations.Communication between tasks is then realized by twodifferent mechanisms:(i) When chares are co-located in the same PE, com-

munication can be handled via shared memory. This isachieved by accessing the public fields (including arrays)of co-located tasks.(ii) When chares do not share the same location, they

perform explicit one-sided communication by sendingpointers as arguments in RMIs. This prompts the runtimesystem to transfer data asynchronously from the pointerlocation. The called method is not executed until all datahas been deep copied into the receiving task’s memoryspace. The called task then receives a local pointer to itscopy of the data which can be accessed directly.

The disadvantage of shared memory communicationin Charm++, when compared to UPC++, is that theprogrammer needs to handle communication betweentasks using two different mechanisms whether or not theyare co-located.

Fig. 21 shows an example of a program where a set ofdata set needs to be transferred to a child task. A new taskis created in line 2 as a new object with initial arguments(args) and destination processing element (destPE). Thevalue of destPE determines whether the child task willbe allocated in the same space as its parent. The ckLocalmethod – inherited from the base chare class– returns apointer to the object if the specified chare is co-located.In case it is co-located (line 7), the data array is sharedamong the two tasks and they can communicate throughshared memory. In case the child task is remote, a NULLpointer is returned. This requires the transmission ofthe data pointer as an entry method argument of theChareClass class (line 10), which incurs a deep copy ofthe data.

One of the distinctive features in Charm++ is that itsupports the migration of chares among different PEs (andtherefore, across the network) by deploying a packingand unpacking (PUP) framework [2]. For a chare to bemigratable, Charm++ only requires the programmer todefine a pup() virtual method that serializes/deserializesthe contents of a task to/from a stream of bytes. Migrat-able chares is the main mechanism by which Charm++enables load balancing and checkpoint/restart based faulttolerance [76].

In the next section we analyze a set of parallel program-ming models, called Dataflow models, that dynamicallyalter the execution order of an application based on the

availability of data. We will see how this approach canbe used to help hide the cost of network communication.

VIII. DATAFLOW MODELS

Dataflow processors were proposed to execute streamsof instructions as directed acyclic graphs (DAG), wherenodes represent instructions, and edges represent datadependencies between them. Instead of executing in theprogram’s order, instructions execute as soon as theirrequired operands are satisfied, as long as no data hazardsare detected. The earliest instances of such processorswere the IBM System/360, implementing Tomasulo’s al-gorithm [68], and the CDC 6600, implementing the Score-board algorithm [67]. These ideas have paved the way forthe out-of-order execution logic that have dramaticallyincreased performance of processors.

The same principle has also been proposed for high-level parallel programming models [3]. A dataflow modeldefines the semantics of a program by declaring theelements that need to be calculated, and the dependencies(operands) required by each one of them.

In dataflow programming models, data dependenciesare used to define the execution order of the application.The complexity of managing a data dependence-drivenexecution is, however, hidden from the programmer. Forthis reason, these models rely heavily in both trans-lator/compiler and runtime system support. Compilersneed to perform a static analysis of dependencies andembed out-of-order logic inside the code, while runtimesystems are required to track the data the status of datadependencies (whether they are satisfied) and decidingwhich operation(s) can be executed next.

We identify four dataflow programming models thatimplement a range of different approaches: Concur-rent Collections, Statement-level Dataflow, Region-levelDataflow, and Task-level Dataflow.

A. Concurrent CollectionsThe Concurrent Collections (CnC) [11] model ex-

presses the control-flow of a parallel program in termsof producer/consumer relations between data elements.A CnC program defines data collections where eachelement in a collection is assigned a tag or identifier. Aprogrammer can define dependencies, where the value ofa tag may require pre-calculating the value of other tags.

What makes CnC a particularly interesting program-ming model is the ability to compose a program throughmemoization – storing intermediate results for later re-use. By requesting the value of a tag, a CnC programcalculates and stores the values of all the tags uponwhich it depends, and then performs a simple operationto calculate its value. This logic is applied recursivelyuntil the last dependencies are only external inputs (e.g.constants, command line arguments, files, user input).In this way, a CnC program obtains the value of a tagas series of simple data transformations starting fromexternal inputs.

Fig. 22 represents a typical CnC code for calculatingthe ith number in the fibonacci sequence for any positiveinteger i. Left arrows indicate input dependencies. In thisexample, there are two inputs, x and y, representing theprevious two fibonacci numbers (in case of i > 2). Forany i required by the programmer, the CnC model willautomatically resolve and memoize the range of previousfibonacci numbers. Right arrows indicate the result value,

Page 13: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 13

1 [ int ∗fib: i ];2

3 $compute fib: i4 <− [ x @ fib: i−2 ] $when(i>1),5 <− [ y @ fib: i−1 ] $when(i>1),6 −> [ 0 @ fib: i ] $when(i==0),7 −> [ 1 @ fib: i ] $when(i==1),8 −> [ x + y @ fib: i ] $when(i>1);

Fig. 22: A typical CnC fibonacci code. Adapted from [71].

Fig. 23: Decomposition of the Cholesky factorization intosimpler operations in DAGuE. Source: [10].

and where it is stored. The result of each i is stored astag: fib: i.

CnC has proven to be an ideal model for developinglinear algebra libraries. DAGuE [10] is a programminglanguage and runtime system based on the CnC model,that represents complex algebraic transformations, suchas the Cholesky factorization (Dense Linear Algebra) asrelations between matrix/vector operations.

The dependency graph of the Cholesky factorizationgenerated by DAGuE is shown in Fig. 23. DPOTRF,DTRSM, DSYRK and DGEMM, represent simpler linearalgebra operations upon which Cholesky is composed.The programmer only needs to define the operationsbetween tags required to perform the factorization, whilethe DAGuE compiler/runtime system deals with the intri-cacies of fetching the dependencies for each element. Asshown in the illustration, DAGuE automatically defineshow data elements and operations are mapped acrosscompute nodes, and how communication is automaticallydistinguished between local and remote.

B. Statement-Level DataflowThe dataflow model can also be applied to procedural

languages by having the statements of a program executeout-of-order, based on their dependencies.

Swift/T [73] is a C-like language and compiler basedon the statement-level dataflow model. Just like DAGuE,Swift/T requires no explicit definition of parallelism nordata locality. However Swift/T’s approach differs fromDAGuE’s in that the former constructs the underlyingDAG through a static analysis of program instructions,rather than relationship between data elements. Duringexecution, its runtime system creates one task for eachstatement in the program and manages their data depen-dencies, while optimizing data and task locality.

Fig. 24 shows a simple example of a Swift/T programand the corresponding DAG that is generated by Swift/T

Fig. 24: Example code of a Swift/T program and itscorresponding DAG. Simplified for clarity.

in compilation time. The value of y will be calculatedfirst (line 4) since it is the only one that has data nodependencies. Lines 3 and 5 are executed concurrentlyafterwards, since they only depend on the value of y.The value of z2 (line 6) is calculated once z1 and x areobtained. The last steps represent the calculation of thecontents of the A[] array (line 7) based on the values ofz1 and z2, and the parallel processing of all its elements(line 9).

One of the main advantages of both Swift/T andDAGuE is that they manage communication automaticallyand use shared memory when tasks are co-located in thesame address space. However, the fact that they requirethe creation of large amounts of extremely fine-grainedtasks and dependencies can potentially entail large over-heads. Although their compilers and runtime systems areoptimized to handle large amounts of elements, it isunclear how they deal with the following problems:

(i) The overhead required in managing fine-grainedtasks (creation, allocation, dependency evaluation) maybe on the same order of complexity as the very operationsexecuted by each task.

(ii) As tasks execute smaller operations, they alsoproduce smaller results. As a consequence, communi-cation operations become extremely fine grained andsusceptible to the cost of latency. Since latency will bethe dominant cost of communication in exascale comput-ers, fine grained communication may be punishing forperformance.

(iii) The number of communication operations arecorrelated to the number of tasks, and therefore to thecomplexity of the algorithm.

1 for (int i = 0; i < n; i++)2 for (int j = 0; j < n; j++)3 for (int k = 0; k < n; k++)4 C[i,j] += A[i,k] ∗ C[k,j];

Fig. 25: Pseudocode of the naive O(n3) matrix multiplicationalgorithm kernel.

Nested loops represent pathological cases where per-formance drops quickly with increased problem sizes.Polynomial algorithms, such as the naive matrix multi-plication shown in Fig. 25 require the creation of O(n3)tasks (where n is the number of elements in a side of asquare matrix). Published tests show that such algorithmsfail to scale beyond O(105) cores, which is way below theorder of magnitude involved in exascale computers [73].This is illustrated in Fig. 26, where tasks completion persecond drops after >1000 cores due to a nested loop ina Swift/T program.

Page 14: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 14

Fig. 26: Scaling of a nested loop application in Swift/T.Source: [73].

1 #pragma bamboo overlap2 for (int step = 0; i < iterations; step++)3 {4 #pragma bamboo send5 MPI Isend(&Uprev[BoundaryCells] −> [up,down,left,right]);6

7 #pragma bamboo receive8 MPI Irecv(&U[GhostCells] <− [up,down,left,right]);9

10 #pragma bamboo compute11 {12 MPI Waitall();13 for (int i = 0; i < N; i++)14 for (int j = 0; j < N; j++)15 U[i][j] = U[i−1][j] + U[i+1][j] +16 U[i][j−1] + U[i][j+1] − 4∗U[i][j];17 swap(&U, &Uprev);18 }19 }

Fig. 27: Code from Fig. 8 annotated using Bamboo.

C. Region-Level DataflowThe dataflow model can also be applied to a procedural

language where entire regions of code are scheduled,instead of individual statements. Regions are defined ascontiguous sections of the code that execute in-order andnon-preemptively. Programs can have their code regionsrearranged statically (on compilation time) or dynamically(on runtime) to execute out-of-order based on their datadependencies.

Bamboo [52] is a source-to-source translator that rein-terprets C/C++ MPI applications to execute as a dataflowprogram to realize communication/computation overlap.Bamboo performs a static analysis of Bamboo-specificannotations and MPI calls in the code and generates adata dependence graph. This graph is used to performtransformations in the source code that enable a datadependence-driven execution by generating code that iscompatible with the Tarragon runtime system [14]. Tar-ragon provides support for thread-based oversubscriptionof Bamboo-translated code. For this reason, Bamboo alsosuffers from the global variable problem observed in theThreaded MPI model.

Bamboo extends the C++ syntax with #pragma direc-tives [8] used to define three types of regions in the code:(i) Overlap regions indicate what parts of the code willbe transformed by Bamboo. Any code outside an overlapregion remains unmodified and executes in-order. (ii)Send/receive regions are used to enclose MPI send/receiveoperations, respectively. (iii) Compute regions are usedto enclose computation that depends on the receive regionand produces data for the send region.

Fig. 27 shows an example of Bamboo’s annotationsyntax for the iterative solver presented in Fig. 8. MPI

(a) Oversubscription Factor = 1 + Region-level Dependencies

(b) Oversubscription Factor = 2 + Region-level Dependencies

Fig. 28: Timeline of core usage with data dependence-drivenexecution of task regions with (a) V=1, and (b) V=2. c =

Number of cores.

Blocking operations (e.g. MPI Waitall) are postponedby Bamboo to prevent the entire process from beingpreempted while there are still regions ready to execute(ie. their particular data dependencies are satisfied).

MPI/SMPSs [42] is also a #pragma directive-basedinterface for MPI C/C++ applications. The SPMSs model[58] was originally proposed as a framework to describedependencies between tasks in OpenMP. MPI/SPMSs isan adaptation of the SPMSs interface to describe datadependencies across MPI processes.

Unlike Bamboo, where dependencies between regionsare obtained from the static analysis of the code, SMPSsrequires on the programmer to explicitly name and definedependencies between regions. Although MPI/SMPSsgives more flexibility in defining more complex depen-dency graphs than the (receive→compute→send) pat-tern used by Bamboo, specifying dependencies manuallycould require an additional effort from programmers.

Both Bamboo and MPI/SPMSs automatically con-vert MPI applications into semantically-equivalent pro-grams that execute based on data dependencies. Datadependence-driven execution has a significant impact inhiding the costs of network latency. We have seen in Fig.13a that, although oversubscription enables overlapppingcomputation and communication by switching tasks whenone of them is blocked for the arrival of messages,each individual task still performs communication andcomputation in separate stages.

As a result of dataflow execution, a task can be sched-uled as soon as at least one of its regions has satisfiedits dependencies. The effect of this is a finer-grainedexecution of each task, realizing overlap without theadditional context-switching overhead we had observedin oversubscription-only executions (Fig. 10).

Fig. 28a shows how a single task can overlap commu-nication and computation in non-oversubscribed execu-tions. Regions are represented as fine-dotted division lineswithin a task. When compared with Fig. 10, we see thatthis mechanism can potentially enable a better core usage,as compute regions still execute while communicationoperations run in the background.

It is important to note that overlap by oversubscriptionand data dependence-driven execution are not mutuallyexclusive. The effect of using both mechanisms canimprove processor utilization than any of them separately.Fig. 28b shows how their combination represents can po-tentially improve over the oversubscription-only approachshown in Fig. 13a.

It has been shown in a variety of applications [51] [50]that the best performance is obtained by a combination ofoversubscription and data dependence-driven execution.

Page 15: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

15

Fig. 29: Speed-Up of a 3D Jacobi solver using Bamboosubject to different oversubscription factors. Source: [50].

Fig. 29, shows the speedups of a 7-point stencil 3D Jacobisolver using Bamboo.

In the experiment, Bamboo is configured with V= {1,2,4,8} to solve a 3072x3072x3072 grid on theNERSC Hopper supercomputer over {12288, 24576,49152, 98304} cores. The peak in improvement is givenby a better overlap of communication and computation.As the amount of processor cores (c) increases, the costof communication becomes more significant. As a con-sequence, the improvement obtained from using Bambooalso increases with (c), as shown in the plot.

The results obtained by Bamboo shows how the per-formance of an application can improve by schedulingregions of code based on their data dependencies, insteadof suspending an entire task based on message operations.

D. Task-Level DataflowAnother way the dataflow model can be applied to

procedural languages is to schedule entire functions. Wecall this the Task-Level Dataflow (TLD) model. TLDshares many similarities with the APGAS model: (i) tasksare dynamically instantiated to execute a single function,(ii) the locality of data and tasks can be explicitly definedby the programmer, (iii) locality defines how data isshared among different tasks.

The main difference with the APGAS model lies inhow tasks are scheduled. Whereas APGAS models allowscreating execution dependencies between tasks (i.e., thestart of a task can be triggered by the completion ofanother task or event), the execution of tasks in the TLDmodel depends exclusively on the availability of data.

The Legion [9] programming language and runtimesystem is one instance of the TLD model. In a Legionprogram, a clear distinction is made between data andtask semantics.

Data objects in a Legion program needs to be mappedinto regions (Not to be confused with regions in theRegion-level Dataflow model, that refers to sections ofcode). Regions are structures for grouping data objectsthat share the same access privilege level. Legion definesseveral types of privilege levels, but for our analysis,they can essentially be reduced into two levels: exclu-sive and simultaneous. Exclusive regions can only bemodified by a single task, while simultaneous regionsenclose objects that may be modified by several tasks andrequire synchronization. Simultaneous regions define datadependencies: a task will not be issued until a previouslyissued task finishes modifying the region.

Fig. 30 shows an example of data objects are divided ina circuit simulation algorithm in Legion. Each circle rep-resents a certain circuit node (with voltage/current/charge

Fig. 30: Nodes and wires in a circuit simulation program, splitinto three partitions. Source: [9]

values) and the connections represent wires connectingthe nodes. To allow parallel execution, the circuit isdivided into three partitions of similar number of nodes.We can then define 6 regions: 3 exclusive regions (cyan)that contain nodes modified by a single task, and 3simultaneous regions (red), representing the boundariesbetween partitions that are modified by two or more tasks.

Once data regions are defined, the programmer needsto instantiate the tasks that will process them. This isdone via the spawn function. This function has a similarsemantics as the async function in UPC++ and X10,requiring a function to execute to be passed as argument.However, instead of requiring the task locality (as withplaces in UPC++) as an argument, Legion infers tasklocality dynamically from the data region(s) associatedto new task. The scheduling of tasks is given by howdependencies between data regions and their privilegelevels are structured.

The semantics of a Legion program is given only bydefining data regions and instantiating tasks using thespawn function. However, to realize full performance,a programmer needs also to define mapping of regionsto physical resources. An optimal mapping will assignneighboring data regions into the same resources tomaximize the use of shared memory.

Legion gives a programmer an interface to definealmost all the aspects of the execution of a program: datadependencies, task locality, and task-to-physical resourcemapping. However, Legion requires that many of theseaspects be explicitly defined by a programmer whereastools like Bamboo or MPI/SMPss implement simpler andless complicated ways to support a dataflow execution.

IX. DISCUSSION

A. Overview of Alternative ModelsTable I shows how the programming tools we surveyed

compare regarding their design paradigms. All of themprovide one or more mechanisms to address the chal-lenges brought by exascale computing.

Regarding the cost of internal communication, wehave seen that defining the locality of data/tasks can helpdeal with the data duplication problem. Locality givesruntime systems the necessary information for mappinggroups of data/tasks into the same physical resources,enabling the use of shared memory and removing theneed for buffering, preventing unnecessary data motionwithin a compute node.

The tools surveyed also simplify the way communica-tion is expressed, reducing the effort of programmers andeliminating the need of multiple interfaces. The PGASmodel paved the way by integrating local and remote

Page 16: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

16

Communication Locality Definition Dependency-drivenTool Model Oversubcription Two-Sided Explicit 1S Implicit 1S Task Data Task DataMPI Message Passing 3 3

FG-MPI Threaded MP 3 3MPI+X Hybrid 3 3

MPI+MPI Hybrid 3 3 3UPC PGAS 3 3 3 3

UPC++ APGAS 3 3 3 3 3X10 APGAS 3 3 3 3

Charm++ APGAS 3 3 3 3DAGuE Dataflow 3 3 3 3Swift/T Dataflow 3 3 3 3Bamboo Dataflow 3 3 3

MPI/SMPSs Dataflow 3 3Legion Dataflow 3 3 3 3

Ideal Model 3 3 3 3

TABLE I: Classification of the tools surveyed regarding their design paradigms.

communication with an implicit one-sided communi-cation interface. However, this type of communicationproved to be less than efficient for messages larger thana few bytes. In APGAS models this is partially solved byproviding the concept of task locality, guaranteeing thatdata and tasks running in the same places share the sameaddress space. As a result, APGAS languages are able tointegrate shared memory and distributed communication,just like in the Hybrid (MPI+X) model, but using just asingle programming interface.

Some dataflow programming tools, such as DAGuE andSwift/T do not provide any interface for specifying local-ity, as this information does not play a role in defining thesemantics of the program. Instead, these tools infer thebest logical-to-physical resource mapping automaticallyduring compilation or runtime. On the other hand, Legionrequires programmers to use an interface for mappingdata regions to physical resources manually. Although amanual mapping may be more efficient, mapping logicaland physical resources by hand can be extremely complexand may not portable to other system configurations.

The contrast between manual/automatic mappingserves to illustrate that there is a trade-off between sim-plicity and efficiency regarding data locality. A possiblealternative to get the best of both worlds could be touse compiler directives, as in Bamboo and the SMPSsmodel. Compiler directives may be a good alternativebecause fit the same criteria that are required by localityspecifications: (i) They are optional. Their inclusion oromission does not affect the semantics of a program. (ii)They are portable. They can be interpreted in differentways depending on the system. (iii) They can provide asimpler interface than that required by a manual mapping(Legion).

Regarding the cost of network communication, mosttools support oversubscription by combining both user-level and kernel-level threads. We have shown how thismechanism is the basis for enabling communication andcomputation overlap. SPMD-based libraries, such as FG-MPI, Bamboo and UPC offer a simple way to achieveoversubscription by defining an oversubscription factor.That is, creating a fixed V number of tasks per core.It is less intuitive, however, how to achieve oversub-scription in asynchronous models (e.g. APGAS, X10,DAGuE, Swift/T, and Legion), where tasks are createddynamically. For these models, either the programmer orthe runtime system needs to make sure there is enoughtasks at all times to maximize core usage and performcommunication operations simultaneously.

Dataflow models also provide data dependency drivenexecution. This mechanism increases the overlap potentialof an oversubscribed execution. In some cases, the depen-dency graph is automatically inferred from the code, andin other cases it needs to be specified manually.

While the task dependency-driven execution approachused in PGAS/APGAS tools could potentially enablecommunication/computation overlap, it is not a straight-forward way of representing data dependencies. Withtask dependencies, the communication operations shouldbe described by commmunication-only tasks upon whichother compute-only tasks depend. This description of datadependencies is not as straightforward as in dataflowmodels, where communication operations are clearlyidentified as such.

B. Towards an Ideal ModelWe consider that each model provides a unique set of

weaknesses and strengths. It is certainly not be possible tobuild an ideal programming model since memory access,communication, and computation patterns vary widelyamong different applications. However, in this section weseek to define the design aspects we believe would beclose to such ideal model:

(i) Follow the Single Program Multiple Data model.We consider that the SPMD is still a relevant way todecompose the problem domain of most scientific com-puting motifs. This involves creating a fixed number oftasks for the distribution of the workload, as opposed toasynchronous models where tasks are created dynamicallyon runtime. The simplicity of this approach would alsomake it easier for MPI programmers to transition to suchideal model.

(ii) Oversubscription = Locality. An ideal modelshould directly relate the oversubscription factor and thelocality of tasks. This approach would give users thepossibility of defining how many tasks are created pernode as well as provide support for shared memory amongtasks in the same node automatically.

(iii) Provide a single (implicit) communication proto-col as in the PGAS/APGAS models. This would requireno semantic difference between communicating amongtasks in the same node or across nodes. The only differ-ence between internal and network communication shouldbe their performance.

To avoid the problem with large size messages, an idealmodel should have heuristics to predict consecutive grouppointer accesses as a single communication operation.Although this may not be generally possible, it would

Page 17: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

REFERENCES 17

reduce the per-element access overhead we observed inPGAS models.

Additionally, an ideal model should provide a simpleannotation-based syntax for programmers to distinguishbetween private and shared variables. For shared vari-ables, this model should provide a way to indicate howdata allocation will be distributed among tasks, as withaffinity in UPC.

(iv) Provide an easy means for defining data de-pendencies. The experience of the Bamboo translatorindicates that it is possible to have a simple annotation-based interface for defining data dependencies betweenregions of code to maximize communication/computationoverlap.

X. CONCLUSIONS

In this study we have presented some of the challengesposed by exascale computing and the models that havebeen proposed to address them.

For many programmers, MPI is still a viable optionsince it has been able to catch up with some of thelatest ideas and developments, especially in the use ofshared memory. However, it is unclear whether the adhoc addition of features to the MPI specification willbe an option in the long run. As it stands now, itcan cumbersome even for expert programmers to fullyunderstand and utilize the whole range of tools providedby MPI.

Alternative models have shown that it is possible toprovide additional mechanisms such as oversubscription,task/data locality, shared memory, and data dependence-driven execution to alleviate the costs of communicationwhile using simple programming interfaces.

Future research needs to be focused on integratingmechanisms to address the challenges we have describedbut also others that were out the scope of this study.Many-core devices continue to be a challenge due to theirown interfaces and communication protocols, making ithard to conciliate with existing programming models.Other aspects we have not covered such as fault tolerance,load balancing, and power consumption will also becomeimportant challenges to address as we approach exascalecomputing.

ACKNOWLEDGEMENTS

I would like to thank my Ph.D advisor, Prof. ScottBaden, for his insightful comments and constructive crit-icism throughout the process of writing this study and Dr.Tan Nguyen (Lawrence Berkeley National Laboratory) forproviding additional feedback.

REFERENCES

[1] T. S. Abdelrahman and G. Liu. “Overlap of Computationand Communication on Shared-memory Networks-of-workstations”. In: Cluster Computing. Commack, NY,USA: Nova Science, 2001.

[2] B. Acun et al. “Parallel Programming with MigratableObjects: Charm++ in Practice”. In: Proceedings of theInternational Conference for High Performance Com-puting, Networking, Storage and Analysis. SC ’14. NewOrleans, Louisana, 2014.

[3] D.A. Adams. A Computation Model with Data FlowSequencing. Stanford University, 1968.

[4] S. Amarasinghe et al. Exascale software study: Softwarechallenges in extreme scale systems. Tech. rep. 2009.

[5] K. Asanovic et al. The Landscape of Parallel Comput-ing Research: A View from Berkeley. Tech. rep. EECSDepartment, University of California, Berkeley, 2006.

[6] S. Ashby et al. “The opportunities and challenges ofexascale computing”. In: Summary Report of the USDepartment of Energy Advanced Scientific ComputingAdvisory Committee (ASCAC) Subcommittee (2010).

[7] P. Balaji et al. “MPI on a Million Processors”. In: Pro-ceedings of the 16th European PVM/MPI Users’ GroupMeeting on Recent Advances in Parallel Virtual Machineand Message Passing Interface. Espoo, Finland, 2009.

[8] Bamboo User Manual - bamboo-translate command.2015. URL: http://bamboo.ucsd.edu/commands/bamboo-translate.html.

[9] M. Bauer et al. “Legion: Expressing Locality and Inde-pendence with Logical Regions”. In: Proceedings of theInternational Conference on High Performance Comput-ing, Networking, Storage and Analysis. SC ’12. Salt LakeCity, Utah, 2012.

[10] G. Bosilca et al. “DAGuE: A Generic Distributed DAGEngine for High Performance Computing”. In: Paralleland Distributed Processing Workshops and Phd Forum(IPDPSW), 2011 IEEE International Symposium on. May2011.

[11] Z. Budimlic et al. “Concurrent Collections”. In: ScientificProgramming 18.3-4 (2010).

[12] W. Carlson et al. “Introduction to UPC and LanguageSpecification”. In: Technical Report CCS-TR-99-157,Center for Computing Sciences, Bowie, MD. (Mar. 1999).

[13] P. Charles et al. “X10: An Object-oriented Approachto Non-uniform Cluster Computing”. In: SIGPLAN Not.40.10 (Oct. 2005).

[14] P. Ciccoti. “Tarragon: a Programming Model for Latency-Hiding Scientific Computations”. PhD thesis. Universityof California, San Diego, 2011.

[15] P. Colella. Defining Software Requirements for ScientificComputing. 2014. URL: http://view.eecs.berkeley.edu/w/images/temp/6/6e/20061003235551!DARPAHPCS.ppt.

[16] D. E. Culler et al. “Parallel programming in Split-C”. In:Supercomputing ’93. Proceedings. 1993, pp. 262–273.

[17] NASA Advanced Supercomputing Division. URL: https:/ / www. nas . nasa . gov / Software / FAST / RND - 93 - 010 .walatka-clucas/htmldocs/chp 16.surferu.html.

[18] D. Doerfer et al. “Applying the Roofline PerformanceModel to the Intel Xeon Phi Knights Landing Processor”.In: Intel Xeon Phi User Group Workshop (IXPUG).Chicago, Illinois, 2016.

[19] J.C. Dıaz Martın et al. “An MPI-1 Compliant Thread-Based Implementation”. In: Recent Advances in Par-allel Virtual Machine and Message Passing Interface.Vol. 5759. Springer, 2009.

[20] L. Flynn. Intel Halts Development of 2 New Micropro-cessors. News Report. The New York Times, 2004.

[21] Message Passing Interface Forum. URL: http://www.mpi-forum.org/.

[22] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard - Version 3.0. 2012. URL:https : / / www. mpi - forum . org / docs / mpi - 3 . 0 / mpi30 -report.pdf.

[23] E. Georganas et al. “HipMer: An Extreme-scale De NovoGenome Assembler”. In: Proceedings of the InternationalConference for High Performance Computing, Network-ing, Storage and Analysis. SC ’15. Austin, Texas, 2015.

[24] L. Greengard and V. Rokhlin. “A Fast Algorithm forParticle Simulations”. In: J. Comput. Phys. 73.2 (Dec.1987).

[25] N. Hemsoth. The Supercomputing Strategy That MakesAirbus Soar. July 2015. URL: http://www.nextplatform.com / 2015 / 07 / 22 / the - supercomputing - strategy - that -makes-airbus-soar/.

[26] J. L. Hennessy and D. A. Patterson. Computer Archi-tecture, Fifth Edition: A Quantitative Approach. 5th. San

Page 18: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

REFERENCES 18

Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,2011.

[27] T. Hoefler and A. Lumsdaine. “Overlapping Communica-tion and Computation with High Level CommunicationRoutines”. In: Cluster Computing and the Grid, 2008.CCGRID ’08. 8th IEEE International Symposium on.2008.

[28] T. Hoefler et al. “MPI + MPI: a new hybrid approach toparallel programming with MPI plus shared memory”.In: Computing 95.12 (2013).

[29] Hydrodynamics Challenge Problem, Lawrence LivermoreNational Laboratory. Tech. rep. LLNL-TR-490254. Liv-ermore, CA, pp. 1–17.

[30] C. Iancu et al. “Oversubscription on multicore proces-sors”. In: Parallel & Distributed Processing (IPDPS),2010 IEEE International Symposium on. IEEE. 2010,pp. 1–11.

[31] Intel MPI Library. URL: https://software.intel.com/en-us/intel-mpi-library.

[32] A. Jackson and I. Kirker. Unified Parallel C: UPC onHPCx. UoE HPCx Ltd., 2008.

[33] L. V. Kale and S. Krishnan. “CHARM++: A PortableConcurrent Object Oriented System Based on C++”.In: Proceedings of the Eighth Annual Conference onObject-oriented Programming Systems, Languages, andApplications. OOPSLA ’93. Washington, D.C., USA:ACM, 1993.

[34] L. V. Kale et al. “The CHARM Parallel ProgrammingLanguage and System: Part I – Description of LanguageFeatures”. In: Parallel Programming Laboratory Techni-cal Report #95-02 (1994).

[35] H. Kamal and A. Wagner. “An integrated fine-grainruntime system for MPI”. In: Computing 96.4 (2014).

[36] H. Kamal and A. Wagner. “FG-MPI: Fine-grain MPI formulticore and clusters”. In: Parallel Distributed Process-ing, Workshops and Phd Forum (IPDPSW), 2010 IEEEInternational Symposium on. Apr. 2010.

[37] S. Kamil et al. “Communication Requirements and In-terconnect Optimization for High-End Scientific Applica-tions”. In: IEEE Transactions on Parallel and DistributedSystems 21.2 (2010).

[38] J. Kim, J. Balfour, and W. J. Dally. “Flattened ButterflyTopology for On-Chip Networks”. In: IEEE ComputerArchitecture Letters 6.2 (2007).

[39] J. Kim et al. “Technology-Driven, Highly-Scalable Drag-onfly Topology”. In: Computer Architecture, 2008. ISCA’08. 35th International Symposium on. 2008, pp. 77–88.

[40] C. E. Leiserson. “Fat-trees: Universal Networks forHardware-efficient Supercomputing”. In: IEEE Trans.Comput. 34.10 (Oct. 1985).

[41] Top 500 Supercomputer Sites list. URL: top500.org.[42] V. Marjanovic et al. “Overlapping Communication and

Computation by Using a Hybrid MPI/SMPSs Approach”.In: Proceedings of the 24th ACM International Confer-ence on Supercomputing. ICS ’10. 2010.

[43] Chohong Min and Frederic Gibou. “A second orderaccurate level set method on non-graded adaptive carte-sian grids”. In: Journal of Computational Physics 225.1(2007).

[44] P. Moin. Fundamentals of engineering numerical analy-sis. Cambridge University Press, 2010.

[45] G. E. Moore. “Cramming More Components onto Inte-grated Circuits”. In: IEEE Electronics 38 (1965).

[46] MPICH. URL: http://www.mpich.org/.[47] MVAPICH. URL: http://mvapich.cse.ohio-state.edu/.[48] NERSC. Edison compute node configuration. 2016. URL:

http : / / www. nersc . gov / users / computational - systems /edison/configuration/.

[49] NERSC. Edison Interconnect. 2016. URL: http://www.nersc . gov / users / computational - systems / edison /configuration/interconnect/.

[50] T. Nguyen. “Bamboo: Automatic Translation of MPISource into a Latency-Tolerant Form”. PhD thesis. Uni-versity of California, San Diego, 2014.

[51] T. Nguyen et al. “Bamboo - Preliminary scaling resultson multiple hybrid nodes of Knights Corner and SandyBridge processors”. In: Third International Workshop onDomain-Specific Languages and High-Level Frameworksfor High Performance Computing. May 2013.

[52] T. Nguyen et al. “Bamboo: Translating MPI Applicationsto a Latency-tolerant, Data-driven Form”. In: Proceedingsof the International Conference on High PerformanceComputing, Networking, Storage and Analysis. SC ’12.Salt Lake City, Utah: IEEE Computer Society Press,2012.

[53] R. Numrich and J. Reid. “Co-array Fortran for ParallelProgramming”. In: SIGPLAN Fortran Forum 17.2 (Aug.1998).

[54] P. Nyberg. The Critical Role of Supercomputers inWeather Forecasting. July 2013. URL: http://www.cray.com / blog / the - critical - role - of - supercomputers - in -weather-forecasting/.

[55] Open MPI. URL: https://www.open-mpi.org/.[56] OpenMP Application Program Interface - Version 4.0.

July 2013. URL: http://www.openmp.org/mp-documents/OpenMP4.0.0.pdf.

[57] A. Patrizio. U.S. Army plans for a 100 petaflop super-computer. Feb. 2015. URL: http : / / www. itworld . com /article /2889072/u- s - army- plans- for- a- 100- petaflop-supercomputer.html.

[58] J. M. Perez, R. M. Badia, and J. Labarta. “A dependency-aware task-based programming environment for multi-core architectures”. In: Cluster Computing, 2008 IEEEInternational Conference on. Sept. 2008.

[59] A. E. Randles. “Modeling Cardiovascular HemodynamicsUsing the Lattice Boltzmann Method on Massively Par-allel Supercomputers”. PhD thesis. Harvard University,2013.

[60] T. Rimmer. Intel R© Omni-Path Architecture TechnologyOverview. 2015. URL: http://www.hoti.org/hoti23/slides/rimmer.pdf.

[61] Andrew S. and Rupak B. Communication Studies of DMPand SMP Machines. Tech. rep. NAS-97-004. NAS, 1997.

[62] L. I. Sedov. Similarity and Dimensional Methods inMechanics. 1959.

[63] E. Solomonik and J. Demmel. “Communication-optimalParallel 2.5D Matrix Multiplication and LU FactorizationAlgorithms”. In: Proceedings of the 17th InternationalConference on Parallel Processing - Volume Part II.Euro-Par’11. Bordeaux, France, 2011.

[64] Dylan T. Stark et al. “Early Experiences Co-schedulingWork and Communication Tasks for Hybrid MPI+XApplications”. In: Proceedings of the 2014 Workshop onExascale MPI. ExaMPI ’14. New Orleans, Louisiana,2014, pp. 9–19.

[65] H. Sutter. “The Free Lunch Is Over: A FundamentalTurn Toward Concurrency in Software”. In: Dr. Dobb’sJournal 30.3 (Mar. 2005).

[66] Hong T. and Tao Y. “Optimizing Threaded MPI Execu-tion on SMP Clusters”. In: In Proc. of 15th ACM In-ternational Conference on Supercomputing. ACM Press,2001.

[67] J. E. Thornton. “Parallel Operation in the Control Data6600”. In: Proceedings of the 1964, Fall Joint ComputerConference, Part II: Very High Speed Computer Systems.San Francisco, California, 1965.

[68] R. M. Tomasulo. “An Efficient Algorithm for ExploitingMultiple Arithmetic Units”. In: IBM Journal of Researchand Development 11.1 (1967).

[69] D. M. Tullsen, S. J. Eggers, and H. M. Levy. “Simulta-neous Multithreading: Maximizing On-chip Parallelism”.In: 25 Years of the International Symposia on ComputerArchitecture. ISCA ’98. Barcelona, Spain, 1998.

Page 19: RESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, …cseweb.ucsd.edu/~sergiom/papers/fa16-smartin.pdfRESEARCH EXAMINATION - UNIVERSITY OF CALIFORNIA, SAN DIEGO. FALL 2016 2 Fig. 2:

19

[70] L. G. Valiant. “A Bridging Model for Parallel Computa-tion”. In: Communications of the ACM 33.8 (Aug. 1990).

[71] Nick Vrvilo. The Habanero CnC Framework: A Demon-stration of CnC Unification. URL: https : / / engineering .purdue.edu/plcl/cnc2015/slides/cnc-framework.pdf.

[72] S. Williams, Andrew Waterman, and David Patterson.“Roofline: An Insightful Visual Performance Model forMulticore Architectures”. In: Commun. ACM 52.4 ().

[73] J. M. Wozniak et al. “Swift/T: Large-Scale ApplicationComposition via Distributed-Memory Dataflow Process-ing”. In: Cluster, Cloud and Grid Computing (CCGrid),2013 13th IEEE/ACM International Symposium on. May2013.

[74] K. Yelick et al. “Titanium: A High-Performance JavaDialect”. In: In ACM. 1998, pp. 10–11.

[75] G. Zheng et al. “Automatic Handling of Global Vari-ables for Multi-threaded MPI Programs”. In: Paralleland Distributed Systems (ICPADS), 2011 IEEE 17thInternational Conference on. Dec. 2011.

[76] G. Zheng et al. “Hierarchical Load Balancing forCharm++ Applications on Large Supercomputers”. In:Parallel Processing Workshops (ICPPW), 2010 39th In-ternational Conference on. 2010.

[77] Y. Zheng et al. “UPC++: a PGAS extension for C++”.In: 28th IEEE International Parallel and DistributedProcessing Symposium (IPDPS). 2014.

[78] J. Zverina and W. Froelich. UC San Diego Team AchievesPetaflop-Level Earthquake Simulations on GPU-PoweredSupercomputers. Apr. 2013. URL: http://www.sdsc.edu/News%20Items/PR040213 earthquake.html.