IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …lik/publications/Cen-Chen-IEEE-TPDS-2018.pdf ·...

GFlink: An In-Memory Computing Architectureon Heterogeneous CPU-GPU Clusters

for Big DataCen Chen , Kenli Li , Senior Member, IEEE, Aijia Ouyang, Zeng Zeng, and Keqin Li , Fellow, IEEE

Abstract—The increasing main memory capacity and the explosion of big data have fueled the development of in-memory big data

management and processing. By offering an efficient in-memory parallel execution model which can eliminate disk I/O bottleneck,

existing in-memory cluster computing platforms (e.g., Flink and Spark) have already been proven to be outstanding platforms for big

data processing. However, these platforms are merely CPU-based systems. This paper proposes GFlink, an in-memory computing

architecture on heterogeneous CPU-GPU clusters for big data. Our proposed architecture extends the original Flink from CPU clusters

to heterogeneous CPU-GPU clusters, greatly improving the computational power of Flink. Furthermore, we have proposed a

programming framework based on Flink’s abstract model, i.e., DataSet (DST), hiding the programming complexity of GPUs behind the

simple and familiar high-level interfaces. To achieve high performance and good load-balance, an efficient JVM-GPU communication

strategy, a GPU cache scheme, and an adaptive locality-aware scheduling scheme for three-stage pipelining execution are proposed.

Extensive experiment results indicate that the high computational power of GPUs can be efficiently utilized, and the implementation on

GFlink outperforms that on the original CPU-based Flink.

Index Terms—Big data, GPGPU, heterogeneous cluster, in-memory computing, OpenCL

Ç

1 INTRODUCTION

1.1 Motivation

WITH the rapid development of the Internet and Inter-net of things technologies, recent years have wit-

nessed a surge of data at a fast speed. Hadoop, an opensource MapReduce framework [1], has been highly success-ful in implementing large-scale data-intensive applicationson commodity clusters. However, as it is a disk-based sys-tem, each MapReduce stage can only interact with otherstages through the Hadoop Distributed File System (HDFS).Flink [2] and Spark [3] are designed to process data-inten-sive applications with distributed in-memory architecture.Because their in-memory parallel execution model can sub-stantially shorten the time of disk I/O operations, they are

more suitable for data mining and machine learning whichrequire many iterative operations.

To some extent, Spark and Flink are quite similar. Forexample, they are both JVM-based in-memory cluster com-puting platforms with the core execution model being Map-Reduce. Moreover, both of them adopt a master-slavemodel. Marcu et al. [4] found that neither of the two frame-work outperforms the other for all data types, sizes and jobpatterns. One key difference between them is the treatmentof streaming processing. Apache Spark looks at streamingas fast batch processing, while Apache Flink looks at batchprocessing as the special case of stream processing. ApacheFlink provides event level processing which is also knownas real time streaming. Nevertheless, Spark utilizes minibatches which doesn’t provide event level granularity.Hence, an important reason why we have chosen Flink tobase the whole framework lies in the needs of future expan-sion for a better streaming processing implementation.

Over the past few years, graphics processing units(GPUs) have emerged as parallel processors because of theirhigh computational power and low price especially forhigh-performance computing (HPC) area. It is now a main-stream trend to use heterogeneous CPU-GPU clusters. Inmany supercomputers, such as Tianhe and Titan, CPUs andGPUs cooperate together to produce powerful computing.Even in personal computers, the combination of the CPUand the GPU also provides low price and high performance.To promote the application of GPUs, we have made utmostefforts. Li et al. [5] proposed a model of scheduling stochas-tic parallel applications on heterogeneous cluster systems.Yang et al. [6] put forward performance optimization strate-gies for SpMV on GPUs and Multicore CPUs.

� C. Chen and K. Li are with the College of Information Science andEngineering, Hunan University, Changsha, Hunan 410082, China andalso with National Supercomputing Center in Changsha, Changsha,Hunan 410082, China. E-mail: {chencen, lkl}@hnu.edu.cn.

� A. Ouyang is with the Department of Information Engineering, ZunyiNormal College, Zunyi, Guizhou 563006, China. E-mail: [email protected].

� Z. Zeng is with the Institute for Infocomm Research, A*STAR, Singapore138632. E-mail: [email protected].

� K. Li is with the College of Information Science and Engineering, HunanUniversity, Changsha, Hunan 410082, China, and the National Super-computing Center in Changsha, Changsha, Hunan 410082, China, andalso with the Department of Computer Science, State University of NewYork, New Paltz, New York, NY 12561. E-mail: [email protected].

Manuscript received 1 Apr. 2017; revised 14 Dec. 2017; accepted 23 Dec.2017. Date of publication 16 Jan. 2018; date of current version 11 May 2018.(Corresponding author: Kenli Li.)Recommended for acceptance by Z. Du.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2018.2794343

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018 1275

1045-9219� 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See ht _tp://www.ieee.org/publications_standards/publications/rights/index.html for more information.

https://orcid.org/0000-0003-1389-0148

https://orcid.org/0000-0003-1389-0148

https://orcid.org/0000-0003-1389-0148

https://orcid.org/0000-0003-1389-0148

https://orcid.org/0000-0003-1389-0148

https://orcid.org/0000-0002-2635-7716

https://orcid.org/0000-0002-2635-7716

https://orcid.org/0000-0002-2635-7716

https://orcid.org/0000-0002-2635-7716

https://orcid.org/0000-0002-2635-7716

https://orcid.org/0000-0001-5224-4048

https://orcid.org/0000-0001-5224-4048

https://orcid.org/0000-0001-5224-4048

https://orcid.org/0000-0001-5224-4048

https://orcid.org/0000-0001-5224-4048

mailto:

mailto:

mailto:

mailto:

Programming on GPU clusters for big data is quite com-plex and difficult now, as programmers are forced to per-form explicit transfers between processes in many existingdistributed programming models (such as MPI [7] plusOpenMP [8] and CUDA). Using these programming model,programmers have to be conscious of system level issuesduring the two-level parallelization such as cache misses,out-of-memory errors, load-balance among heterogeneouscores in the cluster. Reliability is another challenge. As sys-tems scale up, the time between hardware failures tends todecrease. Reliability thus acts as the main driver for con-structing our system, GFlink, on top of Flink. Flink has arobust job management system as it uses replication anderror detection to schedule around failures [9].

However, the open source version of in-memory clustercomputing platforms such as Flink and Spark can only runon CPUs for the moment. That is to say, these platforms can-not leverage the available computing resources of GPUs,which may be present in the nodes of the cluster. Recently,many studies have focused on accelerating in-memory clus-ter computing by GPUs [10], [11], [12], [13], [14]. However,first, all these studies need to convert JVM objects to GPU-friendly buffers manually, which incurs heavy burdens forprogrammers. Moreover, the overhead of transformation issignificant compared with the actual useful computation.Second, many studies need to copy buffers from JVM-heap tonative memory, and then transfer buffers to GPUs on PCIe,which also incurs extra memory copy overhead. Some haveproposed GPU sharing scheme. However, they fail to takedata-locality, a critical aspect during big data processing, intoconsideration when managing tasks for GPUs. The detaileddiscussion of these studies is presented in Section 2.3.

1.2 Our Contributions

Considering various aspects, we propose GFlink: an in-mem-ory computing architecture on heterogeneous CPU-GPU clus-ters for processing big data. An earlier and simpler version ofthis paper was included in a conference proceeding [15].Compared with the conference version, we have improvedthe communication strategy to match the contents of user-defined objects in main memory with the layout of CUDAstructs. This scheme inhibits the occurrence of serializationand deserialization between JVM objects and CUDA structs.Furthermore, a GPU cache scheme has been proposed toavoid redundant allocations/deallocations and data transfersfor GPUs. Ourmain contributions are as follows.

� Analysis: We have analyzed and identified the chal-lenges for effectively using GPUs in current distrib-uted in-memory data processing systems. GFlink iscarefully designed to address these challenges.

� Practicability: GFlink is compatible with both thecompile-time and run-time of Flink, inheriting theexisting outstanding features of Flink, such as highreliability and compatibility with distributed file sys-tem. Moreover, we have extended the programmingmodel of Flink so that programming on GFlink canbecome easier.

� Efficient Communication Strategy: A user-defined datalayout scheme and an efficient JVM-GPU communi-cation strategy are proposed in GFlink. The contents

of user-defined objects are stored in the off-heapmemory space in raw bytes, matching the data layoutof CUDA structs. Therefore, the data can be trans-ferred to GPUs without modifications. Furthermore,an asynchronous transfer scheme and a GPU cachingscheme are provided in GFlink.

� Adaptive Locality-Aware Scheduling: Considering thedata-locality, an adaptive locality-aware schedulingscheme is proposed to simultaneously achieve load-balance among GPUs, hide the cost of PCIe transfersand avoid unnecessary PCIe transfers.

The remainder of this paper is organized as follows.Section 2 providesmore background information and reviewsrelated literatures. Section 3 describes the information aboutdesign and architecture of GFlink. Section 4 presents our pro-posed efficient JVM toGPUcommunication strategy. Section 5details the stream execution model. Section 6 shows the per-formance results of GFlink. Section 7 concludes this paper.

2 RELATED WORK AND BACKGROUND

2.1 GPGPU

A Graphics Processing Unit was extended to the general-purpose high-performance computing area after the emer-gence of General Purpose GPU (GPGPU) under certainframeworks, such as CUDA. Using CUDA, programmersneed to write an application with two portions of code-functions to be executed on the CPU host and functions tobe executed on the GPU device named CUDA kernel.

In most cases, data need to be transferred to GPU forprocessing, and to be transferred back after the kernel exe-cution. Gregg et al. [16] have benchmarked a broad set ofGPU kernels on a number of platforms with different GPUsand showed that when memory transfer times wereincluded, it could easily take between 2 to 50x longer to runa kernel than the GPU processing time alone. Reducing thetime of data transfer between main memory and devicememory of GPUs is an important optimization method toimprove the performance of applications on GPUs [16], [17].

Another important performance optimization method isto coalesce global memory accesses generated by streamingmultiprocessors (SMs). Having coalesced memory accesshas long been advocated as one of the most important off-chip memory access optimizations for modern GPUs. Manystudies have focused on optimizing data layout to improvethe performance [18], [19], [20]. Generally speaking, thereare three types of data layout: Array-of-Structures (AoS),Structure-of-Arrays (SoA) and Array-of-Primitive (AoP).An illustrative example is as follows. The efficiency perfor-mance of the same GPU application may drastically differdue to the use of different types of data layout [21], [18].

2.2 MapReduce on GPUs

Because of the high computational power and memorybandwidth of GPUs, many studies have focused on acceler-atingMapReduce with GPUs. Fang et al. [22] designedMars,a MapReduce runtime system accelerated by graphics proc-essing units, and then integrated it into Hadoop for clustercomputing. Hong et al. [23] proposedMapCG, aMapReduceframework which can provide source code level portabilitybetween CPU and GPU so that programmers only need to

1276 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 29, NO. 6, JUNE 2018

write one version of code to be compiled and executed oneither CPUs or GPUs efficiently without modification. Chenet al. [24] accelerated MapReduce model on a coupled CPU-GPU architecture to make full use of CPU and GPU comput-ing resources. All these studies focused on implementingMapReduce on GPUs and integrating GPUs into Hadoop.During Hadoop, each MapReduce stage can only interactwith other stages through the HDFS, which requires muchtime for I/O operations. Therefore, they cannot benefit fromthe in-memory computing model provided by in-memorycluster computing platforms.

1: //Example of AoS2: public struct Pt {3: float x;4: float y;5: }6: struct Pt myPts[N];7:8: //Example of SoA9: public Pt {10: float x[N];11: float y[N];12: }13: struct Pt myPts;14:15: //Example of AoP16: float x[N];17: float y[N];18: kernel void DA( global float *bar19: global float *bar){20: }

2.3 In-Memory Cluster Computing on GPUs

In-memory cluster computing, such as Flink and Spark pro-vides an abstractionmodel for distributed data (e.g., ResilientDistributed Dataset (RDD) [3] in Spark and DataSet (DST) inFlink ). This abstract data model offers a series of high-leveltransformation and action interfaces including Map, Reduce,Join, Group and Count, enabling programmers to work eas-ily. DSTs represent a collection of distributed items, whichcan be concurrently manipulated across many computingnodes. The kernel computation in Flink and Spark is a Map-Reduce model, which contains two high-level computationalstages:Map andReduce.

Li et al. [10] proposed a heterogeneous CPU-GPU sparkplatform formachine learning algorithms. However, the com-munication between Spark and GPU is based on RMI whichutilizes socket-like communication APIs layered on top ofTCP/IP protocol stack. This scheme introduces large extraoverheads (e.g., serialization and deserialization, overheadcaused by passing through TCP/IP protocol stack). Toimprove the communication performance, Yuan et al. [11]proposed a system based on Spark named as Spark-GPU toaccelerate in-memory data processing on clusters. Aswith theSpark-GPU, Java Native Interface (JNI) is utilized to conductthe communication between JVM and GPUs. JNI is providedby JAVA platforms to allow JAVA programs to interactwith local libraries written in other languages (e.g., C andC++). Through JNI, the interfaces exported by local librar-ies can be directly called by JAVA code. However, the data

to be processed in GPUs need to be transferred from mem-ory in JVM to the native memory, and then from nativememory to the device memory of GPUs, resulting in extramemory copy overhead. Furthermore, the relationshipbetween the data layout of cached data in JVM and that ofthe buffer in GPUs is unclear. It seems like, with Spark-GPU, transformation between JVM objects and native buf-fers is needed, which may incur large-scale overhead.

Grossman et al. [12] proposed SWAT during which GPUsare integrated into Spark through Aparapi and OpenCL.During SWAT, the communication between JVM and GPUsis also based on JNI. However, before transferring data toGPUs, some threads are responsible for converting and accu-mulating JVM objects to form buffers, greatly reducing theefficiency of communication. Chen et al. [25] proposed a par-allel hierarchical ELM algorithm on Flink andGPUs.

3 DESIGN AND ARCHITECTURE

3.1 Challenges for Integration

Generally speaking, applications running on Flink usuallyhave rich data parallelism, which matches GPU’s parallelexecution model. However, due to the different propertiesof Flink and GPU, it is a non-trivial task to efficiently inte-grate GPUs into Flink. Some challenges are listed below.

Communication: As we know, the official CUDA driverfrom NVIDIA only supports C/C++. However, the tasks ofFlink are executed in JVMs. Therefore, to integrate GPUsinto the existing architecture of Flink, the first problem wecome across is how to provide an efficient strategy for com-munication between JVMs and GPUs. A number of issuescomplicate the efficient communication strategy.

First, CPUs and GPUs have separate memory spaces,requiring explicit data transfers between CPU and GPUmemory. Because of JVM’s memory management mecha-nisms, one of which is the JVM’s garbage collection (GC)function, the virtual addresses and actual physical addressof values or objects in JVM are not fixed and invisible to pro-grammers. Therefore, it is not able to transfer data or objectsin JVMs to GPUs by PCIe bus directly. A naive method asintroduced in [12], [13] is to convert and accumulate JVMobjects to GPU-friendly buffers in JVM heap before copyingthe buffers from JVM heap to the native memory. After that,the buffers in the native is transferred to the GPU’s devicememory. The whole process is inversed after execution onGPUs. This schema brings about large-scale extra overhead.

Second, the PCIe link connecting the main memory anddevice memory of GPUs has limited bandwidth. This canoften be a bottleneck for the computationswe are considering,starving GPU cores from their data [26]. For example, PCIeGen 3 has a theoretical maximum throughput of 15.75 GB/s,far lower thanmemory bandwidth of GPU-side.

Data Layout: Data layout is the form in which datashould be organized and accessed in memory when oper-ating on multi-valued data. Different GPU kernels areappropriate for different types of data layout [21], [27]. It iswell-known that global memory accesses are always coa-lesced when using the columnar format (SoA). However,[21], [19] have found that AoS is a better choice over SoAduring some applications. The selection of appropriatedata layout is a crucial issue in the development of GPUaccelerated applications.

CHEN ETAL.: GFLINK: AN IN-MEMORYCOMPUTING ARCHITECTURE ON HETEROGENEOUS CPU-GPU CLUSTERS FOR BIG DATA 1277

Programming Model: Another challenge is that theprogramming model of CUDA is very different from themodel of Flink. Using Flink, programmers just need to imple-ment some high-level interfaces (e.g., Map and Reduce)without considering communication, fault tolerance anddata synchronism.

Execution Model: Flink uses the iterator execution model.Each DST in Flink implements an iterator interface, whichcomputes one element of the DST when it is called. Theone-element-a-time iterator model has advantages such assimplicity and flexibility. However, it doesn’t match GPU’sarchitecture, significantly underutilizing GPU resources. Tofully utilize the GPU’s performance, appropriate block proc-essing needs to be supported.

3.2 Our Main Methods

To solve the problems listed in Section 1.1 and Section 3.1,we have carefully designed GFlink. Our main methods arelisted as follows.

User-defined Data Layout: A GPU-based abstract datamodel DST (GDST) based on Flink’s existing abstract datamodel DST is proposed to combine the computing model ofGPUs and Flink. Programmers can create GDST objectsbased on user-defined C-style struct named as GStruct. Uti-lizing GStruct, programmers can oraginze the data layoutand define the type of alignment. The raw bytes of the user-defined GStruct are stored in a sequential way in accor-dance with the definition of GStruct. By defaut, the datalayout is AoS. Programmers can define arrays in user-defined GStructs, then the data layout becomes SoA just asthe columnar format. One SoA is a sub-region, and all sub-regions in the cluster constitute a whole dataset. If thearrays in Structure-of-Arrays are separated, the data layoutwill become AoP.

Efficient Communication Strategy: During GFlink, the rawbytes of user-defined GStructs are stored in the off-heapmemory in the cluster. Therefore, there is no need to trans-fer the data between JVM heap memories and native memo-ries. Furthermore, the raw bytes cached match the layout ofthe struct defined in CUDA. This scheme inhibits the occur-rence of serialization and deserialization between JVMobjects and CUDA structs, thus greatly improving the

performance and reducing the heavy burdens for pro-grammers. To further improve the performance, an asyn-chronous transferring and a software caching scheme areprovided in GFlink.

Task Executions in GPUs: We propose a producer-con-sumer scheme and carefully design a GStreamManager todeal with task executions on GPUs. The TaskManagers inall the worker nodes produce tasks, while the GStreamMan-agers consumes the tasks. In order to achieve good load-balance among multi-GPUs, and further avoid unnecessarytransfers especially for iterative computing, we propose anadaptive locality-aware scheduling scheme, which containslocality-aware scheduling algorithm and locality-awarework stealing algorithm.

3.3 Overall Architecture

Like the existing architecture of Flink, GFlink also containsclient, master and worker, all of which are based on the run-time of Flink (HDFS, GraphScheduler, JobManager andTaskManager) as shown in Fig. 1a, therefore preservingcompatibility with the existing platforms. The blue blocksin the figure denote the components provided by the origi-nal Flink framework, while the green blocks denote thecomponents provided by our proposed GFlink. When theGFlink system is started, it brings up one JobManager inthe master, and one TaskManager and GPUManager inevery worker. The JobManager is the coordinator of theGFlink system, while the TaskManagers are the workersthat execute parts of the parallel programs [2]. The GPU-Manager is in charge of managing computing in GPUs.

The core functionalities of GFlink are in GPUManagerseach of which resides in the worker. In order to use GFlink,developers need to write driver programs using the interfa-ces in the form of Java provided by GFlink. They also needto provide CUDA kernel programs (or C/C++ interfaces)and register them as GWork using interfaces provided byGFlink. When the program is submitted, the work definedby programmers will be scheduled to the workers in thecluster by the master, and then be executed in CPUs orGPUs by TaskManagers or GPUManagers. GPUManagerswill call the corresponding registered GWork and theninvoke the CUDA kernels.

Fig. 1. GFlink architecture. (a) Overall architecture. (b) GPUManager architecture.


3.4 System Components of GPUManager

GPUManager, which resides in each worker in the cluster,manages GPU computing resources (e.g., GPUmemory, GPUcontext) and cooperates with TaskManager to accomplish thetasks assigned to GPUs defined by the driver program. Thearchitecture of GPUManager is shown in Fig. 1b. The systemcomponents in GPUManager are listed as follows.

CUDAWrapper and CUDAStub are responsible for thecommunication between JVMs and GPUs by Java NativeInterface. CUDAWrapper (programmed by Java) collabo-rates with CUDAStub (programmed by C++) to interceptand redirect API calls from applications in GFlink to theGPUs. In addition, many objects in CUDA (e.g., Streams,cudaEvent) are also virtualized in CUDAWrapper in theform of Java.

GMemoryManager is in charge of GFlink’s memory man-agement, which is implemented to manage automatic allo-cations/deallocations on the GPUs. GMemoryManager alsocaches data in GPUs to avoid redundant allocations/deallo-cations and data transfers for GPUs.

GStreamManager manages CUDA Streams of GPUs andschedulesGWork assigned toGPUswith an adaptive scheme.A three-stage pipelining execution is adopted to overlapdata transfers and kernel execution, decreasing the overheadcaused by transfers over PCIe bus. Furthermore, a locality-aware scheduling scheme is proposed to avoid unnecessarytransfers and achieve good load-balance among heteroge-neous GPUs.

3.5 Programming Framework

The main goal of our proposed GFlink is to improve thecomputational performance of applications on Flink byleveraging the GPU’s high computing power, meanwhilemaintaining the easy programming model. In order to lever-age the GPU computing resources to process big data by uti-lizing GFlink, there are several steps:

� Design and define GPU-based DST based on ourproposed C-style struct. (in Java)

� Provide CUDA kernels. (in CUDA C)� Implement user-defined GPU-based Mapper and

Reducer, during which the GPU-based work is con-structed and submitted. (in Java)

3.5.1 GPU-Based DST

As we know, Flink runs in Java Virtual Machine (JVM). Inthe original Flink, key and value in the DST are in the form ofJVM object references. However, during the classical imple-mentations of CUDA programming model, data transfersfrom the host to GPUs are in the form of buffers. The naivesolution is to transform the JVM objects to buffers manually,which places heavy burdens on programming and simulta-neously decreases performance.

To fill the gap between JVMs and GPUs mentionedabove, we have designed GPU-based DST (GDST), which isbased on the carefully designed C-style struct named asGStruct. We have defined a series of primitive data types(e.g., Unsigned32, Float32) corresponding to data types inCUDA. To use GStruct, we cannot use Java object types(e.g., Int, Long), but rather use these primitive data types. Inaddition, the positions of member variables should be

addressed. Through Java annotation and reflection technol-ogies, the details and layout of GStruct can be obtained inthe runtime. Therefore, the layout of the GStruct can beautomatically mapped to the Direct Buffer (the reasons forusing Direct Buffer will be discussed in Section 4). Unlikethe original JVM object references, programmers can get theDirect Buffer which is mapped to the GStruct directly bycalling the membership function of GStruct, reducing theheavy burdens for programmers. As for writing CUDA ker-nels, programmers can define the struct according to theuser-defined GStruct. The contents of buffers generated byGStruct are directly mapped to the layout of struct inCUDA programs.

By means of GDST, there is no need to manually encodeand decode between JVM objects and buffers, thus bringingconvenience for programmers and greatly improving theperformance. An illustrative example of using GDST is asfollows. In this example, one Java object Point based onGStruct8 is defined and a DST pointDST is created based onPoint. GStruct8 means that the size of alignment is 8 bytes.As for the definition of Point, @StructField(order = 0) meansthat the position of member variable x is 0.

1: public class Point extends GStruct_8{2: @StructField(order = 0)3: public Unsigned32 x;4:5: @StructField(order = 1)6: public Double64 y;7:8: @StructField(order = 2)9: public Float32 z;10: }11: DataSet < Point > pointDST12: = new DataSet < Point >;

3.5.2 GPU-Based Mapper and Reducer

As described in Section 2.3, Flink provides user-imple-mented transformation functions (e.g., Map, Reduce, Flat-Map, Join) and action function (e.g., Count, Save) for theabstraction model DST. During practical applications, mostcomputing tasks are encapsulated in user-implementedMapper and Reducer. Therefore, we have added GPU-based user-implemented Map and Reduce interfaces (e.g.,gpuMap, gpuReduce, gpuFlatMap). Other functions suchas Join, Count can also be implemented in GPUs, but theyare beyond the scope of this paper. Programmers define theactual logic and how to invoke the CUDA kernel in GPU-based Mapper and Reducer. As for the computing in GPUs,processing many items (chunks) on GPUs in parallelachieves high performance. Therefore, we also add user-implemented GPU-based Map and Reduce interfaces forprocessing a block (e.g., gpuMapBlock). During GPU-basedMapper and Reducer, programmers need to define the log-ics, and assemble the parameters for GWork.

3.5.3 GPU-Based Work

GFlink has designed an abstraction model named as GWorkin the form of Java for GPU computing. Programmers canset input buffer, output buffer, the path of the ptx file and


other parameters to form a GWork and submit the GWorkby calling interfaces provided by GFlink in GPU-basedMapper and Reducer. After submission, the input bufferand output buffer will be transformed to GPUs automati-cally. Then, GFlink will schedule the work by GStreamMan-ager (to be described in Section 5) and the CUDA functionwill be found by the name provided by programmers beforefurther execution. After executions on GPUs, the results arepulled from GPUs to output buffer automatically. By takingadvantage of this capability, programmers do not need tomanage GPU context and GPU memory, etc. Algorithm 3.1shows the pseudocode of utilizing GWork.

Algorithm 3.1. An Example on GFlink

Input: A file Awhich contains the contents to be processed andis stored in the HDFS.

Output: A file Bwhich contains the contents in the HDFS.Driver(A)

1: Define Point type based on GStruct;2: Create GDST Tuple2 < Point; Point > named asM from

file A;3: for i 0 to iTimes do4: V M.gpuMapPartition(new addPoint())5: end for6: return V ;7: addPoint(HBuffer in, HBuffer out, int cacheID, int size)

extends gpuMapBlock8: Define GW ork object sWork;9: sWork:ptxPath “=addPoint:ptx”;10: sWork:size size;11: sWork:blockSize 256;12: sWork:gridSize size=256;13: sWork:inBuffer in;14: sWork:cache true;15: sWork:cacheKey cahceID;16: sWork:outBuffer out;17: sWork:executeName “cudaAddPoint”;18: Submit sWork to GStreamManager;19: return

3.6 Discussion of Migration from Flink to Spark

An important thinking of designing GFlink is to makemigration from Flink to Spark easier for further extension.There are some common design ideas in Flink and Spark.First, they are all JVM-based system. CUDAWrapperand CUDAStub are responsible for the communicationbetween JVMs and GPUs over JNI. CUDAWrapper wrapscommon interfaces of CUDA (including CUDA driverAPIs and CUDA runtime interfaces) in JAVA. Applica-tions on Spark can also call these interfaces to controlGPUs without any modifications. Second, like Flink, Sparkalso adopts the master-salve and MapReduce executionmodel. Our proposed programming framework is alsosuitable for Spark, though there exists some coding work.Third, the producer-consumer scheme decouples execu-tions on Flink and GPUs. In principle, the consumer Flinkcan be replaced by Spark. To migrate our proposed archi-tecture from Flink to Spark, an important problem thatshould be fixed is to make Spark support GStruct, storethe data in the off-heap memory and make the data-layoutof data matches that in CUDA struct.

4 EFFICIENT JVM-GPU COMMUNICATION

STRATEGY

4.1 Communication Channel

As for the CUDA programming model, host applicationscould only be programmed by C/C++ or Python. Therefore,the naive strategy for communication between JVMs andGPUs consists of two steps: communication between JVMsand local processes/libraries, and communication betweenlocal processes/libraries and GPUs. For the first step, somecommonly used methods include remote procedure callprotocol (RPC), Hadoop Streaming and Java native inter-face. For example, Mars [28] adopted the Hadoop Streamingscheme to integrate GPUs into Hadoop. As for RPC method,the data being transferred should pass through the TCP/IPprotocol stack in the worker, adding extra overhead to thecommunication path. In addition, serialization/deserializa-tion in RPC systems are usually expensive. The HadoopStreaming also needs cross-processing communication,resulting in overhead like RPC method. For the second step,data transmission between RAM and GPU device memoryusually relies on DMA engines over PCIe bus.

To mitigate the shortcomings of RPC and HadoopStreaming, we have carefully designed an efficient commu-nication scheme. We first divide communication betweenJVMs and GPUs into transfer channel and control channelas shown in Fig. 1b. Small data (e.g., memory address) andcontrol commands are transferred by the control channelthrough JNI. The transfer channel is responsible for large-capacity and high-speed data transfer between the mainmemory and the device memory of GPUs. In order toimprove the communication performance, a series of opti-mizations is adopted in transfer channel.

4.1.1 Control Channel

The control channel is responsible for controlling the GPU’sexecution. Small data and control commands are transferredby the control channel. The CUDAStub exports all the run-time APIs of CUDA as native interfaces, and the CUDA-Wrapper wraps all the APIs in Java. One execution phaseentails three procedures. First, applications in GFlink whichwant to submit work to GPUs call the interfaces provided bythe CUDAWrapper. Second, the CUDAWrapper redirectsthe API calls to the CUDAStub by JNI. Third, the CUDAStubcalls CUDA APIs or invokes CUDA kernels to make GPUsdo the specific operations (e.g., cudaMalloc, cudaMemcpyD2H,cudaMemcpyH2DAsync). Lastly, the results are transferredfromGPUs to applications.

4.1.2 Transfer Channel

The transfer channel is responsible for large-capacity datacommunication between main memory and GPU’s devicememory. Our proposed control channel provides two inter-faces (cudaMemcpyH2D, cudaMemcpyH2DAsync) to transfernative buffers to the device memory of GPUs, and two inter-faces (cudaMemcpyD2H, cudaMemcpyD2HAsync) to transferdata in the device memory to the main memory.

Off-heap Memory: There are two ways to cache the data inFlink: in the JVM heap or in the off-heap memory. The naiveway is to store the data in the JVM heap, which has some


problems. Adopting this way requires the following steps:(1) copying data from Java heap to native memory; (2) trans-ferring data to GPU device memory; (3) computing onGPUs; (4) transferring results from GPU device memory tonative memory; and (5) copying results from native mem-ory to Java heap. Data copying is expensive and can causeGPU performance degradation.

To address this problem, GFlink caches data in the off-heap memory (direct buffers in Java). The contents of directbuffers reside outside of the normal garbage-collected heap,as a result of which, local libraries can get the user space’svirtual address and then read or write the buffer using theaddress. As shown in Fig. 2, shared memory spaces are cre-ated by allocating Direct Buffers in the off-heap memoryspace. These memories are beyond JVM’s GC functionwhich are directly mapped to the actual physical memoryby the operating system (OS). By adopting this way, thesteps (1), (5) are avoided.

To transfer data from JVMs to GPUs, applications inGFlink call the interfaces provided by the CUDAWrapper(e.g., cudaMemcpyH2D, cudaMemcpyH2DAsync) with the vir-tual addresses and data sizes as parameters. Then, theCUDA-Wrapper calls the corresponding interfaces exported by theCUDAStub. After these callings, the data will be directlytransferred from the main memory to the device memory ofGPUs by the DMA engine over the PCIe bus. Applicationsneed to call cudaMemcpyD2H or cudaMemcpyD2HAsync func-tions to transfer the results fromGPUs to JVMs.

Bulk Transfer: GPU is a massively parallel co-processor,which executes GPU kernels in a Single Instruction MultipleThreads (SIMT) way. To maximize GPU’s performance, tworequirements must be met. First, each GPU kernel shouldbe launched with a large number of GPU threads, whichcan utilize GPU computing resources and hide GPU mem-ory access latency to achieve high throughput. Second, tofully utilize GPU’s memory bandwidth, data should beaccessed in a coalesced manner, where consecutive GPUthreads access consecutive GPU memory locations.

To meet the first requirement, a system needs to supportthe block processing model that processes a block of dataelements at a time. To meet the second requirement, a sys-tem needs to organize data into an appropriate format sothat they can be accessed in a coalesced way. However,Flink does not meet these two requirements. It adopts theiterator model and computes one element at a time using

row format, which may significantly underutilize GPUresources. In this case, to efficiently harness GPU resources,transferring the data to GPUs in bulks and invoking blockprocessing should co-exist.

Asynchronous Communication: Asynchronous communica-tion is utilized in our scheme. CUDA Streamsmay be used toseparate the computation into distinct streams that may exe-cute in parallel. Stream is a sequence of commands that exe-cutes on the GPU in order. On one hand, different Streamsmay execute their commands out of order with each other orconcurrently. Hence, communication from one Stream can beoverlapped with computation, and with communication inother Streams, thus improving the performance.

GPUs with one copy engine can only use the PCIe bus inhalf duplex when data is moved using explicit memory cop-ies. As such, computation can only be overlapped with com-munication in one direction at most. The only way for theseGPUs to use the PCIe bus in full duplex is to use device-mapped host memory instead. GPUs with two copyengines, such as NVIDIA’s Tesla K20, can use the PCIe busin full duplex using explicit memory copies in differentstreams. In this way, computation and communication inboth directions can be fully overlapped using differentstreams. Streams may also be used to allow different kernelsto execute concurrently, no matter how data is transferredbetween the host and the device. This task-parallel use ofstreams for exploiting application-level task parallelism isbeyond the scope of this paper. In order to enable asynchro-nous memory transfers, ultimately overlapping data trans-fers and computations for multiple streams, the buffersshould be page-locked. cudaHostRegister function is utilizedto change the Direct Buffer to page-locked memory, andthen cudaMemcpyD2HAsync and cudaMemcpyH2DAsync areutilized to do the asynchronous communication.

4.2 Device Memory Management

Device memory is the primary onboard DRAM storage forthe computation performed on GPU. Unlike systemmemorywhere the OS controls space allocation and reclamation,GPU device memory is still directly controlled by individualapplications in current systems, which complicates GPGPUapplication design. Generally speaking, using CUDA, pro-grammers usually need to allocate input spaces and outputspaces in GPUs with cudaMalloc, and then transfer buffersto GPUs or from GPUs, and release the allocated memoryspaces explicitly. This explicit memory management isvery complicated, error-prone and a heavy burden for pro-grammers. Moreover, as described in Section 3.1, the datatransfers over PCIe link have significant effects on the per-formance acceleration. To address these problems, an auto-matic memory management scheme and a GPU cachescheme are proposed.

4.2.1 Automatic Memory Management

GMemoryManager is responsible for GFlink’s automaticmemory management. During GFlink, HBuffer is utilized tostore data in the off-heap memory, while GMemory is aJVM object that consumes minimal JVM memory but isassociated with a set of GPU’s buffers in the CUDA addressspace. Each GMemory object acts as a handle that GFlink

Fig. 2. Transfer channel.


runtime components must acquire to access the associatedCUDA buffers.

CUDAWrapper wraps CUDA’s device memory manag-ment interfaces (such as cudaMalloc, cudaHostRegister andcudaMemcpy and cudaMemcpyAsync) in JAVA. Programmersdo not need to manage device memories manually. Theyonly need to create GWork and submit it. GMemoryManagerutilizes CUDAWrapper’s interfaces to allocate memoryspaces in GPUs in the form of GMemory according to inputbuffers and output buffers in the form ofHBuffer. Then inputbuffers are transferred from off-heapmemory to GPUs. Afterthe executions on GPUs, the results are transferred to mainmemory. Lastly, if the data is not required to be cached, theallocated spaces in GPUswill be released automatically.

4.2.2 GPU Cache Scheme

In order to cache the data, especially for iterative comput-ing, a software cache scheme, as presented in Fig. 3, is pro-posed. Each job has a cache region itself, which is allocatedwhen the job starts but deallocated when the job ends orGFlink stops through the device memory reservation/release API calls provided by CUDAWrapper. The capacityof the region is a user-defined parameter. In order to reduceoverhead of searching the memory address of specific data,we use a hash table to maintain the data in GPU cacheregions. One element in the hash table named as an objectcontains a key and a value. By default, the key of a block isthe partition ID and the block ID, while the value representsthe offset of the start of the cache region and the size of theobject. GMemoryManager is responsible for GPU cachemanagement and allocation/release memory in the cacheregion. The data to be cached in GPU is allocated and deal-located in a sequent manner. To avoid unnecessary searchof hash tables, only the data access which is marked Cacheneeds to search hash tables to find GMemory objects.

Because of the limited capacity of GPU’s device memory,garbage collectionmechanism is needed. The cache region ofa specific job is allocated when the job starts. Accordingly, itis released when the job finishes. Two garbage collectionschemes are proposed and one is the fist-in first-out (FIFO)manner. As presented in Fig. 3, a corresponding FIFO list isutilized to store the elements in the hash table. When a newpartition needs to be cached in the device memory, if thespare memory is smaller than the capacity of the new parti-tion, the first objects in the FIFO list will be selected one byone. What’s more the sizes of these objects are added untilthe sizes are bigger than the size of the new partition. Afterthat, all selected objects in the FIFO list and the hash table are

deleted. Another scheme is that when the cache region isfully utilized, no data can be cached in the region. Thisscheme is useful when the data needed to be cached in theGPUs in one iteration is larger than that of the region.

5 EXECUTION MODEL ON GPUS

We separate the task execution flow at slave nodes on GPUsinto data producing process and data processing as shownin Fig. 4. Tasks invoked by TaskManager of the originalFlink is regarded as producers producing independentwork, while our desigend GStreamManager consumes thework produced by Tasks.

GStreamManager’s core is an independent streamingdataflow engine which manages Streams provided byCUDA. In GStreamManager, Streams in GPUs are treatedas high-level virtual computing resources which are similarto threads for CPUs. As different Streams may execute theircommands out of order with each other, the asynchronousstreaming model of GPUs is appropriate for processingindependent work. JobManager and DAGScheduler in Flinkare responsible for the work dependencies, while TaskMan-ager does not need to consider the dependencies.

With the purpose of improving the performance, GFlinkadopts a three-stage pipelining execution model using mul-tiple asynchronous streams, overlapping computation andcommunication. There are three stages in the pipeliningexecution: host-to-device input transfers (H2D); GPU execu-tion (K) and device-to-host output transfers (D2H). A classnamed CUDA Stream is provided in CUDAWrapper.CUDA Streams can be created by calling the interfacecudaStreamCreate. To overlap data transfers and computa-tion for multiple Streams, asynchronous copy functions(cudaMemcpyH2DAsync and cudaMemcpyD2HAsync) onpage-locked memory need to be used.

Fig. 3. Memory management.

Fig. 4. GStream manager.


This producer-consumer scheme and three-stage pipelin-ing execution model decouple the execution of the originalFlink and GPUs. More importantly, by adopting them, aGPU can be shared among multiple task slots and eachGPU can concurrently execute multiple kernels with eachusing a large number of threads.

5.1 Data Partitioning Scheme

Flink currently uses CPU core information to determine thenumber of data partitions to distribute the work. By default,the number of task slots allocated by Flink is equal to that ofCPUs. Flink adopts an on-demand data parallelism scheme.Specifically, users can configure the number of task slots ineach operator (or function) independently. Each assignedtask slot processes a partition of the data. During each taskslot, all the pairs in a partition are traversed and processedsubsequently. However, this execution model is not appro-priate for GPUs. First, the size of a whole partition may belarger than that of the device memory in GPUs. Under suchcircumstances, the partition cannot be transferred to GPUsas a whole. Second, the one-element-a-time iterator modelhas advantages such as simplicity and flexibility. However,it does not match architecture of GPUs, which can signifi-cantly underutilize GPU resources.

Therefore, we advocate a block processing model andintegrate it into Flink. UsingGFlink, programmers can definethe data parallelism. For example, if the original Flinkassigns two task slots in a slave node, there will be two parti-tions. Each partition contains some blocks the size of whichis usually smaller that that of GPU’s devicememory, and the-ses blocks are processed by GPUs subsequently. Throughthis scheme, the three-Stage pipelining execution can beapplied in a partition, thus overlapping the computation anddata transfer. By default, the size of a block is set the same asthat of a memory page in Flink’s memory managementscheme. In order to transfer a memory page directly, the con-tent of a GStruct can not be stored across pages.

5.2 Execution Flow of GStreamManager

As shown in Fig. 4, there are three components in GStream-Manager: GWork Scheduler, GWork Pool and GStreamPool. GWork Scheduler is responsible for scheduling thesubmitted GWork with the locality-aware scheduling algo-rithm as described in Algorithm 5.1. GStream Pool containsStreams, each of which is controlled by a thread. TheStreams which belong to the same GPU are grouped as abulk. Each stream executes the three-stage pipelining execu-tion as described above. If Streams finish their work, theywill get work from GWork Pool by locality-aware workstealing algorithm as described in Algorithm 5.2. GWorkPool caches the work that is not executed by GPUs and pre-serves a FIFO work queue for each GPU.

Suppose that there are two GPUs in the worker. Accord-ingly, there will be tow FIFO work queues in GWork Pooland two Streams bulks. Fig. 4 shows the execution flow ofGStreamManager which is compatible with the run-time ofFlink. GStreamManager cooperates with GMemoryManagerto allocate and free the device memory of GPUs automati-cally, and determines the data locality. It communicates withGPUs through communication layer by CUDA-Wrapperand CUDA-Stub.

5.3 Adaptive Locality-Aware Scheduling Scheme

In a heterogeneous environment, the computational powerof GPUs is different from each other. The complexity of dif-ferent applications may also vary in cloud computing. Inorder to achieve good load-balance among heterogeneousGPUs, and further avoid unnecessary transfers especially foriterative computing, we have proposed an adaptive locality-aware scheduling scheme. It contains locality-aware sched-uling algorithm and locality-awarework stealing algorithm.

Locality-aware scheduling algorithm, which is imple-mented in GWork Scheduler, is shown in Algorithm 5.1.First, Scheduling function calls GMemoryManager’s interfa-ces to determine which GPU is appropriate. GMemoryMan-ager will go through inBuffer and look for what buffers arelabeled cache and identify their sizes. It will then select theGPU with the biggest sum of input bytes in its device mem-ory and return its index named GID. Second, our algorithmselects an idle Streams in GStream Pool’s GID bulk. If thereis no idle Stream in GID bulk, to balance the work amongGPUs, a Stream bulk which preserves the most idle Streamswill be selected. Lastly, if there are no idle Streams in thewhole GStream Pool, the work will be pushed into GWorkPool for execution later. If the GID is not null, the work willbe pushed into the GID queue. If the GID is null, the workwill be pushed into the queue which contains the leastwork. From this perspective, this scheme balances the work-loads among queues in GWork Pool.

Algorithm 5.1. Scheduling(inBuffer, outBuffer)

Input: inBuffer denotes the input HBuffer array;Output: the scheduled results1: Call GMemoryManager’s interface with parameter

inBuffer to get GID;2: if (GID is not null) then3: if there is no idle stream in GID bulk then4: Select an idle stream from the bulk which contains the

most idle streams;5: else6: Select an idle stream from GID bulk;7: end if8: else9: Select an idle stream from the bulk which contains the

most idle streams;10: end if11: if (an idle stream is not selected) then12: if (GID is null) then13: Put this work into the queue which has the least work

from GWorkPool;14: else15: Put this work into GID queue in GWork Pool;16: end if17: streamID �1;18: end if19: return streamID.

When a thread finishes the work assigned to it, it will getwork from GWork Pool periodically through the locality-aware work stealing algorithm as shown in Algorithm 5.2.If the elapsed time for getting the work is beyond a thresh-old, the thread will be freed. The Stealing function accepts aparameter named GID denoting the GPU to which thestream belongs. Our algorithm first searches the GID queue,


and dequeues a GWork from the GID queue which is notempty. If the GID queue is empty, we will dequeue aGWork from the queue which contains the most GWorks. Ifall the queues in GWork Pool are empty, our algorithm willreturn null.

Algorithm 5.2. Stealing(GID)

Input: GID denotes the index of GPUOutput: the selected work1: if (the GID queue in GWork Pool is not empty) then2: Dequeue a GWork from this queue;3: return GWork;4: end if5: if (all GID queues in GWork Pool are empty) then6: return null;7: else8: Dequeue a GWork from the queue which has the most

GWork;9: end if10: return work.

6 EXAMPLE AND EXPERIMENTAL EVALUATION

6.1 Experimental Setup

In this section, we evaluate the performance of GFlink. Theexperiments have been conducted on a single machine andcluster in comparison with the original implementationsbased on Flink. Each test computer is equippedwith one Intel(R) Core(TM)i5-4590 CPU which contains four cores runningat 3.30 GHz, 16 GB memory. Several kinds of GPUs are uti-lized, including NVIDIA GeForce GTX 750, NVIDIA TeslaC2050, NVIDIA Tesla K20 and NVIDIA Tesla P100. As forthe softwares, the test machine runs in the UBUNTU 14.04and NVIDIA CUDA toolkit 7.5. Our architecture is based onFlink 1.3.0.

6.2 Benchmarks

We select benchmarks from the in-memory version ofHiBench [29] which is widely used to evaluate the Sparkand Hadoop framework. HiBench has different kinds ofworkloads including machine learning and graph miningalgorithms. We seek out four benchmarks (KMeans, Pag-eRank, ComponentConnect, WordCount) from HiBenchand two extra benchmarks in the examples of Flink (Linear-Regression, SpMV) to evaluate the performance of ourproposed GFlink. For each benchmark, we employ five dif-ferent sizes of input datasets (see Table 1) to evaluateGFlink. The benchmarks represent a sufficiently broad setof typical MapReduce workload behaviors. During these

benchmarks, WordCount is a batch workload which con-tains just one-pass processing, while KMeans, PageRank,ComponentConnect and SpMV are iterative workloads.

6.3 Performance Analysis

In this section, we build a brief time-cost model to evaluatethe speedup achieved by GPU acceleration and the perfor-mance after adopting the proposed optimizations in GFlink.

6.4 Overall Analysis

Flink is a distributed in-memory MapReduce framework.That is to say, the core execution in Flink is MapReduce,though some other convenient functions are provided (e.g.,Count, Join). To simplify the analysis, only the MapReducephases are taken into consideration. The total time of exe-cuting an application on Flink can be derived as

Ttotal ¼Xn

i¼1ðTmap i þ Treduce i þ Tshuffle iÞ

þ Tsubmit þ TIO þ Tschedule;

(1)

where Ttotal represents the total execution time of an appli-cation on Flink or GFlink, n denotes the number of MapRe-duce phases, Tmap i denotes the execution time of the ithMap phase, Treduce i denotes the execution time of the ithReduce phase, Tshuffle i denotes the execution time cost bythe ith Shuffle phase, Tsubmit denotes the time cost by sub-mitting an application to GFlink system, TIO denotes thetime cost in reading or writing data to HDFS or other sys-tems (e.g., HBase), and Tschedule denotes the time spent onscheduling the job to be executed in the GFlink system.

In GFlink, the Map phase and Reduce phase can be exe-cuted on CPUs or GPUs. As for executions on GPUs, thespeedup of an application on GFlink is denoted as

Speeduptotal ¼ TFlink total

TGFlink total: (2)

The Reduce phase is similar to the Map phase. Thespeedup of the ith Map on GFlink is denoted as

Speedupmap i ¼ Tmap cpu i

Tmap gpu i; (3)

where Tmap cpu i is the execution time of the ith Map phase,and Tmap gpu i is the total execution time of ith Map on aGPU which can be derived as

Tmap gpu ¼ Tmap gt data þ Tmap ge þ Tmap gt result; (4)

where Tmap gt data denotes the time spent on transferringdata to GPUs, Tmap ge denotes the execution time of GPUkernel, and Tmap gt result denotes the time spent on transfer-ring the results from GPUs to the main memory.

From these two equations, some observations can beobtained.

Observation 1: The speedup of an application on GFlinkdepends on the characteristics of the application. If otherparameters are fixed, the larger space the Shuffle phasesoccupy, the smaller speedup can be obtained.

Observation 2: The higher the speedup of all Map andReduce phases is, the higher the total speedup is. From the

TABLE 1Benchmarks from Hibench

Benchmark Data sizes

KMeans 150, 180, 210, 240, 270 (million points)PageRank 5, 10, 15, 20, 25 (million pages)WordCount 24, 32, 40, 48, 56 (GB)ComponentConnect 5, 10, 15, 20, 25 (million pages)LinearRegression 150, 180, 210, 240, 270 (million points)SpMV 2, 4, 8, 16, 32 (GB)


Equation (4), we can get that if the time of transferring thetotal data between GPUs and the main memory decreases,the total speedup can be increased. Our proposed cachescheme, efficient communication strategy and pipeliningexecution model can decrease the data transferring time.

Observation 3: If the data to be processed is small, theTsubmit, TIO and Tschedule will occupy a large part of the totalexecution time. Therefore, the acceleration of GPU has a lit-tle influence. That is to say, generally speaking, the speedupof processing small datasets is smaller than that of process-ing large-scale datasets.

6.5 Performance Results Overview

In this section, we show the performance results of thebenchmarks in Table 1 on GFlink with 10 slave nodes, eachof which contains four CPUs and two Tesla C2050 GPUs.The running time and speedup of these benchmarks arepresented in Figs. 5 and 6. From these figures, we can seethat the speedup of applications on GFlink varies with thecharacteristics of applications.

The speedup of WordCount is not high (only 1:1 �speedup), because WordCount is a batch application with-out iterative execution. Therefore, the GPU cache schemedoes not take effect. Moreover, the I/O overhead of Word-Count is the bottleneck. As a consequence, the accelerationof computation has little influence. SpMV obtains highspeedup which is about 6:3�. That is because SpMV is aniterative application so that we can cache the matrix intoGPUs in the first iteration to reduce the running time of thefollowing iterations. Moreover, cuBLAS is utilized to con-duct the multiplication of the submatrices and the vector.The performance of KMeans is improved significantly byabout 5� because of KMeans’s compute-intensive nature.

KMeans only shuffles centers in each iteration. The domi-nant operation is searching for the closest centers, whichcan benefit significantly from GPU’s high parallelism. Line-arRegression is a popular classification method in statisticalanalysis, and it is a iterative workload. GFlink significantlyimproves the performance by almost 9:2�. The reason forGFlink’s better performance is that the linear regression isbounded by calculations on each data point, which can ben-efit from the GPU’s high computation powers. As for thePageRank and ComponentConnect), GFlink improves theperformance by almost 3:5� and 4:8� respectively.

From all these figures, we can also observe that thespeedup increases along with the increase of input datawhich accords with the analysis presented in Section 6.3.This is because when processing small datasets, the over-head caused by submitting the job and task schedulingoccupies a large part of the total execution time. Further-more, the GPU is good at bulk computations.

In summary, our results demonstrate that applicationson GFlink can improve the performance of various work-loads significantly especially for iterative computation andcompute-intensive applications.

6.6 Detailed Performance Evaluation

6.6.1 The Average Running Time for Different Iterations

In this experiment, we first evaluate the performance ofSpMV on a single machine for different iterations, thematrix is 1.0 GB and the vector is 123 MB. It can be seenfrom Fig. 7a that during the first iteration, GFlink’s imple-mentation on one GPU is almost 2:5� speedup over theimplementation on one CPU. During the following itera-tions, implementation on one GPU is almost 10� speedupover the native implementation on one CPU. After the

Fig. 5. An overview of performance results. (a) Average running time and speedup of KMeans on a cluster. (b) Average running time and speedup ofPageRank on a cluster. (c) Average running time and speedup of WordCount on a cluster.

Fig. 6. An overview of performance results. (a) Average running time and speedup of SpMVon a cluster. (b) Average running time and speedup of Lin-earRegression on a cluster. (c) Average running time and speedup of ComponentConnect on a cluster.


first iteration, the running time for experiments on GPUsdecreases rapidly: almost 30 seconds on one GPU and 17 sec-onds on second GPUs. Nonetheless, the running time of thelast iteration increases. That is because, during the first itera-tion, the matrix should be read fromHDFS and transferred toGPUs, which consumes a great amount of time. In contrast,during the following iterations, the matrix is already cachedinGPUs so that the overhead caused by I/O and PCIe transferis removed. During the last iteration, the vector should bewritten toHDFS.

Fig. 8a shows the effect of our proposed cache schemeabout SpMV. Without adopting the GPU cache scheme, therunning time increases. That is because, the matrix andthe vector need to be transferred to GPUs in each iteration ifthe cache scheme is not adopted.

Then, we evaluate the performance of KMeans on a clus-ter which contains three slave nodes for different iterations.The number of points is 210 million. As shown in Fig. 7a,the trend of KMeans under different iterations is similar tothat of SpMV. The speedup of the first iteration and the lastiteration is not high, while the speedup of other iterations ishigh. The reason is that the executions of the first iterationand the last iteration involve substantial I/O overhead andjob submission (as described in Section 6.3).

6.6.2 Speedup of GMapper and GReducer

In this section, we evaluate the speedup of some GMappersand GReducers accelerated by GFlink. We omit other phasessuch as reading data from HDFS, submission of the job andscheduling of the job. As for the executions on CPUs, themapPartition function of the original Flink is utilized, duringwhich all the elements are traversed by the iterator in JAVAand processed by the map function one by one. We evaluatethe results of our proposed GFlink on different GPUs(including C2050, GTX 750, K20 and P100) on a single node.The detailed speedup of GMapper and GReducer on GFlinkis presented in Fig. 8b. It can be seen from the figure that theexecutions on P100 have the highest speedup, while the

performance on K20 is better than that of GTX 750 and theperformance on C2050 and GTX 750 is almost the same.Therefore, our GPU-accelerated algorithms will get higherspeedup if the C2050 is replacedwith K20 or P100.

From Fig. 8b, some observations can be obtained. First, thespeedup of the GMapper of KMeans and SpMV is muchhigher than the overall speedup of these applications. Thespeedup of GMapper of SpMV is lower than that of KMeans.However, the overall speedup of SpMV is higher than that ofKMeans. That is because, there are other overheads in theseapplications. According to the Amdahl rule, the speedup ishigher than the overall speedup. We also can see that thespeedup of GMapper of PointAdd is smaller than that ofKMeans and SpMV. The GReducer in FlinkCL cannot obtaingood speedup as it is not compute-intensive.

6.6.3 The Average Running Time for Different Numbers

of Slave Nodes on Cluster

In this experiment, we evaluate the scalability of SpMV andKMeans on GFlink. For the same matrix data size (10 GB),we vary the number of slave nodes to evaluate the perfor-mance. The average running time of SpMV for an iterationof is shown in Fig. 7c, while the average running time ofKMeans is presented in Fig. 7d. The running time on CPUsdecreases rapidly along with the increase of the number ofslave nodes, while the running time on GPUs decreasesslowly as the number of slave nodes increases. That isbecause, for the implementation on GPUs, the computationis accelerated by GPUs significantly. Therefore, the over-head caused by I/O, communication over networks, taskscheduling and system invoking rather than the computa-tion has become the bottleneck.

6.6.4 Evaluation of Concurrent Multi-Applications

In this section, we evaluate GFlink’s ability of supportingconcurrent executions. Three applications (KMeans, SpMVand PointAdd) are submitted simultaneously.

Fig. 7. (a) The average running time of KMeans with different iterations. (b) The average running time of SpMV with different iterations. (c) The aver-age running time of KMeans with different numbers of slave nodes. (d) The average running time of SpMV with different numbers of slave nodes.

Fig. 8. (a) Effects of cache scheme. (b) Detailed speedup of GMapper and GReducer on GFlink for different kinds of kernels. (c) Effects of concurrentmulti-applications on a single node. (d) Effects of concurrent multi-applications on a cluster.


Fig. 8c shows the detailed running time of GMappers ofthese applications on a single node. During this case, the par-allelism of each application is set as 1. That is to say, as forthe executions on GPUs, one task is responsible for produc-ing tasks. While two GPUs are responsible for consumingtasks. We can observe that the running time of concurrentexecution is sightlymore than three times of that of exclusiveexecutions. Fig. 8d shows the results of these three applica-tions on a cluster with 10 slave nodes. During this case, theparallelism of each application is set as 10. We can observethat the speedup of executing these three applications aloneis almost four times of the speedup of executing these appli-cations concurrently. That is because, reading and writingfromHDFS, as well as transferring data over networks affectthe performance.

6.7 Evaluation of Communication Channel

In this case, we have evaluated the performance of our pro-posed transfer channel by comparing the bandwidth oftransfer channel with the native implementation (from Clibrary to GPU). The results are shown in Table 2. As thetable shows,the bandwidth of GFlink is similar to the nativeimplementation. During these two methods, the bandwidthincreases along with the increase of transferred bytes. Whentransferred bytes are larger than 262144 bytes, both methodsbecome stable. When transferring small bytes, the nativeimplementation is better than GFlink. That is because theoverhead caused by redirecting the API calls from CUDA-Wrapper to CUDAStub affects the performance of transfer-ring small datasets. However, the overhead can be ignoredwhen transferring large datasets.

7 CONCLUSION

This paper has proposed GFlink, which harnesses the GPUcomputation power and high memory bandwidth foraccelerating the in-memory cluster computing frameworkswith an easy programming model. GFlink is compatiblewith the compile-time and run-time of the original Flink.Several measures have been taken to achieve high perfor-mance, such as designing an efficient strategy for commu-nication between JVMs and GPUs, a three-stage pipeliningexecution strategy, and an adaptive locality-aware schedul-ing scheme with the respective purpose of achieving load-balance among heterogeneous GPUs, hiding PCIe transfersand avoiding unnecessary PCIe transfers. Extensive experi-ment results indicate that not only the high computationalpower of GPUs can be efficiently utilized, but also the

implementations on GFlink outperform that on the originalCPU-based Flink.

ACKNOWLEDGMENTS

The authorswould like to express their gratitude to four anon-ymous reviewers for their constructive commentswhich havehelped to improve the quality of themanuscript. The researchwas partially funded by the Key Program of National NaturalScience Foundation of China (Grant No. 61432005), theNational Outstanding Youth Science Program of NationalNatural Science Foundation of China (Grant No. 61625202),the International (Regional) Cooperation and Exchange Pro-gram ofNational Natural Science Foundation of China (GrantNo. 61661146006), the Singapore-China NRF-NSFC Grant(Grant No. NRF2016NRF-NSFC001-111), the National Natu-ral Science Foundation of China (Grant Nos. 61370095,61472124, 61662090, 61602350), and the Key TechnologyResearch and Development Programs of Guangdong Prov-ince (Grant No. 2015B010108006), the National Key R&D Pro-gram of China (Grant No. 2016YFB0201303), the OutstandingGraduate Student Innovation Fund Program of CollaborativeInnovationCenter of High Performance Computing.

REFERENCES

[1] J. Dean and S. Ghemawat, “MapReduce: Simplified data processingon large clusters,” Commun. ACM, vol. 51, no. 1, pp. 107–113, 2008.

[2] Flink Programming Guide, 2016. [Online]. Available: http://flink.apache.org/, Accessed on: Nov. 1, 2015.

[3] M. Zaharia, et al., “Resilient distributed datasets: A fault-tolerantabstraction for in-memory cluster computing,” in Proc. 9th USE-NIX Conf. Networked Syst. Des. Implementation, 2012, pp. 2–2.

[4] O. C. Marcu, A. Costan, G. Antoniu, and M. S. P�erezhern�andez,“Spark versus flink: Understanding performance in big data ana-lytics frameworks,” in Proc. IEEE Int. Conf. CLUSTER Comput.,2016, pp. 433–442.

[5] K. Li, X. Tang, B. Veeravalli, and K. Li, “Scheduling precedenceconstrained stochastic tasks on heterogeneous cluster systems,”IEEE Trans. Comput., vol. 64, no. 1, pp. 191–204, Jan. 2015.

[6] W. Yang, K. Li, Z. Mo, and K. Li, “Performance optimization usingpartitioned SpMV on GPUs and multicore CPUs,” IEEE Trans.Comput., vol. 64, no. 9, pp. 2623–2636, Sep. 2015.

[7] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-perfor-mance, portable implementation of theMPI message passing inter-face standard,” Parallel Comput., vol. 22, no. 6, pp. 789–828, 1996.

[8] L. Dagum and R. Enon, “OpenMP: An industry-standard API forshared-memory programming,” IEEE Comput. Sci. Eng., vol. 5,no. 1, pp. 46–55, Jan-Mar. 1998.

[9] P. Carbone, G. F�ora, S. Ewen, S. Haridi, and K. Tzoumas,“Lightweight asynchronous snapshots for distributed dataflows,”arXiv:1506.08603, 2015.

[10] P. Li, Y. Luo, N. Zhang, and Y. Cao, “Heterospark: A heterogeneousCPU/GPU spark platform for machine learning algorithms,” inProc. IEEE Int. Conf. Netw., Archit. Storage, 2015, pp. 347–348.

[11] Y. Yuan, M. F. Salmi, Y. Huai, K. Wang, R. Lee, and X. Zhang,“Spark-GPU: An accelerated in-memory data processing engineon clusters,” in Proc. IEEE Int. Conf. Big Data, 2016, pp. 273–283.

[12] M. Grossman and V. Sarkar, “SWAT: A programmable, in-memory, distributed, high-performance computing platform,” inProc. 25th Acm Int. Symp. High-Performance Parallel Distrib. Com-put., 2016, pp. 81–92.

[13] Y. Ohno, S. Morishima, and H. Matsutani, “Accelerating sparkRDD operations with local and remote GPU devices,” in Proc.IEEE Int. Conf. Parallel Distrib. Syst., 2017, pp. 791–799.

[14] Z. Chen, J. Xu, J. Tang, K. Kwiat, and C. Kamhoua, “G-storm:GPU-enabled high-throughput online data processing in storm,”in Proc. IEEE Int. Conf. Big Data, 2015, pp. 307–312.

[15] C. Chen, K. Li, A. Ouyang, Z. Tang, and K. Li, “GFlink: An in-memory computing architecture on heterogeneous CPU-GPUclusters for big data,” in Proc. Int. Conf. Parallel Process., 2016,pp. 542–551.

TABLE 2Bandwidth of Transfer Channel for Host to Device

Bytes Bandwidth-GFlink Bandwidth-Native

2048 776.398 MB/s 814.425 MB/s4096 1241.311 MB/s 1348.418 MB/s16384 2195.872 MB/s 2245.351 MB/s32768 2556.237 MB/s 2646.721 MB/s131072 2858.368 MB/s 2878.373 MB/s262144 2968.151 MB/s 2945.243 MB/s524288 2960.003 MB/s 2931.513 MB/s1048576 2973.701 MB/s 2963.532 MB/s


http://flink.apache.org/

http://flink.apache.org/

[16] C. Gregg and K. Hazelwood, “Where is the data? Why you cannotdebate CPU versus GPU performance without the answer,” inProc. IEEE Int. Symp. Perform. Anal. Syst. Softw., 2011, pp. 134–144.

[17] B. Van Werkhoven, J. Maassen, F. Seinstra, and H. E. Bal,“Performance models for CPU-GPU data transfers,” in Proc. 14thIEEE/ACM Int. Symp. Cluster, Cloud Grid Comput., 2014, pp. 11–20.

[18] C. Li, Y. Yang, Z. Lin, and H. Zhou, “Automatic data placementinto GPU on-chip memory resources,” in Proc. IEEE/ACM Int.Symp. Code Generation Optimization, 2015, pp. 23–33.

[19] N. Fauzia and P. Sadayappan, “Characterizing and enhancingglobal memory data coalescing on GPUs,” in Proc. IEEE/ACM Int.Symp. Code Generation Optimization, 2015, pp. 12–22.

[20] T. Ben-Nun, E. Levy, A. Barak, and E. Rubin, “Memory access pat-terns: Themissing piece of themulti-GPU puzzle,” in Proc. SC - Int.Conf. High Perform. Comput., Netw., Storage Anal., 2017, Art. no. 19.

[21] I. J. Sung, G. D. Liu, and W. M. W. Hwu, “Dl: A data layout trans-formation system for heterogeneous computing,” in Proc. Innova-tive Parallel Comput., 2012, pp. 1–11.

[22] W. Fang, B. He, Q. Luo, and N. K. Govindaraju, “Mars: accelerat-ing MapReduce with graphics processors,” IEEE Trans. ParallelDistrib. Syst., vol. 22, no. 4, pp. 608–620, Apr. 2011.

[23] C. Hong, D. Chen, W. Chen, W. Zheng, and H. Lin, “MapCG:Writing parallel program portable between CPU and GPU,” inProc. 19th Int. Conf. Parallel Architectures Compilation Tech., 2010,pp. 217–226.

[24] L. Chen, X. Huo, and G. Agrawal, “Accelerating MapReduce on acoupled CPU-GPU architecture,” in Proc. Int. Conf. High Perform.Comput., Netw., Storage Anal., 2012, Art. no. 25.

[25] C. Chen, K. Li, A. Ouyang, Z. Tang, and K. Li, “GPU-acceleratedparallel hierarchical extreme learning machine on flink for bigdata,” IEEE Trans. Syst. Man Cybern. Syst., vol. 47, no. 10,pp. 2740–2753, Oct. 2017.

[26] R. Mokhtari and M. Stumm, “Bigkernel–high performance CPU-GPU communication pipelining for big data-style applications,”in Proc. IEEE 28th Int. Parallel Distrib. Process. Symp., 2014, pp. 819–828.

[27] J. Jenkins, J. Dinan, P. Balaji, T. Peterka, N. F. Samatova, andR. Thakur, “Processing MPI derived datatypes on noncontiguousGPU-resident data,” IEEE Trans. Parallel Distrib. Syst., vol. 25,no. 10, pp. 2627–2637, Oct. 2014.

[28] B. He, W. Fang, Q. Luo, N. K. Govindaraju, and T. Wang, “Mars:A MapReduce framework on graphics processors,” in Proc. 17thInt. Conf. Parallel Architectures Compilation Tech., 2008, pp. 260–269.

[29] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “The HiBenchbenchmark suite: Characterization of the MapReduce-based dataanalysis,” in Proc. IEEE Int. Conf. Data Eng. Workshops, 2010,pp. 41–51.

Cen Chen is currently working toward the PhDdegree in computer science, Hunan University,China. His research interests include parallel anddistributed computing systems, machine learningon big data. He has published several researcharticles in international conference and journals ofmachine learning algorithms and parallelcomputing.

Kenli Li received the PhD degree in computerscience from the Huazhong University of Scienceand Technology, China, in 2003. He is currently afull professor of computer science technologywith Hunan University and the director of theNational Supercomputing Center in Changsha.His major research interests include parallelcomputing, high-performance computing, gridand cloud computing. He has published morethan 130 research papers in international confer-ences and journals such as the IEEE Transac-

tions on Computers, the IEEE Transactions on Parallel and DistributedSystems, ICPP. He is a member of the IEEE and serves on the editorialboard of the IEEE Transactions on Computers.

Aijia Ouyang received the PhD degree in com-puter science, Hunan University, China, in 2015.His research interests include parallel computing,cloud computing and big data. He has publishedmore than 20 research papers in internationalconference and journals of intelligence algo-rithms and parallel computing.

Zeng Zeng received the PhD degree in electricaland computer engineering from the National Uni-versity of Singapore, Singapore, in 2005, and theBS and MS degrees from the Huazhong Univer-sity of Science and Technology, Wuhan, China, in1997 and 2000, respectively. Currently, he worksas a scientist III in Data Analytics Department,I2R, A*STAR, Singapore. From 2011 to 2014, heworked as a senior research fellow with theNational University of Singapore. From 2005 to2011, he worked as an associate professor in

Computer and Communication School, Hunan University, China. Hisresearch interests include distributed/parallel computing systems, datastream analysis, deep learning, multimedia storage systems, wirelesssensor networks, and onboard fault diagnosis.

Keqin Li is a SUNY distinguished professor ofcomputer science with the State University ofNew York. He is also a distinguished professorof Chinese National Recruitment Program ofGlobal Experts (1000 Plan) with Hunan University,China. He was an intellectual ventures endowedvisiting chair professor at the National Laboratoryfor Information Science and Technology, TsinghuaUniversity, Beijing, China, during 2011-2014. Hiscurrent research interests include parallel comput-ing and high-performance computing, distributed

computing, energy-efficient computing and communication, heteroge-neous computing systems, cloud computing, big data computing, CPU-GPU hybrid and cooperative computing, multicore computing, storage andfile systems, wireless communication networks, sensor networks, peer-to-peer file sharing systems, mobile computing, service computing, Internetof things and cyber-physical systems. He has published more than 535journal articles, book chapters, and refereed conference papers, and hasreceived several best paper awards He is currently or has served on theeditorial boards of the IEEE Transactions on Parallel and Distributed Sys-tems, the IEEE Transactions on Computers, the IEEE Transactions onCloud Computing, the IEEE Transactions on Services Computing, theIEEE Transactions on SustainableComputing. He is a fellow of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …lik/publications/Cen-Chen-IEEE-TPDS-2018.pdf ·...

Documents

Transcript of IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED …lik/publications/Cen-Chen-IEEE-TPDS-2018.pdf ·...