Research Article An OpenMP Programming Environment on...

Research ArticleAn OpenMP Programming Environment on Mobile Devices

Tyng-Yeu Liang, Hung-Fu Li, and Yu-Chih Chen

Department of Electrical Engineering, National Kaohsiung University of Applied Sciences, No. 415, Jiangong Road,Sanmin District, Kaohsiung City 807, Taiwan

Correspondence should be addressed to Tyng-Yeu Liang; [email protected]

Received 16 January 2016; Revised 22 April 2016; Accepted 22 May 2016

Academic Editor: Alessandro Cilardo

Copyright © 2016 Tyng-Yeu Liang et al.This is an open access article distributed under the Creative CommonsAttribution License,which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Recently, the computational speed and battery capability of mobile devices were greatly promoted. With an enormous number ofAPPs, users can do many things in mobile devices as well as in computers. Consequently, more and more scientific researchers areencouraged tomove their working environment from computers tomobile devices for increasing their work efficiency because theycan analyze data and make decisions on their mobile devices anytime and anywhere. Accordingly, we propose a mobile OpenMPprogramming environment called MOMP in this paper. Using this APP, users can directly write, compile, and execute OpenMPprograms on their Android-basedmobile devices to exploit embedded CPU andGPU for resolving their problemswithout networkconnection. Because of source compatibility, MOMP makes users easily port their OpenMP programs from computers to mobiledevices without any modification. Moreover, MOMP provides users with an easy interface to choose CPU or GPU for executingdifferent parallel regions in the same program based on the properties of parallel regions. Therefore, MOMP can effectively reducethe programming complexity of heterogeneous computing inmobile devices and exploit the computational power ofmobile devicesfor the performance of user applications.

1. Introduction

In recent years, the hardware of mobile devices such assmartphones and tablets has been obviously upgraded.Most of modern mobile devices have multicore CPU likeARM, many-core GPU such as Adreno, PowerVR, Krait, andNVIDIA Tegra, 1∼2GB RAM, and enough battery capabilityto standby for dozens of hours. This remarkable hardwareadvancement has made mobile devices comparable to com-mon PCs in the execution performance of applications. Onthe other hand, the amount and type of applications appear-ing in Google Play and APP store are enormous and diverse.A lot of software programs such as MS Office, Skype, AdobeProfessional, and Photoshop have released mobile versions.With these APPs, users can domany things in mobile devicesas well as in PCs. Consequently, mobile devices have becomethe most important equipment for users to connect with net-works for handling their daily affairs including communica-tion, working, shopping, learning, and playing because theyare easier than laptops to be owned and carried.

Reacting to this development trend, some scientific soft-ware such as Octave [1] and Addi [2] recently appeared in

Google Play to provide users with MATLAB-like environ-ment. Using these two APPs, researchers and engineers canmake use of CPUs embedded in mobile devices to performmathematic computation and simulation by means of MAT-LAB [3] instructions or scripts. As a result, they can effectivelyincrease their work efficiency because they usually carry theirsmartphones with themselves anytime anywhere and cancontinuously do their researches on the smartphones evenwhen they leave from offices or laboratories. Moreover, theycan save energy consumed for resolving their problems sincethe processors embedded within mobile devices usually arehigher in energy efficiency than the processors of computers.

However, not all of problems are resolvable only by usingscientific simulation software. When users intend to exploitthe computation power of multicore CPUs or many-coreGPUs for speeding up data computation, they have to writemultiprocess or multithreading programs by using MPI [4],Pthread, OpenMP [5], CUDA [6], or OpenCL [7]. Unfortu-nately, most of the programs developed by these program-ming APIs are not portable onto mobile devices becausemobile devices are different from computers in processor

Hindawi Publishing CorporationMobile Information SystemsVolume 2016, Article ID 4513486, 24 pageshttp://dx.doi.org/10.1155/2016/4513486

2 Mobile Information Systems

architecture and operating system and do not support neces-sary toolkits and runtime libraries. Although Java is effectivefor resolving the problemof resource heterogeneity, it is not asefficient as C/C++. Therefore, user programs usually have toconsume more time and energy if they are developed by Javainstead of C/C++. For resolving this problem, several APPssuch as C4droid [8], CppDroid [9], andCCTools [10] recentlywere proposed for users to develop and execute C/C++programs on mobile devices. Basically, these APPs providean embedded Linux terminal for users to develop programs.Consequently, they allow users to write multithreaded pro-grams by Pthread for exploiting multicore CPU in mobiledevices. However, they have some problems as follows.

First, the user interface of Linux terminal is not friendlyfor those who are not used to operating Linux OS. Second,multithreading programming is not easy enough for users.When users make use of Pthread to develop applications,they have to manually deal with problem partition, dataand thread synchronization, and load balance by themselves.Consequently, the programming effort of users is increasedalong with the code length of user programs.This is a big bur-den for users to write programs on mobile devices especiallywhen the size of touch screen is not big enough. Third, theseAPPs do not support GPU programming in mobile devices.Although some smartphones support CUDA/OpenCL run-time library and driver, no APP currently can provide anIDE for users to develop CUDA or OpenCL applications onmobile devices. They must write CUDA/OpenCL programsin computers and download the programs from computers tomobile devices for execution. As a result, it is not convenientfor users to exploit the computation power of GPUs inmobile devices for resolving their problems especially at thephase of application development because they need to repeatremote programmodification and recompilationmany timesthrough unstable wireless networks.

As previously described, we propose a mobile OpenMPprogramming environment called MOMP in this paper.Using this APP, users can directly edit, compile, and executeOpenMPprograms onmobile devices and exploit the compu-tational power of embeddedCPUandGPU for resolving theirproblems. OpenMP is a directive-oriented API. When usersdevelop parallel applicationswithOpenMP,what they need todo is to add OpenMP directives into their C/C++ programs.The OpenMP compiler of MOMP can generate Pthread andOpenCL codes based on the semantic of OpenMP directivesand can automatically add the instructions necessary for datapartition and synchronization into the generated programs.As a result, MOMP can effectively reduce the programmingcomplexity of heterogeneous computing in mobile devicesbecause it allows users to develop parallel applications forexploiting the computational power of embedded CPU andGPU with a uniform and directive-oriented programminginterface, that is, OpenMP, which is much easier thanPthread, CUDA, and OpenCL. Moreover, MOMP providesusers with an extended directive to choose CPU or GPUfor executing different parallel regions in the same programaccording to computation demand and parallelism in orderto obtain the best program performance.

The rest of this paper is organized as follows. Section 2is the background of OpenMP and OpenCL. Sections 3 and4 describe the framework and implementation of MOMP,respectively. Section 5 discusses the experimental results ofperformance evaluation. Section 6 is the discussion of relatedwork. Finally, Section 7 gives a number of conclusions for thispaper and our future work.

2. Background

OpenMP is a sharedmemory parallel programming interfacemanaged by OpenMP architecture review board. It consistsof compiler directives, runtime library, and environmentvariables. Users can easily control the parallelism of theirC/C++/Fortran programs by using OpenMP directives. Theparallelism of OpenMP basically is classified into two kinds,that is, task parallelism and data parallelism. Task parallelismis to create threads for executing different sections while dataparallelism is to generate threads for sharing the work of thesame for-loop. On the other hand, users can make use ofscheduling-clauses such as static, dynamic, and guide or theOpenMP runtime functions to determine the way of worksharing in a given parallel region. For thread synchronization,they can use atomic to specify which variable should beupdated atomically. They also can use data clauses to controlthe attribute of shared variables. For example, the privateclause makes the threads of the same parallel region to havetheir own copy of the same variable. The firstprivate clausemakes these copies be initialized with the original value ofthe variable. The lastprivate clause makes the variable beupdated with the value of the copy in the last thread thatleaves from the parallel region.The reduction clause canmakethe values of the copies be reduced by a specific operationsuch as add, subtract,multiple, divide,max, andmin. Becausethe OpenMP compiler can automatically generate the cor-responding multithreading codes according to OpenMPdirectives, OpenMP is effective for reducing the complexityof multithreading programming, and thereby it has becomea popular shared memory parallel programming standardfor shared memory multiprocessors. Many researches [11, 12]were aimed at enabling OpenMP programming interfacebased on software distributed sharedmemory systems [13, 14]for reducing the programming complexity of computationalclusters. In recent years, GPGPU has successfully becomean alternative for high performance computing because itcan provide high computation performance with low energy.Because the complexity of GPU programming such as CUDAandOpenCL is too high formost of users, manyOpenMP-to-CUDA or OpenCL compilers such as Cetus [15], OpenMPC[16], OMPi [17], andHMCSOT [18] were proposed to addressthis issue.

OpenCL is an open programming interface proposed byKhronos Group to support heterogeneous computing. Theruntime library of OpenCL consists of Platform, Device,Context, Program,Kernel,MemoryObject, CommandQueue,Image, and Sampler.The execution flowofOpenCLprogramsbasically is composed of the steps as follows. The first is toget a platform available in the execution node and allocate anavailable device in the platform by clGetPlatformIDs() and

Mobile Information Systems 3

clGetDeviceIDs().The second is to build a context executableon the allocated device and construct a CommandQueue forthe context to accept the command from the user program byclCreateContext() and clCreateCommandQueue().The thirdis to generate a source program and compile the source pro-gram into an executable program by clCreateProgramWith-Source() and clBuildProgram(), respectively. The fourth is tocreate a kernel for obtaining the entry point of the kernelfunction in the executable program by clCreateKernel(). Thefifth is to allocate input and output MemoryObjects (i.e.,device memory) accessible for the kernel and then copy datafrom host memory to the input MemoryObjects by clCreate-Buffer() and clEnqueueWriteBuffer(), respectively. The sixthstep is to launch the kernel function to the allocated device forexecution by clEnqueueNDRangeKernel(). The seventh is tocopy the execution result from the output MemoryObjects tohost memory by clEnqueueReadBuffer() after the executionof the kernel function finishes. Recently, several studies suchas JCL [19], VCL [20], and SunCL [21] were dedicated to theimplementation of OpenCL cluster for resolving problemswith a cluster of distributed heterogeneous processors.

Different from the previous work, this work is aimed atproposing a mobile programming environment for reducingthe programming complexity of heterogeneous computingon mobile devices. In the proposed programming environ-ment, OpenMP programs are translated into OpenCL oneswhich can be executed by CPU or GPU. For the applica-tions implemented with recursive algorithms, dynamic taskcreation and task dependency maintenance are necessary.However, OpenCL does not support dynamic parallelismuntil the second version. Almost all mobile devices producedby different vendors currently support only OpenCL 1.1or 1.2 embedded profile, which does not provide dynamicparallelism. Therefore, this work currently is focused on theimplementation of OpenMP 2.0 on mobile devices for data-parallelism applications.

3. Framework

MOMP is developed based on a mobile integrated develop-ment environment called UbiC, which was proposed by ourprevious work [22]. With the support of UbiC, users can edit,compile, execute, and debug their C programs on Android-based mobile devices without the help of remote servers.UbiC consists ofGUI, programeditor, compiler, executer, anddebugger as shown in Figure 1. UbiC adopts Clang [23] tocompile C programs into LLVM [24] IRs. When users pressthe execution button, the executor of UbiC loads the LLVMIRs of the working program and translates LLVM IRs intooptimized native codes for execution by means of MCJIT.Consequently, UbiC can allow user programs to be portableonto any platform that supports LLVM while simultaneouslyobtaining a good performance as well as native codes. Onthe other hand, UbiC adopts a modified LLVM interpreterinstead of gdb to be program debugger because of costconsideration. It inserts DWARF-3 tags into user programswhen it compiles the programs. When users debug their

UbiCAndroid-GUI

Editor Compiler Executer Debugger

Clang LLVMMCJIT

LLVMinterpreter

Figure 1: Framework of UbiC.

programs, the modified LLVM interpreter retrieves the nec-essary information from program contexts for users to debugtheir programs according to the inserted tags. Currently,the program debugger supports breakpoint, single step, andvariable view for users to trace the execution status of theirprograms. Although UbiC provides a complete and friendlyC programming environment, it does not support OpenMP.Basically, the GUI interface of MOMP is as same as thatof UbiC while the compiler and executor of MOMP aredifferent from those of UbiC because MOMP is dedicated tosupporting OpenMP. Therefore, the following description isfocused on the compiler and executor of MOMP.

The program compiler of MOMP mainly consists ofJava-based OmniDriver, frontend and backend compiler.The frontend compiler is constructed by yacc and C whilethe backend compiler is implemented by Java. Here wegive an example program to explain the flow of programcompilation in MOMP as shown in Figure 2. This exampleprogram is composed of two parallel regions: one is matrixmultiplication and another is vector addition. When userspress the compilation button of GUI to compile this program,the program compiler of MOMPwill compile the program asfollows.

First, the compiler driver sets up the arguments and pathof program compilation and expands the source programby the preprocessor of Clang. Second, the frontend com-piler translates the source program into an abstract syntaxtree which are composed of X objects. Third, the backendcompiler generates another source program based on theabstract syntax tree. In the new source program, the twoparallel regions are extracted from the main function andthen translated into Pthread modules (e.g., ompc func 1and ompc func 2) and OpenCL ones (e.g., cuker 1 andcuker 2). On the contrary, the main functions are modifiedby replacing the two parallel regions with the stub1 andstub2 function, which are used to invoke the Pthread-version modules or OpenCL-version modules according tothe architecture of target processor. Finally, Clang compilesthe new source program, which consists of main module,Pthread modules, and OpenCL modules to generate anexecutable file (i.e., output.ll) of LLVM IRs. The OpenMPtranslator of the backend compiler is implemented basedon OMPICUDA, which was proposed by our previous work[25]. The implementation of the OpenMP translator will bedetailed in Section 4.

When users press the execution button of GUI to runthe example program, the executor of MOMP loads theexecutable file, that is, output.ll, of the program, and then exe-cutes the executable file through LLVM by means of MCJIT,


MOMP OpenMP compiler

Frontend compiler

omniDriver(setting parameter)

Backend compiler

Compiler driver

Clang

Preprocess code

C-to-Xobject translatorXobject

Sequential Parallel region

Mainmodule

OpenMPPthread

OpenMPOpenCL

Pthreadmodule

OpenCLmodule

Clangllvm-link

Backend translator

User code

output.ll (LLVM IR)

Main module OpenCL module

Pthread module

Figure 2: Flow of program compilation in MOMP.

as shown in Figure 3. Because the executable file is runnableon any platform that supports LLVM, MOMP effectivelymaintains the portability of user programs. It is worth sayingthatMOMP provides an extended directive called target() forusers to select the Pthread orOpenCL version for a given par-allel region to be executed on runtime based on parallelismdegree and computation demand. If the target directive is setas “pthread,” the executor loads the Pthread-version moduleof the parallel region and executes the module by CPU. Bycontrast, it loads the OpenCL-version module and executesthemodule with GPU by default if the target directive is set as“opencl.” However, not all of mobile devices support both ofGPU and the OpenCL driver. For example, MiPAD supportsonly theCUDAdriver for itsNVIDIAK1GPU.Consequently,the executor loads the OpenCL-version module and executesthe module with GPU if both of the OpenCL driver and GPUare available.Otherwise, it loads theOpenCL-versionmodulewhile executing the module with other types of processorslike ARM CPU if the OpenCL driver but GPU is available. Ifboth of OpenCL and GPU are not available, it automaticallychanges to load the Pthread-version module of the parallelregion and execute themodule byARMCPU. Bymeans of thetarget directive, users can easily adapt the type of processorsused to execute different parallel regions for optimizing thetotal execution performance of their programs through thesame programming interface.

4. Implementation

Basically speaking, there are two important jobs in theimplementation of MOMP. The first is to port the OpenMPcompiler of OMPICUDA onto mobile devices. The second isto implement a reduced OpenCL runtime library based onCUDA driver API for NVIDIA mobile GPU such as K1. Wedetail how to accomplish these two jobs as follows.

4.1. OpenMP Compiler. The entry point of the OpenMPcompiler consists of a launcher and omniDriver. They wereoriginally implemented by C and shell script, respectively.The launcher is responsible of passing the arguments ofprogram compilation to omniDriver for creating a template.This template stores the necessary information such as nativecompiler, programming language, and the paths of runtimeand dynamic linking libraries such as “-L/libraryPATH/. . .-lpthread” for the backend of theOpenMP compiler to analyzeand translate user programs into executable files. To accom-plish the functions of launcher and omniDriver, we faced andresolved the problems on Android for MOMP as follows.

Although Android is constructed based on Linux, it doesnot support bash or all the shell instructions. Unfortunately,this problem results in that the omniDriver implemented byshell script is not workable in Android. For resolving thisproblem, we implemented a new launcher and omniDriver


output.ll (LLVM IR)

Main module

Executer

LLVM MCJIT

ARM

OpenCL module

Pthread module

seleteModule(){

(1) Use OpenCL

(2) Use Pthread

}

Runtime library

Link

OpenCLdevice

Figure 3: Program execution in MOMP.

by means of Java programming language in order to connectwith the user interface and the backend of the OpenMP com-piler in MOMP. On the other hand, most of the shell instruc-tions used by the backend of the OpenMP compiler canbe processed by invoking “/bin/sh -c” via. Runtime.exec().However, Runtime.exec() does not support the I/O redirec-tion of instructions.Therefore, MOMP intercepts redirectioninstructions first.Then, it uses InputStreamReader to retrievethe output of Runtime.exec() and writes or appends theretrieved output into the original destination files of the redi-rection instruction. For example, if the instruction is “ls >>file,” MOMPwill use Runtime.exec() to run “ls” first.Then, itwill retrieve the output of Runtime.exec() and will append orwrite the output into file when the file is available or unavail-able, respectively.

OpenMP allows users to set up the number of threadsforked for the execution of parallel regions in their programsby using the OMP NUM THREADS environment variable.When the programs are executed, the runtime system ofOpenMP retrieves the environment variable by calling thegetenv() function to decide the number of forked threads.However, the environment variables in Android platformsare read only for user applications. To attack this problem,the OpenMP compiler of MOMP uses Java’s ProcessBuilderto create a child process for program execution and sets agroup of environment variables for the child process becauseProcessBuilder can set up the environment variables forforked processes. When the child process executes userprograms, the runtime system of MOMP is able to retrievethe environment variables set by the OpenMP compiler withcalling the getenv() function.

During program compilation, the immediate files gener-ated by the OpenMP compiler are stored in the/temp direc-tory while this directory is not always available or accessible

for any applications in any Android platforms. In Android,each application is allowed to access only its own directorysuch as /data/data/<packagename>/files. Consequently, theOpenMP compiler cannot access the root directory and the/tmp directory. For resolving this problem, MOMP createsa /tmp directory under /data/data/<MOMP>/files for theOpenMP compiler to store the immediate files of programcompilation.

On the other hand, the OpenMP compiler ofOMPICUDA supports OpenMP-to-CUDA but OpenMP-to-OpenCL translation. However, most of modern mobile devi-ces support OpenCL but CUDA.Therefore, we implementedan OpenMP-to-OpenCL (simply denoted by OMPCL) trans-lator for MOMP. Because most of CUDAAPI can be mappedonto OpenCL API, the OMPCL translator of MOMP mainlyis implemented based on the framework of the OpenMP-to-CUDA translator of OMPICUDA. For considering paperlength, we detailed only how the OMPCL translator mapsthe parallel regions of OpenMP to OpenCL kernels andthe variables used in parallel regions from host memory todevice memory for the execution of kernels as follows.

4.1.1. Mapping Parallel Regions to Kernels. OpenMP pro-gramming is based on a fork-join model while both ofCUDA and OpenCL are based on a client-server model. Itis necessary for the OMPCL translator to map the fork-joinmodel onto the client-server one. The OMPCL translatorimplements the mapping of the two programming models asshown in Figure 4.

An OpenMP program basically consists of sequentialregions and parallel regions. When an OpenMP program istranslated into an OpenCL one, the sequential regions andparallel regions of the OpenMP program are mapped ontothe host (i.e., client) program and kernels (i.e., server) of


Fork

Join

clEnqueueNDRangeKernel

clFinish

clEnqueueWriteBuffer

clEnqueueReadBuffer

cudaLaunch

cudaThreadSynchronize

cudaMemcpy(H2D)

cudaMemcpy(D2H)

(a) OpenMP (b) CUDA (c) OpenCL

Sequential codes

Sequential codes

Parallelcodes

Figure 4: Programming model mapping between OpenMP and CUDA/OpenCL.

the OpenCL one, respectively.Whenever the execution of thehost program arrives at a parallel region, it copies the shareddata necessary for the parallel region from host to deviceby calling clEnqueueWriteBuffer() because device memoryis independent to host memory. Then, it launches a kernelby calling clEnqueueNDRangeKernel() to device for concur-rently processing the shared data in the device memory bymultiple GPU threads. Finally, it copies the execution resultof the kernel fromdevice to host by invoking clEnqueueRead-Buffer(). As previously described, the fork and join operationsof OpenMP are mapped onto the clEnqueueNDRangeKer-nel() and clFinish() of OpenCL, respectively.

4.1.2. Mapping Variables from Host Memory to Device Mem-ory. The working threads in a parallel region may accessglobal, shared, and local variables. Global and shared vari-ables are shareable for all the threads while local variablesare not shareable, and they are duplicated for each thread.Accordingly, the OMPCL translator allocates shared andglobal variables on the global memory of GPU for datasharing among different thread blocks while it allocateslocal variables to the registers of GPU for each thread. Toachieve thismemory allocation policy, theOMPCL translatortraverses all the global and shared variables that are accessedin the parallel region andmarks the variables as GlobalSharedand LocalShared. Because OpenCL does not support globalvariables and not all global variables are accessed in parallelregion, the OMPCL translator creates two data structures,that is, GlobalShared and LocalShared to pass global andlocal shared variables into OpenCL kernels for GPU threadsto access these shared variables. We used an example asshown in Figure 5 to explain how the OMPCL translator usethese two data structures for mapping global and local sharedvariables from host memory onto device memory.

In this example, there are two one-dimensional matrices,that is, a and b, with 128 integer numbers. Because the twomatrices are globally shared, the OMPCL translator defines

a GlobalSahred data structure called globals to representmatrices a and b. In contrast, the c matrix declared in themain function is locally shared. Consequently, the OMPCLtranslator defines a LocalShared data structure called Localsto represent the matrix c. In addition, the OMPCL trans-lator extracts the parallel region from the main functionin Figure 5(a) to generate a kernel function called open-clKernel with two parameters named as globals and locals,which are globals pointer and locals pointer, respectively. Inaddition, it changes the expression, that is, c[i]=a[i]+b[i] inthe parallel region to be locals->c[i]=globals->a[i]+globals->b[i] as shown in Figure 5(b). On the other hand, theOMPCL translator generates a function called parallel0 forthe host program to offload the parallel region onto devicefor execution as shown in Figure 5(c). In this function, thefirst step is to allocate one globals and locals data structurein device memory and then copy matrices a, b, and c fromhostmemory to the globals and locals data structure in devicememory. The next step is to create a kernel binding with theopenclKernel function and then launch the kernel to devicefor execution.Thefinal step is to copy the locals data structurefrom device memory to the c matrix in host memory.

On the other hand, the values of pointer variables used inparallel regions are invalid to device memory because of dif-ferent address spaces. This problem is unable to be addressedat the time of program compilation because the values ofpointer variables usually are determined and changed at run-time. To resolve this problem, the OMPCL translator gener-ates a mapping table for tracking the information of hostmemory segments directed by pointer variables as shownin Figure 6. This mapping table is divided into global anddynamic sections because pointer variables may direct toglobal variables or dynamically allocated memory spaces.When an OpenMP program is compiled, the OMPCL trans-lator searches global variables in the program and creates aregistry global variable function for each global variable tostore the start addresses and lengths of the global variable into


(a) The original user program (b) The kernel program translated from (a)

(c) The host program to (b)

Figure 5: Mapping variables from host memory to device memory.

the global section of the mapping table at run time. However,this method is not workable for dynamically allocated mem-ory spaces. To resolve this problem, we added a hooker intodynamic allocation functions includingmalloc, valloc, calloc,and realloc in this paper. Whenever a dynamic allocationfunction is invoked, the hooker will fill the start address andlength of allocation memory space into the dynamic sectionof the mapping table.

In addition to the mapping table, the OMPCL creates anexchange pointer function for each pointer variable used inthe parallel region before launching the kernel correspondingto the parallel region to device. The execution flow of theexchange pointer function is as shown in Figure 7. The firstis to search the mapping table to find out the value of thepointer variable (assume that it is addr) is located in whichhost memory segment. The second is to check if the devPtrfield of the host memory segment is null, that is, whether thememory segment is mapped onto a device memory segmentor not. If it is true, it is necessary to allocate a device memory

space as big as the host memory segment and update themapping table by filling the start address (assume that it isdevaddr3) of allocated device memory space to the devPtrfield copy first. Then, the next step is to copy data from thehost memory segment to the allocated device memory space.The final is to change the value of the pointer variable to bedevaddr3+(addr-a3). It is worth noting that the value of thepointer variable is changed back to addr by referencing themapping table after the execution of the kernel finishes. Bythe above process, user programs can transparently access thesame pointer variables in both of parallel regions and kernels.

4.2. A Reduced OpenCL Runtime Library Based on CUDA.MOMP translates OpenMP directives into OpenCL codesto exploit GPUs in mobile devices for data computation.However, some mobile devices such as MiPAD embeddedwith NVIDIA GPU support only CUDA driver but noOpenCL runtime library. As a consequence, the OpenCLprograms generated by the MOMP compiler are unable to be


From Malloc

Isglobal variable

Inheritsinformation from

malloc hooker

Global variables

Static variables

var2

var1

registry_dynamic_allocation

registry_global_variables

a2 = address of var2, L2 = size of (var2)a1 = address of var1, L1 = size of (var1)

Size = size of memory blocka3 = address retained from malloc

GlobalStart Length devPtra1 L1

a2 L2

Start Length devPtra3 Size

Dynamic

· · · · · ·

Figure 6: Registration of pointer variables.

Start Length devPtra1 L1 devaddr1

a2 L2 devaddr2a3 L3

Mapping table

Is devPtr NULL?

Replace addr withdevaddr3

Allocate devicememory, updatemapping table

Start Length devPtr

a1 L1 devaddr1a2 L2 devaddr2a3 L3

Mapping table

(2) Check whether the memorysegment has been mapped onto device

memory

[True] allocate device memory

(4) Modify the value of P

a3 devaddr3Write to device

Update mappingtable

(3) Copy data fromhost to device

(5) Return devaddr3 andoffset(addr-a3) to user

devaddr3

(1) Search addr isallocated in whichmemory segment

Assume a3 < addr < a3 + L3

Figure 7: Operations of the exchange pointer function.

executed by thesemobile devices. To resolve this problem, wedeveloped a reducedOpenCL runtime library calledMCOCL(mobile CUDA-based OpenCL) based on CUDA driverfor MOMP in this paper. Because MOMP provides userswith OpenMP but OpenCL to develop parallel-computingapplications in mobile devices, this reduced runtime librarycurrently consists of only theOpenCL functions necessary forthe OMPCL compiler to generate OpenCL source programs

as shown in Table 1. We briefly describe the implementationof some of these functions as follows.

clGetDeviceIDs() is used to get the handles and countof devices available in a platform. In fact, the meaning ofmost of the parameters used in the CUDA driver API is assame as that used in the OpenCL API. For example, both ofCUdevice and cl device id are used to obtain the informationof devices. Consequently,MCOCL uses cuDeviceGetCount()


clBuildProgram()Kernel code

kernel.cl Clang kernel.ll

llvm-link libclc.bckernel.bc

Clang kernel.ptx

cuModuleLoad()

Save as file

Generate LLVM IR for nvptx

Generate linked Bitcode

Generate PTX for CUDA Driver

Output handler

Step 1: compile kernel.cl from OpenCL code to LLVM-IR

Step 3: compile kernel.bc from Bitcode to PTX

Step 2: link LLVM-IR and OpenCL implementation

Step 4: load PTX into CUDA driver

CUmodule

Figure 8: Execution flow of clBuildProgram() in MOMP.

Table 1: The OpenCL functions necessary for MOMP.

clGetPlatformIDs clGetPlatformInfoclGetDeviceIDs clCreateContextclGetContextInfo clCreateProgramWithSourceclBuildProgram clGetProgramBuildInfoclCreateKernel clEnqueueNDRangeKernelclSetKernelArg clCreateCommandQueueclCreateBuffer clEnqueueReadBufferclFinish clEnqueueWriteBufferclReleaseKernel clReleaseMemObject

and cuDeviceGet() to query the amount and identifiers ofdevices when clGetDeviceIDs() is called by user programs.

clCreateContext() is used to get a context in a givenplatform. An OpenCL context is created for one or moredevices. It is used by the OpenCL runtime system for man-aging command queues, memory, programs, and kernels.By contrast, a CUDA context is created by cuCTxCreate()while it is useful for only one device. Consequently, MCOCLcalls cuCtxCreate() to create a CUDA context for each deviceand store multiple CUDA contexts into a context array torepresent one OpenCL context. When user programs intendto use different devices, MCOCL calls cuCtxPopCurrent()and cuCtxPushCurrent() to achieve context switch.

clCreateCommandQueue() is aimed at creating a com-mand queue for launching kernels. Since OpenCL sup-ports user programs to asynchronously execute the kernelsappended in the command queue, MCOCL uses asyn-chronous CUDA streams to implement OpenCL command

queues. When user programs call clEnqueueReadBuffer()or clEnqueueWriteBuffer() by a blocking mode, MCOCLinvokes cuStreamSynchronize() to wait for the termina-tion of a given kernel and then calls cuMemcpyHtoD() orcuMemcpyDtoH() to perform memory copy from host todevice or from device to host. On the contrary, it invokescuMemcpyDtoHAsync() or cuMemcpyHtoDAsync() whenclEnqueueReadBuffer() or clEnqueueWriteBuffer() is calledby a nonblocking mode.When user programs calls clFinish()to wait for the termination of kernels in the commandqueue, MCOCL calls cuStreamSynchronize() to wait for thetermination of corresponding streams.

clBuildProgram() is to load a program source andthen build a program executable from the program source.Although the cuModuleLoad() of CUDA is useful for loadingprograms, the loaded programs are composed of PTX (Par-allel Thread Execution) codes instead of OpenCL sources.Therefore, MOMPmust be able to translate OpenCL sourcesinto PTX codes. Fortunately, LLVM provides libclc.bc tomake Clang able to translate OpenCL into PTX for NVIDIAGPU. The execution flow of clBuildProgram() in MCOCLis as shown in Figure 8. The first step is to write OpenCLkernel codes into a file named as kernel.cl and compile thisfile by Clang to create another file called kernel.ll composedof LLVM IR. Because this kernel.ll file includes OpenCLsymbols necessary to be supported by liblic, MCOCL usesllvm-link to link the kernel.ll with liblic to generate anotherfile called kernel.bc and then uses Clang to translate thekernel.bc file into a PTX file called kernel.ptx. Finally, it callscuModuleLoad() to load the PTX file into memory to be aCUmodule for clBuildProgram().ThisCUmodule is regardedas program handle and is stored in cl program.Themessages


and parameters of program compilation are also stored intocl program for calling clGetProgramBuildInfo() to retrievethese information later.

clSetKernelArg() is used to set the argument value for aspecific argument of a kernel. This function is as same as thecuParamSetv() of CUDA. However, clSetKernelArg() usesarg index to reference the augments of an OpenCL kernelwhile cuParamSetv() accesses the parameters of a CUDAkernel function by means of offset. In order to correctlyconvert arg index into offset, MCOCL loads the kernel.ll asan IR module for LLVM and then invokes the getFunction()and getFunctionType() to retrieve the kernel from IRmoduleand the prototype of the kernel, respectively. Then, it callsgetParamType() and getTypeID() to extract the argumentsfrom the kernel prototype and get the types of the arguments.Finally, it estimates the offset for 𝑖th argument by summatingargument lengths from the first argument to the (𝑖 − 1)th oneand stores the offsets of all the arguments into the cl kernelcreated by clCreateKernel() for translating arg index to offsetlater.

clEnqueueNDRangeKernel() is used to submit a com-mand for executing a given kernel. The purpose of thisfunction is the same as that of cuLaunchKernel() while thethread configuration used for OpenCL kernels is differentfrom that used for CUDA kernel functions. OpenCL usesglobal work size and local work size to represent the num-ber of global work-items in work dim dimensions and thenumber of work-itemsmaking up a work group for executingthe kernel function, respectively. By contrast, CUDA usesgridDim and blockDim to represent the dimensions ofa thread-block grid and a thread block, respectively. Thelocal work dim of OpenCL can be regarded as the blockDimof CUDA. Consequently, MCOCL uses the local work dimto specify the dimensions of a thread block. On the contrary,it sets the dimensions of a thread-block grid by

grid width = ⌈global work size [0]local work size [0]

⌉ ,

grid height = ⌈global work size [1]local work size [1]

⌉ .

(1)

clCreateBuffer() is used to create a buffer object for datacommunication between host and device. To implement thisfunction,MCOCLbasically calls cuMemAlloc() and cuMem-cpyHtoD() to allocate device memory for the buffer objectand then copy data from host memory to the buffer object bydefault. However, user programs can set the flags argumentof this function to specify the created buffer object is locatedat host or device memory. Consequently, MCOCL invokesdifferent CUDA functions for clCreateBuffer() according tothe flags augment. For example, it calls cuMemAllocHost()and memcpy() to allocate host memory for the buffer objectand then copy the data to the allocated host memory whenthe flags argument is set as CL MEM ALLOC HOST PTR |CL MEM COPY HOST PTR.

Table 2: Experimental environment.

XiaoMi MiPADOS Android 4.4.4Processor-CPU Quad-Core ARM Cortex-A15 @ 2.2GHzMemory 2GB LPDDR3

CacheL1 cache 64KB (32KB I-cache, 32 KB D-cache)per coreL2 cache Up to 4MB per cluster

Processor-GPU 192 NVIDIA CUDA� Cores (NVIDIA Kepler�architecture)

SONY Z3OS Android 4.4.4Processor-CPU Qualcomm Snapdragon 801 ARMv7 @ 2.5GHzMemory 3GBCache L0: 4 KB + 4KB, L1: 16 KB + 16KB, L2: 2MBProcessor-GPU Adreno 330 (578MHz)

Samsung Note3OS Android 4.4.2Processor-CPU Quad-Core Snapdragon 800 Krait @ 2.3GHzMemory 3GB LPDDR3Cache L0: 4 KB + 4KB, L1: 16 KB + 16KB, L2: 2MBProcessor-GPU Adreno 330 (578MHz)

InFocus M810OS Android 4.4.4Processor-CPU Qualcomm Snapdragon 801 ARMv7 @ 2.5GHzMemory 2GBCache L0: 4 KB + 4KB, L1: 16 KB + 16KB, L2: 2MBProcessor-GPU Adreno 330 (578MHz)

Test APMatrixmultiplication

Matrix size (𝑁 ∗𝑁),𝑁 ={512, 1024, 1536, 2048}

Nbody Body number = {1024, 2048, 3072, 4096}, Loop= 100

SOR Matrix size (𝑁 ∗𝑁),𝑁 ={1024, 2048, 3072, 4096}, Loop = 100

Mandelbrot set Image size (𝑁 ×𝑁),𝑁 = {256, 512, 768, 1024}

5. Performance Evaluation

We have evaluated the performance of MOMP in this paper.Our experimental environment includes Xiaomi MiPAD,Sony Z3, Samsung Note3, and InFocus M810. The resourceconfigurations of these four mobile devices are listed inTable 2. We developed four applications including matrixmultiplication (MM), Successive over Relaxation (SOR),Nbody, and Mandelbrot set. Both of MM and Nbody havelots of data computation while MM requires more memoryspace than Nbody. By contrast, the computation demand ofSOR is much less than MM and Nbody while the memorydemand of SOR is more than that of Nbody but less thanthat ofMM.Mandelbrot set is an application, which generatesand draws fractal images based on recursive formulas. Itrequires less memory space than MM and SOR but more


than Nbody. Since Nbody and SOR are iterative applications,the runtime system of MOMP forks multiple threads andassigns the threads to multicore CPU or launches kernelsto GPU for concurrent execution at the beginning of eachiteration. When the execution processor is GPU, it copiesinput data from host to device before a kernel is launchedand copies output data from device to host after the kernelis terminated for each iteration. Basically, this performanceevaluation was to individually run the test applications on thefour mobile devices and estimate the execution time of thetest applications by using CPU or GPU in the mobile devices.

On the other hand, we estimated the performance ofmemory accesses andmemory copy between host and device(denoted as HtoD and DtoH) in mobile devices before wedid our experiments. The experimental result is shown inFigure 9. The performance of memory accesses in MiPAD isworse than that of the others. For memory copy between hostand device, MiPAD is the worst of the four mobile devices.By contrast, Z3 is the best, and M810 almost is as good asZ3. As to Note3, it is better than MiPAD but worse than Z3and M810. We will use this result to explain the executionperformance of the four mobile devices later.

5.1. Performance ofMOMPUsingCPU inMobileDevices. Ourfirst experiment was aimed at evaluating the performance ofMOMP when it used CPU in mobile devices for executingthe test applications. We ran the test applications by forking1, 2, and 4 threads to estimate the execution time of the testapplications executed by 1, 2, and 4 CPU cores (denoted asOMP-1core, OMP-2core, and OMP-4core), respectively. Tomeasure the overhead of runtime system in MOMP, we cre-ated a sequential version for the test applications by deletingthe OpenMP directives in the original source programs andexecuted the sequential test applications by mobile devices.The experimental result is shown in Figures 10, 11, 12, and13. Basically, the execution performance of the OpenMPprograms is better than that of the sequential-C ones whenthe core number is larger than one.The performance of all thetest applications is significantly improved with the increaseof core number no matter which mobile device is used forprogram execution. However, the execution performance ofMandelbrot is not improved well when the core number isincreased from 2 to 4 because the computation of this appli-cation is not evenly distributed over four working threads.

On the other hand, the runtime system overhead ofMOMP is negligible for the performance of the MM, Nbody,and Mandelbrot applications. However, it is significant forthe performance of the SOR application especially when itis executed on MiPAD or Note3. Our performance profileshows that when the SOR application is executed withmultiple threads, the numbers of cache misses, instructions,and branches are significantly increased. It is worth notingthat the performance of the MOMP-1core case is better thanthat of the sequential-C case when the MM and Mandelbrotapplications are executed with MiPad. There are two reasonsfor this result. The first is that the speed of memory accessof MiPAD is slower than the others as shown in Figure 9. Inother words, the penalty cost of cache misses in MiPAD isobvious especially. The second is that the OpenMP compiler

replaces the parallel region in themain function by a functioncall. This is helpful for minimizing the size of the main func-tion and the miss rate of instruction cache. Consequently, theOpenMP compiler can save the overhead of cache misses forthe MM and Mandelbrot applications in MiPAD. However,this situation does not appear in the other mobile devicesbecause the penalty cost of cache misses becomes muchsmaller. It also disappears when the other applications areexecuted in MiPAD because the granularity of the mainfunction in the Nbody and SOR application is smaller.

In this experiment, we also compared MOMP withCCTools. Since CCTools uses gcc to compile user programs,the executable files generated by CCTools are native codes.By contrast, MOMP uses Clang to compile user programsinto LLVM IRs and converts IRs by MCJIT into native codesfor processors to execute the programs at runtime. In thisperformance comparison, we ran the MM application withthe problem size of 2048 × 2048 float-point numbers, theNbody application for 4096 particles, the SOR application forthe problem size of 4096× 4096 float-point numbers, and theMandelbrot application for 1024 × 1024 pixels, respectively.The result of performance comparison between MOMPand CCTools is as depicted in Figure 14. The executionperformance of MOMP is better than that of CCTools for allthe test applications in MiPAD because MOMP can optimizethe executable codes of user programs according to thearchitecture of ARM Cortex-A15 in MiPAD while CCToolscannot.Moreover,MOMP ismore efficient for theNbody andMandelbrot applications than CCTools in most of the mobiledevices. Conversely, CCTools performs more efficiently thanMOMP for the SOR application in Z3, M810, and Note3because the computational demand of the SOR applicationis small, and MOMP has to spend extra time on convertingLLVM IRs into native codes at runtime.

5.2. Performance of MOMP Using GPU in Mobile Devices.Our second experiment was aimed at evaluating the perfor-mance of MOMP when it used the GPU of mobile devicesfor executing the test applications. We ran the same testapplications with GPU and estimated the execution time ofthe applications. The breakdown of the execution time ofthe test programs is shown in Figures 15, 16, 17, and 18.In these figures, the BuildPro label represents the cost ofclBuildProgram() which was called for generating a kernelfunction for each parallel region in the test applications. Inaddition, the HtoD and DtoH labels denote the cost of host-to-device and device-to-host memory copy, respectively. Theexecution label denotes the execution time of the kernellaunched by the test programs.

For the MM application, the performance of MOMP inMiPAD is better than that in the others. Although MOMPspent more time on program building in MiPAD because thecost of memory accesses inMiPAD is high, it spent much lesstime on data computation in MiPAD because the computa-tion speed of MiPAD is fast. The same situation also happensin the Nbody application. Because Nbody is a computationintensive but data-intensive problem, almost all the executiontime of this application on the mobile devices except MiPAD


Write

100 200 3000Size (MB)

0

10

20

30

40

Tim

e (m

s)

Read

0

20

40

60

80

100

120

140

Tim

e (m

s)

100 200 3000Size (MB)

DtoH

0

0.5

1

1.5

2

2.5

Tim

e (se

c)

100 200 3000Size (MB)

MIPADZ3

M810Note3

Memory access comparison

HtoD

100 200 3000Size (MB)

MIPADZ3

M810Note3

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Tim

e (se

c)

Figure 9: Performance of memory accesses in mobile devices.

is spent on the execution of the kernel. As to Mandelbrot, itsexecution performance is mainly dependent on the cost ofmemory access and copy when it is executed by MiPAD. Bycontrast, its execution time is determined by the cost of datacomputation when it is executed by the other mobile devices.Different from the previous applications, SOR requires a large

amount of data transfer between host and device while theamount of data computation is relatively much less. Conse-quently, MiPAD spends more time onmemory copy betweenhost and device but less time on the execution of the SORkernel than the other devices. Because the increased cost ofmemory copy ismore than the saved cost of kernel execution,


Z3

Sequential-COMP-2core

OMP-1coreOMP-4core

M810

512 1024 1536 20480

50

100

150

200

250

300

350

Exec

utio

n tim

e (s)


OMP-1coreOMP-4core

512 1024 1536 20480

50

100

150

200

250

300

350

Exec

utio

n tim

e (s)

Problem size (n ∗ n)

Note3

512 1024 1536 20480

50

100

150

200

250

300

350

Exec

utio

n tim

e (s)

Problem size (n ∗ n) Problem size (n ∗ n)

MiPAD

512 1024 1536 2048Problem size (n ∗ n)

0

100

200

300

400

500

600

700

800

Exec

utio

n tim

e (s)

Figure 10: Performance of MM executed by CPU in mobile devices.

the execution performance of MiPAD is worse than those ofthe other mobile devices for the SOR application.

On the other hand, we divided theCPU execution time bythe GPU execution time of the test applications to estimatethe performance improvement obtained by replacing CPUwith GPU, as shown in Figure 19. For the MM application,the GPUs of Z3, Note3, and M810, respectively, provide 5,5, and 4 times speedup in the best case. For the Nbodyapplication, the maximal performance difference betweenGPU and CPU is 10, 9, and 7 times when MOMP uses Z3,

Note3, and M810, respectively. The SOR application obtains2.5, 2, and 1.8 speedups, respectively, from the GPUs of M810,Z3, and Note3. However, the performance comparison alsoshows that when the problem size is small, the GPU ofMiPAD does not contribute performance improvement forthe test applications compared with the CPU of MiPAD. Themain reason is that MiPAD has to spend a long period oftime on program building and data transfer between hostand device for GPU while it does not need to do thesethings for CPU. The benefit from saving computation time


1024 2048 3072 4096Problem size (n)

MiPAD

0

10

20

30

40

50

60

Exec

utio

n tim

e (s)

1024 2048 3072 4096Problem size (n)

Z3

0

100

200

300

400

500

600

Exec

utio

n tim

e (s)

1024 2048 3072 4096Problem size (n)

Note3

0

100

200

300

400

500

600

700

Exec

utio

n tim

e (s)


OMP-2coreOMP-4core

1024 2048 3072 4096Problem size (n)

M810

0

100

200

300

400

500

600

700

Exec

utio

n tim

e (s)


OMP-2coreOMP-4core

Figure 11: Performance of Nbody executed by CPU in mobile devices.

is seriously reduced by the cost of program building anddata transfer. Consequently, MiPAD does not provide perfor-mance improvement as well as the other mobile devices forthe MM and Nbody applications and degrades the executionperformance of the SOR application when it replaces CPUwith GPU for program execution. Conversely, Mandelbrotobtains a great performance improvement by replacing CPUwith GPU no matter which mobile device is used because it

has high parallelism and massive data computation but a fewof data communication between host and device.

5.3. Impact of Resource Selection. Our third experiment wasdedicated to evaluating the impact of resource selection onthe performance of user applications. We created an applica-tion with two parallel regions that are aimed at resolving theMM and SOR problems, respectively. We ran the two parallel


1024 2048 3072 4096

MiPAD


0

20

40

60

80

100

120

Exec

utio

n tim

e (s)

1024 2048 3072 4096

Z3


0

50

100

150

200

250

300

350

Exec

utio

n tim

e (s)

Note3

1024 2048 3072 4096Problem size (n ∗ n)

0

50

100

150

200

250

300

350

Exec

utio

n tim

e (s)


OMP-2coreOMP-4core

M810

1024 2048 3072 4096Problem size (n ∗ n)

0

50

100

150

200

250

300

350

Exec

utio

n tim

e(s)


OMP-2coreOMP-4core

Figure 12: Performance of SOR executed by CPU in mobile devices.

regions of the application inMiPADby threeways of resourceselection. The first way is to select CPU executing the twoparallel regions while the second way is to select GPU. Thethird way is to select GPU for the parallel region of the MMproblem but CPU for the parallel region of the SOR problem.The experimental result is shown in Figure 20.

We can find that the third way of resource selection is thebest for the performance of this application that has different

properties in different parallel regions. By contrast, the firstway is the worst because the MM application is computationintensive while it is executed by CPU. Conversely, the secondway is better than the first but worse than the third becauseit selects GPU for the parallel region of SOR while the SORproblem is suitable to be executed by CPU but GPU.The pre-vious discussion implies that selecting proper processors forexecuting different parallel regions in the same application is


256 512 768 1024Problem size (n ∗ n)

0

50

100

150

200

250

300

350

400

Exec

utio

n tim

e (s)

256 512 768 1024

Z3


0

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)

M810

256 512 768 1024Problem size (n ∗ n)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Exec

utio

n tim

e (s)


OMP-2coreOMP-4core

Note3

256 512 768 1024Problem size (n ∗ n)

0

500

1000

1500

2000

2500

3000

3500

4000

4500

Exec

utio

n tim

e (s)


OMP-2coreOMP-4core

MiPAD

Figure 13: Performance of Mandelbrot executed by CPU in mobile devices.

very important and necessary for the execution performanceof the application.MOMP provides an easy interface, namely,target(pthread or opencl), for users to address this issue.

6. Related Work

In recent years, some researches have been done for reducingthe complexity of parallel programming on mobile devices.We discussed these researchers as follows.

Android-Aparapi [26] supports users to exploit mobileCPUs orGPUs for parallel processing data in Java. It can auto-matically translate parallel Java bytecodes into OpenCL hostsand OpenCL kernels or the codes of using Java Thread Pool(JTP). It executes user programs with GPU by default. How-ever, if OpenCL driver or GPU is not supported on mobiledevices, it will use multicore CPUs to execute user programs.It is developed based on Aparapi [27] while it optimizes theexecution performance of Aparapi on Android. For example,


MIPAD Z3 M810 Note3

MM

0

20

40

60

80

100

120

140

160

180

200Ti

me (

s)

MIPAD Z3 M810 Note3

Nbody

0

50

100

150

200

250

300

Tim

e (s)

OMP-4core (MOMP)OMP-4core (CCTools)

MIPAD Z3 M810 Note3

SOR

0

20

40

60

80

100

120

Tim

e (s)

OMP-4core (MOMP)OMP-4core (CCTools)

MIPAD Z3 M810 Note3

Mandelbrot

0

500

1000

1500

2000

2500Ti

me (

s)

Figure 14: Performance comparison between MOMP and CCTools.

it exploits Byte Buffer for data communication with OpenCLkernels while minimizing the overhead of garbage collection.It also tries to gather many computational instructions intoa single JNI call to minimize the overhead of Dalvik VM(DVM) and OpenCL runtime. Although Android-Aparapisuccessfully enables Java to be a uniform API for mobileCPU and GPU, it is necessary to modify DVM for Android-Aparapi.This implies that user applications are not portable toany Android platform unless the modified DVM is installedon target mobile devices. In addition, the front end ofAndroid-Aparpi must be executed at PC or workstationfor passing the classes of user programs to the backend at

mobile devices for generating OpenCL kernels and executingthe kernels with mobile GPU. As a result, Android-Aparapicannot allow users to directly develop applications onmobiledevices anytime and anywhere without network connection.

Pyjama [27] is a Java compiler and runtime systemfor supporting OpenMP-like directives on Android. Theymake use of JavaCC [28] to implement a source-to-sourcetranslator for converting OpenMP to Java. Since the fork-join model of OpenMP results in that GUI thread delays topick up user events, it proposes a GUI-aware directive calledfreeguithread to automatically handle the synchronizationbetween GUI thread and the threads of parallel regions.


512 1024 1536 2048

MIPAD

0

2

4

6

8

10

12

14

Exec

utio

n tim

e (s)


512 1024 1536 2048

Z3


0

5

10

15

20

25

30

Exec

utio

n tim

e (s)

M810

512 1024 1536 2048Problem size (n ∗ n)

0

5

10

15

20

25

30

Exec

utio

n tim

e (s)

BuildProDtoH

HtoDExecution

Note3

512 1024 1536 2048Problem size (n ∗ n)

0

5

10

15

20

25

30

Exec

utio

n tim

e (s)

BuildProDtoH

HtoDExecution

Figure 15: Performance of MM executed by GPU in mobile devices.

Although Pyjama provides an easy API, that is, OpenMP-like directives and GUI-aware directives for users to increasethe parallelism of their mobile applications and maintainthe responsiveness of the applications, it cannot exploit thecomputational power of GPUs in mobile devices to increasethe execution performance of mobile applications.

CCTools is an integrated development environment onmobile devices. The main advantage of this APP is allowingusers to directly edit, compile, and execute OpenMP pro-grams on mobile devices through an embedded Linux termi-nal. It makes use of gcc to compile users’ OpenMP programs.

However, the executable files generated by gcc are runnablefor CPU but GPU in mobile devices. Since it requires usersto manually type commands to edit, compile, and executetheir programs in a text-mode terminal, it is not friendly andconvenient for those who are not used to operating LinuxOS.

HIPA (Heterogeneous Image Processing Acceleration)[29] is a domain specific language for speeding up imageprocessing on embedded systems and mobile devices. HIPAenables users to design the kernels of image processing by C-like language while it automatically converts the kernels intoCUDA/OpenCL/Renderscript/Filterscript ones to process


1024 2048 3072 4096Problem size (n)

Z3

0

5

10

15

20

25

30

Exec

utio

n tim

e (s)

1024 2048 3072 4096Problem size (n)

M810

0

5

10

15

20

25

30

35

Exec

utio

n tim

e (s)

BuildProDtoH

HtoDExecution

1024 2048 3072 4096Problem size (n)

Note3

0

5

10

15

20

25

30

Exec

utio

n tim

e (s)

BuildProDtoH

HtoDExecution

1024 2048 3072 4096Problem size (n)

MIPAD

0

5

10

15

20

Exec

utio

n tim

e (s)

Figure 16: Performance of Nbody executed by GPU in mobile devices.

imageswithGPU. In order to exploit GPU efficiently, itmakesuse of the memory hierarchy of GPU to minimize the costof data accesses. For instance, it stores the data shared by allthe threads in the same kernel at global memory and texturememory while allocating the data shared by the threads of thesame thread block at sharedmemory or local memory.Whenkernels are translated by means of Renderscript/Filterscript,the expressions of accessing images are translated to callrsGetElementAt function. Moreover, HIPA optimizes theOpenCL implementation for HSA (Heterogeneous System

Architecture). Since host memory and device memory sharesthe same region on HSA, HIPA uses mmap() and munmap()tominimize the cost of data transfer between host and device.With the support of HIPA, users need not deal with memoryallocation and data transfer and select execution processorthrough environment variables. However, HIPA is dedicatedfor image processing but general-purpose computing.

Although the above toolkits can effectively reduce theparallel programming of multicore CPU or manycore-GPUon mobile devices, they require users to compile programs


1024 2048 3072 4096

MIPAD


0

20

40

60

80

100

120

Exec

utio

n tim

e (s)

1024 2048 3072 4096

Z3


0

10

20

30

40

50

60

Exec

utio

n tim

e (s)

M810

1024 2048 3072 4096Problem size (n ∗ n)

0

5

10

15

20

25

30

35

40

45

50

Exec

utio

n tim

e (s)

BuildProDtoH

HtoDExecution

Note3

1024 2048 3072 4096Problem size (n ∗ n)

0

10

20

30

40

50

60

70

Exec

utio

n tim

e (s)

BuildProDtoH

HtoDExecution

Figure 17: Performance of SOR executed by GPU in mobile devices.

on computers and then move the executable files fromcomputers to mobile devices for execution except CCTools.Moreover, most of them are not able to support a uniformprogramming interface for users to exploiting CPUs andGPUs in mobile devices. By contrast, MOMP provides acomplete and friendly programming environment for usersto develop parallel programs in mobile devices anytime any-where without network connection. Furthermore, it supportsa uniformprogramming interface, that is, OpenMP, to reducethe parallel programming complexity of multicore CPU andmany-core GPU on mobile devices at the same time.

7. Conclusions and Future Work

In this paper, we have successfully developed a mobileOpenMP programming environment called MOMP. Usingthis APP, users can develop OpenMP applications to exploitthe computational power of CPU and GPU inmobile devicesfor resolving their problems anytime and anywhere withoutthe assistance of remote servers. In addition, they can easilymove their OpenMP programs from computers to mobiledevices without modifying their programs. It is convenientfor them to switch their computing environment between


256 512 768 1024

MIPAD


0

5

10

15

20

25

30

Exec

utio

n tim

e (s)

256 512 768 1024

Z3


0

5

10

15

20

25

30

Exec

utio

n tim

e (s)

M810

256 512 768 1024Problem size (n ∗ n)

0

5

10

15

20

25

30

35

Exec

utio

n tim

e (s)

BuildProDtoH

HtoDExecution

Note3

256 512 768 1024Problem size (n ∗ n)

0

5

10

15

20

25

30

Exec

utio

n tim

e (s)

BuildProDtoH

HtoDExecution

Figure 18: Performance of Mandelbrot executed by GPU in mobile devices.

computers and mobile devices. On the other hand, ourexperimental results have shown that MOMP can effectivelyexploit the computational power of mobile devices for theexecution performance of user programs, and it performsmore efficiently than CCTools for the test applications inmost of the experimental mobile devices. It is worth notingthat carefully selecting CPU and GPU is essential for theexecution performance of user programs.

Although mobile devices recently have become morepowerful in data computation, the high complexity of parallelprogramming always was a big problem for users to movetheir computation platform from computers toward mobile

devices. In this work, MOMP has successfully reducedthe programming complexity of heterogeneous computingin mobile devices because OpenMP is much easier thanCUDA, OpenCL, and Pthread. This contribution is useful toencourage users to move their working environment towardmobile devices because they can save lots of time and effort onlearning new programming interface and application devel-opment. Although MOMP currently supports only OpenMP2.0, it is enough for users to develop lots of data-parallelismapplications with parallel loops. In past decades, OpenMPprogramming has been popularly andwidely applied inmanyresearch areas such as high energy physics, bioinformatics,


MM

0

5

10

15

1024 1536 2048512Problem size (n ∗ n)

NBody

0

2

4

6

8

10

12

2048 3072 40961024Problem size (n ∗ n)

SOR

0

1

2

3

2048 3072 40961024Problem size (n ∗ n)

Mandelbrot

0

20

40

60

80

100

512 768 1024256Problem size (n ∗ n)

MIPADZ3

M810Note3

MIPADZ3

M810Note3

Figure 19: Speedup gained by replacing CPU with GPU in mobile devices.

data mining, earthquake prediction, and multimedia pro-cessing and is also an important course for the departmentof computer science and engineering in universities. Theappearance of MOMP is helpful for researchers and studentsto make use of mobile devices for increasing their workefficiency and learning effect becausemobile devices aremoreaffordable and easier to carry for users than Notebooks.

Although the computational power of mobile deviceshas been greatly improved, mobile devices still cannot workas long as PCs because of finite battery capability. If userprograms cannot finish their work before they use up the bat-tery power of mobile devices, the context of these programs

should be logged into storages in order to resume their execu-tion later, or they must be executed from beginning. Accord-ingly, we will develop a backup/recovery mechanism and alive migration scheme to increase the reliability of MOMP. Inaddition, we will extendMOMP to support OpenMP 3.0 and4.0 when OpenCL 2.0 is available in mobile devices.

Competing Interests

There is no conflict of interests regarding the publication ofthis paper.


MM(G)SOR(C)

1024 2048 3072 4096Problem size (n ∗ n)

0

10

20

30

40

50

60

70

Exec

utio

n tim

e (s)

BuildProDtoH

HtoDExecution

MM(C)SOR(C)

Execution

1024 2048 3072 4096Problem size (n ∗ n)

0

500

1000

1500

2000Ex

ecut

ion

time (

s)

1024 2048 3072 4096Problem size (n ∗ n)

0

20

40

60

80

100

120

140

160

Exec

utio

n tim

e (s)

MM(G)SOR(G)

BuildProDtoH

HtoDExecution

Figure 20: Impact of resource selection in mobile devices.

Acknowledgments

This work is supported by Ministry of Science and Tech-nology of Republic of China under the research Project no.MOST 103-2221-E-151-044.

References

[1] E. W. John, GUN Octave, 2015, http://www.gnu.org/software/octave/index.html.

[2] Corbin Champion.: Addi–Matlab/Octave clone for Android,2015, https://code.google.com/p/addi.

[3] MathWorks.: MATLAB—The language of technical computing,2015, http://www.mathworks.com/products/matlab.

[4] W. Gropp, E. Lusk, N. Doss, and A. Skjellum, “A high-performance, portable implementation of the MPI messagepassing interface standard,” Parallel Computing, vol. 22, no. 6,pp. 789–828, 1996.

[5] L. Dagum and R. Menon, “OpenMP: an industry standard APIfor shared-memory programming,” Computational Science &Engineering, vol. 5, no. 1, pp. 46–55, 1998.


[6] NVIDIA,CUDACProgrammingGuide, 2015, http://docs.nvidia.com/cuda/cuda-c-programming-guide.

[7] J. E. Stone, D. Gohara, and G. Shi, “OpenCL: a parallelprogramming standard for heterogeneous computing systems,”Computing in Science & Engineering, vol. 12, no. 3, Article ID5457293, pp. 66–73, 2010.

[8] n0n3m4.: C4droid - C/C++ compiler & IDE, 2015, https://play.google.com/store/apps/details?id=com.n0n3m4.droidc.

[9] Arduino team, CppDroid−C/C++ IDE, 2015, https://play.google.com/store/apps/details?id=name.antonsmirnov.android.cppdroid.

[10] A. Chukov, “CCTools,” 2015, https://play.google.com/store/apps/details?id=com.pdaxrom.cctools.

[11] J. Hoeflinger, “(Intel).: Cluster OpenMP for Intel5 Compilers,”2010, https://software.intel.com/en-us/articles/cluster-openmp-for-intel-compilers.

[12] T.-Y. Liang, S.-H. Wang, C. E.-K. Shieh, C.-M. Huang, andL.-I. Chang, “Design and implementation of the OpenMPprogramming interface on linux-based SMP clusters,” Journalof Information Science and Engineering, vol. 22, no. 4, pp. 785–798, 2006.

[13] C. Amza, A. L. Cox, S. Dwarkadas et al., “TreadMarks: sharedmemory computing on networks of workstations,” Computer,vol. 29, no. 2, pp. 18–28, 1996.

[14] J.-B. Chang, C.-K. Shieh, and T.-Y. Liang, “A transparentdistributed shared memory for clustered symmetric multipro-cessors,” The Journal of Supercomputing, vol. 37, no. 2, pp. 145–160, 2006.

[15] H. Bae, D. Mustafa, J.-W. Lee et al., “The Cetus source-to-source compiler infrastructure: overview and evaluation,”International Journal of Parallel Programming, vol. 41, no. 6, pp.753–767, 2013.

[16] S. Lee and R. Eigenmann, “OpenMPC: extended OpenMPfor efficient programming and tuning on GPUs,” InternationalJournal of Computational Science and Engineering, vol. 8, no. 1,pp. 4–20, 2013.

[17] G. Noaje, C. Jaillet, and M. Krajecki, “Source-to-source codetranslator: OpenMP C to CUDA,” in Proceedings of the IEEE13th International Conference on High Performance Computingand Communications (HPCC ’11), pp. 512–519, Banff, Canada,September 2011.

[18] HMCSOT: OpenMP to OpenCL Source-to-Source Translator,2015, http://www.openfoundry.org/of/projects/1117.

[19] T. Y. Liang and Y. J. Lin, “JCL: an OpenCL programmingtoolkit for heterogeneous computing,” in Grid and PervasiveComputing: 8th International Conference, GPC 2013 and Colo-cated Workshops, Seoul, Korea, May 9–11, 2013. Proceedings, vol.7861 of Lecture Notes in Computer Science, pp. 59–72, Springer,Berlin, Germany, 2013.

[20] A. Barak andA. Shiloh, “TheVirtualCL (VCL) Cluster Platform,”2015, http://www.mosix.org/vcl/VCL wp.pdf.

[21] J. Kim, S. Seo, J. Lee, J. Nah, G. Jo, and J. Lee, “SnuCL: anOpenCL framework for heterogeneous CPU/GPU clusters,”in Proceedings of the 26th ACM International Conference onSupercomputing (ICS ’12), pp. 341–351, ACM, Istanbul, Turkey,June 2012.

[22] T. Y. Liang, H. F. Li, and Y. C. Chen, “A ubiquitous integrateddevelopment environment for C programming on mobiledevices,” in Proceedings of the 12th IEEE International Confer-ence on Dependable, Autonomic and Secure Computing (DASC’14), pp. 184–189, Dalian, China, August 2014.

[23] Clang Team, “Clang: a C language family Frontend for LLVM,”2015, http://Clang.llvm.org/.

[24] C. Lattner and V. Adve, “LLVM: a compilation framework forlifelong program analysis & transformation,” in Proceedings ofthe International Symposium onCode Generation andOptimiza-tion (CGO ’04), pp. 75–86, March 2004.

[25] H.-F. Li, T.-Y. Liang, and J.-Y. Chiu, “A compound OpenMP/MPI program development toolkit for hybrid CPU/GPU clus-ters,”The Journal of Supercomputing, vol. 66, no. 1, pp. 381–405,2013.

[26] H.-S. Chen, J.-Y. Chiou, C.-Y. Yang et al., “Design and imple-mentation of high-level compute on Android systems,” inProceedings of the 11th IEEE Symposium on Embedded Systemsfor Real-Time Multimedia (ESTIMedia ’13), pp. 96–104, IEEE,Montreal, Canada, October 2013.

[27] K. G. Gupta, N. Agrawal, and S. K. Maity, “Performanceanalysis between aparapi (a parallel API) and JAVA by imple-menting sobel edge detection Algorithm,” in Proceedings ofthe National Conference on Parallel Computing Technologies(PARCOMPTECH ’13), pp. 1–5, Bangalore, India, February 2013.

[28] JavaCC is a Parser/Scanner Generator, 2015, https://java.net/projects/javacc.

[29] R. Membarth and O. Reiche, “HIPAcc,” 2015, http://hipacc-lang.org/.

Submit your manuscripts athttp://www.hindawi.com

Computer Games Technology

International Journal of

Hindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Distributed Sensor Networks


Advances in

FuzzySystems

Hindawi Publishing Corporationhttp://www.hindawi.com

Volume 2014


ReconfigurableComputing

Hindawi Publishing Corporation http://www.hindawi.com Volume 2014


Applied Computational Intelligence and Soft Computing

Advances in

Artificial Intelligence


Advances inSoftware EngineeringHindawi Publishing Corporationhttp://www.hindawi.com Volume 2014


Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications


Hindawi Publishing Corporation

http://www.hindawi.com Volume 2014

Advances in

Multimedia


Biomedical Imaging


ArtificialNeural Systems

Advances in


RoboticsJournal of



Computational Intelligence and Neuroscience

Industrial EngineeringJournal of


Modelling & Simulation in EngineeringHindawi Publishing Corporation http://www.hindawi.com Volume 2014

The Scientific World JournalHindawi Publishing Corporation http://www.hindawi.com Volume 2014


Human-ComputerInteraction

Advances in

Computer EngineeringAdvances in


Research Article An OpenMP Programming Environment on...

Documents

Transcript of Research Article An OpenMP Programming Environment on...