Computers & Geosciences...an MFD algorithm on a GPU, although MFD is generally thought to be better...

Computers & Geosciences 43 (2012) 7–16

Contents lists available at SciVerse ScienceDirect

Computers & Geosciences

0098-30

doi:10.1

n Corr

E-m

journal homepage: www.elsevier.com/locate/cageo

Parallelizing flow-accumulation calculations on graphics processingunits—From iterative DEM preprocessing algorithm to recursivemultiple-flow-direction algorithm

Cheng-Zhi Qin a,n, Lijun Zhan a,b

a State Key Laboratory of Resources and Environmental Information System, Institute of Geographical Sciences and Natural Resources Research, CAS, Beijing 100101, Chinab Graduate School of the Chinese Academy of Sciences, Beijing 100049, China

a r t i c l e i n f o

Article history:

Received 24 November 2011

Received in revised form

20 February 2012

Accepted 24 February 2012Available online 4 March 2012

Keywords:

Parallel computing

Graphics processing unit (GPU)

Digital terrain analysis

Flow accumulation

Multiple-flow-direction algorithm (MFD)

DEM preprocessing

04/$ - see front matter & 2012 Elsevier Ltd. A

016/j.cageo.2012.02.022

esponding author. Tel.: þ86 10 648 89777; fa

ail address: [email protected] (C.-Z. Qin).

a b s t r a c t

As one of the important tasks in digital terrain analysis, the calculation of flow accumulations from

gridded digital elevation models (DEMs) usually involves two steps in a real application: (1) using an

iterative DEM preprocessing algorithm to remove the depressions and flat areas commonly contained

in real DEMs, and (2) using a recursive flow-direction algorithm to calculate the flow accumulation for

every cell in the DEM. Because both algorithms are computationally intensive, quick calculation of the

flow accumulations from a DEM (especially for a large area) presents a practical challenge to personal

computer (PC) users. In recent years, rapid increases in hardware capacity of the graphics processing

units (GPUs) provided in modern PCs have made it possible to meet this challenge in a PC environment.

Parallel computing on GPUs using a compute-unified-device-architecture (CUDA) programming model

has been explored to speed up the execution of the single-flow-direction algorithm (SFD). However, the

parallel implementation on a GPU of the multiple-flow-direction (MFD) algorithm, which generally

performs better than the SFD algorithm, has not been reported. Moreover, GPU-based parallelization of

the DEM preprocessing step in the flow-accumulation calculations has not been addressed. This paper

proposes a parallel approach to calculate flow accumulations (including both iterative DEM preproces-

sing and a recursive MFD algorithm) on a CUDA-compatible GPU. For the parallelization of an MFD

algorithm (MFD-md), two different parallelization strategies using a GPU are explored. The first

parallelization strategy, which has been used in the existing parallel SFD algorithm on GPU, has the

problem of computing redundancy. Therefore, we designed a parallelization strategy based on graph

theory. The application results show that the proposed parallel approach to calculate flow accumula-

tions on a GPU performs much faster than either sequential algorithms or other parallel GPU-based

algorithms based on existing parallelization strategies.

& 2012 Elsevier Ltd. All rights reserved.

1. Introduction

Calculation of flow accumulations or specific contributing areas(SCAs) from gridded digital elevation models (DEMs) is one of thekey issues in digital terrain analysis (DTA) and has a markedinfluence on a wide range of applications such as hydrologic analysis,soil erosion, and geomorphology (Wilson and Gallant, 2000; Hengland Reuter, 2008). The calculation of flow accumulations from agridded DEM is normally performed using the flow-direction algo-rithm, which determines how to drain the flow from each given cellin the DEM into the neighboring cell(s) and then recursivelycalculates the flow accumulation for every cell (Freeman, 1991).Over the last twenty years, many flow-direction algorithms have

ll rights reserved.

x: þ86 10 6488 9630.

been developed (Wilson et al., 2008). Based on whether it is assumedthat the flow from a cell can drain into only one neighboring cell orinto one or more downslope neighboring cells, existing flow-direc-tion algorithms can be classified into two main types: single-flow-direction (SFD) algorithm (e.g., the D8 algorithm proposed byO’Callaghan and Mark (1984)) and multiple-flow-direction (MFD)algorithms (e.g., the FD8 algorithm proposed by Quinn et al. (1991),the D-inf algorithm proposed by Tarboton (1997), and the MFD-md

algorithm proposed by Qin et al. (2007)). MFD has generally beenrecognized to perform better than SFD from the perspective ofalgorithm error, especially when the spatial pattern of SCA or SCA-based topographic attributes (e.g., topographic wetness index) at afine scale is needed (Wolock and McCabe, 1995; Wilson et al., 2008;Qin et al., 2011).

During flow-accumulation calculations for real DEMs, a DEMpreprocessing algorithm is generally used to fill in the depressionsand remove the flat areas in the DEMs before the flow-direction

www.elsevier.com/locate/cageo

www.elsevier.com/locate/cageo

dx.doi.org/10.1016/j.cageo.2012.02.022

mailto:[email protected]

dx.doi.org/10.1016/j.cageo.2012.02.022

Fig. 1. Overall parallelization workflow on a GPU using a CUDA programming

model.

C.-Z. Qin, L. Zhan / Computers & Geosciences 43 (2012) 7–168

algorithm is used (Hengl and Reuter, 2008). These depressions andflat areas commonly exist in real DEMs, both because of actualterrain conditions and because of errors introduced during the DEMproduction process. These real or spurious features in a DEM willcause the flow-direction algorithms to fail to determine the flowdirection properly and to obtain hydrological correct results for flowaccumulation. Many DEM preprocessing algorithms, often with aniterative process, have been proposed to assist in the application offlow-direction algorithms (e.g., Jenson and Domingue, 1988; Martzand de Jong, 1988; Planchon and Darboux, 2001).

In real applications, flow-accumulation calculations have highcomputational complexity and are often very time-consuming.The high computational complexity arises not only from therecursive MFD algorithm, but also from the iterative DEM pre-processing algorithm. The traditional algorithm for calculatingflow accumulation is coded as a sequential program executed on asingle computer processor. Therefore, the execution time is oftenvery long, especially for DEMs of large area and finer scale.

To speed up the execution time for flow-accumulation calcula-tions, some researchers have proposed to parallelize these calcu-lations using parallel programs designed for specific hardwarearchitectures. Currently, three main types of hardware are used toparallelize flow-accumulation calculations: clusters, multi-coreCPUs in a single personal computer (PC), and graphics processingunits (GPUs). The cluster, which in theory can use any number ofprocessors, has a high theoretical scalability and therefore canprocess a huge DEM dataset. Based on computer clusters, aparallelization of the D8 algorithm has been developed in amessage-passing-interface (MPI) programming model and hasachieved a significant improvement in processing times (e.g.,Wallis et al., 2009; Do et al., 2011). However, the ownershipand operational costs of a computer cluster are high, and clusterprogramming is difficult because of the difficulty of code debug-ging and the lack of performance tuning tools. The use ofcomputer clusters is still limited for most users. Multi-core CPUsin a single PC, as a cheaper and easy-to-use hardware solution,have also been used as a way to parallelize the D8 algorithm usingan open-multi-processing (OpenMP) programming model (Xuet al., 2010). However, the multiple threads used in multi-coredCPUs in a PC cannot provide the large amount of speeduprequired because of the limited number of cores available in theCPU (Lee et al., 2010).

GPU devices are attracting attention because they can acceleratedigital terrain analysis (Xia et al., 2010) in a more efficient andeconomical way than multi-core CPUs in single PCs or than clusters.Designed originally for view processing for computer displays, GPUshave become very powerful, with up to hundreds of cores, and arewidely provided in modern PCs. With the emergence of general-purpose computing on GPU (GPGPU) technology, such as thecompute-unified-device-architecture (CUDA) programming model,GPUs have been used to parallelize many high-computational-complexity tasks ranging from physical process simulation togeographical computations (Tukora and Szalay, 2008).

However, little research has yet been done on parallelizingflow-accumulation calculations on a GPU. Ortega and Rueda(2010) presented a method of parallelizing the SFD algorithm,D8, on a GPU having the theoretical peak speed of 360 GFLOPS(giga floating-point operations per second) with 16 multiproces-sors (including 128 scalar processors in total). Compared with thesequential implementation of D8 executed on a CPU having thetheoretical peak speed of 32 GFLOPS, their CUDA-based parallelD8 algorithm achieved a high running efficiency, with a speedupratio of approximately eight times. To the best of the authors’knowledge, there are no reports in the literature on parallelizingan MFD algorithm on a GPU, although MFD is generally thought tobe better than SFD. Furthermore, current research has not yet

addressed GPU-based parallelization of the DEM preprocessingstep in the flow-accumulation calculations.

This paper presents a design and implementation of paralle-lized flow-accumulation calculations (including both iterativeDEM preprocessing and a recursive MFD algorithm) on theNVIDIATM GPU using the CUDA programming model. A paralleli-zation strategy based on graph theory is proposed to improveefficiency of parallelizing the MFD algorithm on GPU.

2. CUDA-compatible GPU

At the hardware level, the GPU is composed of a scalable arrayof multiprocessors (MPs). Each MP contains 8 scalar processors(SPs). During a given processing cycle, each SP in the MPs executesthe same instruction on different data, which is similar to theSIMD (single instruction stream, multiple data stream) model of acomputer.

To use the parallel computing capabilities of the GPU for general-purpose computations, a C-based programming model called CUDAhas been developed by the popular graphics-card manufacturerNVIDIATM and has become the most popular GPGPU technology(Danalis et al., 2010). Using CUDA, a sequential algorithm to beparallelized should be redesigned to be processed on two differenthardware platforms concurrently, the CPU (the host) and theCUDA-compatible GPU (the device) (Halfhill, 2008). The generalworkflow of CUDA computation consists of three phases, Initializa-

tion, GPU Execution, and Finalization (Fig. 1). During the GPU Execu-

tion phase, a large number of threads are automatically created and

C.-Z. Qin, L. Zhan / Computers & Geosciences 43 (2012) 7–16 9

then scheduled among MPs to keep processors computing and fullyuse the computing capability of a GPU.

3. Design for parallelization of a DEM preprocessingalgorithm on a GPU

3.1. Original sequential DEM preprocessing algorithm proposed by

Planchon and Darboux (2001)

The DEM preprocessing algorithm proposed by Planchon andDarboux (2001) (or P&D algorithm), which can iteratively revisethe elevation of cells in depressions and flat areas in a DEM with avery small slope gradient, is thought to be the most suitable forMFD algorithms (Qin et al., 2007). In this study we select the P&Dalgorithm as a representative of DEM preprocessing algorithmsfor parallelization.

The P&D algorithm includes two steps: (1) the water-coveringstep, which involves adding a thick layer of water over the entireDEM except for the boundary, and (2) the water-removal step,which drains the excess water to ensure that for each cell, there isa path that leads to the boundary (Planchon and Darboux, 2001).According to the pseudocode of the P&D algorithm (Algorithm 1),the intensive computing part of the algorithm is the iterativeprocess (lines 4–24 in Algorithm 1). The operation performed ineach round of iterative processing is a neighborhood operation,which means that the computation for a cell will involve not onlythis cell, but also its neighboring cells. This iterative neighbor-hood operation has high potential for parallelization.

Algorithm 1. Pseudocode of sequential P&D DEM preprocessingalgorithm (adapted from Planchon and Darboux (2001)). (zDEMand wDEM are the input DEM and the output DEM, respectively.)

3.2. Strategy for parallelizing the DEM preprocessing algorithm

on a GPU

Here the concern is how to parallelize the iterative part ofthe P&D algorithm. Unlike generic algorithms with iterative

neighborhood operations, the sequential P&D algorithm permitsthe computation of the value of a given cell to use the values of itsneighboring cells which have been updated during the currentround of iteration (Planchon and Darboux, 2001). This means thatthe computation of each cell can be concurrent within a givenround of the iterative process.

3.3. Parallel P&D algorithm

Based on the above strategy for parallelizing the P&D algo-rithm on a GPU, a CUDA-based parallel P&D algorithm wasdesigned which consists of two parts: the host part and thedevice part. The host part, executed on the CPU, implements thewater-covering step of the P&D algorithm (i.e., the Initialization

phase in Fig. 1) and the output of the final result (i.e., theFinalization phase in Fig. 1). The iterative water-removal step ofthe P&D algorithm is parallelized on the device part and executedas a ‘‘kernel’’ on the GPU (i.e., the GPU Execution phase in Fig. 1).

The host part of the parallel P&D algorithm is shown inAlgorithm 2. In line 2, the function Water_covering() serves toflood the whole surface except for the boundary with a thick layerof ‘‘water.’’ In lines 4–6, all required data are copied from the hostto global memory in the device, where all GPU threads can accessit simultaneously for reading. Lines 8–9 assign the number ofthreads per block and the number of blocks based on the DEMsize to ensure that each cell in the DEM, except the boundary, willbe processed by a single thread. This thread-data mappingprocess requires a large number of threads and is feasible witha GPU in which the powerful arithmetic engine can run thousandsof lightweight threads. The following loop code in the host part(lines 10–17) will iteratively call for the execution of each roundin the water-removal step. This loop will terminate when thestate variable gpuStop equals true, which means that none of thecells in the DEM has changed in altitude during the current roundof the water-removal process on the GPU. Once the execution ofthe loop has terminated, the results of DEM preprocessing will betransferred from the GPU to the CPU (line 18).

The device part of the parallel P&D algorithm is shown inAlgorithm 3. In lines 2–3, each thread obtains a thread ID and usesit as an index to obtain its corresponding data. The rest of Algorithm 3implements the water-removal step of the P&D algorithm, which issimilar to the corresponding code in the sequential Algorithm 1.

Algorithm 2. Pseudocode of the host part of the CUDA-basedP&D algorithm. (cpuzDEM and cpuwDEM are the input DEM andthe output DEM, respectively.)


Algorithm 3. Pseudocode of the device part of the CUDA-basedP&D algorithm.

4. Design for parallelization of a multiple-flow-directionalgorithm on a GPU

4.1. Analysis of the parallelizability of the recursive MFD algorithm

Following the earliest MFD algorithm with a recursive design(Freeman, 1991; Quinn et al., 1991), later MFD algorithms havegenerally focused on how to model the allocation of flow amongmultiple neighboring cells of a given cell. Therefore, these algo-rithms can be formalized as a similar form as follows:

di ¼ðtanbiÞ

p� LiP8

j ¼ 1 ðtanbjÞp� Lj

ð1Þ

where di is the fraction of flow into the i-th neighboring cell froma given cell, tanbi is the slope gradient of the neighboring cell i,and Li is the ‘‘effective contour length’’ of the neighboring cell i ofthe central cell. Li equals 0.5 for downslope cells in cardinaldirections, 0.354 for downslope cells in diagonal directions, and0 for non-downslope neighboring pixels (Quinn et al., 1991). Theflow-partition exponent p is set to a constant value in the classicMFD, e.g., p¼ 1 in the FD8 algorithm (Quinn et al., 1991). OtherMFD algorithms generally suggest varying the flow-partitionexponent p by a function related to the terrain conditions (e.g.,Quinn et al., 1995; Kim and Lee, 2004; Qin et al., 2007).

This study uses MFD-md, an MFD algorithm proposed by Qinet al. (2007), as an example for parallelization of MFDs which arebased on a form similar to Eq. (1). The MFD-md algorithm adaptsto local terrain conditions by determining the flow-partitionexponent based on the local maximum downslope gradient (Qinet al., 2007):

f ðeÞ ¼ 8:9�minðe,1Þþ1:1 ð2Þ

where e is the tangent value of the maximum downslopegradient, minðe,1Þ is the minimum of e and 1, and f ðeÞ is thefunction for determining the flow-partition exponent in Eq. (1).Experimental results have shown that MFD-md produces lowererror on artificial surfaces and achieves a more reasonable resulton real-world surfaces compared with classic MFD and SFD (Qinet al., 2007).

The sequential MFD-md algorithm consists of two steps: (1)data preparation and (2) flow-accumulation calculations. Thetasks in the first step include the calculation and recording of

both the multiple flow directions and the tangent value of themaximum downslope gradient for each cell in the DEM. Then theflow fractions among all neighboring cells of each cell are alsodetermined based on Eqs. (1) and (2). The time consumed in thisdata-preparation step is relatively minor, and therefore it is notnecessary to parallelize it on the GPU.

The flow-accumulation calculation step in the MFD-md algorithmis computationally intensive and time-consuming. Therefore, theparallelization of this step on a GPU has the potential to acceleratethe execution of the MFD-md algorithm considerably. However,traditional flow-accumulation calculations often use a depth-first-search (DFS) process which recursively calculates from outlet topeak. The DFS process is thought to be an inherently sequentialprocess and therefore has no parallel solution (Reif, 1985).

In this research, the recursive flow-accumulation calculationwas first converted into an iterative algorithm which calculatesthe flow accumulation in a sequence from the peak to the outletof each area. Then concurrency will be possible in each round ofthe iterative process. Two parallelization strategies will beexplored in the remaining parts of this section.

4.2. Design 1: a flow-transfer-matrix-based parallel MFD-md

algorithm on GPU

4.2.1. Parallelization strategy based on the flow-transfer matrix

proposed by Ortega and Rueda (2010)

During the process of designing and parallelizing the MFDalgorithm on a GPU, it was natural to consider whether or not theparallelization strategy used in the existing CUDA-based parallelSFD algorithm is still available in this case. This parallelizationstrategy proposed by Ortega and Rueda (2010) used a datastructure called the ‘‘flow-transfer matrix’’ to parallelize the D8

algorithm on a GPU. Based on the flow-transfer matrix, the flow-accumulation process which is recursively calculated in thetraditional algorithm can be simulated on a GPU by an iterativeflow-transfer process among the neighboring cells of every cell inthe DEM. In each round of the iterative process, a flow-transfermatrix is used to record the flow accumulation transferred intoevery cell from its neighboring cells during the current round ofthe process. The flow transfer to every cell in an individual roundof the iterative process can be parallelized on a GPU. The iterativeprocess terminates when no cell has flow transferred to it from itsneighboring cells in the current round of the process. The result ofthe flow-accumulation algorithm is the sum of all the flow-transfer matrices from each round of the process.

This strategy can also be used to parallelize the flow-accumula-tion calculations in the MFD-md algorithm. This parallelizationprocess can be illustrated using a 3�3 DEM example (Fig. 2a).During the data-preparation step of the MFD-md algorithm (Eqs.(1) and (2)), the flow-transfer action for every cell of the DEM isdetermined both by its multiple flow directions and by theircorresponding flow fractions (Fig. 2b). During the flow-accumulationcalculation step of the MFD-md algorithm, the flow-transfer matrix,flowTransfer, is used to record the flow accumulation that istransferred in each round of the parallel process. The initial flow-transfer matrix, flowTransfer0, is set to have a value of one 1 forevery cell (Fig. 2c) to simulate the amount of water that each cell(i,j)

directly obtains from rainfall. In the first round of the process, aflow-transfer matrix, flowTransfer1, records the flow accumulationwhich every cell drains from its neighboring cells according to theflow-transfer action, together with the flow recorded in flowTrans-

fer0 (Fig. 2d). For example, the cell of interest (2,2) receives the flowdrained from its upslope neighboring cells (i.e., cell(1,1), cell(1,2),cell(1,3), and cell(2,3)) plus the amount recorded in these neighbor-ing cells in flowTransfer0 (Fig. 2c) multiplied by the correspondingflow fraction (Fig. 2b) from each of these neighboring cells to the cell

Fig. 3. Data matrices for flow-transfer-matrix-based parallel MFD-md algorithm and thread-data mapping in the GPU.

Fig. 2. Illustration of the parallel MFD-md algorithm using a flow-transfer matrix: (a) a 3�3 DEM; (b) multiple flow directions (marked as arrows) and corresponding flow

fractions for the DEM in (a); (c) initial flow transfer matrix; (d–f) the flow-transfer matrices which record the flow accumulation transferred in the first to the last rounds of

the parallel process using the MFD-md algorithm; (g) flow-accumulation result, which is the sum of the flow-transfer matrices (c–f).


of interest (2,2) (i.e., 20.7%�1, 79.3%�1, 100%�1, and 3.8%�1respectively, which sum to 2.038). Therefore, flowTransfer1(2,2) isset to 2.038, which means that cell(2,2) receives 2.038 units of flowin this round (Fig. 2d). This operation in one round of an iterativeprocess is a neighborhood operator which is easy to parallelize on aGPU. The following rounds of the iterative process are similar(Fig. 2e). When no cell receives a nonzero flow transfer (Fig. 2f),the flow-accumulation result of the MFD-md algorithm can besummed up (Fig. 2g) using the equation:

f low Accumulation¼X

f low Transf eri

4.2.2. Flow-transfer-matrix-based parallel MFD-md algorithm on a GPU

Using the parallelization strategy based on flow-transfer matrices,a CUDA-based parallel implementation of the MFD-md algorithmwas designed. This flow-transfer-matrix-based parallel MFD-md

algorithm consists of two parts, the host part and the device part.The host part (Algorithm 4) first initializes three two-dimen-

sional matrices allocated in PC internal memory and duplicatesthese matrices into the corresponding matrices allocated in GPU

global memory (lines 2–6). As shown in Fig. 3, the first matrix inPC internal memory is the reversal multiple-flow-direction matrix(CPURMFD matrix), for which the corresponding matrix in GPUmemory is GPURMFD. Each cell in CPURMFD and GPURMFD storesan eight-bit variable in which each bit records whether theneighboring cell in the corresponding direction will be drainedinto the current cell. The second matrix in PC internal memory isthe flow-fraction matrix (CPUFF), which records the flow fractionsused to partition flow among the downslope neighboring cells ofeach cell. The corresponding matrix in GPU global memory isGPUFF. Each cell in CPUFF and GPUFF stores an array with eightelements; each element records the fraction of flow going to aspecific neighboring cell of the current cell. The third matrix in PCinternal memory, the flow-accumulation matrix (CPUFA), recordsthe flow accumulation. The corresponding matrix in GPU globalmemory is the GPUFA matrix. In GPU global memory, two morematrices, GPUOldFlow and GPUNewFlow (Fig. 3), are used tosimulate the flow transfer in each round of the parallel process.The number of threads per block and the number of blocks aredetermined in lines 8–9 depending on the matrix size (i.e., the


size of the DEM) to ensure that each cell will be mapped to anindividual thread (Fig. 3). Using the loop code (lines 10–17), thehost part iteratively invokes the execution of a round of the flow-transfer process until no cell receives flow transferred from itsneighboring cells in the current round of the process (or in otherwords, when the state variable cpuStop equals true).

The device part of the flow-transfer-matrix-based parallelMFD-md algorithm, FlowAccu_Thread_FlowTranferMatrix() (shownas Algorithm 5) was implemented to simulate the flow accumula-tion of each cell in one round of the flow-transfer process. It isinvoked by the host part and is executed by each GPU threadwhich maps a cell in the DEM.

Algorithm 4. Pseudocode of the host part of the parallel MFD-md

using the flow-transfer matrix. (cpuDEM and cpuFA are the inputDEM and the output result, respectively.)

Fig. 4. 100�100 DEM example used for discussion of the computing redundancy

problem in a parallel flow-direction algorithm using a flow-transfer matrix.

Algorithm 5. Pseudocode of the device part of the parallel MFD-

md using the flow-transfer matrix.

4.3. Computing redundancy problem in the flow-transfer-matrix-

based parallelization strategy

All parallel flow-direction algorithms which use a flow-transfermatrix have the problem of computing redundancy, including theparallel D8 algorithm and the parallel MFD-md algorithm. Taking asmall 100�100 DEM as an example (Fig. 4), the elevation of thisarea drops continuously from the northeast to the southwest untilthe outlet, cell(100,1). In the first round of parallel processing, allcells are involved in the computations according to the iterativeprocess in the parallel flow-direction algorithm using the flow-transfer matrix. However, the flow-accumulation computation iscompleted only for the cells in the first row and the 100th column.The flow-transfer computation among the other cells has no influ-ence on the flow-accumulation computation for the cells in the firstrow and the 100th column. In other words, computing redundancyexists for 99�99 cells. In the second round of parallel processing,only the cells in the second row and the 99th column can have theirflow accumulation calculated completely. The flow-transfer compu-tation among the other 98�98 cells is redundant. The situation insubsequent rounds of parallel processing is similar. This computingredundancy can become enormous and limit the computationalefficiency of the algorithm.

4.4. Design 2: a graph-theory-based parallel MFD-md algorithm on

a GPU

This section presents an algorithm design based on graphtheory instead of on the flow-transfer matrix to avoid thecomputing redundancy problem occurring in parallel flow-direc-tion algorithms using the flow-transfer matrix.

4.4.1. Parallelization strategy based on graph theory

The basic idea of the algorithm is to view the calculation offlow accumulations from a graph-theory perspective (Arge et al.,2003; Wallis et al., 2009). A flow-direction graph can be naturallydefined. Taking a 3�3 DEM as an example (Fig. 5a), an individualvertex in the graph can be associated with each cell in the DEM.There is an edge from vertex i to vertex j in the graph if the flowdirection from cell i to cell j exists (Fig. 5c). Each edge in the graphis associated with an attribute value of the flow fraction deter-mined by Eqs. (1) and (2) (Fig. 5b). A flow-direction graph definedin this way is a directed acyclic graph, as Arge et al. (2003)demonstrated.

The flow-direction graph can be used to specify the order ofcomputation of the flow accumulations of all cells as a topologicalorder in graph theory, which means that vertex i appears beforevertex j in the ordering if there a path from vertex i to vertex j

(Weiss, 1997). Under this ordering, the flow accumulation of agiven cell cannot be calculated until the flow accumulation of

Fig. 5. Illustration of processing in the graph-theory-based parallel MFD-md

algorithm: (a) 3�3 DEM example (the value in each cell is the elevation);

(b) flow direction marked as an arrow with the flow-fraction value determined

by MFD-md; (c) initial flow-direction graph (the value of each vertex in the flow-

direction graph is the elevation), as in (e) and (g); (d) indegree matrix correspond-

ing to (c); (e) flow-direction graph after the first round of parallel processing;

(f) indegree matrix corresponding to (e) (a gray cell means that the flow-

accumulation calculation for this cell is finished); (g) flow-direction graph after

the second round of parallel processing; (h) indegree matrix corresponding to (g);

(i) flow accumulation for all cells as calculated after the last round of processing.


every cell draining into it has been calculated previously. Thus, inone round of processing, the cells corresponding to verticeswithout incoming edges in the graph should have their flowaccumulation calculated. Then, for those cells for which the flow-accumulation calculations are complete, the corresponding ver-tices and the edges between these vertices will be removed fromthe graph. During this round of processing, cells corresponding tovertices with at least one incoming edge require no processing.The remaining rounds of processing take place in a similar way.Therefore, the flow-direction algorithm designed on a graph-theory basis could avoid a large amount of computing redundancycompared with a flow-transfer-matrix-based parallel algorithm.

To determine the topological ordering of the flow-directiongraph, all cells without incoming edges should first be identified.This process can be parallelized using an indegree matrix in whicheach cell records the count of neighboring cells that drain into thiscell, i.e., the number of immediate incoming edges of the corre-sponding vertex of this cell in the flow-direction graph (Fig. 5d).

The parallel processing of MFD-md based on graph theory isillustrated in Fig. 5c through i. In the first round of processing, the

flow accumulations of cells with zero indegree value are calcu-lated. Then, the indegree values of these cells in the indegree

matrix are marked as ‘‘Done.’’ The indegree value of each cellwhich has immediate incoming edge(s) from these cells isupdated by subtracting the count of immediate incoming edgesfrom these cells (Fig. 5f). All these operations are conducted inparallel. Among the cells which have had their indegree valuesupdated, those cells that currently have a zero indegree value areready for flow-accumulation calculation in the next round ofparallel processing. The process terminates when all cells aremarked with an indegree value of ‘‘Done’’ (Fig. 5i).

4.4.2. The graph-theory-based parallel MFD-md algorithm on GPU

The host part and the device part of the graph-theory-basedparallel MFD-md algorithm are shown as Algorithm 6 andAlgorithm 7 respectively. The host part is similar to that of theflow-transfer-matrix-based parallel MFD-md algorithm (Algorithm 4).The difference is that Algorithm 6 uses new matrices, i.e., the indegree

matrix and the multiple-flow-direction matrix according to theproposed parallelization strategy, instead of the GPUOldFlow andGPUNewFlow matrices used in Algorithm 4. The indegree matrix andthe multiple-flow-direction matrix allocated in PC internal memoryare CPUIndegree and CPUMFD respectively, whereas the correspondingmatrices allocated in GPU global memory are GPUIndegree andGPUMFD respectively. The multiple-flow-direction matrix is used todetermine which of the neighboring cells of a given cell should haveits indegree value updated. Each cell in the multiple-flow-directionmatrix stores an eight-bit variable in which each bit indicateswhether the neighboring cell in the corresponding direction will bedrained from the current cell. The thread-data mapping strategy isthe same as that used in the flow-transfer-matrix-based parallel MFD-

md algorithm (see Fig. 3).On the device part, atomic functions are used to enforce

atomic access to shared variables in gpuIndegree, as shown inlines 19–26 of Algorithm 7, FlowAccu_Thread_Graph(). This meansthat any other thread cannot access these variables until thecurrent operation on these variables is complete.

Algorithm 6. Pseudocode of the host part of the graph-theory-based parallel MFD-md algorithm. (cpuDEM and cpuFA are theinput DEM and the output result, respectively.)

Algorithm 7. Pseudocode of the device part of the graph-theory-based parallel MFD-md algorithm.


5. Experiments and results

5.1. Experimental design

To assess the performance of the proposed parallel DEMpreprocessing algorithm and various parallel MFD-md algorithmson a GPU, the run times of the parallel algorithms designed in thispaper were measured, including the parallel P&D DEM preproces-sing algorithm (called Preprocessing_gpu), the flow-transfer-

Fig. 6. Map of s

matrix-based parallel MFD-md algorithm (MFDmd_FTM_gpu),and the graph-theory-based parallel MFD-md algorithm(MFDmd_graph_gpu), and compared with the run times of thecorresponding sequential algorithms (Preprocessing_cpu, MFDmd_

FTM_cpu, and MFDmd_graph_cpu). The performance of paralleli-zation of flow-accumulation calculations (including both P&DDEM preprocessing and the MFD-md algorithm) was assessed onthe GPU studied in this paper (called Workflow_gpu) by compar-ing the run time of Workflow_gpu (i.e., connecting Preproces-

sing_gpu to MFDmd_graph_gpu) with the run time of Workflow_

cpu (i.e., connecting Preprocessing_cpu to MFDmd_graph_cpu).Here the run time of an algorithm is the execution time of eachtested algorithm, including the time needed for matrix preparation(such as determining multiple flow directions and building thedirected acyclic graph) and the time needed for transferring databetween GPU and CPU. The time needed for loading the DEM datainto PC internal memory and moving the results from PC internalmemory to external memory is not counted in the run time.

All algorithms were executed on a PC with an Intel Core 2 duoCPU (the theoretical peak speed is 32 GFLOPS, and only singlecore was used for the experiments) with 3 GB RAM and a GeForceGT 330 graphics card (the theoretical peak speed is 182 GFLOPS)with 1 GB global memory and 12 MPs (i.e. 96 SPs) running CUDAversion 3.0. The operating system was 32-bit Windows XPProfessional.

The test data are from a gridded DEM at 10-m resolution of alow-relief area (approximately 60 km2, Fig. 6) in northeasternChina and from another five gridded DEMs created by resamplingthe original DEM under different grid resolutions. The six DEMshave dimensions of 1225�855, 1633�1140, 2450�1710, 3063�2138, 3769�2631, and 4899�3420 respectively. They are usedto assess the performance of the parallel algorithms designed inthis paper on datasets of different dimensions.

5.2. Experimental results

The experiment results (Table 1, Fig. 7) show that in allcases the proposed parallel algorithms are more efficientthan their corresponding sequential algorithms. This is becauseparallel algorithms take advantage of the GPU architecture which

tudy area.

Table 1Run times (ms) of algorithms with different dimensions of DEM (in cell).

Algorithm Dimensions of DEM (in cell)

(1225�855) (1633�1140) (2450�1710) (3063�2138) (3769�2631) (4899�3420)

Preprocessing_cpu 14,485 39,938 142,234 190,172 1,504,890 3,305,500

Preprocessing_gpu 906 2187 7078 11,784 67,562 148,296

MFDmd_FTM_cpu 22,943 41,639 94,084 315,443 308,260 651,250

MFDmd_FTM_gpu 6101 10,190 20,369 59,444 60,034 126,987

MFDmd_graph_cpu 14,829 32,724 67,868 162,423 163,425 510,716

MFDmd_graph_gpu 2547 5017 8491 16,607 21,216 46,803

Workflow_cpu 29,314 72,662 210,102 352,595 1,668,315 3,816,216

Workflow_gpu 3453 7204 15,569 28,391 88,778 195,099

Fig. 7. Experimental result. Curve 1 shows the speedup of the parallel P&D DEM

preprocessing algorithm; curves 2 and 3 represent the speedups of the flow-

transfer-matrix-based and graph-based parallel MFD-md algorithm respectively;

curve 4 compares the efficiency of the two parallel MFD-md algorithms; curve

5 compares the runtimes when the DEM preprocessing and MFD-md algorithms

are executed as a whole workflow on a GPU or a CPU.


has a large number of low-clock-frequency processors working inparallel.

The parallel P&D DEM preprocessing algorithm (Preproces-

sing_gpu) shows the highest speedup, ranging from 15.9 to 22.3times (curve 1 in Fig. 7). This can be attributed to the high degreeof independence of the operations within an iteration of the P&Dalgorithm, as analyzed in Section 3.2.

The speedup of MFDmd_graph_gpu ranges from 5.8 to 10.9times with different datasets in this test (curve 3 in Fig. 7). It ishigher than MFDmd_FTM_gpu, which the speedup ranges from3.8 to 5.3 times (curve 2 in Fig. 7). This shows the proposedparallelization strategy based on graph theory has higher paral-lelizability on the GPU than the existing parallelization strategybased on the flow-transfer matrix. Furthermore, due to avoidingthe computing redundancy problem by using the graph-theory-based strategy instead of the flow-transfer-matrix strategy,MFDmd_graph_gpu is more than twice as fast as MFDmd_FTM_gpu

(curve 4 in Fig. 7), even though the atomic functions used in theMFDmd_graph_gpu algorithm may cause some latency time.

Curve 5 in Fig. 7 shows that Workflow_gpu is also distinctly fasterthan the reference Workflow_cpu. The speedup of Workflow_gpu

ranges from 8.5 to 19.5 times. This means that the proposedparallelization of the flow-accumulation calculations (including bothiterative DEM preprocessing and the recursive MFD algorithm) on aCUDA-compatible GPU achieves the purpose of this study.

6. Conclusions and discussion

This paper presents a parallelization of flow-accumulation calcu-lations (including both an iterative DEM preprocessing step and arecursive MFD algorithm) on a CUDA-compatible GPU. The parallelalgorithms designed in this paper include a parallel P&D DEMpreprocessing algorithm and two parallel MFD-md algorithms basedon different parallelization strategies. The existing parallelizationstrategy, based on the flow-transfer matrix used in a parallel D8

algorithm (Ortega and Rueda, 2010), has the problem of computingredundancy. Therefore, a parallelization strategy based on graphtheory has been proposed, and a graph-theory-based parallel MFD-

md algorithm has been designed. The experimental results showthat the proposed parallelization of flow-accumulation calculationson a GPU performs much faster than either of the sequentialalgorithms or the parallel algorithm on a GPU, based on the existingparallelization strategy.

In this study, the MFD-md algorithm was used as an exampleof MFD for parallelization on a CUDA-compatible GPU. In fact, theproposed parallelization strategy is also available for other MFDalgorithms (e.g. D-inf algorithm). Furthermore, the methodologyused in the design of the proposed parallel MFD-md algorithm,which is to change the recursive algorithm into an iterativeprocess for better parallelizability, is also potentially useful forthe parallelization of other recursive algorithms in DTA (e.g.,Tarboton et al., 2009; Tesfa et al., 2011).

This study uses a GPU as the hardware for parallelization,instead of a PC cluster or multi-core CPUs in a single PC. It shouldbe noted that the size of GPU global memory limits the dimensionof DEM which can be processed by parallel GPU-based algorithms.For example, the maximum dimension of DEM which can beprocessed by MFDmd_graph_gpu algorithm with a GPU having1 GB global memory (the case in this test) is about 5000�5000cells, because MFDmd_graph_gpu algorithm needs 3 integer arraysand 9 float arrays to be allocated in GPU global memory. Themaximum dimension of DEM which can be processed byMFDmd_FTM_gpu algorithm is smaller than that. To handle largerDEMs by the parallel algorithms designed in this paper, moreadvanced GPU should be used. Although this limitation inparallelizing DTA algorithms on a GPU, high speedup for comput-ing-intensive and time-consuming DTA algorithms is achieved inan efficient and economical way. The parallel algorithms designedin this paper can work on any PC with a GPU graphic card running


CUDA version 1.1 or higher. This study shows the potential of theGPU for parallelization for both algorithms and applications in theDTA domain.

Acknowledgments

This study was funded by the National High-Tech Researchand Development Program of China (2011AA120302) and theKnowledge Innovation Program of the Chinese Academy ofSciences (KZCX2-YW-Q10-1-5). This study was also partly fundedby the National Natural Science Foundation of China (40971235)and the Institute of Geographical Sciences and Natural ResourcesResearch (2011RC203). We thank Prof. David Tarboton and ananonymous reviewer for their constructive comments on theearlier version of this paper.

Appendix A. Supporting information

Supplementary data associated with this article can be foundin the online version at doi:10.1016/j.cageo.2012.02.022.

References

Arge, L., Chase, J.S., Halpin, P., Toma, L., Vitter, J., Urban, S.D., Wickremesinghe, R.,2003. Efficient flow computation on massive grid terrain datasets. Geoinfor-matica 7 (4), 283–313.

Danalis, A., Marin, G., McCurdy, C., Meredith, J.S., Roth, P.C., Spafford, K., Tipparaju,V., Vetter, J.S., 2010. The Scalable Heterogeneous Computing (SHOC) bench-mark suite. In: Proceedings of the Third Workshop on General-PurposeComputation on Graphics Processing Units (GPGPU-3), New York, pp. 63–74.

Do, H.T., Limet, S., Melin, E., 2011. Parallel computing flow accumulation in largeDigital Elevation Models. Procedia Computer Science 4, 2277–2286.

Freeman, T.G., 1991. Calculating catchment area with divergent flow based on aregular grid. Computers & Geosciences 17 (3), 413–422.

Halfhill, T.R., 2008. Parallel processing with CUDA: Nvidia’s high-performancecomputing platform uses massive multithreading. Microprocessor Report 22, 1–8.

Hengl, T., Reuter, H.I., 2008. Developments in soil science, In: Geomorphometry:Concepts, Software, Application, vol. 33. Elsevier, Amsterdam, Netherlands,707 pp.

Jenson, S.K., Domingue, J.O., 1988. Extracting topographic structure from digitalelevation data for geographical information system analysis. PhotogrammetricEngineering and Remote Sensing 54 (11), 1593–1600.

Kim, S., Lee, H., 2004. A digital elevation analysis: a spatially distributed flowapportioning algorithm. Hydrological Processes 18 (10), 1777–1794.

Lee, V.W., Kim, C., Chhuqani, J., Deisher, M., Kim, D., Nquyen, A.D., Satish, N.,Smelyanskiy, M., Chennupaty, S., Hammarlund, P., Sinqhal, R., Dubey, P., 2010.Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing

on CPU and GPU. In: Proceedings of the 37th Annual International Symposiumon Computer Architecture, New York, pp. 451–460.

Martz, L.W., de Jong, E., 1988. Catch: a fortran program for measuring catchmentarea from digital elevation models. Computers & Geosciences 14 (5), 627–640.

O’Callaghan, J.F., Mark, D.M., 1984. The extraction of drainage networks fromdigital elevation data. Computer Vision, Graphics, and Image Processing 28 (3),323–344.

Ortega, L., Rueda, A., 2010. Parallel drainage network computation on CUDA.Computers & Geosciences 36 (2), 171–178.

Planchon, O., Darboux, F., 2001. A fast, simple, and versatile algorithm to fill thedepressions of digital elevation models. Catena 46 (2–3), 159–176.

Quinn, P.F., Beven, K.J., Chevallier, P., Planchon, O., 1991. The prediction of hillslopeflow paths for distributed hydrological modeling using digital terrain models.Hydrological Processes 5 (1), 59–79.

Quinn, P.F., Beven, K.J., Lamb, R., 1995. The ln(a/tan b) index: how to calculate itand how to use it within the Topmodel framework. Hydrological Processes 9(2), 161–182.

Qin, C.Z., Zhu, A.-X., Pei, T., Li, B.L., Zhou, C., Yang, L., 2007. An adaptive approach toselecting a flow-partition exponent for a multiple-flow-direction algorithm.International Journal of Geographical Information Science 21 (4), 443–458.

Qin, C.-Z., Zhu, A.-X., Pei, T., Li, B.-L., Scholten, T., Behrens, T., Zhou, C.-H., 2011. Anapproach to computing topographic wetness index based on maximumdownslope gradient. Precision Agriculture 12 (1), 32–43.

Reif, J.H., 1985. Depth-first search is inherently sequential. Information ProcessingLetters 20 (5), 229–234.

Tarboton, D.G., 1997. A new method for the determination of flow directions andcontributing areas in grid digital elevation models. Water Resources Research33 (2), 309–319.

Tarboton, D.G., Schreuders, K.A.T., Watson, D.W., Baker, M.E., 2009. Generalizedterrain-based flow analysis of digital elevation models. In: Proceedings of the18th World IMACS Congress and MODSIM09 International Congress onModelling and Simulation, Cairns, Australia, pp. 2000–2006.

Tesfa, T.K., Tarboton, D.G., Waston, D.W., Schreuders, K.A.T., Baker, M.E., Wallace,R.M., 2011. Extraction of hydrological proximity measures from DEMs usingparallel processing. Environmental Modelling & Software 26 (12), 1696–1709.

Tukora, B., Szalay, T., 2008. High-performance computing on graphics processingunits. International Journal for Engineering and Information Sciences 3 (2),27–34.

Wallis, C., Watson, D., Tarboton, D., Wallace, R., 2009. Parallel flow-direction andcontributing area calculation for hydrology analysis in digital elevationmodels. In: Proceedings of the International Conference on Parallel andDistributed Processing Techniques and Applications, Las Vegas, Nevada, pp.467–472.

Weiss, M.A. (Ed.), 2nd ed. Addison-Wesley, Menlo Park, California. (600 pp.).Wilson, J.P., Aggett, G., Deng, Y., Lam, C.S., 2008. Water in the landscape: a review

of contemporary flow routing algorithms. In: Zhou, Q., Lees, B., Tang, G. (Eds.),Advances in Terrain Analysis. Springer, New York, pp. 213–236.

Wilson, J.P., Gallant, J.C. (Eds.), 2000. Terrain Analysis: Principles and Applications.Wiley, New York, NY, 479 pp.

Wolock, D.M., McCabe, G.J., 1995. Comparison of single and multiple flow directionalgorithms for computing topographic parameters in Topmodel. WaterResources Research 31 (5), 1315–1324.

Xia, Y., Li, Y., Shi, X., 2010. Parallel viewshed analysis on GPU using CUDA. In:Proceedings of the Third International Joint Conference on ComputationalSciences and Optimization, Huangshan, China. Piscataway, NJ, pp. 373–374.

Xu, R., Huang, X.X., Luo, L., Li, S.C., 2010. A new grid-associated algorithm indistributed hydrological model simulations. Science in China (TechnologicalSciences) 53 (1), 235–241.

doi:10.1016/j.cageo.2012.02.022

Computers & Geosciences...an MFD algorithm on a GPU, although MFD is generally thought to be better...

Documents

Transcript of Computers & Geosciences...an MFD algorithm on a GPU, although MFD is generally thought to be better...