Implementation of Smith-Waterman algorithm in OpenCL for GPUs

9
Implementation of Smith-Waterman algorithm in OpenCL for GPUs Dzmitry Razmyslovich * , Guillermo Marcus * , Markus Gipp * , Marc Zapatka and Andreas Szillus * Institute for Computer Engineering (ZITI), University of Heidelberg, Mannheim, Germany Email: see http://www.ziti.uni-heidelberg.de German Cancer Research Center, Heidelberg, Germany Email: m.zapatka, [email protected] Abstract—In this paper we present an implementation of the Smith-Waterman algorithm. The implementation is done in OpenCL and targets high-end GPUs. This implementation is capable of computing similarity indexes between reference and query sequences. The implementation is designed for the sequence alignment paths calculation. In addition, it is capable of handling very long reference sequences (in the order of mil- lions of nucleotides), a requirement for the target application in cancer research. Performance compares favorably against CPU, being on the order of 9 - 130 times faster; 3 times faster than the CUDA-enabled CUDASW++v2.0 for medium sequences or larger. Additionally, it is on par with Farrar’s performance, but with less constraints in sequence length. Keywords-OpenCL, GPU, CUDA, Smith-Waterman, Bioin- formatics I NTRODUCTION There are currently a lot of biological questions being investigated using the second-generation sequencing tech- nology. This technology is characterized by short lengths of the read sequences (35-100 nucleotides). One possible application of the second-generation sequencing technology is cancer genomics [1]. All cancers are results of changes occurred in the DNA sequence of the genomes of cancer cells [2]. These changes (aberrations) can be described as nucleotide substitutions, short insertions and deletions, rearrangements and copy- number changes [3]. The Smith-Waterman algorithm is one of the best solutions for the identification of the aberrations specified, because this algorithm is quite sensitive to identify most complex aberrations unrecognizable with alternative faster algorithms [4]. Our approach aims to provide a solution for the alignment of the short reads from second-generation sequencing tech- nology along the long genome sequence, which would be acceptable according to the time characteristics. The main problem of the Smith-Waterman algorithm usage for the described task is the O(n × m) time complexity, where n is the length of a short read (a query sequence) and m is the length of a long genome sequence (a reference sequence). Moreover, the algorithm requires a lot of memory (of the order of 16 GB), additionally decreasing the performance and putting excessively high requirements for the target computation system. In this paper we represent the accelerated implementation of the Smith-Waterman algorithm which uses the latest technologies for heterogeneous high-performance parallel systems such as GPUs, FPGAs and etc. This implementation is written using a modern OpenCL standard, which provides the interface independence from the type of the target system. The code is optimized for running on high-end CUDA-enabled NVIDIA GPUs. Currently, there are a number of GPU accelerated implementations of the Smith-Waterman algorithm. The following implementations can be seen as a related work for our implementation: Cheng Ling’s implementa- tion [5], CUDASW++v2.0 [6], Farrar’s implementation [7], Manavski’s implementation [8]. However none of these im- plementations is focused on computing sequence alignment paths. Neither of them, except for Cheng Ling’s, processes long reference sequences. At the same time, in accordance with the biological task described, it is necessary to imple- ment the Smith-Waterman algorithm so that this implemen- tation does not put limits on the reference sequence length and is able to provide sequence alignment paths calculation. These two challenges form the key characteristics of our implementation. To provide the possibility for revealing the efficiency of our implementation, CUDASW++v2.0 and Farrar’s implementations have been chosen, being the most popular and widely used. The rest of the paper is divided into 5 sections. A brief description of the Smith-Waterman algorithm is given in the first section. The second section highlights the main pros and cons of the NVIDIA OpenCL standard implementation for GPUs. The third section consists of 6 subsections, each of them presenting a technique or a method we have used to improve the performance of the implementation. In the fourth section, the results of benchmarking and comparison are given. Finally, the fifth section concludes the paper with an outlook to the most important advantages of the OpenCL implementation presented. 2010 Second International Workshop on High Performance Computational Systems Biology and Parallel and Distributed Methods of verifiCation 978-0-7695-4265-2/10 $26.00 © 2010 IEEE DOI 10.1109/HiBi.2010.20 39 2010 Second International Workshop on High Performance Computational Systems Biology and Parallel and Distributed Methods of verifiCation 978-0-7695-4265-2/10 $26.00 © 2010 IEEE DOI 10.1109/HiBi.2010.20 47 2010 Second International Workshop on High Performance Computational Systems Biology and Parallel and Distributed Methods of verifiCation 978-0-7695-4265-2/10 $26.00 © 2010 IEEE DOI 10.1109/HiBi.2010.20 47 2010 Second International Workshop on High Performance Computational Systems Biology 978-0-7695-4265-2/10 $26.00 © 2010 IEEE DOI 10.1109/HiBi.2010.20 47 2010 Second International Workshop on High Performance Computational Systems Biology 978-0-7695-4265-2/10 $26.00 © 2010 IEEE DOI 10.1109/HiBi.2010.20 47 Ninth International Workshop on Parallel and Distributed Methods in Verification/Second International Workshop on High Performance Computational Systems Biology 978-0-7695-4265-2/10 $26.00 © 2010 IEEE DOI 10.1109/PDMC-HiBi.2010.16 48 2010 Ninth International Workshop on Parallel and Distributed Methods in Verification/2010 Second International Workshop on High Performance Computational Systems Biology 978-0-7695-4265-2/10 $26.00 © 2010 IEEE DOI 10.1109/PDMC-HiBi.2010.16 48

description

In this paper we present an implementation ofthe Smith-Waterman algorithm. The implementation is donein OpenCL and targets high-end GPUs. This implementationis capable of computing similarity indexes between referenceand query sequences. The implementation is designed for thesequence alignment paths calculation. In addition, it is capableof handling very long reference sequences (in the order of mil-lions of nucleotides), a requirement for the target application incancer research. Performance compares favorably against CPU,being on the order of 9 - 130 times faster; 3 times faster thanthe CUDA-enabled CUDASW++v2.0 for medium sequences orlarger. Additionally, it is on par with Farrar’s performance,but with less constraints in sequence length.

Transcript of Implementation of Smith-Waterman algorithm in OpenCL for GPUs

Page 1: Implementation of Smith-Waterman algorithm in OpenCL for GPUs

Implementation of Smith-Waterman algorithm in OpenCL for GPUs

Dzmitry Razmyslovich∗, Guillermo Marcus∗, Markus Gipp∗, Marc Zapatka† and Andreas Szillus†∗Institute for Computer Engineering (ZITI),

University of Heidelberg,Mannheim, Germany

Email: see http://www.ziti.uni-heidelberg.de†German Cancer Research Center,

Heidelberg, GermanyEmail: m.zapatka, [email protected]

Abstract—In this paper we present an implementation ofthe Smith-Waterman algorithm. The implementation is donein OpenCL and targets high-end GPUs. This implementationis capable of computing similarity indexes between referenceand query sequences. The implementation is designed for thesequence alignment paths calculation. In addition, it is capableof handling very long reference sequences (in the order of mil-lions of nucleotides), a requirement for the target application incancer research. Performance compares favorably against CPU,being on the order of 9 - 130 times faster; 3 times faster thanthe CUDA-enabled CUDASW++v2.0 for medium sequences orlarger. Additionally, it is on par with Farrar’s performance,but with less constraints in sequence length.

Keywords-OpenCL, GPU, CUDA, Smith-Waterman, Bioin-formatics

INTRODUCTION

There are currently a lot of biological questions beinginvestigated using the second-generation sequencing tech-nology. This technology is characterized by short lengthsof the read sequences (35-100 nucleotides). One possibleapplication of the second-generation sequencing technologyis cancer genomics [1].

All cancers are results of changes occurred in the DNAsequence of the genomes of cancer cells [2]. These changes(aberrations) can be described as nucleotide substitutions,short insertions and deletions, rearrangements and copy-number changes [3]. The Smith-Waterman algorithm is oneof the best solutions for the identification of the aberrationsspecified, because this algorithm is quite sensitive to identifymost complex aberrations unrecognizable with alternativefaster algorithms [4].

Our approach aims to provide a solution for the alignmentof the short reads from second-generation sequencing tech-nology along the long genome sequence, which would beacceptable according to the time characteristics. The mainproblem of the Smith-Waterman algorithm usage for thedescribed task is the O(n×m) time complexity, where n isthe length of a short read (a query sequence) and m is thelength of a long genome sequence (a reference sequence).Moreover, the algorithm requires a lot of memory (of theorder of 16 GB), additionally decreasing the performance

and putting excessively high requirements for the targetcomputation system.

In this paper we represent the accelerated implementationof the Smith-Waterman algorithm which uses the latesttechnologies for heterogeneous high-performance parallelsystems such as GPUs, FPGAs and etc. This implementationis written using a modern OpenCL standard, which providesthe interface independence from the type of the targetsystem. The code is optimized for running on high-endCUDA-enabled NVIDIA GPUs.

Currently, there are a number of GPU acceleratedimplementations of the Smith-Waterman algorithm. Thefollowing implementations can be seen as a relatedwork for our implementation: Cheng Ling’s implementa-tion [5], CUDASW++v2.0 [6], Farrar’s implementation [7],Manavski’s implementation [8]. However none of these im-plementations is focused on computing sequence alignmentpaths. Neither of them, except for Cheng Ling’s, processeslong reference sequences. At the same time, in accordancewith the biological task described, it is necessary to imple-ment the Smith-Waterman algorithm so that this implemen-tation does not put limits on the reference sequence lengthand is able to provide sequence alignment paths calculation.These two challenges form the key characteristics of ourimplementation. To provide the possibility for revealingthe efficiency of our implementation, CUDASW++v2.0 andFarrar’s implementations have been chosen, being the mostpopular and widely used.

The rest of the paper is divided into 5 sections. A briefdescription of the Smith-Waterman algorithm is given in thefirst section. The second section highlights the main prosand cons of the NVIDIA OpenCL standard implementationfor GPUs. The third section consists of 6 subsections, eachof them presenting a technique or a method we have usedto improve the performance of the implementation. In thefourth section, the results of benchmarking and comparisonare given. Finally, the fifth section concludes the paper withan outlook to the most important advantages of the OpenCLimplementation presented.

2010 Second International Workshop on High Performance Computational Systems Biology and Parallel and Distributed Methods

of verifiCation

978-0-7695-4265-2/10 $26.00 © 2010 IEEE

DOI 10.1109/HiBi.2010.20

39

2010 Second International Workshop on High Performance Computational Systems Biology and Parallel and Distributed Methods

of verifiCation

978-0-7695-4265-2/10 $26.00 © 2010 IEEE

DOI 10.1109/HiBi.2010.20

47

2010 Second International Workshop on High Performance Computational Systems Biology and Parallel and Distributed Methods

of verifiCation

978-0-7695-4265-2/10 $26.00 © 2010 IEEE

DOI 10.1109/HiBi.2010.20

47

2010 Second International Workshop on High Performance Computational Systems Biology

978-0-7695-4265-2/10 $26.00 © 2010 IEEE

DOI 10.1109/HiBi.2010.20

47

2010 Second International Workshop on High Performance Computational Systems Biology

978-0-7695-4265-2/10 $26.00 © 2010 IEEE

DOI 10.1109/HiBi.2010.20

47

Ninth International Workshop on Parallel and Distributed Methods in Verification/Second International Workshop on High

Performance Computational Systems Biology

978-0-7695-4265-2/10 $26.00 © 2010 IEEE

DOI 10.1109/PDMC-HiBi.2010.16

48

2010 Ninth International Workshop on Parallel and Distributed Methods in Verification/2010 Second International Workshop

on High Performance Computational Systems Biology

978-0-7695-4265-2/10 $26.00 © 2010 IEEE

DOI 10.1109/PDMC-HiBi.2010.16

48

Page 2: Implementation of Smith-Waterman algorithm in OpenCL for GPUs

I. THE SMITH-WATERMAN ALGORITHM

The Smith-Waterman algorithm is a well-known algorithmfor performing local sequence alignment; that is, for deter-mining similar regions between two nucleotide or protein(elements) sequences [9]. The idea of alignment lies infilling the n × m matrix H , the similarity matrix, wheren is the number of elements in a query sequence and m isthe number of elements in a reference sequence. The valuesof the matrix are computed using dynamic programmingaccording to formula 1. Each value H[i, j] is the measure ofsimilarity of two subsequences: a query sequence up to thei-th element and a reference sequence up to the j-th element.

H[i, 0] = 0, 0 ≤ i ≤ n,H[0, j] = 0, 0 ≤ j ≤ m,

H[i, j] = max

0H[i− 1, j] + IFH[i, j − 1] +RFH[i− 1, j − 1] + S(i, j)

,

1 ≤ i ≤ n, 1 ≤ j ≤ m.

(1)

The H[i, j] value is a similarity score. The insertion fee(IF value) is a penalty for extending a reference sequencewith an element from a query sequence, while the removingfee (RF value) is a penalty for withdrawing an element froma reference sequence. The value S(i, j) is calculated usingformula 2, where MF is the mismatching elements fee andMS is the matching elements score.

S(i, j) =

{MF,Query[i] 6= Reference[j],MS,Query[i] = Reference[j]

(2)

The definition for all fees and scores specified abovediffers. In this implementation:

• IF and RF are specified separately;• MF and MS can be specified equal for all values or

unequal using a substitution matrix (like BLOSUM[10]or PAM[11] matrices)

To obtain the actual alignment of two subsequences, thesequence alignment path in the matrix should be found. It isdone using a traceback procedure [9] which starts from thesimilarity value of the sequences until reaching the upper-leftcorner of the matrix. On each step of the traceback procedurea new point in the matrix is to be chosen as the maximumof 3 neighbors to the current point (see figure 1).

In figure 2 is shown the example of an alignmentfor the following parameters: Reference =’ACACACTA’,Query =’AGCACACA’, MS = 2, MF = IF = RF =−1.

II. OPENCL MODEL DESCRIPTION

OpenCL is an open standard for general purpose paral-lel programming across different heterogeneous processingplatforms: CPUs, GPUs and others [12].

Figure 1. A step of the path constructing traceback procedure.

Figure 2. An example of an alignment by Smith-Waterman algorithm.

An OpenCL program consists of two parts:

• the host code, designed to prepare data, load it to GPUmemory, schedule a kernel execution, postprocess thekernel execution results;

• the kernel code, which is executed on a GPU.

The kernel code is written in a variety of the C languageand consists of at least one kernel function. A kernel func-

Figure 3. A grid of thread blocks.

40484848484949

Page 3: Implementation of Smith-Waterman algorithm in OpenCL for GPUs

Figure 4. The blocks scheduling model.

tion is executed concurrently by each thread (an OpenCLworkitem) of a block (an OpenCL workgroup). A set of theblocks constructs a grid of blocks, which presents the wholeexecution model (see figure 3).

For the GPU platform used, each block is only executedby one streaming multiprocessor (SM), while the grid ofblocks is executed by a scheduler on an array of theSMs, sequentially occupying the vacant multiprocessors (seefigure 4) [13]. So, the kernel code should be designed toprovide the independence of any blocks execution order.

Meanwhile, individual threads in a block are executedby groups of 32 threads called warps. A warp processesone common instruction for all the 32 threads at a time.If the execution sequence has a divergent branch, the warpexecutes both of the paths serially. So, a number of threadsfor each path appears blocked. As a result, every divergentbranch makes the execution time higher.

Such an execution model provides synchronization andcommunication mechanisms. Threads of one block cansynchronize using barriers and communicate using sharedmemory, while threads from different blocks can not. Also,there are some other layers of memory available for usage:registers, non-cached local memory, caches for constantand texture memory, and non-cached global memory (seefigure 5). Each layer has a certain size and performance. Amore detailed description of the memory layers and theirusage can be found in [14].

III. IMPLEMENTATION IN THE GPU

The previously described biological problem can besolved using the Smith-Waterman algorithm implemented inOpenCL as a series of successive steps. Each step contributesto an improvement of the performance.

A. Parallelization granularity

The basic task lies in processing a lot of short query se-quences and one long reference sequence. Since the original

Figure 5. The memory model.

task is computation of the paths, the similarity matrix hasto be stored in order to be processed. However, the data-size requirements put limits to the possibility of storingthe matrix. For instance, in case of the reference sequenceof 28-million-nucleotide length and the query sequence of150-nucleotide length, 16 GB of memory will be used.Therefore, it is necessary to use online computation of thepaths, because the modern graphics cards have up to 5 GBof memory. In this case, it means calculating the paths forthe already calculated part of the matrix and truncating thematrix concurrently with computation of the new piece ofthe matrix - see subsection B. Choosing a nucleotide froma query sequence as a parallelization grain makes onlinecomputation possible.

B. Long reference sequences processing

The main heuristic used to reduce the memory usagefor the calculation is shown in figure 6. In this figure thealready calculated part of the matrix is the similarity matrixfor a reference sequence R1 and a query sequence. Theoptimal path for this matrix is marked with line P1. Theprevious matrix with the attached dashed piece stands forthe similarity matrix for a new reference sequence R2. Anew optimal path P2 for the renewed matrix crosses P1. Itmeans that it is possible to define another optimal path P3for the renewed matrix. P3 is assembled by merging the partof P2 from the end of the renewed matrix to the junction Jand the part of P1 from J to the top row of the old matrix.Now it is easy to see, that the dashed part of the originalmatrix can be truncated just before the calculation of thenew optimal path.

Then, processing a long reference sequence can be divided

41494949495050

Page 4: Implementation of Smith-Waterman algorithm in OpenCL for GPUs

Figure 6. The heuristic model.

into 2 parts:

• calculation of a new piece of the matrix,• calculation of a new optimal path together with trun-

cating the current matrix.

The former part takes more time than the latter. So, ifthese 2 parts will be processed with different devices, thewhole calculation time will be equal to the calculation timeof a new piece of matrix. And since it is generally acceptedto pass the more time-consuming part of an algorithm to aGPU, the calculation of a new piece of a matrix is processedwith this device, while the calculation of a new optimal pathand truncating the current matrix is given to a CPU. Thissolution makes computation of both parts of a long referencesequence processing concurrent.

C. Multi-query processing

Since the size of any possible query sequence is not big,normally a query fits one workgroup, processing of onequery at a time is inefficient. As it was mentioned in the2nd section, one workgroup is processed by only one GPUmultiprocessor. It means that it is possible to process severalqueries in one cycle and it will take the same time. In orderto avoid exceeding the resources available per kernel, thefollowing techniques have been introduced.

First of all, several short query sequences are concatenatedinto one big query sequence. Then, a zero-value delimiter isplaced between the original queries. The delimiter providesan opportunity to process different queries separately. It canbe done by multiplying the H[i, j] value from formula 1 bythe value of the sign function of the current query character.

H[i, j] = H[i, j]× sign(q[i]) (3)

Placing the zero-value delimiter results in the zero func-tion values, which are the top row values for the similaritymatrix of the next query sequence. So formula 3 keeps upthe right values in the matrices for each query sequence.

Concatenating several short query sequences into a bigone makes multi-query processing possible, which providesa better occupancy of a GPU.

D. The calculation shape

Every iteration is a full execution of the kernel function.At any given iteration, a block of h×h values is calculated,where h is the number of workitems in a workgroup.

It is seen from formula 1 that the value H[i, j] depends on3 other values in the matrix. Each of these values dependsby itself on 3 others, creating a dependency chain for allthe elements in the matrix. As well as for the elements thesame dependency chain can be constructed for the blocksof the matrix. If the blocks, computed at any given step inthe chain are numbered, a wavefront appears, as shown infigure 7.

Figure 7. The wavefront calculation model.

Using the wavefront calculation model makes it importantto choose the right shape of the block. In case of the usuallychosen rectangular shape, the calculation process limits thenumber of working threads of each step, because of the datadependencies between these threads. Extending the basicrectangular block to a parallelogram in the way it is shownin figure 8 reduces the number of branches in a kernel codeproviding calculation by diagonals. This eliminates the datadependencies letting all the threads to be simultaneouslysprang into action.

Figure 8. The block calculating model.

However, the calculation process is still inefficient, be-cause the number of values calculated in the extended blockis twice as many as in the original rectangular block, whilethe significant values are only kept in the original rectangulararea (the filled rectangle in figure 8). Moreover, the presenceof useless values in the shaded area brings a branch into thekernel function causing further performance decrease.

To omit the branch, it is necessary to avoid the pres-ence of useless values, while preserving at the same timethe diagonal calculation. So, if several diagonals are puttogether in the number of workitems in a workgroup, anew diamond shape will appear, which does correspond tothese requirements. In this case, no additional computation

42505050505151

Page 5: Implementation of Smith-Waterman algorithm in OpenCL for GPUs

is needed and the kernel function comes easier and faster, atthe expense of the additional complexity in the calculationprocess (see figure 9):the kernel function was divided into2 functions – the preprocessing kernel function and mainkernel function. The preprocessing kernel function calculatesthe initial (k+1)2 blocks (dashed in figure 9), where k is thenumber of OpenCL workgroups used on a GPU. The mainkernel function has no additional branches and is used formonotonous computing of the rest of the matrix. The usageof the diamond shape provides an additional speed-up.

Figure 9. The modified calculating model.

E. The concurrent transfer and execution

The whole calculating process is divided into 8 parts(subprocesses):

• the host initialization;• the transfer of the input data to device memory;• kernel execution scheduling;• the “precalculation” kernel execution;• the main kernel execution;• the transfer of a matrix block to host memory;• path calculation;• results print-out.The most time-consuming subprocesses are the main

kernel execution, the transfer of a matrix block to hostmemory and path calculating, because these processes areheavy and repeated several times (see table I). The mainkernel execution is repeated by the number of iterationsdefined as the number of blocks fitted in the similarity matrixfor a workgroup size query sequence. The transfer processand path calculating are repeated by the number of windowsdefined as the number of iterations over a window size.

Table ITHE GPU USAGE STATISTICS ACCORDING TO THE OPENCL PROFILER.

Method #Calls GPU ms %GPU timeInitial transfer 6 101.7 0.8Precalculate kernel 23 6.8 0.05Main kernel 19279 9240 73.34Blocks transfer 201 3248.9 25.78

Because of data dependencies between these 3 subpro-cesses, it is impossible to execute them simultaneouslyfor the same window. But calculating a window of thematrix together with transferring and processing the previous

window is possible, due to the independent functionalityof the GPU DMA controller and the GPU multiprocessors.Since path calculating and matrix block calculating tasksare processed with different devices, these tasks can also behandled concurrently. To enable the possibility to overlapdata transferring and kernel execution, a ring buffer isallocated in device memory. The ring buffer consists of aminimum of 3 windows, 2 of which are used for calculatingmatrix values and 1 contains the ready-to-transfer piece ofthe matrix (see figure 10).

Figure 10. The ring buffer schema.

The efficiency of the ring buffer usage is represented infigures 11 and 12. In these figures the whole computationprocess with and without the ring buffer usage is shown.The only difference between these 2 processes is the orderof the execution of subprocesses, while the execution timeof each subprocess itself is the same. So, it is easy to seethat the whole computation process without the ring bufferusage needs an additional time gap for transferring datafrom GPU memory to host memory the way it is shownin figure 11. The ring buffer usage makes the time gapoverlapped with the main kernel execution (see figure 12)improving the GPU utilization and reducing the calculationtime by approximately 25%, according to the profiling datashown in table I.

F. Smith-Waterman without the path calculation

In most cases, a vast amount of sequencer data must befiltered at the beginning – this task only requires similarityvalues. So, to provide the possibility of faster computationof solely similarity values as well as to compare the OpenCLimplementation with other ones, a omitting path calculationversion has been created.

The main difference of the no path calculating versionis that the matrix storage is unnecessary, since only thelast column of the matrix is used for retrieving the results.Moreover, if queries fit the workgroup size, it is efficient

43515151515252

Page 6: Implementation of Smith-Waterman algorithm in OpenCL for GPUs

Figure 11. The calculation time diagram without transfer and execution overlapping.

Figure 12. The calculation time diagram with transfer and execution overlapping.

to pass only an integer number of queries to a workgroupin order to omit the data dependency between differentworkgroups. Taking all these features into account, it ispossible to complete the computation in 3 steps instead of8 subprocesses described in subsection E (see figure 13):

• initialization - performs some calculations to make thewavefront technique usable;

• calculating - calculates the whole matrix excludingheading and ending;

• finishing - calculates the ending with saving the results.

This calculating model gives the possibility to eliminatethe expenses for the data transferring, the kernel starting

Figure 13. The no path calculating model.

and the synchronization, providing an additional speed-up.

The effect of each step on the implementation perfor-

44525252525353

Page 7: Implementation of Smith-Waterman algorithm in OpenCL for GPUs

Figure 14. The effect of the implementation steps on overall performance.

mance is summarized in figure 14. The steps letters cor-respond to the subsections letters listed above. It shouldbe pointed out that each step brings either an additionalfunctional characteristic or a speedup or both.

IV. RESULTS AND DISCUSSIONS

The sequence of chromosome 21 (circa 28 million nu-cleotides in length) from the NCBI [15] Build 36 of thehuman reference assembly is used as a test space forbenchmarking, analyzing and comparing the OpenCL im-plementation. A set of 36-nucleotide-long reads of equallength from an Illumina genome analyzer was used as querysequences.

To measure the performance the computation time is used.The computation time includes:

• the kernel execution scheduling time,• the kernel execution time,• the device-to-host data transferring time (for the version

with path calculating),• the paths calculation time (for the version with path

calculating),• the device-to-host results transferring time (for the

version without path calculating).The time for the program initialization and reading inputfiles as well as the OpenCL library initialization, loadingthe input data to GPU memory and printing the resultswas not included into the computation time, because thesefactors do not influence the comparison characteristics of theimplementation.

All benchmarking and comparison tests were carried outon the following test platform: the NVIDIA GeForce GTX260 GPU with 1.75GB of RAM, 30 multiprocessors and 216cores installed on the PC with the Intel i7-920 CPU and 6GBof RAM running Linux OS with the installed NVIDIA GPUComputing SDK 3.0.

A. Implementation benchmarking

Benchmarking tests show the time expenses accordingto the reference sequence length. The number of querysequences processed at a time is fixed. For the versionwith path calculating it is 40 queries at a time, whilefor the version without path calculating — 600 queries.These numbers are experimental findings based on the giventest platform and can vary depending on the test platform.The parameters used in the implementation were chosenaccording to the following requirements:

• the GPU is to be loaded as much as possible, allmultiprocessors should be busy;

• the window size should be neither too small to preventthe CPU from processing the previous window nor toolarge to render the data transferring too long and thering buffer too big;

• a number of queries too big can either rise the pos-sibility to run out of memory (in the case of pathcalculating) or reduce the performance (in the case ofno path calculating).

Figure 15. The benchmark graph.

45535353535454

Page 8: Implementation of Smith-Waterman algorithm in OpenCL for GPUs

Figure 16. The comparison graph for the OpenCL implementations andthe CPU implementation.

The time graph for both of the implementation versionsis shown in figure 15.

B. Comparison

The OpenCL implementations with and without pathcalculation have been compared with CUDASW++v2.0 [6],Farrar’s implementation [7] and our CPU implementation.The comparison with the CPU implementation is shown infigure 16. According to the tests results, the path OpenCLimplementation accelerates the performance about 9x andthe no path OpenCL implementation - about 130x.

The main advantage of the OpenCL implementations isthe ability to process the long reference sequences in com-parison with the other non-CPU implementations. Farrar’simplementation is capable to process 65536-nucleotide-longreference sequences and CUDASW++v2.0 — up to 262000-nucleotide-long ones. While on the same test platformthe OpenCL implementation is able to treat the referencesequences up to 28 million nucleotides in length.

To compare the computation time the test space has beenmodified:

• the reference sequences lengths have been limitedaccording to the maximum mutual capability of theimplementations (65536 nucleotides);

• to show the computation time with regard to bothparameters used in the implementations (the referencesequence length and the number of queries) the compar-ison has been divided into the groups of tests accordingto the number of queries in a query database file. Thedatabases with 40, 200, 600 queries were used.

In figure 17 the comparison graph for the 40-querydatabase file is shown. This graph is the base for thefollowing more complex comparison cases as of more filleddatabase files, the computation time for the path calculatingversion is defined as the sum of times to compute each 40-query part of the file (see subsection A). The path calculatingversion was also tested on the 200-query (figure 18) databasefile.

Figure 17. The comparison graph for the 40-query database file.

Figure 18. The comparison graph for the 200-query database file.

In case of the 600-query database file the computationtime for the path calculating OpenCL version is much higherthan for all the no path calculating implementations. So, thepresence of the graph depicting the path calculating versionperformance in figure 19 would have impeded highlightingthe performance differences between all the no path calcu-lating implementations.

The no path OpenCL implementation is competi-tive to Farrar’s implementation and is 3x as fast asCUDASW++v2.0 for the 600-query database file (figure 19).

Figure 19. The comparison graph for the 600-query database file.

46545454545555

Page 9: Implementation of Smith-Waterman algorithm in OpenCL for GPUs

V. CONCLUSION

In this paper the implementation of the Smith-Watermanalgorithm using the modern OpenCL standard targeted high-end CUDA-enabled GPUs was presented. This implementa-tion is intended to use for alignment of the short reads re-ceived using second-generation sequencing technology alonga genome sequence.

It was shown by testing, that the key advantages of thisOpenCL implementation in comparison with the other top-rated implementations are:

• the implementation is able to process efficiently thelong reference sequences (up to 28 million in the tests);

• the alignment paths can be calculated effectively, whichis the key feature of this implementation;

• the computation performance of the implementation iscompetitive to Farrar’s implementation and 3x as fastas CUDASW++v2.0 implementation for the 600-querydatabase file;

• the acceleration in comparison with our CPU imple-mentation is 9x for the path calculating version and130x for the no path calculating version;

• the implementation is written in the modern OpenCLstandard, which provides the possibility to use it ondifferent parallel systems on condition of a propertuning of the implementation.

This new implementation provides the high efficiencyneeded for current biological tasks as well as future chal-lenges posed by ever increasing sequences. An additionalflexibility from the new OpenCL language and the choiceof the paths calculation gives this implementation a uniqueadvantage, that can be exploited by a wide range of biolog-ical applications.

ACKNOWLEDGMENT

We would like to thank Prof. Dr. Reinhard Mannerand DAAD (German Academic Exchange Service)[16] forproviding the scholarship for D. Razmyslovich.

REFERENCES

[1] K. Robison, “Application of second-generation sequencingto cancer genomics.” Briefings in bioinformatics,pp. bbq013+, April 2010. [Online]. Available:http://dx.doi.org/10.1093/bib/bbq013

[2] M. R. Stratton, P. J. Campbell, and P. A.Futreal, “The cancer genome,” Nature, vol. 458, no.7239, pp. 719–724, April 2009. [Online]. Available:http://dx.doi.org/10.1038/nature07943

[3] “International network of cancer genome projects,” Nature,vol. 464, no. 7291, pp. 993–998, April 2010. [Online].Available: http://dx.doi.org/10.1038/nature08987

[4] H. Li and N. Homer, “A survey of sequencealignment algorithms for next-generation sequencing,” BriefBioinform, pp. bbq015+, May 2010. [Online]. Available:http://dx.doi.org/10.1093/bib/bbq015

[5] C. Ling, K. Benkrid, and T. Hamada, “A parameterisable andscalable smith-waterman algorithm implementation on cuda-compatible gpus,” Application Specific Processors, 2009.SASP ’09. IEEE 7th Symposium on, pp. 94–100, jul. 2009.

[6] Y. Liu, B. Schmidt, and D. L. Maskell, “Cudasw++2.0:enhanced smith-waterman protein database search oncuda-enabled gpus based on simt and virtualized simdabstractions.” BMC Res Notes, vol. 3, p. 93, 2010. [Online].Available: http://www.biomedsearch.com/nih/CUDASW20-enhanced-Smith-Waterman-protein/20370891.html

[7] M. Farrar, “Striped smith–waterman speeds database searchessix times over other simd implementations,” Bioinformatics,vol. 23, no. 2, pp. 156–161, 2007.

[8] S. Manavski and G. Valle, “Cuda compatible gpucards as efficient hardware accelerators for smith-waterman sequence alignment,” BMC Bioinformatics,vol. 9, no. Suppl 2, p. S10, 2008. [Online]. Available:http://www.biomedcentral.com/1471-2105/9/S2/S10

[9] T. F. Smith and M. S. Waterman, “Identification of commonmolecular subsequences,” Journal of Molecular Biology, vol.147, pp. 195–197, 1981.

[10] S. Henikoff and J. G. Henikoff, “Amino acid substitutionmatrices from protein blocks.” Proceedings of the NationalAcademy of Sciences of the United States of America,vol. 89, no. 22, pp. 10 915–10 919, November 1992. [Online].Available: http://dx.doi.org/10.1073/pnas.89.22.10915

[11] M. O. Dayhoff and R. M. Schwartz, “Chapter 22: A modelof evolutionary change in proteins,” in in Atlas of ProteinSequence and Structure, 1978.

[12] “Khronos Group,” 2008. [Online]. Available:http://www.khronos.org

[13] NVIDIA, NVIDIA OpenCL Programming Guide for theCUDA Architecture, Version 2.3, 2010.

[14] ——, NVIDIA OpenCL Best Practices Guide, Version 2.3,2009.

[15] “National Center for Biotechnology Information (NCBI).”[Online]. Available: http://www.ncbi.nlm.nih.gov

[16] “DAAD - German Academic Exchange Service.” [Online].Available: http://www.daad.de

47555555555656