GPU-REMuSiC on Graphics Processing Units
Transcript of GPU-REMuSiC on Graphics Processing Units
Chun-Yuan Lin, Assistant Processor
Department of Computer Science and Information Engineering
Chang Gung University
GPU-REMuSiC on Graphics
Processing Units
2011/5/20 1 PPCB
Outline
Sequence alignment problem
Pair-wise sequence alignment
Multiple sequence alignment
Previous studies
Parallel computing for dynamic programming
Related works on CUDA
Constrain multiple sequence alignment
Developments of a series of tools
RE-MuSiC
GPU-REMuSiC implementation
2011/5/20 2 PPCB
Some bioinformatics applications have been successfully
ported to GPGPU in the past.
Liu et al. (IPDPS 2006) implemented the Smith-Waterman algorithm to
run on the nVidia GeForce 6800 GTO and GeForce 7800 GTX, and
reported an approximate 16× speedup by computing the alignment
score of multiple cells simultaneously.
Charalambous et al. (LNCS 2005) ported an expensive loop from
RAxML, an application for phylogenetic tree construction, and achieved a
1.2× speedup on the nVidia GeForce 5700 LE.
2011/5/20 24 PPCB
Liu et al. (IEEE TPDS 2007) presented a GPGPU approach to
high-performance biological sequence alignment based on
commodity PC graphics hardware. (C++ and OpenGL Shading
Language (GLSL))
Pairwise Sequence Alignment (Smith-Waterman algorithm, scan database)
Multiple sequence alignment (MSA)
2011/5/20 25 PPCB
Some bioinformatics applications have been successfully
ported to CUDA now.
Smith-Waterman algorithm (goal: scan database)
Manavski and Valle (BMC Bioinformatics 2008),
Striemer and Akoglu (IPDPS 2009),
Liu et al. (BMC Research Notes 2009)
Khajeh-Saeed et al. (Journal of Computational Physics, 2010)
Liu et al. (BMC Research Notes 2010)
Multiple sequence alignment (ClustalW)
Liu et al. (IPDPS 2009) for Neighbor-Joining Trees construction
Liu et al. (ASAP 2009)
2011/5/20 26 PPCB
Manavski and Valle present the first solution based on commodity hardware that efficiently computes the exact Smith-Waterman alignment. It runs from 2 to 30 times faster than any previous implementation on general-purpose hardware.
Pre-compute a query profile parallel to the query sequence for each possible residue.
The implementation in CUDA was to make each GPU thread compute the whole alignment of the query sequence with one database sequence. (pre-order the sequences of the database in function of their length)
The ordered database is stored in the global memory, while the query-profile is saved into the texture memory.
For each alignment the matrix is computed column by column in order parallel to the query sequence. (store them in the local memory of the thread)
2011/5/20 27 PPCB
Striemer and Akoglu further study the effect of memory organization and
the instruction set architecture on GPU performance.
They pointed out that query profile in Manavski’s method has a major drawback
in utilizing the texture memory of the GPU that leads to unnecessary caches
misses. (larger than 8KB)
Long sequence problem.
They placed the substitution matrix in the constant memory to exploit the
constant cache, and created an efficient cost function to access it. (modulo
operator is extremely inefficient on CUDA, not use hash function)
They mapped query sequence as well as the substitution matrix to the constant
memory.
2011/5/20 28 PPCB
Liu et al. proposed Two versions of CUDASW++ are implemented: a
single-GPU version and a multi-GPU version.
The alignment can be computed in minor-diagonal order from the top-left
corner to the bottom-right corner in the alignment matrix.
Considering the optimal local alignment of a query sequence and a subject
sequence as a task.
Inter-task parallelization: Each task is assigned to exactly one thread and dimBlock
tasks are performed in parallel by different threads in a thread block.
Intra-task parallelization: Each task is assigned to one thread block and all dimBlock
threads in the thread block cooperate to perform the task in parallel.
2011/5/20 29 PPCB
Khajeh-Saeed et al. showed that effective use of the GPU requires a novel
reformulation of the Smith–Waterman algorithm.
2011/5/20 30 PPCB
Liu et al. described the latest release of the CUDASW++ software,
CUDASW++ 2.0, which makes new contributions to Smith-Waterman
protein database searches using compute unified device architecture
(CUDA).
2011/5/20 31 PPCB
Liu et al. presents MSA-CUDA, a parallel MSA program, which
parallelizes all three stages of the ClustalW processing pipeline using
CUDA.
Pairwise distance computation:
As the work in Liu et al. (BMC Research Notes 2009)
Neighbor-Joining Trees: as the work in Liu et al. (IPDPS 2009)
Progressive alignment:
conducted iteratively in a
multi-pass way.
2011/5/20 32 PPCB
G. Myers, S. Selznick, Z. Zhang and W. Miller. “Progressive multiple alignment with constraints,” Proceedings of the 1st ACM Conference on Computational Molecular Biology, 1997.
(Position constraint, is jt)
C.Y. Tang, C.L. Lu, M.D.T. Chang, Y.T. Tsai, etc, “Constrained Multiple Sequence Alignment Tool Development and Its Application to RNase Family Alignment,” Journal of Bioinformatics and Computational Biology, vol. 1, 2003. (CSB 2002) (CMSA)
(Character constraint, P: a set of single residue/nucleotide)
Y.T. Tsai, Y.P. Huang, C.T. Yu and C.L. Lu, “MuSiC: a tool for multiple sequence alignment with constraints,” Bioinformatics, vol. 20, 2004. (MuSiC)
(Segment constraint, P: a set of short segment) C.L. Lu and Y.P. Huang, “A memory-efficient algorithm for multiple
sequence alignment with constraints,” Bioinformatics, vol. 21, 2005. (MuSiC-ME)
2011/5/20 35 PPCB
Y.-T. Tsai, “The constrained common sequence problem,” Information Processing Letters, vol. 88, 2003.
F. Y.L. Chin, etc,
“A simple algorithm for the constrained sequence problems,” IPL, vol. 90, 2004.
“ Efficient Constrained Multiple Sequence Alignment with Performance Guarantee,” JBCB, vol. 3, 2005. (CSB, 2003)
A. N. Arslan, etc.
“Algorithms for the constrained common sequence problem,” Proc. Prague Stringology Conference, 2004.
“A fast Algorithm for the Constrained Multiple Sequence Alignment problem,” Acta Cybernetica (accepted)
“A parallel algorithm for the constrained multiple sequence alignment problem,” BIBE 2005.
“A space-efficient algorithm for the constrained pairwise sequence alignment problem,” GIW 2005.
“FastPCMSA: An improved parallel algorithm for the constrained multiple sequence alignment problem,” FCS 2006.
“A* Algorithms for the Constrained Multiple Sequence Alignment Problem,” ICAI 2006.
“Space-efficient Parallel Algorithms for the Constrained Multiple Sequence Alignment Problem,” BIOCOMP 2006.
2011/5/20 36 PPCB
RE-MuSiC steps Step1. Compute the score of the global sequence alignment without
any constraint using the Needleman-Wunsch algorithm between all pairs of the K sequences and build the distant matrix.
Step2. Follow the results of Step1 to create a complete graph G of K sequences.
Step3. Use the complete graph G to construct a Kruskal merging order tree .
Step4. Use the created in Step3 as a guide tree. Progressively align the sequences according to the branching order of the Two closest pre-aligned groups of sequences are joined by CPSA algorithm to represented sequences of two groups. (ClustalW)
2011/5/20 38 PPCB
In the past, we have designed an efficient parallel algorithm for the
constrained multiple sequence alignment based on the memory-efficient
algorithm (MuSiC-ME). (PCMSA, MPI)
3k
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35
total processers
spee
d up
32 sequences 64 sequences 128 sequences
256 sequences 512 sequences
4k
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35
total processers
spee
d up
32 sequences 64 sequences 128 sequences
256 sequences 512 sequences
5k
0
5
10
15
20
25
30
0 5 10 15 20 25 30 35
total processers
spee
d u
p
32 sequences 64 sequences 128 sequences
256 sequences 512 sequences
2011/5/20 40 PPCB
Define eight possible
methods for computing
dynamic programming
by CUDA
(a)SRST (b) SRMT
Figure 3: SRST and SRMT.
(a) ARST (b) ARMT
Figure 4: ARST and ARMT.
(a) SDST (b) SDMT
Figure 5: SDST and SDMT.
(a) ADST (b) ADMT
Figure 6: ADST and ADMT.
2011/5/20 41 PPCB
Figure 7. Distance matrix and the thread blocks allocation.
Figure 8. The load balancing cutting technology by two GPUs
2011/5/20 42 PPCB
(Device) Grid
Constant
Memory
Texture
Memory
Global
Memory
Block (0, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Local
Memory
Thread (0, 0)
Registers
Local
Memory
Thread (1, 0)
Registers
Host
Intra-task parallelization
•Sequences (global memory)
•Scoring matrix (constant memory)
•alphabetical order
•ASCII indexing
•Si and Sj stored in shared memory
•Result stored in shared memory
•Memory efficient
•Only score
2011/5/20 43 PPCB
Sp
eedu
ps
Sequences
90.4 85.75 92.39 87.63
179.24 177.27 187.38 181.17
229.58
268.1 255.41273.1272.04
374.86342.78
383.14
0
50
100
150
200
250
300
350
400
450
400(856) 2000(266) 1000(858) 2000(858)
1 GPU 2 GPU 3 GPU 4 GPU
(DP computation time)
2011/5/20 44 PPCB
Sp
eedu
ps
Sequences
Figure 10. The speedup ratios by comparing GPU-REMuSiC and RE-MuSiC.
12.51
22.78
12.59 13.5113.3
26.36
13.5
21.1
13.5
28.05
13.68
21.9
13.52
29.18
13.93
22.36
0
5
10
15
20
25
30
35
400(856) 2000(266) 1000(858) 2000(858)
1 GPU 2 GPU 3 GPU 4 GPU
2011/5/20 45 PPCB
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
CPU 1 GPU 2 GPU 3 GPU 4 GPU
Other Distant Matrix
2011/5/20 46 PPCB