Parallel Longest Common Subsequence using Graphics Hardware
Transcript of Parallel Longest Common Subsequence using Graphics Hardware
![Page 1: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/1.jpg)
1
Parallel Longest Common Subsequence using Graphics Hardware
John KloetzliBrian Strege
Jonathan DeckerDr. Marc Olano
Presented by: Brian Strege
![Page 2: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/2.jpg)
2
Overview
• Introduction– Problem Statement
• Background and Related Work– The NVIDIA G80 Architecture
• Algorithm Description• Results and Analysis• Conclusion
![Page 3: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/3.jpg)
3
Introduction
• Worked on GPU acceleration of Dynamic Programming– Specifically, problems in the Gaussian
Elimination Paradigm (GEP)– More specifically, Longest Common
Subsequence as a representative problem belonging to the GEP
![Page 4: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/4.jpg)
4
Problem Statement
• Design and implement an algorithm for finding the LCS of two arbitrary length strings on a CPU + GPU machine– Must make efficient use of both CPU and
GPU architectures– Must have theoretical justification of design
![Page 5: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/5.jpg)
5
Overview
• Introduction– Problem Statement
• Background and Related Work– The NVIDIA G80 Architecture
• Algorithm Description• Results and Analysis• Conclusion
![Page 6: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/6.jpg)
6
Related Work
• General Purpose on Graphics Hardware– NVIDIA CUDA– Owens et al. (2005)
• Linear Dynamic Programming– Hirschberg (1975)– Chowdhury et al. (2006)
• GPU Sequence Alignment– Liu et al. (2007)– Schatz et al. (2007)
![Page 7: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/7.jpg)
7
• 16 multiprocessors, 8 cores each128 logical processors
• 1.35 GHz• 768 MB of RAM• 86.4GB/sec transfer rate
(8.5GB/sec Core 2 Duo)
• 520 GFLOPS(22 GFLOPS Core 2 Duo)
NV
IDIA
CU
DA
Pro
gram
min
g G
uide
, 1.0
The NVIDIA G80 Architecture
![Page 8: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/8.jpg)
8
The NVIDIA G80 Architecture
Program workflow:• CPU (host) creates
kernel program• GPU maps kernel
“blocks” to processors• Processors map
kernel “threads” to processor cores
• Cores execute in parallel
NV
IDIA
CU
DA
Pro
gram
min
g G
uide
, 1.0
![Page 9: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/9.jpg)
9
Overview
• Introduction– Problem Statement
• Background and Related Work– The NVIDIA G80 Architecture
• Algorithm Description• Results and Analysis• Conclusion
![Page 10: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/10.jpg)
10
Algorithm Description
• The SIMPLE-LCS recurrence– Requires quadratic space, which limits
scalability– Faster than Chowdhury et al. linear space
method
![Page 11: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/11.jpg)
11
A B A B
AABB
SIMPLE-LCS Example
![Page 12: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/12.jpg)
12
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0
0
0
0
![Page 13: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/13.jpg)
13
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 10
0
0
![Page 14: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/14.jpg)
14
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 10
0
0
![Page 15: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/15.jpg)
15
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 10
0
0
![Page 16: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/16.jpg)
16
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10
0
0
![Page 17: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/17.jpg)
17
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 10
0
![Page 18: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/18.jpg)
18
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 10
0
![Page 19: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/19.jpg)
19
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 20
0
![Page 20: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/20.jpg)
20
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20
0
![Page 21: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/21.jpg)
21
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 10
![Page 22: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/22.jpg)
22
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 20
![Page 23: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/23.jpg)
23
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 20
![Page 24: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/24.jpg)
24
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30
![Page 25: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/25.jpg)
25
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1
![Page 26: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/26.jpg)
26
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2
![Page 27: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/27.jpg)
27
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2
![Page 28: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/28.jpg)
28
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3
![Page 29: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/29.jpg)
29
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3
![Page 30: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/30.jpg)
30
A B A B
AABB
SIMPLE-LCS Example
0 0 0 0 0
0 1 1 1 10 1 1 2 20 1 2 2 30 1 2 2 3
![Page 31: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/31.jpg)
31
Algorithm Description
• Chowdhury et al. perform CPU quadratic space algorithm on small subproblems– CH-LCS is their linear space algorithm– CUTOFF ranges from 28 – 210
![Page 32: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/32.jpg)
32
Algorithm Description• Our approach is to add another base case
solved quickly on the GPU– GPU-LCS is our new algorithm (not recursive)– GPU-CUTOFF is 216
– CUTOFF is 211
![Page 33: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/33.jpg)
33
Algorithm Description
• CH: CPU Linear Space DP• GPU: GPU DP
– GPU level 1: GPU Quadratic Space DP (block level)
– GPU level 2: GPU Linear Space DP (thread level)
• Simple: CPU Quadratic Space DP
![Page 34: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/34.jpg)
34
CH: CPU Linear Space DP
Two recursive functions used:• Output boundary• LCS reconstruction
![Page 35: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/35.jpg)
35
CH: CPU Linear Space DP
Output boundary:• Given input boundary,
computes output boundary
• Expects subproblem size to be square, with power-of-two lengths
![Page 36: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/36.jpg)
36
A B A B
AABB
Pushing Example
19 20 21 22 2220202020
![Page 37: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/37.jpg)
37
A B A B
AABB
Pushing Example
19 20 21 22 2220202020
20 20 20 20 19 20 21 22 22
![Page 38: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/38.jpg)
38
A B A B
AABB
Pushing Example
19 20 21 22 2220 20202020
20 20 20 20 20 20 21 22 22
![Page 39: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/39.jpg)
39
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21202020
20 20 20 20 20 21 21 22 22
![Page 40: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/40.jpg)
40
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 2120 212020
20 20 20 21 20 21 21 22 22
![Page 41: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/41.jpg)
41
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 2120 21 212020
20 20 20 21 21 21 21 22 22
![Page 42: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/42.jpg)
42
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 2220 21 212020
20 20 20 21 21 21 22 22 22
![Page 43: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/43.jpg)
43
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 212020
20 20 20 21 21 21 22 22 22
![Page 44: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/44.jpg)
44
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 222020
20 20 20 21 21 22 22 22 22
![Page 45: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/45.jpg)
45
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 222020
20 20 20 21 21 22 22 22 22
![Page 46: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/46.jpg)
46
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 2120
20 20 21 21 21 22 22 22 22
![Page 47: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/47.jpg)
47
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220
20 20 21 22 21 22 22 22 22
![Page 48: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/48.jpg)
48
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21
20 21 21 22 21 22 22 22 22
![Page 49: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/49.jpg)
49
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 2220 21 22
20 21 22 22 21 22 22 22 22
![Page 50: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/50.jpg)
50
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 2220 21 22
20 21 22 22 22 22 22 22 22
![Page 51: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/51.jpg)
51
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22
20 21 22 22 22 23 22 22 22
![Page 52: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/52.jpg)
52
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22
20 21 22 22 22 23 22 22 22
![Page 53: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/53.jpg)
53
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23
20 21 22 22 23 23 22 22 22
![Page 54: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/54.jpg)
54
A B A B
AABB
Pushing Example
19 20 21 22 2220 20 21 22 2220 21 21 22 2220 21 22 22 2320 21 22 22 23
20 21 22 22 23 23 22 22 22
![Page 55: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/55.jpg)
55
Algorithm Description
• CH: CPU Linear Space DP • GPU: GPU DP
– GPU level 1: GPU Quadratic Space DP (block level)
– GPU level 2: GPU Linear Space DP (thread level)
• Simple: CPU Quadratic Space DP
![Page 56: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/56.jpg)
56
GPU Processing Overview• Two levels of parallelism
– Blocks are executed on a processor– Threads are executed on a processor core– Each thread is computed by exactly one processor core
![Page 57: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/57.jpg)
57
GPU Level 1: Quadratic Space
• Length of LCS with max length of 216
• Divide DP matrix into “blocks,” each block is solved by one of the GPU processors
• We must enforce the correct order of block execution– Each diagonal can be
computed in parallel
![Page 58: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/58.jpg)
58
GPU Level 1: Quadratic Space
• The basic quadratic space DP algorithm would require 16 GB of memory– We “fold” the memory to store only the input/output boundary
for each block– Reduces the storage required to 64 MB– From n2 to 2(n2/m) where m = 512– Duplicate some values to avoid memory contention
![Page 59: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/59.jpg)
59
Algorithm Description
• CH: CPU Linear Space DP • GPU: GPU DP
– GPU level 1: GPU Quadratic Space DP (block level)
– GPU level 2: GPU Linear Space DP (thread level)
• Simple: CPU Quadratic Space DP
![Page 60: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/60.jpg)
60
GPU Level 2: Linear Space
• Within each block we also have more parallelism– Divide each block into “threads”– Each processor core computes one thread at a time– Hardware-level synchronization ensures the correct
diagonal ordering– Each core reuses the same space (white) and
computes the entire logical matrix (grey)
![Page 61: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/61.jpg)
61
GPU Level 2 : Linear Space
• Each thread is a 4x4 subproblem– The size was determined by experimentation– This memory is on chip, so we do not have to
worry about memory conflicts– The linear space algorithm allows us to make
each block as large as possible, which allows for very fast execution
![Page 62: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/62.jpg)
62
Algorithm Description
• CH: CPU Linear Space DP • GPU: GPU DP
– GPU level 1: GPU Quadratic Space DP (block level)
– GPU level 2: GPU Linear Space DP (thread level)
• Simple: CPU Quadratic Space DP
![Page 63: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/63.jpg)
63
Simple: CPU Quadratic Space DP
• Only gets called when a subproblem is too small for the GPU
• Implements SIMPLE-LCS, the “classic” matrix-based LCS algorithm
![Page 64: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/64.jpg)
64
Overview
• Introduction– Problem Statement
• Background and Related Work– The NVIDIA G80 Architecture
• Algorithm Description• Results and Analysis• Conclusion
![Page 65: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/65.jpg)
65
Results and Analysis
GPU thread width of 4 proves optimal
![Page 66: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/66.jpg)
66
Results and Analysis
GPU block width of 512 is slightly faster
![Page 67: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/67.jpg)
67
Results and Analysis
CPU/GPU cutoff sizes determined experimentally
![Page 68: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/68.jpg)
68
Results and Analysis
• Test DNA sequence data obtained from Mike Brudno• Over five-fold performance improvement from results in
Chowdhury et al. on all sequence comparisons
Species LengthHuman 1.80Chimp 1.32Baboon 1.51Chicken 0.42Fugu 0.27Cow 1.46Mouse 1.49Rat 1.50Cat 1.16Dog 1.05
Lengths in millions
![Page 69: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/69.jpg)
69
Conclusion
• We present a GPU based Dynamic Programming algorithm to compute the LCS of very large sequences
• GPU implementation over five-fold performance boost over single CPU implementation
![Page 70: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/70.jpg)
70
Future Work
• We believe our algorithm can be accelerated further with careful optimization– Memory management on the GPU– Memory transfer between CPU and GPU
• Investigation of other computation models– Implementations using 8xCPU + 2xGPU?
![Page 71: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/71.jpg)
71
Questions?
Special thanks to Rezaul Chowdhury for his support and Mike Brudno for the DNA sequence data
![Page 72: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/72.jpg)
72
NVIDIA CUDA
• Compute Unified Device Architecture• Available on G80 Series• Architecture for utilizing the GPU as a
data-parallel computing device• Eliminates the need to map computation
through graphics API• User writes a C style function which is
then run in parallel on the GPU
![Page 73: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/73.jpg)
73
CH: CPU Linear Space DP
LCS reconstruction• Computes output
boundaries in specific order
• Traces back through boundaries to generate LCS
• Linear space
![Page 74: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/74.jpg)
74
CH: CPU Linear Space DP
LCS reconstruction omissions:
• Non-power-of-two sequence lengths
• Non-equal sequence lengths
![Page 75: Parallel Longest Common Subsequence using Graphics Hardware](https://reader036.fdocuments.us/reader036/viewer/2022062413/58a2d7d91a28abe6338b71f9/html5/thumbnails/75.jpg)
75
Integration with Parallel CPUs
• Chowdhury et al. implemented a parallel version of their algorithm– No data available for LCS, but results from other
algorithms show we should expect ~6 times speedup for LCS using 8 server processors
– Disadvantages: • Number of processors which can be effectively used scales
poorly with input size
• Server CPUs cost between $500 and $1600 each, while the GPU we used cost $550