Parallel Algorithms
description
Transcript of Parallel Algorithms
![Page 1: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/1.jpg)
Parallel Algorithms
Patrick CozziUniversity of PennsylvaniaCIS 565 - Fall 2012
![Page 2: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/2.jpg)
Announcements
Project 1 due Sunday 09/30Email grade to Karl Include blog link in README.mdBe ready to present on Wednesday, 10/01
2
![Page 3: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/3.jpg)
Agenda
Parallel AlgorithmsParallel ReductionScanStream CompressionSummed Area TablesRadix Sort
3
![Page 4: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/4.jpg)
Parallel Reduction
Given an array of numbers, design a parallel algorithm to find the sum.
Consider: Arithmetic intensity: compute to memory access ratio
4
![Page 5: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/5.jpg)
Parallel Reduction
Given an array of numbers, design a parallel algorithm to find: The sum The maximum value The product of values The average value
How different are these algorithms?
5
![Page 6: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/6.jpg)
Parallel Reduction
Reduction: An operation that computes a single result from a set of data
Examples:Minimum/maximum valueAverage, sum, product, etc.
Parallel Reduction: Do it in parallel. Obviously
6
![Page 7: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/7.jpg)
Parallel Reduction
0 1 5 2 3 4 6 7
Example. Find the sum:
7
![Page 8: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/8.jpg)
Parallel Reduction
0 1 5 2 3 4 6 7
1 5 9 13
8
![Page 9: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/9.jpg)
Parallel Reduction
0 1 5 2 3 4 6 7
1 5 9 13
6 22
9
![Page 10: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/10.jpg)
Parallel Reduction
0 1 5 2 3 4 6 7
1 5 9 13
6 22
28
10
![Page 11: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/11.jpg)
Parallel Reduction
Similar to brackets for a basketball tournament log(n) passes for n elements
11
![Page 12: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/12.jpg)
All-Prefix-Sums
All-Prefix-Sums Input
Array of n elements: Binary associate operator: Identity: I
Outputs the array:
Images from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 12
![Page 13: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/13.jpg)
All-Prefix-Sums
Example If is addition, the array
[3 1 7 0 4 1 6 3] is transformed to
[0 3 4 11 11 15 16 22] Seems sequential, but there is an efficient
parallel solution
13
![Page 14: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/14.jpg)
Scan
Scan: all-prefix-sums operation on an array of data
Exclusive Scan: Element j of the result does not include element j of the input:
In: [3 1 7 0 4 1 6 3] Out: [0 3 4 11 11 15 16 22]
Inclusive Scan (Prescan): All elements including j are summed
In: [3 1 7 0 4 1 6 3] Out: [3 4 11 11 15 16 22 25]
14
![Page 15: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/15.jpg)
Scan
How do you generate an exclusive scan from an inclusive scan?
Input: [3 1 7 0 4 1 6 3] Inclusive: [3 4 11 11 15 16 22 25] Exclusive: [0 3 4 11 11 15 16 22]
// Shift right, insert identity
How do you go in the opposite direction?
15
![Page 16: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/16.jpg)
Scan
Use cases Stream compaction Summed-area tables for variable width image processing Radix sort …
16
![Page 17: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/17.jpg)
Scan
Used to convert certain sequential computation into equivalent parallel computation
Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 17
![Page 18: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/18.jpg)
Scan
Design a parallel algorithm for exclusive scanIn: [3 1 7 0 4 1 6 3]Out: [0 3 4 11 11 15 16 22]
Consider: Total number of additions
18
![Page 19: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/19.jpg)
Scan
Sequential Scan: single thread, trivial
n adds for an array of length n Work complexity: O(n) How many adds will our parallel version
have?
Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 19
![Page 20: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/20.jpg)
Scan
Naive Parallel Scan
Image from http://developer.download.nvidia.com/compute/cuda/1_1/Website/projects/scan/doc/scan.pdf
Is this exclusive or inclusive? Each thread
Writes one sum Reads two values
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
20
![Page 21: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/21.jpg)
Scan
Naive Parallel Scan: Input
0 1 5 2 3 4 6 7
21
![Page 22: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/22.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
0
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
22
![Page 23: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/23.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
0 1
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
23
![Page 24: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/24.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
0 1 3
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
24
![Page 25: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/25.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
0 1 3 5
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
25
![Page 26: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/26.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
0 1 3 5 7
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
26
![Page 27: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/27.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
0 1 9 3 5 7
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
27
![Page 28: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/28.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
0 1 9 3 5 7 11
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
28
![Page 29: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/29.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
0 1 9 3 5 7 11 13
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
29
![Page 30: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/30.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
Recall, it runs in parallel! for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
30
![Page 31: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/31.jpg)
Scan
Naive Parallel Scan: d = 1, 2d-1 = 1
0 1 5 2 3 4 6 7
0 1 9 3 5 7 11 13
Recall, it runs in parallel! for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
31
![Page 32: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/32.jpg)
Scan
Naive Parallel Scan: d = 2, 2d-1 = 2
0 1 5 2 3 4 6 7
0 1 9 3 5 7 11 13
after d = 1
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
32
![Page 33: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/33.jpg)
Scan
Naive Parallel Scan: d = 2, 2d-1 = 2
0 1 5 2 3 4 6 7
0 1 9 3 5 7 11 13
22
after d = 1
Consider only k = 7for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
33
![Page 34: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/34.jpg)
Scan
Naive Parallel Scan: d = 2, 2d-1 = 2
0 1 5 2 3 4 6 7
0 1 9 3 5 7 11 13
0 1 14 3 6 10 18 22
after d = 1
after d = 2
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
34
![Page 35: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/35.jpg)
Scan
Naive Parallel Scan: d = 3, 2d-1 = 4
0 1 5 2 3 4 6 7
0 1 9 3 5 7 11 13 after d = 1
after d = 2
0 1 14 3 6 10 18 22
for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
35
![Page 36: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/36.jpg)
Scan
Naive Parallel Scan: d = 3, 2d-1 = 4
0 1 5 2 3 4 6 7
0 1 9 3 5 7 11 13 after d = 1
after d = 2
28
0 1 14 3 6 10 18 22
Consider only k = 7for d = 1 to log2n for all k in parallel if (k >= 2d-1) x[k] = x[k – 2d-1] + x[k];
36
![Page 37: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/37.jpg)
Scan
Naive Parallel Scan: Final
0 1 5 2 3 4 6 7
0 1 9 3 5 7 11 13
0 1 14 3 6 10 18 22
0 1 15 3 6 10 21 28
37
![Page 38: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/38.jpg)
Scan
Naive Parallel ScanWhat is naive about this algorithm?
What was the work complexity for sequential scan? What is the work complexity for this?
38
![Page 39: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/39.jpg)
Stream Compaction Stream Compaction
Given an array of elements Create a new array with elements that meet a certain
criteria, e.g. non null Preserve order
a b f c d e g h
39
![Page 40: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/40.jpg)
Stream Compaction Stream Compaction
Given an array of elements Create a new array with elements that meet a certain
criteria, e.g. non null Preserve order
a b f c d e g h
a c d g
40
![Page 41: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/41.jpg)
Stream Compaction Stream Compaction
Used in path tracing, collision detection, sparse matrix compression, etc.
Can reduce bandwidth from GPU to CPU
a b f c d e g h
a c d g
41
![Page 42: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/42.jpg)
Stream Compaction Stream Compaction
Step 1: Compute temporary array containing 1 if corresponding element meets criteria 0 if element does not meet criteria
a b f c d e g h
42
![Page 43: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/43.jpg)
Stream Compaction Stream Compaction
Step 1: Compute temporary array
a b f c d e g h
1
43
![Page 44: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/44.jpg)
Stream Compaction Stream Compaction
Step 1: Compute temporary array
a b f c d e g h
1 0
44
![Page 45: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/45.jpg)
Stream Compaction Stream Compaction
Step 1: Compute temporary array
a b f c d e g h
1 0 1
45
![Page 46: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/46.jpg)
Stream Compaction Stream Compaction
Step 1: Compute temporary array
a b f c d e g h
1 0 0 1 1 0 1 0
46
![Page 47: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/47.jpg)
Stream Compaction Stream Compaction
Step 1: Compute temporary array
a b f c d e g h
It runs in parallel!
47
![Page 48: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/48.jpg)
Stream Compaction Stream Compaction
Step 1: Compute temporary array
a b f c d e g h
1 0 0 1 1 0 1 0
It runs in parallel! 48
![Page 49: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/49.jpg)
Stream Compaction
Stream Compaction Step 2: Run exclusive scan on temporary array
a b f c d e g h
1 0 0 1 1 0 1 0
Scan result:
49
![Page 50: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/50.jpg)
Stream Compaction
Stream Compaction Step 2: Run exclusive scan on temporary array
Scan runs in parallelWhat can we do with the results?
a b f c d e g h
1 0 0 1 1 0 1 0
0 1 3 1 2 3 3 4Scan result:
50
![Page 51: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/51.jpg)
Stream Compaction
Stream Compaction Step 3: Scatter
Result of scan is index into final arrayOnly write an element if temporary
array has a 1
51
![Page 52: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/52.jpg)
Stream Compaction
Stream Compaction Step 3: Scatter
a b f c d e g h
1 0 0 1 1 0 1 0
0 1 3 1 2 3 3 4Scan result:
Final array:
0 1 2 3 52
![Page 53: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/53.jpg)
Stream Compaction
Stream Compaction Step 3: Scatter
a b f c d e g h
1 0 0 1 1 0 1 0
0 1 3 1 2 3 3 4Scan result:
a Final array:
0 1 2 3 53
![Page 54: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/54.jpg)
Stream Compaction
Stream Compaction Step 3: Scatter
a b f c d e g h
1 0 0 1 1 0 1 0
0 1 3 1 2 3 3 4Scan result:
a c Final array:
0 1 2 3 54
![Page 55: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/55.jpg)
Stream Compaction
Stream Compaction Step 3: Scatter
a b f c d e g h
1 0 0 1 1 0 1 0
0 1 3 1 2 3 3 4Scan result:
a c d Final array:
0 1 2 3 55
![Page 56: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/56.jpg)
Stream Compaction
Stream Compaction Step 3: Scatter
a b f c d e g h
1 0 0 1 1 0 1 0
0 1 3 1 2 3 3 4Scan result:
a c d gFinal array:
0 1 2 3 56
![Page 57: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/57.jpg)
Stream Compaction
Stream Compaction Step 3: Scatter
a b f c d e g h
1 0 0 1 1 0 1 0
0 1 3 1 2 3 3 4Scan result:
Final array:
Scatter runs in parallel! 0 1 2 3 57
![Page 58: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/58.jpg)
Stream Compaction
Stream Compaction Step 3: Scatter
a b f c d e g h
1 0 0 1 1 0 1 0
0 1 3 1 2 3 3 4Scan result:
a c d gFinal array:
0 1 2 3 Scatter runs in parallel!58
![Page 59: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/59.jpg)
Summed Area Table
Summed Area Table (SAT): 2D table where each element stores the sum of all elements in an input image between the lower left corner and the entry location.
59
![Page 60: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/60.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2 2 4
2 5 6 8
2 6 9 11
4 9 12 14
SAT
(1 + 1 + 0) + (1 + 2 + 1) + (0 + 1 + 2) = 9
Example:
60
![Page 61: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/61.jpg)
Summed Area Table
BenefitUsed to perform different width filters at every
pixel in the image in constant time per pixelJust sample four pixels in SAT:
Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 61
![Page 62: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/62.jpg)
Summed Area Table
UsesApproximate depth
of fieldGlossy
environment reflections and refractions
Image from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html 62
![Page 63: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/63.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
SAT
63
![Page 64: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/64.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1
SAT
64
![Page 65: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/65.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2
SAT
65
![Page 66: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/66.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2 2
SAT
66
![Page 67: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/67.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2 2 4
SAT
67
![Page 68: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/68.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2 2 4
2
SAT
68
![Page 69: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/69.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2 2 4
2 5
SAT
69
![Page 70: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/70.jpg)
Summed Area Table
…
70
![Page 71: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/71.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2 2 4
2 5 6 8
2 6 9 11
4 9
SAT
71
![Page 72: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/72.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2 2 4
2 5 6 8
2 6 9 11
4 9 12
SAT
72
![Page 73: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/73.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2 2 4
2 5 6 8
2 6 9 11
4 9 12 14
SAT
73
![Page 74: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/74.jpg)
Summed Area Table
How would implement this on the GPU?
74
![Page 75: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/75.jpg)
Summed Area Table
How would compute a SAT on the GPU using
inclusive scan?75
![Page 76: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/76.jpg)
Summed Area Table
1 1 0 2
1 2 1 0
0 1 2 0
2 1 0 0
Input image
1 2 2 4
1 3 4 4
0 1 3 3
2 3 3 3
Partial SAT
One inclusive scan for each row
Step 1 of 2:
76
![Page 77: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/77.jpg)
Summed Area Table
1 2 2 4
1 3 4 4
0 1 3 3
2 3 3 3
Partial SAT
One inclusive scan for eachcolumn, bottom to top
Step 2 of 2:
1 2 2 4
2 5 6 8
2 6 9 11
4 9 12 14
Final SAT
77
![Page 78: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/78.jpg)
Radix Sort
Efficient for small sort keysk-bit keys require k passes
78
![Page 79: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/79.jpg)
Radix Sort
Each radix sort pass partitions its input based on one bit
First pass starts with the least significant bit (LSB). Subsequent passes move towards the most significant bit (MSB)
010LSBMSB
79
![Page 80: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/80.jpg)
Radix Sort
100
Example from http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html
111 010 110 011 101 001 000
Example input:
80
![Page 81: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/81.jpg)
Radix Sort
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001
First pass: partition based on LSB
LSB == 0 LSB == 1
81
![Page 82: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/82.jpg)
Radix Sort
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001
Second pass: partition based on middle bit
bit == 0 bit == 1
100 010 110000 111 011101 001
82
![Page 83: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/83.jpg)
Radix Sort
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001
Final pass: partition based on MSB
MSB == 0 MSB == 1
100 010 110000 111 011101 001
000 100 101001 110 111010 011
83
![Page 84: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/84.jpg)
Radix Sort
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001
Completed:
100 010 110000 111 011101 001
000 100 101001 110 111010 011
84
![Page 85: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/85.jpg)
Radix Sort
4 7 2 6 3 5 1 0
4 2 6 0 7 3 5 1
Completed:
4 2 60 7 35 1
0 4 51 6 72 3
85
![Page 86: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/86.jpg)
Parallel Radix Sort
Where is the parallelism?
86
![Page 87: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/87.jpg)
Parallel Radix Sort
1. Break input arrays into tilesEach tile fits into shared memory for an SM
2. Sort tiles in parallel with radix sort3. Merge pairs of tiles using a parallel
bitonic merge until all tiles are merged.
Our focus is on Step 2
87
![Page 88: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/88.jpg)
Parallel Radix Sort
Where is the parallelism?Each tile is sorted in parallelWhere is the parallelism within a tile?
88
![Page 89: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/89.jpg)
Parallel Radix Sort
Where is the parallelism?Each tile is sorted in parallelWhere is the parallelism within a tile?
Each pass is done in sequence after the previous pass. No parallelism
Can we parallelize an individual pass? How?Merge also has parallelism
89
![Page 90: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/90.jpg)
Parallel Radix Sort
Implement spilt. Given:Array, i, at pass n:
Array, b, which is true/false for bit n:
Output array with false keys before true keys:
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0
100 010 110 000 111 011 101 001
90
![Page 91: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/91.jpg)
Parallel Radix Sort
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0
i array
b array
Step 1: Compute e array
1 0 1 1 0 0 0 1 e array
91
![Page 92: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/92.jpg)
Parallel Radix Sort
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0 b array
Step 2: Exclusive Scan e
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
i array
92
![Page 93: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/93.jpg)
Parallel Radix Sort
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0
i array
b array
Step 3: Compute totalFalses
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
totalFalses = e[n – 1] + f[n – 1]totalFalses = 1 + 3totalFalses = 4
93
![Page 94: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/94.jpg)
Parallel Radix Sort
100 111 010 110 011 101 001 000
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
t array
t[i] = i – f[i] + totalFalses
totalFalses = 494
![Page 95: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/95.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 t array
t[0] = 0 – f[0] + totalFalsest[0] = 0 – 0 + 4t[0] = 4 totalFalses = 4
100 111 010 110 011 101 001 000
95
![Page 96: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/96.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 t array
t[1] = 1 – f[1] + totalFalsest[1] = 1 – 1 + 4t[1] = 4 totalFalses = 4
100 111 010 110 011 101 001 000
96
![Page 97: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/97.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 t array
t[2] = 2 – f[2] + totalFalsest[2] = 2 – 1 + 4t[2] = 5 totalFalses = 4
100 111 010 110 011 101 001 000
97
![Page 98: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/98.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 4: Compute t array
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
totalFalses = 4
t[i] = i – f[i] + totalFalses
100 111 010 110 011 101 001 000
98
![Page 99: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/99.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 5: Scatter based on address d
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
0 d[i] = b[i] ? t[i] : f[i]
100 111 010 110 011 101 001 000
99
![Page 100: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/100.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 5: Scatter based on address d
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
0 4 d[i] = b[i] ? t[i] : f[i]
100 111 010 110 011 101 001 000
100
![Page 101: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/101.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 5: Scatter based on address d
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
0 4 1 d[i] = b[i] ? t[i] : f[i]
100 111 010 110 011 101 001 000
101
![Page 102: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/102.jpg)
Parallel Radix Sort
0 1 0 0 1 1 1 0
i array
b array
Step 5: Scatter based on address d
1 0 1 1 0 0 0 1 e array
0 1 1 2 3 3 3 3 f array
4 4 5 5 5 6 7 8 t array
d[i] = b[i] ? t[i] : f[i]
100 111 010 110 011 101 001 000
0 4 1 2 5 6 7 3102
![Page 103: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/103.jpg)
Parallel Radix Sort
i array
Step 5: Scatter based on address d
0 4 1 2 5 6 7 3 d
100 111 010 110 011 101 001 000
output
103
![Page 104: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/104.jpg)
Parallel Radix Sort
i array
Step 5: Scatter based on address d
0 4 1 2 5 6 7 3 d
100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001 output
104
![Page 105: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/105.jpg)
Parallel Radix Sort
Given k-bit keys, how do we sort using our new split function?
Once each tile is sorted, how do we merge tiles to provide the final sorted array?
105
![Page 106: Parallel Algorithms](https://reader035.fdocuments.us/reader035/viewer/2022081420/5681624c550346895dd294a8/html5/thumbnails/106.jpg)
Summary
Parallel reduction, scan, and sort are building blocks for many algorithms
An understanding of parallel programming and GPU architecture yields efficient GPU implementations
106