Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data...
Transcript of Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data...
![Page 1: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/1.jpg)
Parallel Algorithms and Data StructuresCS 448s, Stanford University
20 April 2010John Owens
Associate Professor, Electrical and Computer EngineeringUC Davis
![Page 2: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/2.jpg)
Data-Parallel Algorithms• Efficient algorithms require efficient building blocks
• Five data-parallel building blocks
• Map
• Gather & Scatter
• Reduce
• Scan
• Sort
• Advanced data structures:
• Sparse matrices
• Hash tables
• Task queues
![Page 3: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/3.jpg)
Sample Motivating Application• How bumpy is a surface that we
represent as a grid of samples?
• Algorithm:
• Loop over all elements
• At each element, compare the value of that element to the average of its neighbors (“difference”). Square that difference.
• Now sum up all those differences.
• But we don’t want to sum all the diffs that are 0.
• So only sum up the non-zero differences.
• This is a fake application—don’t take it too seriously.
Picture courtesy http://www.artifice.com
![Page 4: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/4.jpg)
Sample Motivating Application
for all samples:
neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
![Page 5: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/5.jpg)
Sample Motivating Application
for all samples:
neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
![Page 6: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/6.jpg)
The Map Operation
• Given:
• Array or stream of data elements A
• Function f(x)
• map(A, f) = applies f(x) to all ai∈ A
• How does this map to a data-parallel processor?
![Page 7: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/7.jpg)
Sample Motivating Application
for all samples:
neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
![Page 8: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/8.jpg)
Scatter vs. Gather• Gather: p = a[i]
• Scatter: a[i] = p
• How does this map to a data-parallel processor?
![Page 9: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/9.jpg)
Sample Motivating Application
for all samples:
neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
![Page 10: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/10.jpg)
Parallel Reductions• Given:
• Binary associative operator ⊕ with identity I
• Ordered set s = [a₀, a₁, …, an-1] of n elements
• reduce(⊕, s) returns a₀ ⊕ a₁ ⊕ … ⊕ an-1
• Example: reduce(+, [3 1 7 0 4 1 6 3]) = 25
• Reductions common in parallel algorithms
• Common reduction operators are +, ×, min and max
• Note floating point is only pseudo-associative
![Page 11: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/11.jpg)
Efficiency
• Work efficiency:
• Total amount of work done over all processors
• Step efficiency:
• Number of steps it takes to do that work
• With parallel processors, sometimes you’re willing to do more work to reduce the number of steps
• Even better if you can reduce the amount of steps and still do the same amount of work
![Page 12: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/12.jpg)
Parallel Reductions• 1D parallel reduction:
• add two halves of domain together repeatedly...
• … until we’re left with a single row
+N
N/2N/4… 1
+ +
![Page 13: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/13.jpg)
Parallel Reductions• 1D parallel reduction:
• add two halves of domain together repeatedly...
• … until we’re left with a single row
+N
N/2N/4… 1
O(log2N) steps, O(N) work
+ +
![Page 14: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/14.jpg)
Multiple 1D Parallel Reductions• Can run many reductions in parallel
• Use 2D grid and reduce one dimension
+
MxN
MxN/2MxN/4… Mx1
+ +
![Page 15: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/15.jpg)
Multiple 1D Parallel Reductions• Can run many reductions in parallel
• Use 2D grid and reduce one dimension
+
MxN
MxN/2MxN/4… Mx1
+ +
O(log2N) steps, O(MN) work
![Page 16: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/16.jpg)
2D reductions• Like 1D reduction, only reduce in both directions simultaneously
• Note: can add more than 2x2 elements per step
• Trade per-pixel work for step complexity
• Best perf depends on specific hardware (cache, etc.)
![Page 17: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/17.jpg)
Parallel Reduction Complexity
• log(n) parallel steps, each step S does n/2! independent ops
• Step Complexity is O(log n)
• Performs n/2 + n/4 + … + 1 = n-1 operations
• Work Complexity is O(n)—it is work-efficient
• i.e. does not perform more operations than a sequential algorithm
• With p threads physically in parallel (p processors), time complexity is O(n/p + log n)
• Compare to O(n) for sequential reduction
![Page 18: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/18.jpg)
Sample Motivating Application
for all samples:
neighbors[x,y] = 0.25 * ( value[x-1,y]+ value[x+1,y]+ value[x,y+1]+ value[x,y-1] ) )
diff = (value[x,y] - neighbors[x,y])^2
result = 0
for all samples where diff != 0:
result += diff
return result
![Page 19: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/19.jpg)
Stream Compaction
• Input: stream of 1s and 0s[1 0 1 1 0 0 1 0]
• Operation:“sum up all elements before you”
• Output: scatter addresses for “1” elements[0 1 1 2 3 3 3 4]
• Note scatter addresses for red elements are packed!
A B C D E F G H
A C D G
Input
Output
![Page 20: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/20.jpg)
Common Situations in Parallel Computation
• Many parallel threads that need to partition data
• Split
• Many parallel threads and variable output per thread
• Compact / Expand / Allocate
• More complicated patterns than one-to-one or all-to-one
• Instead all-to-all
![Page 21: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/21.jpg)
Split Operation• Given an array of true and false elements (and
payloads)
• Return an array with all true elements at the beginning
• Examples: sorting, building trees
FTFFTFFT
FFFFFTTT
36140713
31471603
Flag
Payload
![Page 22: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/22.jpg)
Variable Output Per Thread: Compact
• Remove null elements
• Example: collision detection
3 7 4 1 3
3 0 7 0 4 1 0 3
![Page 23: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/23.jpg)
Variable Output Per Thread• Allocate Variable Storage Per Thread
• Examples: marching cubes, geometry generation
A
B
C D
E
F
G
2 1 0 3 2
H
![Page 24: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/24.jpg)
“Where do I write my output?”
• In all of these situations, each thread needs to answer that simple question
• The answer is:
• “That depends on how muchthe other threads need to write!”
• In a serial processor, this is simple
• “Scan” is an efficient way to answer this question in parallel
![Page 25: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/25.jpg)
Parallel Prefix Sum (Scan)• Given an array A = [a0, a1, …, an-1]
and a binary associative operator ⊕ with identity I,
• scan(A) = [I, a0, (a0 ⊕ a1), …, (a0 ⊕ a1 ⊕ … ⊕ an-2)]
• Example: if ⊕ is addition, then scan on the set
• [3 1 7 0 4 1 6 3]
• returns the set
• [0 3 4 11 11 15 16 22]
![Page 26: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/26.jpg)
Segmented Scan• Example: if ⊕ is addition, then scan on the set
• [3 1 7 | 0 4 1 | 6 3]
• returns the set
• [0 3 4 | 0 0 4 | 0 6]
• Same computational complexity as scan, but additionally have to keep track of segments (we use head flags to mark which elements are segment heads)
• Useful for nested data parallelism (quicksort)
![Page 27: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/27.jpg)
Quicksort
[5 3 7 4 6] # initial input[5 5 5 5 5] # distribute pivot across segment[f f t f t] # input > pivot?[5 3 4][7 6] # split-and-segment[5 5 5][7 7] # distribute pivot across segment[t f f][t f] # input >= pivot?[3 4 5][6 7] # split-and-segment, done!
![Page 28: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/28.jpg)
O(n log n) Scan!" !# !$ !% !& !' !( !)
*!"++!", *!"++!#, *!#++!$, *!$++!%, *!%++!&, *!&++!', *!'++!(, *!(++!),
*!"++!", *!"++!#, *!"++!$, *!"++!%, *!#++!&, *!$++!', *!%++!(, *!&++!),
*!"++!", *!"++!#, *!"++!$, *!"++!%, *!"++!&, *!"++!', *!"++!(, *!"++!),
!-#
!-$
!-%
• Step efficient (log n steps)• Not work efficient (n log n work)
![Page 29: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/29.jpg)
O(n) Scan
!" !# !$ !% !& !' !( !)
!" *!"++!#, !$ *!$++!%, !& *!&++!', !( *!(++!),
!" *!"++!#, !$ *!"++!%, !& *!&++!', !( *!&++!),
!" *!"++!#, !$ *!"++!%, !& *!&++!', !( *!"++!),
!"-.$
!"-.#
!"-."
! "! #"!$$"%& #"!$$"'& #"!$$"(& #"!$$")& #"!$$"*& #"!$$"+&
"! "' #"!$$"%& ") #"!$$"(& "+ #"!$$"*&
"! #"!$$"%& "' ") #")$$"*& "+ #"!$$"(&
"! #"!$$"%& "' #"!$$"(& ") #")$$"*& "+
"! #"!$$"%& "' #"!$$"(& ") #")$$"*& "+ #"!$$",&
!"#$
!"-.!
!"-.%
!"-.'
• Not step efficient (2 log n steps)• Work efficient (O(n) work)
![Page 30: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/30.jpg)
Application: Stream Compaction
• 1M elements: ~0.6-1.3 ms
• 16M elements: ~8-20 ms
• Perf depends on # elements retained
30
Informally, stream compaction is a filtering operation: from an input vector, it selects asubset of this vector and packs that subset into a dense output vector. Figure 39-9shows an example. More formally, stream compaction takes an input vector vi and apredicate p, and outputs only those elements in vi for which p(vi) is true, preserving theordering of the input elements. Horn (2005) describes this operation in detail.
Stream compaction requires two steps, a scan and a scatter.
1. The first step generates a temporary vector where the elements that pass the predi-cate are set to 1 and the other elements are set to 0. We then scan this temporaryvector. For each element that passes the predicate, the result of the scan now con-tains the destination address for that element in the output vector.
2. The second step scatters the input elements to the output vector using the addressesgenerated by the scan.
Figure 39-10 shows this process in detail.
39.3 Applications of Scan 867
Input
OutputA C D G
A B C D E F G H
Figure 39-9. Stream Compaction ExampleGiven an input vector, stream compaction outputs a packed subset of that vector, choosing onlythe input elements of interest (marked in gray).
A C D G
A B C D E F G H
Scan
Scatter input to output, using scan result as scatteraddress
Set a “1” in eachgray input
0 1 1 2 3 3 3 4
1 0 1 1 0 0 1 0
Input: we want to preserve the gray elements A B C D E F G H
Figure 39-10. Scan and ScatterStream compaction requires one scan to determine destination addresses and one vector scatteroperation to place the input elements into the output vector.
639_gems3_ch39 6/28/2007 2:28 PM Page 867FIRST PROOFS
![Page 31: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/31.jpg)
Application: Radix Sort
31
• Sort 16M 32-bit key-value pairs: ~120 ms
• Perform split operation on each bit using scan
• Can also sort each block and merge• Efficient merge on GPU an
active area of research
872
Thus for k-bit keys, radix sort requires k steps. Our implementation requires one scanper step.
The fundamental primitive we use to implement each step of radix sort is the splitprimitive. The input to split is a list of sort keys and their bit value b of interest onthis step, either a true or false. The output is a new list of sort keys, with all false sortkeys packed before all true sort keys.
We implement split on the GPU in the following way, as shown in Figure 39-14.
1. In a temporary buffer in shared memory, we set a 1 for all false sort keys (b = 0)and a 0 for all true sort keys.
2. We then scan this buffer. This is the enumerate operation; each false sort key nowcontains its destination address in the scan output, which we will call f. These firsttwo steps are equivalent to a stream compaction operation on all false sort keys.
Chapter 39 Parallel Prefix Sum Scan with CUDA
Scatter input using d asscatter address
Split based on leastsignificant bit b
e = Set a “1” in each “0” input
f = Scan the 1s
totalFalses = e[max] + f[max]
0 1 0 0 1 1 1 0
1 0 1 1 0 0 0 1
0 1 1 2 3 3 3 0
t = i – f + totalFalses
d = b ? t : f
0-0+4=4
1-1+4=4
2-1+4=5
3-2+4=5
4-3+4=5
5-3+4=6
6-3+4=7
7-3+4=8
0 4 1 2 5 6 7 3
Input100 111 010 110 011 101 001 000
100 010 110 000 111 011 101 001
100 111 010 110 011 101 001 000
Figure 39-14. The split Operation Requires a Single Scan and Runs in Linear Time with theNumber of Input Elements
639_gems3_ch39 6/28/2007 2:28 PM Page 872FIRST PROOFS
3
![Page 32: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/32.jpg)
GPU Design Principles
• Data layouts that:
• Minimize memory traffic
• Maximize coalesced memory access
• Algorithms that:
• Exhibit data parallelism
• Keep the hardware busy
• Minimize divergence
![Page 33: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/33.jpg)
Dense Matrix Multiplication
• for all elements E in destination matrix P
• Pr,c = Mr • Nc
M P
WIDTH
WIDTH
WIDTH WIDTH
N
![Page 34: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/34.jpg)
Dense Matrix Multiplication• P = M * N of size WIDTH x WIDTH
• With blocking:
• One thread block handles one BLOCK_SIZE x BLOCK_SIZE sub-matrix Psub of P
• M and N are only loaded WIDTH / BLOCK_SIZEtimes from globalmemory
• Great saving ofmemory bandwidth!
M
N
P
Psub
BLOCK_SIZEBLOCK_SIZE BLOCK_SIZE BLOCK_SIZE
BLOCK_SIZE
BLOCK_SIZEBLOCK_SIZEBLOCK_SIZE
WIDTH
WIDTH
WIDTH WIDTH
N
![Page 35: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/35.jpg)
Dense Matrix Multiplication• Data layouts that:
• Minimize memory traffic
• Maximize coalesced memory access
• Algorithms that:
• Exhibit data parallelism
• Keep the hardware busy
• Minimize divergenceM
N
P
Psub
BLOCK_SIBLOCK_SI BLOCK_SI BLOCK_SI
BLOCK_SI
BLOCK_SIBLOCK_SIBLOCK_SI
WIDT
WIDT
WIDT WIDT
N
![Page 36: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/36.jpg)
Sparse Matrix-Vector Multiply: What’s Hard?
• Dense approach is wasteful
• Unclear how to map work to parallel processors
• Irregular data access
![Page 37: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/37.jpg)
Sparse Matrix Formats
Structured Unstructured
(DIA
) Diag
onal
(ELL)
ELLP
ACK
(COO) C
oord
inate
(CSR
) Com
pres
sed R
ow(H
YB) H
ybrid
![Page 38: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/38.jpg)
Diagonal Matrices
• Diagonals should be mostly populated
• Map one thread per row
• Good parallel efficiency
• Good memory behavior [column-major storage]
-2 0 1
![Page 39: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/39.jpg)
Irregular Matrices: ELL
• Assign one thread per row again
• But now:
• Load imbalance hurts parallel efficiency
0 2
0 1 2 3 4 5
0 2 3
0 1 2
1 2 3 4 5
5
padding
![Page 40: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/40.jpg)
Irregular Matrices: COO
• General format; insensitive to sparsity pattern, but ~3x slower than ELL
• Assign one thread per element, combine results from all elements in a row to get output element
• Req segmented reduction, communication btwn threads
00 02
10 11 12 13 14 15
20 22 23
30 31 32
41 42 43 44 45
55no padding!
![Page 41: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/41.jpg)
Thread-per-{element,row}
![Page 42: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/42.jpg)
Irregular Matrices: HYB
• Combine regularity of ELL + flexibility of COO
0 2
0 1 2
0 2 3
0 1 2
1 2 3
5
13 14 15
44 45
“Typical” “Exceptional”
![Page 43: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/43.jpg)
SpMV: Summary• Ample parallelism for large matrices
• Structured matrices (dense, diagonal): straightforward
• Take-home message: Use data structure appropriate to your matrix
• Sparse matrices: Issue: Parallel efficiency
• ELL format / one thread per row is efficient
• Sparse matrices: Issue: Load imbalance
• COO format / one thread per element is insensitive to matrix structure
• Conclusion: Hybrid structure gives best of both worlds
• Insight: Irregularity is manageable if you regularize the common case
![Page 44: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/44.jpg)
CompositionSamples / Fragments
Subpixels locations
Pixel
S1
S2S3
S4
Fragment Generation
Composite
Filter
Final pixels
Subpixels(position, color)
Fragments (position, depth, color)
![Page 45: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/45.jpg)
Pixel-Parallel Composition
Pixel i+1Pixel i
Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1
![Page 46: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/46.jpg)
Sample-Parallel Composition
Pixel i+1Pixel i
Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1
Pixel i+1Pixel i
Segmented Scan
Segmented Reduction
Subpixel 0 Subpixel 1 Subpixel 0 Subpixel 1
![Page 47: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/47.jpg)
Hash Tables & Sparsity
• Lefebvre and Hoppe, Siggraph 2006
![Page 48: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/48.jpg)
Scalar Hashing
key
#
Linear Probing
key
#2
#1
Double Probing
key
#1
#2
Chaining
![Page 49: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/49.jpg)
Scalar Hashing: Parallel Problems
key
#
key
#2
#1
key
#1
#2
• Construction and Lookup
• Variable time/work per entry
• Construction
• Synchronization / shared access to data structure
![Page 50: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/50.jpg)
Parallel Hashing: The Problem• Hash tables are good for sparse data.
• Input: Set of key-value pairs to place in the hash table
• Output: Data structure that allows:
• Determining if key has been placed in hash table
• Given the key, fetching its value
• Could also:
• Sort key-value pairs by key (construction)
• Binary-search sorted list (lookup)
• Recalculate at every change
![Page 51: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/51.jpg)
Parallel Hashing: What We Want
• Fast construction time
• Fast access time
• O(1) for any element, O(n) for n elements in parallel
• Reasonable memory usage
• Algorithms and data structures may sit at different places in this space
• Perfect spatial hashing has good lookup times and reasonable memory usage but is very slow to construct
![Page 52: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/52.jpg)
Level 1: Distribute into bucketsKeys
Bucket sizes
Data distributed into buckets
Atomic add
Local offsets
8 5 6 8
1 3 7 0 5 2 4 6 8 13 190
h
Bucket ids
Global offsets
![Page 53: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/53.jpg)
Parallel Hashing: Level 1
• Good for a coarse categorization
• Possible performance issue: atomics
• Bad for a fine categorization
• Space requirements for n elements to (probabilistically) guarantee no collisions are O(n2)
![Page 54: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/54.jpg)
Hashing in Parallel
0123
ha
1
3
2
0
hb
1
3
2
1
0123
![Page 55: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/55.jpg)
Cuckoo Hashing Construction
01
01
1
0
1
0
h1
1
1
0
0
h2 T1 T2
• Lookup procedure: in parallel, for each element:
• Calculate h1 & look in T1;
• Calculate h2 & look in T2; still O(1) lookup
![Page 56: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/56.jpg)
Cuckoo Construction Mechanics
• Level 1 created buckets of no more than 512 items
• Average: 409; probability of overflow: < 10-6
• Level 2: Assign each bucket to a thread block, construct cuckoo hash per bucket entirely within shared memory
• Semantic: Multiple writes to same location must have one and only one winner
• Our implementation uses 3 tables of 192 elements each (load factor: 71%)
• What if it fails? New hash functions & start over.
![Page 57: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/57.jpg)
Timings on random voxel data
![Page 58: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/58.jpg)
Hashing: Big Ideas
• Classic serial hashing techniques are a poor fit for a GPU.
• Serialization, load balance
• Solving this problem required a different algorithm
• Both hashing algorithms were new to the parallel literature
• Hybrid algorithm was entirely new
![Page 59: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/59.jpg)
Trees: Motivation
• Query: Does object X intersect with anything in the scene?
• Difficulty: X and the scene are dynamic
• Goal: Data structure that makes this query efficient (in parallel) Images from HPCCD: Hybrid Parallel
Continuous Collision Detection, Kim et al., Pacific Graphics 2009
![Page 60: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/60.jpg)
k-d trees
Images from Wikipedia, “Kd-tree”
![Page 61: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/61.jpg)
Generating Trees
• Increased parallelism with depth
• Irregular work generation
![Page 62: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/62.jpg)
D E FD E F
Tree Construction on a GPU
• At each stage, any node can generate 0, 1, or 2 new nodes
• Increased parallelism, but some threads wasted
• Compact after each step?
0 A
B C
D E F0 21 3
0 1B C
d0 f0 d1 e1 f1
left right
![Page 63: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/63.jpg)
Tree Construction on a GPU
• Compact reduces overwork, but …
• … requires global compact operation per step
• Also requires worst-case storage allocation
0
B C
A
D E F0 1 2
0 1B C
D E F
d0 f0 d1 e1 f1
left right
D E F
![Page 64: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/64.jpg)
Assumptions of Approach
• Fairly high computation cost per step
• Smaller cost -> runtime dominated by overhead
• Small branching factor
• Makes pre-allocation tractable
• Fairly uniform computation per step
• Otherwise, load imbalance
• No communication between threads at all
![Page 65: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/65.jpg)
Work Queue Approach
• Allocate private work queue of tasks per core
• Each core can add to or remove work from its local queue
• Cores mark self as idle if {queue exhausts storage, queue is empty}
• Cores periodically check global idle counter
• If global idle counter reaches threshold, rebalance work
Fast Hierarchy Operations on GPU Architectures, Lauterbach et al.
![Page 66: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/66.jpg)
Static Task List
Input Input Input Input Input
SM SM SM SM SM
Output
Atomic Ptr
restartkernel
Next 4 slides: Daniel Cederman and Philippas Tsigas, On Dynamic Load
Balancing on Graphics Processors. Graphics
Hardware 2008, June 2008.
![Page 67: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/67.jpg)
Blocking Dynamic Task Queue
SM SM SM SM SM
Queue
Lock
Lock
• Poor performance
• Scales poorly with # of blocks
![Page 68: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/68.jpg)
Non-Blocking Dynamic Task Queue
SM SM SM SM SM
Atomic Head Ptr
Queue
Atomic Tail Ptr
lazy updateof pointers
• Better performance
• Scales well with small # of blocks, but poorer with large
![Page 69: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/69.jpg)
Work Stealing
I/ODeque
I/ODeque
I/ODeque
I/ODeque
I/O Deque
SM SM SM SM SM
Lock Lock ...
• Best performance and scalability
• Recent work by our group explored task donating
• Win for memory consumption overall
![Page 70: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/70.jpg)
Big-Picture Questions
• Relative cost of computation vs. overhead
• Frequency of global communication
• Cost of global communication
• Need for communication between GPU cores?
• Would permit efficient in-kernel work stealing
![Page 71: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/71.jpg)
DS Research Challenges• String-based algorithms
• Building suffix trees (DNA sequence alignment)
• Graphs (vs. sparse matrix) and trees
• Dynamic programming
• Neighbor queries (kNN)
• Tuning
• True “parallel” data structures (not parallel versions of serial ones)?
• Incremental data structures
![Page 72: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/72.jpg)
Thanks to ...
• Nathan Bell, Michael Garland, David Luebke, Shubho Sengupta, and Dan Alcantara for helpful comments and slide material.
• Funding agencies: Department of Energy (SciDAC Institute for Ultrascale Visualization, Early Career Principal Investigator Award), NSF, BMW, NVIDIA, HP, Intel, UC MICRO, Rambus
![Page 73: Parallel Algorithms and Data Structures CS 448 , Stanford ... · Parallel Algorithms and Data Structures CS 448!, Stanford University 20 April 2010 John Owens Associate Professor,](https://reader031.fdocuments.us/reader031/viewer/2022013003/5f58f677a1f35c270d3418fe/html5/thumbnails/73.jpg)
Bibliography• Dan A. Alcantara, Andrei Sharf, Fatemeh Abbasinejad, Shubhabrata Sengupta, Michael
Mitzenmacher, John D. Owens, and Nina Amenta. Real-Time Parallel Hashing on the GPU. ACM Transactions on Graphics, 28(5), December 2009.
• Nathan Bell and Michael Garland. Implementing Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors. Proceedings of IEEE/ACM Supercomputing, November 2009.
• Daniel Cederman and Philippas Tsigas, On Dynamic Load Balancing on Graphics Processors. Graphics Hardware 2008, June 2008.
• C. Lauterbach, M. Garland, S. Sengupta, D. Luebke, and D. Manocha. Fast BVH Construction on GPUs. Computer Graphics Forum (Proceedings of Eurographics 2009), 28(2), April 2009.
• Christian Lauterbach, Qi Mo, and Dinesh Manocha. Fast Hierarchy Operations on GPU Architectures. Tech report, April 2009, UNC Chapel Hill.
• Shubhabrata Sengupta, Mark Harris, Yao Zhang, and John D. Owens. Scan Primitives for GPU Computing. In Graphics Hardware 2007, August 2007.
• Kun Zhou, Qiming Hou, Rui Wang, and Baining Guo. Real-Time KD-Tree Construction on Graphics Hardware. ACM Transactions on Graphics, 27(5), December 2008.