4/22/2015 1 LU Decomposition Electrical Engineering Majors Authors: Autar Kaw Transforming.
Optimized LU-decomposition with Full Pivot for Small...
Transcript of Optimized LU-decomposition with Full Pivot for Small...
![Page 1: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/1.jpg)
Optimized LU-decomposition with Full Pivot for Small Batched Matrices
S3069
Ian Wainwright
High Performance Consulting Sweden
![Page 2: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/2.jpg)
Background Based on work for GTC 2012: 10x speed-up vs multi-threaded Intel MKL on an 4 core (8HT).
http://www.hpcsweden.se/files/CUDA-based_LU_Factorization.pdf
![Page 3: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/3.jpg)
Aim
To investigate various techniques for batched matrix computations.
The motivation behind this work is to use our previous work on LU-decomposition as a use case to investigate optimization techniques for Kepler.
![Page 4: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/4.jpg)
Outline
1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions
![Page 5: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/5.jpg)
LU-decomposition
1. LU-decomposition 2. Implementations 3. Performance gains 4. Conclusions
![Page 6: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/6.jpg)
LU-decomposition The idea is to transform A where Ax=b to an equivalent triangular system such that A=LU
1
01
001
0001
434241
3231
21
lll
ll
l
44000
343300
2423220
14131211
u
uu
uuu
uuuu
44434241
34333231
24232221
14131211
aaaa
aaaa
aaaa
aaaa
= *
A L U
![Page 7: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/7.jpg)
LU-decomposition
1 4 7
2 5 8
3 6 10
0 0 0
0 0 0
0 0 0
Eliminate values by
subtracting pivot line
Without Pivot
-3
-2
-3/1
-2/1
A L Pivot
element
Multipliers
![Page 8: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/8.jpg)
0 0 0
0 0 0
0 0 0
LU-decomposition
1 4 7
2 5 8
3 6 10
0 0 0
0 0 0
0 0 0
-3
1 4 7
0 -3 -6
0 -6 -11
-2
Without Pivot
3
2
0 0 0
2 0 0
3 0 0
Eliminate values by
subtracting pivot line
A L Pivot
element
Multipliers
-3/1
-2/1
![Page 9: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/9.jpg)
LU-decomposition
1 4 7
2 5 8
3 6 10
A 0 0 0
0 0 0
0 0 0
L
1 4 7
0 -3 -6
0 -6 -11
Without Pivot
0 0 0
2 0 0
3 0 0
![Page 10: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/10.jpg)
LU-decomposition A L
1 4 7
0 -3 -6
0 -6 -11
Without Pivot 0 0 0
2 0 0
3 0 0
Eliminate values by
subtracting pivot line
-2/1
Pivot element
Multiplier
![Page 11: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/11.jpg)
LU-decomposition A L
1 4 7
0 -3 -6
0 -6 -11
Without Pivot 0 0 0
2 0 0
3 0 0
Eliminate values by
subtracting pivot line
2
0 0 0
0 0 0
0 0 0
1 4 7
0 -3 -6
0 0 1
0 0 0
2 0 0
3 0 0
0 0 0
2 0 0
3 2 0
-2
The multiplier must be shared with all elements of a row Syncronization and sharing in inner loop.
Pivot element
Multiplier
-2/1
![Page 12: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/12.jpg)
LU-decomposition A L
1 4 7
0 -3 -6
0 -6 -11
Without Pivot 0 0 0
2 0 0
3 0 0
1 4 7
0 -3 -6
0 0 1
0 0 0
2 0 0
3 2 0
![Page 13: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/13.jpg)
LU-decomposition A L
1 4 7
0 -3 -6
0 -6 -11
Without Pivot 0 0 0
-2 0 0
-3 0 0
1 4 7
0 -3 -6
0 0 1
0 0 0
2 0 0
3 2 0
![Page 14: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/14.jpg)
LU-decomposition U L
1 4 7
0 -3 -6
0 0 1
Without Pivot 0 0 0
-2 0 0
-3 0 0
1 0 0
2 1 0
3 2 1
![Page 15: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/15.jpg)
LU-decomposition A Full Pivot
0 4 7
2 5 8
3 6 10
-∞
-∞
Pivot element
![Page 16: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/16.jpg)
LU-decomposition Full Pivot
Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual.
A 10 6 3
8 5 2
7 4 0
-8/10
-7/10
Pivot element
LU-decomposition with full pivot is stable
![Page 17: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/17.jpg)
LU-decomposition with full pivot is stable
Pivot element
LU-decomposition Full Pivot
Solution: Perform Pivot 1. Find largest value in bottom submatrix. 2. Swap rows and columns to make largest value the pivot element. 3. Keep track of row and column pivot in each step. 4. Then perform operations over rows as usual.
Necessitates max-reduction in each column iteration.
A 10 6 3
8 5 2
7 4 0
-8/10
-7/10
Find largest value Synchronization and data-sharing
![Page 18: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/18.jpg)
Implementations Problem size: • Matrices of at most 32 rows or columns of any shape, i.e. both rectangular and square. • Batches of 10 000 matrices. • Find L and U for matrix A such that PAQ=LU with full pivoting, where P and Q are the pivot vectors.
![Page 19: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/19.jpg)
Implementations Mapping of problem: • Map one matrix to a warp. • Map a column to a thread. • One or more matrices per block.
![Page 20: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/20.jpg)
Implementations Mapping of problem: • Map one matrix to a warp. • Map a column to a thread. • One or more matrices per block.
1 3 4 2
3 7 4 9
7 5 16 4
2 3 4 1
t.x 0 t.x 1 t.x 2 t.x 3
1 3 4 2
3 7 4 9
7 5 16 4
2 3 4 1
t.x 0 t.x 1 t.x 2 t.x 3
Block 0 Block 1
One matrix per block
Warp 0 Warp 0
![Page 21: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/21.jpg)
Implementations Mapping of problem: • Map one matrix to a warp. • Map a column to a thread. • One or more matrices per block.
1 3 4 2
3 7 4 9
7 5 16 4
2 3 4 1
t.x 0 t.x 1 t.x 2 t.x 3
1 3 4 2
3 7 4 9
7 5 16 4
2 3 4 1
t.x 32 t.x 33 t.x 34 t.x 35
Block 0 Block 0
Two matrix per block
Warp 0 Warp 1
![Page 22: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/22.jpg)
Implementations Mapping of problem: • Map one matrix to a warp. • Map a column to a thread. • One or more matrices per block.
1 3 4 2
3 7 4 9
7 5 16 4
2 3 4 1
t.x 0 t.x 1 t.x 2 t.x 3
1 3 4 2
3 7 4 9
7 5 16 4
2 3 4 1
Block 0 Block 0
Two matrix per block
Warp 0 Warp 1
The number of matrices per block affects the occupancy.
t.x 32 t.x 33 t.x 34 t.x 35
![Page 23: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/23.jpg)
Implementations
To investigate various techniques for batched matrix computations.
Multiple matrices per block: 1. 1 matrix per block. 2. 2 matrices per block. 3. 4 matrices per block.
Cache-config: 1. Prefer shared. 2. Prefer L1.
Synchronization Occupancy Cache configuration
and their interaction
Aim
Synchronization and data sharing: 1. Shared-mem with
__syncthreads(). 2. Shared-mem using warp-
synchronous programming. 3. Warp shuffle.
![Page 24: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/24.jpg)
Implementations
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 25: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/25.jpg)
Performance gains
All benchmarks were performed on a standard GTX 680. All numbers are based on execution time for a batch of 10 000
matrices for various sizes. Execution time is in ms. To save time, we will only be looking at performance numbers
for square matrices.
![Page 26: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/26.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 27: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/27.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 28: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/28.jpg)
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
Prefer L1 Prefer shared
0
10
20
30
40
50
0 10 20 30 Matrix size
Execution time, __syncthreads()
0
10
20
30
40
50
0 10 20 30
Matrix size
Execution time, __syncthreads()
1 matrix per block 2 matrices per block 4 matrices per block
![Page 29: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/29.jpg)
Prefer L1 Prefer shared
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
0
10
20
30
40
50
0 10 20 30 Matrix size
Execution time, __syncthreads()
0
10
20
30
40
50
0 10 20 30
Matrix size
Execution time, __syncthreads()
1 matrix per block 2 matrices per block 4 matrices per block
![Page 30: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/30.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 31: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/31.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 32: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/32.jpg)
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
Prefer L1 Prefer shared
0
5
10
15
20
25
30
0 10 20 30 Matrix size
Execution time, warp-synchronous
0
5
10
15
20
25
30
0 10 20 30 Matrix size
Execution time, warp-synchronous
1 matrix per block 2 matrices per block 4 matrices per block
![Page 33: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/33.jpg)
Prefer L1
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
0
5
10
15
20
25
30
0 10 20 30 Matrix size
Execution time, warp-synchronous
0
5
10
15
20
25
30
0 10 20 30 Matrix size
Execution time, warp-synchronous
Prefer shared
1 matrix per block 2 matrices per block 4 matrices per block
![Page 34: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/34.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 35: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/35.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
No surprise: we use shared memory for all data-sharing and reduction. More shared memory gives us higher occupancy Better performance.
![Page 36: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/36.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 37: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/37.jpg)
Performance gains
Warp-synchronous __syncthreads()
1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
0
5
10
15
20
25
30
0 10 20 30 Matrix size
Execution time[ms], warp-synchronous()
0
5
10
15
20
25
30
0 10 20 30 Matrix size
Execution time[ms], __syncthreads()
1 matrix per block 2 matrices per block 4 matrices per block
![Page 38: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/38.jpg)
Performance gains
Warp-synchronous __syncthreads()
1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
0
5
10
15
20
25
30
0 10 20 30 Matrix size
Execution time[ms], warp-synchronous()
0
5
10
15
20
25
30
0 10 20 30 Matrix size
Execution time[ms], __syncthreads()
0
0,5
1
1,5
2
2,5
0 10 20 30
Spe
ed
-up
Matrix size
Warp-sync speed-up over __syncthreads()
Roughly a speed-up of 1.7x when using warp-synchronous.
1 matrix per block 2 matrices per block 4 matrices per block
![Page 39: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/39.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 40: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/40.jpg)
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
Prefer L1 Prefer shared
0
4
8
12
16
0 8 16 24 32 Matrix size
Execution time, warp shuffle
0
4
8
12
16
0 8 16 24 32 Matrix size
Execution time, warp shuffle
1 matrix per block 2 matrices per block 4 matrices per block
![Page 41: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/41.jpg)
Prefer shared
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
0
4
8
12
16
0 8 16 24 32 Matrix size
Execution time, warp shuffle
0
4
8
12
16
0 8 16 24 32 Matrix size
Execution time, warp shuffle
Prefer L1
1 matrix per block 2 matrices per block 4 matrices per block
![Page 42: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/42.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 43: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/43.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 44: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/44.jpg)
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
Prefer L1 Prefer shared
0
4
8
12
16
0 8 16 24 32 Matrix size
Execution time, warp shuffle
0
4
8
12
16
0 8 16 24 32 Matrix size
Execution time, warp-synchronous
1 matrix per block 2 matrices per block 4 matrices per block
![Page 45: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/45.jpg)
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
Prefer L1 Prefer shared
0
4
8
12
16
0 8 16 24 32 Matrix size
Execution time, warp shuffle
0
4
8
12
16
0 8 16 24 32 Matrix size
Execution time, warp-synchronous
0
0,5
1
1,5
2
0 8 16 24 32
Spe
ed
-up
Matrix size
Warp shuffle speed-up over warp-synch
Roughly a speed-up of 1.2x when using warp shuffle over warp-sync
1 matrix per block 2 matrices per block 4 matrices per block
![Page 46: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/46.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 47: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/47.jpg)
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
0
2
4
6
8
10
12
14
16
0 8 16 24 32
Matrix size
Execution time, warp shuffle & Prefer L1
1 matrix per block 2 matrices per block 4 matrices per block
![Page 48: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/48.jpg)
Performance gains 1. Syncthreads() 2. Warp-synchronous 3. Warp shuffle
0
2
4
6
8
10
12
14
16
0 8 16 24 32
Matrix size
Execution time, warp shuffle & Prefer L1 2 matrices per block is most consistent.
0
0,5
1
1,5
2
2,5
0 8 16 24 32
Spe
ed
-up
Matrix size
Speed-up vs 1 block per matrix We can gain 2x by mapping 2 or 4 matrices to each block. On average we gain 1.5x.
1 matrix per block 2 matrices per block 4 matrices per block
![Page 49: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/49.jpg)
Performance gains
1 matrix per block
2 matrices per block
4 matrices per block
Synchronization
Occupancy
Cache configuration
![Page 50: Optimized LU-decomposition with Full Pivot for Small ...on-demand.gputechconf.com/gtc/2013/presentations/S... · LU-decomposition FullA Pivot Solution: Perform Pivot 1. Find largest](https://reader036.fdocuments.us/reader036/viewer/2022081607/5edca4d7ad6a402d66676611/html5/thumbnails/50.jpg)
Conclusions
1. Warp-synchronous is 1.7x faster than __syncthreads(). 2. Warp shuffle is 1.2x faster than warp-synchronous. 3. 2 matrices per block is 1.5x faster than 1 matrix per block. 4. Optimal cache config varies with implementation:
Prefer shared if you use shared mem. Prefer L1 if you use warp shuffle.