Jeremy Johnson Dept. of Computer Science Drexel University
Transcript of Jeremy Johnson Dept. of Computer Science Drexel University
![Page 1: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/1.jpg)
Automatic Performance TuningAutomatic Performance Tuning
Jeremy JohnsonDept. of Computer Science
Drexel University
![Page 2: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/2.jpg)
OutlineOutline
• Scientific Computation Kernels– Matrix Multiplication– Fast Fourier Transform (FFT)
• Automated Performance Tuning (IEEE Proc. Vol. 93, No. 2, Feb. 2005)
– ATLAS– FFTW– SPIRAL
![Page 3: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/3.jpg)
Matrix Multiplication and the FFTMatrix Multiplication and the FFT
∑=
=n
kkjikij BAC
1
=
+=+==
=
∑∑
∑
−
=+
−
=+
−
=
1
0
1
0
2112
1
0
12
22
12
11
,,R
lSR
S
l
kl
NS
l
N
l
kl
Nk
xy
llkk
xy
kklklk
Skk
RlSkRSN
ωωω
ω
![Page 4: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/4.jpg)
Basic Linear Algebra Subprograms (BLAS)Basic Linear Algebra Subprograms (BLAS)
• Level 1 – vectorvector, O(n) data, O(n) operations• Level 2 – matrixvector, O(n2) data, O(n2) operations• Level 3 – matrixmatrix, O(n2) data, O(n3) operations = data reuse =
locality!
• LAPACK built on top of BLAS (level 3)– Blocking (for the memory hierarchy) is the single most important
optimization for linear algebra algorithms
• GEMM – General Matrix Multiplication
– SUBROUTINE DGEMM (TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC )
– C := alpha*op( A )*op( B ) + beta*C, – where op(X) = X or X’
![Page 5: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/5.jpg)
DGEMMDGEMM
…* Form C := alpha*A*B + beta*C.* DO 90, J = 1, N IF( BETA.EQ.ZERO )THEN DO 50, I = 1, M C( I, J ) = ZERO 50 CONTINUE ELSE IF( BETA.NE.ONE )THEN DO 60, I = 1, M C( I, J ) = BETA*C( I, J ) 60 CONTINUE END IF DO 80, L = 1, K IF( B( L, J ).NE.ZERO )THEN TEMP = ALPHA*B( L, J ) DO 70, I = 1, M C( I, J ) = C( I, J ) + TEMP*A( I, L ) 70 CONTINUE END IF 80 CONTINUE 90 CONTINUE…
![Page 6: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/6.jpg)
Matrix Multiplication PerformanceMatrix Multiplication Performance
![Page 7: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/7.jpg)
Matrix Multiplication PerformanceMatrix Multiplication Performance
![Page 8: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/8.jpg)
Numeric RecipesNumeric Recipes
• Numeric Recipes in C – The Art of Scientific Computing, 2nd Ed.– William H. Press, Saul A. Teukolsky, William T. Vetterling, Brian P.
Flannery, Cambridge University Press, 1992.
• “This book is unique, we think, in offering, for each topic considered, a certain amount of general discussion, a certain amount of analytical mathematics, a certain amount of discussion of algorithmics, and (most important) actual implementations of these ideas in the form of working computer routines.
• 1. Preliminarys• 2. Solutions of Linear Algebraic Equations• …• 12. Fast Fourier Transform• 19. Partial Differential Equations• 20. Less Numerical Algorithms
![Page 9: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/9.jpg)
four1four1
![Page 10: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/10.jpg)
four1 (cont)four1 (cont)
![Page 11: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/11.jpg)
FFT PerformanceFFT Performance
![Page 12: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/12.jpg)
Atlas Architecture and Search ParametersAtlas Architecture and Search Parameters
• NB – L1 data cache tile size
• NCNB – L1 data cache tile size for noncopying version
• MU, NU – Register tile size
• KU – Unroll factor for k’ loop
• LS – Latency for computation scheduling• FMA – 1 if fused multiplyadd available, 0 otherwise• FF, IF, NF – Scheduling of loads
Yotov et al., Is Search Really Necessary to Generate HighPerformance BLAS?, Proc. IEEE, Vol. 93, No. 2, Feb. 2005
![Page 13: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/13.jpg)
ATLAS Code GenerationATLAS Code Generation
• Optimization for locality – Cache tiling, Register tiling
![Page 14: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/14.jpg)
ATLAS Code GenerationATLAS Code Generation
• Register Tiling– MU + NU + MU×NU ≤ NR
• Loop unrolling• Scalar replacement• Add/mul interleaving• Loop skewing
• Ci’’j’’ = Ci’’j’’ + Ai’’k’’*Bk’’j’’
A C
B
NU
MU
K
K
NB
NB
mul1mul2…mulLs
add1
mulLs+1
add2
…mulMu×Nu
addMu×NuLs+2
…addMu×Nu
![Page 15: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/15.jpg)
ATLAS SearchATLAS Search
• Estimate Machine Parameters (C1, NR, FMA, LS)– Used to bound search
• Orthogonal Line Search (fix all parameters except one and search for the optimal value of this parameter)– Search order
• NB• MU, NU• KU• LS• FF, IF, NF• NCNB• Cleanup codes
![Page 16: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/16.jpg)
Using FFTWUsing FFTW
![Page 17: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/17.jpg)
FFTW InfrastructureFFTW Infrastructure
• Use dynamic programming to find an efficient way to combine code sequences.
• Combine code sequences using divide and conquer structure in FFT
• Codelets (optimized code sequences for small FFTs)
• Plan encodes divide and conquer strategy and stores “twiddle factors”
• Executor computes FFT of given data using algorithm described by plan.
15
3 12
4 8
3 5
Right Recursive
![Page 18: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/18.jpg)
SPIRAL SPIRAL systemsystem
DSP transform specifies
user
goes for a coffee
Formula Generator
SPL Compiler Sea
rch
Eng
ine
runtime on given platform
controlsimplementation options
controlsalgorithm generation
fast algorithmas SPL formula
C/Fortran/SIMDcode
S P
I R
A L
(or an espresso for small transform
s)
platformadaptedimplementation
comes back
Mathem
atician
Expert
Programmer
![Page 19: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/19.jpg)
DSPDSP Algorithms: Example 4point DFT Algorithms: Example 4point DFTCooley/Tukey FFT (size 4):
algorithms reduce arithmetic cost O(n^2)→O(nlog(n)) product of structured sparse matrices mathematical notation exhibits structure
−
−
−−
=
−−−−−−
1000001001000001
1100110000110011
000010000100001
1010010110100101
111111
111111
iii
ii
4222
42224 )()( LDFTITIDFTDFT ⋅⊗⋅⋅⊗=
Fourier transform
Identity Permutation
Diagonal matrix (twiddles)
Kronecker product
![Page 20: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/20.jpg)
AlgorithmsAlgorithms = Ruletrees = Formulas = Ruletrees = Formulas)(
8IIDCT
)(4
IIDCT)(
4IVDCT
R1)()( 2/2
)(2/
)(2/
)(n
IVn
IIn
IIn IFDCTDCTPDCT ⊗⋅⊕⋅→
2FR3 R6
2FR4
R3
R1
R6
2F
2FR4
2)(
2 21
FDCT II ⋅→
)(2
IIDCT
)(2
IIDST
)(2
IVDCT
)(2
IIDST
)(2
IIDCT
R1 R6 SDCTPDCT IIn
IVn ⋅⋅→ )()(
)(4
IIDCT)(2
IVDCT
![Page 21: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/21.jpg)
GeneratedGenerated DFT Vector Code: Pentium 4, SSE DFT Vector Code: Pentium 4, SSE(P
seud
o) g
flop/
s
DFT 2n single precision, Pentium 4, 2.53 GHz, using Intel C compiler 6.0
n
speedups (to C code) up to factor of 3.1
0
1
2
3
4
5
6
7
4 5 6 7 8 9 10 11 12 13
Spiral SSEIntel MKL interl.FFTW 2.1.3Spiral CSpiral C vectSIMDFFT
handtuned vendor assembly code
![Page 22: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/22.jpg)
Best Best DFT Trees, size DFT Trees, size 210 = 1024 = 1024
scalar
C vect
SIMD
Pentium 4float
Pentium 4double
Pentium IIIfloat
AthlonXPfloat
10
64
4
10
87
5
10
8
6
4
2
2
2
2 2
2 2
2 2
2
21
22 3
10
8
5
2
2
2 3
10
97
5
12
22 3
10
62
42 2
2 2
4
10
4
2
6
2 42 2
2
10
5
2
5
2 3 3
10
6
3
4
2 2 3
10
8
5
2
2
2 3
10
64
42 2
2 2
2
10
5
2
5
2 3 3
trees platform/datatype dependent
![Page 23: Jeremy Johnson Dept. of Computer Science Drexel University](https://reader031.fdocuments.us/reader031/viewer/2022012507/6182cf316439367594756c13/html5/thumbnails/23.jpg)
CrosstimingCrosstiming of best trees on Pentium 4 of best trees on Pentium 4S
low
dow
n fa
ctor
w.r.
t. be
st
DFT 2n single precision, runtime of best found of other platforms
n
software adaptation is necessary
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
4 5 6 7 8 9 10 11 12 13
Pentium 4 SSEPentium 4 SSE2AthlonXP SSEPentiumIII SSEPentium 4 float