Performance Optimization Getting your programs to run faster CS 691.
-
Upload
charleen-williamson -
Category
Documents
-
view
217 -
download
0
Transcript of Performance Optimization Getting your programs to run faster CS 691.
![Page 1: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/1.jpg)
Performance Optimization
Getting your programs to run fasterCS 691
![Page 2: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/2.jpg)
Why optimize
Better turn-around on jobsRun more programs/scenariosRelease resources to other applicationsYou want the job to finish before you retire
![Page 3: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/3.jpg)
Ways to get more performance
Run on bigger, faster hardware clock speed, more memory, …
Tweak your algorithmOptimize your code
![Page 4: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/4.jpg)
Loop Unrolling
Converting passes of a loop into in-line streams of codeUseful when loops do calculations on data in arraysUnrolling can take advantage of pipeline processing units in processorsCompiler may preload operands into CPU registers
![Page 5: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/5.jpg)
Loop Unrolling – disadvantages
may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128
![Page 6: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/6.jpg)
Loop Unrolling – simple example
Loop
do i=1,n
a(i) = b(i) +x*c(i)
enddo
Unrolled Loop
do i=1,n,4
a(i) = b(i) +x*c(i)
a(i+1) = b(i+1) +x*c(i+1)
a(i+2) = b(i+2) +x*c(i+2)
a(i+3) = b(i+3) +x*c(i+3)
enddo
![Page 7: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/7.jpg)
Loop Unrolling – simple example
Performance – RolledP3 550mhz – 13 mflopsItanium – 30 mflops
Performance UnrolledP3 550mhz – 30 mflopsItanium – 107 mflops
*from: LCI and NCSA
![Page 8: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/8.jpg)
Loop Unrolling
int a[100];
for (i=0;i<100;i++){
a[i] = a[i] * 2;
}
int a[100];
for (i=0;i<100;i+=5){
a[i] = a[i] * 2;
a[i+1]=a[i+1]*2;
a[i+2]=a[i+2]*2;
a[i+3]=a[i+3]*2;
a[i+4]=a[i+4]*2;
}
![Page 9: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/9.jpg)
Loop unrolling
int a[10][10];
for (i=0;i<10;i++){
for (j=0;j<10;j++) {
a[i][j] = a[i][j] *2;
} }
int a[10][10];for (i=0;i<10;i++){
a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2;
a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2;
a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2;
a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2;
a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;
} }
![Page 10: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/10.jpg)
Loop unrolling – Matrix Dot Product
float a[100];
float b[100];
float z;
for (i=0;i<100;i++){
z = z + a[i] * b[i];
}
float a[100];float b[100];float z;for (i=0;i<100;i+=2){
z = z + a[i] * b[i];z = z + a[i+1] *
b[i+1];}
![Page 11: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/11.jpg)
Unrolling Loops
You can do it automatically
![Page 12: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/12.jpg)
Unrolling Loops – compiler options
GNU Compilers -funroll-loops -funrull-all-loops (not recommended)
PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M
![Page 13: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/13.jpg)
Unrolling Loops – Compiler Options
Intel Compilers -unrollM (up to M times) -unroll
![Page 14: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/14.jpg)
Taking Memory in Order
Optimizing the use of cacherow major order vs column major order row major --
a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –
a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…
![Page 15: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/15.jpg)
Taking Memory in Order
Remember C and Fortran store arrays in the
opposite mannerC – row majorFortran – column major
![Page 16: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/16.jpg)
Taking Memory in Order
c
Fortran
![Page 17: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/17.jpg)
Taking Memory in Order
do i=1,m
do j=1,n
a(i,j)=b(i,j)+c(i)
end do
end do
do j=1,m
do i=1,n
a(i,j)=b(i,j)+c(i)
end do
end do
•loop time: 23.42
•loop runs at 4.48 Mflops
•loop time: 2.80
•loop runs at 37.48 Mflops
![Page 18: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/18.jpg)
Floating Point Division
FP Division is very expensive in terms of processor time20-60 clock cycles to computeUsually not pipelinedFP Division required by IEEE “rules”
![Page 19: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/19.jpg)
Floating point division – use reciprocal float a[100];
for (i=0;i<100;i++){
a[i]=a[i]/2;
}
float a[100];
Float denom;
denom = 1/2;
for (i=0;i<100;i++){
a[i]=a[i]*denom;
}
![Page 20: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/20.jpg)
Compiler options for IEEE Compatibility PGI Compilers
-Knoieee Intel Compilers
-mp GNU Compilers
can’t do
Floating Point Division
![Page 21: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/21.jpg)
Floating Point Division
Compilers can’t optimize if divisor is not scalarBreaks IEEE “rules” May impact portability
![Page 22: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/22.jpg)
Function Inlining
Build functions/subroutines in as inline parts of the programs code…… rather than functions/subroutines minimizes functions calls (and
management of…)
![Page 23: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/23.jpg)
Function Inlining
Compile with – -Minline
compiler tries to inline what it can (meet compiler criteria)
-Minline=except:func excludes func from inlining
-Minline=func inline only func
![Page 24: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/24.jpg)
Function Inlining
…Compile with- -Minline=myfile.lib
inlines functions from inline library file -Minline=levels:n
inlines functions up to n levels of calls usually default = 1
![Page 25: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/25.jpg)
MPI Tuning
Minimize messagesPointers/countsMPI Derived datatypesMPI_Pack/MPI_UnpackUsing shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.
![Page 26: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/26.jpg)
Compiler optimizations
-O0 –no optimization-O1 –local optimization, register allocation-O2 –local/limited global optimization-O3 –aggressive global optimization-Munroll – loop unrolling-Mvect - vectorization-Minline – function inlining
![Page 27: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/27.jpg)
gcc Compiler Optimatizations
http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
See:
![Page 28: Performance Optimization Getting your programs to run faster CS 691.](https://reader030.fdocuments.us/reader030/viewer/2022032414/56649ee75503460f94bf8094/html5/thumbnails/28.jpg)