Performance Optimization Getting your programs to run faster CS 691.

28
Performance Optimization Getting your programs to run faster CS 691

Transcript of Performance Optimization Getting your programs to run faster CS 691.

Page 1: Performance Optimization Getting your programs to run faster CS 691.

Performance Optimization

Getting your programs to run fasterCS 691

Page 2: Performance Optimization Getting your programs to run faster CS 691.

Why optimize

Better turn-around on jobsRun more programs/scenariosRelease resources to other applicationsYou want the job to finish before you retire

Page 3: Performance Optimization Getting your programs to run faster CS 691.

Ways to get more performance

Run on bigger, faster hardware clock speed, more memory, …

Tweak your algorithmOptimize your code

Page 4: Performance Optimization Getting your programs to run faster CS 691.

Loop Unrolling

Converting passes of a loop into in-line streams of codeUseful when loops do calculations on data in arraysUnrolling can take advantage of pipeline processing units in processorsCompiler may preload operands into CPU registers

Page 5: Performance Optimization Getting your programs to run faster CS 691.

Loop Unrolling – disadvantages

may be limited by the number of Floating point registers Pentium III: 8 Pentium 4: 8 Itanium: 128

Page 6: Performance Optimization Getting your programs to run faster CS 691.

Loop Unrolling – simple example

Loop

do i=1,n

a(i) = b(i) +x*c(i)

enddo

Unrolled Loop

do i=1,n,4

a(i) = b(i) +x*c(i)

a(i+1) = b(i+1) +x*c(i+1)

a(i+2) = b(i+2) +x*c(i+2)

a(i+3) = b(i+3) +x*c(i+3)

enddo

Page 7: Performance Optimization Getting your programs to run faster CS 691.

Loop Unrolling – simple example

Performance – RolledP3 550mhz – 13 mflopsItanium – 30 mflops

Performance UnrolledP3 550mhz – 30 mflopsItanium – 107 mflops

*from: LCI and NCSA

Page 8: Performance Optimization Getting your programs to run faster CS 691.

Loop Unrolling

int a[100];

for (i=0;i<100;i++){

a[i] = a[i] * 2;

}

int a[100];

for (i=0;i<100;i+=5){

a[i] = a[i] * 2;

a[i+1]=a[i+1]*2;

a[i+2]=a[i+2]*2;

a[i+3]=a[i+3]*2;

a[i+4]=a[i+4]*2;

}

Page 9: Performance Optimization Getting your programs to run faster CS 691.

Loop unrolling

int a[10][10];

for (i=0;i<10;i++){

for (j=0;j<10;j++) {

a[i][j] = a[i][j] *2;

} }

int a[10][10];for (i=0;i<10;i++){

a[i][0]=a[i][0]*2; a[i][1]=a[i][1]*2;

a[i][2]=a[i][2]*2; a[i][3]=a[i][3]*2;

a[i][4]=a[i][4]*2; a[i][5]=a[i][5]*2;

a[i][6]=a[i][6]*2; a[i][7]=a[i][7]*2;

a[i][8]=a[i][8]*2; a[i][9]=a[i][9]*2;

} }

Page 10: Performance Optimization Getting your programs to run faster CS 691.

Loop unrolling – Matrix Dot Product

float a[100];

float b[100];

float z;

for (i=0;i<100;i++){

z = z + a[i] * b[i];

}

float a[100];float b[100];float z;for (i=0;i<100;i+=2){

z = z + a[i] * b[i];z = z + a[i+1] *

b[i+1];}

Page 11: Performance Optimization Getting your programs to run faster CS 691.

Unrolling Loops

You can do it automatically

Page 12: Performance Optimization Getting your programs to run faster CS 691.

Unrolling Loops – compiler options

GNU Compilers -funroll-loops -funrull-all-loops (not recommended)

PGI Compilers -Munroll -Munroll=c:N -Munroll=n:M

Page 13: Performance Optimization Getting your programs to run faster CS 691.

Unrolling Loops – Compiler Options

Intel Compilers -unrollM (up to M times) -unroll

Page 14: Performance Optimization Getting your programs to run faster CS 691.

Taking Memory in Order

Optimizing the use of cacherow major order vs column major order row major --

a(1,1), a(2,1), a(3,1), a(1,2), a(2,2),… column major –

a(1,1), a(1,2), a(1,3), a(2,1), a(2,2),…

Page 15: Performance Optimization Getting your programs to run faster CS 691.

Taking Memory in Order

Remember C and Fortran store arrays in the

opposite mannerC – row majorFortran – column major

Page 16: Performance Optimization Getting your programs to run faster CS 691.

Taking Memory in Order

c

Fortran

Page 17: Performance Optimization Getting your programs to run faster CS 691.

Taking Memory in Order

do i=1,m

do j=1,n

a(i,j)=b(i,j)+c(i)

end do

end do

do j=1,m

do i=1,n

a(i,j)=b(i,j)+c(i)

end do

end do

•loop time: 23.42

•loop runs at 4.48 Mflops

•loop time: 2.80

•loop runs at 37.48 Mflops

Page 18: Performance Optimization Getting your programs to run faster CS 691.

Floating Point Division

FP Division is very expensive in terms of processor time20-60 clock cycles to computeUsually not pipelinedFP Division required by IEEE “rules”

Page 19: Performance Optimization Getting your programs to run faster CS 691.

Floating point division – use reciprocal float a[100];

for (i=0;i<100;i++){

a[i]=a[i]/2;

}

float a[100];

Float denom;

denom = 1/2;

for (i=0;i<100;i++){

a[i]=a[i]*denom;

}

Page 20: Performance Optimization Getting your programs to run faster CS 691.

Compiler options for IEEE Compatibility PGI Compilers

-Knoieee Intel Compilers

-mp GNU Compilers

can’t do

Floating Point Division

Page 21: Performance Optimization Getting your programs to run faster CS 691.

Floating Point Division

Compilers can’t optimize if divisor is not scalarBreaks IEEE “rules” May impact portability

Page 22: Performance Optimization Getting your programs to run faster CS 691.

Function Inlining

Build functions/subroutines in as inline parts of the programs code…… rather than functions/subroutines minimizes functions calls (and

management of…)

Page 23: Performance Optimization Getting your programs to run faster CS 691.

Function Inlining

Compile with – -Minline

compiler tries to inline what it can (meet compiler criteria)

-Minline=except:func excludes func from inlining

-Minline=func inline only func

Page 24: Performance Optimization Getting your programs to run faster CS 691.

Function Inlining

…Compile with- -Minline=myfile.lib

inlines functions from inline library file -Minline=levels:n

inlines functions up to n levels of calls usually default = 1

Page 25: Performance Optimization Getting your programs to run faster CS 691.

MPI Tuning

Minimize messagesPointers/countsMPI Derived datatypesMPI_Pack/MPI_UnpackUsing shared memory for message passing #PBS –l nodes=6:ppn=1 … but… #PBS –l nodes=3:ppn=2 … is better.

Page 26: Performance Optimization Getting your programs to run faster CS 691.

Compiler optimizations

-O0 –no optimization-O1 –local optimization, register allocation-O2 –local/limited global optimization-O3 –aggressive global optimization-Munroll – loop unrolling-Mvect - vectorization-Minline – function inlining

Page 27: Performance Optimization Getting your programs to run faster CS 691.

gcc Compiler Optimatizations

http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

See:

Page 28: Performance Optimization Getting your programs to run faster CS 691.