TM Cache Optimizations & the Loop Nest Optimizer.
-
date post
21-Dec-2015 -
Category
Documents
-
view
221 -
download
2
Transcript of TM Cache Optimizations & the Loop Nest Optimizer.
TM
Cache Optimizations &Cache Optimizations &the Loop Nest Optimizerthe Loop Nest Optimizer
TM
Improvement OpportunitiesImprovement Opportunities Program runs slow because not all resources are used:• processor:
– not using opportunities to go superscalar (ILP)– scheduling of instructions is not optimal (too many wait states)
• memory access:– not all data in cache line is used (spatial locality)– data in the cache in not reused (temporal locality)
Performance analysis is used to diagnose the problem.
Compiler will attempt to optimize the program for the given Architecture: • data structure can inhibit compiler optimizations• algorithm presentations can inhibit compiler optimizations
Often it is necessary to rewrite critical part of code (loops) in the program so that compiler can do better performance optimization. Understand compiler optimizations techniques
TM
Compiler Optimization TechniquesCompiler Optimization Techniques The following optimizations are built into the compiler:
• general– procedure inlining– data and array padding
• loop based:– Loop interchange– outer and inner loop unrolling– cache blocking– loop fusion (merge) and fission (split)
• Code generation:– software pipelining– instruction reordering
Algorithm presentation in the program such that compiler can apply the optimization techniques - - leads to optimal program performance on the machine.
Loop nests,implies usage ofmulti-dimensional arraysenabled at -O3 or withLNO:opt=[1|0]
TM
Scalar Architecture:Scalar Architecture: Cache System Cache System• The hierarchy of memory devices:
• The goal of Memory Hierarchy:– access speed ~ fastest memory– effective capacity ~ size of largest memory
-> Programs should follow the principle of locality:(Use items in the cache)
– Spatial locality of reference (use all words in cache line)– Temporal locality of reference (use same cache line)
Spee
d of
Acc
ess
1/cl
ock
64reg
32KB(L1) 8MB
(L2)~1 - 100s GB
Cache subsystem memory
Device Capacity (size)
1
0.1
0.01~4000 cy
~100 cy
~10 cy~2-3cy
disk
TMScalar Architecture:Scalar Architecture: Cache OrganizationCache Organization
The goal of scalar optimization:– Spatial locality of reference (use all words in cache line)– Temporal locality of reference (use same cache line)
Words in Memory
Example Cache L2 on O2K(e.g. 8 MB
or 2097152 words)
cache lines in memory (32 words)
Load instruction (ld)
for 1 word
cache linetransfer
‡ Cache hit will load word from cache‡ Cache miss will load cache line from memory
TM
Problems of Scalar OptimizationProblems of Scalar Optimization
– each C(I,j) value is accumulated in the register for A(I,k)*B(k,j)– B is traversed in sequence of cache lines (spatial locality)– A is accessing only 1 word from each cache line (no locality)– for A and B no reuse of cache lines (if n is large)
This is a problem only if A,B,C do not fit into the cache
DO i=1,n DO j=1,n DO k=1,n C(i,j)=C(i,j) + A(i,k)*B(k,j)ENDDO ENDDO ENDDO
i
j k j
i k
= X
cache lines
TM
Loop Nest OptimizerLoop Nest Optimizer LNO performs loop restructuring to optimize data access:• loop interchange • loop unrolling• loop blocking for cache• loop fusion• loop fission• pre-fetching
LNO is controlled with compiler options and/or compiler directives or pragmas; same options for both• LNO is the default at -O3, but can be turned on/off individually by -LNO:opt=[1|0]
• directives/pragma syntax:– Fortran: C*$* keyword [=value(s)]– C/C++ : #pragma keyword [=value(s)]–
• directives/pragmas can be disabled with the compiler switch -LNO:ignore_pragmas
TM
Array IndexingArray Indexing There are several ways to index arrays:
• The addressing scheme will have impact on the performance
• Arrays should be accessed in most natural direct way for compiler to apply loop optimization techniques
Direct Addressing ++DO j=1,M DO i=1,n … A(i,j) ….ENDDO ENDDO
Explicit Addressing +DO j=1,M DO i=1,N … A(i+(j-1)*N) …ENDDO ENDDO
Indirect Addressing --DO j=1,M DO i=1,N … A(index(i,j)) …ENDDO ENDDO
Loop carried Addressing -DO j=1,M DO i=1,N k = k + 1 … A(k) …ENDDO ENDDO
TM
Data Storage in MemoryData Storage in Memory Data storage order is language dependent:• Fortran stores multi-dimensional arrays “column-wise”
• C stores multi-dimensional arrays “row-wise”
• Accessing array elements in storage order greatly improves performance:
for arrays that do not fit in the cache(s)
A(I,J)
JI
i i i ij j+1 j+2
In memory
j j j ji i+1 i+2
In memory
a[i][j]
ji
right most index changes fastest...
left most index changes fastest...
TM
Loop Interchange: FORTRANLoop Interchange: FORTRAN
• The distribution of data in memory is not changed. Only the access pattern is changed
• Compiler can do this optimization automatically -LNO:interchange=[on|off] (default on)
Original loop:
c*$* no interchangeDO I=1,N DO J=1,M C(I,J)=A(I,J)+B(I,J)ENDDO ENDDO
Interchanged loops:
c*$* interchange(J,I)DO J=1,M DO I=1,N C(I,J)=A(I,J)+B(I,J)ENDDO ENDDO
A(I,J)B(I,J)C(I,J)
JI
M
NAccess orderStorage order
A(I,J)B(I,J)C(I,J)
JI
M
N
TM
Index ReversalIndex Reversal
• Index reversal on B: i.e. B(I,J) replaced by B(J,I) must be done everywhere in the program
• This has to be done manually, there is no compiler optimization that does index reversal.
Original loop:
DO I=1,N DO J=1,M C(I,J)=A(I,J)+B(J,I)ENDDO ENDDO
The access is poor for A and C,while it is optimal for B
Interchanged loops + Index reversal:
DO J=1,M DO I=1,N C(I,J)=A(I,J)+B(I,J)ENDDO ENDDO
interchange will be good for A and C,it will be bad for B
TMThe Significance of The Significance of Loop InterchangeLoop Interchange
Run time in seconds obtained on an Origin 3000:
loop order R12K@400MHz (8 MB cache) i,j,k 535.0 j,i,k 32.0 k,j,i 11.0
DO I=1,700 DO J=1,700 DO K=1,700 A(I,J,K)=A(I,J,K)+B(I,J,K)*C(I,J,K)ENDDO ENDDO ENDDO
TM
Loop Interchange in CLoop Interchange in C In C, the situation is exactly the opposite to Fortran:
• The performance benefits in C are the same as in Fortran
• In most practical situations, loop interchange (supported by the compiler) is much easier to achieve than index reversal.
Original loop:#pragma no interchangefor(j=0; j<m; j++) for(i=0; i<n; i++) c[i][j]=a[i][j]+b[j][i];
Interchanged loop:#pragma interchange(i,j) for(i=0; i<n; i++)for(j=0; j<m; j++) c[i][j]=a[i][j]+b[j][i];
Index Reversal loop:
for(j=0; j<m; j++) for(i=0; i<n; i++) c[j][i]=a[j][i]+b[j][i];
Addressing ofc[i][j] anda[i][j] are poor
Addressing ofb[j][i] isoptimal
TM
Array Placement EffectsArray Placement Effects “Poor” data placement in memory can lead to the effect of cache thrashing.
There are 2 techniques built into the compiler to avoid the cache thrashing:
• array padding
• leading dimension extension
NOTE: leading dimension of arrays should be an odd number, if the multi-dimensional array has small extensions (e.g. a(64,64,64,..)) several leading dimensions should be odd numbers.
TM
Direct-Mapped Caches: ThrashingDirect-Mapped Caches: Thrashing
A(1) A(2) A(3) A(4)
A(5) A(6) A(7) A(8)
A(8185) A(8186) A(8187) A(8188)
A(8189) A(8190) A(8191) A(8192)
Direct mapped cache (32 KB)Cache line: 4 words
A(1)A(2)
A(8191)A(8192)B(1)
B(8191)B(8192)
32 KB
(Virtual)memory
Location in the cache:(memory-address) mod (cache-size)in this case loc(A(1)) mod 32KB = loc(B(1)) mod 32KB[because B(1) = A(1) + 8192; 8192*4B mod 32KB = 0]
1
2
2047
2048
Registersin the CPU
COMMON //A(8192), B(8192)DO I=1,N PROD = PROD + A(I)*B(I)ENDDO
Thrashing: every memory reference results in a cache miss
TM
Set-Associative CachesSet-Associative Caches
A(1) A(2) A(3) A(4)A(5) A(6) A(7) A(8)
A(4089) A(4090) A(4091) A(4092)A(4093) A(4094) A(4095) A(4096)
2 way set associative cache (32 KB)Cache line: 4 words
A(1)A(2)
A(8191)A(8192)B(1)
B(8191)B(8192)
32 KB
(Virtual)memory
Location in the cache:(memory-address) mod (cache-size)in this case loc(A(1)) mod 16KB = loc(B(1)) mod 16KBBUT A DIFFERENT SET!
1
2
10231024
Registersin the CPU
COMMON //A(8192), B(8192)DO I=1,N PROD = PROD + A(I)*B(I)ENDDO
No Thrashing: conflicting cache lines are stored into a different set
B(1) B(2) B(3) B(4)B(5) B(6) B(7) B(8)
B(4089) B(4090) B(4091) B(4092)B(4093) B(4094) B(4095) B(4096)
Set select (1bit)(LRU)
TM
Array Padding: ExampleArray Padding: Example
COMMON // A(1024,1024),B(1024,1024),C(1024,1024)
DO J=1,1024 DO I=1,1024 A(I,J) = A(I,J)+B(I,J)*C(I,J)ENDDO ENDDO
Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4position in the cache: C(1,1) = B(1,1) since (1024*1024*4) mod 32KB = 0
Assume 32 KB cache
COMMON // A(1024,1024),pad1(129)B(1024,1024),pad2(129)C(1024,1024)
DO J=1,1024 DO I=1,1024 A(I,J) = A(I,J)+B(I,J)*C(I,J)ENDDO ENDDO
Addr[C(1,1)] = Addr[B(1,1)] + 1024*1024*4+129*4position in the cache: C(1,1) = B(129,1) mod 32KB
•Padding will cause cache lines to be placed in different cache locations
•Compiler will try to do padding automatically
TM
Maxwell Code ExampleMaxwell Code Example
Compiling with: -mips4 -O3 -LNO:opt=0 -OPT:reorg_common=off (to show the effect of compiler not performing the necessary optimizations)
gives performance on this code of 4.6 Mflop/s
REAL EX(NX,NY,NZ),EY(NX,NY,NZ),EZ(NX,NY,NZ) !Electric fieldREAL HX(NX,NY,NZ),HY(NX,NY,NZ),HZ(NX,NY,NZ) !Magnetic field…DO K=2,NZ-1 DO J=2,NY-1 DO I=2,NX-1 HX(I,J,K)=HX(I,J,K)-(EZ(I,J,K)-EZ(I,J-1,K))*CHDY +(EY(I,J,K)-EY(I,J,K-1))*CHDZ HY(I,J,K)=HY(I,J,K)-(EX(I,J,K)-EX(I,J,K-1))*CHDZ +(EZ(I,J,K)-EZ(I-1,J,K))*CHDX HZ(I,J,K)=HZ(I,J,K)-(EY(I,J,K)-EY(I-1,J,K))*CHDX +(EX(I,J,K)-EX(I,J-1,K))*CHDYENDDO ENDDO ENDDO
here NX=NY=NZ = 32, 64, 128, 256 (i.e. with real*4 elements: 0.8MB, 6.3MB, 50MB, 403MB)
Reusing load from previous iteration (I-1) gives in total:13 memory operations (6H+7E) -> minimum 13 cycles/iteration18 floating point operations in this code18/(13*2)=69% peak, i.e. 800Mflop/s on the R10000@400MHz processor
TM
Maxwell Example - continuedMaxwell Example - continued Problem:• array dimensions are small even numbers, power of 2 and map
to the same location in both 1st level and the 2nd level caches
• for the Maxwell example the print shows with NX=NY=NZ=64:
• Compiler is able to pad the arrays automatically. Compiling with the default optimizations: -mips4 -O3 gives for the performance 162 Mflop/s
In general:primary cache 32 KB = 2(way-set-ass) * 4(size-real) * 4096secondary cache 8 MB = 2(way-set-ass) * 4(size-real) * 1048576
C print position of arrays in memory with the code:Integer*8 aEXaEX = %LOC(EX(1,1,1))print *,’Addr EX=‘,mod(aEX,4096), mod(aEX,1048576),’words’
Addr EX= 3720 470664Addr EY= 3720 470664…….. etc.Addr HZ= 3720 470664
All arrays map to the same locationsin both caches
TM
Dangers of Array PaddingDangers of Array Padding• Compiler will automatically pad local data
• -O3 optimization will automatically pad common blocks
• Padding of common blocks is safe as long as the Fortran standard is not violated:
• Fix violation or do not to use this optimization either by compiling with lower optimization or using explicit compiler flag:
• -OPT:reorg_common=off
SUBROUTINE SUBCOMMON // A(512,512), B(512,512)
DO I=1, 2*512*512 A(I) = 0.0END
TM
Variable Length Arrays (VLA)Variable Length Arrays (VLA) SGI compiler supports Variable Length Arrays in C and Fortran• It is standard in F90 and an SGI extension in F77:
• In C it is an SGI extension:
• VLAs are very handy as scratch arrays, since they are created each time execution enters the subroutine and they are destroyed at exit
• Unlike the static arrays, VLAs allow for proper aliasing and alignment considerations by the compiler
SUBROUTINE NAME1(N,M)DIMENTION R(N,M)……… etc. …END
These arrays are created on the stack,as opposed to a location in a static area
void name1(int m, int n){ double r[m][n][n+m];…… etc. …..}
TM
Loop UnrollingLoop Unrolling Loop unrolling: perform multiple loop iterations at the same time
Advantages of loop unrolling:• more opportunities for super-scalar code• more data re-use & pseudo-prefetch• exploit presence of cache lines• reduction in loop overhead (minor)
NOTE: Inner loops should “never” be unrolled by hand:• compiler will typically unroll the inner loop the necessary amount for SWP
DO I=1,N,1…(I)…ENDDO
DO I=1,N,UNROLL…(I)……(I+1)……(I+2)……(I+UNROLL-1)…ENDDO
DO I=N-mod(N,unroll)+1,N…(I)…ENDDO
& cleanup
C*$* unroll(p)
P = 0 default unrollingp = 1 no unrollingp = UNROLL - that factor
TM
Prefetch Data from MemoryPrefetch Data from Memory Reordering instructions in unrolled loop leads to effective (pseudo-) prefetch of the data
• no instruction overhead; compiler does this optimization automatically. Explicit (manual) prefetch for memory:• prefetch to 1st level cache should be done in form of pseudo-prefetch• compiler will insert prefetch to 2nd level cache automatically (LNO)• manual prefetch to 2nd level cache can be done with compiler directive:
• same in C with the corresponding #pragma directive
for(i=0; i<n; i+=4){ t = b[i+3]; a += b[i+0]; a += b[i+1]; a += b[i+2]; a += t;}
for(i=0; i<n; i+=4){ a += b[i+0]; a += b[i+1]; a += b[i+2]; a += b[i+3];}
C*$* prefetch_ref=a(1)c*$* prefetch_ref=a(1+16)
do I=1,nc*$* prefetch_ref=a(I+32),stride=16,kind=rd
sum = sum + a(I) enddo
TM
Outer Loop UnrollingOuter Loop Unrolling
• the unroll factor should match the cache line size• mostly 1st level cache optimization• if the data fits into the 2nd level cache, this is good optimization to use
DO I=1,N DO J=1,N A(I)=A(I)+B(I,J)*C(J)ENDDO ENDDO
DO I=1,N,4 ! Unrolling by 4 DO J=1,N A(I+0)=A(I+0)+B(I+0,J)*C(J) A(I+1)=A(I+1)+B(I+1,J)*C(J) A(I+2)=A(I+2)+B(I+2,J)*C(J) A(I+3)=A(I+3)+B(I+3,J)*C(J)ENDDO ENDDO
Problem:A(I) is constant for the inner loop JC(J) is traversed each I iterationB(I,J) is traversed poorly
Unrolling the outer loopwill load the complete cacheline of B in to the registers-> data re-use
one 1st level cache line
-LNO:outer_unroll=n
TM
TM
Blocking for Cache (tiling)Blocking for Cache (tiling) Blocking for cache:• An optimization that applies to data sets that do not fit into the
(2nd level) data cache• A way to increase spatial locality of reference
(i.e. exploit full cache lines)
• A way to increase temporal locality of reference (i.e. to improve data re-use)
• It is beneficial mostly with multi-dimensional arrays
DO I=1,N…. (I) ….ENDDO
DO i1=1,N,nb DO I=i1,min(i1+nb-1,N)…. (I) ….ENDDO ENDDO
The inner loop is traversed only in therange of nb at a time
-LNO:blocking=[on|off] (default on)-LNO:blocking_size=n1,n2 (for L1 and L2)
By default L1=32KB and L2=1MBuse -LNO:cs2=8M to specify the 8MB L2 cache
TM
Blocking: ExampleBlocking: Example The following loop nest:
• z[j] is reused for each i iteration
• For large n the array z will not be reused from the cache
Blocking the loops for cache:
• nb elements of z array will be brought in to the cache and reused nb times before moving on to the next tile
for(i=0; i<n; i++) for(j=0; j<m; j++) x[i][j] = y[i] + z[j]
x[i][j] is traversed in ordery[I] is loop invariantz[j] is traversed sequentiallychanging loop order is not beneficialin this case
For(it=0; it<n; it += nb) for(jt=0; jt<m; jt += nb for(i=it; i<min(jt+nb,n); i++) for(j=jt; j<min(jt+nb,m); j++) x[i][j] = y[i] + z[j]
TM
Loop Fusion Loop Fusion Loop fusion (merging two or more loops together):
• fusing loops that refer to the same data enhances temporal locality• larger loop body allow more effective scalar optimizations
Example:
• loop peeling can be used to break data dependencies when fusing loops• sometimes temporary arrays can be replaced by scalars (this
optimization has to be done manually)• Compiler will attempt fuse loops if they are adjacent, i.e. no code between the
loops to be fused
Original loops:for(i=0; i<n; i++) a[i] = b[i] + 1for(i=0; i<n; i++) c[i] = a[i]/2for(i=0; i<n; i++) d[i] = 1/c[i+1]
Fused loops:for(i=0; i<n; i++){ a[i] = b[i] + 1 c[i] = a[i]/2}for(i=0; i<n; i++) d[i] = 1/c[i+1]
Fusing more loops with loop peeling:a[0] = b[0] + 1c[0] = a[0]/2for(i=1; i<n; i++){ a[i] = b[i] + 1 c[i] = a[i]/2 d[I-1] = 1/c[i]}d[n] = 1/c[n+1]-LNO:fusion=[0,1,2] (default 1)
TM
Loop Fusion in Array AssignmentsLoop Fusion in Array Assignments Loop Fusion is instrumental in generating good F90 code
compiler can optimize the loop sequence by fusion• for that, all assignments (loops) should be adjacent• preserving data dependencies, this can fused:
• for this optimization to work automatically, no code should be placed between the array assignments, such that the assignments are adjacent
F90 code sequence:A(I:N) = B(I:N)+1
C(I:N) = A(1:N)/2
D(1:N) = 1/C(2:N+1)
Allocate T(1:N)DO I=1,N T(I)=B(I)+1ENDDODO I=1,N A(I) = T(I)ENDDODO I=1,N T(I)= A(I)/2ENDDODO I=1,N C(I) = T(I)ENDDODO I=1,N T(I)=1/C(I+1)ENDDODO I=1,N D(I) = T(I)ENDDO
Compiler will typically
generate the following
instruction sequence
Fused loops:DO I=1,N A(I) = B(I)+1 C(I) = A(I)/2ENDDODO I=1,ND(I) = 1/C(I+1)ENDDO
Further peeling to break data dependencieswill merge the two remaining loops
TM
Loop FissionLoop Fission Loop Fission (splitting) or loop distribution:• improve memory locality by splitting out loops that refer to different
independent arrays
for(i=1; i<n; i++){ a[i] = a[i] + b[i-1]; b[i] = c[i-1]*x + y; c[i] = 1/b[i]; d[i] = sqrt(c[i]);}
for(i=0; i<n-1; i++){ b[i+1] = c[i]*x + y; c[i+1] = 1/b[i+1];}for(i=0; i<n-1; i++) a[i+1] = a[i+1] + b[i];for(i=0; i<n-1; i++) d[i+1] = sqrt(c[i+1]);i=n+1
-LNO:fission=[0,1,2] (default 1)0 no fission1 normal fission3 fission tried before fussion
attempts to distribute inner loops
TM
LNO: Gather-ScatterLNO: Gather-Scatter Special form of loop fission:• If the loop to be optimized contains conditional execution, it is often
faster to evaluate all the conditions first.
• The computationally intensive loop runs only over the indices for which the condition was true and can be better optimized (SWP)
• LNO will not evaluate the nested IF conditions, unless -LNO:gather_scatter=2 is used
Subroutine fred(a,b,c,n)real*8 a(n), b(n), c(n)do I=1,n if(c(I) .gt. 0) then a(I) = c(I)/b(I) c(I) = c(I)*b(I) b(I) = 2*b(I) endifenddoend
do I=1,n deref_gs(inc_0+1) = I if(c(I) .gt. 0) then inc_0 = inc_0 + 1 endifenddodo ind_0=0,inc_0-1 I=deref_gs(ind_0+1) a(I) = c(I)/b(I) c(I) = c(I)*b(I) b(I) = 2*b(I)enddoend
Conditional executionremoved
TM
LNO: Vector IntrinsicsLNO: Vector Intrinsics Most intrinsics have their “vector” equivalents. The compiler will automatically substitute vector intrinsics where legal, when the functions are invoked in a loop:
Vector intrinsics are faster if N>10 for most intrinsics• Vector intrinsics have different precision rules (1 or 2 ulp less)
• illegal arguments cannot be trapped with the vector intrinsics
• -LNO:vintr=off to disable the generation of the vector intrinsics
SUBROUTINE VFRED(A,N)REAL*8 A(N)DO I=1,N A(I) = A(I) + COS(A(I))ENDDOEND
CALL VCOS$(A(1),DEREF_SE1_F8(1), %VAL(N-1),%VAL(1), %VAL(1))DO I=1,N A(I) = A(I) + DEREF_SE1_F8(I)ENDDO
TM
Vector Intrinsics: PerformanceVector Intrinsics: Performance
TM
Data Dependence in LoopsData Dependence in Loops In loops, each statement can be executed many times. • loop carried data dependence
– dependence between statements in different iterations• loop independent data dependence
– dependence between statements in the same iteration• lexically forward dependence:
– source precedes the target lexically• lexically backward dependence:
– opposite from above• right-hand side of an assignment precede the left-hand side
example:
unroll to analyze:
loop carried, lexically forward dependence
(1) for( i=2; i<9; i++){(2) x[i] = y[i] + z[i];(3) a[i] = x[i-1] + 1;(4) }
S2
S3
(1)
TM
Specifying the Dependency RulesSpecifying the Dependency Rules In the following example:
if K>N no dependency; if K<N there is a dependency. The value of K is unknown to the compiler , thus the
compiler will assume dependencies. The ivdep directive can be used to
communicate to the compiler the data dependency rules.
IVDEP = Ignore Vector DEPendency
SUBROUTINE DAXPYI(N,X,K,A)INTEGER N,KREAL*8 X(N),ADO I=1,N X(K+I) = X(K+I) + A*X(I)ENDDOEND
Compiler schedules:K<N (dependence) 14% peakK>N (no dependence) 33% peak
SUBROUTINE DAXPYI(N,X,K,A)INTEGER N,KREAL*8 X(N),Acdir$ ivdepDO I=1,N X(K+I) = X(K+I) + A*X(I)ENDDOEND
TM
The IVDEP DirectiveThe IVDEP Directive With indexed addressing IVDEP is the only way to specify no data dependencies to the compiler:
• here ivdep means that the integer values stored in indx array are all different, I.e. indx is a permutation array
• assuming no data dependencies will produce faster processor code, because compiler has less constraints on ordering the load-store instructions
The IVDEP directive to the compiler is not part of the language and its interpretation is not standardized.
void update(int n, float *a, float *b, int *indx, float s)
{ int i;#pragma ivdep for(i=0; i<n; i++) a[indx[i]] += s*b[i];}
TM
Three Types of IVDEP DirectiveThree Types of IVDEP Directive The IVDEP directive to the compiler is not part of any language and its interpretation is not standardized.
• Default interpretation:– A and B and C are independent, that breaks both, lexically
forward (i+k) and backward (i-k) dependencies. • index(1,i) != index(1,j) • index(2,i) != index(2,j) • But for some I: index(1,*) == index(2,*)
• The default interpretation can be changed with the -OPT: compiler option. Possible other interpretations:– break only lexically backward dependencies (Cray IVDEP), I.e.
assume only index(*,i)!=index(*,i-k) (cray_ivdep=on)– there are no dependencies what so ever (Liberal IVDEP, enable with -OPT:liberal_ivdep=on)
CDIR$ IVDEPDO I=1,N
A(INDEX(1,I)) = B(I)A(INDEX(2,I)) = C(I)
ENDDO
SGI default behaviour:A and B and C are independent,i.e. index(*,i) != index(*,j)
TM
The Argument Alias ProblemThe Argument Alias Problem
• In Fortran, it is a mistake to invoke copy with overlapping arguments. The compiler will perform optimizations assuming A and B are not aliases over the computational range.
• In C, argument aliases are allowed. Therefore optimizations (SWP) changing the original order of loads and stores are not possible. There are several ways to remove this restriction:– the ivdep pragma– the compiler optimization flag: -OPT:alias=memory-access-model– the restrict keyword
SUBROUTINE COPY(A,B,N)REAL*8 A(N),B(N)DO I=1,N B(I) = A(I)ENDDOEND
void copy(double *a, double *b, int n){ int i; for(i=1; i<n; i++) b[i] = a[i];}
In Fortran, compiler assumes A and B do not overlap
In C, compiler assumes pointers a and b can point to the same address
TM
Aliases: the Optimizer OptionsAliases: the Optimizer Options These options work over all of the compilation unit. -OPT:alias=[any,typed,unnamed,restrict,disjoint]• any is the default. Any pair of memory references may be aliased.
From the other memory access models, the most important are:• restrict
– assume that any pair of memory references that are named differently do not point to the same regions in memory
• disjoint– assume same restrictions as “restrict”, in addition any pointer de-
referencing will point to an overlapping region in memory
float *p, *q*p does not alias with *q, q, p or any global variable
float *p, *q*p does not alias with *q, q, p or any global variable*p does not alias with **q, **p, ***q, etc.
TM
The The restrictrestrict Keyword Keyword The Numerical C Extensions Group X3J11.1 proposed (1993) a restrict keyword as the way to specify pointer access models. The restrict semantics:• assume de-referencing the qualified pointer is the only way the program can
access the memory pointed to by that pointer
• loads and stores through such a pointer do not alias with any other load and stores, except these with the same pointer
• in this example, it is sufficient to indicate restrict b, since it is necessary to qualify only the pointers being stored through
• to enable the restrict keyword it is necessary to use the compiler flag (7.2 and 7.3 compilers): -LANG:restrict
void copy(double * restrict a, double * restrict b, int n)
{ int i; for(i=1; i<n; i++) b[i] = a[i];}
TM
Alias in Storage AllocationAlias in Storage Allocation Program data can be stored in memory in 2 ways:• Storage in global area
– memory pages are allocated statically, i.e. all data is put at a fixed (virtual) address at load time
– loading such data takes often 2 instructions, since the load immediate instruction in MIPS is limited by 64 KB offset: ldadr
R1,addr #load base pointer ldw R2,R1+offset #load base+offset– COMMON block data, global data, SAVE data, malloc, mmap– compilation with -static: all variables are allocated in global area
• Storage on the stack– memory pages are allocated dynamically during program exec– each subroutine gets new stack area for local data– loading data from the stack requires single instruction ldw
R2,TOS+offset #load TopOfStack+offset– local (automatic) variables, temporary storage, alloca data
• Routines called from a parallel region :– Allocate private stack area– Variables allocated on private stack are private. – Variables in global area are shared (aliases).
TM
Procedure InliningProcedure Inlining Inlining: replace a function call by that function source code
Advantages:• increase opportunities for processor optimizations• more opportunities for Loop Nest optimizations
Candidates for inlining are modules that:• “small” i.e. not much source code• are called very often (typically in a loop)• do not take much time per call
Inhibition to inlining:• mismatched in the subroutine arguments (type or shape)• no inlining across languages (e.g. Fortran calls C subroutine)• no static (SAVE) local variables• not varargs routines, no recursive routines• no functions with alternate entry points• no nested subroutines (like in F90)
DO I=1,N call DO_WORK(A(I),C(I))ENDDO
Subroutine DO_WORK(X,Y) Y=1+X*(1+x*0.5)END
-INLINE:list=[on|off] (default off)-INLINE:must=sub1:never=sub2-IPA:inline=[on|off] (default on)
TM
TM
TM
TM
TM
Software Pipelining (SWP)Software Pipelining (SWP) The software pipelining is the way to mix iterations in a loop such that all processor execution slots are filled:• SWP is performed by the Code Generator (CG), that also unrolls
inner loop to achieve the best SWP schedule (-O3 opt level). This can be computationally intensive.
• Vector loops well-suited for SWP; short loops may run slower with SWP
Inhibitors to SWP:• loops with subroutine (or intrinsic) calls cannot be SWP-ed
• loops with complicated conditionals or branching
• loops that are too long cannot be software pipelined because compiler runs out of available registers (loop fission)
• data dependence between iterations are harder to SWP
TM
Summary Summary • Scalar optimization:
– improving ILP by code transformation and grouping independent instructions
– improving memory access by restructuring loop nests to take better advantage of memory hierarchy
• compilers are good at instruction level optimizations and loop transformations. It depends on the language, however:– F77 is the easiest for compiler to work with– C is more difficult– F90/C++ are most complex for compiler optimizations
• the user is responsible to present the code in a way that allows for compiler optimizations:– don’t violate the language standard– write clean and clear code– consider the data structures for (false) sharing and alignment – consider the data structures for data dependencies– most natural presentation of algorithms using multi-dimensional arrays
TM
Case Study:Case Study:Vector UpdateVector Update
Scalar Optimization Techniques
TM
Vector Update CodeVector Update Code
ll=0do jj=1,nj do ii=1,ni ll=ll+1 res=0 do n=1,nib na=ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nb=n+(jj-1)*nib ndb1=nmb1/2 naa1=nma1+na nbb1=ndb1+nb res=res+p(naa1)*dp(nbb1) end do nde1=nme1/2 lle1=nde1+ll
dp(lle1)=dp(lle1)+res end doend do
Thist is the net result ofall the computations
Profiling tells us that we spend mosttime in this part
L1 Cache L2 Cache TLB Execution (sec) (sec) (sec) (sec) 50 37 215 286
TM
Vector Update: Stride AnalysisVector Update: Stride Analysis
• for the inner loop, the stride on array P is controlled by naa1:
• the loop index in n, therefore the stride is nra • stride on array DP is controlled by nbb1:• therefore the stride is 1• loop exchange consideration: (note: nra, nib
~5000)• thus ii should be the inner loop
do jj=1,nj do ii=1,ni
…. do n=1,nib na=ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nb=n+(jj-1)*nib ndb1=nmb1/2 naa1=nma1+na nbb1=ndb1+nb res=res+P(naa1)*DP(nbb1) end do….
naa1 = nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub
nbb1 = nbd1+n+(jj-1)*nib
Inner loop over n ii jjstride on P: nra 1 0stride on DP: 1 0 nib
TM
Vector Update: Loop InterchangeVector Update: Loop Interchange To interchange the loops they have to be properly nested• substitution expressions and eliminating temporary variables
• now the loops can be interchanged
do jj=1,nj do ii=1,ni res=0 do n=1,nib ndb1=nmb1/2 naa1=nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nbb1=ndb1+n+(jj-1)*nib res=res+p(naa1)*dp(nbb1) end do nde1=nme1/2 lle1=nde1+ii+(jj-1)*ni dp(lle1)=dp(lle1)+res end doend do
res can be eliminated by placingin inner loop
Eliminated LL
Substituted NB
Substituted NA
do jj=1,nj do ii=1,ni do n=1,nib ndb1=nmb1/2 naa1=nma1+ii+(n-1)*nra+(i-1)*nru+(l-1)*nra*nrub nbb1=ndb1+n+(jj-1)*nib nde1=nme1/2 lle1=nde1+ii+(jj-1)*ni dp(lle1)=dp(lle1)+p(naa1)*dp(nbb1) end do end doend do
TM
Vector Update: DAXPY FormVector Update: DAXPY Form
simplifying indexing….
this is a DAXPY operation
ndb1=nmb1/2nde1=nme1/2do jj=1,nj do n=1,nib do ii=1,ni naa1=nma1+ii+(n-1)*nra+ (i-1)*nru+(l-1)*nra*nrub nbb1=ndb1+n+(jj-1)*nib lle1=nde1+ii+(jj-1)*ni dp(lle1)=dp(lle1)+p(naa1)*dp(nbb1) end do end doend do
ndb1=nmb1/2nde1=nme1/2do jj=1,nj do n=1,nib naa1=nma1+(n-1)*nra+ (i-1)*nru+(l-1)*nra*nrub dp_temp=dp(ndb1+n+(jj-1)*nib) lle1=nde1+(jj-1)*ni do ii=1,ni dp(lle1+ii)=dp(lle1+ii)+ p(naa1+ii)*dp_temp end do end doend do
ndb1=nmb1/2nde1=nme1/2id1 =nma1+(i-1)*nru+(l-1)*nra*nrubdo jj=1,nj id2 = ndb1+(jj-1)*nib lle1= nde1+(jj-1)*ni id3 = id1 do n=1,nib dp_temp=dp(id2+n) do ii=1,ni dp(lle1+ii)=dp(lle1+ii)+p(id3+ii)*dp_temp end do id3 = id3 + nra end doend do
TM
Vector Update: 2D FormVector Update: 2D Form With DAXPY operation in the inner loop, we should consider further optimization with outer loop unrolling and blocking.• hand tuning was necessary
• compiler would not implement loop interchange because in the original code the loops are not properly nested
• With the DAXPY formulation, we can consider 2-dimensional implementation of that code:
real*8 dp(ni,nj), p(ni,nib)
ndb1=nmb1/2nde1=nme1/2id1 =nma1+(i-1)*nru+(l-1)*nra*nrub
do jj=nde1,nj do n=ndb1,nib dp_temp=dp(n,jj-nde1) do ii=1,ni dp(ii,jj)=dp(ii,jj)+p(ii,jj)*dp_temp end do end doend do
TM
Vector Update: Compiler OptVector Update: Compiler Opt Compilation the new 2D version with -O3:• compiler can perform automatically the necessary loop transforms
DO wd_jj0 = jj, MIN((tile2jj + 125), nj), 1 mi8 = dp2(n, wd_jj0) mi9 = dp2(n + 1, wd_jj0) mi10 = dp2(n + 3, wd_jj0) mi11 = dp2(n + 2, wd_jj0) DO ii0 = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n) * mi8)) dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 1) * mi9)) dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 2) * mi11)) dp1(ii0, wd_jj0) = (dp1(ii0, wd_jj0) +(p(ii0, n + 3) * mi10)) END DOEND DOEND DO DO wd_n = n, nib, 1 DO jj0 = tile2jj, MIN((nj + -1), (tile2jj + 124)), 2 mi12 = dp2(wd_n, jj0) mi13 = dp2(wd_n, jj0 + 1) DO ii1 = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii1, jj0) = (dp1(ii1, jj0) +(p(ii1, wd_n) * mi12)) dp1(ii1, jj0 + 1) = (dp1(ii1, jj0 + 1) +(p(ii1, wd_n) * mi13)) END DO END DO DO wd_jj = jj0, MIN((tile2jj + 125), nj), 1 mi14 = dp2(wd_n, wd_jj) DO ii2 = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii2, wd_jj) = (dp1(ii2, wd_jj) +(p(ii2, wd_n) * mi14)) END DO END DO
DO tile2jj = 1, nj, 126 DO tile1ii = 1, ni, 544 DO n = 1, (nib + -3), 4 DO jj = tile2jj, MIN((nj + -1), (tile2jj + 124)), 2 mi0 = dp2(n, jj) mi1 = dp2(n + 3, jj + 1) mi2 = dp2(n + 2, jj + 1) mi3 = dp2(n + 1, jj + 1) mi4 = dp2(n, jj + 1) mi5 = dp2(n + 1, jj) mi6 = dp2(n + 2, jj) mi7 = dp2(n + 3, jj) DO ii = tile1ii, MIN((tile1ii + 543), ni), 1 dp1(ii, jj) = (dp1(ii, jj) +(p(ii, n) * mi0)) dp1(ii, jj) = (dp1(ii, jj) +(p(ii, n + 1) * mi5)) dp1(ii, jj) = (dp1(ii, jj) +(p(ii, n + 2) * mi6)) dp1(ii, jj) = (dp1(ii, jj) +(p(ii, n + 3) * mi7)) dp1(ii, jj + 1) = (dp1(ii, jj + 1) +(p(ii, n) * mi4)) dp1(ii, jj + 1) = (dp1(ii, jj + 1) +(p(ii, n + 1) * mi3)) dp1(ii, jj + 1) = (dp1(ii, jj + 1) +(p(ii, n + 2) * mi2)) dp1(ii, jj + 1) = (dp1(ii, jj + 1) +(p(ii, n + 3) * mi1)) END DO END DO
END DO END DOEND DO
TM
Vector Update SummaryVector Update Summary
ORIGINAL CODE
TM
Vector Update SummaryVector Update Summary
TM
Vector Update SummaryVector Update Summary