Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8,...

43
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4

Transcript of Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8,...

Page 1: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Compiler Improvement of Register Usage

Chapter 8, through Section 8.4

Page 2: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Overview

• Improving memory hierarchy performance by compiler transformations—Scalar Replacement—Unroll-and-Jam

• Saving memory loads & stores

• Make good use of the processor registers

Page 3: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Motivating Example

DO I = 1, N

DO J = 1, M

A(I) = A(I) + B(J)

ENDDO

ENDDO

• A(I) can be left in a register throughout the inner loop

• Coloring based register allocation fails to recognize this

DO I = 1, N

T = A(I)

DO J = 1, M

T = T + B(J)

ENDDO

A(I) = T

ENDDO

• All loads and stores to A in the inner loop have been saved

• High chance of T being allocated a register by the coloring algorithm

Page 4: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Scalar Replacement

• Convert array reference to scalar reference to improve performance of the coloring based allocator

• Our approach is to use dependences to achieve these memory hierarchy transformations

Page 5: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Dependence and Memory Hierarchy

• True or Flow - save loads and cache miss

• Anti - save cache miss

• Output - save stores

• Input - save loads

A(I) = ... + B(I)

... = A(I) + k

A(I) = ...

... = B(I)

Page 6: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Dependence and Memory Hierarchy

• Loop Carried dependences - Consistent dependences most useful for memory management purposes

• Consistent dependences - dependences with constant threshold (dependence distance)

Page 7: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Dependence and Memory Hierarchy

• Problem of overcounting optimization opportunities. For example

S1: A(I) = ...

S2: ... = A(I)

S3: ... = A(I)

• But we can save only two memory references not three

• Solution - Prune edges from dependence graph which don’t correspond to savings in memory accesses

Page 8: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

• In the reduction example

DO I = 1, N

DO J = 1, M

A(I) = A(I) + B(J)

ENDDO

ENDDO

Using Dependences

DO I = 1, N

T = A(I)

DO J = 1, M

T = T + B(J)

ENDDO

A(I) = T

ENDDO

• True dependence - replace the references to A in the inner loop by scalar T

• Output dependence - store can be moved outside the inner loop

• Antidependence - load can be moved before the inner loop

Page 9: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Scalar Replacement

• Example: Scalar Replacement in case of loop independent dependence

DO I = 1, N

A(I) = B(I) + C

X(I) = A(I)*Q

ENDDO

DO I = 1, N

t = B(I) + C

A(I) = t

X(I) = t*Q

ENDDO

• One less load for each iteration for reference to A

Page 10: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Scalar Replacement

• Example: Scalar Replacement in case of loop carried dependence spanning single iteration

DO I = 1, N

A(I) = B(I-1)

B(I) = A(I) + C(I)

ENDDO

tB = B(0)

DO I = 1, N

tA = tB

A(I) = tA

tB = tA + C(I)

B(I) = tB

ENDDO

• One less load for each iteration for reference to B which had a loop carried true dependence spanning 1 iteration

• Also one less load per iteration for reference to A

Page 11: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Scalar Replacement

• Example: Scalar Replacement in case of loop carried dependence spanning multiple iterations

DO I = 1, N

A(I) = B(I-1) + B(I+1)

ENDDO

t1 = B(0)

t2 = B(1)

DO I = 1, N

t3 = B(I+1)

A(I) = t1 + t3

t1 = t2

t2 = t3

ENDDO

• One less load for each iteration for reference to B which had a loop carried input dependence spanning 2 iterations

• Invariants maintained were t1=B(I-1);t2=B(I);t3=B(I+1)

Page 12: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Preloop

Main Loop

Eliminate Scalar Copies

t1 = B(0)

t2 = B(1)

DO I = 1, N

t3 = B(I+1)

A(I) = t1 + t3

t1 = t2

t2 = t3

ENDDO

• Unnecessary register-register copies

• Unroll loop 3 times

t1 = B(0)

t2 = B(1)

mN3 = MOD(N,3)

DO I = 1, mN3

t3 = B(I+1)

A(I) = t1 + t3

t1 = t2

t2 = t3

ENDDO

DO I = mN3 + 1, N, 3

t3 = B(I+1)

A(I) = t1 + t3

t1 = B(I+2)

A(I+1) = t2 + t1

t2 = B(I+3)

A(I+2) = t3 + t2

ENDDO

Page 13: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

• Dependence pattern before pruning

• Not all edges suggest memory access savings

Pruning the dependence graph

Page 14: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Pruning the dependence graphDO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

• Dependence pattern before pruning

• Not all edges suggest memory access savings

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

• Dashed edges are pruned

• Red-colored array references are generators

• Each reference has at most one predecessor in the pruned graph

Page 15: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Pruning the dependence graphDO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

• Apply scalar replacement after pruning the dependence graph

t0A = A(0); t1A0 = A(1); tB1 = B(0)

DO I = 1, N

t1A1 = t0A + tB1

tB3 = B(I+1)

t0A = t1A0 + tB3 + tB2

A(I) = t0A

t1A0 = t1A1

tB1 = tB2

tB2 = tB3

ENDDO

A(N+1) = t1A1

• Only one load and one store per iteration

Page 16: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Pruning the dependence graph

• Prune flow and input dependence edges that do not represent a potential reuse

• Prune redundant input dependence edges

• Prune output dependence edges after rest of the pruning is done

Page 17: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Pruning the dependence graph

• Phase 1: Eliminate killed dependences— When killed dependence is a flow dependence

S1: A(I+1) = ...

S2: A(I) = ...

S3: ... = A(I)– Store in S2 is a killing store. Flow dependence from S1

to S3 is pruned— When killed dependence is an input dependence

S1: ... = A(I+1)

S2: A(I) = ...

S3: ... = A(I-1)– Store in S2 is a killing store. Input dependence from S1

to S3 is pruned

Page 18: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Pruning the dependence graph

• Phase 2: Identify generators

DO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

• Any assignment reference with at least one flow dependence emanating from it to another statement in the loop

• Any use reference with at least one input dependence emanating from it and no input or flow dependence into it

Page 19: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Pruning the dependence graph

• Phase 3: Find name partitions and eliminate input dependences—Use Typed Fusion

– References as vertices– An edge joins two references– Output and anti- dependences are bad edges– Name of array as type

• Eliminate input dependences between two elements of same name partition unless source is a generator

Page 20: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Pruning the dependence graph

• Special cases—Reference is in a dependence cycle in the loop

DO I = 1, N

A(J) = B(I) + C(I,J)

C(I,J) = A(J) + D(I)

ENDDO

• Assign single scalar to the reference in the cycle

• Replace A(J) by a scalar tA and insert A(J)=tA before or after the loop depending on upward/downward exposed occurrence

Page 21: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Pruning the dependence graph

• Special cases: Inconsistent dependences

DO I = 1, N

A(I) = A(I-1) + B(I)

A(J) = A(J) + A(I)

ENDDO

• Store to A(J) kills A(I)

• Only one scalar replacement possible

DO I = 1, N

tAI = A(I-1) + B(I)

A(I) = tAI

A(J) = A(J) + tAI

ENDDO

• This code can be improved substantially by index set splitting

Page 22: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Pruning the dependence graph

DO I = 1, N

tAI = A(I-1) + B(I)

A(I) = tAI

A(J) = A(J) + tAI

ENDDO

• Split this loop into three separate parts—A loop up to J— Iteration J—A loop after iteration J to N

tAI = A(0); tAJ = A(J)

JU = MAX(J-1,0)

DO I = 1, JU

tAI = tAI + B(I); A(I) = tAI

tAJ = tAJ + tAI

ENDDO

IF(J.GT.0.AND.J.LE.N) THEN

tAI = tAI + B(I); A(I) = tAI

tAJ = tAJ + tAI

tAI = tAJ

ENDIF

DO I = JU+2, N

tAI = tAI + B(I); A(I) = tAI

tAJ = tAJ + tAI

ENDDO

A(J) = tAJ

Page 23: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Moderation of Register Pressure

• Scalar replacement of all name partitions will produce a lot of scalar quantities which compete for floating point registers

• Have to choose name partitions for scalar replacement to maximize register usage

Page 24: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Moderation of Register Pressure

• Attach two parameters to each name partition R—v(R): the value of the name partition R

– Number of loads or stores saved by replacing each reference in R by register resident scalars

—c(R): the cost of the name partition R– Number of registers needed to hold all scalar

temporary values

Page 25: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

c(Ri)≤ni=1

m

∑ v(Ri)i=1

m

Moderation of Register Pressure

• Choose subset {R1,…,Rm} such that

and maximize

• 0-1 bin packing problem—Can use dynamic programming: O(nm)—Can use heuristic

– Order name partitions according to the ratio v(R)/c(R) and select elements at the beginning of the list till all registers are exhausted

Page 26: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Scalar Replacement: Putting it together

1. Prune dependence graph; Apply typed fusion

2. Select a set of name partitions using register pressure moderation

3. For each selected partitionA) If non-cyclic, replace using set of temporariesB) If cyclic replace reference with single temporaryC) For each inconsistent dependence

Use index set splitting or insert loads and stores

4. Unroll loop to eliminate scalar copies

Page 27: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Scalar Replacement: Case ADO I = 1, N

A(I+1) = A(I-1) + B(I-1)

A(I) = A(I) + B(I) + B(I+1)

ENDDO

t0A = A(0); t1A0 = A(1); tB1 = B(0)

DO I = 1, N

t1A1 = t0A + tB1

tB3 = B(I+1)

t0A = t1A0 + tB3 + tB2

A(I) = t0A

t1A0 = t1A1

tB1 = tB2

tB2 = tB3

ENDDO

A(N+1) = t1A1

Page 28: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Scalar Replacement: Case B

DO I = 1, N

A(J) = B(I) + C(I,J)

C(I,J) = A(J) + D(I)

ENDDO

• replace with single temporary...

DO I = 1, N

tA = B(I) + C(I,J)

C(I,J) = tA + D(I)

ENDDO

A(J) = tA

Page 29: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Scalar Replacement: Case CDO I = 1, N

tAI = A(I-1) + B(I)

A(I) = tAI

A(J) = A(J) + tAI

ENDDO

• Split this loop into three separate parts—A loop up to J— Iteration J—A loop after iteration J to N

tAI = A(0); tAJ = A(J)

JU = MAX(J-1,0)

DO I = 1, JU

tAI = tAI + B(I); A(I) = tAI

tAJ = tAJ + tAI

ENDDO

IF(J.GT.0.AND.J.LE.N) THEN

tAI = tAI + B(I); A(I) = tAI

tAJ = tAJ + tAI

tAI = tAJ

ENDIF

DO I = JU+2, N

tAI = tAI + B(I); A(I) = tAI

tAJ = tAJ + tAI

ENDDO

A(J) = tAJ

Page 30: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Experiments on Scalar Replacement

Page 31: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Experiments on Scalar Replacement

Page 32: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Unroll-and-Jam

DO I = 1, N*2

DO J = 1, M

A(I) = A(I) + B(J)

ENDDO

ENDDO

• Can we achieve reuse of references to B ?

• Use transformation called Unroll-and-Jam

DO I = 1, N*2, 2

DO J = 1, M

A(I) = A(I) + B(J)

A(I+1) = A(I+1) + B(J)

ENDDO

ENDDO

• Unroll outer loop twice and then fuse the copies of the inner loop

• Brought two uses of B(J) together

Page 33: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Unroll-and-JamDO I = 1, N*2, 2

DO J = 1, M

A(I) = A(I) + B(J)

A(I+1) = A(I+1) + B(J)

ENDDO

ENDDO

• Apply scalar replacement on this code

DO I = 1, N*2, 2

s0 = A(I)

s1 = A(I+1)

DO J = 1, M

t = B(J)

s0 = s0 + t

s1 = s1 + t

ENDDO

A(I) = s0

A(I+1) = s1

ENDDO

• Half the number of loads as the original program

Page 34: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Legality of Unroll-and-Jam

• Is unroll-and-jam always legal?

DO I = 1, N*2

DO J = 1, M

A(I+1,J-1) = A(I,J) + B(I,J)

ENDDO

ENDDO

• Apply unroll-and-jam

DO I = 1, N*2, 2

DO J = 1, M

A(I+1,J-1) = A(I,J) + B(I,J)

A(I+2,J-1) = A(I+1,J) + B(I+1,J)

ENDDO

ENDDO

• This is wrong!!!

Page 35: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Legality of Unroll-and-Jam

Page 36: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Legality of Unroll-and-Jam

• Direction vector in this example was (<,>)—This makes loop interchange illegal—Unroll-and-Jam is loop interchange followed by unrolling

inner loop followed by another loop interchange

• But does loop interchange illegal imply unroll-and-jam illegal ? NO

Page 37: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Legality of Unroll-and-Jam

• Consider this example

DO I = 1, N*2

DO J = 1, M

A(I+2,J-1) = A(I,J) + B(I,J)

ENDDO

ENDDO

• Direction vector is (<,>); still unroll-and-jam possible

Page 38: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Conditions for legality of unroll-and-jam

• Definition: Unroll-and-jam to factor n consists of unrolling the outer loop n-1 times and fusing those copies together.

• Theorem: An unroll-and-jam to a factor of n is legal iff there exists no dependence with direction vector (<,>) such that the distance for the outer loop is less than n.

Page 39: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Unroll-and-jam Algorithm

1. Create preloop

2. Unroll main loop m(the unroll-and-jam factor) times

3. Apply typed fusion to loops within the body of the unrolled loop

4. Apply unroll-and-jam recursively to the inner nested loop

Page 40: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Unroll-and-jam exampleDO I = 1, N

DO K = 1, N

A(I) = A(I) + X(I,K)

ENDDO

DO J = 1, M

DO K = 1, N

B(J,K) = B(J,K) + A(I)

ENDDO

ENDDO

DO J = 1, M

C(J,I) = B(J,N)/A(I)

ENDDO

ENDDO

DO I = mN2+1, N, 2

DO K = 1, N

A(I) = A(I) + X(I,K)

A(I+1) = A(I+1) + X(I+1,K)

ENDDO

DO J = 1, M

DO K = 1, N

B(J,K) = B(J,K) + A(I)

B(J,K) = B(J,K) + A(I+1)

ENDDO

C(J,I) = B(J,N)/A(I)

C(J,I+1) = B(J,N)/A(I+1)

ENDDO

ENDDO

Page 41: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Unroll-and-jam: Experiments

Page 42: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Unroll-and-jam: Experiments

Page 43: Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Optimizing Compilers for Modern Architectures

Conclusion

• We have learned two memory hierarchy transformations:—scalar replacement —unroll-and-jam

• They reduce the number of memory accesses by maximum use of processor registers