Download - Improving Register Usage Chapter 8, Section 8.5 End. Omer Yehezkely.

Improving Register Usage

Chapter 8, Section 8.5 End.

Omer Yehezkely

Agenda

Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary

Last lecture at a glance (1)

Assumption 1: Most compilers can handle register allocation to scalars (using node coloring algorithm). However they don’t know how to handle vectors.

Assumption 2: We are dealing with RISC processors. All of the CPU operations need the data in the registers (except of load and store operations).

Assumption 3: Memory Hierarchy: Accessing the registers is much faster than a cache hit, which is much faster than a cache miss and accessing the main memory, which is much faster than accessing the virtual memory (swap file)…


Therefore our strategy will be: Do some transformation that will “expose” vector entries as scalars, and then let the good old compiler do the register allocation.

We will benefit from avoiding unnecessary Load / Store operations.


Example: (Scalar Replacement)

DO I = 1, N

DO J = 1, M

A(I) = A(I) + B(J)

ENDDO

ENDDO

DO I = 1, N

T = A(I)

DO J = 1, M

T = T + B(J)

ENDDO

A(I) = T

ENDDO


Dependences to consider:

True dependence

A(I) =… =A(I)

Output dependence

A(I) =…A(I) =

Antidependence

=A(I)…A(I) =

Input dependence

= A(I)… = A(I)


•We should also consider Loop Carried and Loop Independent dependences.

•In general the more dependences the merry. This is because there are probably more opportunities for registers reuse.

•We will use the dependences to decide if and how to “expose” the vectors as scalars.


We saw:

•Scalar Replacement (see first example) – this is the actual “exposure”.

•Unroll and Jam – Unrolling of loops in order to bring dependences that are carried by an outer loop into the inner loop. This can benefit register reuse if we apply Scalar Replacement afterwards.


Example: (Unroll and Jam)

Original Code

DO I = 1, N*2

DO J = 1, M

A(I) = A(I) + B(J)

ENDDO

ENDDO

Unroll and Jam

DO I = 1, N*2, 2

DO J = 1, M

A(I) = A(I) + B(J)

A(I+1) = A(I+1) +B(J)

ENDDO

ENDDO

Scalar Replacement

DO I = 1, N*2, 2

s0 = A(I)

s1 = A(I+1)

DO J = 1, M

t = B(J)

s0 = s0 + t

s1 = s1 + t

ENDDO

A(I) = s0

A(I+1) = s1

ENDDO

Agenda

Last Lecture at a glance Loop Interchange for Register

Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary

Loop Interchange (1)

Loop nesting is not always optimal in regard to register reuse. For example, on CPUs with no vector

engines, the following code (matrix initialization):

DO I=2, N

A(1:M, I) = A(1:M, I-1)

ENDDO

Will be converted into:DO I = 2, N

DO J = 1, M

A(J, I) = A(J, I-1)

ENDDO

ENDDO

Loop Interchange (2)Which will be implemented in the following way:

DO I = 2, N

DO J = 1, M

R1 = A(J, I-1)

A(J, I) = R1

ENDDO

ENDDO

Which is not too clever, since it has (N-1)*M Load and Store operations.

If we change the order of the loops we can get a better implementation.


Original Code

DO I = 2, N DO J = 1, M A(J, I) = A(J, I-1) ENDDOENDDO

Loop Interchange

DO J = 1, M DO I = 2, N A(J, I) = A(J, I-1) ENDDOENDDO

Scalar Replacement

DO J = 1, M R1 = A(J, 1) DO I = 2, N A(J, I) = R1 ENDDOENDDO

This implementation still requires (N-1)*M Store operations (we can’t escape that), but it only requires M Load operations which can make the running time considerably shorter.


Considerations for Loop Interchange

The basic idea is to get the loop that carries the most dependences to the innermost position.

Register reuse for the outer loop is usually cannot be achieved due to limited register resources.

We use the conventional direction matrix for loop nest.


Example:

DO J = 1, N

DO K = 1, N

DO I = 1, 256

A(I, J, K) = A(I, J-1, K) + A(I, J-1, K-1) + A(I, J, K-1)

ENDDO

ENDDO

ENDDO

There are 3 true dependences which result in the following direction matrix:


Example (cont.):

If we select the J loop to be the innermost we get:

DO K = 1, N

DO I = 1, 256

DO J = 1, N

A(I, J, K) = A(I, J-1, K) + &

A(I, J-1, K-1) + A(I, J, K-1)

ENDDO

ENDDO

ENDDO

DO K = 1, N

DO I = 1, 256

R1 = A(I, 0, K)

DO J = 1, N

R1 = R1 + A(I, J-1, K-1) + &

A(I, J, K-1)

A(I, J, K) = R1

ENDDO

ENDDO

ENDDO

We saved a Load operation in each iteration. It is possible to interchange the 2 outer loops and get further optimization.


Loop Interchange Algorithm:

1. Form the direction matrix for the loop nest and use it to identify the loops other than the scalarization loop that can legally be moved to the innermost position

2. For each such loop L, let count(L) be the number of rows of the direction matrix that have “<“ in the position corresponding to L and “=“ in every other position.

3. Pick the loop l that maximize the product of count(L) and the iteration count of loop L.

• Some assumptions need to be taken when the bounds of the loop are unknown at compile time.

• Loop interchange should be weighed against cache efficiency (next chapter)


100 65 150 1,000 (# of loop iterations)

Example

100 * 2 = 200

65 * 3 = 195

150 * 1 = 150

1,000 * 0 = 0

The outermost loop (100*2) should be the innermost loop

Agenda


Loop Fusion (1)

Example:

On CPUs with no vector engines the following code:A(1:N) = C(1:N) + D(1:N)

B(1:N) = C(1:N) – D(1:N)

Will be transformed into:DO I = 1, N

A(I) = C(I) + D(I)

ENDDODO I = 1, N B(I) = C(I) - D(I)ENDDO

Loop Fusion (2)

Using Loop Fusion (chapter 6) we get:

DO I = 1, N A(I) = C(I) + D(I) B(I) = C(I) – D(I)ENDDO

Using Scalar Replacement We can save on the fetching time of C(I) and D(I):

DO I = 1, N R1 = C(I) R2 = D(I) A(I) = R1 + R2 B(I) = R1 – R2ENDDO

Loop Fusion (3)

Profitable Loop Fusion for Register Reuse

Just because a loop fusion is safe does not mean it is profitable.

There are 2 cases where the fusion may be profitable:

•The fusion results in a loop independent dependence (as we just saw) .

•The fusion results in a forward loop carried dependence.

Loop Fusion (4)

Example: (forward loop carried dependence)

DO J = 1, N

DO I = 1, M

A(I,J) = C(I,J)+D(I,J)

ENDDO

DO I = 1, M

B(I,J) = A(I,J-1)-E(I,J)

ENDDO

ENDDO

Fusion:DO J = 1, N

DO I = 1, M


B(I,J) = A(I,J-1)-E(I,J)

ENDDO

ENDDO

Loop Fusion (5)Fusion:DO J = 1, N

DO I = 1, M


B(I,J) = A(I,J-1)-E(I,J)

ENDDO

ENDDO

Loop Interchange:DO I = 1, M

DO J = 1, N


B(I,J) = A(I,J-1)-E(I,J)

ENDDO

ENDDO

Statement Order Reversing:

DO I = 1, M DO J = 1, N B(I,J) = A(I,J-1)-E(I,J) A(I,J) = C(I,J)+D(I,J) ENDDOENDDO

Scalar Replacement:

DO I = 1, M R1 = A(I, 0) DO J = 1, N B(I,J) = R1 - E(I,J) R1 = C(I,J)+D(I,J) A(I,J) = R1 ENDDOENDDO

Loop Fusion (6)

Loop Alignment for Fusion

Reminder: Blocking dependences cause problems for loop fusion.

DO I = 1, M

DO J = 1, N

A(J,I) = B(J,I) + 1.0

ENDDO

DO J = 1, N

C(J,I) = A(J+1,I) + 2.0

ENDDO

ENDDO

We cannot simply fuse the two loops because we will introduce backward-carried antidependence.

Loop Fusion (7)

We can overcome this problem by aligning the loops:

DO I = 1, M

DO J = 0, N-1

A(J+1,I) = B(J,I+1) + 1.0

ENDDO

DO J = 1, N

C(J,I) = A(J+1,I) + 2.0

ENDDO

ENDDO

We can now fuse the two loops on their common iteration range while peeling a single iteration from the beginning of the first loop and one iteration from the end of the second loop.

Loop Fusion (8)

Hence we get:

DO I = 1, M

A(1,I) = B(1,I) + 1.0

DO J = 1, N-1

A(J+1,I) = B(J+1,I) + 1.0

C(J,I) = A(J+1,I) + 2.0

ENDDO

C(N,I) = A(N+1,I) + 2.0

ENDDO

Scalar ReplacementDO I = 1, M

A(1,I) = B(1,I) + 1.0

DO J = 1, N-1

R1 = B(J+1,I) + 1.0

A(J+1,I) = R1

C(J,I) = R1 + 2.0

ENDDO

C(N,I) = A(N+1,I) + 2.0

ENDDO

Loop Fusion (9)

Definition:

Let be a dependence between loops.

The Alignment Threshold of is defined as follows:

•If is loop independent after merging, threshold() = 0

•If is forward carried after merging, threshold() is the negative of the resulting dependence threshold.

•If is fusion preventing, threshold() is the threshold of the merged dependence.

Aligning by the largest threshold allow fusion.

Loop Fusion (10)

Example:DO I = 1, N

A(I) = B(I) + 1.0

ENDDO

DO I = 1, N

C(I) = A(I+1) + A(I-1)

ENDDO

We have 2 dependences:

1. Forward carried with a threshold of 1 because of the reference A(I-1) Alignment threshold of -1.

2. Backward carried with a threshold of 1 because of the reference A(I+1) Alignment threshold of +1.

Loop Fusion (11)

Since (+1) > (-1) we should align by the alignment threshold: (+1)

And so we get:

DO I = 0, N-1 A(I+1) = B(I+1) + 1.0ENDDODO I = 1, N C(I) = A(I+1) + A(I-1)ENDDO

From here we can proceed to fuse the loops and then “Scalar Replace” A(I+1).

Loop Fusion (12)

Fusion Mechanics

Assuming we have a collection of aligned loops how do we fuse them?

1. Sort the lower bounds of the loops into nondecreasing sequence {L1,L2,…Ln} and sort the upper bounds of the loops into nondecreasing sequence {H1,H2,…,Hn}.

2. Produce a sequence of fusion loops with lower bounds of L1,L2,…,Ln-1 with respective upper bounds of L2-1,L3-1,…,Ln-1.

3. Produce the central fuse loop with a lower bound of Ln and an upper bound of H1.

4. Produce a sequence of fusion loops with lower bounds of H1+1,H2+1,…,Ln-1+1 with respective upper bounds of H2,H3,…,Hn.

Loop Fusion (13)

Loop 1

Loop 2

Loop 3

Example

Each color represents a fusion loop.

Loops after alignment

Loop Fusion (14)

The Weighted Fusion Problem

The last thing to do is to form the collections of the loops to be fused. We need to do it in a profitable manner.

ExampleL1 DO I = 1, 1,000

A(I) = B(I) + X(I)

ENDDO

L2 DO I = 1, 1,000

C(I) = A(I) + Y(I)

ENDDO

S Z = FOO(A(1:1,000))

L3 DO I = 1, 500

A(I) = C(I) + Z

ENDDO

L1

SL2

L3

1,000

500

500

1,000

1,000

Loop Fusion (15)

Definition

A mixed-directed graph is a graph G = (V, E = Ed U Eu) where (V,Ed) forms a directed graph, (V, Eu) forms an undirected graph, and Ed and Eu are disjoint.

•G is acyclic if (V,Ed) is acyclic. •w is a successor or predecessor of v if it is such in (V,Ed). •w is a neighbor of v if it is such in (V,Eu).

Loop Fusion (16)Problem DefinitionLet G be an acyclic mixed-directed graph, W a weight function on E, B a set of bad vertices, and Eb a set of bad edges. The weighted loop fusion problem is the problem of finding vertex sets {V1,V2,…,Vn} such that:

•{V1,V2,…,Vn} partitions V.

•Each vertex set Vi either contains no bad vertices, or consists of a single bad vertex.

•Given two v and w in Vi, there is no path from v to w (in Ed) that leaves Vi.

•Given v and w in Vi, there is no bad edge between v and w.

•The induced graph on the vertex sets is acyclic.

The Target: To maximize the total weight of edges between vertices in the same vertex sets.

Loop Fusion (17)

The Algorithm

1. Initialize all the quantities and compute initial successor, predecessor, and neighbor sets.

2. Topologically sort the vertices of the directed acyclic graph.

Continued…

Unfortunately, The Weighted Fusion Problem is NP-Hard. Therefore we have to resort to heuristic based algorithms.

A fast and simple algorithm, is the Fast Greedy algorithm for Weighted Fusion which was developed by Kennedy.

Loop Fusion (18)

The Algorithm (continued)

3. Process the vertices in V to compute for each vertex the set pathFrom[v], which contains all vertices that can be reached by a path from vertex v, and the set badPathFrom[v], a subset of pathFrom[v] that includes the set of vertices that can be reached from v by a path that contains a bad vertex or a bad edge.

4. Invert the sets pathFrom and badPathFrom, respectively, to produce the sets pathTo[v] and badPathTo[v] for each vertex v in the graph, The set pathTo[v] contains the vertices from which there is a path to v; the set badPathTo[v] contains the vertices from which v can be reached via a bad path.

Continued…

Loop Fusion (19)

5. Insert each of the edges into a priority queue edgeHeap by weight.

6. While edgeHeap is nonempty, select and remove the heaviest edge (v,w) from it. If w is in badPathFrom[v] then do not fuse – repeat step 6. Otherwise do the following:

• Collapse v, w, and every edge on the directed path between them.

• After each collapse, adjust the sets pathFrom, badPathFrom, pathTo, and badPathTo to reflect the new graph. That is, the composite node will now be reached from every vertex that reached a vertex in the composite, and it will reach any vertex that is reached by a vertex in the composite.

• After each vertex collapse, recompute successor, predecessor, and neighbor sets for the composite vertex, and recompute weights between the composite vertex and other vertices as appropriate.

The running time of the algorithm is: O(EV + V2)

Loop Fusion (20)

L1

SL2

L3

1,000

500

500

1,000

1,000

In the previous example the greedy algorithm will fuse L1 and L2 which is the optimal solution.

Loop Fusion (21)

ab

c

e

d

f

Bad

vertex

1a

1 1

11

1

1

1010

However, the algorithm is not optimal. Consider the following example:

Loop Fusion (22)

Since the edge (a,f) is the heaviest, the greedy algorithm will fuse the vertices a,b,c,d,f together:

ab

c

e

d

f

Bad

vertex

1a

1 1

11

1

1

1010

This solution weight is 16.

Loop Fusion (23)

However, fusing c,d,e,f and a,b produce a better result:

ab

c

e

d

f

Bad

vertex

1a

1 1 11

1

1

1010

This solution weight is 23.

Loop Fusion (24)

Multilevel Loop Fusion

When dealing with multiple-loop nesting problem, the strategy is simple: First align and fuse the outer most loops, then recursively repeat the process for the bodies of the resulting loops.

At best it is inefficient to start with fusing the inner loops (since we won’t be able to fuse all of them, and if we will insist on fusing them we might get the wrong code as the outer loops might need alignment, and therefore the references in the inner loops will change).

Agenda


Putting It All Together (1)

In which order should the transformations be applied?

The recommended order is as follows:

1. Loop Interchange.

2. Loop Alignment and Fusion.

3. Unroll and Jam.

4. Scalar Replacement.

But Why?

Putting It All Together (2)

1. Loop Interchange: Fusion might interfere with loop interchange therefore it should be done first.

2. Loop Alignment and Fusion: This can achieve extra reuse across loops

3. Unroll and Jam: This can achieve outer loop reuse when there are dependences carried by other than the inner loop after interchange is finished.

4. Scalar Replacement: As we already noted, this is the actual “exposure” – so this must be the last transformation.

Agenda


Complex Loop Nests (1)

Loops with If Statements

Consider the following example:

DO I = 1, N

IF(M(I).LT.0) THEN

A(I)=B(I)+C

ENDIF

D(I) = A(I) + E

ENDDO

Scalar Replacement

DO I = 1, N

IF(M(I).LT.0) THEN

a0 = B(I) + C

A(I) = a0

ENDIF

D(I) = a0 + E

ENDDO

Error: a0 may not be initialized


We can overcome this problem in the following way:

DO I = 1, N IF(M(I).LT.0) THEN a0 = B(I) + C A(I) = a0 ELSE a0 = A(I) ENDIF D(I) = a0 + EENDDO

Note: We didn’t increase the running time.


Given a control flow graph of the loop, and assuming that each If statement has (possibly empty) Else branch:

•We insert initialization at the beginning of block b if the variable is used in b but not initialized on any path to b.

•We insert an initialization at the end of block b if the variable has not been initialized on any path to the block, it is live on exit from the block, and at some successor to the block it is used. (as done in the example).


Triangular Unroll and Jam

Consider the following example:

DO I = 2, 99

DO J = 1, I-1

A(I,J) = A(I,I) + A(J,J)

ENDDO

ENDDO

Naïve Unroll an Jam

DO I = 2, 99, 2

DO J = 1, I-1

A(I,J) = A(I,I) + A(J,J)

A(I+1,J)=A(I+1,I+1)+A(J,J)

ENDDO

ENDDO

Error: We miss an assignment

We can solve the problem by applying Unroll an Jam step by step an using the loop fusion mechanics.

Complex Loop Nests (5)Original Code

DO I = 2, 99

DO J = 1, I-1

A(I,J) = A(I,I) + A(J,J)

ENDDO

ENDDO

Unroll

DO I = 2, 99, 2

DO J = 1, I-1

A(I,J) = A(I,I) + A(J,J)

ENDDO

DO J = 1, I

A(I+1,J) = A(I+1,I+1)+A(J,J)

ENDDO

ENDDO

Jam (Fusion)

DO I = 2, 99, 2

DO J = 1 , I-1

A(I,J) = A(I,I) + A(J,J)

A(I+1,J) = A(I+1,I+1)+A(J,J)

ENDDO

A(I+1,I) = A(I+1,I+1)+A(I,I)

ENDDO

Scalar Replacement

DO I = 2, 99, 2

tI = A(I,I)

tI1 = A(I+1,I+1)

DO J = 1 , I-1

tJ = A(J,J)

A(I,J) = tI + tJ

A(I+1,J) = tI1 + tJ

ENDDO

A(I+1,I) = tI1 + tI

ENDDO


Note: It is also possible to Unroll using a factor bigger than 2, using the same techniques.


Trapezoidal Unroll and Jam

The same technique can be used for general trapezoidal loops, for example: (A part of a convolution code)

DO I = 0, N

DO J = I, I+N2

F3(I) = F3(I)+F1(J)*W(I-J)

ENDDO

F3(I) = F3(I)*DT

ENDDO

Unroll

DO I = 0, N, 2

DO J = I, I+N2

F3(I) = F3(I)+F1(J)*W(I-J)

ENDDO

F3(I) = F3(I)*DT

DO J = I+1, I+N2+1

F3(I+1)=F3(I+1)+F1(J)*W(I-J+1)

ENDDO

F3(I+1) = F3(I+1)*DT

ENDDO


UnrollDO I = 0, N, 2

DO J = I, I+N2

F3(I) = F3(I)+F1(J)*W(I-J)

ENDDO

F3(I) = F3(I)*DT

DO J = I+1, I+N2+1

F3(I+1)=F3(I+1)+F1(J)*W(I-J+1)

ENDDO

F3(I+1) = F3(I+1)*DT

ENDDO

Jam (Fusion)DO I = 0, N, 2

F3(I) = F3(I)+F1(I)*W(0)

DO J = I, I+N2

F3(I) = F3(I)+F1(J)*W(I-J)

F3(I+1)=F3(I+1)+F1(J)*W(I-J+1)

ENDDO

F3(I+1)=F3(I+1)+F1(I+N2+1)*W(-N2)

F3(I) = F3(I)*DT

F3(I+1) = F3(I+1)*DT

ENDDO

Applying Scalar Replacement gave a speedup of 2.22 on a MIPS M120…

Agenda


Summary (1)

This lecture we covered:

1. Loop Interchange – This gives us more dependences in the innermost loop which we can utilize for more register reuse.

2. Loop Fusion and Alignment – Bring uses together so they can share registers.

3. Complex Loops – How to overcome some of the problems in real-world programs.