Improving Register Usage
Chapter 8, Section 8.5 End.
Omer Yehezkely
Agenda
Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary
Last lecture at a glance (1)
Assumption 1: Most compilers can handle register allocation to scalars (using node coloring algorithm). However they don’t know how to handle vectors.
Assumption 2: We are dealing with RISC processors. All of the CPU operations need the data in the registers (except of load and store operations).
Assumption 3: Memory Hierarchy: Accessing the registers is much faster than a cache hit, which is much faster than a cache miss and accessing the main memory, which is much faster than accessing the virtual memory (swap file)…
Last lecture at a glance (2)
Therefore our strategy will be: Do some transformation that will “expose” vector entries as scalars, and then let the good old compiler do the register allocation.
We will benefit from avoiding unnecessary Load / Store operations.
Last lecture at a glance (3)
Example: (Scalar Replacement)
DO I = 1, N
DO J = 1, M
A(I) = A(I) + B(J)
ENDDO
ENDDO
DO I = 1, N
T = A(I)
DO J = 1, M
T = T + B(J)
ENDDO
A(I) = T
ENDDO
Last lecture at a glance (4)
Dependences to consider:
True dependence
A(I) =… =A(I)
Output dependence
A(I) =…A(I) =
Antidependence
=A(I)…A(I) =
Input dependence
= A(I)… = A(I)
Last lecture at a glance (5)
•We should also consider Loop Carried and Loop Independent dependences.
•In general the more dependences the merry. This is because there are probably more opportunities for registers reuse.
•We will use the dependences to decide if and how to “expose” the vectors as scalars.
Last lecture at a glance (6)
We saw:
•Scalar Replacement (see first example) – this is the actual “exposure”.
•Unroll and Jam – Unrolling of loops in order to bring dependences that are carried by an outer loop into the inner loop. This can benefit register reuse if we apply Scalar Replacement afterwards.
Last lecture at a glance (7)
Example: (Unroll and Jam)
Original Code
DO I = 1, N*2
DO J = 1, M
A(I) = A(I) + B(J)
ENDDO
ENDDO
Unroll and Jam
DO I = 1, N*2, 2
DO J = 1, M
A(I) = A(I) + B(J)
A(I+1) = A(I+1) +B(J)
ENDDO
ENDDO
Scalar Replacement
DO I = 1, N*2, 2
s0 = A(I)
s1 = A(I+1)
DO J = 1, M
t = B(J)
s0 = s0 + t
s1 = s1 + t
ENDDO
A(I) = s0
A(I+1) = s1
ENDDO
Agenda
Last Lecture at a glance Loop Interchange for Register
Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary
Loop Interchange (1)
Loop nesting is not always optimal in regard to register reuse. For example, on CPUs with no vector
engines, the following code (matrix initialization):
DO I=2, N
A(1:M, I) = A(1:M, I-1)
ENDDO
Will be converted into:DO I = 2, N
DO J = 1, M
A(J, I) = A(J, I-1)
ENDDO
ENDDO
Loop Interchange (2)Which will be implemented in the following way:
DO I = 2, N
DO J = 1, M
R1 = A(J, I-1)
A(J, I) = R1
ENDDO
ENDDO
Which is not too clever, since it has (N-1)*M Load and Store operations.
If we change the order of the loops we can get a better implementation.
Loop Interchange (3)
Original Code
DO I = 2, N DO J = 1, M A(J, I) = A(J, I-1) ENDDOENDDO
Loop Interchange
DO J = 1, M DO I = 2, N A(J, I) = A(J, I-1) ENDDOENDDO
Scalar Replacement
DO J = 1, M R1 = A(J, 1) DO I = 2, N A(J, I) = R1 ENDDOENDDO
This implementation still requires (N-1)*M Store operations (we can’t escape that), but it only requires M Load operations which can make the running time considerably shorter.
Loop Interchange (4)
Considerations for Loop Interchange
The basic idea is to get the loop that carries the most dependences to the innermost position.
Register reuse for the outer loop is usually cannot be achieved due to limited register resources.
We use the conventional direction matrix for loop nest.
Loop Interchange (5)
Example:
DO J = 1, N
DO K = 1, N
DO I = 1, 256
A(I, J, K) = A(I, J-1, K) + A(I, J-1, K-1) + A(I, J, K-1)
ENDDO
ENDDO
ENDDO
There are 3 true dependences which result in the following direction matrix:
Loop Interchange (6)
Example (cont.):
If we select the J loop to be the innermost we get:
DO K = 1, N
DO I = 1, 256
DO J = 1, N
A(I, J, K) = A(I, J-1, K) + &
A(I, J-1, K-1) + A(I, J, K-1)
ENDDO
ENDDO
ENDDO
DO K = 1, N
DO I = 1, 256
R1 = A(I, 0, K)
DO J = 1, N
R1 = R1 + A(I, J-1, K-1) + &
A(I, J, K-1)
A(I, J, K) = R1
ENDDO
ENDDO
ENDDO
We saved a Load operation in each iteration. It is possible to interchange the 2 outer loops and get further optimization.
Loop Interchange (7)
Loop Interchange Algorithm:
1. Form the direction matrix for the loop nest and use it to identify the loops other than the scalarization loop that can legally be moved to the innermost position
2. For each such loop L, let count(L) be the number of rows of the direction matrix that have “<“ in the position corresponding to L and “=“ in every other position.
3. Pick the loop l that maximize the product of count(L) and the iteration count of loop L.
• Some assumptions need to be taken when the bounds of the loop are unknown at compile time.
• Loop interchange should be weighed against cache efficiency (next chapter)
Loop Interchange (8)
100 65 150 1,000 (# of loop iterations)
Example
100 * 2 = 200
65 * 3 = 195
150 * 1 = 150
1,000 * 0 = 0
The outermost loop (100*2) should be the innermost loop
Agenda
Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary
Loop Fusion (1)
Example:
On CPUs with no vector engines the following code:A(1:N) = C(1:N) + D(1:N)
B(1:N) = C(1:N) – D(1:N)
Will be transformed into:DO I = 1, N
A(I) = C(I) + D(I)
ENDDODO I = 1, N B(I) = C(I) - D(I)ENDDO
Loop Fusion (2)
Using Loop Fusion (chapter 6) we get:
DO I = 1, N A(I) = C(I) + D(I) B(I) = C(I) – D(I)ENDDO
Using Scalar Replacement We can save on the fetching time of C(I) and D(I):
DO I = 1, N R1 = C(I) R2 = D(I) A(I) = R1 + R2 B(I) = R1 – R2ENDDO
Loop Fusion (3)
Profitable Loop Fusion for Register Reuse
Just because a loop fusion is safe does not mean it is profitable.
There are 2 cases where the fusion may be profitable:
•The fusion results in a loop independent dependence (as we just saw) .
•The fusion results in a forward loop carried dependence.
Loop Fusion (4)
Example: (forward loop carried dependence)
DO J = 1, N
DO I = 1, M
A(I,J) = C(I,J)+D(I,J)
ENDDO
DO I = 1, M
B(I,J) = A(I,J-1)-E(I,J)
ENDDO
ENDDO
Fusion:DO J = 1, N
DO I = 1, M
A(I,J) = C(I,J)+D(I,J)
B(I,J) = A(I,J-1)-E(I,J)
ENDDO
ENDDO
Loop Fusion (5)Fusion:DO J = 1, N
DO I = 1, M
A(I,J) = C(I,J)+D(I,J)
B(I,J) = A(I,J-1)-E(I,J)
ENDDO
ENDDO
Loop Interchange:DO I = 1, M
DO J = 1, N
A(I,J) = C(I,J)+D(I,J)
B(I,J) = A(I,J-1)-E(I,J)
ENDDO
ENDDO
Statement Order Reversing:
DO I = 1, M DO J = 1, N B(I,J) = A(I,J-1)-E(I,J) A(I,J) = C(I,J)+D(I,J) ENDDOENDDO
Scalar Replacement:
DO I = 1, M R1 = A(I, 0) DO J = 1, N B(I,J) = R1 - E(I,J) R1 = C(I,J)+D(I,J) A(I,J) = R1 ENDDOENDDO
Loop Fusion (6)
Loop Alignment for Fusion
Reminder: Blocking dependences cause problems for loop fusion.
DO I = 1, M
DO J = 1, N
A(J,I) = B(J,I) + 1.0
ENDDO
DO J = 1, N
C(J,I) = A(J+1,I) + 2.0
ENDDO
ENDDO
We cannot simply fuse the two loops because we will introduce backward-carried antidependence.
Loop Fusion (7)
We can overcome this problem by aligning the loops:
DO I = 1, M
DO J = 0, N-1
A(J+1,I) = B(J,I+1) + 1.0
ENDDO
DO J = 1, N
C(J,I) = A(J+1,I) + 2.0
ENDDO
ENDDO
We can now fuse the two loops on their common iteration range while peeling a single iteration from the beginning of the first loop and one iteration from the end of the second loop.
Loop Fusion (8)
Hence we get:
DO I = 1, M
A(1,I) = B(1,I) + 1.0
DO J = 1, N-1
A(J+1,I) = B(J+1,I) + 1.0
C(J,I) = A(J+1,I) + 2.0
ENDDO
C(N,I) = A(N+1,I) + 2.0
ENDDO
Scalar ReplacementDO I = 1, M
A(1,I) = B(1,I) + 1.0
DO J = 1, N-1
R1 = B(J+1,I) + 1.0
A(J+1,I) = R1
C(J,I) = R1 + 2.0
ENDDO
C(N,I) = A(N+1,I) + 2.0
ENDDO
Loop Fusion (9)
Definition:
Let be a dependence between loops.
The Alignment Threshold of is defined as follows:
•If is loop independent after merging, threshold() = 0
•If is forward carried after merging, threshold() is the negative of the resulting dependence threshold.
•If is fusion preventing, threshold() is the threshold of the merged dependence.
Aligning by the largest threshold allow fusion.
Loop Fusion (10)
Example:DO I = 1, N
A(I) = B(I) + 1.0
ENDDO
DO I = 1, N
C(I) = A(I+1) + A(I-1)
ENDDO
We have 2 dependences:
1. Forward carried with a threshold of 1 because of the reference A(I-1) Alignment threshold of -1.
2. Backward carried with a threshold of 1 because of the reference A(I+1) Alignment threshold of +1.
Loop Fusion (11)
Since (+1) > (-1) we should align by the alignment threshold: (+1)
And so we get:
DO I = 0, N-1 A(I+1) = B(I+1) + 1.0ENDDODO I = 1, N C(I) = A(I+1) + A(I-1)ENDDO
From here we can proceed to fuse the loops and then “Scalar Replace” A(I+1).
Loop Fusion (12)
Fusion Mechanics
Assuming we have a collection of aligned loops how do we fuse them?
1. Sort the lower bounds of the loops into nondecreasing sequence {L1,L2,…Ln} and sort the upper bounds of the loops into nondecreasing sequence {H1,H2,…,Hn}.
2. Produce a sequence of fusion loops with lower bounds of L1,L2,…,Ln-1 with respective upper bounds of L2-1,L3-1,…,Ln-1.
3. Produce the central fuse loop with a lower bound of Ln and an upper bound of H1.
4. Produce a sequence of fusion loops with lower bounds of H1+1,H2+1,…,Ln-1+1 with respective upper bounds of H2,H3,…,Hn.
Loop Fusion (13)
Loop 1
Loop 2
Loop 3
Example
Each color represents a fusion loop.
Loops after alignment
Loop Fusion (14)
The Weighted Fusion Problem
The last thing to do is to form the collections of the loops to be fused. We need to do it in a profitable manner.
ExampleL1 DO I = 1, 1,000
A(I) = B(I) + X(I)
ENDDO
L2 DO I = 1, 1,000
C(I) = A(I) + Y(I)
ENDDO
S Z = FOO(A(1:1,000))
L3 DO I = 1, 500
A(I) = C(I) + Z
ENDDO
L1
SL2
L3
1,000
500
500
1,000
1,000
Loop Fusion (15)
Definition
A mixed-directed graph is a graph G = (V, E = Ed U Eu) where (V,Ed) forms a directed graph, (V, Eu) forms an undirected graph, and Ed and Eu are disjoint.
•G is acyclic if (V,Ed) is acyclic. •w is a successor or predecessor of v if it is such in (V,Ed). •w is a neighbor of v if it is such in (V,Eu).
Loop Fusion (16)Problem DefinitionLet G be an acyclic mixed-directed graph, W a weight function on E, B a set of bad vertices, and Eb a set of bad edges. The weighted loop fusion problem is the problem of finding vertex sets {V1,V2,…,Vn} such that:
•{V1,V2,…,Vn} partitions V.
•Each vertex set Vi either contains no bad vertices, or consists of a single bad vertex.
•Given two v and w in Vi, there is no path from v to w (in Ed) that leaves Vi.
•Given v and w in Vi, there is no bad edge between v and w.
•The induced graph on the vertex sets is acyclic.
The Target: To maximize the total weight of edges between vertices in the same vertex sets.
Loop Fusion (17)
The Algorithm
1. Initialize all the quantities and compute initial successor, predecessor, and neighbor sets.
2. Topologically sort the vertices of the directed acyclic graph.
Continued…
Unfortunately, The Weighted Fusion Problem is NP-Hard. Therefore we have to resort to heuristic based algorithms.
A fast and simple algorithm, is the Fast Greedy algorithm for Weighted Fusion which was developed by Kennedy.
Loop Fusion (18)
The Algorithm (continued)
3. Process the vertices in V to compute for each vertex the set pathFrom[v], which contains all vertices that can be reached by a path from vertex v, and the set badPathFrom[v], a subset of pathFrom[v] that includes the set of vertices that can be reached from v by a path that contains a bad vertex or a bad edge.
4. Invert the sets pathFrom and badPathFrom, respectively, to produce the sets pathTo[v] and badPathTo[v] for each vertex v in the graph, The set pathTo[v] contains the vertices from which there is a path to v; the set badPathTo[v] contains the vertices from which v can be reached via a bad path.
Continued…
Loop Fusion (19)
5. Insert each of the edges into a priority queue edgeHeap by weight.
6. While edgeHeap is nonempty, select and remove the heaviest edge (v,w) from it. If w is in badPathFrom[v] then do not fuse – repeat step 6. Otherwise do the following:
• Collapse v, w, and every edge on the directed path between them.
• After each collapse, adjust the sets pathFrom, badPathFrom, pathTo, and badPathTo to reflect the new graph. That is, the composite node will now be reached from every vertex that reached a vertex in the composite, and it will reach any vertex that is reached by a vertex in the composite.
• After each vertex collapse, recompute successor, predecessor, and neighbor sets for the composite vertex, and recompute weights between the composite vertex and other vertices as appropriate.
The running time of the algorithm is: O(EV + V2)
Loop Fusion (20)
L1
SL2
L3
1,000
500
500
1,000
1,000
In the previous example the greedy algorithm will fuse L1 and L2 which is the optimal solution.
Loop Fusion (21)
ab
c
e
d
f
Bad
vertex
1a
1 1
11
1
1
1010
However, the algorithm is not optimal. Consider the following example:
Loop Fusion (22)
Since the edge (a,f) is the heaviest, the greedy algorithm will fuse the vertices a,b,c,d,f together:
ab
c
e
d
f
Bad
vertex
1a
1 1
11
1
1
1010
This solution weight is 16.
Loop Fusion (23)
However, fusing c,d,e,f and a,b produce a better result:
ab
c
e
d
f
Bad
vertex
1a
1 1 11
1
1
1010
This solution weight is 23.
Loop Fusion (24)
Multilevel Loop Fusion
When dealing with multiple-loop nesting problem, the strategy is simple: First align and fuse the outer most loops, then recursively repeat the process for the bodies of the resulting loops.
At best it is inefficient to start with fusing the inner loops (since we won’t be able to fuse all of them, and if we will insist on fusing them we might get the wrong code as the outer loops might need alignment, and therefore the references in the inner loops will change).
Agenda
Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary
Putting It All Together (1)
In which order should the transformations be applied?
The recommended order is as follows:
1. Loop Interchange.
2. Loop Alignment and Fusion.
3. Unroll and Jam.
4. Scalar Replacement.
But Why?
Putting It All Together (2)
1. Loop Interchange: Fusion might interfere with loop interchange therefore it should be done first.
2. Loop Alignment and Fusion: This can achieve extra reuse across loops
3. Unroll and Jam: This can achieve outer loop reuse when there are dependences carried by other than the inner loop after interchange is finished.
4. Scalar Replacement: As we already noted, this is the actual “exposure” – so this must be the last transformation.
Agenda
Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary
Complex Loop Nests (1)
Loops with If Statements
Consider the following example:
DO I = 1, N
IF(M(I).LT.0) THEN
A(I)=B(I)+C
ENDIF
D(I) = A(I) + E
ENDDO
Scalar Replacement
DO I = 1, N
IF(M(I).LT.0) THEN
a0 = B(I) + C
A(I) = a0
ENDIF
D(I) = a0 + E
ENDDO
Error: a0 may not be initialized
Complex Loop Nests (2)
We can overcome this problem in the following way:
DO I = 1, N IF(M(I).LT.0) THEN a0 = B(I) + C A(I) = a0 ELSE a0 = A(I) ENDIF D(I) = a0 + EENDDO
Note: We didn’t increase the running time.
Complex Loop Nests (3)
Given a control flow graph of the loop, and assuming that each If statement has (possibly empty) Else branch:
•We insert initialization at the beginning of block b if the variable is used in b but not initialized on any path to b.
•We insert an initialization at the end of block b if the variable has not been initialized on any path to the block, it is live on exit from the block, and at some successor to the block it is used. (as done in the example).
Complex Loop Nests (4)
Triangular Unroll and Jam
Consider the following example:
DO I = 2, 99
DO J = 1, I-1
A(I,J) = A(I,I) + A(J,J)
ENDDO
ENDDO
Naïve Unroll an Jam
DO I = 2, 99, 2
DO J = 1, I-1
A(I,J) = A(I,I) + A(J,J)
A(I+1,J)=A(I+1,I+1)+A(J,J)
ENDDO
ENDDO
Error: We miss an assignment
We can solve the problem by applying Unroll an Jam step by step an using the loop fusion mechanics.
Complex Loop Nests (5)Original Code
DO I = 2, 99
DO J = 1, I-1
A(I,J) = A(I,I) + A(J,J)
ENDDO
ENDDO
Unroll
DO I = 2, 99, 2
DO J = 1, I-1
A(I,J) = A(I,I) + A(J,J)
ENDDO
DO J = 1, I
A(I+1,J) = A(I+1,I+1)+A(J,J)
ENDDO
ENDDO
Jam (Fusion)
DO I = 2, 99, 2
DO J = 1 , I-1
A(I,J) = A(I,I) + A(J,J)
A(I+1,J) = A(I+1,I+1)+A(J,J)
ENDDO
A(I+1,I) = A(I+1,I+1)+A(I,I)
ENDDO
Scalar Replacement
DO I = 2, 99, 2
tI = A(I,I)
tI1 = A(I+1,I+1)
DO J = 1 , I-1
tJ = A(J,J)
A(I,J) = tI + tJ
A(I+1,J) = tI1 + tJ
ENDDO
A(I+1,I) = tI1 + tI
ENDDO
Complex Loop Nests (6)
Note: It is also possible to Unroll using a factor bigger than 2, using the same techniques.
Complex Loop Nests (7)
Trapezoidal Unroll and Jam
The same technique can be used for general trapezoidal loops, for example: (A part of a convolution code)
DO I = 0, N
DO J = I, I+N2
F3(I) = F3(I)+F1(J)*W(I-J)
ENDDO
F3(I) = F3(I)*DT
ENDDO
Unroll
DO I = 0, N, 2
DO J = I, I+N2
F3(I) = F3(I)+F1(J)*W(I-J)
ENDDO
F3(I) = F3(I)*DT
DO J = I+1, I+N2+1
F3(I+1)=F3(I+1)+F1(J)*W(I-J+1)
ENDDO
F3(I+1) = F3(I+1)*DT
ENDDO
Complex Loop Nests (8)
UnrollDO I = 0, N, 2
DO J = I, I+N2
F3(I) = F3(I)+F1(J)*W(I-J)
ENDDO
F3(I) = F3(I)*DT
DO J = I+1, I+N2+1
F3(I+1)=F3(I+1)+F1(J)*W(I-J+1)
ENDDO
F3(I+1) = F3(I+1)*DT
ENDDO
Jam (Fusion)DO I = 0, N, 2
F3(I) = F3(I)+F1(I)*W(0)
DO J = I, I+N2
F3(I) = F3(I)+F1(J)*W(I-J)
F3(I+1)=F3(I+1)+F1(J)*W(I-J+1)
ENDDO
F3(I+1)=F3(I+1)+F1(I+N2+1)*W(-N2)
F3(I) = F3(I)*DT
F3(I+1) = F3(I+1)*DT
ENDDO
Applying Scalar Replacement gave a speedup of 2.22 on a MIPS M120…
Agenda
Last Lecture at a glance Loop Interchange for Register Reuse Loop Fusion for Register Reuse Putting it All Together Complex Loop Nests Summary
Summary (1)
This lecture we covered:
1. Loop Interchange – This gives us more dependences in the innermost loop which we can utilize for more register reuse.
2. Loop Fusion and Alignment – Bring uses together so they can share registers.
3. Complex Loops – How to overcome some of the problems in real-world programs.