CMPUT680 - Winter 2006

March 14, 2002 1

CMPUT680 - Winter 2006

Topic C: Loop FusionKit Barton

www.cs.ualberta.ca/~cbarton

March 14, 2002 2

Outline

• Definition of loop fusion• Basic concepts• Prerequisites of loop fusion• A loop fusion algorithm• Example

March 14, 2002 3

Loop Fusion

• Combine 2 or more loops into a single loop

• This cannot violate any dependencies between the loop bodies

• Several conditions which must be met for fusion to occur

• Often these conditions are not initially satisfied

March 14, 2002 4

Advantages of Loop Fusion

• Save increment and branch instructions

• Creates opportunities for data reuse

• Provide more instructions to instruction scheduler to balance the use of functional units

March 14, 2002 5

Disadvantages of Loop Fusion

• Increase code size effecting instruction cache performance

• Increase register pressure within a loop

• Could cause the formation of loops with more complex control flow

March 14, 2002 6

Background

• There has been extensive work done on loop fusion

• Most has focused on weighted loop fusion (Gao et al., Kennedy and McKinley, Megiddo and Sarkar)

• Extensive work has also been done it performing loop fusion to increase parallelism

March 14, 2002 7

Weighted Loop Fusion

• Associates non-negative weights with each pair of loop nests

• Weights are a measurement of the expected gain if the two loops are fused

• Gains include potential for array contraction, data reuse and improved local register allocation

March 14, 2002 8

Optimal Loop Fusion

• Fuse loops to optimize data reuse, taking into consideration resource constraints and register usage

• This problem is NP-Hard

March 14, 2002 9

Maximal Loop Fusion

• Our approach is to perform maximal loop fusion

• Fuse as many loops as possible, without considering resource constraints

• Fuse loops as soon as possible, not considering the consequences

March 14, 2002 Allen & Kennedy, p. 150, 353 10

Dominators and Post Dominators

• A node x in a directed graph G with a single exit node dominates node y in G if any path from the entry node of G to y must pass through x

• A node x in a directed graph G with a single exit node post-dominates node y in G if any path from y to the exit node of G must pass through x

March 14, 2002 11

Requirements for Loop Fusion

i. Loops must have identical iteration counts (be conforming)

ii. Loops must be control-flow equivalentiii. Loops must be adjacentiv. There cannot be any negative distance

dependencies between the loops

March 14, 2002 12

Non-conforming Loops

• If iteration counts are different, one loop must be manipulated to make the iteration counts the same

1. Loop peeling2. Introduce a guard into one of the loops

March 14, 2002 13

Loop Peeling

• Find the difference between the iteration count of the two loops (n)

• Duplicate the body of the loop with the higher iteration count n times

• Update the iteration count of the peeled loop

March 14, 2002 14

Loop Peeling Example

while (i < 10){

a[i] = a[i - 1] * 2;i++;

}while (j < 12){

b[j] = b[j - 1] - 2;j++;

}

while (i < 10){

a[i] = a[i - 1] * 2;i++;

}while (j < 10){

b[j] = b[j - 1] - 2;j++;

}b[j] = b[j - 1] - 2;j++;b[j] = b[j - 1] - 2;j++;

March 14, 2002 15

Guarding Iterations

• Increase the iteration count of the loop with fewer iterations

• Insert a guard branch around statements that would not normally be executed

March 14, 2002 16

Guarding Iterations Example

while (i < 10){

a[i] = a[i - 1] * 2;i++;

}while (j < 12){

b[j] = b[j - 1] - 2;j++;

}

while (i < 12){

if (i < 10){a[i] = a[i - 1] * 2;

i++;}

}while (j < 12){

b[j] = b[j - 1] - 2;j++;

}

March 14, 2002 17

Loop Peeling

• Advantage:• Does not generate control flow within a loop

body

• Disadvantage:• Generates additional code outside of loops,

which could possible intervene with other loops

March 14, 2002 18

Guarding Iterations

• Advantages:• Does not introduce intervening code• Can be “undone” later

• Disadvantage:• Generates control flow within a loop

March 14, 2002 19

Control Flow Equivalence

• Two loops are control-flow equivalent if when one executes, the other also executes

Loop 1

BB

Loop2

Loop 1

Loop 3

BB

Loop2

March 14, 2002 20

Determining Control Flow Equivalence

• Use the concepts of dominators and post dominators. Two loops L1 and L2 are control-flow equivalent if the following two conditions are true:• L1 dominates L2; and • L2 post dominates L1.

March 14, 2002 21

Intervening Code

• Two loops are adjacent if there are no statements between the two loops

• Can be determined using the CFG:• If the immediate successor of the first loop is

the second loop, the two loops are adjacent• If two loops are not adjacent, there is

intervening code between them.

March 14, 2002 22

Dealing with Non-Adjacent Loops

• If two loops are not adjacent, we attempt to make them adjacent by moving the intervening code

• Intervening code can be moved:• Above the first loop• Below the second loop• Both

• as long as no data dependencies are violated

March 14, 2002 23

Intervening Code Example

• Assume CFG has 20 nodes

• 0-5 are above Loop 1• 17-19 are below Loop 2• What algorithm should be

used to determine which nodes are between Loop1 and Loop2?

Loop 1

Loop 2

6

7

8 9

10 11 12

13 14

15

16

March 14, 2002 24

Gathering Intervening Code

• Given two loops L1 and L2, a basic block B is intervening code between L1 and L2 if and only if:o B is strictly dominated by L1o B is not dominated by L2

• Once the dominance relations are known, the set subtraction can be efficiently computed using bit vectors

March 14, 2002 25

Intervening Code ExampleLoop 1

Loop 2

6

7

8 9

10 11 12

13 14

15

16

Loop 10000 0011 1111 1111 1111 1

Loop 20000 0000 0000 0000 1111 1

Difference

0000 0011 1111 1111 0000 0

March 14, 2002 26

Analyze Intervening Code

• Build a DDG of the intervening code• Put all nodes with no predecessors into queue• For each node in the queue:

• If there are no dependencies between the node and the loop

• Mark node as moveable• Add all of the nodes immediate successors to the

queue• All nodes marked can be moved around the loop

March 14, 2002 27

Non-Adjacent loops examplewhile (i < N) {

a += i;i++;

}b := a * 2;c := b + 6;g := 0;h := g + 10;if (c < 100)

d := c/2;else

e := c * 2;while (j < N) {

f := g + 6;j++;

}

b := a * 2;

c := b + 6;

g := 0;

if (c < 100)

d := c/2;

else

e := c * 2;

h := g + 10;

March 14, 2002 28

Non-Adjacent loops examplewhile (i < N) {

a += i;i++;

}b := a * 2;c := b + 6;g := 0;h := g + 10;if (c < 100)

d := c/2;else

e := c * 2;while (j < N) {

f := g + 6;j++;

}

g := 0;h := g + 10;while (i < N) {

a += i;i++;

}while (j < N) {

f := g + 6;j++;

}b := a * 2;c := b + 6;if (c < 100)

d := c/2;else

e := c * 2;

March 14, 2002 29

Non-Adjacent loops example

b := a * 2;

c := b + 6;

g := 0;

if (c < 100)

d := c/2;

else

e := c * 2;

h := g + 10;

Node Queueb := a * 2;

g := 0;

DDG Loop 2

Moveable Nodes

c := b + 6;

if (c < 100)

d := c/2;

else

e := c * 2;

b := a * 2;

c := b + 6;

if (c < 100)

d := c/2;

else

e := c * 2;

while (j < N) {

f := g + 6;

j++;

}

March 14, 2002 30

Non-Adjacent loops example

b := a * 2;

c := b + 6;

g := 0;

if (c < 100)

d := c/2;

else

e := c * 2;

h := g + 10;

Node Queueb := a * 2;

g := 0;

DDG Loop 1

Moveable Nodes

h := g + 10;

g := 0;

h := g + 10;

while (i < N) {

a += i;

i++;

}

March 14, 2002 31

Dependencies Preventing Fusion

i = j = 1; while (i < 10){

a[i] = c[i] + 10;i++;

}while (j < 10){

b[j] = a[j+1] * 2;j++;

}

Can the following loops be fused?

March 14, 2002 32


• If we look at the array access patterns of a[], we see the following

a[i] = c[i] + 10;

b[j] = a[j+1] * 2;

March 14, 2002 33


• By aligning the array access patterns, we get the following:

a[i] = c[i] + 10;

b[j] = a[j+1] * 2;

March 14, 2002 34

Loop Alignment

i = j = 1; while (i < 10){

a[i] = c[i] + 10;i++;

}while (j < 10){

b[j] = a[j+1] * 2;j++;

}

j = 1; i = 2a[1] = c[1] + 10;while (i < 10){

a[i] = c[i] + 10;i++;

}while (j < 10){

b[j] = a[j+1] * 2;j++;

}

March 14, 2002 35

Loop Alignment

• Loop alignment can be used to remove dependencies between loop bodies

• Easy to do when all dependencies have the same distance

• Gets tricky when there are multiple dependencies with different distances

March 14, 2002 36

Putting it all together

• We’ve seen ways to deal with each of the preconditions of loop fusion

• If the conditions are not met, we apply transformations to try and modify the code

• If the transformations are successful, loop fusion can occur

• But in what order should these transformations be applied?

March 14, 2002 37

Loop Fusion Algorithm

For each Ni from outermost to innermost:Gather control equivalent loops in Ni into LoopSets For each set Si in LoopSetsremove non-eligible loops from Si

FusedLoops = trueDirection = forwardwhile FusedLoops == trueif |Si| < 2 breakCompute Dominance RelationFusedLoops = LoopFusionPass(Si, Direction)Reverse Direction

March 14, 2002 38

Loop Fusion AlgorithmLoopFusionPass(S, Direction)

FusedLoops = falseFor each pair of loops Lj and Lk in S such that Lj dominates Lk in Directionif (DependenceDistance(Lj, Lk) < 0) continueif (InterveningCode(Lj, Lk) == true and

IsInterveningCodeMoveable(Lj, Lk) == false) continued = | IterationCount(Lj) – IterationCount(Lk) |if (Lj and Lk are non-conforming and (d cannot be determined at compile time or d > MAXPEEL)) continueif (Lj and Lk are non-conforming) Peel iterations

MoveInterveningCode(Lj, Lk)if InterveningCode(Lj, Lk) == false FuseLoops(Lj, Lk) FusedLoops = true

Return FusedLoops

March 14, 2002 39

ExampleL1: do i1 = 1, n a(i1) = a(i1) * k1 end doL2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end doS1: ds = 0.0L3: do i3 = 1, m ds = ds + d(i3) end doS2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

Loop Set

L1

L2

L3

L4

March 14, 2002 40

Peeling Loop 1L1: do i1 = 1, n a(i1) = a(i1) * k1 end doL2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end doS1: ds = 0.0L3: do i3 = 1, m ds = ds + d(i3) end doS2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

S7: a(1) = a(1) * k1L1: do i1 = 1, n-1 a(i1+1) = a(i1+1) * k1 end doL2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end doS1: ds = 0.0L3: do i3 = 1, m ds = ds + d(i3) end doS2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, 2002 41

Fuse L1 and L2S7: a(1) = a(1) * k1L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1

d(i5) = a(i5) - b(i5+1) * k2 end doS1: ds = 0.0L3: do i3 = 1, m ds = ds + d(i3) end doS2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

S7: a(1) = a(1) * k1L1: do i1 = 1, n-1 a(i1+1) = a(i1+1) * k1 end doL2: do i2 = 1, n-1 d(i2) = a(i2) - b(i2+1) * k2 end doS1: ds = 0.0L3: do i3 = 1, m ds = ds + d(i3) end doS2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, 2002 42

Compare L5 and L3

• We now compare loops L5 and L3

• They are not adjacent, but the intervening code can move

• Difference in iteration count is not know, so fusion fails

S7: a(1) = a(1) * k1L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1


March 14, 2002 43

Compare L5 and L4

Intervening CodeS7: a(1) = a(1) * k1L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1


S1: ds = 0.0

L3: do i3 = 1, m

ds = ds + d(i3)

end do

S2: if (n<m)

S3: c(n-2) = n

S4: else

S5: c(n-2) = m

March 14, 2002 44

Peel L5S7: a(1) = a(1) * k1L5: do i5 = 1, n-1 a(i5+1) = a(i5+1) * k1


S7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end doS1: ds = 0.0L3: do i3 = 1, m ds = ds + d(i3) end doS2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, 2002 45

Move Intervening CodeS7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2S1: ds = 0.0S2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end doL3: do i3 = 1, m ds = ds + d(i3) end doL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

S7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2L5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end doS1: ds = 0.0L3: do i3 = 1, m ds = ds + d(i3) end doS2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, 2002 46

Reverse PassS7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2S1: ds = 0.0S2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end doL3: do i3 = 1, m ds = ds + d(i3) end doL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

Loop Set

L1

L3

L4

Sorted in Reverse Dominance Direction

L1

L3

L4

March 14, 2002 47

Compare L4 and L3

• Compare L4 and L3• No dependencies to

prevent fusion• Iteration count cannot

be determined at compile time

• Fusion fails

S7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2S1: ds = 0.0S2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end doL3: do i3 = 1, m ds = ds + d(i3) end doL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, 2002 48

Compare L4 and L5

Intervening CodeL3: do i3 = 1, m

ds = ds + d(i3)

end do

S7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2S1: ds = 0.0S2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end doL3: do i3 = 1, m ds = ds + d(i3) end doL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

March 14, 2002 49

Move Intervening CodeS7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2S1: ds = 0.0S2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end doL3: do i3 = 1, m ds = ds + d(i3) end doL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end do

S7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2S1: ds = 0.0S2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end doL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end doL3: do i3 = 1, m ds = ds + d(i3) end do

March 14, 2002 50

Fuse L4 and L1S7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2S1: ds = 0.0S2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL6: do i5 = 1, n-2 a(i6+2) = a(i6+2) * k1 d(i6+1) = a(i6+1) - b(i6+2) * k2 b(i6) = a(i6) + b(i6) / c(i6) end doL3: do i3 = 1, m ds = ds + d(i3) end do

S7: a(1) = a(1) * k1S8: a(2) = a(2) * k1S9: d(1) = a(1) - b(2) * k2S1: ds = 0.0S2: if (n<m)S3: c(n-2) = nS4: elseS5: c(n-2) = mL5: do i5 = 1, n-2 a(i5+2) = a(i5+2) * k1 d(i5+1) = a(i5+1) - b(i5+2) * k2 end doL4: do i4 = 1, n-2 b(i4) = a(i4) + b(i4) / c(i4) end doL3: do i3 = 1, m ds = ds + d(i3) end do

CMPUT680 - Winter 2006

Documents

Transcript of CMPUT680 - Winter 2006