Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern...

Enhancing Fine-Grained Parallelism

Chapter 5 of Allen and Kennedy

Optimizing Compilers for Modern Architectures


Fine-Grained Parallelism

Techniques to enhance fine-grained parallelism:

• Loop Interchange

• Scalar Expansion

• Scalar Renaming

• Array Renaming

• Node Splitting


Recall Vectorization procedure….

procedure codegen(R, k, D);// R is the region for which we must generate code.// k is the minimum nesting level of possible parallel loops. // D is the dependence graph among statements in R..

find the set {S1, S2, ... , Sm} of maximal strongly-connectedregions in the dependence graph D restricted to R

construct Rp from R by reducing each Si to a single node andcompute Dp, the dependence graph naturally induced on Rp by D

let {p1, p2, ... , pm} be the m nodes of Rp numbered in an orderconsistent with Dp (use topological sort to do the numbering);

for i = 1 to m do begin

if pi is cyclic then begin

generate a level-k DO statement;

let Di be the dependence graph consisting of all dependence edges in D that are at level k+1 or greater and are internal to p i;

codegen (pi, k+1, Di);

generate the level-k ENDDO statement;endelse

generate a vector statement for p i in r(pi)-k+1 dimensions, where r (pi) is the number of loops containing pi;

endend

We can fail here


Can we do better?

• Codegen: tries to find parallelism using transformations of loop distribution and statement reordering

• If we deal with loops containing cyclic dependences early on in the loop nest, we can potentially vectorize more loops

• Goal in Chapter 5: To explore other transformations to exploit parallelism


Motivational ExampleDO J = 1, M DO I = 1, N T = 0.0 DO K = 1,L T = T + A(I,K) * B(K,J) ENDDO C(I,J) = T ENDDOENDDO

codegen will not uncover any vector operations. However, by scalar expansion, we can get:

DO J = 1, M

DO I = 1, N

T$(I) = 0.0

DO K = 1,L

T$(I) = T$(I) + A(I,K) * B(K,J)

ENDDO

C(I,J) = T$(I)

ENDDO

ENDDO


Motivational Example

DO J = 1, M

DO I = 1, N

T$(I) = 0.0

DO K = 1,L

T$(I) = T$(I) + A(I,K) * B(K,J)

ENDDO

C(I,J) = T$(I)

ENDDO

ENDDO


Motivational Example II

• Loop Distribution gives us:DO J = 1, M

DO I = 1, N

T$(I) = 0.0

ENDDO

DO I = 1, N

DO K = 1,L

T$(I) = T$(I) + A(I,K) * B(K,J)

ENDDO

ENDDO

DO I = 1, N

C(I,J) = T$(I)

ENDDO

ENDDO


Motivational Example III

Finally, interchanging I and K loops, we get:DO J = 1, M

T$(1:N) = 0.0

DO K = 1,L

T$(1:N) = T$(1:N) + A(1:N,K) * B(K,J)

ENDDO

C(1:N,J) = T$(1:N)

ENDDO

• A couple of new transformations used:—Loop interchange—Scalar Expansion


Loop Interchange DO I = 1, N

DO J = 1, M

S A(I,J+1) = A(I,J) + B • DV: (=, <)

ENDDO

ENDDO

• Applying loop interchange: DO J = 1, M

DO I = 1, N

S A(I,J+1) = A(I,J) + B • DV: (<, =)

ENDDO

ENDDO

• leads to: DO J = 1, M

S A(1:N,J+1) = A(1:N,J) + B

ENDDO


Loop Interchange

• Loop interchange is a reordering transformation

• Why?—Think of statements being parameterized with the

corresponding iteration vector—Loop interchange merely changes the execution order of

these statements.— It does not create new instances, or delete existing

instances

DO J = 1, M

DO I = 1, N

S <some statement>

ENDDO

ENDDO

• If interchanged, S(2, 1) will execute before S(1, 2)


Loop Interchange: Safety

• Safety: not all loop interchanges are safe

DO J = 1, M

DO I = 1, N

A(I,J+1) = A(I+1,J) + B

ENDDO

ENDDO

• Direction vector (<, >)

• If we interchange loops, we violate the dependence



• A dependence is interchange-preventing with respect to a given pair of loops if interchanging those loops would reorder the endpoints of the dependence.



• A dependence is interchange-sensitive if it is carried by the same loop after interchange. That is, an interchange-sensitive dependence moves with its original carrier loop to the new level.

• Example: Interchange-Sensitive?

• Example: Interchange-Insensitive?



• Theorem 5.1 Let D(i,j) be a direction vector for a dependence in a perfect nest of n loops. Then the direction vector for the same dependence after a permutation of the loops in the nest is determined by applying the same permutation to the elements of D(i,j).

• The direction matrix for a nest of loops is a matrix in which each row is a direction vector for some dependence between statements contained in the nest and every such direction vector is represented by a row.



DO I = 1, N DO J = 1, M

DO K = 1, L

A(I+1,J+1,K) = A(I,J,K) + A(I,J+1,K+1)

ENDDO

ENDDO

ENDDO

• The direction matrix for the loop nest is:

< < =

< = >

• Theorem 5.2 A permutation of the loops in a perfect nest is legal if and only if the direction matrix, after the same permutation is applied to its columns, has no ">" direction as the leftmost non-"=" direction in any row.

• Follows from Theorem 5.1 and Theorem 2.3


Loop Interchange: Profitability

• Profitability depends on architecture

DO I = 1, N DO J = 1, M

DO K = 1, L

S A(I+1,J+1,K) = A(I,J,K) + B

ENDDO

ENDDO

ENDDO

• For SIMD machines with large number of FU’s:

DO I = 1, NS A(I+1,2:M+1,1:L) = A(I,1:M,1:L) + B

ENDDO

• Not suitable for vector register machines



• For Vector machines, we want to vectorize loops with stride-one memory access

• Since Fortran stores in column-major order:—useful to vectorize the I-loop

• Thus, transform to:

DO J = 1, M DO K = 1, L

S A(2:N+1,J+1,K) = A(1:N,J,K) + B

ENDDO

ENDDO



• MIMD machines with vector execution units: want to cut down synchronization costs

• Hence, shift K-loop to outermost level:

PARALLEL DO K = 1, L DO J = 1, M

A(2:N+1,J+1,K) = A(1:N,J,K) + B

ENDDO

END PARALLEL DO


Scalar Expansion DO I = 1, NS1 T = A(I)

S2 A(I) = B(I)

S3 B(I) = T ENDDO

• Scalar Expansion: DO I = 1, N

S1 T$(I) = A(I)

S2 A(I) = B(I)

S3 B(I) = T$(I) ENDDO T = T$(N)

• leads to:S1 T$(1:N) = A(1:N)

S2 A(1:N) = B(1:N)

S3 B(1:N) = T$(1:N)

T = T$(N)


Scalar Expansion

• However, not always profitable. Consider:

DO I = 1, N T = T + A(I) + A(I+1)

A(I) = T

ENDDO

• Scalar expansion gives us: T$(0) = T

DO I = 1, N

S1 T$(I) = T$(I-1) + A(I) + A(I+1)

S2 A(I) = T$(I)

ENDDO

T = T$(N)


Scalar Expansion: Safety

• Scalar expansion is always safe

• When is it profitable? —Naïve approach: Expand all scalars, vectorize, shrink all

unnecessary expansions.—However, we want to predict when expansion is profitable

• Dependences due to reuse of memory location vs. reuse of values—Dependences due to reuse of values must be preserved—Dependences due to reuse of memory location can be

deleted by expansion


Scalar Expansion: Drawbacks

• Expansion increases memory requirements

• Solutions:—Expand in a single loop—Strip mine loop before expansion—Forward substitution:

DO I = 1, N T = A(I) + A(I+1)

A(I) = T + B(I)

ENDDO

DO I = 1, N

A(I) = A(I) + A(I+1) + B(I)

ENDDO


Scalar Renaming DO I = 1, 100S1 T = A(I) + B(I)

S2 C(I) = T + T

S3 T = D(I) - B(I)

S4 A(I+1) = T * T

ENDDO

• Renaming scalar T:DO I = 1, 100

S1 T1 = A(I) + B(I)

S2 C(I) = T1 + T1

S3 T2 = D(I) - B(I)

S4 A(I+1) = T2 * T2

ENDDO


Scalar Renaming

• will lead to:S3 T2$(1:100) = D(1:100) - B(1:100)

S4 A(2:101) = T2$(1:100) * T2$(1:100)

S1 T1$(1:100) = A(1:100) + B(1:100)

S2 C(1:100) = T1$(1:100) + T1$(1:100)

T = T2$(100)


Node Splitting

• Sometimes Renaming fails

DO I = 1, N

S1: A(I) = X(I+1) + X(I)

S2: X(I+1) = B(I) + 32

ENDDO

• Recurrence kept intact by renaming algorithm


Node Splitting

DO I = 1, N

S1: A(I) = X(I+1) + X(I)

S2: X(I+1) = B(I) + 32

ENDDO

• Break critical antidependence

• Make copy of node from which antidependence emanates

DO I = 1, N

S1’:X$(I) = X(I+1)

S1: A(I) = X$(I) + X(I)

S2: X(I+1) = B(I) + 32

ENDDO

• Recurrence broken

• Vectorized toX$(1:N) = X(2:N+1)

X(2:N+1) = B(1:N) + 32

A(1:N) = X$(1:N) + X(1:N)


Node Splitting

• Determining minimal set of critical antidependences is in NP-C

• Perfect job of Node Splitting difficult

• Heuristic:—Select antidependences—Delete it to see if acyclic—If acyclic, apply Node Splitting

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern...

Documents

Transcript of Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern...