Symbolic Program Transformation for Numerical Codes Vijay Menon Cornell University.

Post on 20-Dec-2015

218 views 1 download

Transcript of Symbolic Program Transformation for Numerical Codes Vijay Menon Cornell University.

Symbolic Program Transformation for Numerical Codes

Vijay MenonCornell University

Background Compiler/Runtime Support for Numerical

Programs (under Keshav Pingali)– Sparse matrix code generation

– Memory hierarchy optimizations

– Compiler/runtime techniques for MATLAB• MAJIC (with University of Illinois)• MultiMATLAB

Motivation

Using matrix properties to generate efficient code – e.g., Associativity, Commutativity of matrix

operations

Problem: loop notation– Matrix operations hidden within loop nests

Example: LU Factorization with Partial Pivoting

Also called: Gaussian Elimination

Key algorithm for solving systems of linear equations:– To solve A x = b for x:

• => Factor A into L U• => Solve L y = b for y• => Solve U x = y for x

LU Factorization

Problem: Poor cache behavior

do j = 1,N

p(j) = j; do i = j+1,N if (A(i,j)>A(p(j),j)) p(j) = i;

do k = 1,N tmp = A(j,k); A(j,k) = A(p(j),k); A(p(j),k) = tmp;

do i = j+1,N A(i,j) = A(i,j)/A(j,j);

do k = j+1,N do i = j+1,N A(i,k) = A(i,k) - A(i,j)*A(j,k);

Select pivot row:

Swap pivot row with current:

Scale column (to store L):

Update(to compute partial U):

x x x x x x0 x x x x x0 0 5 x x x0 0 3 x x x0 0 7 x x x 0 0 2 x x x

Automatic Blocking Compiler

transformations:– Stripmine– Index-Set Split– Loop Distribute– Tile/Map to BLAS

Problems:– Establishing legality– Mapping to BLAS

do jB = 1,N,B

do j = jB,jB+B-1 p(j) = j; do i = j+1,N if (A(i,j)>A(p(j),j)) p(j) = i; do k = 1,N tmp = A(j,k); A(j,k) = A(p(j),k); A(p(j),k) = tmp; do i = j+1,N A(i,j) = A(i,j)/A(j,j); do k = j+1,jB+B-1 do i = j+1,N A(i,k) = A(i,k) - A(i,j)*A(j,k);

do j = jB,jB+B-1 do k = jB+B,N do i = j+1,N A(i,k) = A(i,k) - A(i,j)*A(j,k);

LU Performance

0

50

100

150

200

250

300

350

400

450

500

size

MF

lop

s

LAPACK

Distributed w ith BLAS

Distributed

Original LU

Key Points

Conventional techniques (dependence analysis) are insufficient to establish legality

Detection of matrix operations buys extra performance

Overview of this Talk

Fractal Symbolic Analysis– Framework for Symbolically determining

Legality of Program Transformations

Matrixization– Generalization of Vectorization to Matrix

Operations

Fractal Symbolic Analysis

Symbolic test to establish legality of program transformations

for i = 1 : n S1(i); S2(i);

for i = 1 : n S1(i);for i = 1 : n S2(i);

Dependence Analysis Independent operations may be reordered

Legality: No data dependence from S2(m) to S1(l) where l > m

for i = 1 : n S1(i); S2(i);

for i = 1 : n S1(i);for i = 1 : n S2(i);

Symbolic Comparison Dependence Analysis is conservative:

Symbolic execution shows equality:– aout = 2*(ain + bin)

– bout = 2*bin

But, intractable for recurrent loops

s1: a = 2*as2: b = 2*bs3: a = a+b

s3: a = a+bs1: a = 2*as2: b = 2*b

?

Distillation of LU Given p(j) j, prove:

Dependence analysis: too conservative Symbolic comparison: intractable

for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp

for i = j+1:n a(i) = a(i)/a(j)

B1(j):

B2(j):

for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp

for j = 1:n for i = j+1:n a(i) = a(i)/a(j)

B1(j):

B2(j):

Idea

Simplify– Prove equality of complex programs via

equality of simpler programs– Repeat if necessary

Sufficient, but not necessary– equality of simpler programs -> equality of

original programs

CommutativityS1, S2, S3, …,Sn = Sf(1), Sf(2), Sf(3), …,Sf(n)

i,j 1..n | i < j f(i) > f(j).

Si; Sj = Sj; Si

Permutation Sequence of Adjacent Transpositions

Commutative Legality

Loop Distribution

Legality:1 m < l N.

S2(m); S1(l) = S1(l); S2(m)

for i = 1 : n S1(i); S2(i);

for i = 1 : n S1(i);for i = 1 : n S2(i);

Commutative Legality

Similar conditions:– Statement reordering– Loop interchange – Loop reversal – Loop tiling

Compiler establishes legality of transformation by testing commute condition

Back to Example Given p(j) j, prove:

Simplified comparison =>

for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp

for i = j+1:n a(i) = a(i)/a(j)

B1(j):

B2(j):

for j = 1:n tmp = a(j) a(j) = a(p(j)) a(p(j)) = tmp

for j = 1:n for i = j+1:n a(i) = a(i)/a(j)

B1(j):

B2(j):

Simplified Comparison Given p(j) j l > m prove:

Further simplification?

for i = m+1:n a(i) = a(i)/a(m)

tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp

B2(m):

B1(l):

tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp

for i = m+1:n a(i) = a(i)/a(m)

B1(l):

B2(m):

Too Simple! Given p(j) j l > m i > m prove:

Too conservative - no longer legal! Hypothesis:

– Repeated application -> dependence analysis

a(i) = a(i)/a(m)

tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp

tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp

a(i) = a(i)/a(m)

Symbolic Comparison Can we symbolically prove?:

Effect can be summarized:– non-recurrent loops– affine indices/loop bounds

for i = m+1:n a(i) = a(i)/a(m)

tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp

B2(m):

B1(l):

tmp = a(l) a(l) = a(p(l)) a(p(l)) = tmp

for i = m+1:n a(i) = a(i)/a(m)

B1(l):

B2(m):

Guarded Expressions

Compare symbolic values of live output variable in terms of input variables:

aout (k)=

Each– guard : affine expression defining part of aout

– expr : symbolic expression on input data

guard1 -> expr1

guard2 -> expr2

……guardn -> exprn

Guarded Expressions

For both program blocks:

aout(k) =

Omega Library (integer programming tool) to manipulate/compare affine guards

k m => ain(k)

k = l => ain(p(l))/ain(m)

k = p(l) => ain(l)/ain(m)

else => ain(k)/ain(m)

Summary of Fractal Symbolic Analysis Powerful legality test Explores tradeoffs between

– tractability (dependence analysis)– accuracy (symbolic comparison)

Similar application for LU w/ pivoting– 2 recursive simplification steps– 6 guarded regions/expressions

Prototype implemented in OCAML

Overview of this Talk

Fractal Symbolic Analysis– Framework for Symbolically determining

Legality of Program Transformations

Matrixization– Generalization of Vectorization to Matrix

Operations

Matrixization

Detect Matrix Operations– Map to

• hand-optimized libraries (BLAS)• special-purpose hardware

– Eliminate loop overhead• MATLAB: type/bounds checks

Exploit Matrix Semantics– Ring Properties of Matrix (+,*)

BLAS Performance

050

100150200250300350400450500

100 200 300 400 500 600 700 800 900 1000size

MF

lop

s

Compiler Matrix-Matrix ProductBLAS Matrix-Matrix Product (DGEMM)Compiler Matrix-Vector ProductBLAS Matrix-Vector Product (DGEMV)

MATLAB Performance

0

10

20

30

40

50

60

70

80

Crank-Nicholson

Dirichlet FiniteDifference

Galerkin Inc.Cholesky

Ex

ecu

tio

n T

ime

(s)

Original

Vectorized

Compiled

Vect. & Comp.

Galerkin Example

In Galerkin, 98% of time spent in:

for i = 1:N for j = 1:N phi(k) += A(i,j)*x(i)*y(i);

end end

A - Matrix x, y - Vector

Vectorized Code

In Optimized Galerkin:

phi(k) += x*A*y’;

Fragment Speedup: 260 Program Speedup: 110

Conventional Approaches

Syntactic Pattern Replacement (KAPF,VAST,FALCON) Can encode high-level propertiesLimited use on loops

Vector Code Generation Can detect array operations in loopsCannot detect/exploit matrix products

Our Approach Map code to: Abstract Matrix Form (AMF)

Convert to Symmetric AMF formulation

Optimize AMF expressions– factorization– invariant removal

Detect Matrix Products

Map AMF to MATLAB/BLAS

Expansion

Array expressions:

Forall loops:

a(1:m,1:n) -> i j ai,j

for i = 1:n x(i) = x(i) + y(i); -> ixi = i(xi+yi)

end

Reduction

Sum Reduction Loops

for i = 1:n k = k + x(i); -> k = k + î i xi

end

Product

Matrix ProductC = A*B -> i jCi,j= P(i jAi,h,ijBh,j)

Product-Reduction Equivalence:

• e1 and e2 are scalar

• e1 is constant w.r.t. ik+1…in

• e2 is constant w.r.t. i1…ik-1

îki1…ine1 * i1…ine2) = Pîk(i1…ike1, ik…ine2)

Other AMF Properties

Distributive Properties– e1· î e2 = î((ie1) ·e2)

• when e1 is constant w.r.t. i

Interchange Properties if(e1, e2,…, en) = f(ie1, ie2,…, ien)

i e = i e • where î=

i j e = j i e

î e = î e

Back to Galerkin Original:

• k = k + î i j (ai,j * xi * yj)

Convert to Symmetric Form:• k = k + î (i j ai,j * i j xi * i j yj)

Optimize:• k = k + î i xi * (i j ai,j * i j yj)

Map to Matrix Operations:• k = k + Pî (i xi , P(i j ai,j, j yj))

=> k = k + x’*A*y;

Other Vectorization Examples

Taken from USENET:

for i = 1:n for j = 1:n C(i,j) = A(i,j)*x(i);

C(1:n,1:n) = A(1:n,1:n) .* repmat(x(1,n),1,n)

for i = 1:n x(i) = y(:,i)*A*y(:,i)’;

x(1:n) = sum(y.*(y*A’),2);

Summary/Status of Matrixization General technique to detect matrix/vector

operations

Implemented in MAJIC– MATLAB interpreter/compiler

Future work:– Extend to more ops: recurrences– JIT Matrixization

Related Work Commutativity Analysis

– Rinard & Diniz– Parallelization of OO programs

High-Level Pattern Replacement– DeRose, Marsolf, …– Exploit high-level matrix properties

• Structure• Symmetry• Orthogonality

Conclusion

High-level symbolic techniques:– Fractal Symbolic Analysis– Matrixization

New techniques to analyze & exploit underlying computation and achieve substantial performance gains