A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James...

A Language for the Compact Representation of Multiple Program Versions

Sébastien Donadio1,2, James Brodman3, Thomas Roeder4,Kamen Yotov4, Denis Barthou2, Albert Cohen5,María Jesús Garzarán3, David Padua3, and Keshav Pingali4

1 BULL S.A. 2 University of Versailles3 University of Illinois at Urbana-Champaign

4 Cornell University 5 INRIA Futurs

International Workshop LCPC 2005

International Workshop LCPC 2005 2

Outline

Context in optimization for high performance

Goals of this language Features of this language Examples (Daxpy & Dgemm) Conclusion


Context Complex architecture and fragile

optimizations Unpredictable performance

Architecture, domain-specific optimizations Resort to empirical search Complement general-purpose optimizations with

user-driven ones


Example FFT performance

Reasonable implementation

(Numerical recipes.

GNU scientific library)

best available

implementation

(FFTW, Intel IPP, Spiral)


Goals of X-Language

Tool to help programmers generate and evaluate multiple versions of their programs: Applying control and data structure transformations Trying multiple transformation sequences and

parameters Evaluating performance of each version and taking

decisions about which transformation variants to try


Goals of X-Language (cont.)

The code must be portable accross ISO-C compilers: Use #pragma annotations for the above tasks Observable program semantics not altered by the

interpretation of these pragmas (assuming transformation legality)


Comparaison with related works

Transformation

GenerationBlack box Manual Domain specific

General purpose

Spiral

Atlas

Tick C

ReflectionCompiler

XLG

X-Language


Features of the language Elementary transformations (fission, stripmining,

interchanging, unrolling,…) Composition of transformations Conditional transformations (versioning) Procedural abstraction of transformations A mechanism to define new transformations No validity check is performed for the

transformation


General schema of X-LanguageCode withPragmas

TransformationDescriptions

Execute and measure performance

searchDifferentversions

Compile


X-Language Naming loops or scopes#pragma xlang name loop1for(i=0;i<10;i++) {a[i]=4;}

Format of transformation

#pragma xlang stripmine loop1 4 ii

#pragma xlangTransformatio

nname

Loop name parameters

Name of additional

loops generatedby

transformations


Elementary transformations implemented in X-language Full unrolling Partial unrolling Scalar promote Interchange Loop fission Loop fusion Strip mining Lifting Sofware pipelining


Applying transformation

#pragma xlang loop1for(i=min;i<4*max;i++)

a[i]=b[i]

#pragma xlang stripmine loop1 4 ii

#pragma xlang loop1

for(i=min;i<4*max;i+=4)

int nl1;

#pragma xlang ii

for(nl1=0;nl1<4;nl1 ++)

a[i+nl1]=b[i+nl1]


How to search the value of parameters ? Using multistage evaluation External scriptfor(k=1;k<16;k=2*k)‘{

#pragma xlang loop1for(i=min;i<max;i++)

a[i]=b[i]#pragma xlang stripmine loop1 ‘d(k) ii

‘}


Composing transformations

#pragma xlang loop1for(i=0;i<4;i++)

#pragma xlang loop2for(j=min2;j<max2;j++)

a[i]=b[j]

#pragma xlang interchange loop1 loop2#pragma xlang fullunroll loop1

#pragma xlang loop2

for(j=min2;j<max2;j++)

{

a[0]=b[j];

a[1]=b[j];

a[2]=b[j];

a[3]=b[j];

}


Analyses and Transformations

Static analyses should also enable the design of smarter (higher level) transformation primitives

External tool to find information


Example with analysisfor(i=2;i<2*N;i+=2)

{u[i]=u[i-1]+u[i-2];

u[i+1]=u[i]+u[i-1];}

for(i=2;i<2*N;i+=2)

{u_1=u[i-1];

u_2=u[i-2];

u_0 = u_1 + u_2;

u_1 = u_0 + u_1;

u[i]=u_0;

u[i+1]=u _1;}

Without interference graph

u_0=u[0];

u_1=u[1];

for(i=2;i<2*N;i+=2)

{u_0 = u_1 + u_2;

u_1 = u_0 + u_1;}

u[i]=u_0;

u[i+1]=u _1;}

With interference graph


Extending the X-LanguageRewriting rule :#pragma xlang name iloop

for (i = 0; i < N; i++)

{<body> }

%

Pattern before Pattern after transformation

#pragma xlang name iiloop1

for (ii = 0; ii < (N/4)*4; ii += 4)

#pragma xlang name iloop1

for (i = ii; i < ii+4; i++)

{ <body>}

#pragma xlang name iloop2

for (i = (N/4)*4; i < N; i++) f

{<body>}%%


Daxpy Example

#pragma xlang name loop1for(k=0;k<2000;k++)

Y[k]=alpha*X[k]*Y[k];

We can modify values of N/** A few values tested for unrolling factor – Different generated version **/#pragma xlang transform stripmine loop1 k N;#pragma xlang transform scalarize-in X in loop1#pragma xlang transform lift l1.loads before loop1#pragma xlang transform scalarize-out Y in loop1#pragma xlang transform lift loop1.loads before loop1#pragma xlang transform lift loop1.stores after loop1#pragma xlang transform fullunroll loop1.loads#pragma xlang transform fullunroll loop1.stores#pragma xlang transform fullunroll loop1


Daxpy Example – Different generated versions

Unrolling factor : 2 for(k=0;k<2000;k=k+2){ double x_0 = X[k+0]; double x_1 = X[k+1]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; Y[k+0] = y_0; Y[k+1] = y_1; }

Unrolling factor : 4 for(k=0;k<2000;k=k+4){ double x_0 = X[k+0]; double x_1 = X[k+1]; double x_2 = X[k+2]; double x_3 = X[k+3]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; double y_2 = Y[k+2]; double y_3 = Y[k+3]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; y_2=alpha*x_2+y_2; y_3=alpha*x_3+y_3; Y[k+0] = y_0; Y[k+1] = y_1; Y[k+2] = y_2;}

Unrolling factor : 8for(k=0;k<2000;k=k+16){

double x_0 = X[k+0];



…

y_0=alpha*x_0+y_0;

y_1=alpha*x_1+y_1;

y_2=alpha*x_2+y_2;

y_3=alpha*x_3+y_3;

…

Y[k+0] = y_0;

Y[k+1] = y_1;

Y[k+2] = y_2;

Y[k+3] = y_3;

…

}


Matrix Multiply(Loop Declaration)

#pragma xlang name iloopfor (i = 0; i < NB; i++)

#pragma xlang name jloopfor (j = 0; j < NB; j++)

#pragma xlang name kloopfor (k = 0; k < NB; k++)

{c[i][j]=c[i][j]+a[i]

[k]*b[k][j];

}

The DGEMM example:

Matrix Multiplication

Problems :Data locality

Scheduling


Matrix Multiply(Transformation Declaration)

#pragma xlang transform stripmine iloop NU NUloop#pragma xlang transform stripmine jloop MU MUloop#pragma xlang transform interchange kloop MUloop#pragma xlang transform interchange jloop NUloop#pragma xlang transform interchange kloop NUloop#pragma xlang transform fullunroll NUloop#pragma xlang transform fullunroll MUloop#pragma xlang transform scalarize_in b in kloop#pragma xlang transform scalarize_in a in kloop#pragma xlang transform scalarize_in&out c in kloop#pragma xlang transform lift kloop.loads before kloop#pragma xlang transform lift kloop.stores after kloop

Sequence of transformations for Itanium:


Matrix Multiply(Transformation Sequence)#pragma xlang name iloopfor(i = 0; i < NB; i++){#pragma xlang name jloopfor(j = 0; j < NB; j += 4){#pragma xlang name kloop.loads{c_0_0 = c[i+0][j+0];c_0_1 = c[i+0][j+1];c_0_2 = c[i+0][j+2];c_0_3 = c[i+0][j+3];}#pragma xlang name kloopfor(k = 0; k < NB; k++){{a_0 = a[i+0][k];a_1 = a[i+0][k];a_2 = a[i+0][k];a_3 = a[i+0][k];}

{b_0 = b[k][j+0];b_1 = b[k][j+1];b_2 = b[k][j+2];b_3 = b[k][j+3];}{c_0_0=c_0_0+a_0*b_0;c_0_1=c_0_1+a_1*b_1;c_0_2=c_0_2+a_2*b_2;c_0_3=c_0_3+a_3*b_3;}...}#pragma xlang name kloop.stores{c[i+0][j+0] = c_0_0;c[i+0][j+1] = c_0_1;c[i+0][j+2] = c_0_2;c[i+0][j+3] = c_0_3;}}}... // Remainder code


Block copies

Block Matrix Multiplication: better performance if matrices are contiguous in memory (TLB)

Poor performance of C copy Resort to a tool generating specific asm

code Tool generating a good code with search (XLG is

an asm search)


Matrix Multiply(Results)

Dgemm on Itanium 2

33003500370039004100430045004700490051005300

128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048

Matrix Size

Perfo

rman

ce(M

Flop

s)

AtlasXLanguage+XLGXLanguage+MemcopyXLanguage+MKLPeak


Conclusion

Describe transformations with reuse, procedures, conditionals

X-Language: language designed to generate multiversion programs Multistage language with a flexible pattern-matching and

rewriting language Experts can describe specific application transformation

optimizations


Future works

Dependence analysis Going further searching asm code

transformation More transformations: vectorization,

alignment,…

A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James...

Documents

Transcript of A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James...