A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James...

26
A Language for the Compact Representation of Multiple Program Versions S é bastien Donadio 1,2 , James Brodman 3 , Thomas Roeder 4 , Kamen Yotov 4 , Denis Barthou 2 , Albert Cohen 5 , María Jesús Garzarán 3 , David Padua 3 , and Keshav Pingali 4 1 BULL S.A. 2 University of Versailles 3 University of Illinois at Urbana-Champaign 4 Cornell University 5 INRIA International Workshop LCPC 2005

description

International Workshop LCPC Context Complex architecture and fragile optimizations  Unpredictable performance Architecture, domain-specific optimizations  Resort to empirical search  Complement general-purpose optimizations with user-driven ones

Transcript of A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James...

Page 1: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

A Language for the Compact Representation of Multiple Program Versions

Sébastien Donadio1,2, James Brodman3, Thomas Roeder4,Kamen Yotov4, Denis Barthou2, Albert Cohen5,María Jesús Garzarán3, David Padua3, and Keshav Pingali4

1 BULL S.A. 2 University of Versailles3 University of Illinois at Urbana-Champaign

4 Cornell University 5 INRIA Futurs

International Workshop LCPC 2005

Page 2: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 2

Outline

Context in optimization for high performance

Goals of this language Features of this language Examples (Daxpy & Dgemm) Conclusion

Page 3: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 3

Context Complex architecture and fragile

optimizations Unpredictable performance

Architecture, domain-specific optimizations Resort to empirical search Complement general-purpose optimizations with

user-driven ones

Page 4: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 4

Example FFT performance

Reasonable implementation

(Numerical recipes.

GNU scientific library)

best available

implementation

(FFTW, Intel IPP, Spiral)

Page 5: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 5

Goals of X-Language

Tool to help programmers generate and evaluate multiple versions of their programs: Applying control and data structure transformations Trying multiple transformation sequences and

parameters Evaluating performance of each version and taking

decisions about which transformation variants to try

Page 6: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 6

Goals of X-Language (cont.)

The code must be portable accross ISO-C compilers: Use #pragma annotations for the above tasks Observable program semantics not altered by the

interpretation of these pragmas (assuming transformation legality)

Page 7: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 7

Comparaison with related works

Transformation

GenerationBlack box Manual Domain specific

General purpose

Spiral

Atlas

Tick C

ReflectionCompiler

XLG

X-Language

Page 8: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 8

Features of the language Elementary transformations (fission, stripmining,

interchanging, unrolling,…) Composition of transformations Conditional transformations (versioning) Procedural abstraction of transformations A mechanism to define new transformations No validity check is performed for the

transformation

Page 9: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 9

General schema of X-LanguageCode withPragmas

TransformationDescriptions

Execute and measure performance

searchDifferentversions

Compile

Page 10: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 10

X-Language Naming loops or scopes#pragma xlang name loop1for(i=0;i<10;i++) {a[i]=4;}

Format of transformation

#pragma xlang stripmine loop1 4 ii

#pragma xlangTransformatio

nname

Loop name parameters

Name of additional

loops generatedby

transformations

Page 11: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 11

Elementary transformations implemented in X-language Full unrolling Partial unrolling Scalar promote Interchange Loop fission Loop fusion Strip mining Lifting Sofware pipelining

Page 12: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 12

Applying transformation

#pragma xlang loop1for(i=min;i<4*max;i++)

a[i]=b[i]

#pragma xlang stripmine loop1 4 ii

#pragma xlang loop1

for(i=min;i<4*max;i+=4)

int nl1;

#pragma xlang ii

for(nl1=0;nl1<4;nl1 ++)

a[i+nl1]=b[i+nl1]

Page 13: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 13

How to search the value of parameters ? Using multistage evaluation External scriptfor(k=1;k<16;k=2*k)‘{

#pragma xlang loop1for(i=min;i<max;i++)

a[i]=b[i]#pragma xlang stripmine loop1 ‘d(k) ii

‘}

Page 14: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 14

Composing transformations

#pragma xlang loop1for(i=0;i<4;i++)

#pragma xlang loop2for(j=min2;j<max2;j++)

a[i]=b[j]

#pragma xlang interchange loop1 loop2#pragma xlang fullunroll loop1

#pragma xlang loop2

for(j=min2;j<max2;j++)

{

a[0]=b[j];

a[1]=b[j];

a[2]=b[j];

a[3]=b[j];

}

Page 15: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 15

Analyses and Transformations

Static analyses should also enable the design of smarter (higher level) transformation primitives

External tool to find information

Page 16: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 16

Example with analysisfor(i=2;i<2*N;i+=2)

{u[i]=u[i-1]+u[i-2];

u[i+1]=u[i]+u[i-1];}

for(i=2;i<2*N;i+=2)

{u_1=u[i-1];

u_2=u[i-2];

u_0 = u_1 + u_2;

u_1 = u_0 + u_1;

u[i]=u_0;

u[i+1]=u _1;}

Without interference graph

u_0=u[0];

u_1=u[1];

for(i=2;i<2*N;i+=2)

{u_0 = u_1 + u_2;

u_1 = u_0 + u_1;}

u[i]=u_0;

u[i+1]=u _1;}

With interference graph

Page 17: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 17

Extending the X-LanguageRewriting rule :#pragma xlang name iloop

for (i = 0; i < N; i++)

{<body> }

%

Pattern before Pattern after transformation

#pragma xlang name iiloop1

for (ii = 0; ii < (N/4)*4; ii += 4)

#pragma xlang name iloop1

for (i = ii; i < ii+4; i++)

{ <body>}

#pragma xlang name iloop2

for (i = (N/4)*4; i < N; i++) f

{<body>}%%

Page 18: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 18

Daxpy Example

#pragma xlang name loop1for(k=0;k<2000;k++)

Y[k]=alpha*X[k]*Y[k];

We can modify values of N/** A few values tested for unrolling factor – Different generated version **/#pragma xlang transform stripmine loop1 k N;#pragma xlang transform scalarize-in X in loop1#pragma xlang transform lift l1.loads before loop1#pragma xlang transform scalarize-out Y in loop1#pragma xlang transform lift loop1.loads before loop1#pragma xlang transform lift loop1.stores after loop1#pragma xlang transform fullunroll loop1.loads#pragma xlang transform fullunroll loop1.stores#pragma xlang transform fullunroll loop1

Page 19: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 19

Daxpy Example – Different generated versions

Unrolling factor : 2 for(k=0;k<2000;k=k+2){ double x_0 = X[k+0]; double x_1 = X[k+1]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; Y[k+0] = y_0; Y[k+1] = y_1; }

Unrolling factor : 4 for(k=0;k<2000;k=k+4){ double x_0 = X[k+0]; double x_1 = X[k+1]; double x_2 = X[k+2]; double x_3 = X[k+3]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; double y_2 = Y[k+2]; double y_3 = Y[k+3]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; y_2=alpha*x_2+y_2; y_3=alpha*x_3+y_3; Y[k+0] = y_0; Y[k+1] = y_1; Y[k+2] = y_2;}

Unrolling factor : 8for(k=0;k<2000;k=k+16){

double x_0 = X[k+0];

double x_1 = X[k+1];

double x_2 = X[k+2];

y_0=alpha*x_0+y_0;

y_1=alpha*x_1+y_1;

y_2=alpha*x_2+y_2;

y_3=alpha*x_3+y_3;

Y[k+0] = y_0;

Y[k+1] = y_1;

Y[k+2] = y_2;

Y[k+3] = y_3;

}

Page 20: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 20

Matrix Multiply(Loop Declaration)

#pragma xlang name iloopfor (i = 0; i < NB; i++)

#pragma xlang name jloopfor (j = 0; j < NB; j++)

#pragma xlang name kloopfor (k = 0; k < NB; k++)

{c[i][j]=c[i][j]+a[i]

[k]*b[k][j];

}

The DGEMM example:

Matrix Multiplication

Problems :Data locality

Scheduling

Page 21: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 21

Matrix Multiply(Transformation Declaration)

#pragma xlang transform stripmine iloop NU NUloop#pragma xlang transform stripmine jloop MU MUloop#pragma xlang transform interchange kloop MUloop#pragma xlang transform interchange jloop NUloop#pragma xlang transform interchange kloop NUloop#pragma xlang transform fullunroll NUloop#pragma xlang transform fullunroll MUloop#pragma xlang transform scalarize_in b in kloop#pragma xlang transform scalarize_in a in kloop#pragma xlang transform scalarize_in&out c in kloop#pragma xlang transform lift kloop.loads before kloop#pragma xlang transform lift kloop.stores after kloop

Sequence of transformations for Itanium:

Page 22: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 22

Matrix Multiply(Transformation Sequence)#pragma xlang name iloopfor(i = 0; i < NB; i++){#pragma xlang name jloopfor(j = 0; j < NB; j += 4){#pragma xlang name kloop.loads{c_0_0 = c[i+0][j+0];c_0_1 = c[i+0][j+1];c_0_2 = c[i+0][j+2];c_0_3 = c[i+0][j+3];}#pragma xlang name kloopfor(k = 0; k < NB; k++){{a_0 = a[i+0][k];a_1 = a[i+0][k];a_2 = a[i+0][k];a_3 = a[i+0][k];}

{b_0 = b[k][j+0];b_1 = b[k][j+1];b_2 = b[k][j+2];b_3 = b[k][j+3];}{c_0_0=c_0_0+a_0*b_0;c_0_1=c_0_1+a_1*b_1;c_0_2=c_0_2+a_2*b_2;c_0_3=c_0_3+a_3*b_3;}...}#pragma xlang name kloop.stores{c[i+0][j+0] = c_0_0;c[i+0][j+1] = c_0_1;c[i+0][j+2] = c_0_2;c[i+0][j+3] = c_0_3;}}}... // Remainder code

Page 23: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 23

Block copies

Block Matrix Multiplication: better performance if matrices are contiguous in memory (TLB)

Poor performance of C copy Resort to a tool generating specific asm

code Tool generating a good code with search (XLG is

an asm search)

Page 24: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 24

Matrix Multiply(Results)

Dgemm on Itanium 2

33003500370039004100430045004700490051005300

128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048

Matrix Size

Perfo

rman

ce(M

Flop

s)

AtlasXLanguage+XLGXLanguage+MemcopyXLanguage+MKLPeak

Page 25: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 25

Conclusion

Describe transformations with reuse, procedures, conditionals

X-Language: language designed to generate multiversion programs Multistage language with a flexible pattern-matching and

rewriting language Experts can describe specific application transformation

optimizations

Page 26: A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James Brodman 3, Thomas Roeder 4, Kamen Yotov 4, Denis Barthou.

International Workshop LCPC 2005 26

Future works

Dependence analysis Going further searching asm code

transformation More transformations: vectorization,

alignment,…