A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James...
-
Upload
asher-potter -
Category
Documents
-
view
220 -
download
0
description
Transcript of A Language for the Compact Representation of Multiple Program Versions Sébastien Donadio 1,2, James...
A Language for the Compact Representation of Multiple Program Versions
Sébastien Donadio1,2, James Brodman3, Thomas Roeder4,Kamen Yotov4, Denis Barthou2, Albert Cohen5,María Jesús Garzarán3, David Padua3, and Keshav Pingali4
1 BULL S.A. 2 University of Versailles3 University of Illinois at Urbana-Champaign
4 Cornell University 5 INRIA Futurs
International Workshop LCPC 2005
International Workshop LCPC 2005 2
Outline
Context in optimization for high performance
Goals of this language Features of this language Examples (Daxpy & Dgemm) Conclusion
International Workshop LCPC 2005 3
Context Complex architecture and fragile
optimizations Unpredictable performance
Architecture, domain-specific optimizations Resort to empirical search Complement general-purpose optimizations with
user-driven ones
International Workshop LCPC 2005 4
Example FFT performance
Reasonable implementation
(Numerical recipes.
GNU scientific library)
best available
implementation
(FFTW, Intel IPP, Spiral)
International Workshop LCPC 2005 5
Goals of X-Language
Tool to help programmers generate and evaluate multiple versions of their programs: Applying control and data structure transformations Trying multiple transformation sequences and
parameters Evaluating performance of each version and taking
decisions about which transformation variants to try
International Workshop LCPC 2005 6
Goals of X-Language (cont.)
The code must be portable accross ISO-C compilers: Use #pragma annotations for the above tasks Observable program semantics not altered by the
interpretation of these pragmas (assuming transformation legality)
International Workshop LCPC 2005 7
Comparaison with related works
Transformation
GenerationBlack box Manual Domain specific
General purpose
Spiral
Atlas
Tick C
ReflectionCompiler
XLG
X-Language
International Workshop LCPC 2005 8
Features of the language Elementary transformations (fission, stripmining,
interchanging, unrolling,…) Composition of transformations Conditional transformations (versioning) Procedural abstraction of transformations A mechanism to define new transformations No validity check is performed for the
transformation
International Workshop LCPC 2005 9
General schema of X-LanguageCode withPragmas
TransformationDescriptions
Execute and measure performance
searchDifferentversions
Compile
International Workshop LCPC 2005 10
X-Language Naming loops or scopes#pragma xlang name loop1for(i=0;i<10;i++) {a[i]=4;}
Format of transformation
#pragma xlang stripmine loop1 4 ii
#pragma xlangTransformatio
nname
Loop name parameters
Name of additional
loops generatedby
transformations
International Workshop LCPC 2005 11
Elementary transformations implemented in X-language Full unrolling Partial unrolling Scalar promote Interchange Loop fission Loop fusion Strip mining Lifting Sofware pipelining
International Workshop LCPC 2005 12
Applying transformation
#pragma xlang loop1for(i=min;i<4*max;i++)
a[i]=b[i]
#pragma xlang stripmine loop1 4 ii
#pragma xlang loop1
for(i=min;i<4*max;i+=4)
int nl1;
#pragma xlang ii
for(nl1=0;nl1<4;nl1 ++)
a[i+nl1]=b[i+nl1]
International Workshop LCPC 2005 13
How to search the value of parameters ? Using multistage evaluation External scriptfor(k=1;k<16;k=2*k)‘{
#pragma xlang loop1for(i=min;i<max;i++)
a[i]=b[i]#pragma xlang stripmine loop1 ‘d(k) ii
‘}
International Workshop LCPC 2005 14
Composing transformations
#pragma xlang loop1for(i=0;i<4;i++)
#pragma xlang loop2for(j=min2;j<max2;j++)
a[i]=b[j]
#pragma xlang interchange loop1 loop2#pragma xlang fullunroll loop1
#pragma xlang loop2
for(j=min2;j<max2;j++)
{
a[0]=b[j];
a[1]=b[j];
a[2]=b[j];
a[3]=b[j];
}
International Workshop LCPC 2005 15
Analyses and Transformations
Static analyses should also enable the design of smarter (higher level) transformation primitives
External tool to find information
International Workshop LCPC 2005 16
Example with analysisfor(i=2;i<2*N;i+=2)
{u[i]=u[i-1]+u[i-2];
u[i+1]=u[i]+u[i-1];}
for(i=2;i<2*N;i+=2)
{u_1=u[i-1];
u_2=u[i-2];
u_0 = u_1 + u_2;
u_1 = u_0 + u_1;
u[i]=u_0;
u[i+1]=u _1;}
Without interference graph
u_0=u[0];
u_1=u[1];
for(i=2;i<2*N;i+=2)
{u_0 = u_1 + u_2;
u_1 = u_0 + u_1;}
u[i]=u_0;
u[i+1]=u _1;}
With interference graph
International Workshop LCPC 2005 17
Extending the X-LanguageRewriting rule :#pragma xlang name iloop
for (i = 0; i < N; i++)
{<body> }
%
Pattern before Pattern after transformation
#pragma xlang name iiloop1
for (ii = 0; ii < (N/4)*4; ii += 4)
#pragma xlang name iloop1
for (i = ii; i < ii+4; i++)
{ <body>}
#pragma xlang name iloop2
for (i = (N/4)*4; i < N; i++) f
{<body>}%%
International Workshop LCPC 2005 18
Daxpy Example
#pragma xlang name loop1for(k=0;k<2000;k++)
Y[k]=alpha*X[k]*Y[k];
We can modify values of N/** A few values tested for unrolling factor – Different generated version **/#pragma xlang transform stripmine loop1 k N;#pragma xlang transform scalarize-in X in loop1#pragma xlang transform lift l1.loads before loop1#pragma xlang transform scalarize-out Y in loop1#pragma xlang transform lift loop1.loads before loop1#pragma xlang transform lift loop1.stores after loop1#pragma xlang transform fullunroll loop1.loads#pragma xlang transform fullunroll loop1.stores#pragma xlang transform fullunroll loop1
International Workshop LCPC 2005 19
Daxpy Example – Different generated versions
Unrolling factor : 2 for(k=0;k<2000;k=k+2){ double x_0 = X[k+0]; double x_1 = X[k+1]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; Y[k+0] = y_0; Y[k+1] = y_1; }
Unrolling factor : 4 for(k=0;k<2000;k=k+4){ double x_0 = X[k+0]; double x_1 = X[k+1]; double x_2 = X[k+2]; double x_3 = X[k+3]; double y_0 = Y[k+0]; double y_1 = Y[k+1]; double y_2 = Y[k+2]; double y_3 = Y[k+3]; y_0=alpha*x_0+y_0; y_1=alpha*x_1+y_1; y_2=alpha*x_2+y_2; y_3=alpha*x_3+y_3; Y[k+0] = y_0; Y[k+1] = y_1; Y[k+2] = y_2;}
Unrolling factor : 8for(k=0;k<2000;k=k+16){
double x_0 = X[k+0];
double x_1 = X[k+1];
double x_2 = X[k+2];
…
y_0=alpha*x_0+y_0;
y_1=alpha*x_1+y_1;
y_2=alpha*x_2+y_2;
y_3=alpha*x_3+y_3;
…
Y[k+0] = y_0;
Y[k+1] = y_1;
Y[k+2] = y_2;
Y[k+3] = y_3;
…
}
International Workshop LCPC 2005 20
Matrix Multiply(Loop Declaration)
#pragma xlang name iloopfor (i = 0; i < NB; i++)
#pragma xlang name jloopfor (j = 0; j < NB; j++)
#pragma xlang name kloopfor (k = 0; k < NB; k++)
{c[i][j]=c[i][j]+a[i]
[k]*b[k][j];
}
The DGEMM example:
Matrix Multiplication
Problems :Data locality
Scheduling
International Workshop LCPC 2005 21
Matrix Multiply(Transformation Declaration)
#pragma xlang transform stripmine iloop NU NUloop#pragma xlang transform stripmine jloop MU MUloop#pragma xlang transform interchange kloop MUloop#pragma xlang transform interchange jloop NUloop#pragma xlang transform interchange kloop NUloop#pragma xlang transform fullunroll NUloop#pragma xlang transform fullunroll MUloop#pragma xlang transform scalarize_in b in kloop#pragma xlang transform scalarize_in a in kloop#pragma xlang transform scalarize_in&out c in kloop#pragma xlang transform lift kloop.loads before kloop#pragma xlang transform lift kloop.stores after kloop
Sequence of transformations for Itanium:
International Workshop LCPC 2005 22
Matrix Multiply(Transformation Sequence)#pragma xlang name iloopfor(i = 0; i < NB; i++){#pragma xlang name jloopfor(j = 0; j < NB; j += 4){#pragma xlang name kloop.loads{c_0_0 = c[i+0][j+0];c_0_1 = c[i+0][j+1];c_0_2 = c[i+0][j+2];c_0_3 = c[i+0][j+3];}#pragma xlang name kloopfor(k = 0; k < NB; k++){{a_0 = a[i+0][k];a_1 = a[i+0][k];a_2 = a[i+0][k];a_3 = a[i+0][k];}
{b_0 = b[k][j+0];b_1 = b[k][j+1];b_2 = b[k][j+2];b_3 = b[k][j+3];}{c_0_0=c_0_0+a_0*b_0;c_0_1=c_0_1+a_1*b_1;c_0_2=c_0_2+a_2*b_2;c_0_3=c_0_3+a_3*b_3;}...}#pragma xlang name kloop.stores{c[i+0][j+0] = c_0_0;c[i+0][j+1] = c_0_1;c[i+0][j+2] = c_0_2;c[i+0][j+3] = c_0_3;}}}... // Remainder code
International Workshop LCPC 2005 23
Block copies
Block Matrix Multiplication: better performance if matrices are contiguous in memory (TLB)
Poor performance of C copy Resort to a tool generating specific asm
code Tool generating a good code with search (XLG is
an asm search)
International Workshop LCPC 2005 24
Matrix Multiply(Results)
Dgemm on Itanium 2
33003500370039004100430045004700490051005300
128 256 384 512 640 768 896 1024 1152 1280 1408 1536 1664 1792 1920 2048
Matrix Size
Perfo
rman
ce(M
Flop
s)
AtlasXLanguage+XLGXLanguage+MemcopyXLanguage+MKLPeak
International Workshop LCPC 2005 25
Conclusion
Describe transformations with reuse, procedures, conditionals
X-Language: language designed to generate multiversion programs Multistage language with a flexible pattern-matching and
rewriting language Experts can describe specific application transformation
optimizations
International Workshop LCPC 2005 26
Future works
Dependence analysis Going further searching asm code
transformation More transformations: vectorization,
alignment,…