Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel®...
-
Upload
godwin-rogers -
Category
Documents
-
view
219 -
download
0
Transcript of Programming with CellSs BSC. Programming with CellSs Motivation * Source: Platform 2015: Intel®...
Programming with CellSs
BSC
Programming with CellSs
Motivation
* Source: Platform 2015: Intel® Processor and Platform Evolution for the Next Decade, Intel White Paper
Programming with CellSs
Outline
•CellSs
• StarSs Programming Model
• CellSs syntax
• CellSs compiler
• CellSs runtime
• Installing CellSs
• Programming examples
• Compiling and running a CellSs application
• Performance analysis using Paraver
•SMPSs
•Conclusions
Programming with CellSs
Cell/B.E. Architecture
Users' point of view
So, what is the Cell/B.E.?Architecture point of view
SPEPPE SPE SPE SPE SPE SPE SPE SPE
Separate address spacesTiny local memoryBandwidth
Thin processorSMT
Hard to optimize
Programmers' point of view
Programming with CellSs
STARSs programming model
Basic idea
...for (i=0; i<N; i++){ T1 (data1, data2); T2 (data4, data5); T3 (data2, data5, data6); T4 (data7, data8); T5 (data6, data8, data9);}...
Sequential Application
T10 T20
T30
T40
T50
T11 T21
T31
T41
T51
T12
…
Resource 1
Resource 2
Resource 3
Resource N
.
.
.
Task graph creation
based on data
precedence
Task selection +
parameters direction
(input, output, inout)
Scheduling,
data transfer,
task execution
Synchronization,
results transfer
Parallel Resources(multicore,SMP, cluster, grid)
Programming with CellSs
StarSs programming model
•GRIDSs, COMPSs
• Tailored for Grids or clusters
• Data dependency analysis based on files
• C/C++, Java
•SMPSs
• Tailored for SMPs or homogeneous multicores
• C or Fortran
•CellSs
• Tailored for Cell/B.E. processor
• C or Fortran
Programming with CellSs
CellSs: Syntax example - matrix multiply
int main (int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]);
}
static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;
for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}
B
BNB
NB
B
B
Programming with CellSs
CellSs: Syntax example - matrix multiply
int main (int argc, char **argv) {int i, j, k;…
initialize(A, B, C);
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]);
}
#pragma css task input(A, B) inout(C)static void block_addmultiply( float C[BS][BS], float A[BS][BS], float B[BS][BS]) {int i, j, k;
for (i=0; i < B; i++) for (j=0; j < B; j++) for (k=0; k < B; k++) C[i][j] += A[i][k] * B[k][j];}
B
BNB
NB
B
B
Programming with CellSs
CellSs: Syntax
• pragmas' syntax:
#pragma css task [input (<input parameters>)] \
[output (<output parameters>)] \
[inout (<input/output parameters>)] \
[highpriority]
void task(<parameters>) { ...
#pragma css wait on(<data address>)
#pragma css barrier
#pragma css start
#pragma css finish
Programming with CellSs
CellSs: Syntax
• Examples: task selection
#pragma css task input(A, B) inout(C)void block_addmultiply( float C[N][N], float A[N][N], float B[N][N] ) { ...
#pragma css task input(A[BS][BS], B[BS][BS]) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B ) { ..
#pragma css task input(A[BS][BS], B[BS][BS], BS) inout(C[BS][BS])void block_addmultiply( float *C, float *A, float *B, int BS ) { ...
• Examples: waiting for data
#pragma css task input (ref_block, to_comp) output (mse) void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *mse) { ... ...
are_blocks_equal (X[ii][jj],Y[ii][jj], &sq_error);#pragma css wait on (sq_error)
if (sq_error >0.0000001)
Programming with CellSs
CellSs: Syntax
• Examples: synchronization
for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css barrier
• Examples: priorization
#pragma css task input(lefthalo[32], tophalo[32], righthalo[32], \ bottomhalo[32]) inout(A[32][32]) highpriority void jacobi (float *lefthalo, float *tophalo, float *righthalo, float
*bottomhalo, float *A) { ... }
Programming with CellSs
CellSs: Syntax
• Examples: CellSs program boundary
#pragma css start for (i=0; i < NB; i++) for (j=0; j < NB; j++) for (k=0; k < NB; k++) block_addmultiply( C[i][j], A[i][k], B[k][j]); #pragma css finish
Programming with CellSs
CellSs: Syntax in Fortran
subroutine example() ... interface !$CSS TASK subroutine block_add_multiply(C, A, B, BS) imtlicit none integer, intent (in) :: BS real, intent (in) :: A(BS,BS), B(BS,BS) real, intent (inout) :: C(BS,BS) end subroutine end interface ... !$CSS START ... call block_add_multiply(C, A, B, BLOCK_SIZE) ... !$CSS FINISH...end subroutine!$CSS TASKsubroutine block_add_multiply(C, A, B, BS)...end subroutine
Programming with CellSs
CellSs compiler: Compiler phase
Code translation
(mcc)
cellss-spu-cc_app.c
pack
app.tasks (tasks list)
app.c
cellss-spu-cc_app.o
app.o
CELSS-CC
cellss-ppu-cc_app.c
SPE Compiler PPE Compiler
cellss-spu-cc_app.o
Programming with CellSs
CellSs compiler: Compiler phase
•Files
• app.c: User code, with CellSs annotations
• cellss-spu-cc_app.c: specific code generated for the spu (tasks code)
• cellss-ppu-cc_app.c: specific code generated for the ppu (main program)
• app.tasks: list of annotated tasks
•Compilation steps
• mcc: source to source compiler, based on the Mercurium compiler (BSC).
• SPE compiler: Generic SPE compiler (IBM SDK)
• PPE compiler: Generic PPE compiler (IBM SDK)
• pack: Specific CellSs module that combines objects (BSC)
Programming with CellSs
CellSs compiler: Linker phase
app.c
unpackapp-adapters.c
exec
libCellSS.so
glue code generator
app.capp.o
app.tasks
exec-adapters.c
app-adapters.cccellss-spu-cc_app.o
exec-registration.c
exec-adapters.o
exec-registration.o
CELLSS-CC
app-adapters.capp-adapters.cccellss-ppu-cc_app.o
PPE Linker
exec-spu
SPE Compiler
PPE Compiler
SPE Embedder
SPE Linker
libCellSS-spu.a
exec-spu.o
app.tasksapp.tasks
Programming with CellSs
CellSs compiler: Linker phase
•Files
• exec-adapters.c: code generated for each of the annotated tasks to uniformly
call them (“stubs”).
• exec-registration.c: code generated to register the annotated tasks
• Linker steps
• unpack: unpacks objects
• glue code generator: from all the *.tasks files of an application generates a
single “adapters” file and a single “registration” file per executable
• SPE, PPE compilers and linkers and SPE embedder (IBM SDK)
Programming with CellSs
CellSs: Runtime
PPE
User main program
CellSs PPU lib
SPE0
DMA inTask executionDMA outSynchronization
CellSs SPU lib
Original task code
Helper threadMain thread
Memory
Userdata
Task control buffer
Synchronization
Tasks
Finalization signal
Stage in/out data
Work assignment
Data dependence Data renaming
Scheduling
SPE1
SPE2
Renaming table
...
Programming with CellSs
CellSs: Runtime - argument renaming
•False dependences (WaW and WaR) are removed with dynamic
renaming of argumentsfor (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…);}
Block1 is output from task T1
Block1 is input to task T2
block1block1
T1_1
T2_1
T3_1
T1_2
T2_2
T3_2
T1_N
T2_N
T3_N
…block1
WaR
WaW
WaR
WaW
WaR
WaW
Programming with CellSs
CellSs: Runtime - argument renaming
•False dependences (WaW and WaR) are removed with dynamic
renaming of argumentsfor (i=0; i<N; i++) { T1 (…,…, block1); T2 (block1, …, block2[i]); T3 (block2[i],…,…);}
Block1 is output from task T1
Block1 is input to task T2
block1_Nblock1_2
T1_1
T2_1
T3_1
T1_2
T2_2
T3_2
T1_N
T2_N
T3_N
…block1_1
WaR
WaW
WaR
WaW
WaR
WaW
Programming with CellSs
CellSs: Runtime - Dependence detection
@L
Type, size,…
*obj
*producer
*prev
Object
instance
Object
instance
Object
instance
Type, size,…
*obj
*producer
*prev
# users
# users
# users
Renaming table
Last renaming
Type, size,…
*obj
*producer
*prev
Task dependence graph
Programming with CellSs
CellSs: Runtime – scheduling
•Scheduling strategy
• Critical path
• Locality
... ...
Bundle of dependent tasks: data locality in SPE Bundle of independent tasks:
Mixed bundle
Programming with CellSs
CellSs: Runtime – scheduling
Ri+1
Ri
u v
ReadyLocs(u) = {@A, @B}
ReadyLocs(v) = {@C, @D}
LocHints(SPEj) = {@X, @Y, @B, @Z}
LocHints (SPEj+1
)={@U, @V, @W}
Ri+1
Ri
v
u
•Scheduling for locality
•Ready lists (Ri). Higher subindex indicates higher priority according
to memory locality
•Scheduling selects tasks from high priority ready lists (higher “i”)
Programming with CellSs
CellSs: Runtime – scheduling
“Co-parent” edges
•Co-parent edges are added between tasks that share a direct
descendent
•Odep(u), number of outstanding dependences of task u outside the
current bundle
Programming with CellSs
CellSs: Runtime – scheduling
“Co-parent” edges
•Co-parent edges are added between tasks that share a direct
descendent
•Maximum of two co-parent edges (due to implementation costs)
Programming with CellSs
CellSs: Runtime – scheduling
•Scheduling algorithm
• Ri : ready lists
• Btemp : candidates for being integrated in a bundle
• B bundle to be scheduled
while not ScheduleStop { t = head (R
M ) | M = max{i|0 < i < N} and R
i not empty
add_to_head (t, Btemp); while DepthSearch { u = head (Btemp); if Odep(u)==0 { add_to_tail (u, B); if ((b = CoParent (u)) !=0) add_to_head (b, Btemp); else if ((s = successor (u))!= 0 ) add_to_tail (s, Btemp); } } }
Programming with CellSs
CellSs: Runtime – scheduling
1
5
6
8
12
15
16
17
2
7
9
3 4
13
10
14
11
• Imagine
• R1 = {1} and
• R0 = {2, 3, 4, 7, 9, 10, 11}
•External loop
• t = 1;
• Btemp = {1}
• internal loop: iteration 1
• u = 1; Btemp = { };
• B = {u}
• b = 2; Btemp = {2};
Programming with CellSs
CellSs: Runtime – scheduling
1
5
6
8
12
15
16
17
2
7
9
3 4
13
10
14
11
• internal loop: iteration 2
• u = 2; Btemp = { };
• B = {1,2}
• b = 3; Btemp = {3};
• internal loop: iteration 3
• u = 3; Btemp = { };
• B = {1,2,3}
• b = 4; Btemp = {4};
• internal loop: iteration 4
• u = 4; Btemp = { };
• B = {1,2,3,4}
• s = 5; Btemp = {5};
Programming with CellSs
CellSs: Runtime – scheduling
1
5
6
8
12
15
16
17
2
7
9
3 4
13
10
14
11
• internal loop: iteration 5
• u = 5; Btemp = { };
• B = {1,2,3,4,5}
• s = 6; Btemp = {6};
• internal loop: iteration 6
• u = 6; Btemp = { };
• B = {1,2,3,4,5,6}
• b = 7; Btemp = {7};
• internal loop: iteration 7
• u = 7; Btemp = { };
• B = {1,2,3,4,5,6,7}
• s = 8; Btemp = {8};
Programming with CellSs
CellSs: Runtime – scheduling
1
5
6
8
12
15
16
17
2
7
9
3 4
13
10
14
11
• internal loop: iteration 8
• u = 8; Btemp = { };
• B = {1,2,3,4,5,6,7,8}
• b = 9; Btemp = {9};
• internal loop: ends since
maximum size of bundle is
reached
Programming with CellSs
CellSs: Runtime
•Paraver view of the runtime behavior
Bundle
Main thread:runs user code and adds and remove tasks to the task graph
SPEs: execute tasks' code
Helper thread:schedules tasks and synchronize with SPEs
Programming with CellSs
CellSs: Runtime – specific SPE library features
•Data dependence analysis, data renaming, task scheduling
performed in the CellSs PPE runtime library
•CellSs SPE runtime library implements specific features, to assist
the CellSs PPE runtime library, but independently
• Early callback
• Minimal stage-out
• Software cache in the SPE Local Store
• Double buffering
Programming with CellSs
CellSs: Runtime – specific SPE library features
•Early call-back
• Innitially, communication of completion of tasks is
done in a per bundle basis
• There are cases where this limits the application
• Task A in the example
• An early callback after the limiting task, enables
the scheduling of new bundles
• Condition: the task has more than one outgoing
dependency
Programming with CellSs
CellSs: Runtime – specific SPE library features
•Minimal stage-out
• For each task in a bundle its outpus will be written
back to main memory
• If inside the bundle, a task rewrites the same
output, there is no need for writing back to main
memory
• The case in the figure can not happen!
• Thanks to renaming
• Example: matmul
• C[i][j] += A[i][k]*B[k][j]
X
Y
X
Zwrites A'
writes A
reads A
X
Y
X
Zwrites A
writes A
reads A
X
Y
X
Z
writes A
writes A
reads A
Programming with CellSs
CellSs: Runtime – specific SPE library features
...
#pragma css task input(A, B) inout(C)block_addmultiply( C[i][j], A[i][k], B[k][j])
C[i][j]
A[i][k] B[k][j]
• For each operation, two blocks of data are get from PPE memory to SPE local storage
• Clusters of dependent tasks are scheduled to the
same PPE
The inout block is kept in the local storage and only
put in PPE memory once (reuse)
Programming with CellSs
CellSs: Runtime – specific SPE library features
•Software cache in the SPE Local Store
• Maintained by the SPE runtime
• LRU replacement strategy
• PPE scheduling is not aware of this behavior
Programming with CellSs
CellSs: Runtime - specific SPE library features
•Double buffering
• CellSs overlaps DMA transfers with computations
DMA programming: reading task control buffer
Waiting for DMA transfer
DMA programming: reading data
Task execution overlapped with data transfers
DMA programming: writing data
Task 1 in bundle Task 2 in bundle Task N in bundle
Synchronization with helper thread
...
Programming with CellSs
CellSs: Runtime - specific SPE library features
•Double buffering: paraver view
SPE reads data
SPE executes task
Programming with CellSs
CellSs: Runtime - specific SPE library features
•Double buffering: paraver view
DMA programming DMA programming
SPE waits for DMA in
Programming with CellSs
CellSs: Runtime - specific SPE library features
•Double buffering: paraver view
DMA out programmingDMA in programming
SPE waits for DMA in
Programming with CellSs
CellSs: Runtime - specific SPE library features
•Double buffering: paraver view
DMA out programmingSPE waits for DMA out (all)
Programming with CellSs
CellSs: Installing CellSs
•Dowload code:
• www.bsc.es/cellsuperscalar -> download
• gunzip, tar
• Installing instructions in the CellSs manual
• www.bsc.es/cellsuperscalar -> documents
• Run configure script with installation directory as prefix
./configure - -prefix=/opt/CellSS
• Other options can be specified
./configure - - help
make
make install
Programming with CellSs
CellSs: Programming examples
•Cholesky factorization
•Common matrix operation used to solve normal equations in linear
least squares problems.
•Calculates a triangular matrix (L) from a symetric and positive definite
matrix A.
Cholesky(A) = L
L · Lt = A
•Different possible implementations, depending on how the matrix is
traversed (by rows, by columns, left-looking, right-looking)
• It can be decomposed in block operations
Programming with CellSs
CellSs: Programming examples
• In each iteration red and blue blocks are updated
• SPOTF: Compute the Cholesky factorization of the diagonal block .
• STRSM: Compute the column panel
• SSYRK: Update the rest of the matrix
Programming with CellSs
CellSs: Programming examples
main (){...
for (i = 0; i < DIM; i++) { for (j= 0; j< i-1; j++){ for (k = 0; k < j-1; k++) { sgemm_tile( A[i][k], A[j][k], A[i][j] ); } strsm_tile( A[j][j], A[i][j] ); } for (j = 0; j < i-1; j++) { ssyrk_tile( A[i][j], A[i][i] ); } spotrf_tile( A[i][i] ); }... for (int i = 0; i < DIM; i++) { for (int j = 0; j < DIM; j++) {#pragma css wait on (A[i][j]) print_block(A[i][j]); } }... }
#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void sgemm_tile(float *A, float *B, float *C)
#pragma css task input (T[64][64]) inout(B[64][64])void strsm_tile(float *T, float *B)
#pragma css task input(A[64][64]) inout(C[64][64])void ssyrk_tile(float *A, float *C)
#pragma css task inout(A[64][64])void spotrf_tile(float *A)
DIM
DIM64
64
Cholesky factorization
Programming with CellSs
CellSs: Programming examples
•Sparse LU
• More generic factorization than Cholesky
• Deals with non symetric matrixes
• Calculates one lower triangular matrix (L) and one upper triangular(U) matrix
which product fits with a permutation of rows of the original
Perm(A)=L*U
• Difficult to program for Cell, since some operations are for columns (not
blocks)
• The example shown here is a simplified version (without pivoting) based on
an initial sparse matrix
Programming with CellSs
CellSs: Programming examples
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}
}
B
B
NB
NB
B
B
void lu0(float *diag);
void bdiv(float *diag, float *row);
void bmod(float *row, float *col, float *inner);
void fwd(float *diag, float *col);
Sparse LU
Programming with CellSs
Dynamic main memory allocationData dependent parallelism
int main(int argc, char **argv) {int ii, jj, kk;…
for (kk=0; kk<NB; kk++) {
lu0(A[kk][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL)
fwd(A[kk][kk], A[kk][jj]);
for (ii=kk+1; ii<NB; ii++)
if (A[ii][kk] != NULL) {
bdiv (A[kk][kk], A[ii][kk]);
for (jj=kk+1; jj<NB; jj++)
if (A[kk][jj] != NULL) {
if (A[ii][jj]==NULL)
A[ii][jj]=allocate_clean_block();
bmod(A[ii][kk], A[kk][jj], A[ii][jj]);
}
}
}
}
CellSs: Programming examples
#pragma css task inout(diag[B][B]) highpriority
void lu0(float *diag);
#pragma css task input(diag[B][B]) inout(row[B][B])
void bdiv(float *diag, float *row);
#pragma css task input(row[B][B],col[B][B]) inout(inner[B][B])
void bmod(float *row, float *col, float *inner);
#pragma css task input(diag[B][B]) inout(col[B][B])void fwd(float *diag, float *col);
Programming with CellSs
CellSs: Programming examples
•SPU memory funtionality: tailored CellSs API to deal with memory
issues in the SPU
•Dynamic memory allocation
• Local Storage (LS) space in each SPU is limited, so CellSs tries to control as
much of it as possible
#include <css_malloc.h>
void *css_malloc (unsigned int size);
void css_free (void *chunk);
Programming with CellSs
CellSs: Programming examples
•Example: Dynamic memory allocation#pragma css task input(bs, log2_N, is_forward, twiddle) inout(data, sync)static void FFT1D_1 (int bs, int log2_N, float twiddle[CUBE_SIZE*2], int is_forward, float data[bs][2*CUBE_SIZE], int sync[1]){ FFT1D_core ( bs, data, log2_N, twiddle, is_forward);}
static void FFT1D_core (int bs, float data[bs][2*CUBE_SIZE], int log2_N, float twiddle[CUBE_SIZE*2], int is_forward)
{ int i; int n_floats_elems = (1 << log2_N)*2; float *work_re = css_malloc(sizeof(float)*n_floats_elems); float *work_im = css_malloc(sizeof(float)*n_floats_elems); for(i=0; i<bs; i++) spe_FFT_1D_core (log2_N, &data[i][0], twiddle, is_forward, work_re, work_im); css_free((void *)work_re); css_free((void *)work_im);}
Programming with CellSs
CellSs: Programming examples
•DMA accesses
• CellSs handles all data transfers from main memory to SPU Local Store
• Some applications may need to do explicit data transfer from main memory
• For transfers of 1, 2, 4, 8 bytes or multiples of 16 bytes up to 16 KB
#include <css_dma.h>
void css_get_a (void *ls, uint32_t ea, unsigned int dma_size, tagid_t tag);
void css_put_a (void *ls, uint32_t ea, unsigned int dma_size, tagid_t tag);
• ls: pointer to a 16-byte aligned allocated buffer in LS
• ea: pointer to main memory
• dma_size: size of the block
• tag: identifies of the DMA transfer
Programming with CellSs
CellSs: Programming examples
•DMA accesses
• Tag obtention: returns a valid tag for a DMA transfer
tagid_t css_tag (void);
• Synchronization
void css_sync (tagid_t tag);
• For DMA transfers not meeting the previous requirements
void css_get (void *ls, unsigned int address, unsigned int dma_size, tagid_t tag);
void css_put (void *ls, unsigned int address, unsigned int dma_size, tagid_t tag);
• Example:
float *blocks = (float *)css_malloc(N*sizeof(Complex));
tag = css_tag ();
css_get_a (blocks, addr, (unsigned int)(N*sizeof(Complex)), tag);
css_sync(tag);
Programming with CellSs
CellSs: Programming examples
•Strided Memory access
• Interface to scatter/gather data from 1D, 2D and 3D arrays
#include <css_stride.h>
dmal_h_t *css_gather_1d (void *ls, unsigned int start, int chunk, int stride, size_t size, size_t e_size, dmal_h_t *c_list);
dmal_h_t *css_scatter_1d (void *ls, unsigned int start, int chunk, int stride, size_t size, size_t e_size, dmal_h_t *c_list);
• ls: pointer to a 16-byte aligned allocated buffer in LS
• c_list: enables to use the same pattern to access memory, reuses DMA lists
• size: number of objects to be copied
• e_size: size of one element
start
chunk stride
Programming with CellSs
CellSs: Programming examples
•Strided Memory access
#include <css_stride.h>
dmal_h_t *css_gather_2d (void *ls, unsigned int start, int local_x, int local_y, int global_x, size_t e_size, dmal_h_t *c_list);
dmal_h_t *css_scatter_2d (void *ls, unsigned int start, int local_x, int local_y, int global_x, size_t e_size, dmal_h_t *c_list);
local_x
local_y
global_x
start
Programming with CellSs
CellSs: Programming examples
•Strided Memory access
#include <css_stride.h>
dmal_h_t *css_gather_3d (void *ls, unsigned int start, int local_x, int local_y, int local_z, int global_x, int global_z, size_t e_size, dmal_h_t *c_list);
dmal_h_t *css_scather_3d (void *ls, unsigned int start, int local_x, int local_y, int local_z, int global_x, int global_z, size_t e_size, dmal_h_t *c_list);
•Example:
#pragma css task input ( A_p) output (A[16*16])
void example (float *A, unsigned int A_p)
{
dmal_h_t *entry = css_gather_1d (A, A_p, 4, 16, 64, sizeof(float), NULL);
css_sync(entry->tag);
}
Programming with CellSs
CellSs: Programming examples
void sequential_cholesky(void){ int STEP; int bm;
for (STEP = 0; STEP <= STEPS-1; STEP++) { // Update and factorize the current diagonal block my_cholesky_ssyrk (STEP, NB, N, A); my_cholesky_spotf2(STEP, NB, N, A);
if (STEP < STEPS-1) { // Compute the current block column for (bm = 0; bm < STEPS-STEP-1; bm++) { my_cholesky_sgemm(STEP, bm, NB, N, A); my_cholesky_strsm(STEP, bm, NB, N, A); } } }}}
void my_cholesky_ssyrk(int STEP, int nb, int N, float *A){ for (int i = 0; i < STEP; i++) // rank update for A[d][d] {
ssyrk(A[STEP*B][i*B],A[STEP*B][STEP*B]); }}
A
Original matrix A stored in consecutive positions in memory by rows
Another Cholesky
N = NB x B
NB x B
B
Programming with CellSs
CellSs: Programming examples
void sequential_cholesky(void){ int STEP; int bm;
for (STEP = 0; STEP <= STEPS-1; STEP++) { // Update and factorize the current diagonal block my_cholesky_ssyrk (STEP, NB, N, A); my_cholesky_spotf2(STEP, NB, N, A);
if (STEP < STEPS-1) { // Compute the current block column for (bm = 0; bm < STEPS-STEP-1; bm++) { my_cholesky_sgemm(STEP, bm, NB, N, A); my_cholesky_strsm(STEP, bm, NB, N, A); } } }}}
void my_cholesky_ssyrk(int STEP, int nb, int N, float *A){ for (int i = 0; i < STEP; i++) // rank update for A[d][d] {
check_data_av(A, ShA, STEP, i , N, nb, B); check_data_av(A, ShA, STEP, STEP, N, nb, B); ssyrk (ShA[STEP*nb+i], ShA[STEP*nb+STEP]);
}}
NB
NBB
B
ShA
A
STEP
i
Programming with CellSs
CellSs: Programming examples
void check_data_av(float* M, float** shadow, int i, int j, int N, int nb, int B){ int pp; if (shadow[i*B+j]==NULL) { shadow[i*B+j] = (float* )malloc(nb*nb*sizeof(float)); pp = (int)&M[i*N*B+j*B]; copy_to_shadow_block (&M[i*N*B+j*B], pp, B, N, shadow[i*nb+j]); }} void copy_back_to_matrix(float* M, float** shadow, int N, int nb, int B)
{ int i, j, pp; for (i = 0; i < nb; i++) { for (j = 0; j < nb; j++) { if (shadow[i*nb+j]!=NULL) { pp = (int)&M[i*N*B+j*B]; copy_from_shadow_block (&M[i*N*B+j*B],pp, nb, N, shadow[i*nb+j]); } } }}
#pragma css task input (address[1], main_address, b, n) output (WA[64][64]) void copy_to_shadow_block (float *address, int main_address, int b, int n, float *WA)
#pragma css task input (WA[64][64], main_address, b, n) inout (address[1]) void copy_from_shadow_block (float * address, int main_address, int b, int n, float *WA)
#pragma css task inout(A[64][64]) highpriority void spotrf_tile(float *A)#pragma css task input (A[64][64]) inout(B[64][64]) void ssyrk_tile(float *A, float *B)#pragma css task input(A[64][64], B[64][64]) inout(C[64][64])void sgemm_tile(float *A, float *B, float *C)#pragma css task input (T[64][64]) inout(B[64][64])void strsm_tile(float *T, float *B)
Could be
managed
as a c
ache !!!
Programming with CellSs
CellSs: Programming examples
#pragma css task input (address[1], main_address, b, n) output (WA[64][64]) void copy_to_shadow_block (float *address, int main_address, int b, int n, float *WA){// address is a trick to ensure dependencies // address points to the first element of the block as representantion// of the whole block
dmal_h_t *entry;
entry = css_gather_2d (WA, main_address, b, b, n, sizeof(float), NULL); ls_sync(entry->tag);}
#pragma css task input (WA[64][64], main_address, nb, n) inout (address[1])void copy_from_shadow_block (float * address, int main_address, int b, int n, float *WA) { dmal_h_t *entry;
address[0]=WA[0];// as address is inout, when the task finishes it copies back its local value// to the original position in main memory, so we need to assign the correct// value to that local variable.
entry = css_scather_2d (WA, main_address, b, b, n, sizeof(float), NULL); ls_sync(entry->tag);}
Programming with CellSs
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
#pragma css task input(Src) out(Dst) void copy_block (float Src[BS][BS], float Dst[BS][BS]);
void copy_mat (float *Src,float *Dst){ ... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) ... copy_block(Src[ii][jj],block); ...}
#pragma gss task input(A) out(L,U)void split_block (float A[BS][BS], float L[BS][BS], float U[BS][BS]);
void split_mat (float *LU[NB][NB],float *L[NB][NB],float *U[NB][NB]){... for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++){ ... split_block (LU[ii][ii],L[ii][ii],U[ii][ii]); ... }}
Checking LU
Programming with CellSs
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
Checking LU
void clean_mat (p_block_t Src[NB][NB]){ int ii, jj;
for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) if (Src[ii][jj] != NULL) { free (Src[ii][jj]); Src[ii][jj]=NULL; }}
#pragma css task output(Dst)void clean_block (float Dst[BS][BS] );
void clean_mat (p_block_t Src[NB][NB]){ int ii, jj;
for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) if (Src[ii][jj] != NULL) { clean_block(Src[ii][jj]); }}
Programming with CellSs
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
Checking LU
void sparse_matmult (float *A[NB][NB], float *B[NB][NB], float *C[NB][NB]){ int ii, jj, kk;
for (ii=0; ii<NB; ii++) for (jj=0; jj<NB; jj++) for (kk=0; kk<NB; kk++) if ((A[ii][kk]!= NULL) && (B[kk][jj] !=NULL )) { if (C[ii][jj] == NULL)
C[ii][jj] = allocate_clean_block(); block_matmul (A[ii][kk], B[kk][jj], C[ii][jj]); }}
#pragma css task input(a,b) inout(c)void block_matmul(float a[BS][BS], float b[BS][BS], float c[BS][BS]){ int i, j, k;
for (i=0; i<BS; i++) for (j=0; j<BS; j++) for (k=0; k<BS; k++) c[i][j] += a[i][k]*b[k][j];}
Programming with CellSs
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
Checking LU#pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e);void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop){ ... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL) if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); }#pragma css finish for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++) if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii,
jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n");}
Programming with CellSs
CellSs: Programming examples
int main(int argc, char* argv[]){... copy_mat (A, origA); LU (A); split_mat (A, L, U); clean_mat (A); sparse_matmult (L, U, A); compare_mat (origA, A);}
Checking LU #pragma css task input (ref_block, to_comp) output (mse)void are_blocks_equal (float ref_block[BS][BS], float to_comp[BS][BS], float *ms e);void compare_mat (p_block_t X[NB][NB], p_block_t Y[NB][NB], struct timeval *stop){ ... Zero_block = allocate_clean_block(); for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++) { if (X[ii][jj] == NULL) if (Y[ii][jj] == NULL) sq_error[ii][jj] = 0.0f; else are_blocks_equal(Zero_block, Y[ii [jj],&sq_error[ii][jj]); else are_blocks_equal(X[ii][jj], Y[ii][jj],&sq_error[ii][jj]); } for (ii = 0; ii < NB; ii++) for (jj = 0; jj < NB; jj++)#pragma css wait on (&sq_error[ii][jj]) if (sq_error[ii][jj] >0.0000001L) { printf ("block [%d, %d]: detected mse = % 20lf\n", ii,
jj,sq_error[ii][jj]); some_difference =TRUE; } if ( some_difference == FALSE) printf ("matrices are identical\n");}
Programming with CellSs
CellSs: Programming examples
copy_mat (A, origA);
LU (A);
split_mat (A, L, U);
clean_mat(A);
sparse_matmult (L, U, A); compare_mat (origA, A);
Without CellSs With CellSs(for NB=4 matrix)
Behavior Checking LU
Programming with CellSs
CellSs: Programming examples
Behavior Checking LU
0: are_blocks_equal1: bdiv_adapte2: block_mpy_add3: bmod4: clean_block5: copy_block6: fwd7: lu08: split_block
Programming with CellSs
CellSs: Programming examples
•Molecular dynamics: Argon simulation
• Simulates the mobility of Argon atoms in gas state, in a
constant volume at T=300K
• All elestrostatic forces observed for each of the atoms due to
the others are considered (Fi)
• The second Newton law is then applied to each atom
Fi=m*a
i
• The initial velocities are random but reasonable for argon
atoms at 300K
• To maintain a constant temperature in all the process the
Berendsen algorithm is applied
Programming with CellSs
CellSs: Programming examples
program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),
z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))
enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER
tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),
vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend
program argon... interface !$CSS TASK subroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj,
vx, vy, vz) implicit none integer, intent(in) :: BSIZE, ii, jj real, intent(in), dimension(BSIZE) :: xi, yi, zi, xj, yj, zj real, intent(inout), dimension(BSIZE) :: vx, vy, vz end subroutine !$CSS TASK subroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z) implicit none integer, intent(in) :: BSIZE real, intent(in) :: lam1 real, intent(inout), dimension(BSIZE) :: vx, vy, vz real, intent(inout), dimension(BSIZE) :: x, y, z end subroutine !$CSS TASK subroutine v_mod(BSIZE, v, vx, vy, vz) implicit none integer, intent(in) :: BSIZE real, intent(out) :: v(BSIZE) real, intent(in), dimension(BSIZE) :: vx, vy, vz end subroutine end interface
Molecular dynamics: Argon simulation
Programming with CellSs
CellSs: Programming examples
program argon...!$CSS START do step=1,niter do ii=1, N, BSIZE do jj=1, N, BSIZE call velocity(BSIZE, ii, jj, x(ii), y(ii),
z(ii), x(jj), y(jj), z(jj), vx(ii),vy(ii), vz(ii))
enddo enddo do jj=1, N, BSIZE call v_mod(BSIZE, v(jj), vx(jj), vy(jj), vz(jj)) enddo!$CSS BARRIER
tins=0.e0 do i=1,N tins=mkg*v(i)**2/3.e0/kb+tins enddo tins=tins/N lam1=sqrt(t/tins) do ii=1, N, BSIZE call update_position(BSIZE, lam1, vx(ii), vy(ii),
vz(ii), x(ii), y(ii), z(ii)) enddo enddo!$CSS FINISHend
!$CSS TASKsubroutine velocity(BSIZE, ii, jj, xi, yi, zi, xj, yj, zj, vx, vy, vz)! subroutine code end subroutine!$CSS TASKsubroutine update_position(BSIZE, lam1, vx, vy, vz, x, y, z)! subroutine code end subroutine!$CSS TASKsubroutine v_mod(BSIZE, v, vx, vy, vz)! subroutine code end subroutine
Molecular dynamics: Argon simulation
Programming with CellSs
CellSs: Programming examples
•Vector reduction
...Array A
BS
...
...
NB
Programming with CellSs
CellSs: Programming examples
Vector Reductionint main(int argc, char* argv[]){ LEVELS = log2 ((double)NB/BS);#pragma css start for (level = 0 ;level < LEVELS; level++){ range = exp2 ((double)level); for(i=0;i<NB;i+=2*BS*range) block_reduce(&A[i],&A[i+BS*range]); } block_reduce2(&A[0], &reduction);#pragma css finish}
#pragma css task input(B[64*64]) inout(A[64*64])void block_reduce(float *A, float *B){int i; for (i=0; i<BS; i++) A[i] += B[i];}
#pragma css task input(A) output(x)void block_reduce2(float *A, float *x){int i; *x = 0.0; for (i=0; i<BS; i++) *x += A[i];}
Programming with CellSs
CellSs: Programming examples
•Vector reduction
...Array A
BS
...
NB
neutral element
- Less concurrency for one vector- Fine when considering several
Programming with CellSs
CellSs: Programming examples
Vector Reduction
int main(int argc, char* argv[]){ LEVELS = log2 ((double)NB/BS);#pragma css start for (i=0; i<NB; i+= BS) block_reduce(&RB[0], &A[i]);
block_reduce2(&RB[0], &reduction);#pragma css finish}
#pragma css task input(B[64*64]) inout(A[64*64])void block_reduce(float *A, float *B){int i; for (i=0; i<BS; i++) A[i] += B[i];}
#pragma css task input(A) output(x)void block_reduce2(float *A, float *x){int i; *x = 0.0; for (i=0; i<BS; i++) *x += A[i];}
Programming with CellSs
CellSs: Compiling and running a CellSs application
• Usage: cellss-cc <options and sources>
• cellss-cc -help : lists usage
• Options:
• Regular compilation flags: -O<opt. level>, -g, -o <filename>, -D<macro>...
• Specific compilation flags:
• -t: tracing enabled. Generates Paraver tracefiles
• -WPPUp,<options>: passes comma separated list of flags to the PPU preprocessor
• -WPPUc,<options>: passes comma separated list of flags to the PPU compiler
• -WPPUl,<options>: passes comma separated list of flags to the PPU linker
• -WPPUf,<options>: passes comma separated list of flags to the PPU Fortran compiler
• WSPUp,<options> Passes the comma separated list of options tothe SPU preprocessor.
• -WSPUc,<options> Passes the comma separated list of options to the SPU compiler.
• -WSPUf,<options> Passes the comma separated list of options to the SPU Fortran compiler.
Programming with CellSs
CellSs: Compiling and running a CellSs application
•Examples
> cellss-cc -O3 *.c -o my_binary
> cellss-cc -O3 matmul.f90 -o matmul
> cellss-cc -O2 -WSPUc,-funroll-loops,-ftree-vectorize -WSPUc,-ftree-
vectorizer-verbose=3 matmul.c -o matmul
> cellss-cc -O3 -k test.c -o test
> cellss-cc -O5 -o argon2 argon2_css.f90 -t
Programming with CellSs
CellSs: Compiling and running a CellSs application
•Multiple source files
> cellss-cc -O3 -c code1.c
> cellss-cc -O3 -c code2.c
> cellss-cc -O3 -c code3.f90
> cellss-cc -O3 code1.o code2.o code3.o -o my_binary
•Use in a Makefile
CC = cellss-cc
LD = cellss-cc
CFLAGS = -O2 -g
SOURCES = code1.c code2.c code3.c
BINARY = my_binary
$(BINARY): $(SOURCES)
Programming with CellSs
CellSs: Compiling and running a CellSs application
• Running
• Setting the LD_LIBRARY_PATH (not always needed):
export LD_LIBRARY_PATH=$(HOME_CELLSS)/lib:$LD_LIBRARY_PATH
• Setting the number of SPUS (default 8, valid from 1 to 16 in a blade, from 1 to
6 in a PS3)
export CSS_NUM_SPUS=6
• Normal execution from command line:
./my_binary arg1 arg2 ... argn
Programming with CellSs
CellSs: Compiling and running a CellSs application
•Generating a tracefile
• Compile with -t flag
> cellss-cc my_app.c -t -O3 -o my_binay_instr
• Run normally
> ./my_binary_instr arg1 arg2 ...
• Tracefile is automatically generated. Default name gss-trace-xxx.ext
gss-trace-0001.prv
gss-trace-0001.row
gss-trace-0001.pcf
• All three files used by Paraver performance analyser and visualizer
• Changing the tracefile name:
> export CSS_TRACE_FILENAME=tracefilename
Will generate tracefiles: tracefilename-0001.prv, ...
Programming with CellSs
CellSs: Compiling and running a CellSs application
•CellSs configuration file
• Optional, default settings applied if not provided
• Plain text file
scheduler.min_tasks = 32
scheduler.initial_tasks = 128
scheduler.max_strand_size = 8
task_graph.task_count_high_mark = 2000
task_graph.task_count_low_mark = 1500
renaming.memory_high_mark = 134217728
renaming.memory_low_mark = 104857600
Programming with CellSs
CellSs: Compiling and running a CellSs application
•CellSs configuration file
• scheduler.initial_tasks (128): defines the number of ready for execution tasks that are generated at the beginning of the execution of an application before starting their scheduling and execution in the SPEs
• scheduler.min_tasks (16): defines minimum number of ready tasks needed to call the scheduler
• scheduler.max_strand_size (8): defines the maximum number of tasks that are simultaneously scheduled to an SPE
• task graph.task_count_high_mark (1000): defines the maximum number of non-executed tasks that the graph will hold
• task graph.task_count_low_mark (900): whevever the task graph reaches the number of tasks defined in the previous variable, the task graph generation is suspended until the number of non-executed tasks goes below this amount
Programming with CellSs
CellSs: Compiling and running a CellSs application
•CellSs configuration file
• renaming.memory_high_mark (∞): defines the maximum amount of memory used for renaming in bytes.
• renaming.memory_low_mark (1): whenever the renaming memory usage
reaches the size specified in the previous variable, the task graph generation
is suspended until the renaming memory usage goes below the number of
bytes specified in this variable.
> export CSS_CONFIG_FILE=file.cfg
Programming with CellSs
CellSs: Performance Analysis with Paraver
•Paraver
• Flexible performance visualization and analysis tool that can be used to analyze:
• MPI, OpenMP, MPI+OpenMP
• Java
• Hardware counters profile
• Operating system activity
• ... and many other things you may think of
• Generally it uses external trace file generators. Example for MPI:
> mpitrace mpirun -n 10 my_mpi-binary
• For CellSs, the libraries have been instrumented.
• When installing the distribution, two libraries are generated: normal and instrumented
• Flag -t links with instrumented version
• Available for free from the BSC website: www.bsc.es/paraver
Programming with CellSs
CellSs: Performance Analysis with Paraver
•Running paraver
paraver tracefile-0001.prv
Programming with CellSs
CellSs: Performance Analysis with Paraver
•Configuration files
Configuration file Feature shown
2dh inbw.cfg
2dh inbytes.cfg
2dh outbw.cfg
2dh outbytes.cfg
3dh duration phase.cfg
3dh duration tasks.cfg
DMA bw.cfg
DMA bytes.cfg
execution phases.cfg
Histogram of the bandwidth achieved by individual DMA IN transfers. Histogram of bytes read by the stage in DMA transfers.Histogram of the bandwidth achieved by individual DMA OUT transfersHistogram of bytes writen by the stage out DMA transfers.Histogram of duration for each of the runtime phases.Histogram of duration of SPU tasks.DMA (in + out) bandwidth per SPU.Bytes being DMAed (in + out) by each SPU.Profile of percentage of time spent by each thread at each of the major phases
Programming with CellSs
CellSs: Performance Analysis with Paraver
•Configuration files
Configuration file Feature shown
flushing.cfg
general.cfg Mix of timelines.
stage in out phase.cfg
task.cfg
task distance histogram.cfg .
task number.cfg
Task profile.cfg
task repetitions.cfg
Total DMA bw.cfg
Intervals (dark blue) where each SPU is flushing its local trace buffer to main memory.
Identification of DMA in (grey) and out phases (green).Outlined function being executed by each SPU.Histogram of task distance between dependent tasks Number of task being executed by each SPUTime (microseconds) each SPU spent executing the different tasksShows which SPU executed each task and the number of times that the task was executed.Total DMA (in+out) bandwidth to Memory.
Programming with CellSs
CellSs: Performance Analysis with Paraver
Clustering Group of 8 tasks (23 us)Block size: 64x64 floatsDMA in/out
Data re-use
Main thread
Helper thread
Programming with CellSs
CellSs: Performance Analysis with Paraver
Another Cholesky
Programming with CellSs
CellSs: Performance evolution
Performance: matrix multiply
• Versions with different task implementation
• Task duration:
• from 2000 µsecs (simple C scalar code)
• to 22 µsecs (highly hand-vectorized/optimized code)
0 1 2 3 4 5 6 7 8 90
1
2
3
4
5
6
7
8
9
Matmul scalability
2023
281,79
117,96
58,91
28,46
22,43
# SPUs
Sp
ee
d-u
p
July 2007
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
Matmul scalability
2023
281,79
117,96
58,91
28,46
22,43
#SPUs
Sp
ee
d-u
p
November 2007
0 1 2 3 4 5 6 7 8 9
0
1
2
3
4
5
6
7
8
9
Scalability analysis of matrix multiply
2022,77 usecs
281,32 usecs
117,47 usecs
58,46 usecs
27,87 usecs
21,86 usecs
#SPUs
Sp
ee
d u
p
April 2007
0 1 2 3 4 5 6 7 8 9
0
20
40
60
80
100
120
140
160
Matmul performance
March 2007
July 2007
Nov 2007
#SPUs
GF
lop
s
Programming with CellSs
CellSs: Performance evolution
Performance: Cholesky factorization
0 500 1000 1500 2000 2500 3000 3500 4000 4500
0
10
20
30
40
50
60
70
80
90
Cholesky performance
Matrix size
GF
lop
s
April 2007
0 1024 2048 3072 4096
0
20
40
60
80
100
120
140
Cholesky Performance
Matrix size
GF
lop
s
July 2007November 2007
0 1024 2048 3072 4096
0
20
40
60
80
100
120
140
Cholesky Performance
Matrix size
GF
lop
s
Programming with CellSs
CellSs: Performance evolution
Task dependence graph for
a 320 x 320 floats matrix
(blocks of 64 x 64)
Programming with CellSs
CellSs: Performance evolution
SXU
LS
DMA
On-chip coherent bus
SL1
...
PPE Memory controller
SXU
LS
DMA
Programming with CellSs
CellSs: Performance evolution
• Increase of locality for Matmul
Programming with CellSs
CellSs: Performance evolution
• Increase of locality for Cholesky
Programming with CellSs
CellSs: Performance evolution
• Increase of locality for Sparse LU
Programming with CellSs
CellSs: Performance evolution
• Increase of locality in the software cache
Programming with CellSs
CellSs: Performance evolution
• Increase of locality in the software cache
Programming with CellSs
CellSs: issues and ongoing efforts
• CellSs programming model
• Memory association
• Array regions
• Subobject accesses
• Blocks larger than Local Store.
• Access to global memory by tasks?
• Inline directives
• CellSs runtime system
• Further optimization of overheads (insert task and remove task),
• scheduling algorithms: overhead, locality
• overlays
• Short circuiting (SPE SPE transfers)
• SMP superscalar (SMPSs)
Programming with CellSs
Outline
•CellSs
• StarSs Programming Model
• CellSs syntax
• CellSs compiler
• CellSs runtime
• Installing CellSs
• Programming examples
• Compiling and running a CellSs application
• Performance analysis using Paraver
•SMPSs
•Conclusions
Programming with CellSs
1 2 3 4 5 6 7 80
5
10
15
20
25
30
LU performance
Machine peak
1024
2048
4096
Number of threads
GFl
ops
SMPSs
• “Same” source code
• Higher flexibility (block size, ...
• Same compiler
• Different back-end
• Execution environment
• Specific implementation
• Distributed scheduling
• No need for data copy
2 way POWER 5
0 5 10 15 20 25 30 35
0
5
10
15
20
25
Cholesky scalability
3072
6144
7680
#Threads
Sp
ee
d-u
p
SGI Altix
Programming with CellSs
SMPSs: Programming example (version array regions)
•Merge-sort
• Splits in 4 subarrays each time
• Sorts de arrays later on, calling a recursive sort to avoid sorting big arrays
• Using array regions
#pragma css task input(V[N]{i..j}) output (M[N][N]{i}{0..N-1})
Programming with CellSs
SMPSs: Programming example (version array regions)
#pragma css task input(low[N]{i1..j1}, low[N]{i2..j2},i1, j1, i2, j2) output (dest[N]{i1..j2})void seqmerge (ELM *low, long i1, long j1, long i2, long j2, ELM *dest);#pragma css task inout (low[N]{i..j}) input (i,j)void seqquick (ELM *low, long i, long j);
void sort (ELM *low, long i, long j){... if (size < QUICKSIZE) { seqquick (low, i, j); }else{ quarter = size / 4; i1= i; j1 = i+quarter-1; i2 = i+quarter; j2 = i+2*quarter-1; i3 = i+2*quarter; j3 = i+3*quarter-1; i4 = i+3*quarter; j4 = j;
sort(low, i1, j1); sort(low, i2, j2); sort(low, i3, j3); sort(low, i4, j4);
merge(low, i1, j1, i2, j2, tmp); merge(low, i3, j3, i4, j4, tmp); merge(tmp, i1, j2, i3, j4, low);}
Programming with CellSs
SMPSs: Programming example (version array regions)
void merge (ELM *low, long i1, long j1, long i2, long j2, ELM *dest){ ... if (size < MERGESIZE) { seqmerge(low1, i1, j1, i2, j2, dest ); return; } size /= 2; ... split(low, i1, j1, i2, j2, &split1, &split2);
merge(low, i1, split1-1, i2, split2-1, dest); merge(low, split, j1, split2, j2, dest );
}
main (){#pragma css startsort(&array, 0, size-1);#pragma css barrier}
Programming with CellSs
SMPSs: Programming example
•Queens
• Find a solution to the problem of locating N queens on an N N board, with
any of them killing each other
Programming with CellSs
SMPSs : Programming example
#pragma css task input (j, i,n) inout (a[n]) highpriorityvoid add_queen_task(char *a, int j, int i, int n);
#pragma css task input (results) inout (acc) highpriorityvoid acumulate(int results, int *acc);
#pragma css task input (n, j, a[n]) output (results)void nqueens_ser_task(int n, int j, char *a, int *results);
void nqueens(int n, int j, char *a, char *b, int depth) { for (i = 0; i < n; i++) { a[j] = i; if (ok(j + 1, a)) { add_queen_task(b, j, i, n); if (depth < task_depth) { nqueens(n, j + 1, a, b, depth + 1); } else { nqueens_ser_task(n, j + 1, b, &results); acumulate(results, &total_res); } } }}
Programming with CellSs
SMPss: Compiler phase
Code translation
(mcc)
smpss-cc_app.c pack
C compiler(gcc, icc, ...)
app.tasks (tasks list)
app.c
smpss-cc_app.o
app.o
SMPSS-CC
Programming with CellSs
smpss-cc-app.c
SMPss: Linker phase
app.c
unpack
smpss-cc-app.c
app-adapters.c
execlibSMPSS.so
Linker
glue code generator
app.capp.o
app.tasks
exec-adapters.c
app-adapters.ccsmpss-cc_app.o
C compiler(gcc, icc,...)
exec-registration.c
exec-adapters.oexec-registration.o
SMPSS-CC
Programming with CellSs
SMPss: runtime
CPU0
User mainprogram
SMPSs runtime library
Main thread
Memory
Data dependenceData renaming
Renaming table
...
SchedulingTask execution
GlobalReady task queues
High pri
Low pri
Thread 0Ready task queue
Original task code
SMPSs runtime library
SchedulingTask execution
Original task code
Worker thread 1
Thread 1Ready task queue
CPU1
Work stealing
SMPSs ru
Original task code
Worker thread 2
CPU2
Thread 2Ready task queue
Work stealing
Programming with CellSs
SMPss: results
Multi sortN queens
•Benchmarks used for OpenMP 3.0
development
• Similar performance in some ranges
• Overlap potential in SMPSs
• Programmability issues
• Reductions, memory allocations, synchronization representatives, nesting ,…
Programming with CellSs
SMPss: results
Programming with CellSs
Outline
•CellSs
• StarSs Programming Model
• CellSs syntax
• CellSs compiler
• CellSs runtime
• Installing CellSs
• Programming examples
• Compiling and running a CellSs application
• Performance analysis using Paraver
•SMPSs
•Conclusions
Programming with CellSs
Conclusions
•The road for new chips with multi and many cores is open
•New programming models that can deal with the complexity of the
hardware are now more needed than ever
•StarSs
• Simple
• Portable
• Enough performance
• Ported to different architectures: CellSs, SMPSs
Programming with CellSs
CellSs and SMPSs websites
•CellSs
• www.bsc.es/cellsuperscalar
•SMPSs
• www.bsc.es/smpsuperscalar
•Both available for download (open source, GPL and LGPL)