Vermelding onderdeel organisatie April 28, 2006 1 Algorithmic Skeletons for Stream Programming in...

28
April 28, 2006 Vermelding onderdeel organisatie 1 Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications IPDPS 2006 Wouter Caarls , Pieter Jonker, Henk Corporaal Quantitative Imaging Group, department of Imaging Science and Technology

Transcript of Vermelding onderdeel organisatie April 28, 2006 1 Algorithmic Skeletons for Stream Programming in...

April 28, 2006

Vermelding onderdeel organisatie

1

Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications

IPDPS 2006

Wouter Caarls, Pieter Jonker, Henk Corporaal

Quantitative Imaging Group, department of Imaging Science and Technology

April 26, 2006 2

Overview

• Stream programming• Writing stream kernels• Algorithmic skeletons• Writing algorithmic skeletons• Skeleton merging• Results• Conclusion & Future work

April 26, 2006 3

Stream Programming

• FIFO-connected kernels processing series of data elements• Well suited to signal processing applications

• Explicit communication and task decomposition• Ideal for distributed-memory systems

• Each data element processed (mostly) independently• Ideal for data-parallel systems such as SIMDs

April 26, 2006 4

Kernel Examples from Image Processing

• Pixel processing (color space conversion)• Perfect match

• Local neighborhood processing (convolution)• Requires 2D access

• Recursive neighborhood processing (distance transform)• Regular data dependencies

• Stack processing (region growing)• Irregular data dependencies

Increasing generality &Architectural requirements

April 26, 2006 5

Writing Kernels

• The language for writing kernels should be restricted• To allow efficient compilation to constrained

architectures• But also general

• So many different algorithms can be specified Solution: a different language for each type of kernel

• User selects the most restricted language that supports his kernel

• Retargetability• Efficiency• Ease-of-use

April 26, 2006 6

Algorithmic skeletons* as kernel languages

• An algorithmic skeleton captures a pattern of computation

• Is conceptually a higher-order function, repetitively calling a kernel function with certain parameters• Iteration strategy may be parallel• Kernel parameters restrict dependencies

• Provides the environment in which the kernel runs, and can be seen as a very restricted DSL

*M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation, 1989

April 26, 2006 7

Sequential neighborhood skeleton

NeighborhoodToPixelOp()Average(in stream float i[-1..1] [-1..1], out stream float *o){ int ky, kx; float acc=0;

for (ky=-1; ky <=1; ky++) for (kx=-1; kx <=1; kx++) acc += i[ky][kx];

*o = acc/9;}

void Average(float **i, float **o){ for (int y=1; y < HEIGHT-1; y++) for (int x=1; x < WIDTH-1; x++) { float acc=0;

acc += i[y-1][x-1]; acc += i[y-1][x ]; acc += i[y-1][x+1]; acc += i[y ][x-1]; acc += i[y ][x ]; acc += i[y ][x+1]; acc += i[y+1][x-1]; acc += i[y+1][x ]; acc += i[y+1][x+1];

o[y][x] = acc/9; }}

Kernel definition Resulting operation

Skeleton

April 26, 2006 8

Skeleton tasks

• Implement structure• Outer loop, border handling, buffering,

parallel implementation Just write C code

• Transform kernel• Stream access, translation to target languageTerm rewriting

How to combine in a single language?Partial evaluation

April 26, 2006 9

Term rewriting (1)

Input*o = acc/9;

Rewrite Rule (applied topdown to all nodes)replace(`o`, `&o[y][x]`);

Outputo[y][x] = acc/9;

April 26, 2006 10

Term rewriting (2) Using Stratego*

Inputacc += i[ky][kx];

Rewrite Rule (applied topdown to all nodes)RelativeToAbsolute:

|[ i[~e1][~e2] ]| ->|[ i[y + ~e1][x + ~e2] ]|

Outputacc += i[y+ky][x+kx];

*E. Visser. Stratego: A language for program transformation based on rewriting strategies, 2001

April 26, 2006 11

PEPCI (1)Rule composition and code generation in C

stratego RelativeToAbsolute(code i, code body){

main = <topdown(RelativeToAbsolute’)>(body)RelativeToAbsolute’:

|[ ~i[~e1][~e2] ]| ->|[ ~i[y + ~e1][x + ~e2] ]|

}

for (a=0; a < arguments; a++)if (args[a].type == ARG_STREAM_IN)

body = RelativeToAbsolute(args[a].id, body);else if (args[a].type == ARG_STREAM_OUT)

body = DerefToArrayIndex(args[a].id, body);

for (y=1; y < HEIGHT-1; y++)for (x=1; x < WIDTH-1; x++)

@body;

Rule definition

Rule composition

Code generation

April 26, 2006 12

PEPCI (2)Combining rule composition and code generation

• How to distinguish rule composition from code generation?for (a=0; a < arguments; a++)

body = DerefToArrayIndex(args[a].id, body);for (x=0; x < stride; x++)

@body;

Partial evaluation: evaluate only the parts of the program that are known. Output the rest• arguments is known, DerefToArrayIndex is known,

args[a].id is known, body is known -> evaluate• stride is unknown -> output

April 26, 2006 13

PEPCI (3)Partial evaluation by interpretation

double n, x=1;int ii, iterations=3;

scanf(“%lf”, &n);

for (ii=0; ii < iterations; ii++) x = (x + n/x)/2;

printf(“sqrt(%f) = %f\n”, n, x);

double n;double x;int ii;int iterations;x = 1;iterations = 3;scanf(“%lf”, &n);ii = 0;x = (1 + n/1)/2;ii = 1;x = (x + n/x)/2;ii = 2;x = (x + n/x)/2;ii = 3;printf(“sqrt(%f) = %f\n”, n, x);

double n

double x

int ii

int iterations

Symbol table

Input Output

?

1

?

1

?

3

?

1

0

3

?

?

0

3

?

?

1

3

?

?

2

3

?

?

3

3

April 26, 2006 14

Kernelization overheads

• Kernelizing an application impacts performance• Mapping• Scheduling• Buffers management• Lost ILP

Merge kernels• Extract static kernel sequences• Statically schedule at compile-time• Replace sequence with merged kernel

April 26, 2006 15

Skeleton merging

• Skeletons are completely general functions• Cannot be properly analyzed or reasoned about

Restrict skeleton generality be using metaskeletons• Skeletons using the same metaskeleton can be merged• Merged operation still uses the original metaskeleton, and can be recursively merged

April 26, 2006 16

Example• Philips Inca+ smart camera

• 640x480 sensor• XeTaL 16MHz, 320-way SIMD • TriMedia 180MHz, 5-issue VLIW

• Ball detection• Filtering, Segmentation, Hough transform

April 26, 2006 17

Results

Setup Time to process a frame (ms)

TriMedia baseline 133

TriMedia optimized 100

TriMedia kernelized 160

TriMedia merged 134

TriMedia + XeTaL merged 54

Buffers,Scheduling, ILP

ILP not fullyrecovered

April 26, 2006 18

Conclusion

• Stream programming is a natural fit for running image processing applications on distributed-memory systems

• Algorithmic Skeletons efficiently exploit data parallelism, by allowing the user to select the most restricted skeleton that supports his kernel• Extensible (new skeletons)• Retargetable (new skeleton implementations)

• PEPCI effectively combines the necessities of efficiently implementing algorithmic skeletons• Term rewriting (by embedding Stratego)• Partial evaluation (to automatically separate rule

composition and code generation)

April 26, 2006 19

Future Work

• Better merging of kernels• Merge more efficiently• Merge different metaskeletons

• Implement on a more general architecture• Implement more demanding applications

• And more involved skeletons

April 26, 2006 20

End

April 26, 2006 21

Partial evaluation (2)Free optimizations

• Loop unrolling• If the conditions are known, and the

body isn’t• Function inlining• Aggressive constant folding

• Including external “pure” functions

April 26, 2006 22

Kernel translation

• SIMD processors are not programmed in C, but in parallel derivatives

• Skeleton should translate kernel to target language

Extend PEPCI with C derivative syntax• Though only minimally interpreted

derivative Cskeleton

C operationkernel PEPCI

April 26, 2006 23

Example: local neighborhood operation in XTC

NeighbourhoodToPixelOp()sobelx(in stream unsigned char i[-1..1][-1..1], out stream int *o){ int x, y, temp; temp = 0; for (y=-1; y < 2; y++) for (x=-1; x < 2; x=x+2) temp = temp + x*i[y][x]; *o = temp;}

static lmem _in2;static lmem _in1;

{ lmem temp;

temp = (0)+((-1)*(_in2[-1 .. 0])); temp = (temp)+((1)*(_in2[1 .. 2])); temp = (temp)+((-1)*(_in1[-1 .. 0])); temp = (temp)+((1)*(_in1[1 .. 2])); temp = (temp)+((-1)*(larg0[-1 .. 0])); temp = (temp)+((1)*(larg0[1 .. 2])); larg1 = temp;}

_in2 = _in1;_in1 = larg0;

April 26, 2006 24

Stream programvoid main(int argc, char **argv){ STREAM a, b, c; int maxval, dummy, maxc;

scInit(argc, argv);

while (1) { capture(&a); interpolate(&a, &a); sobelx(&a, &b); sobely(&a, &c); magnitude(&b, &c, &a); direction(&b, &c, &b); mask(&b, &a, &a, scint(128)); hough(&a, &a); display(&a); imgMax(&a, scint(0), &maxval, scint(0), &dummy, scint(0), &maxc); _block(&maxc, &maxval); printf(“Ball found at %d with strength %d\n”, maxc, maxval); }

return scExit();}

April 26, 2006 25

Programming with algorithmic skeletons (1)

PixelToPixelOp()binarize(in stream int *i, out stream int *o, in int *threshold){ *o = (*i > *threshold);}

NeighbourhoodToPixelOp()average(in stream int i[-1..1][-1..1], out stream int *o){

int x, y;*o = 0;for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) *o += i[y][x];*o /= 9;

}

April 26, 2006 26

Programming with algorithmic skeletons (2)

StackOp(in stream int *init)propagate(in stream int *i[-1..1][-1..1], out stream int *o){ int x, y; for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) if (i[y][x] && !*o) { *o = 1; push(y, x); }}

AssocPixelReductionOp()max(in stream int *i, out int *res){ if (*i > *res) *res = *i;}

April 26, 2006 27

Algorithmic Skeletons

<=t

>t+ =

<=t

>t+ =

<=t

>t+ =

April 26, 2006 28

Term rewriting (1) From code to abstract syntax tree

StatAssignPlus

Id ArrayIndex

“i”

“acc”

“ky”

ArrayIndex IdIdId

Stat(AssignPlus(Id("acc"),ArrayIndex(ArrayIndex(Id("i"),Id("ky")),Id("kx"))))

acc +=i[ ][ ];ky kx

“kx”