vermelding onderdeel organisatie april 28, 2006 1 algorithmic skeletons for stream programming in...
TRANSCRIPT
April 28, 2006
Vermelding onderdeel organisatie
1
Algorithmic Skeletons for Stream Programming in Embedded Hetereogeneous Parallel Image Processing Applications
IPDPS 2006
Wouter Caarls, Pieter Jonker, Henk Corporaal
Quantitative Imaging Group, department of Imaging Science and Technology
April 26, 2006 2
Overview
• Stream programming• Writing stream kernels• Algorithmic skeletons• Writing algorithmic skeletons• Skeleton merging• Results• Conclusion & Future work
April 26, 2006 3
Stream Programming
• FIFO-connected kernels processing series of data elements• Well suited to signal processing applications
• Explicit communication and task decomposition• Ideal for distributed-memory systems
• Each data element processed (mostly) independently• Ideal for data-parallel systems such as SIMDs
April 26, 2006 4
Kernel Examples from Image Processing
• Pixel processing (color space conversion)• Perfect match
• Local neighborhood processing (convolution)• Requires 2D access
• Recursive neighborhood processing (distance transform)• Regular data dependencies
• Stack processing (region growing)• Irregular data dependencies
Increasing generality &Architectural requirements
April 26, 2006 5
Writing Kernels
• The language for writing kernels should be restricted• To allow efficient compilation to constrained
architectures• But also general
• So many different algorithms can be specified Solution: a different language for each type of kernel
• User selects the most restricted language that supports his kernel
• Retargetability• Efficiency• Ease-of-use
April 26, 2006 6
Algorithmic skeletons* as kernel languages
• An algorithmic skeleton captures a pattern of computation
• Is conceptually a higher-order function, repetitively calling a kernel function with certain parameters• Iteration strategy may be parallel• Kernel parameters restrict dependencies
• Provides the environment in which the kernel runs, and can be seen as a very restricted DSL
*M. Cole. Algorithmic Skeletons: Structured Management of Parallel Computation, 1989
April 26, 2006 7
Sequential neighborhood skeleton
NeighborhoodToPixelOp()Average(in stream float i[-1..1] [-1..1], out stream float *o){ int ky, kx; float acc=0;
for (ky=-1; ky <=1; ky++) for (kx=-1; kx <=1; kx++) acc += i[ky][kx];
*o = acc/9;}
void Average(float **i, float **o){ for (int y=1; y < HEIGHT-1; y++) for (int x=1; x < WIDTH-1; x++) { float acc=0;
acc += i[y-1][x-1]; acc += i[y-1][x ]; acc += i[y-1][x+1]; acc += i[y ][x-1]; acc += i[y ][x ]; acc += i[y ][x+1]; acc += i[y+1][x-1]; acc += i[y+1][x ]; acc += i[y+1][x+1];
o[y][x] = acc/9; }}
Kernel definition Resulting operation
Skeleton
April 26, 2006 8
Skeleton tasks
• Implement structure• Outer loop, border handling, buffering,
parallel implementation Just write C code
• Transform kernel• Stream access, translation to target languageTerm rewriting
How to combine in a single language?Partial evaluation
April 26, 2006 9
Term rewriting (1)
Input*o = acc/9;
Rewrite Rule (applied topdown to all nodes)replace(`o`, `&o[y][x]`);
Outputo[y][x] = acc/9;
April 26, 2006 10
Term rewriting (2) Using Stratego*
Inputacc += i[ky][kx];
Rewrite Rule (applied topdown to all nodes)RelativeToAbsolute:
|[ i[~e1][~e2] ]| ->|[ i[y + ~e1][x + ~e2] ]|
Outputacc += i[y+ky][x+kx];
*E. Visser. Stratego: A language for program transformation based on rewriting strategies, 2001
April 26, 2006 11
PEPCI (1)Rule composition and code generation in C
stratego RelativeToAbsolute(code i, code body){
main = <topdown(RelativeToAbsolute’)>(body)RelativeToAbsolute’:
|[ ~i[~e1][~e2] ]| ->|[ ~i[y + ~e1][x + ~e2] ]|
}
for (a=0; a < arguments; a++)if (args[a].type == ARG_STREAM_IN)
body = RelativeToAbsolute(args[a].id, body);else if (args[a].type == ARG_STREAM_OUT)
body = DerefToArrayIndex(args[a].id, body);
for (y=1; y < HEIGHT-1; y++)for (x=1; x < WIDTH-1; x++)
@body;
Rule definition
Rule composition
Code generation
April 26, 2006 12
PEPCI (2)Combining rule composition and code generation
• How to distinguish rule composition from code generation?for (a=0; a < arguments; a++)
body = DerefToArrayIndex(args[a].id, body);for (x=0; x < stride; x++)
@body;
Partial evaluation: evaluate only the parts of the program that are known. Output the rest• arguments is known, DerefToArrayIndex is known,
args[a].id is known, body is known -> evaluate• stride is unknown -> output
April 26, 2006 13
PEPCI (3)Partial evaluation by interpretation
double n, x=1;int ii, iterations=3;
scanf(“%lf”, &n);
for (ii=0; ii < iterations; ii++) x = (x + n/x)/2;
printf(“sqrt(%f) = %f\n”, n, x);
double n;double x;int ii;int iterations;x = 1;iterations = 3;scanf(“%lf”, &n);ii = 0;x = (1 + n/1)/2;ii = 1;x = (x + n/x)/2;ii = 2;x = (x + n/x)/2;ii = 3;printf(“sqrt(%f) = %f\n”, n, x);
double n
double x
int ii
int iterations
Symbol table
Input Output
?
1
?
1
?
3
?
1
0
3
?
?
0
3
?
?
1
3
?
?
2
3
?
?
3
3
April 26, 2006 14
Kernelization overheads
• Kernelizing an application impacts performance• Mapping• Scheduling• Buffers management• Lost ILP
Merge kernels• Extract static kernel sequences• Statically schedule at compile-time• Replace sequence with merged kernel
April 26, 2006 15
Skeleton merging
• Skeletons are completely general functions• Cannot be properly analyzed or reasoned about
Restrict skeleton generality be using metaskeletons• Skeletons using the same metaskeleton can be merged• Merged operation still uses the original metaskeleton, and can be recursively merged
April 26, 2006 16
Example• Philips Inca+ smart camera
• 640x480 sensor• XeTaL 16MHz, 320-way SIMD • TriMedia 180MHz, 5-issue VLIW
• Ball detection• Filtering, Segmentation, Hough transform
April 26, 2006 17
Results
Setup Time to process a frame (ms)
TriMedia baseline 133
TriMedia optimized 100
TriMedia kernelized 160
TriMedia merged 134
TriMedia + XeTaL merged 54
Buffers,Scheduling, ILP
ILP not fullyrecovered
April 26, 2006 18
Conclusion
• Stream programming is a natural fit for running image processing applications on distributed-memory systems
• Algorithmic Skeletons efficiently exploit data parallelism, by allowing the user to select the most restricted skeleton that supports his kernel• Extensible (new skeletons)• Retargetable (new skeleton implementations)
• PEPCI effectively combines the necessities of efficiently implementing algorithmic skeletons• Term rewriting (by embedding Stratego)• Partial evaluation (to automatically separate rule
composition and code generation)
April 26, 2006 19
Future Work
• Better merging of kernels• Merge more efficiently• Merge different metaskeletons
• Implement on a more general architecture• Implement more demanding applications
• And more involved skeletons
April 26, 2006 21
Partial evaluation (2)Free optimizations
• Loop unrolling• If the conditions are known, and the
body isn’t• Function inlining• Aggressive constant folding
• Including external “pure” functions
April 26, 2006 22
Kernel translation
• SIMD processors are not programmed in C, but in parallel derivatives
• Skeleton should translate kernel to target language
Extend PEPCI with C derivative syntax• Though only minimally interpreted
derivative Cskeleton
C operationkernel PEPCI
April 26, 2006 23
Example: local neighborhood operation in XTC
NeighbourhoodToPixelOp()sobelx(in stream unsigned char i[-1..1][-1..1], out stream int *o){ int x, y, temp; temp = 0; for (y=-1; y < 2; y++) for (x=-1; x < 2; x=x+2) temp = temp + x*i[y][x]; *o = temp;}
static lmem _in2;static lmem _in1;
{ lmem temp;
temp = (0)+((-1)*(_in2[-1 .. 0])); temp = (temp)+((1)*(_in2[1 .. 2])); temp = (temp)+((-1)*(_in1[-1 .. 0])); temp = (temp)+((1)*(_in1[1 .. 2])); temp = (temp)+((-1)*(larg0[-1 .. 0])); temp = (temp)+((1)*(larg0[1 .. 2])); larg1 = temp;}
_in2 = _in1;_in1 = larg0;
April 26, 2006 24
Stream programvoid main(int argc, char **argv){ STREAM a, b, c; int maxval, dummy, maxc;
scInit(argc, argv);
while (1) { capture(&a); interpolate(&a, &a); sobelx(&a, &b); sobely(&a, &c); magnitude(&b, &c, &a); direction(&b, &c, &b); mask(&b, &a, &a, scint(128)); hough(&a, &a); display(&a); imgMax(&a, scint(0), &maxval, scint(0), &dummy, scint(0), &maxc); _block(&maxc, &maxval); printf(“Ball found at %d with strength %d\n”, maxc, maxval); }
return scExit();}
April 26, 2006 25
Programming with algorithmic skeletons (1)
PixelToPixelOp()binarize(in stream int *i, out stream int *o, in int *threshold){ *o = (*i > *threshold);}
NeighbourhoodToPixelOp()average(in stream int i[-1..1][-1..1], out stream int *o){
int x, y;*o = 0;for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) *o += i[y][x];*o /= 9;
}
April 26, 2006 26
Programming with algorithmic skeletons (2)
StackOp(in stream int *init)propagate(in stream int *i[-1..1][-1..1], out stream int *o){ int x, y; for (y=-1; y < 2; y++) for (x=-1; x < 2; x++) if (i[y][x] && !*o) { *o = 1; push(y, x); }}
AssocPixelReductionOp()max(in stream int *i, out int *res){ if (*i > *res) *res = *i;}