exploiting simd parallelism with the cgis compiler framework
DESCRIPTION
Exploiting SIMD parallelism with the CGiS compiler framework. Nicolas Fritz , Philipp Lucas, Reinhard Wilhelm Saarland University. Outline. CGiS Language, compiler and GPU back-end SIMD back-end Hardware Challenges Transformations and optimizations Experimental results Future Work - PowerPoint PPT PresentationTRANSCRIPT
Exploiting SIMD parallelism with the CGiS compiler framework
Nicolas Fritz, Philipp Lucas, Reinhard Wilhelm
Saarland University
2
Outline
CGiS Language, compiler and GPU back-end
SIMD back-end Hardware Challenges Transformations and optimizations
Experimental results Future Work Conclusion
3
CGiS
C-like data-parallel programming language Goals:
Exploitation of parallel processing units in common PCs (GPU, SIMD units)
Easy access for inexperienced programmers High abstraction level
32-bit scalar and small vector data types Two forms of explicit parallelism
SPMP (iteration), SIMD (vector types)
4
CGiS Example: YUV to RGB
PROGRAM yuv_to_rgb;
INTERFACE
extern in float3 YUV<_>;
extern out float3 RGB<_>;
CODE
procedure yuv2rgb (in float3 yuv, out float3 rgb)
{
rgb = yuv.x + [0, 0.344, 1.77 ] * yuv.y + [1.403, 0.714, 0] * yuv.z;
}
CONTROL
forall (yuv in YUV, rgb in RGB) { yuv2rgb (yuv, rgb); }
5
CGiS Compiler Overview
CGiSSource
CGiS Compiler
CGiS Runtime
Application
PPU Code
Interface
6
CGiS for GPUs nVidia G80:
128 floating points units Scalar and vector data processible
2-on-2 mapping of CGiS‘ parallelism Code generation for various GPU generations
NV30, NV40, G80, CUDA Limited access to hardware features through the
driver
7
SIMD Hardware Every common PC features SIMD units
Intel‘s SSE and Freescale‘s AltiVec SIMD parallelism not easily accessible for
standard compilers Well-known vectorization problems
Data access Hardware requires 16-byte aligned loads Slow but cached
Only 4-way SIMD vector parallelism usable
8
The SIMD Back-end Goal is mapping of CGiS parallelisms to SIMD
hardware “2-on-1” mapping
SIMD vectorization problems Avoided by design: data dependency analyses Control flow
Divergence in consecutive elements Misalignment and data layout
Reordering might be needed Gathering operations are bottle-necks in load-
heavy algorithms on multidimensional streams
9
Transformations and Optimizations Control flow conversion
If/loop conversion Loop sectioning for 2D streams
Increase cache performance for gather accesses
Kernel flattening IR transformation that replaces compound
variables and operations by scalar ones “2-on-1”
10
Control Flow Conversion Full inlining If/loop converison with slightly modified Allen-
Kennedy algorithm No guarded assignments Masks for select operations are the results of
vector compares Live and written variables after a control flow join
are copied at the branching Select operations are inserted at the join
11
Loop Sectioning Adaptation of iteration sequence to better
exploit cached data Only interesting for 2D streams Iterations subdivided in stripes Width depends on access pattern, cache size
and local variables
12
Kernel Flattening
SIMD vectorization for yuv2rgb not applicable Thus “flatten” the procedure or kernel:
Code transformation on the IR All variables and all statements are split into
scalar ones Those can be subjected to SIMD vectorization
procedure yuv2rgb (in float3 yuv, out float3 rgb) {
rgb = yuv.x + [0, 0.344, 1.77 ] * yuv.y + [1.403, 0.714, 0] * yuv.z;
}
13
Kernel Flattening Exampleprocedure yuv2rgb_f (in float yuv_x, in float yuv_y, in float yuv_z,
out float rgb_x, out float rgb_y, out float rgb_z)
{
float cy = 0.344, cz = 1.77, dx = 1.403, dy = 0.714;
rgb_x = yuv_x + + dx * yuv.z;
rgb_y = yuv_x + cy * yuv.y + dy * yuv.z;
rgb_z = yuv_x + cz * yuv.y;
}
Procedure yuv2rgb_f now features data types suitable to be SIMD-parellelized
14
Kernel Flattening But: data layout doesn’t fit
No stride-one access for single components Reordering of data required
Locally via permutes or shuffles Globally via memory copy
15
Kernel Flattening Data Reorderig
16
Global vs. Local Reordering Global reordering
Reusable for further iterations Simple, but expensive in-memory copy Destroys locality for gather accesses
Local reordering Original stream data untouched Insertion of possibly many relatively cheap
in-register permutation operations Locality for gathering preserved
17
Experimental Results Tested on Intel Core 2 Duo 1.83GHz and
PowerPC G5 1.8GHz Compiled with intrinsics on gcc 4.0.1
Examples Image processing: Gaussian blur
Loop sectioning Computation of mandelbrot set
Control flow conversion Block cipher encryption: rc5 encryption
Kernel flattening
18
Experimental Results
19
Future Work Replace intrinsics by inline-assembly
Improvement of conditionals Better control over register allocation
Improvement of register re-utilization for AltiVec Raises with inline-assembly
Cell back-end SIMD instruction set close to AltiVec Work list algorithm to distribute stream parts to
single PEs More applications
20
Conclusion CGiS abstracts GPUs as well as SIMD units SIMD back-end of the CGiS compiler produces
efficient code Other transformations and optimizations needed
than for the GPU backend Full control flow conversion needed Gather accesses gain speed with loop
sectioning Kernel flattening enables better exploitation