fir filter on gpu

An Implementation of a FIR Filter on a GPU

Alexey Smirnov and Tzi-cker Chiueh

ECSL Research Seminar9/13/05

Outline

Introduction GPU Computing Overview Related Work FIR Filter Definition FIR Filter Implementation on GPU Performance Evaluation Conclusion

Introduction

Numerical algorithms often perform repeated computations on vectors of elements.

Parallel computation improves performance.

x86: MMX, SSE, SSE2, SSE3. Video cards are now

programmable.

Computation and Bandwidth Rates Video cards have higher GFLOPs

rate and memory bandwidth compared to CPU.

However, data copying between main memory and video memory can reduce performance.

GPU Computing Background Rendering pipeline:

User program defines vertex and texture coordinates.

Vertex processor converts vertex attributes from world coordinate system into screen coordinate system.

Fragment processor computes color of each output pixel using textures and color.

Interpolation defines coordinates and color for each pixel.

Vertex and fragment processors are programmable for example in C-like language Cg.

Rendering APIs OpenGL (Linux, Windows, MacOS)

and DirectX (Windows). OpenGL extensions allow to use

advanced features of a video card. NV_float_buffer supports floating-

point textures. ARB_render_texture allows to

render to a texture instead of the screen.

GPU Program Architecture Create floating-point textures that contain

input data and load them into video memory; Load the fragment program and enable multi-

texturing; Define vertex and texture coordinates; Draw the figure to an off-screen buffer; If the results were rendered to an off-screen

buffer then copy the image to a texture using glCopyTexSubImage2D().

Go to step 3 if more iterations needed. Use glGetTexImage() to copy data from video

memory to main memory.

Input Data Representation Matrices are represented as textures

naturally. Four elements per pixel (R, G, B, A).

Vectors are wrapped into matrices. Textures have maximum dimensions.

Related Work Four papers describing matrix

multiplication; Linear algebra operations; Array sorting; FFT; Earlier papers concluded that the CPU is

more efficient then GPU. Recent video cards, e.g. GeForce 7800

and ATI X800 XT do better than CPU.

FIR Filter Definition

Finite Impulse Response (FIR) filter is used in audio processing.

We modified GNU Radio – an open-source software implementing Software Defined Radio.

Other Relevant Transformations

Hilbert transformation:

Frequency translation FIR filter:

FIR Filter on a GPU

FIR Filter’s Loop Initialization:

Loop iteration:

FIR Filter’s Loop

O(j+1)=O(j)+MI

Final output value is computed as

Fragment Program

Optimizations Break loop into two to get rid of

conditional expression; Unroll loop body w/ and w/o

conditional expression; Process two rows of input and

textures; Use different texture units in

unrolled loops; Nothing of the above improved

performance.

Performance Evaluation: FIR Filter

Performance of FreqXlating FIR Filter

Performance of Hilbert Transformation

Conclusion Not everything improves from GPU

optimization. CPU optimization tricks do not work on

GPU. Texture upload/download takes up to

60% of total time. GPU computation can take several

seconds compared to millisecond time to render a frame in a game.

Future Work QoS for GPU: can application

specify maximum latency or share of GPU resources?

Work offload from CPU to GPU: is it possible to build a compiler that can automatically decide what is worth GPU optimization?

Debugging support: a lot of tools for Windows, none for Linux.

fir filter on gpu

Technology

gpu computation

gpu resources

gpu program architecture

worth gpu optimization

gpu alexey smirnov

fir filters loop oj

video memory load

texture coordinates