fir filter on gpu

21

Click here to load reader

Upload: alexey-smirnov

Post on 17-Nov-2014

4.086 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: FIR filter on GPU

An Implementation of a FIR Filter on a GPU

Alexey Smirnov and Tzi-cker Chiueh

ECSL Research Seminar9/13/05

Page 2: FIR filter on GPU

Outline

Introduction GPU Computing Overview Related Work FIR Filter Definition FIR Filter Implementation on GPU Performance Evaluation Conclusion

Page 3: FIR filter on GPU

Introduction

Numerical algorithms often perform repeated computations on vectors of elements.

Parallel computation improves performance.

x86: MMX, SSE, SSE2, SSE3. Video cards are now

programmable.

Page 4: FIR filter on GPU

Computation and Bandwidth Rates Video cards have higher GFLOPs

rate and memory bandwidth compared to CPU.

However, data copying between main memory and video memory can reduce performance.

Page 5: FIR filter on GPU

GPU Computing Background Rendering pipeline:

User program defines vertex and texture coordinates.

Vertex processor converts vertex attributes from world coordinate system into screen coordinate system.

Fragment processor computes color of each output pixel using textures and color.

Interpolation defines coordinates and color for each pixel.

Vertex and fragment processors are programmable for example in C-like language Cg.

Page 6: FIR filter on GPU

Rendering APIs OpenGL (Linux, Windows, MacOS)

and DirectX (Windows). OpenGL extensions allow to use

advanced features of a video card. NV_float_buffer supports floating-

point textures. ARB_render_texture allows to

render to a texture instead of the screen.

Page 7: FIR filter on GPU

GPU Program Architecture Create floating-point textures that contain

input data and load them into video memory; Load the fragment program and enable multi-

texturing; Define vertex and texture coordinates; Draw the figure to an off-screen buffer; If the results were rendered to an off-screen

buffer then copy the image to a texture using glCopyTexSubImage2D().

Go to step 3 if more iterations needed. Use glGetTexImage() to copy data from video

memory to main memory.

Page 8: FIR filter on GPU

Input Data Representation Matrices are represented as textures

naturally. Four elements per pixel (R, G, B, A).

Vectors are wrapped into matrices. Textures have maximum dimensions.

Page 9: FIR filter on GPU

Related Work Four papers describing matrix

multiplication; Linear algebra operations; Array sorting; FFT; Earlier papers concluded that the CPU is

more efficient then GPU. Recent video cards, e.g. GeForce 7800

and ATI X800 XT do better than CPU.

Page 10: FIR filter on GPU

FIR Filter Definition

Finite Impulse Response (FIR) filter is used in audio processing.

We modified GNU Radio – an open-source software implementing Software Defined Radio.

Page 11: FIR filter on GPU

Other Relevant Transformations

Hilbert transformation:

Frequency translation FIR filter:

Page 12: FIR filter on GPU

FIR Filter on a GPU

Page 13: FIR filter on GPU

FIR Filter’s Loop Initialization:

Loop iteration:

Page 14: FIR filter on GPU

FIR Filter’s Loop

O(j+1)=O(j)+MI

Final output value is computed as

Page 15: FIR filter on GPU

Fragment Program

Page 16: FIR filter on GPU

Optimizations Break loop into two to get rid of

conditional expression; Unroll loop body w/ and w/o

conditional expression; Process two rows of input and

textures; Use different texture units in

unrolled loops; Nothing of the above improved

performance.

Page 17: FIR filter on GPU

Performance Evaluation: FIR Filter

Page 18: FIR filter on GPU

Performance of FreqXlating FIR Filter

Page 19: FIR filter on GPU

Performance of Hilbert Transformation

Page 20: FIR filter on GPU

Conclusion Not everything improves from GPU

optimization. CPU optimization tricks do not work on

GPU. Texture upload/download takes up to

60% of total time. GPU computation can take several

seconds compared to millisecond time to render a frame in a game.

Page 21: FIR filter on GPU

Future Work QoS for GPU: can application

specify maximum latency or share of GPU resources?

Work offload from CPU to GPU: is it possible to build a compiler that can automatically decide what is worth GPU optimization?

Debugging support: a lot of tools for Windows, none for Linux.