gridding for radio astronomy on commodity graphics hardware using opencl

GRIDDING FOR RADIO ASTRONOMY

ON COMMODITY GRAPHICS HARDWARE

USING OPENCL

Alexander Ottenhoff

School of Electrical and Electronic Engineering

University of Western Australia

Supervisor

Dr Chistopher Harris

Research Associate

International Centre for Radio Astronomy Research

Co-Supervisor

Associate Professor Karen Haines

Western Australian Supercomputer Program

October 2010

ii

16 Arenga CrtMount Claremont WA 6010

October 29, 2010

The DeanFaculty of Engineering Computing and MathematicsThe University of Western Australia35 Stirling HighwayCRAWLEY WA 6009

Dear Sir,

I submit to you this dissertation entitiled “Gridding for Radio Astronomy on Com-modity Graphics Hardware using OpenCL” in partial fulfillment of the requirementof the award of Bachelor of Engineering.

Yours faithfully,

Alexander Ottenhoff

Abstract

With the emergence of large radiotelescope arrays, such as the MWA, ASKAP and

SKA, the rate at which data is generated is nearing the limits of what can currently

be processed or stored in real time. Since processor clock rates have plateaued

computer hardware manufacturers are trying different strategies, such as develop-

ing massively parallel architectures, in order to create more powerful processors.

A major challenge in high performance computing is the development of parallel

programs which can take advantage of new processors. Due to their extremely high

instruction throughput and low power consumption, fully programmable Graphics

Processing Units (GPUs) are an ideal target for radio-astronomy applications. This

research investigates gridding, a very time-consuming stage of the radio astronomy

image synthesis process, and the challenges involved in devising and implementing a

parallel gridding kernel optimised for programmable GPUs using OpenCL. A paral-

lel gridding implementation was developed, which successfully outperformed a single

threaded reference program for gridding in all but the smallest test cases.

iii

Acknowledgements

I thank my supervisors Christopher Harris and Professor Karen Haines for providing

guidance throughout the course of this project. They provided feedback and advice

on my work and helped me refine my research and academic writing skills.

Thanks to Xenon Technologies for providing the computer used throughout this

project.

I’d like to acknowledge the techincal support staff at WASP, Jason Tan and Khanh

Ly, for providing me with access to WASP facilities, setting up the computer used

during this project.

Thanks to Paul Bourke for providing me with a small CUDA project that got me

started in GPU programming.

Thanks also to Derek Gerstmann for organising the OpenCL Summer School where

I was able to become familiar with the OpenCL API before starting this project.

I’d also like to thank Ankur Sharda and Stefan Westerlund with who I shared the

Hobbit Room with for offering suggestions and feedback on various ideas.

Finally, thank to my family for supporting me over the course of this project. In

v

vi

particular my mother for staying up all night proofreading the final version of this

document.

Contents

Abstract iii

Acknowledgements v

List of Figures viii

1 Introduction 1

2 Background 5

2.1 Radio Astronomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Aperture Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Gridding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Parallel Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Literature Review 19

4 Model 23

4.1 Scatter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.2 Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Pre-sorted Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

5 Testing 33

vii

viii CONTENTS

6 Discussion 45

6.1 Work-Group Optimisation . . . . . . . . . . . . . . . . . . . . . . . . 45

6.2 Performance Profiling . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

6.3 Performance Comparison . . . . . . . . . . . . . . . . . . . . . . . . . 48

7 Conclusion 51

7.1 Project Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

7.2 Future Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

A Original Proposal 57

List of Figures

2.1 Aperture synthesis data processing pipeline. . . . . . . . . . . . . . . 9

2.2 Overview of the gridding operation . . . . . . . . . . . . . . . . . . . 11

2.3 Comparison of CPU and GPU architectures. . . . . . . . . . . . . . . 14

(a) CPU Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

(b) GPU Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.4 OpenCL memory hierarchy. . . . . . . . . . . . . . . . . . . . . . . . 17

4.1 Gridding with a scatter kernel. . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Gridding with a gather kernel. . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Gridding with a pre-sorted gather kernel. . . . . . . . . . . . . . . . . 31

5.1 Thread topology optimisation . . . . . . . . . . . . . . . . . . . . . . 35

5.2 Performance profile of GPU gridding implementation compared with

CPU gridding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.3 Performance profile of GPU gridding implementation with sorting

running on the CPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.4 CPU and GPU gridding performance for a varying number of visibilities 40

5.5 Thread optimisation for a range of convolution filter widths . . . . . . 41

5.6 CPU and GPU gridding performance for a varying convolution filter

width . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

ix

x LIST OF FIGURES

Chapter 1

Introduction

Astronomers can gain a better understanding of the the creation and early evolution

of the universe, test theories and attempt to answer many questions in physics by

producing images from radio waves emitted by distant celestial entities. With the

construction of vast radio-telescope arrays, such as the Murchison Wide-field Array

(MWA), Australian SKA Path-finder (ASKAP) and Square Kilometre Array (SKA),

many engineering challenges have to be overcome. ASKAP alone will generate data

at a rate of 40Gb/s, producing over 12PB in a single month [6] and SKA will produce

several orders of magnitude more, so data processing and storage are major issues.

As we reach the limit of how fast we can make single core CPUs run we need to look

to parallel processors such as multi-core CPUs, GPUs and digital signal processors

to process this vast amount of data. One of biggest problems limiting the popularity

of parallel processors has been the lack of a standard language that runs on a wide

variety of hardware. To address this, the Khronos Group produced the OpenCL

standard.

1

2 CHAPTER 1: Introduction

OpenCL is an open standard for heterogeneous parallel programming [17]. One of

the major advantages of code written in OpenCL is that it allows programmers to

write software capable of running on any device with an OpenCL driver; eliminating

the need to rewrite large amounts of code for each vendor’s hardware. This par-

tially solves the issue of vendor lock-in, a major problem in general purpose GPU

(GPGPU) programming up until now, where, due to the lack of standardisation

software is often restricted to running on a series of architectures produced by a

single company.

In this project I aim to develop an efficient method to develop a parallel algorithm

in OpenCL for the gridding stage of radio-interferometric imaging, which has tra-

ditionally been the most time-consuming stage of the imaging process [30]. Due

to the large amount of data that will be generated by the next generation of radio

telescopes, the amount of data which can be processed in real-time may be a serious

performance bottleneck. Since a cluster of GPUs with equal computational perfor-

mance to a traditional supercomputer consumes a fraction of the energy, an efficient

OpenCL implementation would be a significantly less expensive option. I will pri-

marily target GPU architectures in particular the NVIDIA Tesla C1060, although

I will also attempt to benchmark and compare performance on several different de-

vices.

Chapter 2, the background will explain the theory behind radio astronomy with

a focus on the aperture synthesis process. It will provide an overview of GPU

architectures, NVIDIA’s Tesla series of graphics cards and OpenCL. In Chapter 3,

the literature review, previous implementations of gridding on other heterogeneous

parallel architectures will be discussed. The model in Chapter 4 will provide a

3

detailed explanation of gridding and detail several ways of adapting it to GPU

hardware. Chapter 5 will outline and present the results of various tests performed in

order to determine the paramaters which result in the best performance of the GPU

based gridding algorithm. The results of these tests will be discussed in Chapter 6,

as well as other discoveries made over the course of this project. Finally Chapter 7

will summarise the important results of this work and outline possible areas for

future research.

4 CHAPTER 1: Introduction

Chapter 2

Background

This chapter will explain background information on various topics that are useful for

understanding this project. It will discuss the theory behind radio astronomy, as well

as what scientists in this field can discover. An overview of the aperture synthesis

process, used to generate 2-dimensional images from multiple radio telescopes, will

be given. General Graphical Processing Unit (GPU) design will then be outlined,

with a particular focus on the NVIDIA GT200 architecture used in the Tesla C1060.

OpenCL will also be discussed, explaining all the features used in this project and

why it was chosen for this project over other programming languages.

2.1 Radio Astronomy

Radio astronomy is the branch of astronomy which focuses with observing electro-

magnetic waves emitted by celestial entities lying in the radio band of the electro-

magnetic spectrum. While the visible light spectrum observed by optical telescopes

5

6 CHAPTER 2: Background

can pass through the atmosphere with only a small amount of atmospheric dis-

tortion, radio waves with wavelengths ranging from 3cm to around 30m are not

distorted at all by the Earth’s atmosphere. Also, unlike visible waves which are

mostly produced by hot thermal sources, such as stars, radio waves can originate

from a wide variety of sources including gas clouds, pulsars and even background

radiation left over from the big bang [31]. It is also possible to observe radio waves

through clouds as well as during the day, when the amount of light emitted by the

Sun vastly exceeds that which reaches Earth from distant sources, which allows ra-

dio telescopes to operate when optical astronomy is impossible.

Due to the long wavelengths of the signals being measured, radio telescopes are

generally far larger than their optical counterparts. For a single dish style radio

telescope, the angular resolution R of the image generated from a signal of wave-

length λ is related to the diameter of the dish D.

R =λ

D(2.1)

Since R is a measure of the finest object a telescope can detect, a dish designed to

create detailed images of signals less than 1GHz would need to be several hundred

metres in diameter. Constructing a dish of this size is both difficult and extremely

expensive and, for wavelengths longer than around 3 metres, the diameter required

for a good resolution can surpass what can realistically be constructed. A tech-

nique called radio interferometry makes it possible to combine multiple telescopes

to make observations with a finer angular resolution than that of what each telescope

could resolve individually. When using this technique neither telescope measures the

7

brightness of a frequency in the sky directly. Instead, each pair of telescopes in the

array measures a component of the brightness and combines this data in a process

known as aperture synthesis.

2.2 Aperture Synthesis

Aperture synthesis works by combining signals from multiple telescopes to produce

an image with a resolution approximately equal to that of a single dish with a di-

ameter of the maximum distance between antennae. The aperture synthesis process

is made up of several stages, which transform the signals measured by each pair of

telescopes in an array into a two dimensional image. This process consists of several

stages shown in Figure 2.1.

The first stage of this process involves taking the signals from each pair of antennas

and cross-correlating them to form a baseline. The relationship between the number

of an antennas in an array a, and the total number of baselines b including antennas

is shown in Equation A.2.

b =a(a− 1)

2+ a (2.2)

These signals are combined to produce a set of complex visibilities, one for each

baseline, frequency channel and period of time. The complex visibilities for each

baseline are created by cross-correlating sampled voltages from a pair of telescopes.


The next stage is to calibrate the visibilities to remove noise introduced by atmo-

spheric interference and small irregularites in the shape and positioning of the radio

dishes. The calibrated visibilities can then be used to generate a two-dimensional

image by converting them to the spatial domain. These visibilities are first mapped

to a regular two-dimensional grid in a process referred to as gridding. This is fol-

lowed by applying the two-dimensional inverse Fast Fourier Transform (FFT) to the

gridded visibilities, converting them to the spatial domain. The output of this oper-

ation is known as the dirty image, because it still contains some artifacts introduced

during the aperture synthesis process.

In order to remove these synthesis artifacts the dirty image is finally processed with

a deconvolution technique. Two common algorithms used to perform this operation

are the CLEAN algorithm [2] and the Maximum Entropy Method (MEM) [25]. The

CLEAN algorithm works by finding and removing point sources in the dirty image,

and then adding them back to the image after removing associated side lobe noise.

The MEM process involves defining a function to describe the entropy of an image

and then searching for the maximum of this function. The result of the deconvolu-

tion process is known as a clean image. Several radio astronomy software packages

exist which are able to perform the aperture synthesis process including Miriad [21],

which is used in this project. Of the stages used in aperture synthesis, gridding is

the focus of this research and will be discussed in more depth.

9

Imaging

Fast FourierTransform

Gridding

Deconvolution

CalibrationCorrelation

Figure 2.1: Aperture synthesis data processing pipeline [30].Shown is an overview of the major software stages involved in taking sampled radio-wave data from a pair of radio telescopes and generating an image. The signalsfrom a pair of telescopes are correlated with each other to provide a stream ofvisibilities. These visibilities are then calibrated to correct for irregularities in thetelescope’s dish, small errors in the the telescope’s alignment and to account for someatmospheric interference. After being calibrated these visibilities are converted intoa two-dimensional image through a three stage process consisting of interpolation toa regular grid, transformation to the spatial domain with a Fast Fourier Transform(FFT) and deconvolution using a technique such as the CLEAN algorithm [2] orMaximum Entropy Method [25].


2.3 Gridding

Gridding is the stage of the aperture synthesis process which converts a list of cal-

ibrated visibilities into a form that can be transformed to the spatial domain with

an inverse Fast Fourier Transform. This operation involves sampling the measured

visibilities to a two-dimensional grid aligned with the u and v axes which are defined

for each baseline by the Earth’s rotation. An example of visibilities measured by a

telescope array containing eight baselines is shown in Figure 2.2. In order to min-

imise aliasing effects in the image plain, ie. distortion introduced due to sampling,

each visibility is mapped across a small region of the grid defined by a convolution

window. This convolution function used in this project is the spheroidal function,

which emphasises aliasing suppression near the centre of the image, typically near

the object of interest [23].

Typically, instead of computing the spheroidal function every time its used, coeffi-

cients are generated ahead of time and stored in an array. Because the same function

is used for both gridding a visibility in both u and v directions, the coefficients of the

convolution function can be stored in a one-dimensional array. The ratio between

the length of the convolution array and the width of the convolution function is

known as the oversampling ratio. Because a high oversampling ratio results in bet-

ter suppression of aliasing in the final image, the convolution array is significantly

larger than the width of the function.

11

Gridding

u(t)

v(t)

GridVisibilities

Figure 2.2: Overview of the gridding operationThe gridding operation transforms a set of visibilities sampled from multiple base-lines of a radio-telescope array and convolves them to a regular grid. This operationis necessary to prepare this visibility data for the two dimensional Inverse FastFourier Transform (FFT) operation, which is used to transform these visibilitiesfrom the frequency domain into a two dimensional image in the spatial domain.Each red line represents measurements made by a separate baseline taken over aperiod of time.


2.4 Parallel Processors

Computer manufacturers have shifted their focus in recent years from designing fast

single core processors to creating processors which can execute multiple threads si-

multaneously and minimise memory access latency with on chip cache. Since these

multi-core processors are still relatively new, a diverse range of architectures are

available, including multi-core x86 processors such as the AMD Phenom and Intel

Core i7, GPUs like NVIDIA’s Tesla and AMD’s Firestream series as well as other

types of processors including IBM’s Cell/B.E. One of the factors limiting the usage

of parallel processors by developers is the vast amount of code that has been de-

veloped for single processor computers. Often, due to inter-dependencies between

operations, rewriting these legacy programs to take advantage of multiple concur-

rent threads is not a trivial task.

While originally developed as co-processors optimised for graphics calculations,

GPUs are being designed with increasingly flexible instruction sets and are emerging

as affordable massively parallel processors. NVIDIA’s recent Tesla C1060 GPU is

capable of 933 single precision GigaFLOPS [8] (floating point operations per sec-

ond) compared to one of the fastest CPUs available at the time, Intel’s Core i7 975

with a reported theoretical peak of 55.36 GigaFLOPS [14]. Part of the reason that

GPUs can claim such high performance figures is their architecture. As shown in

Figure 2.3, GPUs devote more die s[ace to data processing. GPUs are thus highly

optimised for performing simple vector operations on large amounts of data faster

than a processor using that die space for other purposes. However, this performance

comes at the expense of control circuitry meaning that GPUs cannot make use of

advanced run time optimisations commonly found on modern desktop CPUs such

13

as branch prediction and out-of-order execution. GPUs also sacrifice the amount of

circuitry used for local cache, which has a major impact on the average amount of

time a process needs to wait after requesting data from memory.


ALU

ALU

ALU

ALUControl

Cache

DRAM

(a) CPU Layout

U

U

ALUs

Control

Cache

DRAM

(b) GPU Layout

Figure 2.3: Comparison of CPU and GPU architectures [7].This figure shows the difference in layout between a CPU and a GPU. CPUs aredesigned to be general purpose processors capable of performing a wide variety oftasks quickly. Because of this, a large amount of space on the chip’s die is dedicatedto control logic and local cache, both of which can be used to optimise programs atrun time. GPUs are highly tuned to perform graphics operations, which are mostlysimple vector operations on large amounts of data. This performance is achievedby dedicating most of the chip to the Arithmetic and Logic Units (ALUs) whichperform instructions at the expense of cache and control circuitry. Because of this,GPUs often lack many of the the advanced run time optimisations commonly foundon modern desktop CPUs, such as branch prediction and out-of-order execution andaccessing system memory has a higher average latency.

15

2.5 OpenCL

OpenCL is a programming language created by the Khronos Group with the design

goal of enabling the creating of code that can run across a wide range of parallel

processor architectures without needing to be modified. To deal with the many dif-

ferent types of processors that can be used for processing data, the OpenCL runtime

separates them into two different classes; device and host. The host, which repre-

sents a general purpose computer, is in charge of transferring both device programs

compiled at run-time (kernels) and data to a device. It also instructs devices to run

kernels and sends requests for data to be transferred back from a device to the host.

A host can make use of a command queue object in order to schedule data transfers

and the execution of kernels on various devices asynchronously so that it remains

free to perform other operations while the devices are busy.

The job of a device is to simply execute a kernel in parallel across a range of data

storing the results locally and to then alert the host when the kernel has finished

execution, so the results can be transferred back. Each device can be divided into a

collection of compute units and in turn each of these compute units is composed of

one or more Processing Elements (PEs). Memory on a device is organised into four

distinct regions: global, constant, local and private. Global and constant memory

are shared among all compute units on a device and are the only regions of memory

accessible to the host. The only major difference between these two regions is that

constant memory can only be written to by the host while global memory can be

written to by both host and device. Local memory is memory shared by all pro-

cessing elements within a work-group, which can be allocated by the host but only

manipulated by the device. Finally, private memory is memory available to only


a single processing element. Figure 2.4 shows how the hierachy of processors and

various memory types are linked together.

In order to run a kernel, the host initialises an NDRange, which represents a one,

two or three dimensional array with a specific length in each dimension. The size

of this NDRange, also known as an index space, determines the number of kernal

instances launched. Each instance of a kernel running on a device is known as a

work-item and is provided with an independent global ID representing a position in

index space. Work-items are organised into work-groups. Each of which have their

own group ID as well as providing work-items within the group with independent

local IDs. When a kernel is executed each work-group is executed on a compute unit

and each work-item maps to a processing element. Limitations on various parame-

ters such as the maximum number of work-items a work-group can allocate as well

as the amount of memory available in each region, are dependent on the architecture

of a device.

For GPU devices based on the CUDA architecture such as the NVIDIA Tesla

C1060 used in this project, OpenCL compute units correspond to hardware objects

called multiprocessors. While each multiprocessor can process 32 threads in parallel

(known as a warp), it is capable of storing the execution context (program counters,

registers, etc) of multiple warps simultaneously and switching between them very

quickly [7]. This technique can be used to efficiently run work-groups larger than

32 threads on a single multiprocessor. Since this context switch can occur between

two consecutive instructions, the multiprocessor can instantly switch to a warp with

threads ready to execute if the current context becomes idle, such as when reading

or writing global memory. Each multiprocessor posesses a single set of registers and

17

Host System

System Memory

CPU

CPU Cache

Compute Device 1

...

Global/Constant Memory Data Cache

Localmemory 1

Compute unit 1

...Private

memory 1

PE 1

Privatememory N

PE N

Localmemory M

Compute unit M

...Private

memory 1

PE 1

Privatememory N

PE N

Compute Device Memory 1

Global Memory

Constant Memory

Compute Device P

...

Global/Constant Memory Data Cache

Localmemory 1

Compute unit 1

...Private

memory 1

PE 1

Privatememory N

PE N

Localmemory M

Compute unit M

...Private

memory 1

PE 1

Privatememory N

PE N

Compute Device Memory P

Global Memory

Constant Memory

...

Figure 2.4: OpenCL memory hierachy [7].A system running OpenCL consists of a host, which can be any computer capable ofrunning the OpenCL API and one or more devices. Each device can be divided intoa collection of compute units and, in turn, each of these compute units is composedof one or more Processing Elements. Memory on a device is arranged in a similarhierachy. Global and constant memory are shared among all compute units on adevice, local memory is only available to a single compute unit and private memory isspecific to a single processing element. OpenCL devices typically represent GPUs,multi-core CPUs, Digital Signal Processors (DSPs) and other parallel processors.Since a host represents a general purpose computer, it has its own CPU and memorywhich are used to issue commands and transfer data to the various devices as wellas perform other operations outside the OpenCL environment.


a fixed amount of local memory which are shared between all active warps. Because

of this tradeoff between work-group size and memory available to each work-item,

trying to find a balance between these parameters is essential to obtaining optimal

performance.

Chapter 3

Literature Review

The gridding algorithm used in aperture synthesis is widely documented in scientific

literature [4,5,10,18,23,32]. A large part of the research effort has been focused on

improving the quality of images generated by devising methods to programatically

determine the ideal convolution window for a given set of data, as well as minimil-

ising artifacts introduced from oversampling.

There have been various efforts to implement this algorithm on parallel hardware

in [11, 19, 28–30]. Before the OpenCL standard was publish, IBM’s Cell Proces-

sor was a major target for research efforts, although recently GPUs have become

cheaper, more powerful and easier to program, leading to more research on paral-

lelisation with GPUs, particularly with NVIDIA’s CUDA based cards.

Gridding is also used in Magnetic Resonance Imaging (MRI) applications and sev-

eral papers have been written on the topic of improving the gridding algorithm

19

20 CHAPTER 3: Literature Review

as well as creating various implementations targetting heterogenous parallel pro-

cessors [1, 12, 15, 16, 20, 22, 24, 26]. While the process used to convert MRI data

into images is completely different to the aperture synthesis process used in ra-

dio astronomy, both processes involve transforming irregularly sampled data in the

Fourier domain into a spacial image.

An early attempt to parallelise gridding on IBM’s Cell Broadband Engine is de-

scribed in an article entitled Radio-Astronomy Image Synthesis on the Cell/B.E. [29]

published in 2008. This paper describes an application of gridding and its inverse

function degridding, and compares the perfomance between an Intel Pentium D x86

CPU and two different platforms containing the cell processor: Sony’s Playstation

3 and a IBM’s QS20 Server Blade. On average, the results for the Cell platforms

showed a twentyfold increase in performance compared to the Pentium D, although

speed increase was negligible for small kernels less than 17x17. One of the main

conclusions reached in this paper is that I/O delay and memory latency are the

largest bottlenecks in scaling this algorithm to a cluster of processors.

The parallel gridding implementation detailed in this paper took advantage of the

Cell’s high bandwidth between processors by using the Power Processing Element

(PPE) to distribute the visibility data along with the relevant convolution and grid

indices to the Synergistic Processing Elements (SPEs) on the fly. The PPE stored

separate queues as well as seperate copies of the grid for each of the SPEs. There-

fore, if multiple adjacent visibilities were located close to each other, they would be

allocated to a single SPE to reduce the number of memory accesses. To prevent too

much work from piling up in a single queue, a maximum queue size was established

in order that the PPE would not continuously fill a single queue while the other

21

SPEs idled. Each of the SPEs performed a simple loop of polling their queue until

work was available, fetching the appropriate data from system memory with Direct

Memory Access (DMA), performing the gridding operation and writing the results

to its copy of the grid in system memory. Once all visibilities were processed, the

PPE added each of the grids together to produce the output.

A follow up paper was written by the same research team in 2009, entitiled Building

high-resolution sky images using the Cell/B.E. [30], detailing further optimisations

to their Cell-based gridding implementation. The largest optimisation detailed in

this paper was to check consecutive visibilities to see if they had identical u − v

coordinates and if so, add them together and then enqueue the combined visibility.

The result of th further optimisations was a scalable version of the previous gridding

algorithm designed to run on a cluster of Cell processors with each Cell core able to

process all data generated by 500 baselines and 500 frequency channels at a rate of

one sample per second.

More recently an effort was made to implement several stages of the aperture syn-

thesis process using CUDA, which is outlined in Enabling a High Throughput Real

Time Data Pipeline for a Large Radio Telescope Array with GPUs [9]. The purpose

of this research was to design a data pipeline capable of processing data generated

by the Murchison Widefield Array in real time. While the data pipeline required

over 500 seconds of processing time running on a single core of an Intel Core-i7 920,

the same pipeline implemented in CUDA could be processed in under 7.5 seconds

on a single NVIDIA Tesla C1060. Excluding data transfer times, the GPU imple-

mentation of gridding developed as part of this research demonstrated an average

speedup of twenty-twofold when compared to the CPU version.

22 CHAPTER 3: Literature Review

This research demonstrates that gridding has been successfully implemented on sev-

eral different parallel processor architectures with significant performance improve-

ments compared to existing serial implementations. Most of the research conducted

to date has been focused on implementing gridding on a single processor architecture

or on comparing the performance of multiple independent implementations written

for different devices. Due to the portability of software written in OpenCL, a par-

allel version of gridding implemented as an OpenCL kernel could be combined with

kernels implementing other stages of aperture synthesis and run on a system com-

prised of multiple different compute devices.

Chapter 4

Model

The gridding algorithm is used to interpolate a set of visibilities to a regular grid

as illustrated in Figure 2.2. Each visibility sample is projected onto a region of the

grid by convolving its brightness value by a two-dimensional set of coefficients. In

this chapter I will outline the model I developed which implements the gridding

algorithm on the parallel architecture of NVIDIA’s Tesla C1060 GPU. I describe

three approaches. Firstly, the scatter approach, where each visibility is mapped to

an OpenCL work-item and the kernel performs a similar convolution operation to

the original serial implementation. Secondly, the gather approach, where the two-

dimensional location of each pixel on the grid corresponds to a thread on the GPU

and the kernel reads in the entire list of visibilities, only writing to the grid address

corresponding to its global ID. Finally, the pre-sorted gather approach, which is

similar to the normal gather approach except the visibilities are sorted and placed

into bins prior to gridding and each kernel only reads through a subset of the list of

visibilities.

23

24 CHAPTER 4: Model

4.1 Scatter

Scatter communication occurs when the ID value given to a kernel processing a

stream of data corresponds to a single input element and the kernel writes to mul-

tiple locations, scattering information to other parts of memory [13]. In the context

of parallel gridding, a scatter kernel is implemented so that the global ID of each

work-item corresponds to a single visibility and the kernel convolves this visibility

over a region of the grid. Of the different parallelisation approaches discussed, scat-

ter is the closest to a traditional serial implementation because the kernel effectively

performs the same operations, although instead of looping through the list of vis-

ibilities, these these operations are performed simultaneously. An example of this

type of kernel is shown in Figure 4.1

In the case of a scatter kernel operating over a set of v visibilities with a convolution

function of width c, v threads are launched where each thread performs c2 multipli-

cations by looping across the convolution function in two-dimensions. This results

in a computational complexity of O(v · c2). Although the complexity is the same as

that of the serial implementation, a scatter kernel can scale across a large number

of processors with a proportional speed increase.

Even though the scatter approach is very fast, it does nothing to prevent multiple

threads attempting to write to the same memory location simultaneously, which can

lead to a write conflict resulting in the results of one thread being lost. A possible

solution is to provide each processor with a unique copy of the grid, which it writes

to, and adding an extra step at the end of the process to add all the grids together.

25

While this solution would be ideal on a multi-core CPU, it would be impractical

on a GPU like device with hundreds of processing elements, since the amount of

memory needed would likely exceed that which is available for any practical grid

size.

26 CHAPTER 4: Model

ThreadIndex

Visibilities

Grid

ConvolutionFunction

A

B

C A

B

C

Figure 4.1: Gridding with a scatter kernel.The scatter strategy involves assigning each visibility to a different thread, whereeach thread applies the convolution function. While this approach is very fast, itruns into problems when threads attempt to write to the same grid location asshown in the magenta region where kernels A and B overlap as well as the cyanregion where kernels B and C overlap. When this occurs only one of the valuesbeing written is saved while all the other values are lost.

27

4.2 Gather

A gather kernel works by mapping each address in the output of a function to a

thread and processing the set of input data separately at each location. The gather

approach to gridding works by assigning a thread for each pixel on the output grid

and having each thread process the list of visibilities separately. Since each thread

only writes to a single pixel of the output grid, this approach avoids the problem of

write conflicts found in scatter kernels, as shown in Figure 4.2.

Given a set of v visibilities which are to be convolved to a w by h grid, a gather

kernel needs to iterate through the list of visibilities once for each thread. Because

a thread is launched for each position on the grid, this results in a complexity of

O(v ·w·h). Since the grid is always significantly larger than the convolution function,

the gather approach is far more algorithmically complex than the scatter approach

and therefore takes longer to run. When it comes to writing, gather kernels have

one major advantage in complexity, because each thread writes to a single location,

the number of writes is only w · h. Since all writes can be performed independently,

given a GPU with p processing elements, the complexity of writing is only O(w·hp

).

A major disadvantage of to this approach is that the total number of operations per-

formed is significantly larger, since each visibility is processed once for each thread,

whereas the scatter approach only processes each visibility once. Because the con-

volution function used to map visibilities to the grid is significantly smaller than

the grid itself, most visibilities processed by each thread fall outside the convolution

width and can safely be ignored. With this in mind, an optimised version of the

28 CHAPTER 4: Model

gather approach was developed and is discussed in the following section.

29

ThreadIndex 1

Visibilities

Grid

ConvolutionFunction

ThreadIndex 0

A

A

A

A

B

C

B

B

B

C

C

C

Figure 4.2: Gridding with a gather kernel.The gather approach to gridding works by assigning a thread for each pixel on theoutput grid and having each thread process the list of visibilities separately. Sinceeach thread only writes to a single pixel of the output grid, this approach avoids theproblem of write conflicts found in scatter kernels A disadvantage to this approachis that the total number of operations performed is significantly larger since the listof visibilities is read once for each thread, whereas the scatter approach only readsthem in once.

30 CHAPTER 4: Model

4.3 Pre-sorted Gather

The pre-sorted gather approach attempts to significantly improve the performance

of the regular gather approach by performing an additional series of steps before the

gridding operation. These steps attempt reduce the number of visibilities processed

by each thread while still producing correct output. This sequence of steps, col-

lectively called binning, works by splitting the list of visibilities into a collection of

shorter lists, whereby each short list contains the visibilities located in a particular

region of the grid, or bin. The binning process begins by determining a bin size,

which must be equal to or larger than the convolution function. This is followed by

idneifying which bin each visibility is located in, and useing thes values to create a

list of keys.

Once the list of keys has been generated, the visibilities are sorted based on the

value stored in each visibility’s corresponding key, which results in a list where

visibilities in each bin are grouped together. The list of visibilities is then processed

a second time in order to generate an array containing the index of the first and last

visibility in each bin. Following this step, a modified gather kernel is launched to

perform the gridding process, with the array of bin indicies passed as an additional

argument, The size and position of each work-group corresponds to the size and

location of each bin. Instead of looping through each visibility in the list, each

work-item only iterates through the visibilities located in its own bin and the eight

bins directly adjacent to it. An illustration of the pre-sorted gather approach is

shown in Figure 4.3.

While the easiest approach to sorting the visibilities into bins would be to sort them

31

ThreadIndex 1

Visibilities

Grid

ConvolutionFunction

ThreadIndex 0

A

A

A

B

C

B

C

C 00 01 02 03

10 11 12 13

20 21 22 23

30 31 32 33

Figure 4.3: Gridding with a presorted gather kernelSince the grid is significantly larger than the convolution window, each thread onlyneeds to consider visibilities located nearby. To take advantage of this, the gridis divided into sub-regions called bins and the list of visibilities is sorted into anorder where visibilities in each bin are grouped together. Each work-item processesvisibilities located in it’s own bin and adjacent bins. The red, green and blue boxeson the left correspond to the list of visibilities processed by an individual work-item.The tinted bins represent adjacent bins for their corresponding coloured work-items.

32 CHAPTER 4: Model

on the CPU with a traditional algorithm such as Quicksort, it is possible to do this

on the GPU using a parallel sorting algorithm such as a bitonic sort. Bitonic sorting

is based on a network of threads taking a divide and conquer approach to sorting,

which implement two kernels: Bitonic Sort which orders the data into alternate in-

creasing and decreasing subsequences, and Bitonic Merge which takes a pair of these

ordered subsequences and combines them together. This was implemented with a

modified version of the Bitonic Sorting network example found in NVIDIA’s GPU

Computing SDK with the datatype of the values convertred from uint to float4

in order to handle visibilities.

Chapter 5

Testing

A GPU gridding program was successfully implemented and tested, using the pre-

sorted gather model presented in Chapter 4. The testing compared the GPU grid-

ding implementation with a single core CPU implementation in order to determine

the suitability of parallel architectures to the gridding stage of aperture synthesis.

All testing was performed on a Xenon Nitro A6 Tesla workstation. This system

contained a Foxconn Destroyer motherboard featuring a single AM2+ CPU Socket,

four Dual Channel DDR2 Memory slots, four PCIe v2.0 slots, NVIDIA nForce 780a

SLI chipset and 5.2 GT/s HyperTransportTM bus connecting the CPU with the

northbridge. The CPU used was an AMD PhenomTM II X4 955 clocked at 3.2 GHz.

8GB of RAM was installed, consisting of four 2GB DIMMS running at DDR2-800.

Two different Graphics Cards were available, a NVIDIA Tesla C1060 and a NVIDIA

Quadro 5800, which both include a 240 core GPU clocked at 1.3GHz and 8GB and

4GB resectively of on-board RAM clocked at 800MHz and connected over a 512 bit

GDDR3 with a bandwidth of 102 GB/s. Both graphics cards were connected to the

33

34 CHAPTER 5: Testing

motherboard through a PCIe x16 bus.

The operating system used in the tests was the AMD64 release of Ubuntu Linux

9.10 (Karmic Koala), running Linux Kernel 2.6.31-22. Version 4.4.1 of the GNU

compiler collection was used to compile all C and FORTRAN code. The reference

implementation of gridding was a version of the invert function from the 2010-04-22

release of the Miriad data reduction package, modified to measure and output it’s

run time. The NVIDIA drivers installed were version 195.36.15, along with ver-

sion 3.0 of the NVIDIA Toolkit which includes the OpenCL libraries for NVIDIA

GPUs and version 3.0 of the NVIDIA GPU Computing SDK. All performance tim-

ing data was measured using the gettimeofday function found in the Unix library

sys/time.h.

Performance tests were conducted using a sample dataset of 1337545 visibilities

taken by the Australian Telescope Compact Array (ATCA) of Supernova SN1987A.

Unless specified otherwise the grid size used is 1186 by 2101 and the convolution

function width is 6, with the convolution function data comprising a spheroidal func-

tion stored in a 2048 element array. The data used to generate each performance

plot was created by running the relevant program five times in a row and averaging

the execution time of the last three runs in order to minimise the impact of hard

disk seek times and power saving features on the results.

The objective of the first test, shown in Figure 5.1, was to determine the optimal

local work-group size for the gridding kernel, operating on the sample dataset. This

value also determines the size of the bins used in the binning stage of the gridding

35

700

800

900

1000

1100

1200

1300

0 50 100 150 200 250 300

Run

time

(ms)

Work Group Size

6x6

6x7

6x86x96x10

6x11

6x126x13

6x146x15

6x16

7x67x7

7x8

7x9

7x107x11

7x12

7x13

7x147x15

7x16

8x68x7

8x8

8x9

8x108x11

8x128x13

8x148x15

8x16

9x6

9x7

9x89x9

9x109x119x12

9x139x149x159x16

10x6

10x7

10x810x9

10x1010x1110x12

10x13

10x1410x1510x16

11x611x7

11x8

11x9

11x1011x11

11x12

11x13

11x1411x1511x16

12x6

12x7

12x812x9

12x10

12x11

12x12

12x13

12x1412x1512x16

13x6

13x7

13x8

13x9

13x10

13x1113x12

13x1313x14

13x1513x16

14x6

14x7

14x814x9

14x1014x1114x12

14x13

14x14

14x1514x16

15x615x7

15x8

15x915x10

15x11

15x12

15x1315x1415x1515x16

16x616x7

16x8

16x916x1016x1116x12

16x1316x1416x1516x16

Figure 5.1: Thread topology optimisationThis figure shows how the performance of the GPU gridding kernel varies with anumber of local work-group sizes on the sample data. This test was performed inorder to find an optimal work-group size for later tests. The x-axis shows the numberof work-items in each group, with each datapoint showing the width and height ofthe work-group it represents. The y-axis is the execution time of the gridding processmeasured in milliseconds. Because its not perfectly clear in the diagram, the fastestwork-group sizes are 6x10, 6x9, 7x8, 6x8, 10x6 and 8x8.


process. This test was conducted by iterating through each combination of work

group width and height and recording the time taken by the entire gridding process.

This included time taken transferring data between the device and host. A value

of 6 was used for the minimum number of elements in both dimensions since the

gridding kernel is only designed to work with both work-group dimensions equal to

or greater than the convolution function width. Values greater than 16 in either

dimension are not displayed on the plot, since increasing either work-group dimen-

sion past this value significantly decreased performance. While a work-group size of

6x10 resulted in the fastest execution time, 8x8 was used in further tests for reasons

that are explained in Chapter 6.

Figure 5.2 illustrates the execution time of each stage of the GPU gridding imple-

mentation compared to the total execution time taken by Miriad’s gridding imple-

mentation. A performance profile of the GPU gridding process with sorting handled

on the CPU is shown in Figure 5.3. The purpose of these diagrams is to visualise

the amount of time spent at each stage of the gridding process in order to determine

if any stage in particular is acting as a performance bottleneck. Each item listed in

the key located on either diagram represents a distinct stage of the GPU gridding

process. Binning represents the time spent determining which bin each visibility is

located in. Device Transfer represents the total amount of time spent transferring

the binned visibilities and convolution function from host to device. Sorting is a

measure of the total time taken by the sorting stage. Bin Processing represents

the time taken to transfer the sorted visibilities from device to host, build an array

containing indices for the first and last visibility in each bin, and transfer this new

array onto the device. Kernel Execution represents the time spent performing the

actual gridding operation. Finally, Host Transfer represents the time taken trans-

37

0

200

400

600

800

1000

1200

Miriad OpenCL with GPU Sort

Tim

e (m

s)

Host TransferKernel Execution

Bin ProcessingSorting

Device TransferBinning

Figure 5.2: Performance Profile of GPU gridding implementations com-pared with CPU gridding.This diagram illustrates the time spent in each stage of the GPU gridding processcompared with the total execution time of Miriad’s gridding implementation, us-ing the sample dataset. Each item listed in the key represents a distinct stage ofthe GPU gridding process. Binning represents the time spent determining whichbin each visibility is located in. Device Transfer represent the total amount oftime spent transferring the binned visibilities and convolution function from hostto device. Sorting is a measure of the total time taken by the sorting stage. BinProcessing represents the time taken to transfer the sorted visibilities from deviceto host, build an array containing indices for the first and last visibility in each binand transfer this new array onto the device. Kernel Execution represents the timespent performing the actual gridding operation. Finally, Host Transfer representsthe time taken transferring the grid from the device back to the host.


0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

OpenCL with CPU Sort

Tim

e (m

s)

Kernel ExecutionHost Transfer

Bin ProcessingSorting

BinningDevice Transfer

Figure 5.3: Performance Profile of GPU gridding implementation withsorting running on the CPU.This diagram illustrates the time spent in each stage of the GPU gridding processwith sorting of the visibilities handled by the CPU using the sample dataset. Thisplot indicates the large amount of processing time needed to sort the visibilitiesinto bins on the CPU. Because of this major performance bottleneck, the sortingstage was adapted to run on the GPU which lead to a significantly faster griddingimplementation as shown in the second column of Figure 5.2.

39

ferring the grid from the device back to the host. Because Miriad’s gridding process

is performed entirely on the host without any pre-processing of visibility data, its

performance profile only consists of the kernel execution stage.

Figure 5.4 compares the execution time of GPU gridding with Miriad for various

size visibility lists. This test was done to compare how the performance of each

program scales when provided with larger datasets to process. The large datasets

used in this test were generated by repeating the visibilities in the SN1987A dataset

as many times as necessary for each test.

In order to compare the performance for convolution windows of various sizes, the

optimal work-group for each convolution width needed to be measured. Because the

graphics card used for testing only allows for work-groups of up to 512 elements in

size, only convolution windows up to 22x22 elements in size could be tested since

the convolution width acts as a lower bound for work-group sizes. The results of

this test are shown in Figure 5.5 and the measured optimal work-group sizes are

listed in Table 5.1.

Figure 5.6 shows how the GPU gridding program performs compared with Miriad

over a number of different convolution filter widths. The work-group sizes used for

the GPU kernel in this test are the optimal values displayed in Table 5.1. This

test was conducted by changing the convolution width parameter provided to both

gridding programs. Since the convolution function used in the sample data has a

large oversampling ratio, the array of convolution coefficients didn not require mod-

ification.


0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 2 4 6 8 10

Run

time

(ms)

Number of Visibilities (millions)

MiriadOpenCL

Figure 5.4: CPU and GPU gridding performance for a varying numberof visibilitiesThis graph compares the performance of the optimised GPU gridding implemen-tation with Miriad as the number of elements in the visibility list, N , increased.Results were plotted starting at N = 250000 and repeated for every multiple of250000 up to a maximum of N = 10000000. Each datapoint was generated byaveraging the runtime for each value of N over four runs.

41

600

800

1000

1200

1400

1600

1800

0 5 10 15 20

Run

time

(ms)

CGF Width

Work-group size1x12x23x34x45x56x67x78x89x9

10x1011x1112x1213x1314x1415x1516x1617x1718x1819x1920x2021x2122x22

Figure 5.5: Thread optimisation for a range of convolution filter widthsThe purpose of this test was to determine the optimal work-group sizes for the GPUgridding kernel for a range of convolution filter widths. Due to the way the griddingkernel handled bins, the minimum possible work-group size needs to be equal orgreater than the convolution width in both dimensions in order to generate correctoutput, so only datpoints satisfying this criteria were plotted. Another limitation isthat the GPU only allows work-groups with 512 elements or less, leading to upperbounds of 22 for the convolution filter width and 22x22 for the work-group size.


0

2000

4000

6000

8000

10000

12000

0 5 10 15 20

Run

time

(ms)

CGF Width

MiriadOpenCL

Figure 5.6: CPU and GPU gridding performance for a varying convolu-tion filter widthThis test was designed to demonstrate the differences in performance between Miriadand OpenCL gridding implementations over a range of convolution widths. Theseconvolution widths, on the x-axis, are plotted against run time, on the y-axis. Cgfwidths were tested over a range of 1 to 22. The maximum value was imposed dueto the current GPU implementation’s requirement of a work-group size equal to orgreater than the convolution width, with 22x22 being the maximum work-group sizepossible on the NVIDIA Tesla C1060.


Convolutionwidth

Optimal Work-group

1 8x82 8x83 7x74 8x85 7x76 8x87 7x78 8x89 11x1110 11x1111 11x11

Convolutionwidth

Optimal Work-group

12 12x1213 13x1314 14x1415 16x1616 16x1617 20x2018 21x2119 21x2120 21x2121 21x2122 22x22

Table 5.1: Optimal work-group sizes for various convolution filter widthsThis table shows the best performing local work-group sizes for a range of convolu-tion widths as determined by the results of the test shown in Figure 5.5

Chapter 6

Discussion

The results presented in the previous chapter are now discussed. I will begin by ex-

amining the selection of an optimal work-group size and explaining the effect of this

parameter on performance. Subsequently, the performance profile of the OpenCL

gridding implementation will be discussed. Finally, Performance of both the Miriad

and OpenCL gridding implementations will be compared and the paramaters affect-

ing each program will be analysed.

6.1 Work-Group Optimisation

The first major goal of testing was to determine the optimal local work-group size

which makes gridding on the GPU run in the shortest amount of time. This para-

mater has a major impact on GPU performance in a number of ways, as it deter-

mines how many work-items can be run simultaneously, the number of registers and

45

46 CHAPTER 6: Discussion

amount of shared memory available to each processor, as well as determining the

size of the bins that visibilities are sorted into. For a given number of work-items in

a work-group T and warp size Wsize (which is equal to 32 for GPUs based NVIDIA’s

CUDA architecture), the total number of warps required by a work-group, Wwg is

given by Equation 6.1 [7]:

Wwg = ceil(T

Wsize

, 1) (6.1)

Given the warp allocation granularity GW (equal to 2 on the Tesla), the number of

registers used by a kernel Rk and the thread allocation granularity GT (512 on the

Tesla), the number of registers allocated to each work-group, Rwg, can be expressed

by Equation 6.2:

Rblock = ceil(ceil(Wwg, GW ) ×Wsize ×Rk, GT ) (6.2)

From Figure 5.1 the best performing work-group size determined to be 60 work-

items arranged as 6x10. The reason a work-group size of 8x8 was chosen was that it

contains 64 work-items, which happens to be the maximum number that can fit into

2 warps. This maximises the number of work-items capable of running simultane-

ously on the GPU without decreasing the number of registers available to each warp.

The optimal work-group sizes measured for a range of convolution widths, shown

in Figure 5.5 and Table 5.1 show several interesting patterns. Work convolution

widths up to 8 in length, work-group sizes of 7x7 and 8x8 appear to produce very

similar results, outperforming all the other sizes. While the 8x8 and 7x7 work-groups

47

contain 64 and 49 work-items respectively, they both take up 2 warps. A possible

explanation for the fast performance of the 7x7 work-group is despite running less

work-items in parallel, the smaller bin size reduces the number of visibilities pro-

cessed in each work-group. A similar pattern can be seen in the performance of

15x15 and 16x16 work-groups, which both require 8 warps, but contain 225 and 256

work-items respectively.

Another observation is that while small work-groups generally outperform large

work-groups, work-groups below 6x6 in size show an opposite trend. Since these

work-groups are all smaller than a single warp in size, since each multiprocessor pro-

cesses a single work-group at a time, these small work-groups don’t contain enough

work-items to make use of the full set of processing elements. Witha 1x1 work-group,

each multiprocessor only performs operations on one processing element, while the

other 31 idle.

6.2 Performance Profiling

The performance profiles shown in Figures 5.2 and 5.3 show a breakdown of the

time spent in each stage for separate gridding implementations. These plots can

be used to determine the ratio between run-times of different stages to determine

performance bottlenecks for a single plot as well as measure speed-up by comparing

different plots together.

The gridding implementation developed in this project, labelled as OpenCL with

GPU sort, demonstrates a speedup of 1.46x the original Miriad implementation. Ex-


cluding the time taken by the device and host transfer stages, as well as the transfers

listed part of bin-processing, this speedup is 2.28x. While this value is lower than

the performance obtained in other parallel gridding processes, detailed in Chapter 3,

the two values can’t be directly compared due to the different datasets used.

A comparison of both OpenCL implementations in these plots reveals the impact

sorting has on the runtime of pre-sorted gather based gridding. Compared to the

CPU based sort, sorting the visibilities on the GPU is 103x faster. Combined with

the other stages of the gridding process, this resulted in a total speedup of 6.36x.

6.3 Performance Comparison

Figure 5.4 shows several sharp increases in runtime of the GPU gridding algorithm

as the number of visibilities grows. Since the bitonic sorting kernel requires the list

of visibilities to be padded with empty values so that its length is a power of two,

these sudden runtime increases represent a combination of two factors. Firstly, the

sorting kernel needs to process twice as many visibilities which doubles the time

required for the operation. Secondly, the current version of GPU gridding pads the

list of visibilities with zeros on the host, which results in the amount of data needed

to be transferred doubling at each jump in runtime on the graph. These steep in-

creases in runtime could be partially reduced by padding the visibility data with

empty values on the GPU.

As shown in Figure 5.6, the relative performance of gridding on a GPU compared to

49

on a CPU greatly increases with large convolution widths. This is a major benefit

of the gather approach compared to the the scatter approach, since, while larger

convolution windows increase the number of calculations performed per visibility in

both algorithms, in a gather based kernel this extra work is spread across a large

number of threads. In both the CPU implementation and scatter kernels, the num-

ber of operations performed on each visibility is proportional to the convolution

width squared.

Chapter 7

Conclusion

This project has implemented the gridding stage of aperture synthesis on a GPU

using OpenCL. Its performance has also been compared with the single threaded

gridding process used in Miriad. This chapter summarises the process of developing

the GPU gridding algorithm and concludes with future considerations for extending

this work.

7.1 Project Summary

The initial target of my research was to write a CPU based gridding implementa-

tion in C. The purpose of this implementation was to gain an understanding of the

gridding process and to develop wrapper code to handle input and output tasks not

supported by the GPU. In order to avoid rewriting a large amount of code unrelated

to the main task of gridding, this program was implemented by replacing the MapIt

51

52 CHAPTER 7: Conclusion

function call in Miriad’s Mapper subroutine with a function call to my own gridding

function and returning the gridded output to Miriad. Once I verified that all grid-

ding calculations were being performed within the function , I began research into

different approaches to perform this operation on parallel hardware.

My first attempt at an OpenCL implementation running on the GPU made use of a

kernel based on the scatter approach described in Section 4.1. This implementation

was similar to the original CPU implementation, since the operations performed

by the kernel on each visibility were exactly the same as the original. The major

difference was that these operations were performed in parallel by work-items on

the GPU instead of inside a loop on the CPU. Although this program was able to

run extremely fast, I was unable to overcome the problems caused by simultaneous

writes to the same memory address.

After further examination of the gridding process, I developed a new gridding ker-

nel using the gather approach outlined in Section 4.2. This kernel was primarily

designed to eliminate the issue of simultaneous writing, which was achieved by

launching a separate thread for each pixel in the output grid. Initially this kernel

was incredibly slow, taking over half an hour to grid the sample dataset described in

Chapter 5. Since the gather kernel managed to produce correct results, improving its

performance replaced correcting the scatter kernel as the main focus of development.

It soon became apparent that the gather kernel’s performance could be drastically

improved by sorting the visibilities based on their location in the u − v plane and

modifying the gridding kernel to only process visibilities close to the grid location

53

designatedd by it’s global ID. This additional sorting step is explained in Section 4.3.

The first version of the pre-sorted gather approach performed the visibility sorting

operation on the CPU before transferring the sorted visibility data to the graphics

card and running the gridding kernel. Because I was planning to eventually im-

plement sorting on the GPU, I wrote my own version of the bitonic sort algorithm

which ran on a single CPU core. Performance tests revealed that although this new

approach to gridding was several hundred times faster than the original unsorted

gather approach, it was still five times slower than Miriad.

Profiling this new gridding implementation showed that the visibility sorting stage

was responsible for 90% of the total runtime. In order to improve overall perfor-

mance the sorting algorithm was replaced by a sorting kernel running on the GPU.

This sorting kernel was taken from OpenCL Sorting Networks example NVIDIA’s

GPU Computing SDK, which perfectly matched my requirements with only slight

modification. This modification finally managed to improve the performance of my

gridding implementation enough to run faster than Miriad over a wide range of pa-

rameters.

7.2 Future Consideration

This research thoroughly investigated many aspects of a GPU based gridding imple-

mentation. However, there are still many related areas yet to be explored as well as a

number of areas within the scope of this project which warrant further investigation.

These include utilising different memory regions on the device and performing the

54 CHAPTER 7: Conclusion

remaining CPU based binning stages on the GPU. Combining gridding with other

stages of the aperture synthesis pipeline in OpenCL and adapting this work to other

hardware architectures are also discussed.

The pre-sorted gather approach to gridding consists of four stages: Determining

each visibility’s bin, sorting the visibilities, constructing an array of indices of the

first and last visibility in each bin and gridding. Currently only the sorting and

gridding stages are implemented on the GPU, while the other stages are processed

on the host. Determining each visibility’s bin on the GPU could be performed faster

than on the CPU. Building the array of indices on the CPU requires the sorted visi-

bilities to be transferred to the host and the array of bin locations to be transferred

to the device. Performing this calculation the GPU would not only be faster, but

also eliminate both of these transfers.

Since each work-group is comprised of work-items located in the same bin, each

work-item processes the same set of visibilities. Currently the gridding kernel re-

quests each visibility from global memory individually, waiting after each request.

By allocating a small amount of local memory as a visibility cache, the kernel could

alternate between filling the cache by requesting a series of consecutive visibilities

in parallel and looping though each visibility in the cache. This optimisation does

not guarantee a performance increase on all devices since the OpenCL specification

allows for devices without work-group specific local memory to map it to a region of

global memory. On such a device, any attempts at caching data from global memory

in local memory would actually slow down a kernel.

55

The aperture synthesis pipeline, outlined in Section 2.2, consists of several sequential

stages that converts radio signals collected by radio telescopes into a two-dimensional

image of the radio source. As discussed in Chapter 3, parallel versions of most stages

of aperture synthesis have been developed in CUDA in order to process data gen-

erated by the Murchison Widefield Array in real time [9]. Future work could focus

on an OpenCL version of this pipeline which could make use of a combination of a

wider variety of GPUs as well as other devices supporting OpenCL.

Since kernels written in OpenCL are capable of running on any OpenCL device

with sufficient memory, the gridding implementation developed in this project is

able to run on a wide variety of hardware without any modification. A subsequent

project could focus on optimising the gridding kernel developed for the NVIDIA

Tesla C1060 for various other devices and compare performance across a wide range

of parameters. Another potential area for further research could be in implementing

a version of gridding capable of running on multiple OpenCL devices simultaneously.

56 CHAPTER : Conclusion

Appendix A

Original Proposal

Radio interferometric image reconstruction on commodity

graphics hardware using OpenCL

Alexander Ottenhoff

01 April 2010

The Problem

Astronomers can gain a better understanding of the the creation and early evolu-

tion of the universe, test theories and attempt to solve many mysteries in physics

by producing images from radio waves emitted by distant celestial entities. With

the construction of vast radio-telescope arrays such as the Square Kilometre Ar-

ray (SKA), Australian SKA Path-finder (ASKAP) and Murchison Wide-field Array

(MWA), many engineering challenges need to be overcome. ASKAP alone will gen-

57

58 CHAPTER A: Original Proposal

erate data at a rate of 40Gb/s, producing over 12PB in a single month [6] and SKA

will produce many times this, so data processing and storage are major issues. As

we reach the limit of how fast we can make single core CPUs run we need to look

to parallel processors such as multi-core CPUs, GPUs and digital signal processors

to process this vast amount of data. One of biggest problems limiting the popular-

ity of parallel processors has been the lack of a standard language that runs on a

wide variety of hardware, although a new language named OpenCL may change that.

First published by the Khronos group in late 2008, OpenCL is an open standard

for heterogeneous parallel programming [17]. One of the major advantages of code

written in OpenCL is that it allows programmers to write software capable of run-

ning on any device with an OpenCL driver, eliminating the need to rewrite large

amounts of code for each vendor’s hardware. This partially solves the issue of vendor

lock-in, a major problem in general purpose GPU (GPGPU) programming up until

now where due to the lack of standardisation, software is often restricted to only

running on a single architecture only produced by one company.

In this project I aim to develop an efficient way to adapt radio-interferometric imag-

ing to parallel processors using OpenCL, in particular the gridding algorithm as this

has traditionally been the most time-consuming part of the imaging process [30].

Due to the large amount of data that will be generated by the ASKAP, the amount

of data which can be processed in real-time may be a serious performance bottle-

neck. Since a cluster of GPUs with equal computational performance to a traditional

supercomputer consumes a fraction of the energy, an efficient OpenCL implemen-

tation would be a significantly less expensive option. I will primarily target GPU

architectures in particular the NVIDIA Tesla C1060, although I will also attempt

59

to benchmark and compare performance on several different devices.

Background

Radio interferometry background

The goal of radio astronomy is to gain a better understanding of the physical universe

via the observation of radio waves emitted by celestial bodies. Part of this is achieved

by forming images from the signals received by radio telescopes. For a single dish

style radio telescope, the angular resolution R of the image generated from a signal

of wavelength λ is related to the diameter of the dish D.

R =λ

D(A.1)

Since R is a measure of the finest object a telescope can detect, a dish designed

to create detailed images of low frequency signals can be several hundred metres in

diameter. Constructing a dish of this size is however both difficult and extremely ex-

pensive, so most modern radio astronomy projects are utilise an array of telescopes.

Aperture synthesis is a method of combining signals from multiple telescopes to

produce an image (as shown in figure ??) with a resolution approximately equal to

that of a single dish with a diameter of the maximum distance between antennae.

The first stage of this process involves taking the signals from each pair of antennas


and cross-correlating them to form a baseline. The relationship between the number

of an antennas in an array a, and the total number of baselines b including those

autocorrelated with themselves is shown by (A.2).

b =a(a− 1)

2+ a (A.2)

These signals are combined to produce a set of complex visibilities, one for each

baseline, frequency channel and period of time. The next stage is to generate a

dirty image from these complex visibilities by translating and interpolating them to

a regular grid so that the Fast Fourier Transform (FFT) can be applied. Finally, the

dirty image may be deconvolved to eliminate artifacts introduced during imaging.

A common name for the stage of aperture synthesis where complex visibilities are

mapped to a regular grid is gridding. The relationship between the 2-dimensional

sky brightness I, 3-dimensional visibility V and primary antenna beam pattern A

is shown in (A.3) [27].

A(l,m)I(l,m) =

∫ ∞−∞

∫ ∞−∞

V (u, v)e2πi(ul+vm) du dv (A.3)

The primary beam pattern A is removed during the deconvolution stage to obtain

the sky brightness. For radio-telescope arrays with sufficiently large baselines or a

wide field of view, images are distorted due to the curvature of the earth introducing

a height element w to the location of each antenna. One technique used to counter

this distortion is faceting, where the sky is divided into patches small enough that

the baselines can be treated as coplanar and then combining them into one image.

61

Another common approach known as W-projection involves gridding the entire grid

treating w as 0 and then convolving each point in the dirty image by a function G̃

provided in (??) [3].

G̃(u, v, w) =i

we−πi[

(u2+v2)w

] (A.4)

Parallel hardware background

Computer manufacturers have shifted their focus in recent years from designing fast,

single core processors to creating processors which can execute multiple threads si-

multaneously and minimise memory access latency. Since these multi-core processors

are still relatively new, a diverse range of architectures are available, including multi-

core x86 processors such as the AMD Phenom and Intel Core i7, IBM’s Cell/B.E.

and GPUS like NVIDIA’s Tesla and AMD’s Radeon 5800 series. One of the fac-

tors limiting the usage of parallel processors by developers is the vast amount of

code that has been developed for single-processor computers. Often, due to inter-

dependencies between operations rewriting these legacy programs to take advantage

of multiple concurrent threads is not a trivial task.

While originally developed as co-processors optimised for graphics calculations,

GPUs are being designed with increasingly flexible instruction sets and are emerging

as economical massively parallel processors. NVIDIA’s recent Tesla C1060 GPU is

capable of 933 single precision GigaFLOPS [8] (floating point operations per sec-

ond) compared to the fastest CPU available at the time, Intel’s Core i7 Extreme

965, which has been benchmarked at 69 single precision GigaFLOPS [?]. Part of


the reason that GPUs can claim such high performance figures is their architecture

as shown in figure ??. By devoting more transistors to data processing, GPUs are

highly optimised for performing simple vector operations on large amounts of data

significantly faster than a processor using those transistors for other purposes. This

performance, however, comes at the expense of control circuitry meaning that GPUs

can’t make use of advanced run time optimisations commonly found on modern desk-

top CPUs such as branch prediction and out-of-order execution. GPUs also sacrifice

the amount of circuitry used for local cache, which has a major impact on the average

amount of time a process needs to wait when between requesting data from memory.

OpenCL is a programming language created by the Khronos group with the design

goal of enabling the creating of code that can run across a wide range of parallel

processor architectures without needing to be modified. To deal with the many dif-

ferent types of processors that can be used for processing data the OpenCL runtime

separates them into 2 classes: hosts and devices. The host, generally a single CPU

core, is in charge of managing memory and transferring programs compiled for the

device at run-time (kernels) and data to and from devices. A device’s job is to

simply execute a kernel in parallel across a range of data, storing the results locally,

and alert the host when finished so the results can be transferred back. Command

queues are used so the host can queue up several instructions waiting for device

execution while still being free to perform whatever other operations are necessary

while waiting for results. An important feature for code executed on the device is

the availability of vector data types, allowing each ALU on a device with SIMD

instructions to perform an operation on multiple variables simultaneously. Because

of the diverse range of devices supported, memory management on the device is left

up to the programmer so they can efficiently make use of the limited local cache

63

available on GPU threads as well as take advantage of optional device features such

as texture memory.

Plan

This project aims to evaluate whether GPUs programmed using OpenCL are a suit-

able platform for running the gridding stage of imaging radio astronomy data in

real-time. So far various research papers and journal articles have been read in an

effort to understand the variety of techniques currently being used to improve grid-

ding performance in existing projects [1, 3, 15, 23, 32], as well as previous efforts to

parallelise gridding on similar processors [19, 26, 28–30]. The next step will be to

construct a theoretical model through analysis of the algorithms used in the most

relevant papers and research into the specifications of the target language and plat-

form [7, 17]. This model will be used to determine where any data dependencies

exist in the algorithm and to plan out a GPU optimised solution.

Before implementing this model on a GPU target using OpenCL, a serial version

written in ANSI C. The serial implementation will be developed first as a reference

to determine the correctness of the OpenCL version. This will then be followed

by an OpenCL implementation, optimised for NVIDIA’s Tesla C1060 processor on

a x86 workstation running Ubuntu Linux. Various optimisations will be tested to

improve the execution time, and the final version will be benchmarked on several

different platforms.


Figure A.1: The software pipeline. [29]The first stage of this process involves taking the signals from each pair of antennasand cross-correlating them to form a baseline. These signals are combined to producea set of complex visibilities, one for each baseline, frequency channel and period oftime. The next stage is to generate a dirty image by translating and interpolatingthem to a regular grid and applying the Fast Fourier Transform (FFT). Finally, thedirty image may be deconvolved to eliminate artifacts introduced during imaging.

Figure A.2: A comparison of CPU and GPU architectures. [7]By devoting more transistors to data processing, GPUs are highly optimised for per-forming simple vector operations on large amounts of data significantly faster thana processor using those transistors for other purposes. This performance, however,comes at the expense of control circuitry meaning that GPUs can’t make use ofadvanced run time optimisations commonly found on modern desktop CPUs suchas branch prediction and out-of-order execution. GPUs also sacrifice the amount ofcircuitry used for local cache, which has a major impact on the average amount oftime a process needs to wait when between requesting data from memory.

References

[1] PJ Beatty, DG Nishimura, and JM Pauly. Rapid gridding reconstruction with aminimal oversampling ratio. IEEE transactions on medical imaging, 24(6):799–808, 2005.

[2] BG Clark. An efficient implementation of the algorithm’CLEAN’. Astronomyand Astrophysics, 89:377, 1980.

[3] T. J. Cornwell, K. Golap, and S. Bhatnagar. Wide field imaging problems inradio astronomy. In IEEE International Conference on Acoustics, Speech, andSignal Processing, 2005. Proceedings. (ICASSP ’05). Vol. 5: p. v-861-v-864,pages 861–, March 2005.

[4] TJ Cornwell. Radio-interferometric imaging of very large objects. Astronomyand Astrophysics, 202:316–321, 1988.

[5] TJ Cornwell, MA Holdaway, and JM Uson. Radio-interferometric imaging ofvery large objects: implications for array design. Astronomy and Astrophysics,271:697, 1993.

[6] TJ Cornwell and G. van Diepenb. Scaling Mount Exaflop: from the pathfindersto the Square Kilometre Array.

[7] NVIDIA Corporation. OpenCL Programming Guide for the CUDA Ar-chitecture. Available from: http://www.nvidia.com/content/cudazone/

download/OpenCL/NVIDIA\_OpenCL\_ProgrammingGuide.pdf Last accessedon: .

[8] NVIDIA Corporation. Tesla c1060 computing processor board specifi-cation. Available from: http://www.nvidia.com/docs/IO/56483/Tesla\

_C1060\_boardSpec\_v03.pdf Last accessed on: .

[9] RG Edgar, MA Clark, K. Dale, DA Mitchell, SM Ord, RB Wayth, H. Pfister,and LJ Greenhill. Enabling a high throughput real time data pipeline for alarge radio telescope array with GPUs. Computer Physics Communications,2010.

65

66 References

[10] S. Freya and L. Mosonic. A short introduction to radio interferometric imagereconstruction.

[11] K. Golap, A. Kemball, T. Cornwell, and W. Young. Parallelization of WidefieldImaging in AIPS++. In Astronomical Data Analysis Software and Systems X,volume 238, page 408, 2001.

[12] A. Gregerson. Implementing Fast MRI Gridding on GPUs via CUDA.

[13] Mark Harris. Mapping computational concepts to gpus. In SIGGRAPH ’05:ACM SIGGRAPH 2005 Courses, page 50, New York, NY, USA, 2005. ACM.

[14] Intel. Intel microprocessor export compliance metrics. Available from: http://www.intel.com/support/processors/sb/CS-023143.htm Last accessed on: .

[15] JI Jackson, CH Meyer, DG Nishimura, and A. Macovski. Selection of a convo-lution function for Fourier inversion usinggridding [computerised tomographyapplication]. IEEE Transactions on Medical Imaging, 10(3):473–478, 1991.

[16] W.Q. Malik, H.A. Khan, D.J. Edwards, and C.J. Stevens. A gridding algorithmfor efficient density compensation of arbitrarily sampled Fourier-domain data.

[17] A. Munshi. OpenCL: Parallel Computing on the GPU and CPU. SIGGRAPH,Tutorial, 2008.

[18] ST Myers. Image Reconstruction in Radio Interferometry.

[19] S. Ord, L. Greenhill, R. Wayth, D. Mitchell, K. Dale, H. Pfister, and RG Edgar.GPUs for data processing in the MWA. Arxiv preprint arXiv:0902.0915, 2009.

[20] D. Rosenfeld. An optimal and efficient new gridding algorithm using singularvalue decomposition. Magnetic Resonance in Medicine, 40(1):14–23, 1998.

[21] RJ Sault, PJ Teuben, and MCH Wright. A retrospective view of Miriad. Arxivpreprint astro-ph/0612759, 2006.

[22] T. Schiwietz, T. Chang, P. Speier, and R. Westermann. MR image reconstruc-tion using the GPU. In Proc. SPIE, volume 6142, pages 1279–90. Citeseer,2006.

[23] FR Schwab. Optimal gridding of visibility data in radio interferometry. InIndirect Imaging. Measurement and Processing for Indirect Imaging, page 333,1984.

[24] H. Sedarat and D.G. Nishimura. On the optimality of the gridding reconstruc-tion algorithm. IEEE Transactions on Medical Imaging, 19(4):306–317, 2000.

[25] DJ Smith. Maximum Entropy Method. MARCONI REV., 44(222):137–158,1981.

67

[26] TS Sorensen, T. Schaeffter, KO Noe, and M.S. Hansen. Accelerating the noneq-uispaced fast Fourier transform on commodity graphics hardware. IEEE Trans-actions on Medical Imaging, 27(4):538–547, 2008.

[27] GB Taylor, CL Carilli, and RA Perley. Synthesis imaging in radio astronomyII. In Synthesis Imaging in Radio Astronomy II, volume 180, 1999.

[28] A.S. van Amesfoort, A.L. Varbanescu, H.J. Sips, and R.V. van Nieuwpoort.Evaluating multi-core platforms for HPC data-intensive kernels. In Proceedingsof the 6th ACM conference on Computing frontiers, pages 207–216. ACM, 2009.

[29] Ana Lucia Varbanescu, Alexander S. Amesfoort, Tim Cornwell, Andrew Mat-tingly, Bruce G. Elmegreen, Rob Nieuwpoort, Ger Diepen, and Henk Sips.Radioastronomy image synthesis on the cell/b.e. In Euro-Par ’08: Proceedingsof the 14th international Euro-Par conference on Parallel Processing, pages749–762, Berlin, Heidelberg, 2008. Springer-Verlag.

[30] Ana Lucia Varbanescu, Alexander S. van Amesfoort, Tim Cornwell, Ger vanDiepen, Rob van Nieuwpoort, Bruce G. Elmegreen, and Henk Sips. Buildinghigh-resolution sky images using the cell/b.e. Sci. Program., 17(1-2):113–134,2009.

[31] T.L. Wilson, K. Rohlfs, and S. Huttemeister. Tools of Radio Astronomy.Springer Verlag, 2009.

[32] M. Yashar and A. Kemball. TDP CALIBRATION & PROCESSING GROUPCPG MEMO COMPUTATIONAL COSTS OF RADIO IMAGING ALGO-RITHMS DEALING WITH THE NON-COPLANAR BASELINES EFFECT:I. 2009.

gridding for radio astronomy on commodity graphics hardware using opencl

Documents