signal processing on gpus for radio telescopes | gtc...

March 18-21, 2013March 18-21, 2013 1GTC'13GTC'13

Signal Processing on GPUsfor Radio Telescopes

Signal Processing on GPUsfor Radio Telescopes

John W. RomeinJohn W. Romein

Netherlands Institute for Radio Astronomy (ASTRON)Dwingeloo, the Netherlands


Overview

radio telescopes motivation processing pipelines signal-processing algorithms

filter, correlator, beam forming, etc. performance


The LOFAR Radio Telescope


The LOFAR Radio Telescope

largest low-frequency telescope

no dishes

distributed sensor network

~85.000 receivers


LOFAR: A Software Telescope

all-sky view antennas ➜ new science opportunities different observation modes require flexibility

digitally steered concurrent observations supercomputer real time

standard imaging

pulsar survey

known pulsar

epoch of re-ionization

transients

ultra-high energy particles

...


LOFAR SuperTerp


What's Next? The Square Kilometre Array

world-wide effort unprecedented size

3,000 dishes + aperture array South Africa + Australia

2016‒2019: 10% SKA 2020‒2023: 100% SKA

exascale computing; petascale I/O O(104)‒O(105) > LOFAR

many challenges!


Our Efforts

build/maintain/operate current telescopes (LOFAR, WSRT) research for SKA


Motivation

1)bring GPU technology to LOFAR

2)do accelerator research for the SKA


ASTRON/IBM Dome

P1 algorithms & machines

P2 nano-photonics

P3 access patterns

P4 microservers

P5 accelerators

P6 compressed sampling

P7 real-time communication

research technologies to develop the SKA green computing nano-photonics data & streaming


Accelerator Research

accelerators for (radio-astronomical) signal processing GPUs, Xeon Phi, BG/Q, FPGAs, ...

fundamental understanding of accelerators which properties make architecture (in)efficient? I/O-compute balance? energy efficiency? programmability? architecture-(in)dependent optimizations? devise new algorithms? generic approach to program accelerators, or ad-hoc?

applications:

1) LOFAR2) SKA


LOFAR Data Processing

developing new GPU-based system


Correlator Processing Overview


More Detail

complex piece of software! several pipelines


Implement GPU Kernels

killing two birds

1) develop new LOFAR correlator2) code base for accelerator research


Why We Use OpenCL

OpenCL advantages disadvantagesvendor independent poor library support (e.g., FFT)GPU: runtime compilation cannot use all GPU features (e.g., GPUdirect)GPU: float8, float16, swizzlingCPU: nice C++ interface (exceptions)

float2 input_samples[NR_STATIONS][NR_CHANNELS][NR_TIMES][NR_POLARIZATIONS];

float16 sums, weights;…sums += weights.s3456789ABCDEF012 * sample;


Why We Are Disappointed

OpenCL advantages disadvantagesvendor independent poor library support (e.g., FFT)GPU: runtime compilation cannot use all GPU features (e.g., GPUdirect)GPU: float8, float16, swizzlingCPU: nice C++ interface (exceptions)

float2 input_samples[NR_STATIONS][NR_CHANNELS][NR_TIMES][NR_POLARIZATIONS];

float16 sums, weights;…sums += weights.s3456789ABCDEF012 * sample;❑ NVIDIA dropped OpenCL support in

development tools

❑ visual profiler


GPUs

NVIDIA Tesla K10 AMD FirePro S10000 Intel Xeon Phi

3072 3584 976 FPUs

4.58 5.91 2.14 SP TFLOPS

320 (ECC off) 480 352 GB/s

225 375 300 W

FirePro S10000:

driver/VBIOS/hardware (?) issues cannot always overlap compute‒PCIe I/O

Xeon Phi:

immature results


From Feasibility Study to Production Software

concentrated mostly on GPU kernels CPU parts: now/later


Implemented Kernels

most important: filter correlator beam former


Processing Pipelines

imaging mode

sky images pulsar modes

in development UHEP mode

detect Ultra-High Energy Particles experimental



imaging mode





Imaging (Correlator) Pipeline


Imaging (Correlator) Pipeline

four GPU kernels


Poly-Phase Filter (PPF) Bank

splits frequency band into channels like prism

trades time for frequency resolution


Poly-Phase Filter (PPF) Bank

FIR filter + FFT


1) Finite Impulse Response (FIR) Filter

history & weights (in registers)

no physical shift reorder output for next kernel

uncoalesced writes many FMAs

operational intensity = 6.4 FLOPs / byte


FIR Filter Performance

0 20 40 60 800

0.5

1

1.5

2 FirePro S10000Tesla K10

#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400

FirePro S10000Tesla K10

#stations

GB

/s

K10: drop @41 stations ➜ small #threads


2) FFT

1D complex ➜ complex typically: 64‒256 tweaked “Apple” FFT library


2) FFT Performance

0 20 40 60 800

0.5

1

1.5


#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400


#stations

GB

/s

memory I/O bound


combined kernel: clock correction delay compensation bandpass correction transpose

reduces #device memory accesses

3) Phase & BandPass Correction


3a) Clock Correction

stations: shared clock corrects cable length errors merge with next step (phase delay)


3b) Delay Compensation (a.k.a. Tracking)

track observed source delay telescope data

delay varies due to earth rotation shift samples remainder: rotate phase (= cmul)

0.75 FLOPs / byte


3c) BandPass Correction

0 64 128 192 2560.0

0.5

1.0

frequency channel

po

we

r

powers in channels unequal

artifact from station processing multiply by channel-dependent weight

0.125 FLOPs / byte


Phase & BandPass Correction

0 20 40 60 800

0.5

1

1.5


#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400


#stations

GB

/s

☹ 0.75 FLOPs / byte memory I/O bound


4) Correlator Kernel

multiply samples from each station pair

integrate ~1s


4) Correlation Triangle

multiply samples from each station pair

integrate ~1s


4) Correlator Implementation

global memory ➜ local memory 1 thread per station pair (dual pol)

1 FLOP / byte (local mem)


4) Optimized Correlator Implementation

increase register reuse 1 thread: 2x2 stations (dual pol)

2 FLOPs / byte (local mem) also: 3x3, 4x4


Correlator Register Usage

3x3 on K10: excessive spilling

max. registers/thread

Tesla K10 63

FirePro S10000 256

#accumulator registers

1x1 8

2x2 32

3x3 72


Correlator #Threads

cannot use all threads in last warp too few threads ➜ low occupancy too many threads ➜ multiple passes

multiple thread blocks

0 20 40 60 800

256

512

768

1024

#stations

#th

rea

ds

1x12x23x3

max #threads

Tesla K10 1,024

FirePro S10000 256


1x1 vs. 2x2 and 3x3 (FirePro S10000)

0 20 40 60 800

0.5

1

1.5

2 1x12x23x3

#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400 1x12x23x3

#stations

GB

/s

use whichever performs best


Correlator Performance

0 20 40 60 800

0.5

1

1.5


#stations

TF

LO

PS

0 20 40 60 800

100

200

300

400


#stations

GB

/s

compute bound S10000: multiple passes


Combined Pipeline

#pragma omp parallel num_threads(2 * nrGPUs) { cl::CommandQueue queue(...); cl::Buffer input(...), output(...), ...; #pragma omp for schedule(dynamic) for (int subband = 0; subband < nrSubbands; subband ++) { receive(input, subband); queue.enqueueWriteBuffer(input, ...); queue.enqueueNDRangeKernel(FIR_filter, ...); queue.enqueueNDRangeKernel(FFT, ...); queue.enqueueNDRangeKernel(delay_bandpass, ...); queue.enqueueNDRangeKernel(correlator, ...); queue.enqueueReadBuffer(output, ...); send(output, subband); } }

2 queues/GPU: overlap I/O & computations

1 host thread/queue: easy model


Correlator PipelineTesla K10 Performance Breakdown

1 10 20 30 40 50 60 70 800

2

4

6

8

10?CorrelatePhase & BandPass CorrectionFFTFIR filter

#stations

#G

PU

s

most challenging observation mode 8-bit samples

488 subbands (95.3 MHz)

correlations most expensive

need ~11 Tesla K10s


Large #Stations

➜ different approach

#stations #correlations #threads #thread blocks (K10)

LOFAR ~80 ~12,960 ~820 1

SKA ≤3,000 ≤18,000,000 ≤1,125,000 ≤1,000

LOFAR SKA


Large #Stations: Another Strategy

32x32 stations(16x16 threads)

2x2 stations (dual pol)1 thread

192x192 stations(squares + rectangles)

read samples 32+32 stations

☺ 32 FLOPs / byte


Large #Stations: Correlator Performance

0 128 256 384 512 640 768 8960

1

2

3

4


#stations

TF

LO

PS

0 128 256 384 512 640 768 8960

100

200

300

400


#stations

GB

/s

to do: Tesla K20X possibly faster


Pulsar Pipelines


Pulsar Pipelines

1)search unknown pulsars

2)observe known pulsar many narrow beams

credit: Jason Hessels


Pulsar Pipelines

in development


Coherent Beam Forming

CV beam=∑stat

W beam ,stat∗Sstat

add stations

> sensitivity weighed

compensate Δt change phase different Δt ➜ different beam

many beams from same input


Coherent Beam Forming Implementation

CV beam=∑stat

W beam ,stat∗Sstat

global memory ➜ local memory each thread:

1 beam, 1 polarization station-dependent weights in registers

2 passes of 24 stations

☺ 48 stations, 128 beams ➜ 14.2 FLOPs / byte


Coherent Beam Forming Performance

0 32 64 96 1280

0.5

1

1.5

2

2.5 FirePro S10000Tesla K10

#beams

TF

LO

PS

0 32 64 96 1280

100

200

300


#beams

GB

/s

48 stations sawtooth caused by unused threads


Ultra-High Energy Particle (UHEP) Pipeline


UHEP Pipeline

store raw antenna voltages in 1.3s circular buffer

create ~50 beams on moon

detects 1015‒1020.5 eV particle collisions

trigger in one/few adjacent beams: freeze/dump captured antenna voltages within 1.3s ???

highly experimental pipeline

Image Courtesy L. Viatour


UHEP Pipeline

create ~50 beams on moon inverse filter for high time resolution trigger

peak detection anti-coincidence check

freeze raw antenna data buffer max 1.3s latency!


UHEP Pipeline





UHEP Performance Breakdown

48 stations, 64 beams

Beam Forming Transpose Inverse FIR Inverse FFT Trigger0

1

2

3 Tesla K10FirePro S10000

TF

LO

P/s

Beam Forming Transpose Inverse FIR Inverse FFT Trigger0

100

200

300

400Tesla K10FirePro S10000

GB

/s


UHEP Performance Breakdown

3.93 ms data, 48 stations, 488 subbands

most time spent in beam former

I/O overlap

latency ≪ 100 ms

Tesla K10 FirePro S100000

20

40

60

Input (Beam Former Weights)Input (Samples)?TriggerInv. FFTInv. FIRTransposeBeam Forming

ms


A Wild Idea


OpenCL on Top of CUDA Driver API

fool CUDA RTS

CPU: implement our own “platform” (ICD)

OpenCL library calls ➜ CUDA Driver API Calls limited subset (proof of concept)

GPU: use OpenCL ➜ PTX compiler (clc/clang/llvm)

☺ efficient

☹ does not support full language can use:

☺ visual profiler

☺ cuFFT, GPUdirect


OpenCL on Top of CUDA Driver API


Future Work


To Do (LOFAR Correlator)

GPU kernels: dedispersion, flagging pulsar pipelines CPU code:

work distribution network reordering (FDR IB)

>240 Gb/s optimizations monitoring & control ...


To Do (SKA Research)

Xeon Phi OpenCL on FPGA energy efficiency fully understand all results


Conclusions

many signal-processing GPU kernels new LOFAR correlator research for SKA

OpenCL: vendor independent, elegant, but poor support NVIDIA FirePro S10000 faster then Tesla K10, but immature driver high efficiency on most important kernels


Thanks

support: Intel, NVIDIA grants from Dutch national & province governments LOFAR GPU Correlator team: Alexander van Amesfoort, Wouter Klijn, Marcel

Loose, Jan David Mol

signal processing on gpus for radio telescopes | gtc...

Documents