exascale radio astronomy

1

EXASCALE RADIO ASTRONOMYM Clark

2

Contents

GPU ComputingGPUs for Radio AstronomyThe problem is powerAstronomy at the Exascale

3

The March of GPUs

4

What is a GPU?Kepler K20X (2012)

2688 processing cores3995 SP Gflops peak

Effective SIMD width of 32 threads (warp)Deep memory hierarchyAs we move away from registers

Bandwidth decreasesLatency increases

Limited on-chip memory65,536 32-bit registers per SM48 KiB shared memory per SM1.5 MiB L2

Programmed using a thread model

5

Minimum Port, Big Speed-up

Application Code

+

GPU CPUOnly Critical FunctionsRest of Sequential

CPU Code

7

Strong CUDA GPU Roadmap

SGEM

M /

W N

orm

alize

d

2012 20142008 2010 2016

TeslaCUDA

FermiFP64

KeplerDynamic Parallelism

MaxwellDX12

PascalUnified Memory3D MemoryNVLink

20

16

12

8

6

2

0

4

10

14

18

8

Introducing NVLINK and Stacked Memory

NVLINKGPU high speed interconnect80-200 GB/sPlanned support for POWER CPUs

Stacked Memory4x Higher Bandwidth (~1 TB/s)3x Larger Capacity4x More Energy Efficient per bit

9

NVLink Enables Data Transfer At Speed of CPU Memory

TESLAGPU CPU

DDR MemoryStacked Memory

NVLink80 GB/s

DDR450-75 GB/s

HBM1

Terabyte/s

10

Correlator Calibration & Imaging

RF Samplers

Real-Time R-T, post R-T

O(N) O(N)

O(N2)O(N log N)

O(N)O(N2)

N

digital

Radio Telescope Data Flow

11

Where can GPUs be Applied?Cross correlation – GPU are ideal

Performance similar to CGEMMHigh performance open-source library https://github.com/GPU-correlators/xGPU

Calibration and ImagingGridding - Coordinate mapping of input data to a regular grid

Arithmetic intensity scales with kernel convolution sizeCompute-bound problem maps well to GPUsDominant time sink in compute pipeline

Other image processing stepsCUFFT – Highly optimized Fast Fourier Transform libraryPFB – Computational intensity increases with number of tapsCoordinate transformations and resampling

https://github.com/GPU-correlators/xGPU



12

GPUs in Radio AstronomyAlready an essential tool in radio astronomy

ASKAP – Western AustraliaLEDA – United States of AmericaLOFAR – Netherlands (+ Europe) MWA – Western AustraliaNCRA - IndiaPAPER – South Africa

LOFAR

LEDA

MWAASKAP PAPER

13

Cross correlation is essentially GEMMHierarchical locality

Cross Correlation on GPUs

14

Correlator Efficiency

2012

2014

2008

2010

X-en

gine

GFL

OPS

per W

att

Kepler

Tesla

Fermi

Maxwell

Pascal64

32

16

8

4

2

1

>1 TFLOPS sustained

>2.5 TFLOPS sustained

0.35 TFLOPS sustained

2016

15

Software Correlation Flexibility

Why do software correlation?Software correlators inherently have a great degree of flexibility

Software correlation can do on-the-fly reconfigurationSubset correlation at increased bandwidthSubset correlation at decreased integration time Pulsar binningEasy classification of data (RFI threshold)

Software is portable, correlator unchanged since 2010Already running on 2016 architecture

16

Power of 300 Petaflop CPU-only

Supercomputer =

Power for the city

of San Francisco

HPC’s Biggest Challenge: Power

17

GPUs Power World’s 10 Greenest Supercomputers Green500

Rank MFLOPS/W Site1 4,503.17 GSIC Center, Tokyo Tech2 3,631.86 Cambridge University3 3,517.84 University of Tsukuba

4 3,185.91 Swiss National Supercomputing (CSCS)

5 3,130.95 ROMEO HPC Center6 3,068.71 GSIC Center, Tokyo Tech7 2,702.16 University of Arizona8 2,629.10 Max-Planck9 2,629.10 (Financial Institution)10 2,358.69 CSIRO37 1959.90 Intel Endeavor (top Xeon Phi cluster)49 1247.57 Météo France (top CPU cluster)

18

The End of Historic Scaling

C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011

19

The End of Voltage ScalingIn the Good Old Days

Leakage was not important, and voltage scaled with feature size

L’ = L/2V’ = V/2E’ = CV2 = E/8f’ = 2fD’ = 1/L2 = 4DP’ = P

Halve L and get 4x the transistors and 8x the

capability for thesame power

The New RealityLeakage has limited threshold voltage, largely ending voltage

scaling

L’ = L/2V’ = ~VE’ = CV2 = E/2f’ = 2fD’ = 1/L2 = 4DP’ = 4P

Halve L and get 4x the transistors and 8x the

capability for 4x the power,

or 2x the capability for the same power in ¼ the

area.

20

Major Software ImplicationsNeed to expose massive concurrency

Exaflop at O(GHz) clocks O(billion-way) parallelism!

Need to expose and exploit localityData motion more expensive than computation> 100:1 global v. local energy

Need to start addressing resiliency in the applications

21

How Parallel is Astronomy?

SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth

Correlator5 Pflops of computationData-parallel across visibilitiesTask-parallel across frequency channelsO(trillion-way) parallelism

22

How Parallel is Astronomy?

SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth

Gridding (W-projection)Kernel size 100x100Parallel across kernel size and visibilities (J. Romein)O(10 billion-way) parallelism

23

Energy Efficiency Drives Locality

20mm

64-bit DP

1000 pJ

28nm IC

256-bit access8 kB SRAM 50 pJ

16000 pJ DRAM Rd/Wr

500 pJ Efficient off-chip link

20 pJ 26 pJ 256 pJ

256bits

24


FMA

Regist

ers L1 L2DRAM

stacke

d

point

to po

int10

100

1000

10000

100000pi

coJo

ules

25


This is observable todayWe have lots of tunable parameters:

Register tile size: how many much work should each thread do?Thread block size: how many threads should work together?Input precision: size of the input words

Quick and dirty cross correlation example4x4 => 8x8 register tiling 8.5% faster, 5.5% lower power => 14% improvement in GFLOPS / watt

26

Correlator Calibration & Imaging

RF Samplers

8-bit digitization O(100) PFLOPSO(10) PFLOPSN = 1024

digital

SKA1 LOW Sketch

10 Tb/s 50 Tb/s

27


FMA

Regist

ers L1 L2DRAM

stacke

d

point

to po

int10

100

1000

10000

100000pi

coJo

ules

28

Do we need Moore’s Law?Moore’s law come from shrinking processMoore’s law is slowing down

Denard scaling is deadIncreasing wafer costs means that it takes longer to move to the next process

29

Improving Energy Efficiency @ Iso-Process

We don’t know how to build the perfect processorHuge focus on improved architecture efficiencyBetter understanding of a given process

Compare Fermi vs. Kepler vs. Maxwell architectures @ 28 nmGF117: 96 cores, peak 192 GflopsGK107: 384 cores, peak 770 GflopsGM107: 640 cores, peak 1330 Gflops

Use cross-correlation benchmarkOnly measure GPU power

30

Improving Energy Efficiency @ 28 nm

Fermi Kepler Maxwell0

5

10

15

20

25

80% 55% 80%

GFL

OPS

/ w

att

31

How Hot is Your Supercomputer?

1. TSUBAME-KFCTokyo Tech, oil cooled4503 GFLOPS / watt

2. Wilkes ClusterU. Cambridge, air cooled3631 GFLOPS / watt

Number 1 is 24% more efficient than number 2

32

Temperature is Power

Power is dynamic and staticDynamic power is workStatic power is leakage

Dominant term from sub-threshold leakage

Voltage terms:Vs: Gate to source voltageVth: Switching threshold voltagen: transistor sub-threshold swing coeff

Device specifics:A: Technology specific constantL, W: device channel length & width Thermal Voltage:

8.62×10−5 eV/K26 mV at room temperature

33

Temperature is Power

Tesla K20mGK110, 28nm

Geforce GTX 580GF110, 40nm

34

Tuning for Power Efficiency

A given processor does not have a fixed power efficiencyDependent on

Clock frequencyVoltageTemperatureAlgorithm

Tune in this multi-dimensional space for power efficiencyE.g., cross-correlation on Kepler K20

12.96 -> 18.34 GFLOPS / wattBad news: no two processors are exactly alike

35

Precision is Power

Power scales with the square of the multiplier (approximately)Most computation done in FP32 / FP64Should use the minimum precision required by science needs

Maxwell GPUs have 16-bit integer multiply-add at FP32 rateAlgorithms should increasingly use hierarchical precision

Only invoke in high precision when necessarySignal processing folks known this for a long time

Lesson feeding back into the HPC community...

36

Conclusions

Astronomy has insatiable amount of computeMany-core processors are a perfect match to the processing pipelinePower is a problem but

Astronomy has oodles of parallelismKey algorithms possess localityPrecision requirements are well understood

Scientists and Engineers wedded to the problemAstronomy is perhaps the ideal application for the exascale

exascale radio astronomy

Documents

memory bandwidth

computing industry

nviida gpu genius

typical code

power cpus

run time

minimum port

simple processing cores