exascale radio astronomy
DESCRIPTION
Exascale radio astronomy. M Clark. Contents. GPU Computing GPUs for Radio Astronomy The problem is power Astronomy at the Exascale. The March of GPUs. What is a GPU?. Kepler K20X (2012) 2688 processing cores 3995 SP Gflops peak Effective SIMD width of 32 threads (warp) - PowerPoint PPT PresentationTRANSCRIPT
1
EXASCALE RADIO ASTRONOMYM Clark
2
Contents
GPU ComputingGPUs for Radio AstronomyThe problem is powerAstronomy at the Exascale
3
The March of GPUs
4
What is a GPU?Kepler K20X (2012)
2688 processing cores3995 SP Gflops peak
Effective SIMD width of 32 threads (warp)Deep memory hierarchyAs we move away from registers
Bandwidth decreasesLatency increases
Limited on-chip memory65,536 32-bit registers per SM48 KiB shared memory per SM1.5 MiB L2
Programmed using a thread model
5
Minimum Port, Big Speed-up
Application Code
+
GPU CPUOnly Critical FunctionsRest of Sequential
CPU Code
6
7
Strong CUDA GPU Roadmap
SGEM
M /
W N
orm
alize
d
2012 20142008 2010 2016
TeslaCUDA
FermiFP64
KeplerDynamic Parallelism
MaxwellDX12
PascalUnified Memory3D MemoryNVLink
20
16
12
8
6
2
0
4
10
14
18
8
Introducing NVLINK and Stacked Memory
NVLINKGPU high speed interconnect80-200 GB/sPlanned support for POWER CPUs
Stacked Memory4x Higher Bandwidth (~1 TB/s)3x Larger Capacity4x More Energy Efficient per bit
9
NVLink Enables Data Transfer At Speed of CPU Memory
TESLAGPU CPU
DDR MemoryStacked Memory
NVLink80 GB/s
DDR450-75 GB/s
HBM1
Terabyte/s
10
Correlator Calibration & Imaging
RF Samplers
Real-Time R-T, post R-T
O(N) O(N)
O(N2)O(N log N)
O(N)O(N2)
N
digital
Radio Telescope Data Flow
11
Where can GPUs be Applied?Cross correlation – GPU are ideal
Performance similar to CGEMMHigh performance open-source library https://github.com/GPU-correlators/xGPU
Calibration and ImagingGridding - Coordinate mapping of input data to a regular grid
Arithmetic intensity scales with kernel convolution sizeCompute-bound problem maps well to GPUsDominant time sink in compute pipeline
Other image processing stepsCUFFT – Highly optimized Fast Fourier Transform libraryPFB – Computational intensity increases with number of tapsCoordinate transformations and resampling
12
GPUs in Radio AstronomyAlready an essential tool in radio astronomy
ASKAP – Western AustraliaLEDA – United States of AmericaLOFAR – Netherlands (+ Europe) MWA – Western AustraliaNCRA - IndiaPAPER – South Africa
LOFAR
LEDA
MWAASKAP PAPER
13
Cross correlation is essentially GEMMHierarchical locality
Cross Correlation on GPUs
14
Correlator Efficiency
2012
2014
2008
2010
X-en
gine
GFL
OPS
per W
att
Kepler
Tesla
Fermi
Maxwell
Pascal64
32
16
8
4
2
1
>1 TFLOPS sustained
>2.5 TFLOPS sustained
0.35 TFLOPS sustained
2016
15
Software Correlation Flexibility
Why do software correlation?Software correlators inherently have a great degree of flexibility
Software correlation can do on-the-fly reconfigurationSubset correlation at increased bandwidthSubset correlation at decreased integration time Pulsar binningEasy classification of data (RFI threshold)
Software is portable, correlator unchanged since 2010Already running on 2016 architecture
16
Power of 300 Petaflop CPU-only
Supercomputer =
Power for the city
of San Francisco
HPC’s Biggest Challenge: Power
17
GPUs Power World’s 10 Greenest Supercomputers Green500
Rank MFLOPS/W Site1 4,503.17 GSIC Center, Tokyo Tech2 3,631.86 Cambridge University3 3,517.84 University of Tsukuba
4 3,185.91 Swiss National Supercomputing (CSCS)
5 3,130.95 ROMEO HPC Center6 3,068.71 GSIC Center, Tokyo Tech7 2,702.16 University of Arizona8 2,629.10 Max-Planck9 2,629.10 (Financial Institution)10 2,358.69 CSIRO37 1959.90 Intel Endeavor (top Xeon Phi cluster)49 1247.57 Météo France (top CPU cluster)
18
The End of Historic Scaling
C Moore, Data Processing in ExaScale-Class Computer Systems, Salishan, April 2011
19
The End of Voltage ScalingIn the Good Old Days
Leakage was not important, and voltage scaled with feature size
L’ = L/2V’ = V/2E’ = CV2 = E/8f’ = 2fD’ = 1/L2 = 4DP’ = P
Halve L and get 4x the transistors and 8x the
capability for thesame power
The New RealityLeakage has limited threshold voltage, largely ending voltage
scaling
L’ = L/2V’ = ~VE’ = CV2 = E/2f’ = 2fD’ = 1/L2 = 4DP’ = 4P
Halve L and get 4x the transistors and 8x the
capability for 4x the power,
or 2x the capability for the same power in ¼ the
area.
20
Major Software ImplicationsNeed to expose massive concurrency
Exaflop at O(GHz) clocks O(billion-way) parallelism!
Need to expose and exploit localityData motion more expensive than computation> 100:1 global v. local energy
Need to start addressing resiliency in the applications
21
How Parallel is Astronomy?
SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth
Correlator5 Pflops of computationData-parallel across visibilitiesTask-parallel across frequency channelsO(trillion-way) parallelism
22
How Parallel is Astronomy?
SKA1-LOW specifications1024 dual-pol stations => 2,096,128 visibilities262,144 frequency channels300 MHz bandwidth
Gridding (W-projection)Kernel size 100x100Parallel across kernel size and visibilities (J. Romein)O(10 billion-way) parallelism
23
Energy Efficiency Drives Locality
20mm
64-bit DP
1000 pJ
28nm IC
256-bit access8 kB SRAM 50 pJ
16000 pJ DRAM Rd/Wr
500 pJ Efficient off-chip link
20 pJ 26 pJ 256 pJ
256bits
24
Energy Efficiency Drives Locality
FMA
Regist
ers L1 L2DRAM
stacke
d
point
to po
int10
100
1000
10000
100000pi
coJo
ules
25
Energy Efficiency Drives Locality
This is observable todayWe have lots of tunable parameters:
Register tile size: how many much work should each thread do?Thread block size: how many threads should work together?Input precision: size of the input words
Quick and dirty cross correlation example4x4 => 8x8 register tiling 8.5% faster, 5.5% lower power => 14% improvement in GFLOPS / watt
26
Correlator Calibration & Imaging
RF Samplers
8-bit digitization O(100) PFLOPSO(10) PFLOPSN = 1024
digital
SKA1 LOW Sketch
10 Tb/s 50 Tb/s
27
Energy Efficiency Drives Locality
FMA
Regist
ers L1 L2DRAM
stacke
d
point
to po
int10
100
1000
10000
100000pi
coJo
ules
28
Do we need Moore’s Law?Moore’s law come from shrinking processMoore’s law is slowing down
Denard scaling is deadIncreasing wafer costs means that it takes longer to move to the next process
29
Improving Energy Efficiency @ Iso-Process
We don’t know how to build the perfect processorHuge focus on improved architecture efficiencyBetter understanding of a given process
Compare Fermi vs. Kepler vs. Maxwell architectures @ 28 nmGF117: 96 cores, peak 192 GflopsGK107: 384 cores, peak 770 GflopsGM107: 640 cores, peak 1330 Gflops
Use cross-correlation benchmarkOnly measure GPU power
30
Improving Energy Efficiency @ 28 nm
Fermi Kepler Maxwell0
5
10
15
20
25
80% 55% 80%
GFL
OPS
/ w
att
31
How Hot is Your Supercomputer?
1. TSUBAME-KFCTokyo Tech, oil cooled4503 GFLOPS / watt
2. Wilkes ClusterU. Cambridge, air cooled3631 GFLOPS / watt
Number 1 is 24% more efficient than number 2
32
Temperature is Power
Power is dynamic and staticDynamic power is workStatic power is leakage
Dominant term from sub-threshold leakage
Voltage terms:Vs: Gate to source voltageVth: Switching threshold voltagen: transistor sub-threshold swing coeff
Device specifics:A: Technology specific constantL, W: device channel length & width Thermal Voltage:
8.62×10−5 eV/K26 mV at room temperature
33
Temperature is Power
Tesla K20mGK110, 28nm
Geforce GTX 580GF110, 40nm
34
Tuning for Power Efficiency
A given processor does not have a fixed power efficiencyDependent on
Clock frequencyVoltageTemperatureAlgorithm
Tune in this multi-dimensional space for power efficiencyE.g., cross-correlation on Kepler K20
12.96 -> 18.34 GFLOPS / wattBad news: no two processors are exactly alike
35
Precision is Power
Power scales with the square of the multiplier (approximately)Most computation done in FP32 / FP64Should use the minimum precision required by science needs
Maxwell GPUs have 16-bit integer multiply-add at FP32 rateAlgorithms should increasingly use hierarchical precision
Only invoke in high precision when necessarySignal processing folks known this for a long time
Lesson feeding back into the HPC community...
36
Conclusions
Astronomy has insatiable amount of computeMany-core processors are a perfect match to the processing pipelinePower is a problem but
Astronomy has oodles of parallelismKey algorithms possess localityPrecision requirements are well understood
Scientists and Engineers wedded to the problemAstronomy is perhaps the ideal application for the exascale