designing high performance computing architectures for reliable space applications

14
Designing High Performance Computing Designing High Performance Computing A hit t f R li bl S A hit t f R li bl S Architectures for Reliable Space Architectures for Reliable Space Applications Applications Fisnik Kraja PhD Defense December 6, 2012 Advisors: 1 st : Prof Dr Arndt Bode 1 : Prof. Dr . Arndt Bode 2 nd : Prof. Dr. Xavier Martorell

Upload: fisnik-kraja

Post on 28-May-2015

122 views

Category:

Technology


4 download

DESCRIPTION

PhD Defense Talk

TRANSCRIPT

Page 1: Designing High Performance Computing Architectures for Reliable Space Applications

Designing High Performance ComputingDesigning High Performance ComputingA hit t f R li bl SA hit t f R li bl SArchitectures for Reliable SpaceArchitectures for Reliable Space

ApplicationsApplicationspppp

Fisnik KrajaPhD DefenseDecember 6, 2012 Advisors:

1st : Prof Dr Arndt Bode1 : Prof. Dr. Arndt Bode2nd : Prof. Dr. Xavier Martorell

Page 2: Designing High Performance Computing Architectures for Reliable Space Applications

OutlineOutlineOut eOut e

1. Motivation

2 The Proposed Computing Architecture2. The Proposed Computing Architecture

3 The 2DSSAR Benchmarking Application3. The 2DSSAR Benchmarking Application

4. Optimizations and Benchmarking Resultsp g– Shared memory multiprocessor systems– Distributed memory multiprocessor systems

Heterogeneous CPU/GPU systems– Heterogeneous CPU/GPU systems

5. Conclusions

2

Page 3: Designing High Performance Computing Architectures for Reliable Space Applications

MotivationMotivation

• Future space applications will demand for:

ot at oot at o

• Future space applications will demand for:– Increased on-board computing capabilities– Preserved system reliability

• Future missions:– Optical - IR Sounder: 4.3 GMult/s + 5.7 GAdd/s, 2.2 Gbit/s

Radar/Microwave HRWS SAR: 1 Tera 16 bit fixed point operations/s 603 1– Radar/Microwave - HRWS SAR: 1 Tera 16 bit fixed point operations/s, 603.1 Gbit/s

• ChallengesChallenges– Costs (ASICs are very expensive)– Modularity (component change and reuse)– Portability (across various spacecraft platforms)y ( p p )– Scalability (hardware and software)– Programmability (compatible to various environments)– Efficiency (power consumption and size)y (p p )

3

Page 4: Designing High Performance Computing Architectures for Reliable Space Applications

The Proposed ArchitectureThe Proposed Architecturee oposed c tectu ee oposed c tectu e

Legend:

RHMURadiation-Hardened Management Unit

Control Bus

PPNParallel processing Node

Data Bus

Co t o us

4

Page 5: Designing High Performance Computing Architectures for Reliable Space Applications

The 2DSSAR ApplicationThe 2DSSAR Application22 DimensionalDimensional SSpotlightpotlight Synthetic Aperture RadarSynthetic Aperture Radar

Synthetic Data

22-- DimensionalDimensional SSpotlightpotlight Synthetic Aperture Radar Synthetic Aperture Radar

Illuminated swath in Side‐looking Spotlight SARSynthetic Data Generation (SDG):

Synthetic SAR returns from a uniform grid of point reflectorsS ft point reflectors

Azimuth

Spacecraft

Flight Path

Altit dSwath Range

SAR Sensor Processing (SSP)

Read Generated DataImage Reconstruction (IR)Write Reconstructed Image

Reconstructed SAR i i b i d b

Altitude Range Write Reconstructed Image

image is obtained by applying a 2DFourier Matched Filtering and Interpolation

Swath Cross-Range

Range

5

pAlgorithm

Page 6: Designing High Performance Computing Architectures for Reliable Space Applications

Profiling SAR Image ReconstructionProfiling SAR Image Reconstructiong gg g

Coverage Memory FLOP Time Goal: g(in km)

y(in GB) (in Giga) (in Seconds) Speedup

Scale=10 3.8 x 2.5 0.25 29.54 23

30xScale=30 11 4 x 7 5 2 115 03 230 30xScale 30 11.4 x 7.5 2 115.03 230Scale=60 22.8 x 15 8 1302 926

Transposition and FFT‐shifting

2%

Compression and Decompression 

Loops7%7%

FFTs22%

InterpolationInterpolation Loop69%

6IR Profiling

Page 7: Designing High Performance Computing Architectures for Reliable Space Applications

IR Optimizations for IR Optimizations for Shared Memory MultiprocessingShared Memory Multiprocessing

O MP

Shared Memory MultiprocessingShared Memory Multiprocessing

• OpenMP– General optimizations:

Thread Pinning and First Touch Policy• Thread Pinning and First Touch Policy• Static/Dynamic Scheduling

– FFT• Manual Multithreading of Loops of 1D-FFT(not the FFT itself)

– Interpolation Loop (Polar to Rectangular Coordinates)• Atomic Operations• Replication and Reduction (R&R)

• Other Programming Models– OmpSs, MPI, MPI+OpenMPOmpSs, MPI, MPI OpenMP

7

Page 8: Designing High Performance Computing Architectures for Reliable Space Applications

IR IR on a Shared Memory Nodeon a Shared Memory Nodeyy

The ccNUMA Node:12

The ccNUMA Node:2 x Nehalem CPUs

6 Cores, 12 threads2.93-3.33 GHzQPI: 25.6 GB/sIMC 32 GB/ 8

10

=60)

IMC: 32 GB/sLithography: 32 nmTDP: 95 W

2 x 3 x 6 GB MemoryTotal: 36 GB

6

8pe

edup

  (Scale

DDR3 SDRAM1066 MHz

2

4Sp

1 2 4 6 8 10 12 12 (24)OpenMP Atomic 1 1,55 3,05 4,45 5,81 6,94 7,98 10,54OpenMP R&R 1 1,78 3,51 5,02 6,36 7,74 9,03 11,06

0Cores (Threads)

OpenMP R&R 1 1,78 3,51 5,02 6,36 7,74 9,03 11,06OmpSs Atomic 1 1,61 3,12 4,62 5,92 7,02 8,13 10,72OmpSs R&R 1 1,93 3,73 5,52 7,13 8,65 10,37 12,37MPI R&R 1 1,92 3,65 5,30 6,57 7,94 9,81 11,20MPI+OpenMP 1 1 89 3 54 4 88 6 40 8 02 9 94 11 69

8

MPI+OpenMP 1 1,89 3,54 4,88 6,40 8,02 9,94 11,69

Page 9: Designing High Performance Computing Architectures for Reliable Space Applications

IR Optimizations for IR Optimizations for Distributed Memory MultiprocessingDistributed Memory Multiprocessing• Programming Paradigms

Distributed Memory MultiprocessingDistributed Memory MultiprocessingPID 0 D00 D01 D02 D03g g g

– MPI • Data Replication• Process Creation Overhead

MPI+OpenMP

PID 0 D00 D01 D02 D03PID 1 D10 D11 D12 D13PID 2 D20 D21 D22 D23PID 3 D30 D31 D32 D33

– MPI+OpenMP• 1 process/Node• 1 Thread/Core

PID 0 A1 B1

• Communication Optimizations– Transposition (new: All-to-All)– FFT-shift (new: Peer-to-Peer)

PID 1 A2 B2PID 2 D1 C1PID 3 D2 C2

– Int.Loop Replication and Reduction

• Pipelined IR– Each node reconstructs a separate

SAR Image

9

Page 10: Designing High Performance Computing Architectures for Reliable Space Applications

IR IR on the on the DDistributedistributed Memory SystemMemory Systemy yy y

The Nehalem Cluster:Each Node 60Each Node

2 x 4 Cores, 16 threads2.8 - 3.2 GHz12/24/48 GB RAM

50

60

QPI: 25.6 GB/sIMC: 32 GB/s,Lithography: 45 nmTDP: 95 W/CPU

Infiniband Network 30

40up

 (Scale=60)

Fat-tree Topology 6 Backbone Switches24 Leaf Switches 20Sp

eedu

1 (8) 2 (16) 4 (32) 8 (64) 12 (96) 16 (128)0

10

No of Nodes (Cores) ( ) ( ) ( ) ( ) ( ) ( )MPI (4Proc/Node) 3,54 5,46 7,92 8,52 7,69 7,37Hybrid (1Proc:16Thr/Node) 6,68 10,19 14,41 17,11 17,19 17,73MPI_new (8Proc/Node‐24GB) 6,45 10,87 15,93 23,69 28,90 31,06Hyb new (1Proc:16Thr/Node) 5 66 9 68 17 13 26 92 30 72 32 08

No. of Nodes (Cores)

10

Hyb_new (1Proc:16Thr/Node) 5,66 9,68 17,13 26,92 30,72 32,08Pipelined (1Proc:16Thr/Node) 6,35 11,50 21,30 38,05 50,48 59,80

Page 11: Designing High Performance Computing Architectures for Reliable Space Applications

IR Optimizations for IR Optimizations for Heterogeneous CPU/GPU ComputingHeterogeneous CPU/GPU Computing

ccNUMA Multi Processor

Heterogeneous CPU/GPU ComputingHeterogeneous CPU/GPU ComputingblockIndex.x (bx)

• ccNUMA Multi-Processor– Sequential Optimizations– Minor load-balancing improvements

0 1 2 3

0 1threadIndex.x (tx)

• Computing on CPU+GPU

• Accelerator (GP-GPU)

0

1ex.y

(by)

Block (2 1)dex.

y (ty

)

tsize

tsize

– CUDA – Tiling Technique– cuFFT Library– Transcendental functions

1

2

bloc

kInd

e

0

Block (2,1)

thre

adIn

d

1tsize

• Such as sine and cosine– CUDA 3.2 lacks

• Some complex operations (multiplication and CEXP)p p ( p )• Atomic operations for complex/float data

– Memory Limitation• Atomic operations are used in SFI loop (R&R is not an option)• Large Scale IR dataset does not fit into GPU memory

11

Page 12: Designing High Performance Computing Architectures for Reliable Space Applications

IR on a IR on a HeterogeneousHeterogeneous NodeNodegg

The Machine:ccNUMA Module

35ccNUMA Module

2 x 4 Cores, 16 threads2.8 – 3.2 GHz12 GB RAM 25

30

TDP: 95 W/CPUPCIe 2.0 (8 GB/s)

Accelerator Module2 GPU CardsNVIDIA Tesla(Fermi)

20

dup

( )1.15 GHz6 GB GDDR5 144GB/sTDP 238 W

10

15

Spee

CPU B t CPU CPU                2 GPU0

5

CPU  CPU Best Sequential 

CPU           8 Threads 16 Threads         

(SMT)GPU CPU + GPU 2 GPUs 2 GPUs 

Pipelined

Scale=10 1 1,82 14,46 16,06 20,11 18,88 4,27 15,86Scale=30 1 1,89 11,41 13,26 19,44 22,10 16,71 25,40

12

Scale=60 1 1,97 10,27 12,55 20,17 24,68 22,26 34,46

Page 13: Designing High Performance Computing Architectures for Reliable Space Applications

ConclusionsConclusionsCo c us o sCo c us o s• Shared memory Nodes y

– Performance is limited by hardware resources – 1 Node (12 Cores/24 Threads): speedup = 12.4

• Distributed memory systems– Low efficiency in terms of performance per power consumption and size.– 8 Nodes (64 cores): speedup: 38.05

• Heterogeneous CPU/GPU systems– Perfect compromise:

• Better performance than current shared memory nodes• Better efficiency than distributed memory systems• 1 CPU + 2 GPUs: speedup: 34.46

• Final Design Recommendations– Powerful shared memory PPN– PPN with ccNUMA CPUs and GPU accelerators– Distributed memory only if multiple PPNs are needed

13

Page 14: Designing High Performance Computing Architectures for Reliable Space Applications

Th k YTh k YThank YouThank You

[email protected]