manycore application migration, methodology and...

39
Manycore Application Migration: Methodology and Tools 1 Московский государственный университет , Лоран Морен, 29 август 2011

Upload: others

Post on 13-Oct-2019

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Manycore Application Migration:

Methodology and Tools

1

Московский государственный университет, Лоран Морен, 29 август 2011

Page 2: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Москва - Лоран Морен 2

Methodology to Port Applications

29 август 2011

•Exploit CPU and GPU

•Optimize kernel execution

•Reduce CPU-GPU data transfers

•Optimize CPU code

•Exhibit Parallelism

•Move kernels to GPU

•Performance goal

•Validation process

•Hotspots selection

•Continuous integration Define your parallel project

Port your application

on GPU

Optimize your

GPGPU

application

Phase 1

Phase 2

Hotspots Parallelization

Tuning

GPGPU operational

application with

known potential

Hours to Days Days to Weeks

Weeks to Months

Page 3: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Take a decision

• Go

• Dense hotspot

• Fast GPU kernels

• Low CPU-GPU data transfers

• Midterm perspective:Prepare to manycore parallelism

Москва - Лоран Морен 329 август 2011

• No Goo Flat profile

o GPU results not accurate enough• cannot validate the computation

o Slow GPU kernels• i.e. no speedup to be expected

o Complex data structures

o Big memory space requirement

Page 4: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

• Parallel execution on Many-core architectures will (probably) change Floating-Point computation results…o Parallel programming change in an extensive way the computation

order

o Hardware floating-point units are not always IEEE 754 compliant

o Convergence speed of Iterative algorithms can be impacted

o …

• … but it does not imply that the computation is wrong !o A floating-point computation is always an approximation

• The validation must be provided by application providerso It is the mathematicians or physicists responsibility

o They may or may not require binary equal results

o If not, a validation protocol must be provided

Validate your Computation

Москва - Лоран Морен 429 август 2011

Page 5: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Tools to improve the methodology

productivity

Page 6: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Москва - Лоран Морен 6

Tools in the Methodology

29 август 2011

•Parallel Profiling ToolsParaver, TAU, Vampir

•HMPP Performance Analyzer

•Target Specific toolsNvidia Insight

•HMPP Workbench

•HMPP Wizard

• Profiling toolsGprof, Kachegrind,Vampir, Acumem, ThreadSpotter

•Debugging toolsgdb, alinea DDT, TotalView

Define your parallel project

Port your application

on GPU

Optimize your

GPGPU

application

Phase 1

Phase 2

Hotspots Parallelization

Tuning

GPGPU operational

application with

known potential

Hours to Days Days to Weeks

Weeks to Months

Page 7: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Profiling

Page 8: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

• Select Hotspots

• Estimate speedup using Amdahl’s law:

Focus on Hotspots (Gprof example)

Москва - Лоран Морен 829 август 2011

Profile your

CPU application

Build a coherent

kernel set

S

PP )1(

1

Page 9: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Move Computation to GPU:

HMPP Workbench

Page 10: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

What is HMPP? (Hybrid Manycore Parallel Programming)

• A directive based programming modelo Provide an incremental tool to exploit GPU in legacy applications

o Complement existing parallel APIs (OpenMP or MPI)

o Avoid exit cost, future-proof solution

o Based on an OpenStandard

• An integrated compilation chaino Code generators from C and Fortran to GPU (CUDA or OpenCL)

o Source to source compiler: uses hardware vendor SDK

o One single compiler driver interfacing GPU compilers

o Complements existing parallel APIs (OpenMP or MPI)

• A runtime support of heterogeneous architectureso Manages hardware resource and its availability

o Interface transfers and synchronizations of the hardware

o Fast and easier application deployment on new hardware

o Free of use

Port your application with HMPP Workbench

Москва - Лоран Морен 1029 август 2011

Page 11: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

HMPP usage in three steps

Москва - Лоран Морен 11

1. A set of directives to program

hardware accelerators

• Drive your HWAs, manage

transfers

• Optimize kernels at source level

2. A complete tool-chain to build

manycore applications

• Build your hybrid application

3. A runtime to adapt to platform

configuration

29 август 2011

Page 12: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

HMPP Directives: drive Hybrid Applications

Москва - Лоран Морен 12

HMPP Runtime

HWAData Codelet

HW-specific code generation

Directives

29 август 2011

int main(int argc, char **argv) {// ...

#pragma hmpp compute advancedload, args[M;N]for( j = 0 ; j < 2 ; j++ ) {#pragma hmpp compute callsitecompute( size, size, size, alpha, vin1, vin2, beta, vout );

}// ...

}

#pragma hmpp compute codelet, target=CUDA, args[a].io=inoutvoid compute ( int M, int N, float alpha, float a[n][n], float b[n][n]){#pragma hmppcg hmppcg grid blocksize 16 X 16#pragma hmppcg unroll 2, jamfor (i = 0; i < M; i += 1) {#pragma hmpp unroll 4, guardedfor (j = 0; j < N; j += 1) {

a[i][j] = cos(a[i][j]) + cos(b[i][j]) - alpha;}

} }

Page 13: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

#pragma hmpp sgemm codelet, target=CUDA, args[vout].io=inoutextern void sgemm ( int m, int n, int k, float alpha,

const float vin1[n][n], const float vin2[n][n],float beta, float vout[n][n] );

int main(int argc, char **argv) {// ...for( j = 0 ; j < 2 ; j++ ) {#pragma hmpp sgemm callsitesgemm( size, size, size, alpha, vin1, vin2, beta, vout );

}// ...

}

HMPP Basic Programming

Москва - Лоран Морен 13

One directive to setup a GPU codelet…

… and another inserted in the application

29 август 2011

Page 14: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

HMPP BLAS Performance on NVIDIA Fermi

Москва - Лоран Морен 1429 август 2011

0

100

200

300

400

500

600

700

64

32

0

57

6

83

2

10

88

13

44

16

00

18

56

21

12

23

68

26

24

28

80

31

36

33

92

36

48

39

04

41

60

44

16

46

72

49

28

51

84

54

40

56

96

59

52

62

08

64

64

67

20

69

76

72

32

74

88

77

44

80

00

Gfl

op

/s

Taille des matrices

SGEMM Performance

MKL

HMPP

CUBLAS

MAGMA

2 x Intel(R) Xeon(R) X5560 @ 2.80GHz (8 cores) - MKLNVidia Tesla C2050, ECC activated – HMPP, CUBLAS, MAGMA

Page 15: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Build a Parallel Project: HMPP Wizard

Page 16: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

A diagnosis tool to make the code “GPU Friendly”

• Features:

o Generate GPU porting advices of C/Fortran source code

• Unsupported programming (pointers, goto-labels, …)

• Loop structure and parallelism validation

• Diagnosis the loop nest translation process into a GPU grid of threads

o Estimate kernels behavior on a GPU and provide specific advices

• Reduction inside a kernel;

• Bad memory coalescing;

• Inefficient memory coalescing;

• Too low computation density;

• Detection of well-known patterns such as convolutions

o Propose basic GPU Tuning directives

• Core technology: based on static analysis

o Analyze memory access patterns on the GPU

o Takes into account the full HMPP code generation process

• Evaluates HMPPCG optimization directives

Make your kernels more GPU friendly

HMPP Wizard

Москва - Лоран Морен 1629 август 2011

Page 17: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Москва - Лоран Морен 1729 август 2011

HMPP Wizard: visual

Analyze your memory

access pattern

Use HMPPCG Directives

and make your kernel GPU

friendly

Page 18: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

An advice: is the loop GPU friendly?

o Example: memory access pattern over the Z dimension (3D)

• The GPU grid is 2D → the coalescing must be on the X dimension

o A pragma permute is proposed to fix the problem

Tune your application with the HMPP Wizard

Москва - Лоран Морен 1830 август 2011

Page 19: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Debug your hybrid & parallel

application: Alinea DDT

Page 20: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

www.allinea.com

Allinea Software

• Software tools company since 2001

– Allinea DDT – the scalable parallel debugger

– Allinea OPT – the optimization tool for MPI and non-MPI

– Users at all scales – at 1 to 100,000 cores and above

– World's only Petascale debugger!

Simplifying the challenge of multi- and many-core development

– Bugs at scale need a debugger at scale

• … until recently debuggers limited to ~4,000-8,000 cores

– Bugs on GPUs need a debugger for GPUs

• … until recently GPU software couldn't be debugged

Allinea Software

Page 21: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

www.allinea.com

DDT for CAPS HMPP

Debuggable inside C kernels (codelets) on the GPU

F90 multi-dimensional arrays support in progress for the GPU

Auto-reporting of CAPS runtime errors to the DDT GUI

Also able to debug codeletsrunning on the CPU

Allinea SoftwareDDT for CAPS HMPP

Page 22: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Optimize CPU-GPU integration:

PARAVER

Page 23: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Москва - Лоран Морен 2329 август 2011

Analyze the CPU-GPU Transfers

HMPP + PARAVER

Get a precise view of HMPP element

behavior

Transfer time, execution time…

Page 24: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Optimize GPU Performance:

HMPP Performance Analyzer

Page 25: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Provide information about the GPU behavior at source level for

HMPP applications

• Main features

o Analyze execution profiles made on the GPU by target specific profilers

o Compute synthetic metrics from raw profile data

• Memory throughput, grid size, computation density…

o Provide detailed statistics from raw profile data

• Benefits

o Avoid using target specific profiling tools for HMPP applications

• Nvidia insight profiles Cuda code !

o Get precise and specific information about the loop nest behavior

o Explore and exploit at best the GPU power from the source level

o Fine tune kernels using HMPPCG directives

Understand how well kernels are performing

HMPP Performance Analyzer

Москва - Лоран Морен 2529 август 2011

Page 26: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Москва - Лоран Морен 2629 август 2011

HMPP Performance Analyzer: visual

Get precise and

specific information

about the kernel

behavior

Explore and Exploit at

best the GPU power

from the C source

level

Page 27: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

HMPP Performance Analyzer: example

Москва - Лоран Морен 2730 август 2011

Synthetic kernel metrics

Global view of the execution time distribution

Details statistics of raw profiling data

Page 28: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский
Page 29: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

An ideal package for a proven methodology

Москва - Лоран Морен 2929 август 2011

A multi-level tool suite for

manycore applications definition,

porting and optimization

A GPU focused porting

methodology and a state-of-the-art

products ecosystem

A resources tank to guide you in

your porting issues

Page 30: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Москва - Лоран Морен 3029 август 2011

PACKAGES

HMPP HYBRID COMPILER

DEVELOPMENT TOOLSHMPP Wizard

HMPP Performance Analyzer

HMPP HYBRID COMPILER

DEVELOPMENT TOOLSHMPP Wizard

HMPP Performance Analyzer

ALLINEA DDT

SCIENTIFIC LIBRARIES

For constructors / Integrators

DevDeck OEM

OEM license

For computing centers

& large companies

DevDeck ENTERPRISE

Floating license

HMPP HYBRID COMPILER

DEVELOPMENT TOOLSHMPP Wizard

HMPP Performance Analyzer

CUSTOMIZED SERVICES

expertise, tools recommendation, training…

For developers’ workstations

DevDeck DEVELOPER

Node-Locked license

+ Access to Web Ressources according to the package:

HMPP Workbench - Wizard - Performance Analyzer - Allinea DDT - Libraries - Methodology supports - Cook book - Tutorials - Use cases…

Page 31: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Conclusion

Page 32: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

• Application Diagnosis by CAPSGo / NoGo Analysis

• Provided by HMPPIncremental approach

• Use GPU libraries/function into HMPPDon’t redevelop what

already exists

• Use tools like paraver, Vampire, TAUUnderstand finely the

data transfers

• HMPP Performance AnalyserGood understanding of

the HW

• HMPPCG directivesLots of experimentation

at a lower cost

Москва - Лоран Морен 3229 август 2011

Cost-effective migration requirements

Page 33: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Gravimetry application

Finite volumes method

Москва - Лоран Морен 33

• Effort

o 1 man-month

• Nehalem X5560

improvement:

o x5.15

• Main porting operation

o Reduce kernel memory

accesses, use of GPU

shared memory and optimize

Data Movement

The finite volumes method (left) is more accurate than the analytic solution (right)

which over estimates the central peak

29 август 2011

Page 34: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Heat transfer simulation

Monte Carlo simulation

for thermal radiation

Москва - Лоран Морен 34

• Efforto 1,5 man-month

• Sizeo ~1kLoC of C d (DP)

• GPU C2050 improvemento x6 over 8 Nehalem cores

• Full hybrid versiono x23 with 8 nodes over 8

Nehalem cores

• Main porting operationo adding a few HMPP directives

• Main difficultyo none

29 август 2011

Page 35: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Numerical Weather Prediction

A global cloud resolving model

Москва - Лоран Морен 35

• Effort

o 1 man-month (part of the code already ported)

• GPU C1060 improvement

o 11x over a Nehalem core

o 19x over a Harpertown core

• Main porting operation

o reduction of CPU-GPU transfers

• Main difficulty

o GPU memory size is the limiting factor

29 август 2011

Page 36: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Computer vision & Medical imaging

MultiView Stereo

Москва - Лоран Морен 36

• Resource spento 1 man-month

• Sizeo ~1kLoC of C99 (DP)

• CPU Improvement

o x 4,86

• GPU C2050 improvemento x 120 over serial code on

Nehalem

• Main porting operation

o Rethinking algorithm

29 август 2011

Page 37: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Image processing

Edge detection algorithm

Москва - Лоран Морен 37

• Sobel Filter benchmark

• Size

o ~ 200 lines of C code

• GPU C2070 improvemento x 25,8 over serial code on

Nehalem

• Main porting operation

o Use of basic HMPP directives

29 август 2011

Page 38: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

Biosciences, phylogenetics

Phylip, DNA distance

• In association with the HMPP Center Of Excellence for APAC

• Computes a matric of distances between DNA distances

• Resource spent

o A first CUDA version developed by Shanghai Jiao Tong University, HPC Lab

o 1 man-month

• Size

o 8700 lines of C code, one main kernel (99% of the execution time)

• GPU C2070 improvement

o x 44 over serial code on Nehalem

• Main porting operation

o Kernel parallelism & data transfer coalescing leverage

o Conversion from double precision to simple precision computation

Москва - Лоран Морен 3829 август 2011

Page 39: Manycore Application Migration, Methodology and Tools.nvidia.esyr.org/files/presentations/0829_GPU_Porting.pdf · Manycore Application Migration: Methodology and Tools 1 Московский

OpenHMPP – http://www.openhmpp.org –

CAPS entreprise – http://www.caps-entreprise.com –