performance optimization of hpc applications: from hardware to source code

42
science + computing ag IT-Services and Software Solutions for Complex Computing Environments Tübingen | München | Berlin | Düsseldorf Performance Optimization of HPC Applications: From Hardware to Source Code Dr. Fisnik Kraja Dr. Oliver Schröder

Upload: fisnik-kraja

Post on 15-Aug-2015

167 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Performance Optimization of HPC Applications: From Hardware to Source Code

science + computing agIT-Services and Software Solutions for Complex Computing EnvironmentsTübingen | München | Berlin | Düsseldorf

Performance Optimization of HPC Applications: From Hardware to Source Code

Dr. Fisnik Kraja Dr. Oliver Schröder

Page 2: Performance Optimization of HPC Applications: From Hardware to Source Code

Outline

Page 2 May 2015

� The Optimization Process� Performance Analysis and System Choice� Example: AVL Fire� System Tuning� Examples: AVL Fire and OpenFOAM� Summary

Page 3: Performance Optimization of HPC Applications: From Hardware to Source Code

The Optimization Process

Page 3 May 2015

1 • Application Performance Analysis

2 • Chose the right hardware

3 • Use the resources efficiently

4 • Tune system and middleware parameters

5 • Optimize the application itself

Page 4: Performance Optimization of HPC Applications: From Hardware to Source Code

How can we determine the importance of each part in the overall execution time?

Page 4

Performance Analysis is the Starting Point

May 2015

Time

Frequency

Memory bandwidth

I/OMPI

Cache

Page 5: Performance Optimization of HPC Applications: From Hardware to Source Code

The Methodology: Profiling and more…

Page 5 May 2015

T Tcomm

Tcomp

TCPU

Tmem

Tvec

Tscal

Tio

1. CPU Frequency Dep.2. Mem. (BW & Cache) Dep.

a. Scatter vs. Compactb. L3 & L2 Hit Ratio

AVX2AVXSSE4.2-no-vec

ScalabilityTests

Page 6: Performance Optimization of HPC Applications: From Hardware to Source Code

� Scalability check and MPI Profiling� Tests on different number of nodes

� Dependency on the CPU Frequency� Tests with different Freq. Values

� Dependency on the Interconnect Bandwidth� Use 2x the amount of nodes for the same number of

MPI Ranks to provide each rank with 2x Interconnect Bandwidth

� Use only one CPU (blocking) to provide each MPI Rank the same memory bandwidth as before.

� Dependency on the Memory Hierarchy� As above, but by just scattering the MPI Ranks over

both CPUs to provide to each MPI Rank twice the amount of Memory Bandwidth and at the same time they share twice the amount of LLC.

A few word on the methodology

Page 6 May 2015

� Tools

� Allinea Performance Reports

� Intel MPI Performance Snapshot

� Internal MPI profiling summaries� Intel MPI� Platform MPI

� Intel PCM� pcm_memory� pcm.x

Page 7: Performance Optimization of HPC Applications: From Hardware to Source Code

Allinea Performance Reports

Allinea Performance Reports answers the key questions with real data and gives guidance on solutions:

� Are CPU performance extensions being used?

� Are memory and/or cache problems costing cycles?

� Is I/O or communication affecting performance?

� Are threads using the cores well?� What configuration might suit this

application better?

Page 7 May 2015

Page 8: Performance Optimization of HPC Applications: From Hardware to Source Code

Allinea Performance Reports$> performance-report mpirun –np ….

Page 8 May 2015

Page 9: Performance Optimization of HPC Applications: From Hardware to Source Code

Intel MPI Performance Snapshot

Page 9 May 2015

� Intel MPS is part of the ITAC� Summary Information� Memory & Counter Usage� MPI & OpenMP Imbalance

Analysis

� Usage Models� mpirun -mps -n 2 ./myApp� mpi_perf_snapshot -f ./stats.txt

./myApp

Page 10: Performance Optimization of HPC Applications: From Hardware to Source Code

Memory Throughput Analysis with Intel PCMIntel Performance Counter Monitor

� Open Source project:� Available here:

http://software.intel.com/en-us/articles/intel-performance-counter-monitor/

� Provides Information on:� Memory Bandwidth Usage� Cache hits and misses � QPI Throughput (requires BIOS support)� Temperature

� Two commands:� pcm.x� pcm-memory.x

Page 10 May 2015

Page 11: Performance Optimization of HPC Applications: From Hardware to Source Code

pcm.x or pcm-memory.x ?

pcm.x pcm-memory.x

Processors Since NHM Since SNB

Memory BWSummary

Read/Write, per SKTDetailed

per Channel R/W

Cache Misses L2 & L3 No

IPC Yes No

� pcm.x provides a global view on the system

� pcm-memory.x provides details on Memory Bandwidth Usage

Page 11 May 2015

Page 12: Performance Optimization of HPC Applications: From Hardware to Source Code

Intel PCMsudo /bin/pcm.x

Page 12 May 2015

Page 13: Performance Optimization of HPC Applications: From Hardware to Source Code

Intel PCM Memorysudo /bin/pcm-memory.x

Page 13 May 2015

Page 14: Performance Optimization of HPC Applications: From Hardware to Source Code

Intel MPI Profiling (1)export I_MPI_STATS=ipm

Page 14 May 2015

Page 15: Performance Optimization of HPC Applications: From Hardware to Source Code

Intel MPI Profiling (2)

Page 15 May 2015

Summary

Page 16: Performance Optimization of HPC Applications: From Hardware to Source Code

Intel MPI Profiling (3)

Page 16 May 2015

Routine Summary:

Page 17: Performance Optimization of HPC Applications: From Hardware to Source Code

Platform MPI Profiling (1)export MPIRUN_OPTIONS="-i mpiprof“

Page 17 May 2015

Summary

Application Summary by Rank (second):

Page 18: Performance Optimization of HPC Applications: From Hardware to Source Code

Platform MPI Profiling (2)

Page 18 May 2015

Routine Summary by Rank:

Page 19: Performance Optimization of HPC Applications: From Hardware to Source Code

Platform MPI Profiling (3)

Page 19 May 2015

Message Summary by Rank Pair:

Page 20: Performance Optimization of HPC Applications: From Hardware to Source Code

Example: Y2 WCJ FINE Test Case

Page 20 May 2015

5.000.000 Cells

The simulation computes the• velocity, • pressure and • temperature distribution of the liquid coolant in the domain.

Additionally the heat transfer between the fluid and the surrounding solid wall boundariesis calculated. With this it is possible to evaluate and optimize cooling jacket designs in respect to engine cooling performance and minimum power consumption of the water pump.

Page 21: Performance Optimization of HPC Applications: From Hardware to Source Code

Example: NADIA Test Case

Page 21 May 2015

Simplified cylinder headFluid domain ~3.000.000 CellsSolid domain ~2.000.000 Cells

The simulation focuses on computing the heat transfer between a warm solid and a significantly colder liquid coolant. The simulation model imitates the dipping of the solid into a basin filled with the coolant. During this process the temperature of the solid drops quickly, but spatially not uniformly. This causes the generation of stresses, which may lead to failure (e.g. cracks) when operating the quenched part under real loads (i.e. as we speak about a cylinder head, loads = different operating conditions of an IC Engine).The simulation supports designing the quenching process resulting in minimum residual stresses, avoiding failure during the application of the part.

Page 22: Performance Optimization of HPC Applications: From Hardware to Source Code

AVL Fire - Y2 Test CaseScalability and MPI Profiling

Page 22 May 2015

93.89 91.3182.84

69.83

51.51

6.11 8.6917.16

30.17

48.49

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

24 48 96 192 384

Number of Cores / MPI Ranks

Compute[%] MPI[%]

100

1000

10000

10 100 1000

Number of Cores/MPI Ranks

Wallclock [s] Perfect Scalability Power Consumption [w]

Page 23: Performance Optimization of HPC Applications: From Hardware to Source Code

AVL Fire - Y2 Test CaseDependency on the CPU Frequency

Page 23 May 2015

200

250

300

350

400

450

500

550

600

1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000

Sec

onds

CPU Frequency in MHz

Wallclock [s]

Dependency of 100 %

Dependency of 75 %

Page 24: Performance Optimization of HPC Applications: From Hardware to Source Code

AVL Fire - Y2 Test CaseDependency on Memory Hierarchy and Interconnect Bandwidth

Page 24 May 2015

0

50

100

150

200

250

300

350

400

Test 1 Test 2 Test 3

8 Nodes 16 Nodes 16 Nodes

PPN=24 PPN=12 PPN=12

Compact Compact Scatter

2.5 GHz 2.5 GHz 2.5 GHz

Wallclock[s]

Page 25: Performance Optimization of HPC Applications: From Hardware to Source Code

AVL Fire – Y2 Test Case 8 Nodes (2x E5-2680v3) - 192 MPI Ranks – 24 PPN

Allinea Performance Report

Page 25 May 2015

MPI Collective20%

MPI P2P19%

Scalar12%

Vector1%

Memory48%

Compute61%

Page 26: Performance Optimization of HPC Applications: From Hardware to Source Code

AVL Fire – Nadia Test Case 4 Nodes (2x E5-2680v3) – 64 MPI Ranks – 16 PPN

Allinea Performance Report

Page 26 May 2015

MPI Collective11%

MPI P2P46%

Scalar5.7 %

Vector0.3 %

Memory37%

Compute43%

Page 27: Performance Optimization of HPC Applications: From Hardware to Source Code

AVL FireTest Case Comparison

Page 27 May 2015

20.518.8

11.8

1.2

47.7

10.5

46.5

5.6

0.3

37.2

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

MPICollective

MPI P2P Scalar Vector Memory

% o

f Tim

e

Y2 Nadia

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Y2 Nadia

% o

f Tim

e

Test Case

Memory

Vector

Scalar

MPI P2P

MPI Collective

Page 28: Performance Optimization of HPC Applications: From Hardware to Source Code

Optimizations for AVL Fire

Page 28 May 2015

Application Performance

Analysis

• Latency oriented dependency on the memory hierarchy• Latency oriented dependency on the interconnect

Chose the right

hardware

• A CPU with intermediate number of cores is more suitable for different usage models• Memory Configuration: 1 DIMM per Memory Channel• Interconnect with lower latency. Reduce the number of hops between the nodes as much as possible

Use the resources efficiently

•Run small jobs•Distribute the MPI Ranks of a job over as many nodes as possible.•When resources are limited share the nodes among the jobs

Tune system and

middleware parameters

Optimize the application

itself

Page 29: Performance Optimization of HPC Applications: From Hardware to Source Code

Page 29 May 2015

Tuning of the Runtime Environment

1. CPU – Binding2. Zone –Reclaim3. Transparent Huge Pages4. Turbo Boost5. Hyperthreading6. MPI Rank Distribution over the nodes and over the sockets

• Communication Pattern Analysis7. MPI

• Collectives• Shared Memory vs Infiniband

8. Infiniband Fabric Control• Fabric Selection and Fallback• Eager Thresholds• Connected mode vs. Datagram Mode

Page 30: Performance Optimization of HPC Applications: From Hardware to Source Code

Tuning Example: AVL Fire - Y2 Test Case4 Nodes (2x E5-2680v3) – 96 MPI Ranks – 24 PPN

Page 30 May 2015

2054.91

791.07 770.72 767.72 819.14 809.52

1130

1397

1406

1580

1463

1657

0

500

1000

1500

2000

2500

no cpu binding compact cpubinding

block rankdistibution

turbo boost hyperthreading tb+ht

Wallclock [s]

Power Consumption [w]

Page 31: Performance Optimization of HPC Applications: From Hardware to Source Code

AVL Fire - Y2 Test CaseDependency on Memory Hierarchy and Interconnect Bandwidth

Page 31 May 2015

210

220

230

240

250

260

270

280

Test 3 Test 4

16 Nodes 16 Nodes

PPN=12 PPN=12

Scatter Scatter

2.5 GHz Turbo Boost

Wallclock[s]

Page 32: Performance Optimization of HPC Applications: From Hardware to Source Code

Tuning Example: A STAR -CCM+ Test CaseNodes with E5-2680v2 CPUs

May 2015Page 32

120 Tasks 240 Tasks 480 Tasks 960 Tasks

6 Nodes 12 Nodes 24 Nodes 48 Nodes

Opt: none 3412.6 1779.3 938.1 553.0

Opt: cb 2719.0 1196.5 578.7 316.7

Opt: cb,zr 2199.0 1103.5 577.3 325.8

Opt: cb,zr,thp 2191.5 1094.4 575.2 313.6

Opt: cb,zr,thp,trb 2048.6 1027.7 537.2 300.9

Opt: cb,rz,thp,ht 1953.8 1017.3 543.8 317.2

0.0

500.0

1000.0

1500.0

2000.0

2500.0

3000.0

3500.0

4000.0

Ela

psed

Tim

e i

n S

econ

ds

20 %

19 %

43%

( cb=cpu_bind, zr=zone reclaim, thp=transparent huge pages, trb=turbo, ht=hyperthreading )

Page 33: Performance Optimization of HPC Applications: From Hardware to Source Code

Tuning Example: OpenFOAM Test Case MPI Profiling (with Intel MPI)

� The portion of time spent in MPI increases with the number of nodes

� Looking at the MPI Calls� ~ 50 % of MPI Time is spent on MPI_Allreduce� ~ 30 % of MPI Time is spent on MPI_Waitall

76%

58%42%

23.5%

41.6%57.9%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

320 Tasks 640 Tasks 960 Tasks

16 Nodes 32 Nodes 48 Nodes

MPI User

48.6% 53.5% 51.2%

31.9%29.6% 28.4%

15.1% 11.7% 14.0%

4.4% 5.2% 6.4%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

320 Tasks 640 Tasks 960 Tasks

16 Nodes 32 Nodes 48 Nodes

MPI_Allreduce MPI_Waitall MPI_Recv Other

Page 33 May 2015

Page 34: Performance Optimization of HPC Applications: From Hardware to Source Code

Tuning Example: OpenFOAM Test CaseMPI All Reduce Tuning

� The best All_Reduce Algorithm:� (5) Binominal Gather + Scatter

� Overall Obtained Improvement� None on 16 nodes – 320 MPI Ranks� 10% on 32 Nodes – 640 MPI Ranks

1. Recursive doubling algorithm2. Rabenseifner's algorithm3. Reduce + Bcast algorithm4. Topology aware Reduce + Bcast algorithm5. Binomial gather + scatter algorithm6. Topology aware binominal gather + scatter algorithm7. Shumilin's ring algorithm8. Ring algorithm

1797 1893 2075 22321864 1801 1860

2208

2813

0

1000

2000

3000

4000

5000

6000

7000

8000

default 1 2 3 4 5 6 7 8

Clo

ck T

ime

[s]

MPI All Reduce Algorithms

16 Nodes - 320 MPI Ranks

1117 12101597

4096

14901004

1526

3985

7373

0

1000

2000

3000

4000

5000

6000

7000

8000

default 1 2 3 4 5 6 7 8

Clo

ck T

ime

[s]

MPI All Reduce Algorithms

32 Nodes – 640 MPI Ranks

Page 34 May 2015

Page 35: Performance Optimization of HPC Applications: From Hardware to Source Code

Rank Placement

Based on the Platform MPI profiling report, “Message Summary by Rank Pair”

Page 36: Performance Optimization of HPC Applications: From Hardware to Source Code

Source Code Optimisations

� Compiler Options and Reports� Profiling� Hot-Spot Analysis� Tracing

Tools� Score-P (http://www.vi-hps.org/projects/score-p/) � Scalasca (http://www.scalasca.org/about/about.html)� Vampir (https://www.vampir.eu/)� Intel V-Tune Amplifier (https://software.intel.com/en-us/intel-vtune-amplifier-xe)� Intel ITAC (https://software.intel.com/en-us/intel-trace-analyzer)� Allinea MAP (http://www.allinea.com/products/map)

Page 37: Performance Optimization of HPC Applications: From Hardware to Source Code

Hotspot Analysis withV-Tune Amplifier and Intel Compiler Reports

331 335 347

407

356336 346

438

587

0 1 2 3 4 5 6 7 8

I_MPI_ADJUST_ALLREDUCE Algorithm

El.Time [s]

#pragma ivdep#pragma simd

#pragma vector always

Page 38: Performance Optimization of HPC Applications: From Hardware to Source Code

Hotspot Analysis with Allinea MAP

Performance improvement through function vectorization

Page 39: Performance Optimization of HPC Applications: From Hardware to Source Code

Hotspot Analysis with Score-P

� Profile obtained on 1 Node� 2 MPI Ranks

� 2 OMP Threads

� Performance improvementthrough a replication andreduce approach insteadof using atomic operations

Page 40: Performance Optimization of HPC Applications: From Hardware to Source Code

Score-P Trace Analysis with VAMPIR

Page 41: Performance Optimization of HPC Applications: From Hardware to Source Code

May 2015Page 41

Application Performance

Analysis

• Latency oriented dependency on the memory hierarchy• Latency oriented dependency on the interconnect

Chose the right

hardware

• A CPU with intermediate number of cores is more suitable for different usage models• Memory Configuration: 1 DIMM per Memory Channel• Interconnect with lower latency. Reduce the number of hops between the nodes as much as possible

Use the resources efficiently

•Run small jobs•Distribute the MPI Ranks of a job over as many nodes as possible.•When resources are limited share the nodes among the jobs

Tune system and

middleware parameters

• CPU Binding must be applied• Tune MPI Collectives (All_Gather)• No benefit from Hyperthreading•Turbo Boost to used only with jobs on non-fully populated nodes

Optimize the application

itself

•Improve cache usage/reusage•Improve the vectorization level of the code

Page 42: Performance Optimization of HPC Applications: From Hardware to Source Code

Thank you for your attention.

science + computing agwww.science-computing.de

[email protected]@science-computing.de