performance optimization of hpc applications: from hardware to source code

science + computing agIT-Services and Software Solutions for Complex Computing EnvironmentsTübingen | München | Berlin | Düsseldorf

Performance Optimization of HPC Applications: From Hardware to Source Code

Dr. Fisnik Kraja Dr. Oliver Schröder

Outline

May 2015

� The Optimization Process� Performance Analysis and System Choice� Example: AVL Fire� System Tuning� Examples: AVL Fire and OpenFOAM� Summary

The Optimization Process

May 2015

1 • Application Performance Analysis

2 • Chose the right hardware

3 • Use the resources efficiently

4 • Tune system and middleware parameters

5 • Optimize the application itself

How can we determine the importance of each part in the overall execution time?

Performance Analysis is the Starting Point

May 2015

Time

Frequency

Memory bandwidth

I/OMPI

Cache

The Methodology: Profiling and more…

May 2015

T Tcomm

Tcomp

TCPU

Tmem

Tvec

Tscal

Tio

1. CPU Frequency Dep.2. Mem. (BW & Cache) Dep.

a. Scatter vs. Compactb. L3 & L2 Hit Ratio

AVX2AVXSSE4.2-no-vec

ScalabilityTests

� Scalability check and MPI Profiling� Tests on different number of nodes

� Dependency on the CPU Frequency� Tests with different Freq. Values

� Dependency on the Interconnect Bandwidth� Use 2x the amount of nodes for the same number of

MPI Ranks to provide each rank with 2x Interconnect Bandwidth

� Use only one CPU (blocking) to provide each MPI Rank the same memory bandwidth as before.

� Dependency on the Memory Hierarchy� As above, but by just scattering the MPI Ranks over

both CPUs to provide to each MPI Rank twice the amount of Memory Bandwidth and at the same time they share twice the amount of LLC.

A few word on the methodology

May 2015

� Tools

� Allinea Performance Reports

� Intel MPI Performance Snapshot

� Internal MPI profiling summaries� Intel MPI� Platform MPI

� Intel PCM� pcm_memory� pcm.x

Allinea Performance Reports

Allinea Performance Reports answers the key questions with real data and gives guidance on solutions:

� Are CPU performance extensions being used?

� Are memory and/or cache problems costing cycles?

� Is I/O or communication affecting performance?

� Are threads using the cores well?� What configuration might suit this

application better?

May 2015

Allinea Performance Reports$> performance-report mpirun –np ….

May 2015

Intel MPI Performance Snapshot

May 2015

� Intel MPS is part of the ITAC� Summary Information� Memory & Counter Usage� MPI & OpenMP Imbalance

Analysis

� Usage Models� mpirun -mps -n 2 ./myApp� mpi_perf_snapshot -f ./stats.txt

./myApp

Memory Throughput Analysis with Intel PCMIntel Performance Counter Monitor

� Open Source project:� Available here:

http://software.intel.com/en-us/articles/intel-performance-counter-monitor/

� Provides Information on:� Memory Bandwidth Usage� Cache hits and misses � QPI Throughput (requires BIOS support)� Temperature

� Two commands:� pcm.x� pcm-memory.x

May 2015

pcm.x or pcm-memory.x ?

pcm.x pcm-memory.x

Processors Since NHM Since SNB

Memory BWSummary

Read/Write, per SKTDetailed

per Channel R/W

Cache Misses L2 & L3 No

IPC Yes No

� pcm.x provides a global view on the system

� pcm-memory.x provides details on Memory Bandwidth Usage

May 2015

Intel PCMsudo /bin/pcm.x

May 2015

Intel PCM Memorysudo /bin/pcm-memory.x

May 2015

Intel MPI Profiling (1)export I_MPI_STATS=ipm

May 2015

Intel MPI Profiling (2)

May 2015

Summary

Intel MPI Profiling (3)

May 2015

Routine Summary:

Platform MPI Profiling (1)export MPIRUN_OPTIONS="-i mpiprof“

May 2015

Summary

Application Summary by Rank (second):

Platform MPI Profiling (2)

May 2015

Routine Summary by Rank:

Platform MPI Profiling (3)

May 2015

Message Summary by Rank Pair:

Example: Y2 WCJ FINE Test Case

May 2015

5.000.000 Cells

The simulation computes the• velocity, • pressure and • temperature distribution of the liquid coolant in the domain.

Additionally the heat transfer between the fluid and the surrounding solid wall boundariesis calculated. With this it is possible to evaluate and optimize cooling jacket designs in respect to engine cooling performance and minimum power consumption of the water pump.

Example: NADIA Test Case

May 2015

Simplified cylinder headFluid domain ~3.000.000 CellsSolid domain ~2.000.000 Cells

The simulation focuses on computing the heat transfer between a warm solid and a significantly colder liquid coolant. The simulation model imitates the dipping of the solid into a basin filled with the coolant. During this process the temperature of the solid drops quickly, but spatially not uniformly. This causes the generation of stresses, which may lead to failure (e.g. cracks) when operating the quenched part under real loads (i.e. as we speak about a cylinder head, loads = different operating conditions of an IC Engine).The simulation supports designing the quenching process resulting in minimum residual stresses, avoiding failure during the application of the part.

AVL Fire - Y2 Test CaseScalability and MPI Profiling

May 2015

93.89 91.3182.84

69.83

51.51

6.11 8.6917.16

30.17

48.49

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

24 48 96 192 384

Number of Cores / MPI Ranks

Compute[%] MPI[%]

100

1000

10000

10 100 1000

Number of Cores/MPI Ranks

Wallclock [s] Perfect Scalability Power Consumption [w]

AVL Fire - Y2 Test CaseDependency on the CPU Frequency

May 2015

200

250

300

350

400

450

500

550

600

1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000

Sec

onds

CPU Frequency in MHz

Wallclock [s]

Dependency of 100 %

Dependency of 75 %

AVL Fire - Y2 Test CaseDependency on Memory Hierarchy and Interconnect Bandwidth

May 2015

0

50

100

150

200

250

300

350

400

Test 1 Test 2 Test 3

8 Nodes 16 Nodes 16 Nodes

PPN=24 PPN=12 PPN=12

Compact Compact Scatter

2.5 GHz 2.5 GHz 2.5 GHz

Wallclock[s]

AVL Fire – Y2 Test Case 8 Nodes (2x E5-2680v3) - 192 MPI Ranks – 24 PPN

Allinea Performance Report

May 2015

MPI Collective20%

MPI P2P19%

Scalar12%

Vector1%

Memory48%

Compute61%

AVL Fire – Nadia Test Case 4 Nodes (2x E5-2680v3) – 64 MPI Ranks – 16 PPN

Allinea Performance Report

May 2015

MPI Collective11%

MPI P2P46%

Scalar5.7 %

Vector0.3 %

Memory37%

Compute43%

AVL FireTest Case Comparison

May 2015

20.518.8

11.8

1.2

47.7

10.5

46.5

5.6

0.3

37.2

0.0

5.0

10.0

15.0

20.0

25.0

30.0

35.0

40.0

45.0

50.0

MPICollective

MPI P2P Scalar Vector Memory

% o

f Tim

e

Y2 Nadia

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Y2 Nadia

% o

f Tim

e

Test Case

Memory

Vector

Scalar

MPI P2P

MPI Collective

Optimizations for AVL Fire

May 2015

Application Performance

Analysis

• Latency oriented dependency on the memory hierarchy• Latency oriented dependency on the interconnect

Chose the right

hardware

• A CPU with intermediate number of cores is more suitable for different usage models• Memory Configuration: 1 DIMM per Memory Channel• Interconnect with lower latency. Reduce the number of hops between the nodes as much as possible

Use the resources efficiently

•Run small jobs•Distribute the MPI Ranks of a job over as many nodes as possible.•When resources are limited share the nodes among the jobs

Tune system and

middleware parameters

Optimize the application

itself

May 2015

Tuning of the Runtime Environment

1. CPU – Binding2. Zone –Reclaim3. Transparent Huge Pages4. Turbo Boost5. Hyperthreading6. MPI Rank Distribution over the nodes and over the sockets

• Communication Pattern Analysis7. MPI

• Collectives• Shared Memory vs Infiniband

8. Infiniband Fabric Control• Fabric Selection and Fallback• Eager Thresholds• Connected mode vs. Datagram Mode

Tuning Example: AVL Fire - Y2 Test Case4 Nodes (2x E5-2680v3) – 96 MPI Ranks – 24 PPN

May 2015

2054.91

791.07 770.72 767.72 819.14 809.52

1130

1397

1406

1580

1463

1657

0

500

1000

1500

2000

2500

no cpu binding compact cpubinding

block rankdistibution

turbo boost hyperthreading tb+ht

Wallclock [s]

Power Consumption [w]

AVL Fire - Y2 Test CaseDependency on Memory Hierarchy and Interconnect Bandwidth

May 2015

210

220

230

240

250

260

270

280

Test 3 Test 4

16 Nodes 16 Nodes

PPN=12 PPN=12

Scatter Scatter

2.5 GHz Turbo Boost

Wallclock[s]

Tuning Example: A STAR -CCM+ Test CaseNodes with E5-2680v2 CPUs

May 2015

120 Tasks 240 Tasks 480 Tasks 960 Tasks

6 Nodes 12 Nodes 24 Nodes 48 Nodes

Opt: none 3412.6 1779.3 938.1 553.0

Opt: cb 2719.0 1196.5 578.7 316.7

Opt: cb,zr 2199.0 1103.5 577.3 325.8

Opt: cb,zr,thp 2191.5 1094.4 575.2 313.6

Opt: cb,zr,thp,trb 2048.6 1027.7 537.2 300.9

Opt: cb,rz,thp,ht 1953.8 1017.3 543.8 317.2

0.0

500.0

1000.0

1500.0

2000.0

2500.0

3000.0

3500.0

4000.0

Ela

psed

Tim

e i

n S

econ

ds

20 %

19 %

43%

( cb=cpu_bind, zr=zone reclaim, thp=transparent huge pages, trb=turbo, ht=hyperthreading )

Tuning Example: OpenFOAM Test Case MPI Profiling (with Intel MPI)

� The portion of time spent in MPI increases with the number of nodes

� Looking at the MPI Calls� ~ 50 % of MPI Time is spent on MPI_Allreduce� ~ 30 % of MPI Time is spent on MPI_Waitall

76%

58%42%

23.5%

41.6%57.9%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

320 Tasks 640 Tasks 960 Tasks


MPI User

48.6% 53.5% 51.2%

31.9%29.6% 28.4%

15.1% 11.7% 14.0%

4.4% 5.2% 6.4%

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

320 Tasks 640 Tasks 960 Tasks


MPI_Allreduce MPI_Waitall MPI_Recv Other

May 2015

Tuning Example: OpenFOAM Test CaseMPI All Reduce Tuning

� The best All_Reduce Algorithm:� (5) Binominal Gather + Scatter

� Overall Obtained Improvement� None on 16 nodes – 320 MPI Ranks� 10% on 32 Nodes – 640 MPI Ranks

1. Recursive doubling algorithm2. Rabenseifner's algorithm3. Reduce + Bcast algorithm4. Topology aware Reduce + Bcast algorithm5. Binomial gather + scatter algorithm6. Topology aware binominal gather + scatter algorithm7. Shumilin's ring algorithm8. Ring algorithm

1797 1893 2075 22321864 1801 1860

2208

2813

0

1000

2000

3000

4000

5000

6000

7000

8000

default 1 2 3 4 5 6 7 8

Clo

ck T

ime

[s]

MPI All Reduce Algorithms

16 Nodes - 320 MPI Ranks

1117 12101597

4096

14901004

1526

3985

7373

0

1000

2000

3000

4000

5000

6000

7000

8000

default 1 2 3 4 5 6 7 8

Clo

ck T

ime

[s]

MPI All Reduce Algorithms

32 Nodes – 640 MPI Ranks

May 2015

Rank Placement

Based on the Platform MPI profiling report, “Message Summary by Rank Pair”

Source Code Optimisations

� Compiler Options and Reports� Profiling� Hot-Spot Analysis� Tracing

Tools� Score-P (http://www.vi-hps.org/projects/score-p/) � Scalasca (http://www.scalasca.org/about/about.html)� Vampir (https://www.vampir.eu/)� Intel V-Tune Amplifier (https://software.intel.com/en-us/intel-vtune-amplifier-xe)� Intel ITAC (https://software.intel.com/en-us/intel-trace-analyzer)� Allinea MAP (http://www.allinea.com/products/map)

Hotspot Analysis withV-Tune Amplifier and Intel Compiler Reports

331 335 347

407

356336 346

438

587

0 1 2 3 4 5 6 7 8

I_MPI_ADJUST_ALLREDUCE Algorithm

El.Time [s]

#pragma ivdep#pragma simd

#pragma vector always

Hotspot Analysis with Allinea MAP

Performance improvement through function vectorization

Hotspot Analysis with Score-P

� Profile obtained on 1 Node� 2 MPI Ranks

� 2 OMP Threads

� Performance improvementthrough a replication andreduce approach insteadof using atomic operations

Score-P Trace Analysis with VAMPIR

May 2015

Application Performance

Analysis

• Latency oriented dependency on the memory hierarchy• Latency oriented dependency on the interconnect

Chose the right

hardware

• A CPU with intermediate number of cores is more suitable for different usage models• Memory Configuration: 1 DIMM per Memory Channel• Interconnect with lower latency. Reduce the number of hops between the nodes as much as possible

Use the resources efficiently

•Run small jobs•Distribute the MPI Ranks of a job over as many nodes as possible.•When resources are limited share the nodes among the jobs

Tune system and

middleware parameters

• CPU Binding must be applied• Tune MPI Collectives (All_Gather)• No benefit from Hyperthreading•Turbo Boost to used only with jobs on non-fully populated nodes

Optimize the application

itself

•Improve cache usage/reusage•Improve the vectorization level of the code

Thank you for your attention.

science + computing agwww.science-computing.de

[email protected]@science-computing.de

performance optimization of hpc applications: from hardware to source code

Technology