performance optimization of hpc applications: from hardware to source code
TRANSCRIPT
science + computing agIT-Services and Software Solutions for Complex Computing EnvironmentsTübingen | München | Berlin | Düsseldorf
Performance Optimization of HPC Applications: From Hardware to Source Code
Dr. Fisnik Kraja Dr. Oliver Schröder
Outline
Page 2 May 2015
� The Optimization Process� Performance Analysis and System Choice� Example: AVL Fire� System Tuning� Examples: AVL Fire and OpenFOAM� Summary
The Optimization Process
Page 3 May 2015
1 • Application Performance Analysis
2 • Chose the right hardware
3 • Use the resources efficiently
4 • Tune system and middleware parameters
5 • Optimize the application itself
How can we determine the importance of each part in the overall execution time?
Page 4
Performance Analysis is the Starting Point
May 2015
Time
Frequency
Memory bandwidth
I/OMPI
Cache
The Methodology: Profiling and more…
Page 5 May 2015
T Tcomm
Tcomp
TCPU
Tmem
Tvec
Tscal
Tio
1. CPU Frequency Dep.2. Mem. (BW & Cache) Dep.
a. Scatter vs. Compactb. L3 & L2 Hit Ratio
AVX2AVXSSE4.2-no-vec
ScalabilityTests
� Scalability check and MPI Profiling� Tests on different number of nodes
� Dependency on the CPU Frequency� Tests with different Freq. Values
� Dependency on the Interconnect Bandwidth� Use 2x the amount of nodes for the same number of
MPI Ranks to provide each rank with 2x Interconnect Bandwidth
� Use only one CPU (blocking) to provide each MPI Rank the same memory bandwidth as before.
� Dependency on the Memory Hierarchy� As above, but by just scattering the MPI Ranks over
both CPUs to provide to each MPI Rank twice the amount of Memory Bandwidth and at the same time they share twice the amount of LLC.
A few word on the methodology
Page 6 May 2015
� Tools
� Allinea Performance Reports
� Intel MPI Performance Snapshot
� Internal MPI profiling summaries� Intel MPI� Platform MPI
� Intel PCM� pcm_memory� pcm.x
Allinea Performance Reports
Allinea Performance Reports answers the key questions with real data and gives guidance on solutions:
� Are CPU performance extensions being used?
� Are memory and/or cache problems costing cycles?
� Is I/O or communication affecting performance?
� Are threads using the cores well?� What configuration might suit this
application better?
Page 7 May 2015
Allinea Performance Reports$> performance-report mpirun –np ….
Page 8 May 2015
Intel MPI Performance Snapshot
Page 9 May 2015
� Intel MPS is part of the ITAC� Summary Information� Memory & Counter Usage� MPI & OpenMP Imbalance
Analysis
� Usage Models� mpirun -mps -n 2 ./myApp� mpi_perf_snapshot -f ./stats.txt
./myApp
Memory Throughput Analysis with Intel PCMIntel Performance Counter Monitor
� Open Source project:� Available here:
http://software.intel.com/en-us/articles/intel-performance-counter-monitor/
� Provides Information on:� Memory Bandwidth Usage� Cache hits and misses � QPI Throughput (requires BIOS support)� Temperature
� Two commands:� pcm.x� pcm-memory.x
Page 10 May 2015
pcm.x or pcm-memory.x ?
pcm.x pcm-memory.x
Processors Since NHM Since SNB
Memory BWSummary
Read/Write, per SKTDetailed
per Channel R/W
Cache Misses L2 & L3 No
IPC Yes No
� pcm.x provides a global view on the system
� pcm-memory.x provides details on Memory Bandwidth Usage
Page 11 May 2015
Intel PCMsudo /bin/pcm.x
Page 12 May 2015
Intel PCM Memorysudo /bin/pcm-memory.x
Page 13 May 2015
Intel MPI Profiling (1)export I_MPI_STATS=ipm
Page 14 May 2015
Intel MPI Profiling (2)
Page 15 May 2015
Summary
Intel MPI Profiling (3)
Page 16 May 2015
Routine Summary:
Platform MPI Profiling (1)export MPIRUN_OPTIONS="-i mpiprof“
Page 17 May 2015
Summary
Application Summary by Rank (second):
Platform MPI Profiling (2)
Page 18 May 2015
Routine Summary by Rank:
Platform MPI Profiling (3)
Page 19 May 2015
Message Summary by Rank Pair:
Example: Y2 WCJ FINE Test Case
Page 20 May 2015
5.000.000 Cells
The simulation computes the• velocity, • pressure and • temperature distribution of the liquid coolant in the domain.
Additionally the heat transfer between the fluid and the surrounding solid wall boundariesis calculated. With this it is possible to evaluate and optimize cooling jacket designs in respect to engine cooling performance and minimum power consumption of the water pump.
Example: NADIA Test Case
Page 21 May 2015
Simplified cylinder headFluid domain ~3.000.000 CellsSolid domain ~2.000.000 Cells
The simulation focuses on computing the heat transfer between a warm solid and a significantly colder liquid coolant. The simulation model imitates the dipping of the solid into a basin filled with the coolant. During this process the temperature of the solid drops quickly, but spatially not uniformly. This causes the generation of stresses, which may lead to failure (e.g. cracks) when operating the quenched part under real loads (i.e. as we speak about a cylinder head, loads = different operating conditions of an IC Engine).The simulation supports designing the quenching process resulting in minimum residual stresses, avoiding failure during the application of the part.
AVL Fire - Y2 Test CaseScalability and MPI Profiling
Page 22 May 2015
93.89 91.3182.84
69.83
51.51
6.11 8.6917.16
30.17
48.49
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
24 48 96 192 384
Number of Cores / MPI Ranks
Compute[%] MPI[%]
100
1000
10000
10 100 1000
Number of Cores/MPI Ranks
Wallclock [s] Perfect Scalability Power Consumption [w]
AVL Fire - Y2 Test CaseDependency on the CPU Frequency
Page 23 May 2015
200
250
300
350
400
450
500
550
600
1000 1200 1400 1600 1800 2000 2200 2400 2600 2800 3000
Sec
onds
CPU Frequency in MHz
Wallclock [s]
Dependency of 100 %
Dependency of 75 %
AVL Fire - Y2 Test CaseDependency on Memory Hierarchy and Interconnect Bandwidth
Page 24 May 2015
0
50
100
150
200
250
300
350
400
Test 1 Test 2 Test 3
8 Nodes 16 Nodes 16 Nodes
PPN=24 PPN=12 PPN=12
Compact Compact Scatter
2.5 GHz 2.5 GHz 2.5 GHz
Wallclock[s]
AVL Fire – Y2 Test Case 8 Nodes (2x E5-2680v3) - 192 MPI Ranks – 24 PPN
Allinea Performance Report
Page 25 May 2015
MPI Collective20%
MPI P2P19%
Scalar12%
Vector1%
Memory48%
Compute61%
AVL Fire – Nadia Test Case 4 Nodes (2x E5-2680v3) – 64 MPI Ranks – 16 PPN
Allinea Performance Report
Page 26 May 2015
MPI Collective11%
MPI P2P46%
Scalar5.7 %
Vector0.3 %
Memory37%
Compute43%
AVL FireTest Case Comparison
Page 27 May 2015
20.518.8
11.8
1.2
47.7
10.5
46.5
5.6
0.3
37.2
0.0
5.0
10.0
15.0
20.0
25.0
30.0
35.0
40.0
45.0
50.0
MPICollective
MPI P2P Scalar Vector Memory
% o
f Tim
e
Y2 Nadia
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Y2 Nadia
% o
f Tim
e
Test Case
Memory
Vector
Scalar
MPI P2P
MPI Collective
Optimizations for AVL Fire
Page 28 May 2015
Application Performance
Analysis
• Latency oriented dependency on the memory hierarchy• Latency oriented dependency on the interconnect
Chose the right
hardware
• A CPU with intermediate number of cores is more suitable for different usage models• Memory Configuration: 1 DIMM per Memory Channel• Interconnect with lower latency. Reduce the number of hops between the nodes as much as possible
Use the resources efficiently
•Run small jobs•Distribute the MPI Ranks of a job over as many nodes as possible.•When resources are limited share the nodes among the jobs
Tune system and
middleware parameters
Optimize the application
itself
Page 29 May 2015
Tuning of the Runtime Environment
1. CPU – Binding2. Zone –Reclaim3. Transparent Huge Pages4. Turbo Boost5. Hyperthreading6. MPI Rank Distribution over the nodes and over the sockets
• Communication Pattern Analysis7. MPI
• Collectives• Shared Memory vs Infiniband
8. Infiniband Fabric Control• Fabric Selection and Fallback• Eager Thresholds• Connected mode vs. Datagram Mode
Tuning Example: AVL Fire - Y2 Test Case4 Nodes (2x E5-2680v3) – 96 MPI Ranks – 24 PPN
Page 30 May 2015
2054.91
791.07 770.72 767.72 819.14 809.52
1130
1397
1406
1580
1463
1657
0
500
1000
1500
2000
2500
no cpu binding compact cpubinding
block rankdistibution
turbo boost hyperthreading tb+ht
Wallclock [s]
Power Consumption [w]
AVL Fire - Y2 Test CaseDependency on Memory Hierarchy and Interconnect Bandwidth
Page 31 May 2015
210
220
230
240
250
260
270
280
Test 3 Test 4
16 Nodes 16 Nodes
PPN=12 PPN=12
Scatter Scatter
2.5 GHz Turbo Boost
Wallclock[s]
Tuning Example: A STAR -CCM+ Test CaseNodes with E5-2680v2 CPUs
May 2015Page 32
120 Tasks 240 Tasks 480 Tasks 960 Tasks
6 Nodes 12 Nodes 24 Nodes 48 Nodes
Opt: none 3412.6 1779.3 938.1 553.0
Opt: cb 2719.0 1196.5 578.7 316.7
Opt: cb,zr 2199.0 1103.5 577.3 325.8
Opt: cb,zr,thp 2191.5 1094.4 575.2 313.6
Opt: cb,zr,thp,trb 2048.6 1027.7 537.2 300.9
Opt: cb,rz,thp,ht 1953.8 1017.3 543.8 317.2
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
3500.0
4000.0
Ela
psed
Tim
e i
n S
econ
ds
20 %
19 %
43%
( cb=cpu_bind, zr=zone reclaim, thp=transparent huge pages, trb=turbo, ht=hyperthreading )
Tuning Example: OpenFOAM Test Case MPI Profiling (with Intel MPI)
� The portion of time spent in MPI increases with the number of nodes
� Looking at the MPI Calls� ~ 50 % of MPI Time is spent on MPI_Allreduce� ~ 30 % of MPI Time is spent on MPI_Waitall
76%
58%42%
23.5%
41.6%57.9%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
320 Tasks 640 Tasks 960 Tasks
16 Nodes 32 Nodes 48 Nodes
MPI User
48.6% 53.5% 51.2%
31.9%29.6% 28.4%
15.1% 11.7% 14.0%
4.4% 5.2% 6.4%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
320 Tasks 640 Tasks 960 Tasks
16 Nodes 32 Nodes 48 Nodes
MPI_Allreduce MPI_Waitall MPI_Recv Other
Page 33 May 2015
Tuning Example: OpenFOAM Test CaseMPI All Reduce Tuning
� The best All_Reduce Algorithm:� (5) Binominal Gather + Scatter
� Overall Obtained Improvement� None on 16 nodes – 320 MPI Ranks� 10% on 32 Nodes – 640 MPI Ranks
1. Recursive doubling algorithm2. Rabenseifner's algorithm3. Reduce + Bcast algorithm4. Topology aware Reduce + Bcast algorithm5. Binomial gather + scatter algorithm6. Topology aware binominal gather + scatter algorithm7. Shumilin's ring algorithm8. Ring algorithm
1797 1893 2075 22321864 1801 1860
2208
2813
0
1000
2000
3000
4000
5000
6000
7000
8000
default 1 2 3 4 5 6 7 8
Clo
ck T
ime
[s]
MPI All Reduce Algorithms
16 Nodes - 320 MPI Ranks
1117 12101597
4096
14901004
1526
3985
7373
0
1000
2000
3000
4000
5000
6000
7000
8000
default 1 2 3 4 5 6 7 8
Clo
ck T
ime
[s]
MPI All Reduce Algorithms
32 Nodes – 640 MPI Ranks
Page 34 May 2015
Rank Placement
Based on the Platform MPI profiling report, “Message Summary by Rank Pair”
Source Code Optimisations
� Compiler Options and Reports� Profiling� Hot-Spot Analysis� Tracing
Tools� Score-P (http://www.vi-hps.org/projects/score-p/) � Scalasca (http://www.scalasca.org/about/about.html)� Vampir (https://www.vampir.eu/)� Intel V-Tune Amplifier (https://software.intel.com/en-us/intel-vtune-amplifier-xe)� Intel ITAC (https://software.intel.com/en-us/intel-trace-analyzer)� Allinea MAP (http://www.allinea.com/products/map)
Hotspot Analysis withV-Tune Amplifier and Intel Compiler Reports
331 335 347
407
356336 346
438
587
0 1 2 3 4 5 6 7 8
I_MPI_ADJUST_ALLREDUCE Algorithm
El.Time [s]
#pragma ivdep#pragma simd
#pragma vector always
Hotspot Analysis with Allinea MAP
Performance improvement through function vectorization
Hotspot Analysis with Score-P
� Profile obtained on 1 Node� 2 MPI Ranks
� 2 OMP Threads
� Performance improvementthrough a replication andreduce approach insteadof using atomic operations
Score-P Trace Analysis with VAMPIR
May 2015Page 41
Application Performance
Analysis
• Latency oriented dependency on the memory hierarchy• Latency oriented dependency on the interconnect
Chose the right
hardware
• A CPU with intermediate number of cores is more suitable for different usage models• Memory Configuration: 1 DIMM per Memory Channel• Interconnect with lower latency. Reduce the number of hops between the nodes as much as possible
Use the resources efficiently
•Run small jobs•Distribute the MPI Ranks of a job over as many nodes as possible.•When resources are limited share the nodes among the jobs
Tune system and
middleware parameters
• CPU Binding must be applied• Tune MPI Collectives (All_Gather)• No benefit from Hyperthreading•Turbo Boost to used only with jobs on non-fully populated nodes
Optimize the application
itself
•Improve cache usage/reusage•Improve the vectorization level of the code
Thank you for your attention.
science + computing agwww.science-computing.de
[email protected]@science-computing.de