clustering optimizations how to achieve optimal performance?• approximately 55% higher performance...

41
Clustering Optimizations – How to achieve optimal performance? Pak Lui

Upload: others

Post on 31-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

Clustering Optimizations – How to achieve optimal performance?

Pak Lui

Page 2: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

2

• LS-DYNA

• miniFE

• MILC

• MSC Nastran

• MR Bayes

• MM5

• MPQC

• NAMD

• Nekbone

• NEMO

• NWChem

• Octopus

• OpenAtom

• OpenFOAM

• MILC

• OpenMX

• PARATEC

• PFA

• PFLOTRAN

• Quantum ESPRESSO

• RADIOSS

• SPECFEM3D

• WRF

130 Applications Best Practices Published

• Abaqus

• AcuSolve

• Amber

• AMG

• AMR

• ABySS

• ANSYS CFX

• ANSYS FLUENT

• ANSYS Mechanics

• BQCD

• CCSM

• CESM

• COSMO

• CP2K

• CPMD

• Dacapo

• Desmond

• DL-POLY

• Eclipse

• FLOW-3D

• GADGET-2

• GROMACS

• Himeno

• HOOMD-blue

• HYCOM

• ICON

• Lattice QCD

• LAMMPS

For more information, visit: http://www.hpcadvisorycouncil.com/best_practices.php

Page 3: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

3

Agenda

• Overview Of HPC Application Performance

• Ways To Inspect/Profile/Optimize HPC Applications

– CPU/Memory, File I/O, Network

• System Configurations and Tuning

• Case Studies, Performance Optimization and Highlights

– STAR-CCM+

– ANSYS Fluent

• Conclusions

Page 4: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

4

HPC Application Performance Overview

• To achieve scalability performance on HPC applications

– Involves understanding of the workload by performing profile analysis

• Tune for the most time spent (either CPU, Network, IO, etc)

– Underlying implicit requirement: Each node to perform similarly

• Run CPU/memory /network tests or cluster checker to identify bad node(s)

– Comparing behaviors of using different HW components

• Which pinpoint bottlenecks in different areas of the HPC cluster

• A selection of HPC applications will be shown

– To demonstrate method of profiling and analysis

– To determine the bottleneck in SW/HW

– To determine the effectiveness of tuning to improve on performance

Page 5: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

5

Ways To Inspect and Profile Applications

• Computation (CPU/Accelerators)

– Tools: top, htop, perf top, pstack, Visual Profiler, etc

– Tests and Benchmarks: HPL, STREAM

• File I/O

– Bandwidth and Block Size: iostat, collectl, darshan, etc

– Characterization Tools and Benchmarks: iozone, ior, etc

• Network Interconnect

– Tools and Profilers: perfquery, MPI profilers (IPM, TAU, etc)

– Characterization Tools and Benchmarks:

• Latency and Bandwidth: OSU benchmarks, IMB

Page 6: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

6

Case Study: STAR-CCM+

Page 7: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

7

STAR-CCM+

• STAR-CCM+

– An engineering process-oriented CFD tool

– Client-server architecture, object-oriented programming

– Delivers the entire CFD process in a single integrated software

environment

• Developed by CD-adapco

Page 8: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

8

Objectives

• The presented research was done to provide best practices

– CD-adapco performance benchmarking

– Interconnect performance comparisons

– Ways to increase CD-adapco productivity

– Power-efficient simulations

• The presented results will demonstrate

– The scalability of the compute environment

– The scalability of the compute environment/application

– Considerations for higher productivity and efficiency

Page 9: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

9

Test Cluster Configuration

• Dell™ PowerEdge™ R720xd 32-node (640-core) “Jupiter” cluster

– Dual-Socket Hexa-Core Intel E5-2680 V2 @ 2.80 GHz CPUs (Static max Perf in BIOS)

– Memory: 64GB memory, DDR3 1600 MHz

– OS: RHEL 6.2, OFED 2.1-1.0.0 InfiniBand SW stack

– Hard Drives: 24x 250GB 7.2 RPM SATA 2.5” on RAID 0

• Intel Cluster Ready certified cluster

• Mellanox Connect-IB FDR InfiniBand and ConnectX-3 Ethernet adapters

• Mellanox SwitchX 6036 VPI InfiniBand and Ethernet switches

• MPI: Mellanox HPC-X v1.0.0 (based on OMPI), Platform MPI 8.3.0.6, Intel MPI 4.1.3

• Application: STAR-CCM+ version 9.02.005 (unless specified otherwise)

• Benchmarks:

– Lemans_Poly_17M (Epsilon Euskadi Le Mans car external aerodynamics)

Page 10: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

10

STAR-CCM+ Performance – Network

• FDR InfiniBand delivers the best network scalability performance – Provides up to 208% higher performance than 10GbE at 32 nodes – Provides up to 191% higher performance than 40GbE at 32 nodes – FDR IB scales linearly while 10/40GbE has scalability limitation beyond 16 nodes

Higher is better 20 Processes/Node

208%

191%

Page 11: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

11

STAR-CCM+ Profiling – Network

• InfiniBand reduces network overhead; results higher CPU utilization – Reducing MPI communication overhead with efficient network interconnect – As less time spent on the network, overall application runtime is improved

• Ethernet solutions consumes more time in communications – Spent 73%-95% of overall time in network due to congestion in Ethernet – While FDR IB spent about 38% of overall runtime

Higher is better 20 Processes/Node

Page 12: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

12

STAR-CCM+ Profiling – MPI Comm. Time

• Identified MPI overheads by profiling communication time

– Dealt with communications in collective, point-to-point and non-blocking operations

– 10/40GbE vs FDR IB: Spent longer time in Allreduce, Bcast, Recv, Waitany

Page 13: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

13

STAR-CCM+ Profiling – MPI Comm. Time

• Observed MPI time spent by different network hardware

– FDR IB: MPI_Test(46%), MPI_Waitany(13%), MPI_Alltoall (13%), MPI_Reduce(13%)

– 10GbE: MPI_Recv(29%), MPI_Waitany(25%), MPI_Allreduce(15%), MPI_Wait(10%)

Page 14: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

14

STAR-CCM+ Performance – Software Versions

• Improvement in latest STAR-CCM+ results in higher performance at scale

– v9.02.005 demonstrated a 28% gain compared to the v8.06.005 on 32-node run

– Slight Change in communication pattern helps to improve the scalability

– Improvement gap expects to widen at scale

– See subsequence slides in the MPI profiling to show the differences

Higher is better 20 Processes/Node

28%

14%

7%

Page 15: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

15

STAR-CCM+ Profiling – MPI Time Spent

• Communication time has dropped with the latest STAR-CCM+ version – Observed less time spent in MPI, although communication pattern is roughly the same – MPI Barrier time is reduced significantly between the 2 releases

20 Processes/Node Higher is better

Page 16: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

16

STAR-CCM+ Performance-MPI Implementations

• STAR-CCM+ has made various MPI implementations available to run

– Default MPI implementation used in STAR-CCM+ is Platform MPI

– MPI implementations started to differentiate beyond 8 nodes

– Optimization flags have been set already in vendor’s startup scripts

– Support for HPC-X is based on the existing Open MPI support in STAR-CCM+

– HPC-X provides 21% of higher scalability than the alternatives

Higher is better Version 9.02.005

21%

Page 17: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

17

STAR-CCM+ Performance – Single/Dual Port

• Benefit of deploying dual-port InfiniBand is demonstrated at scale – Running with dual port provides up to 11% higher performance at 32 nodes – Connect-IB on PCIe Gen3 x16 slot which can provide additional throughput with 2 links

Higher is better 20 Processes/Node

11%

Page 18: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

18

STAR-CCM+ Performance – Turbo Mode

• Enabling Turbo mode results in higher application performance – Up to 17% of the improvement seen by enabling Turbo mode – Higher performance gain seen with higher node count – Boosting base frequency; consequently resulted in higher power consumption

• Using kernel tools called “msr-tools” to adjust Turbo Mode dynamically – Allows dynamically turn off/on Turbo mode in the OS level

Higher is better 20 Processes/Node

17%

13%

Page 19: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

19

STAR-CCM+ Performance – File IO

• Advantages for staging data to temporary file system in memory

– Data write of ~8GB occurs at the end of the run for the benchmark

– By staging on local FS, which avoid accessing by all processes (vs NFS)

– By staging on local /dev/shm, even higher performance gain is seen (~11% gain)

• Using temporary storage is not recommended for production environment

– While /dev/shm reduces outperforms localfs, it is not recommended for production

– If available, parallel file system is more preferred solution versus local or /dev/shm

Higher is better Version 8.06.005

11%

8%

Page 20: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

20

STAR-CCM+ Performance - System Generations

• New generations of software and hardware provide performance gain – Performance gain demonstrated through variables in HW and SW – Latest stack provides ~55% higher performance versus 1 generation behind – Latest stack provides ~2.6x higher performance versus 2 generations behind

• System components used: – WSM: [email protected], DDR3-10666, ConnectX-2 QDR IB, 1 disk, v5.04.006 – SNB: [email protected], DDR3-12800, ConnectX-3 FDR IB, 24 disks, v7.02.008 – IVB: [email protected], DDR3-12800, Connect-IB FDR IB, 24 disks, v9.02.005

Higher is better

262%

55%

Page 21: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

21

STAR-CCM+ Profiling – User/MPI Time Ratio

• STAR-CCM+ spent more time in computation than communication

– The time ratio for network gradually increases with more nodes in the job

– Improvement on network efficiency would reflect in improvement on overall runtime

FDR InfiniBand

Page 22: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

22

STAR-CCM+ Profiling – Message Sizes

• Majority of messages are small messages

– Messages are concentrated below 64KB

• Number of messages increases with the number of nodes

Page 23: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

23

STAR-CCM+ Profiling – MPI Data Transfer

• As the cluster grows, less data transfers between MPI processes

– Drops from ~20GB per rank at 1 node vs ~3GB at 32 nodes

– Some node imbalances are seen through the amount of data transfers

– Rank 0 shows significantly higher network activities than other ranks

Page 24: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

24

STAR-CCM+ Profiling – Aggregated Transfer

• Aggregated data transfer refers to:

– Total amount of data being transferred in the network between all MPI ranks collectively

• Very large data transfer takes place in STAR-CCM+

– High network throughput is required for delivering the network bandwidth

– 1.5TB of data transfer takes place between the MPI processes at 32 nodes

Version 9.02.005

Page 25: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

25

STAR-CCM+ – Summary

• Performance

– STAR-CCM+ v9.02.005 improved on scalability over v8.06.005 by 28% at 32 nodes

• Performance gap expect to widen at higher node count

– FDR InfiniBand delivers the highest network performance for STAR-CCM+ to scale

– FDR IB provides higher performance against other networks

• FDR IB delivers ~191% higher compared to 40GbE, ~208% vs 10GbE on a 32 node run

– Deploying dual-port Connect-IB HCA provides 11% performance at 32 nodes

– Performance improvement seen compared to older hardware/software generations

• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations

– Enabling Turbo mode results in higher application performance

• Up to 17% of the improvement seen by enabling Turbo mode

– Mellanox HPC-X provides better performance than the alternatives

• MPI Profiling

– Communication time reduction with v9.02.005 which improves overall performance

– Ethernet solutions consumes more time in communications

• Spent 73%-95% of overall time in network due to congestion in Ethernet, while IB spent ~38%

Page 26: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

26

Case Study: ANSYS Fluent

Page 27: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

27

ANSYS FLUENT

• Computational Fluid Dynamics (CFD) is a computational technology

– Enables the study of the dynamics of things that flow

– Enable better understanding of qualitative and quantitative physical phenomena in

the flow which is used to improve engineering design

• CFD brings together a number of different disciplines

– Fluid dynamics, mathematical theory of partial differential systems, computational

geometry, numerical analysis, Computer science

• ANSYS FLUENT is a leading CFD application from ANSYS

– Widely used in almost every industry sector and manufactured product

Page 28: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

28

Objectives

• The presented research was done to provide best practices

– Fluent performance benchmarking

• MPI Library performance comparison

• Interconnect performance comparison

• CPUs comparison

• Compilers comparison

• The presented results will demonstrate

– The scalability of the compute environment/application

– Considerations for higher productivity and efficiency

Page 29: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

29

Test Cluster Configuration

• Dell™ PowerEdge™ R720xd 32-node (640-core) “Jupiter” cluster

– Dual-Socket Hexa-Core Intel E5-2680 V2 @ 2.80 GHz CPUs (Turbo mode enabled unless otherwise stated)

– Memory: 64GB memory, DDR3 1600 MHz

– OS: RHEL 6.2, OFED 2.3-1.0.1 InfiniBand SW stack

– Hard Drives: 24x 250GB 7.2 RPM SATA 2.5” on RAID 0

• Intel Cluster Ready certified cluster

• Mellanox Connect-IB FDR InfiniBand adapters

• Mellanox ConnectX-3 QDR InfiniBand and Ethernet VPI adapters

• Mellanox SwitchX SX6036 VPI InfiniBand and Ethernet switches

• MPI: Mellanox HPC-X v1.2.0 based on OMPI, (Provided): Intel MPI 4.1.030, IBM Platform MPI 9.1

• Application: ANSYS Fluent 15.0.7

• Benchmarks:

– eddy_417k, turbo_500k, aircraft_2m, sedan_4m, truck_poly_14m, truck_14m

– Descriptions for the test cases can be found at the ANSYS Fluent 15.0 Benchmark page

Page 30: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

30

Fluent Performance – Interconnects

Higher is better

200%

39x

16x

• FDR InfiniBand enables the highest cluster productivity

– Surpassed other network interconnect in scalability performance

• FDR InfiniBand tops performance among different network interconnects

– FDR InfiniBand outperforms QDR InfiniBand by up to 200% at 32 nodes

– Similarly, FDR outperforms 10GbE by 16 times, and 1GbE by over 39 times

Page 31: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

31

Fluent Performance – Interconnects

• FDR InfiniBand performance outperforms on other Fluent benchmarks

Page 32: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

32

Fluent Performance – MPI Implementations

FDR InfiniBand Higher is better

• HPC-X delivers higher scalability performance than other MPIs compared

– HPC-X outperforms over the default Platform MPI by 10%, and Intel MPI by 19%

• Support of HPC-X on Fluent is based on the support of Open MPI on Fluent

• The new “yalla” pml reduces the overhead. Flags used for HPC-X:

– -mca coll_fca_enable 1 -mca coll_fca_np 0 -mca pml yalla -map-by node -mca mtl

mxm -mca mtl_mxm_np 0 -x MXM_TLS=self,shm,ud --bind-to core

19%

10%

Page 33: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

33

Fluent Performance – MPI Implementations

• HPC-X outperforms other MPIs on other benchmark data

Page 34: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

34

Higher is better

Fluent Performance – Turbo Mode and Clock

FDR InfiniBand

• Advantages are seen with running higher clock rate with Fluent

– Either by enabling Turbo mode or higher CPU clock frequency

• Boosting CPU clock rate yields higher performance at lower cost

– Increasing to 2800MHz (from 2200MHz) run 42% faster, 18% of increased power

• Running turbo mode also yields higher performance but at higher cost

– Increase of 13% of performance at a expense of a 25% of increased power usage

42% 13%

18%

25%

Page 35: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

35

Fluent Performance – Best Published

Higher is better

26.36%

• Results demonstrated by HPCAC outperforms the previous best record

– The ANSYS Fluent 15.0 Benchmark publishes ANSYS Fluent performance results

– HPCAC achieved 26.36% higher performance than the best published results (as of

9/22/2014), despite slower CPUs are used on the Jupiter cluster by the HPCAC

– The 32-node/640-core result beats previous record of 96-node/1920-core by 8.53%

– Performance is expected to climb on the Jupiter cluster if more nodes are available

8.53%

Page 36: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

36

Fluent Profiling – I/O Profiling

• Minor disk I/O activities take place on all MPI ranks for this workload

– Majority of the read activities are disk appeared at the beginning of the job run

InfiniBand FDR

Page 37: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

37

Fluent Profiling – Point-to-point dataflow

• Communication seems to be limited to MPI ranks that is closer to self

– Heavy communications seen between first and last ranks

• Communication pattern does not change as the cluster scales

– However, the amount of data being transferred is reduced as the node scales

InfiniBand FDR

2 nodes 32 nodes

Page 38: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

38

Fluent Profiling – Time Spent by MPI Calls

• Majority of the MPI time is

spent on MPI_Waitall

– Accounts for 30% Wall time

– MPI_Allreduce – 20%

– MPI_Recv – 11%

• Some load imbalances in

network are observed

– Some ranks spent more

time MPI_Waitall and

MPI_Allreduce

– Might be related to how

workload is distributed

among the MPI ranks

eddy_417k, 32 nodes

Page 39: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

39

Fluent Profiling – MPI Message Sizes

• Majority of data transfer messages are small to medium sizes

– MPI_Allreduce: Large concentration of 4-byte msg (~18% wall time)

– MPI_Wait: Large concentration of 16-byte msg (~11% wall time)

eddy_417k, 32 nodes

Page 40: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

40

Fluent – Summary

• Performance

– Jupiter cluster outperforms other system architectures on Fluent

• FDR InfiniBand delivers higher performance against QDR InfiniBand by 200%

• FDR IB outperforms 10GbE by up to 11 times at 32 nodes / 640 cores

– FDR InfiniBand enable Fluent to break previous performance record

• Outperforms previously set record by 25.38% at 640 cores/ 32 nodes

• Outperforms previously set record by 8.52% at 1920 cores/ 96 nodes

– HPC-X MPI delivers higher performance against other MPI Implementation

• HPC-X outperforms Platform MPI by 10%, outperforms Intel MPI by 19%

• CPU

– Higher CPU clock rate and Turbo mode yields higher performance for Fluent

• Bumping CPU clock (from 2200MHz to 2800MHz) yields 42% faster perf at 18% of increased power

• Enabling turbo mode translates to 13% of increase performance at a 25% of additional power usage

• Profiling

– Heavy usage in small msg in MPI_Waitall, MPI_Allreduce, MPI_Recv communications

Page 41: Clustering Optimizations How to achieve optimal performance?• Approximately 55% higher performance for 1 generation and 2.6x for 2 generations – Enabling Turbo mode results in

41 41

Thank You HPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein