performance and scalability of - hpc advisory council€¦ · performance and scalability of...

29
Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC Advisory Council

Upload: others

Post on 21-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment

Pak Lui, Application Performance Manager

HPC Advisory Council

Page 2: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

2

Agenda

• Introduction to HPC Advisory Council

• Benchmark Configuration

• Performance Benchmark Testing and Results

• Summary

• Q&A / For More Information

Page 3: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

3

The HPC Advisory Council

• World-wide HPC organization (360+ members)

• Bridges the gap between HPC usage and its full potential

• Provides best practices and a support/development center

• Explores future technologies and future developments

• Working Groups – HPC|Cloud, HPC|Scale, HPC|GPU, HPC|Storage

• Leading edge solutions and technology demonstrations

Page 4: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

4

HPC Advisory Council Members

Page 5: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

5

HPC Council Board

• HPC Advisory Council Chairman

Gilad Shainer - [email protected]

• HPC Advisory Council Media Relations and Events

Director

Brian Sparks - [email protected]

• HPC Advisory Council China Events Manager

Blade Meng - [email protected]

• Director of the HPC Advisory Council, Asia

Tong Liu - [email protected]

• HPC Advisory Council HPC|Works SIG Chair

and Cluster Center Manager

Pak Lui - [email protected]

• HPC Advisory Council Director of Educational

Outreach

Scot Schultz – [email protected]

• HPC Advisory Council Programming Advisor

Tarick Bedeir - [email protected]

• HPC Advisory Council HPC|Scale SIG Chair

Richard Graham – [email protected]

• HPC Advisory Council HPC|Cloud SIG Chair

William Lu – [email protected]

• HPC Advisory Council HPC|GPU SIG Chair

Sadaf Alam – [email protected]

• HPC Advisory Council India Outreach

Goldi Misra – [email protected]

• Director of the HPC Advisory Council Switzerland

Center of Excellence and HPC|Storage SIG Chair

Hussein Harake – [email protected]

• HPC Advisory Council Workshop Program Director

Eric Lantz – [email protected]

• HPC Advisory Council Research Steering Committee

Director

Cydney Stevens - [email protected]

Page 6: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

6

HPC Advisory Council HPC Center

InfiniBand-based Storage (Lustre) Juniper Dodecas

Plutus Janus Athena

Vesta Mercury Mala

Lustre FS 512 cores

456 cores 192 cores

704 cores 384 cores

256 cores

16 GPUs

64 cores

Page 7: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

7

Special Interest Subgroups Missions

• HPC|Scale

– To explore usage of commodity HPC as a replacement for multi-million dollar mainframes and

proprietary based supercomputers with networks and clusters of microcomputers acting in unison to

deliver high-end computing services.

• HPC|Cloud

– To explore usage of HPC components as part of the creation of external/public/internal/private cloud

computing environments.

• HPC|Works

– To provide best practices for building balanced and scalable HPC systems, performance tuning and

application guidelines.

• HPC|Storage

– To demonstrate how to build high-performance storage solutions and their affect on application

performance and productivity. One of the main interests of the HPC|Storage subgroup is to explore

Lustre based solutions, and to expose more users to the potential of Lustre over high-speed

networks.

• HPC|GPU

– To explore usage models of GPU components as part of next generation compute environments

and potential optimizations for GPU based computing.

• HPC|FSI

– To explore the usage of high-performance computing solutions for low latency trading,

more productive simulations (such as Monte Carlo) and overall more efficient financial services.

Page 8: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

8

HPC Advisory Council

• HPC Advisory Council (HPCAC)

– 360+ members

– http://www.hpcadvisorycouncil.com/

– Application best practices, case studies (Over 150)

– Benchmarking center with remote access for users

– World-wide workshops

– Value add for your customers to stay up to date

and in tune to HPC market

• 2013 Workshops

– USA (Stanford University) – January 2013

– Switzerland – March 2013

– Germany (ISC’13) – June 2013

– Spain – Sep 2013

– China (HPC China) – Oct 2013

• For more information

– www.hpcadvisorycouncil.com

[email protected]

Page 9: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

9

ISC’14 – Student Cluster Competition

• University-based teams to compete and demonstrate the

incredible capabilities of state-of- the-art HPC systems and

applications on the ISC’14 show-floor

• The Student Cluster Competition is designed to introduce the next

generation of students to the high performance computing world

and community

Page 10: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

10

FLOW-3D/MP Performance Study

• Research performed under the HPC Advisory Council

activities

– Participating vendors: Intel, Dell, Mellanox

– Compute resource - HPC Advisory Council Cluster Center

• Objectives

– Give overview of FLOW-3D/MP performance

– Compare different MPI libraries, network interconnects and

others

– Understand FLOW-3D/MP communication patterns

– Provide best practices to increase FLOW-3D/MP productivity

Page 11: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

11

FLOW-3D/MP

• FLOW-3D/MP is a powerful and highly-accurate CFD software

– Provides engineers valuable insight into many physical flow processes

• FLOW-3D/MP is the ideal computational fluid dynamics software

– To use in the design phase as well as in improving production processes

– Provides special capabilities for accurately predicting free-surface flows

• FLOW-3D/MP is a standalone, all-inclusive CFD package

– Includes an integrated GUI that ties components from problem setup to post-

processing

Page 12: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

12

Test Cluster Configuration

• Dell™ PowerEdge™ R720xd 32-node “Jupiter” cluster

– 16-node Dual-Socket Eight-Core Intel E5-2680 @ 2.70 GHz CPUs

– 16-node Dual-Socket Ten-Core Intel E5-2680 V2 @ 2.80 GHz CPUs

– Memory: 64GB memory, DDR3 1600 MHz

– OS: RHEL 6.2, OFED 1.5.3-3.1.0 (for v4.2) and 2.0-3.0.0 (for v5.0) InfiniBand SW stack

– Hard Drives: 24x 250GB 7.2 RPM SATA 2.5” on RAID 0

• Mellanox Connect-IB FDR InfiniBand adapters and ConnectX-3 Ethernet adapters

• Mellanox SwitchX SX6036 InfiniBand switch

• MPI (vendor provided): Intel MPI 3.2.0.011 (v4.2) and 4.0.0.025 (v5.0)

• Application: FLOW-3D/MP 4.2 and 5.0

• Benchmarks:

– Lid Driven Cavity Flow - Completely filled with fluid (in this case air)

– P2 Engine Block - A casting application with free-surface (one-fluid simulation with sharp

interface tracking)

Page 13: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

13

About Intel® Cluster Ready

• Intel® Cluster Ready systems make it practical to use a cluster to increase

your simulation and modeling productivity

– Simplifies selection, deployment, and operation of a cluster

• A single architecture platform supported by many OEMs, ISVs, cluster

provisioning vendors, and interconnect providers

– Focus on your work productivity, spend less management time on the cluster

• Select Intel Cluster Ready

– Where the cluster is delivered ready to run

– Hardware and software are integrated and configured together

– Applications are registered, validating execution on the Intel Cluster Ready

architecture

– Includes Intel® Cluster Checker tool, to verify functionality and periodically check

cluster health

• FLOW-3D/MP is Intel Cluster Ready certified

Page 14: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

14

PowerEdge R720xd Massive flexibility for data intensive operations

• Performance and efficiency

– Intelligent hardware-driven systems management

with extensive power management features

– Innovative tools including automation for

parts replacement and lifecycle manageability

– Broad choice of networking technologies from 1GbE to IB

– Built in redundancy with hot plug and swappable PSU, HDDs and fans

• Benefits

– Designed for performance workloads

• from big data analytics, distributed storage or distributed computing

where local storage is key to classic HPC and large scale hosting environments

• High performance scale-out compute and low cost dense storage in one package

• Hardware Capabilities

– Flexible compute platform with dense storage capacity

• 2S/2U server, 6 PCIe slots

– Large memory footprint (Up to 768GB / 24 DIMMs)

– High I/O performance and optional storage configurations

• HDD options: 12 x 3.5” - or - 24 x 2.5 + 2x 2.5 HDDs in rear of server

• Up to 26 HDDs with 2 hot plug drives in rear of server for boot or scratch

Page 15: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

15

FLOW-3D/MP Performance – MPI vs Hybrid

• FLOW-3D/MP supports OpenMP Hybrid mode – Up to 327% of higher performance than MPI mode at 16 nodes

• Hybrid version enables higher scalability versus pure MPI version – Hybrid version delivers better scalability after 4 nodes – MPI processes would spawn OpenMP threads for computation on CPU cores

– Streamline and reduce communication endpoints to improve scalability

Intel E5-2680 V2 FDR InfiniBand

327%

*Performance Rating = Jobs/Day

Page 16: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

16

• Default for FLOW-3D/MP Hybrid mode is to run 1 PPN of 16 threads

– Which uses “-genv I_MPI_PIN_DOMAIN node” as specified in runhyd_par script

• For best performance on 2P Platforms:

– Use “-genv I_MPI_PIN_DOMAIN socket” in runhyd_par script

– 2PPN of 8/10 threads can yield better performance than 1PPN of 16/20 threads

– The flag allows to each MPI process to spawn threads within its own socket

– Instead of both MPI processes sharing the same socket

FLOW-3D/MP Performance – Hybrid Mode

110%

178%

FDR InfiniBand FLOW-3D/MP 5.0

Page 17: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

17

• 2PPN of 8 threads can provide better performance than 1PPN of 16 threads

– Threads of the MPI process causes threads to spawn within the same socket

– With the “I_MPI_PIN_DOMAIN=socket” specified in the runhyd_par script

• Default for FLOW-3D/MP hybrid is to run 1 PPN of 16 threads

– With the “I_MPI_PIN_DOMAIN=node” specified in the runhyd_par script

• The flag is modified to “socket” to allow spawning of threads within a socket

– For the case of 2PPN of 8 threads

FLOW-3D/MP Performance – Hybrid Mode

FDR InfiniBand

9% 35%

FLOW-3D/MP 4.2

Page 18: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

18

FLOW-3D/MP Performance – Processors

• Intel E5-2680 (Sandy Bridge) cluster outperforms Intel Xeon E5670 cluster

– Performs 70% better than X5670 cluster at 16 nodes

• System components used:

– Sandy Bridge: 2-socket Intel E5-2680 @ 2.7GHz, 1600MHz DIMMs, FDR IB, 24 disks

– Westmere: 2-socket Intel X5670 @ 2.93GHz, 1333MHz DIMMs, QDR IB, 1 disk

FLOW-3D/MP 4.2

70%

FDR InfiniBand

Page 19: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

19

FLOW-3D/MP Performance – Processors

• Intel E5-2680 v2 (Ivy Bridge) cluster outperforms the Intel E5-2680 cluster

– Performs 8% better than X5670 cluster at 16 nodes

• System components used:

– Sandy Bridge: 2-socket 8-core Intel E5-2680 @ 2.7GHz

– Ivy Bridge: 2-socket 10-core Intel E5-2680 V2 @ 2.8GHz

8%

5%

FDR InfiniBand FLOW-3D/MP 5.0

Page 20: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

20

FLOW-3D/MP Performance – Software Versions

• FLOW-3D/MP v5.0 outperforms v4.2 in scalability in Hybrid mode

– Provides up to 37% faster in Hybrid mode at 16-node

• Running in MPI mode shows slightly longer runtime than previous version

– Appears to be caused by a change in communication algorithm

– More MPI collective operations are being used compared to prior version

16 MPI Processes/Node

37%

*Performance Rating = Jobs/Day

Page 21: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

21

FLOW-3D/MP Performance – Network

• InfiniBand FDR provides better scalability performance than Ethernet

– Scalability gap widens as more nodes involved in simulation

– FDR InfiniBand provides up to 246% better performance than 1GbE

– FDR InfiniBand delivers up to 39% better performance than 10GbE

– Hybrid mode is shown

39%

16 MPI Processes/Node

246%

*Performance Rating = Jobs/Day

Page 22: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

22

FLOW-3D/MP Profiling – Time Ratio

• InfiniBand FDR reduces the communication time at scale

– InfiniBand FDR consumes about 17% of total runtime on 16-node Hybrid job

– 10GbE consumes 39% of total time, while 1GbE consumes about 75%

• IB RDMA technology allows communication to bypass CPU involvement

– Reduces CPU overhead in handling communication

– Which leaves more time for application processing

Page 23: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

23

FLOW-3D/MP Profiling – # of MPI Calls

• Overall runtime reduces as more nodes take part of the MPI job

– More compute nodes reduce runtime by spreading out the workload

• Computation time drops while the communication time stays flat

– As cluster scales, MPI time stays constantly at the same level

16 MPI Processes/Node Pure MPI Mode

Page 24: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

24

FLOW-3D/MP Profiling – Communication Time

• The most time consumed MPI functions are:

– FDR: MPI_Allreduce(51%), MPI_Bcast(24%), MPI_Waitall(16%)

• InfiniBand reduces more time in Collective Operations than Ethernet

– Collective communications are most used in FLOW-3D v5.0

– Those communications account for the highest communication time

Page 25: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

25

FLOW-3D/MP Profiling – # of MPI Calls

• There is a wide range of message sizes seen: – MPI_Allreduce: Concentration between 4B to 16B – MPI_Waitall: Around 4MB – MPI_Bcast: Around 1-4MB

Page 26: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

26

FLOW-3D/MP Performance – File System

• Storing data files on local FS or tmpfs would improve performance

– Scalability is limited by NFS when running at scale after 8 nodes

– NFS used in this case is over 1GbE network

Higher is better InfiniBand FDR

25%

Page 27: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

27

FLOW-3D/MP Profiling – Disk IO Time

• File IO access occurs during certain period during the MPI solver

– Large spikes for writing the restart and spatial data

– Files are directed to write to local instead of NFS to avoid IO bottleneck

Page 28: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

28

FLOW-3D/MP – Summary

• Scalability

– FLOW-3D/MP v5.0 Hybrid mode enables higher scalability versus pure MPI version

• Hybrid version delivers good scalability to 16 nodes (320 cores); >37% faster than v4.2

• Performance

– Intel Ivy Bridge-EP and FDR InfiniBand enable FLOW-3D/MP to scale to 320 cores

– Allocate MPI process to “proper” socket in Hybrid mode allows performance to jump 178%

– Hybrid mode allows FLOW-3D/MP to scale at 16 nodes, up to 327% against MPI mode

• Network

– InfiniBand FDR allows the best scalability performance with 56Gbps rate

• Outperforms by 246% over 1GbE at 16-node (320 cores)

• Outperforms by 39% over 10GbE at 16-node (320 cores)

– RDMA technology in InfiniBand allows bypassing CPU for network transfer

• This offload reduces CPU overhead in handling communication; thus CPU can focus on application

• Profiling

– MPI Communication time is spent mostly on MPI_Allreduce at 51% of overall MPI time

– InfiniBand can process Collective Operations in network faster than Ethernet

– Large concentration on small messages, typical for latency sensitive HPC applications

Page 29: Performance and Scalability of - HPC Advisory Council€¦ · Performance and Scalability of FLOW-3D/MP 5.0 in HPC Cluster Environment Pak Lui, Application Performance Manager HPC

29 29

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

Thank You

• Special thanks to Anup Gokarn of Flow Science

• Questions?

– Pak Lui

[email protected]