wrf 3.7.1 performance benchmarking and profiling · –wrf performance benchmarking ... • hp...

16
WRF 3.7.1 Performance Benchmarking and Profiling December 2015

Upload: vonhan

Post on 02-Sep-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

WRF 3.7.1

Performance Benchmarking and Profiling

December 2015

Page 2: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

2

Note

• The following research was performed under the HPC Advisory Council

activities

– Special thanks for: HP Enterprise, Mellanox

• For more information on the supporting vendors solutions please refer to:

– www.mellanox.com, http://www.hp.com/go/hpc

• For more information on the application:

– http://www.ansys.com

Page 3: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

3

Weather Research and Forecasting (WRF)

• The Weather Research and Forecasting (WRF) Model

– Numerical weather prediction system

– Designed for operational forecasting and atmospheric research

• WRF developed by

– National Center for Atmospheric Research (NCAR)

– The National Centers for Environmental Prediction (NCEP)

– Forecast Systems Laboratory (FSL)

– Air Force Weather Agency (AFWA)

– Naval Research Laboratory

– Oklahoma University

– Federal Aviation Administration (FAA)

Page 4: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

4

WRF Usage

• The WRF model includes

– Real-data and idealized simulations

– Various lateral boundary condition options

– Full physics options

– Non-hydrostatic and hydrostatic

– One-way, two-way nesting and moving nest

– Applications ranging from meters to thousands of kilometers

Page 5: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

5

Objectives

• The presented research was done to provide best practices

– WRF performance benchmarking

• CPU performance comparison

• MPI library performance comparison

• Interconnect performance comparison

• System generations comparison

• The presented results will demonstrate

– The scalability of the compute environment/application

– Considerations for higher productivity and efficiency

Page 6: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

6

Test Cluster Configuration

• HP Proliant XL170r Gen9 32-node (1024-core) cluster

– Mellanox ConnectX-4 100Gbps EDR InfiniBand Adapters

– Mellanox Switch-IB SB7700 36-port 100Gb/s EDR InfiniBand Switch

• HP Proliant XL230a Gen9 32-node (1024-core) cluster

– Mellanox Connect-IB FDR 56Gbps FDR InfiniBand Adapters

– Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch

• Dual-Socket 16-Core Intel E5-2698v3 @ 2.30 GHz CPUs (BIOS: Maximum Performance, Turbo Off)

• Memory: 128GB memory, DDR4 2133 MHz

• OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack

• Tools and Libraries: NetCDF 3.6.3, Intel MPI 5.1.1, Intel Compilers 16.0.0.109

• Application: WRF 3.7.1

• Benchmark datasets: CONUS-12km - 48-hour, 12km resolution case over the Continental US from October 24,

2001

Page 7: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

7

Item HP ProLiant XL230a Gen9 Server

Processor Tw o Intel® Xeon® E5-2600 v3 Series, 6/8/10/12/14/16 Cores

Chipset Intel Xeon E5-2600 v3 series

Memory 512 GB (16 x 32 GB) 16 DIMM slots, DDR3 up to DDR4; R-DIMM/LR-DIMM; 2,133 MHz

Max Memory 512 GB

Internal Storage1 HP Dynamic Smart Array B140i

SATA controller

HP H240 Host Bus Adapter

Netw orkingNetw ork module supporting

various FlexibleLOMs: 1GbE, 10GbE, and/or InfiniBand

Expansion Slots1 Internal PCIe:

1 PCIe x 16 Gen3, half-height

PortsFront: (1) Management, (2) 1GbE, (1) Serial, (1) S.U.V port, (2) PCIe, and Internal Micro SD

card & Active Health

Pow er SuppliesHP 2,400 or 2,650 W Platinum hot-plug pow er supplies

delivered by HP Apollo 6000 Pow er Shelf

Integrated ManagementHP iLO (Firmw are: HP iLO 4)

Option: HP Advanced Power Manager

Additional FeaturesShared Pow er & Cooling and up to 8 nodes per 4U chassis, single GPU support, Fusion I/O

support

Form Factor 10 servers in 5U chassis

HP ProLiant XL230a Gen9 Server

Page 8: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

8

WRF Performance - EDR InfiniBand vs FDR InfiniBand

• InfiniBand delivers superior scalability performance

– EDR InfiniBand provides higher performance and more scalable than FDR InfiniBand

– EDR InfiniBand delivers up over 68% of higher performance than FDR InfiniBand

– InfiniBand continues to scalable to higher nodes or processes

32 MPI Processes / Node Higher is better

63%

68%

20%

38%

29%

Page 9: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

9

• Running WRF on both sockets generate higher performance than on single socket

– More CPU cores in MPI processing would generate more data

– With more data generated, higher demand on the network is needed

– EDR IB can provide better message rate and communication bandwidth than FDR IB

Higher is better

WRF Performance - EDR InfiniBand vs FDR InfiniBand

16 MPI Processes / Socket

Page 10: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

10

• Both conus12km and conus2.5km input data can scale well with InfiniBand

• Both input data can be benefited by the additional cores on the CPU

Higher is better 16 MPI Processes / Socket

WRF Performance – conus2.5km vs conus12km

Page 11: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

11

• EDR InfiniBand is demonstrated to provide higher throughput needed for WRF

– EDR IB outperforms FDR IB by 38% at 8 nodes with cores on both CPU sockets

– Data generated by additional CPU cores can utilize the additional bandwidth by the interconnect

– EDR IB can provide better message rate and communication bandwidth than FDR IB

Higher is better 16 MPI Processes / Socket

WRF Performance - EDR InfiniBand vs FDR InfiniBand

38%

29%

Page 12: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

12

WRF Profiling – % MPI Communications

• Time spent on communications differ between the 2 benchmark data

– Conus12km: 83% in MPI_Bcast (60% wall time in 4-byte MPI_Bcast)

– Conus2.5km: 56% in MPI_Bcast (34% wall time in 4-byte MPI_Bcast)

Higher is better

Conus2.5km – 576 coresConus12km – 896 cores

Page 13: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

13

Conus2.5km – 576 coresConus12km – 896 cores

WRF Profiling – % MPI Communications

• Majority of the MPI time is spent on MPI_Bcast

– Conus12km: all processes appeared to be balance

– Conus2.5km: imbalance is seen for some MPI processes for MPI_Wait

– MPI_Wait: For waiting for pending non-blocking sends and receives to complete

Higher is better

Page 14: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

14

WRF Profiling – % MPI Communications

• Majority of data transfer messages are medium sizes for both data, except for:

– MPI_Bcast has a large concentration (60% wall) in small sizes (e.g. 4 byte size)

– MPI_Scatter shows some concentration (~6% wall time) at 16KB buffer

– MPI_Wait: Large concentration of 1 to less than 256 bytes

Higher is better

Conus2.5km – 576 coresConus12km – 896 cores

Page 15: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

15

WRF Summary

• InfiniBand delivers superior scalability performance

– EDR InfiniBand delivers up over 68% of higher performance than FDR InfiniBand

• EDR InfiniBand is demonstrated to provide higher throughput needed for WRF

– EDR IB outperforms FDR IB by 38% at 8 nodes with cores on both CPU sockets

– Data generated by additional cores can utilize the additional bandwidth by the interconnect

– EDR IB can provide better message rate and communication bandwidth than FDR IB

• Both conus12km and conus2.5km demonstrates good scalability with InfiniBand

• MPI Profiling

– Conus12km: 83% in MPI_Bcast (60% wall time in 4-byte MPI_Bcast)

– Conus2.5km: 56% in MPI_Bcast (34% wall time in 4-byte MPI_Bcast)

Page 16: WRF 3.7.1 Performance Benchmarking and Profiling · –WRF performance benchmarking ... • HP Proliant XL170r Gen9 32-node (1024-core) ... MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW

1616

Thank YouHPC Advisory Council

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and

completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein