wrf 3.7.1 performance benchmarking and profiling · –wrf performance benchmarking ... • hp...
TRANSCRIPT
WRF 3.7.1
Performance Benchmarking and Profiling
December 2015
2
Note
• The following research was performed under the HPC Advisory Council
activities
– Special thanks for: HP Enterprise, Mellanox
• For more information on the supporting vendors solutions please refer to:
– www.mellanox.com, http://www.hp.com/go/hpc
• For more information on the application:
– http://www.ansys.com
3
Weather Research and Forecasting (WRF)
• The Weather Research and Forecasting (WRF) Model
– Numerical weather prediction system
– Designed for operational forecasting and atmospheric research
• WRF developed by
– National Center for Atmospheric Research (NCAR)
– The National Centers for Environmental Prediction (NCEP)
– Forecast Systems Laboratory (FSL)
– Air Force Weather Agency (AFWA)
– Naval Research Laboratory
– Oklahoma University
– Federal Aviation Administration (FAA)
4
WRF Usage
• The WRF model includes
– Real-data and idealized simulations
– Various lateral boundary condition options
– Full physics options
– Non-hydrostatic and hydrostatic
– One-way, two-way nesting and moving nest
– Applications ranging from meters to thousands of kilometers
5
Objectives
• The presented research was done to provide best practices
– WRF performance benchmarking
• CPU performance comparison
• MPI library performance comparison
• Interconnect performance comparison
• System generations comparison
• The presented results will demonstrate
– The scalability of the compute environment/application
– Considerations for higher productivity and efficiency
6
Test Cluster Configuration
• HP Proliant XL170r Gen9 32-node (1024-core) cluster
– Mellanox ConnectX-4 100Gbps EDR InfiniBand Adapters
– Mellanox Switch-IB SB7700 36-port 100Gb/s EDR InfiniBand Switch
• HP Proliant XL230a Gen9 32-node (1024-core) cluster
– Mellanox Connect-IB FDR 56Gbps FDR InfiniBand Adapters
– Mellanox SwitchX-2 SX6036 36-port 56Gb/s FDR InfiniBand / VPI Ethernet Switch
• Dual-Socket 16-Core Intel E5-2698v3 @ 2.30 GHz CPUs (BIOS: Maximum Performance, Turbo Off)
• Memory: 128GB memory, DDR4 2133 MHz
• OS: RHEL 6.5, MLNX_OFED_LINUX-3.0-1.0.1 InfiniBand SW stack
• Tools and Libraries: NetCDF 3.6.3, Intel MPI 5.1.1, Intel Compilers 16.0.0.109
• Application: WRF 3.7.1
• Benchmark datasets: CONUS-12km - 48-hour, 12km resolution case over the Continental US from October 24,
2001
7
Item HP ProLiant XL230a Gen9 Server
Processor Tw o Intel® Xeon® E5-2600 v3 Series, 6/8/10/12/14/16 Cores
Chipset Intel Xeon E5-2600 v3 series
Memory 512 GB (16 x 32 GB) 16 DIMM slots, DDR3 up to DDR4; R-DIMM/LR-DIMM; 2,133 MHz
Max Memory 512 GB
Internal Storage1 HP Dynamic Smart Array B140i
SATA controller
HP H240 Host Bus Adapter
Netw orkingNetw ork module supporting
various FlexibleLOMs: 1GbE, 10GbE, and/or InfiniBand
Expansion Slots1 Internal PCIe:
1 PCIe x 16 Gen3, half-height
PortsFront: (1) Management, (2) 1GbE, (1) Serial, (1) S.U.V port, (2) PCIe, and Internal Micro SD
card & Active Health
Pow er SuppliesHP 2,400 or 2,650 W Platinum hot-plug pow er supplies
delivered by HP Apollo 6000 Pow er Shelf
Integrated ManagementHP iLO (Firmw are: HP iLO 4)
Option: HP Advanced Power Manager
Additional FeaturesShared Pow er & Cooling and up to 8 nodes per 4U chassis, single GPU support, Fusion I/O
support
Form Factor 10 servers in 5U chassis
HP ProLiant XL230a Gen9 Server
8
WRF Performance - EDR InfiniBand vs FDR InfiniBand
• InfiniBand delivers superior scalability performance
– EDR InfiniBand provides higher performance and more scalable than FDR InfiniBand
– EDR InfiniBand delivers up over 68% of higher performance than FDR InfiniBand
– InfiniBand continues to scalable to higher nodes or processes
32 MPI Processes / Node Higher is better
63%
68%
20%
38%
29%
9
• Running WRF on both sockets generate higher performance than on single socket
– More CPU cores in MPI processing would generate more data
– With more data generated, higher demand on the network is needed
– EDR IB can provide better message rate and communication bandwidth than FDR IB
Higher is better
WRF Performance - EDR InfiniBand vs FDR InfiniBand
16 MPI Processes / Socket
10
• Both conus12km and conus2.5km input data can scale well with InfiniBand
• Both input data can be benefited by the additional cores on the CPU
Higher is better 16 MPI Processes / Socket
WRF Performance – conus2.5km vs conus12km
11
• EDR InfiniBand is demonstrated to provide higher throughput needed for WRF
– EDR IB outperforms FDR IB by 38% at 8 nodes with cores on both CPU sockets
– Data generated by additional CPU cores can utilize the additional bandwidth by the interconnect
– EDR IB can provide better message rate and communication bandwidth than FDR IB
Higher is better 16 MPI Processes / Socket
WRF Performance - EDR InfiniBand vs FDR InfiniBand
38%
29%
12
WRF Profiling – % MPI Communications
• Time spent on communications differ between the 2 benchmark data
– Conus12km: 83% in MPI_Bcast (60% wall time in 4-byte MPI_Bcast)
– Conus2.5km: 56% in MPI_Bcast (34% wall time in 4-byte MPI_Bcast)
Higher is better
Conus2.5km – 576 coresConus12km – 896 cores
13
Conus2.5km – 576 coresConus12km – 896 cores
WRF Profiling – % MPI Communications
• Majority of the MPI time is spent on MPI_Bcast
– Conus12km: all processes appeared to be balance
– Conus2.5km: imbalance is seen for some MPI processes for MPI_Wait
– MPI_Wait: For waiting for pending non-blocking sends and receives to complete
Higher is better
14
WRF Profiling – % MPI Communications
• Majority of data transfer messages are medium sizes for both data, except for:
– MPI_Bcast has a large concentration (60% wall) in small sizes (e.g. 4 byte size)
– MPI_Scatter shows some concentration (~6% wall time) at 16KB buffer
– MPI_Wait: Large concentration of 1 to less than 256 bytes
Higher is better
Conus2.5km – 576 coresConus12km – 896 cores
15
WRF Summary
• InfiniBand delivers superior scalability performance
– EDR InfiniBand delivers up over 68% of higher performance than FDR InfiniBand
• EDR InfiniBand is demonstrated to provide higher throughput needed for WRF
– EDR IB outperforms FDR IB by 38% at 8 nodes with cores on both CPU sockets
– Data generated by additional cores can utilize the additional bandwidth by the interconnect
– EDR IB can provide better message rate and communication bandwidth than FDR IB
• Both conus12km and conus2.5km demonstrates good scalability with InfiniBand
• MPI Profiling
– Conus12km: 83% in MPI_Bcast (60% wall time in 4-byte MPI_Bcast)
– Conus2.5km: 56% in MPI_Bcast (34% wall time in 4-byte MPI_Bcast)
1616
Thank YouHPC Advisory Council
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and
completeness of the information contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein