lammps performance benchmark and profiling performance...– lammps performance overview over intel...
TRANSCRIPT
![Page 1: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/1.jpg)
LAMMPS
Performance Benchmark and Profiling
November 2020
![Page 2: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/2.jpg)
2
Note
• The following research was performed under the HPC Advisory Council activities
– HPCAI-AC - Iris cluster
– Dell – Zenith cluster
• The following was done to provide best practices
– LAMMPS performance overview over Intel based platforms
– Understanding LAMMPS MPI communication patterns
• More info on LAMMPS
– https://lammps.sandia.gov/
![Page 3: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/3.jpg)
3
LAMMPS
• Large-scale Atomic/Molecular Massively Parallel Simulator
– Classical molecular dynamics code which can model:
– Atomic, Polymeric, Biological, Metallic, Granular, and coarse-grained systems
• LAMMPS-KOKKOS package contains
– Versions of pair, fix, and atom styles that use data structures and macros provided by the Kokkos library
• LAMMPS runs efficiently in parallel using message-passing techniques
– Developed at Sandia National Laboratories
– An open-source code, distributed under GNU Public License
• More information on LAMMPS can be found at the LAMMPS web site:
http://lammps.sandia.gov
![Page 4: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/4.jpg)
4
Cluster Configuration
• HPC-AI AC Cluster Center – Iris cluster
– Dual Socket Intel Gold 6148 CPU @ 2.40GHz
– ConnectX-6 HDR100 InfiniBand
– Quantum Switch HDR InfiniBand
– Memory: 192GB DDR4 2677MHz RDIMMs per node
• Software
– OS: RHEL 7.8,
– MLNX_OFED 4.9
– MPI: HPC-X 2.7.0
– LAMMPS: v10-29-2020
– Compiler: Intel 2020.4.304
• Dell Cluster Center – Zenith cluster
– Dual Socket Intel Gold 6248 CPU @ 2.50GHz
– ConnectX-6 HDR100 InfiniBand
– Quantum Switch HDR InfiniBand
– Memory: 192GB DDR4 2677MHz RDIMMs per node
• Software
– OS: RHEL 7.8,
– MLNX_OFED 4.9
– MPI: HPC-X 2.7.0
– LAMMPS: v10-29-2020
– Compiler: Intel 2020.4.304
![Page 5: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/5.jpg)
5
LAMMPS Inputs
• AF_lennard-jones_2.5
– Problem: https://lammps.sandia.gov/bench/in.lj.txt
– region: box block 0 200 0 200 0 200
– neigh_modify: delay 0 every 20 check no
– Iterations: 1000
• EAM
– Problem:
https://github.com/lammps/lammps/blob/master/bench/POTENTIALS/in.eam
– region: box block 0 200 0 200 0 200
– neigh_modify: delay 1 every 5 check yes
– Iterations: 1000
– thermo 100
– thermo 100
• Tersoff
– Problem: https://lammps.sandia.gov/bench/in.tersoff.txt
– region: box block 0 200 0 200 0 200
– Iterations: 1000
• Gay-Berne
– Problem: https://lammps.sandia.gov/bench/in.gb.txt
– region: box block 0 320 0 320 0 320
– set type 1 mass 1.5
– set type 1 shape 1 1.5 2
– neigh_modify: delay 1 every 5 check yes
– Iterations: 1000
– thermo 100
• Rhodopsin
– Problem:
https://github.com/lammps/lammps/blob/master/bench/in.rhodo
– replicate: 1 1 1
– atom_modify map array
– Iterations: 1000
• SNAP
– Problem:
– region: box block 0 5 0 8 0 32
– Iterations: 1000
![Page 6: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/6.jpg)
6
LAMMPS Performance – Scalability
Higher is better
100% 100%92%
92% 100%97%
* Bigger problem size
![Page 7: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/7.jpg)
7
LAMMPS Performance – AVX2/AVX512
Higher is better
![Page 8: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/8.jpg)
8
LAMMPS Performance – CPU
Higher is better
![Page 9: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/9.jpg)
9
LAMMPS MPI Profiles on 32 nodes
Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI
Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI
![Page 10: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/10.jpg)
10
LAMMPS MPI Profiles on 32 nodes
Lennard_jones 2.5 - 30% MPI EAM - 15% MPI Gay-Berne - 12% MPI
Rhodopsin - 14% MPI SNAP - 4% MPI Tersoff - 6% MPI
![Page 11: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/11.jpg)
11
Summary
• LAMMPS can be scalable, per the problem size defined. The problem should suit the CPU
architecture and cluster size. With InfiniBand the scalability is above 92% for the
demonstrated cased
• AVX512 helps five out of the six input benchmarks, and up to 2x improvment
• Intel Gold 6248, 2.5GHz (40 cores per node) demonstrated up 38% of performance
improvements comparing to Intel Gold 6148 @2.4GHz (40 cores per node)
• MPI Profile shows up to 30% communication time mostly on point to point and MPI
AllReduce operations. Rhodopsin input also showing also MPI alltoallv as well
![Page 12: LAMMPS Performance Benchmark and Profiling Performance...– LAMMPS performance overview over Intel based platforms ... • More info on LAMMPS – 3 LAMMPS • Large-scale Atomic/Molecular](https://reader033.vdocuments.net/reader033/viewer/2022060302/60a52a90df278d02d3255c0e/html5/thumbnails/12.jpg)
12
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC-AI Advisory Council makes no representation to the accuracy and completeness of the information
contained herein. HPC-AI Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein
Thank You