experiences with hpc and big data applications on the sdsc ...€¦ · 4/11/2017 · overview •...
TRANSCRIPT
Experiences with HPC and Big Data Applications on theSDSC Comet Cluster: Using Virtualization, Singularity containers,and RDMA enabled Data Analytics tools
Mahidhar Tatineni, Director of User Services, SDSC
HPC Advisory Council China ConferenceOctober 18, 2017, Hefei, China
Acknowledgements: Trevor Cooper, Dmitry Mishin, Christopher Irving, Gregorvon Laszewski (IU) Fugang Wang (IU), Rick Wagner (Globus group, U.Chicago), Phil Papadopoulos
This work supported by the National Science Foundation, award ACI-1341698.
Overview• Comet Hardware – Compute,GPU nodes, network, flesystems• MPI implementations, including MVAPICH2-GDR results• Data analytics frameworks and tools on Comet – RDMA-Hadoop, RDMA-
Spark, OSU-Caffe• Virtual Cluster (VC) Design – layout, software• VC Benchmarks: MPI, Applications
NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of SciencePI: Michael NormanCo-PIs: Shawn Strande, Philip Papadopoulos, Robert Sinkovits, Nancy Wilkins-DiehrSDSC Project in Collaboration with Indiana University (led by Geoffrey Fox)
Comet: System Characteristics • Total peak fops ~2.1 PF• Dell primary integrator
• Intel Haswell processors w/ AVX2
• Mellanox FDR InfniBand
• 1,944 standard compute nodes (46,656cores)
• Dual CPUs, each 12-core, 2.5 GHz
• 128 GB DDR4 2133 MHz DRAM
• 2*160GB GB SSDs (local disk)
72 GPU nodes• 36 nodes with two NVIDIA K80 cards, each
with dual Kepler3 GPUs, same CPU as mainpartition.
• 36 nodes with 4 P100 GPUs each, 2 IntelBroadwell processors (14 cores each)
• 4 large-memory nodes• 1.5 TB DDR4 1866 MHz DRAM
• Four Haswell processors/node
• 64 cores/node
• Hybrid fat-tree topology● FDR (56 Gbps) InfniBand
● Rack-level (72 nodes, 1,728 cores) full bisection
bandwidth
● 4:1 oversubscription cross-rack
• Performance Storage (Aeon)● 7.6 PB, 200 GB/s; Lustre
● Scratch & Persistent Storage segments
• Durable Storage (Aeon)● 6 PB, 100 GB/s; Lustre
● Automatic backups of critical data
• Home directory storage
• Gateway hosting nodes
• Virtual image repository
• 100 Gbps external connectivity to Internet2 & ESNet
Comet Network Architecture
Comet Lustre Filesystems• Comet features two Lustre filesystems - scratch and projects storage.• Projects storage is mounted on multiple systems, including non IB
connected clusters.• Backend storage servers are connected via 40 Gbit ethernet fabric. • Comet network design handles this aspect by using bridge switches
(Mellanox). • Design provides flexibility to mount filesystem on multiple machine and
keeps the aggregate performance high. • Each filesystem (scratch and projects) achieved 100 GB/s on and IOR
bandwidth test at scale.
Comet: MPI options, RDMA enabled software• MVAPICH2 (v2.1) is the default MPI on Comet. Intel MPI and
OpenMPI also available.
• MVAPICH2-X v2.2a to provide unified high-performance runtimesupporting both MPI and PGAS programming models.
• MVAPICH2-GDR (v2.2) on the GPU nodes featuring NVIDIA K80s andP100s. Benchmark and application performance results in this talk.
• RDMA-Hadoop (2x-1.1.0), RDMA-Spark (0.9.4) (from Dr. Panda’sHiBD lab) also available.
Comet K80 node architecture
• 4 GPUs per node• GPUs (0,1) and (2,3) can do P2P communication• Mellanox InfniBand adapter associated with second socket (GPUs 2, 3)
OSU Latency (osu_latency) Benchmark Intra-node, K80 nodes
• Latency between GPU 2 , GPU 3: 2.82 µs
• Latency between GPU 1 , GPU 2: 3.18 µs
OSU Latency (osu_latency) Benchmark Inter-node, K80 nodes
• Latency between GPU 2 , process bound to CPU 1 on both nodes: 2.27 µs• Latency between GPU 2 , process bound to CPU 0 on both nodes: 2.47 µs• Latency between GPU 0 , process bound to CPU 0 on both nodes: 2.43 µs
Comet P100 node architecture
• 4 GPUs per node• GPUs (0,1) and (2,3) can do P2P communication• Mellanox InfniBand adapter associated with frst socket (GPUs 0, 1)
OSU Latency (osu_latency) Benchmark Intra-node, P100 nodes
• Latency between GPU 0 , GPU 1: 2.73 µs• Latency between GPU 2 , GPU 3: 2.95 µs• Latency between GPU 1 , GPU 2: 3.13 µs
OSU Latency (osu_latency) Benchmark Inter-node, P100 nodes
• Latency between GPU 0 , process bound to CPU 0 on both nodes: 2.17 µs• Latency between GPU 2 , process bound to CPU 1 on both nodes: 2.35 µs
MVAPICH2-GDR Application Example:HOOMD-blue • HOOMD-blue is a general-purpose particle simulation toolkit• Results for the Hexagon benchmark are presented..
References:• HOOMD-blue web page: http://glotzerlab.engin.umich.edu/hoomd-blue/• HOOMD-blue Benchmarks page: http://
glotzerlab.engin.umich.edu/hoomd-blue/benchmarks.html• J. A. Anderson, C. D. Lorenz, and A. Travesset. General purpose molecular
dynamics simulations fully implemented on graphics processing units Journal ofComputational Physics 227(10): 5342-5359, May 2008. 10.1016/j.jcp.2008.01.047
• J. Glaser, T. D. Nguyen, J. A. Anderson, P. Liu, F. Spiga, J. A. Millan, D. C.Morse, S. C. Glotzer. Strong scaling of general-purpose molecular dynamicssimulations on GPUs Computer Physics Communications 192: 97-107, July2015. 10.1016/j.cpc.2015.02.028
HOOMD-Blue: Hexagon Benchmark• N=1,048,576N=1,048,576• Hard particle Monte Carlo
● Vertices: [[0.5,0],[0.25,0.433012701892219],[-0.25,0.433012701892219],[-0.5,0],[-0.25,-0.433012701892219],[0.25,-0.433012701892219]]
● d=0.17010672166874857● a=1.0471975511965976● nselect=4
• Log fle period: 10000 time steps• SDF analysis
● xmax==0.02● δx=10−4● period: 50 time steps● navg=2000
• DCD dump period: 100000
HOOMD-Blue: Hexagon BenchmarkStrong scaling on K80 nodes
RDMA-Hadoop and RDMA-SparkNetwork-Based Computing Lab, Ohio State University
NSF funded project in collaboration with Dr. DK Panda*
• HDFS, MapReduce, and RPC over native InfiniBand and RDMA overConverged Ethernet (RoCE).
• Based on Apache distributions of Hadoop and Spark.• Version RDMA-Apache-Hadoop-2.x 1.1.0 (based on Apache Hadoop
2.6.0) available on Comet• Version RDMA-Spark 0.9.4 (based on Apache Spark 2.1.0) is
available on Comet.• More details on the RDMA-Hadoop and RDMA-Spark projects at:
– http://hibd.cse.ohio-state.edu/
*NSF BIGDATA F: DKM: Collaborative Research: Scalable Middleware for Managing and ProcessingBig Data on Next Generation HPC Systems, Award #s 1447804 (Ohio State), and 1447861 (SDSC).
● Exploit performance on modern clusters with RDMA-enabledinterconnects for Big Data applications.
● Hybrid design with in-memory and heterogeneous storage (HDD,SSDs, Lustre).
● Keep compliance with standard distributions from Apache.
RDMA-Hadoop, Spark
OSU-Caffe, CIFAR10 Quick on K80 nodes
• Results with K80 nodes.
• Current runs with data in Lustre filesystem(/oasis/scratch/comet)
• All Comet GPU nodes have 280GB of SSD basedlocal scratch space. Future tests with larger testcases planned to evaluate performanceadvantages of using the SSDs.
OSU-Caffe, CIFAR10 Quick on K80 nodes
Virtualization on Comet
• Comet Virtual Clusters – KVM based, SRIOVenabled full virtualization.
• Singularity based containerization – user spaceonly with namespaces and minimal SetUID.
Comet VC Use Cases• Root access to nodes for custom OS and
software stack. – Example : CAIDA group used it for a workshop
allowing attendees to modify network stack forresearch. Allows for isolation of tests.
• Simplified install for groups with existingmanagement infrastructure.
– Example: Open Science Group (OSG) used theirexisting installation procedures to enable multipleresearch groups to run on Comet (including LIGO).
Singularity Use Cases
• Applications with newer library OS requirements thanavailable on Comet – e.g. Tensorflow, Torch, Caffe.
• Commercial application binaries with specific OSrequirements.
• Importing docker images to enable use in a shared HPCenvironment.
Overview of Virtual Clusters on Comet
• Projects have persistent VM for cluster management● Modest: single core, 1-2 GB of RAM
• Standard compute nodes will be scheduled as containers via
batch system● One virtual compute node per container
• Virtual disk images stored as ZFS datasets● Migrated to and from containers at job start and end
• VM use allocated and tracked like regular computing
Nucleus
Persistent virtual front end
API
• Request nodes• Console & power
• Scheduling• Storage management• Coordinating network changes• VM launch & shutdown
Idle disk images
Active virtual compute nodes
Disk images
User Perspective
Attached and synchronized
Enabling Technologies
• KVM—Lets us run virtual machines (all processor features)
• SR-IOV—Makes MPI go fast on VMs
• Rocks—Systems management
• ZFS—Disk image management
• VLANs—Isolate virtual cluster management network
• pkeys—Isolate virtual cluster IB network
• Nucleus—Coordination engine (scheduling, provisioning, status, etc.)
• Client – Cloudmesh
User-Customized HPC
Frontend
Virtual FrontendHosting Disk Image Vault
Compute
Compute
Compute
Compute
Compute
Compute
Compute
Compute
Compute
publicnetwork
private
VirtualCompute
VirtualCompute
VirtualCompute
private
VirtualCompute
VirtualCompute
VirtualCompute
private
physicalvirtual virtual
Virtual Frontend
Virtual Frontend
High Performance Virtual Cluster Characteristics
Virtual Frontend
VirtualCompute
VirtualCompute
VirtualCompute
privateEthernet
Infniband
All nodes have• Private Ethernet• Infniband• Local Disk Storage
Virtual Compute Nodes can Network boot(PXE) from its virtual frontend
All Disks retain state• keep user confguration between boots
Infniband Virtualization • 8% latency overhead. • Nominal bandwidth overhead
Comet: ProvidingVirtualized HPC for XSEDE
Data Storage/Filesystems
• Local SSD storage on each compute node
• Limited number of large SSD nodes (1.4TB) for large VM
images
• Local (SDSC) network access same as compute nodes
• Modest (TB) storage available via NFS now
• Future: Secure access to Lustre?
Cloudmesh Developed by IU collaborators
• Cloudmesh client enables access to multiple cloud environmentsfrom a command shell and command line.
• We leverage this easy to use CLI for virtual cluster management,allowing the use of Comet as infrastructure for virtual clusters.
• Cloudmesh has more functionality with ability to access hybridclouds OpenStack, (EC2, AWS, Azure); possible to extend to othersystems like Jetstream, Bridges etc.
• Plans for customizable launchers available through command line orbrowser – can target specifc application user communities.
Reference: https://github.com/cloudmesh/client
Comet Cloudmesh Client (selected commands)
• cm comet cluster ID • Show the cluster details
• cm comet power on ID vm-ID -[0-3] --walltime=6h• Power 3 nodes on for 6 hours
• cm comet image attach image.iso ID vm-ID-0• Attach an image
• cm comet boot ID vm-ID-0
• Boot node 0
• cm comet console vc4
• Console
Comet Cloudmesh Usage Examples
Comet Cloudmesh Client : Console access
(COMET)host:client$ cm comet console vc4 vm-vc4-0
MPI bandwidth slowdown from SR-IOV is at most 1.21 for medium-sized messages & negligible for small & large ones
MPI latency slowdown from SR-IOV is at most 1.32 for smallmessages & negligible for large ones
WRF Weather Modeling
• 96-core (4-node) calculation
• Nearest-neighbor communication
• Test Case: 3hr Forecast, 2.5km
resolution of Continental US
(CONUS).
• Scalable algorithms
• 2% slower w/ SR-IOV vs native IB.
PSDNS, 1024x1024x1024 : Strong scaling case
• 32 (2-node), 64 (4-node), and 128 (8-
node) core tests.
• Computational core based on FFTs.
• Communication intensive, mainly
alltoallv – bisection bandwidth
limited.
Cores (Nodes) Time/Step
32 (2) 101.51
64 (4) 67.03
128 (8) 33.99
Quantum ESPRESSO
• 48-core (3 node) calculation
• CG matrix inversion - irregular
communication
• 3D FFT matrix transposes (all-
to-all communication)
• Test Case: DEISA AUSURF
112 benchmark.
• 8% slower w/ SR-IOV vs native
IB.
RAxML: Code for Maximum Likelihood-basedinference of large phylogenetic trees.
• Widely used, including by CIPRES
gateway.• 48-core (2 node) calculation• Hybrid MPI/Pthreads Code.• 12 MPI tasks, 4 threads per task.• Compilers: gcc + mvapich2 v2.2, AVX
options.• Test Case: Comprehensive analysis,
218 taxa, 2,294 characters, 1,846
patterns, 100 bootstraps specifed.• 19% slower w/ SR-IOV vs native IB.
MrBayes: Software for Bayesian inference ofphylogeny.
• Widely used, including by CIPRES
gateway.
• 32-core (2 node) calculation
• Hybrid MPI/OpenMP Code.
• 8 MPI tasks, 4 OpenMP threads per task.
• Compilers: gcc + mvapich2 v2.2, AVX
options.
• Test Case: 218 taxa, 10,000 generations.
• 3% slower with SR-IOV vs native IB.
Summary• Comet uses the flexibility and performance of IB network + Bridging to
– provide multi-machine mounted parallel filesystems– enhance GPU applications using GPU Direct RDMA– enhance performance of data analytics tools with RDMA enabled frameworks– enable virtualized HPC clusters at scale.
• OSU Benchmarks and HOOMD-Blue applications show good performanceusing MVAPICH2-GDR.
• Results for OSU-Caffe with CIFAR10 benchmark show good scaling. Futuretests with larger test cases planned to evaluate performance advantages ofusing the SSDs.
• Application benchmarks show good performance on virtualized cluster.PSDNS example, which is communication intensive, shows good strongscaling for a test example.