experiences with hpc and big data applications on the sdsc ...€¦ · 4/11/2017 · overview •...

Experiences with HPC and Big Data Applications on theSDSC Comet Cluster: Using Virtualization, Singularity containers,and RDMA enabled Data Analytics tools

Mahidhar Tatineni, Director of User Services, SDSC

HPC Advisory Council China ConferenceOctober 18, 2017, Hefei, China

Acknowledgements: Trevor Cooper, Dmitry Mishin, Christopher Irving, Gregorvon Laszewski (IU) Fugang Wang (IU), Rick Wagner (Globus group, U.Chicago), Phil Papadopoulos

This work supported by the National Science Foundation, award ACI-1341698.

Overview• Comet Hardware – Compute,GPU nodes, network, flesystems• MPI implementations, including MVAPICH2-GDR results• Data analytics frameworks and tools on Comet – RDMA-Hadoop, RDMA-

Spark, OSU-Caffe• Virtual Cluster (VC) Design – layout, software• VC Benchmarks: MPI, Applications

NSF Award# 1341698, Gateways to Discovery: Cyberinfrastructure for the Long Tail of SciencePI: Michael NormanCo-PIs: Shawn Strande, Philip Papadopoulos, Robert Sinkovits, Nancy Wilkins-DiehrSDSC Project in Collaboration with Indiana University (led by Geoffrey Fox)

Comet: System Characteristics • Total peak fops ~2.1 PF• Dell primary integrator

• Intel Haswell processors w/ AVX2

• Mellanox FDR InfniBand

• 1,944 standard compute nodes (46,656cores)

• Dual CPUs, each 12-core, 2.5 GHz

• 128 GB DDR4 2133 MHz DRAM

• 2*160GB GB SSDs (local disk)

72 GPU nodes• 36 nodes with two NVIDIA K80 cards, each

with dual Kepler3 GPUs, same CPU as mainpartition.

• 36 nodes with 4 P100 GPUs each, 2 IntelBroadwell processors (14 cores each)

• 4 large-memory nodes• 1.5 TB DDR4 1866 MHz DRAM

• Four Haswell processors/node

• 64 cores/node

• Hybrid fat-tree topology● FDR (56 Gbps) InfniBand

● Rack-level (72 nodes, 1,728 cores) full bisection

bandwidth

● 4:1 oversubscription cross-rack

• Performance Storage (Aeon)● 7.6 PB, 200 GB/s; Lustre

● Scratch & Persistent Storage segments

• Durable Storage (Aeon)● 6 PB, 100 GB/s; Lustre

● Automatic backups of critical data

• Home directory storage

• Gateway hosting nodes

• Virtual image repository

• 100 Gbps external connectivity to Internet2 & ESNet

Comet Network Architecture

Comet Lustre Filesystems• Comet features two Lustre filesystems - scratch and projects storage.• Projects storage is mounted on multiple systems, including non IB

connected clusters.• Backend storage servers are connected via 40 Gbit ethernet fabric. • Comet network design handles this aspect by using bridge switches

(Mellanox). • Design provides flexibility to mount filesystem on multiple machine and

keeps the aggregate performance high. • Each filesystem (scratch and projects) achieved 100 GB/s on and IOR

bandwidth test at scale.

Comet: MPI options, RDMA enabled software• MVAPICH2 (v2.1) is the default MPI on Comet. Intel MPI and

OpenMPI also available.

• MVAPICH2-X v2.2a to provide unified high-performance runtimesupporting both MPI and PGAS programming models.

• MVAPICH2-GDR (v2.2) on the GPU nodes featuring NVIDIA K80s andP100s. Benchmark and application performance results in this talk.

• RDMA-Hadoop (2x-1.1.0), RDMA-Spark (0.9.4) (from Dr. Panda’sHiBD lab) also available.

Comet K80 node architecture

• 4 GPUs per node• GPUs (0,1) and (2,3) can do P2P communication• Mellanox InfniBand adapter associated with second socket (GPUs 2, 3)

OSU Latency (osu_latency) Benchmark Intra-node, K80 nodes

• Latency between GPU 2 , GPU 3: 2.82 µs

• Latency between GPU 1 , GPU 2: 3.18 µs

OSU Latency (osu_latency) Benchmark Inter-node, K80 nodes

• Latency between GPU 2 , process bound to CPU 1 on both nodes: 2.27 µs• Latency between GPU 2 , process bound to CPU 0 on both nodes: 2.47 µs• Latency between GPU 0 , process bound to CPU 0 on both nodes: 2.43 µs

Comet P100 node architecture

• 4 GPUs per node• GPUs (0,1) and (2,3) can do P2P communication• Mellanox InfniBand adapter associated with frst socket (GPUs 0, 1)

OSU Latency (osu_latency) Benchmark Intra-node, P100 nodes

• Latency between GPU 0 , GPU 1: 2.73 µs• Latency between GPU 2 , GPU 3: 2.95 µs• Latency between GPU 1 , GPU 2: 3.13 µs

OSU Latency (osu_latency) Benchmark Inter-node, P100 nodes

• Latency between GPU 0 , process bound to CPU 0 on both nodes: 2.17 µs• Latency between GPU 2 , process bound to CPU 1 on both nodes: 2.35 µs

MVAPICH2-GDR Application Example:HOOMD-blue • HOOMD-blue is a general-purpose particle simulation toolkit• Results for the Hexagon benchmark are presented..

References:• HOOMD-blue web page: http://glotzerlab.engin.umich.edu/hoomd-blue/• HOOMD-blue Benchmarks page: http://

glotzerlab.engin.umich.edu/hoomd-blue/benchmarks.html• J. A. Anderson, C. D. Lorenz, and A. Travesset. General purpose molecular

dynamics simulations fully implemented on graphics processing units Journal ofComputational Physics 227(10): 5342-5359, May 2008. 10.1016/j.jcp.2008.01.047

• J. Glaser, T. D. Nguyen, J. A. Anderson, P. Liu, F. Spiga, J. A. Millan, D. C.Morse, S. C. Glotzer. Strong scaling of general-purpose molecular dynamicssimulations on GPUs Computer Physics Communications 192: 97-107, July2015. 10.1016/j.cpc.2015.02.028

http://glotzerlab.engin.umich.edu/hoomd-blue/

http://glotzerlab.engin.umich.edu/hoomd-blue/

http://glotzerlab.engin.umich.edu/hoomd-blue/benchmarks.html

http://glotzerlab.engin.umich.edu/hoomd-blue/benchmarks.html

http://dx.doi.org/10.1016/j.jcp.2008.01.047

http://dx.doi.org/10.1016/j.cpc.2015.02.028

HOOMD-Blue: Hexagon Benchmark• N=1,048,576N=1,048,576• Hard particle Monte Carlo

● Vertices: [[0.5,0],[0.25,0.433012701892219],[-0.25,0.433012701892219],[-0.5,0],[-0.25,-0.433012701892219],[0.25,-0.433012701892219]]

● d=0.17010672166874857● a=1.0471975511965976● nselect=4

• Log fle period: 10000 time steps• SDF analysis

● xmax==0.02● δx=10−4● period: 50 time steps● navg=2000

• DCD dump period: 100000

HOOMD-Blue: Hexagon BenchmarkStrong scaling on K80 nodes

RDMA-Hadoop and RDMA-SparkNetwork-Based Computing Lab, Ohio State University

NSF funded project in collaboration with Dr. DK Panda*

• HDFS, MapReduce, and RPC over native InfiniBand and RDMA overConverged Ethernet (RoCE).

• Based on Apache distributions of Hadoop and Spark.• Version RDMA-Apache-Hadoop-2.x 1.1.0 (based on Apache Hadoop

2.6.0) available on Comet• Version RDMA-Spark 0.9.4 (based on Apache Spark 2.1.0) is

available on Comet.• More details on the RDMA-Hadoop and RDMA-Spark projects at:

– http://hibd.cse.ohio-state.edu/

*NSF BIGDATA F: DKM: Collaborative Research: Scalable Middleware for Managing and ProcessingBig Data on Next Generation HPC Systems, Award #s 1447804 (Ohio State), and 1447861 (SDSC).

http://hibd.cse.ohio-state.edu/

● Exploit performance on modern clusters with RDMA-enabledinterconnects for Big Data applications.

● Hybrid design with in-memory and heterogeneous storage (HDD,SSDs, Lustre).

● Keep compliance with standard distributions from Apache.

RDMA-Hadoop, Spark

OSU-Caffe, CIFAR10 Quick on K80 nodes

• Results with K80 nodes.

• Current runs with data in Lustre filesystem(/oasis/scratch/comet)

• All Comet GPU nodes have 280GB of SSD basedlocal scratch space. Future tests with larger testcases planned to evaluate performanceadvantages of using the SSDs.

OSU-Caffe, CIFAR10 Quick on K80 nodes

Virtualization on Comet

• Comet Virtual Clusters – KVM based, SRIOVenabled full virtualization.

• Singularity based containerization – user spaceonly with namespaces and minimal SetUID.

Comet VC Use Cases• Root access to nodes for custom OS and

software stack. – Example : CAIDA group used it for a workshop

allowing attendees to modify network stack forresearch. Allows for isolation of tests.

• Simplified install for groups with existingmanagement infrastructure.

– Example: Open Science Group (OSG) used theirexisting installation procedures to enable multipleresearch groups to run on Comet (including LIGO).

Singularity Use Cases

• Applications with newer library OS requirements thanavailable on Comet – e.g. Tensorflow, Torch, Caffe.

• Commercial application binaries with specific OSrequirements.

• Importing docker images to enable use in a shared HPCenvironment.

Overview of Virtual Clusters on Comet

• Projects have persistent VM for cluster management● Modest: single core, 1-2 GB of RAM

• Standard compute nodes will be scheduled as containers via

batch system● One virtual compute node per container

• Virtual disk images stored as ZFS datasets● Migrated to and from containers at job start and end

• VM use allocated and tracked like regular computing

Nucleus

Persistent virtual front end

API

• Request nodes• Console & power

• Scheduling• Storage management• Coordinating network changes• VM launch & shutdown

Idle disk images

Active virtual compute nodes

Disk images

User Perspective

Attached and synchronized

Enabling Technologies

• KVM—Lets us run virtual machines (all processor features)

• SR-IOV—Makes MPI go fast on VMs

• Rocks—Systems management

• ZFS—Disk image management

• VLANs—Isolate virtual cluster management network

• pkeys—Isolate virtual cluster IB network

• Nucleus—Coordination engine (scheduling, provisioning, status, etc.)

• Client – Cloudmesh

User-Customized HPC

Frontend

Virtual FrontendHosting Disk Image Vault

Compute

Compute

Compute

Compute

Compute

Compute

Compute

Compute

Compute

publicnetwork

private

VirtualCompute

VirtualCompute

VirtualCompute

private

VirtualCompute

VirtualCompute

VirtualCompute

private

physicalvirtual virtual

Virtual Frontend

Virtual Frontend

High Performance Virtual Cluster Characteristics

Virtual Frontend

VirtualCompute

VirtualCompute

VirtualCompute

privateEthernet

Infniband

All nodes have• Private Ethernet• Infniband• Local Disk Storage

Virtual Compute Nodes can Network boot(PXE) from its virtual frontend

All Disks retain state• keep user confguration between boots

Infniband Virtualization • 8% latency overhead. • Nominal bandwidth overhead

Comet: ProvidingVirtualized HPC for XSEDE

Data Storage/Filesystems

• Local SSD storage on each compute node

• Limited number of large SSD nodes (1.4TB) for large VM

images

• Local (SDSC) network access same as compute nodes

• Modest (TB) storage available via NFS now

• Future: Secure access to Lustre?

Cloudmesh Developed by IU collaborators

• Cloudmesh client enables access to multiple cloud environmentsfrom a command shell and command line.

• We leverage this easy to use CLI for virtual cluster management,allowing the use of Comet as infrastructure for virtual clusters.

• Cloudmesh has more functionality with ability to access hybridclouds OpenStack, (EC2, AWS, Azure); possible to extend to othersystems like Jetstream, Bridges etc.

• Plans for customizable launchers available through command line orbrowser – can target specifc application user communities.

Reference: https://github.com/cloudmesh/client

Comet Cloudmesh Client (selected commands)

• cm comet cluster ID • Show the cluster details

• cm comet power on ID vm-ID -[0-3] --walltime=6h• Power 3 nodes on for 6 hours

• cm comet image attach image.iso ID vm-ID-0• Attach an image

• cm comet boot ID vm-ID-0

• Boot node 0

• cm comet console vc4

• Console

Comet Cloudmesh Usage Examples

Comet Cloudmesh Client : Console access

(COMET)host:client$ cm comet console vc4 vm-vc4-0

MPI bandwidth slowdown from SR-IOV is at most 1.21 for medium-sized messages & negligible for small & large ones

MPI latency slowdown from SR-IOV is at most 1.32 for smallmessages & negligible for large ones

WRF Weather Modeling

• 96-core (4-node) calculation

• Nearest-neighbor communication

• Test Case: 3hr Forecast, 2.5km

resolution of Continental US

(CONUS).

• Scalable algorithms

• 2% slower w/ SR-IOV vs native IB.

PSDNS, 1024x1024x1024 : Strong scaling case

• 32 (2-node), 64 (4-node), and 128 (8-

node) core tests.

• Computational core based on FFTs.

• Communication intensive, mainly

alltoallv – bisection bandwidth

limited.

Cores (Nodes) Time/Step

32 (2) 101.51

64 (4) 67.03

128 (8) 33.99

Quantum ESPRESSO

• 48-core (3 node) calculation

• CG matrix inversion - irregular

communication

• 3D FFT matrix transposes (all-

to-all communication)

• Test Case: DEISA AUSURF

112 benchmark.

• 8% slower w/ SR-IOV vs native

IB.

RAxML: Code for Maximum Likelihood-basedinference of large phylogenetic trees.

• Widely used, including by CIPRES

gateway.• 48-core (2 node) calculation• Hybrid MPI/Pthreads Code.• 12 MPI tasks, 4 threads per task.• Compilers: gcc + mvapich2 v2.2, AVX

options.• Test Case: Comprehensive analysis,

218 taxa, 2,294 characters, 1,846

patterns, 100 bootstraps specifed.• 19% slower w/ SR-IOV vs native IB.

MrBayes: Software for Bayesian inference ofphylogeny.

• Widely used, including by CIPRES

gateway.

• 32-core (2 node) calculation

• Hybrid MPI/OpenMP Code.

• 8 MPI tasks, 4 OpenMP threads per task.

• Compilers: gcc + mvapich2 v2.2, AVX

options.

• Test Case: 218 taxa, 10,000 generations.

• 3% slower with SR-IOV vs native IB.

Summary• Comet uses the flexibility and performance of IB network + Bridging to

– provide multi-machine mounted parallel filesystems– enhance GPU applications using GPU Direct RDMA– enhance performance of data analytics tools with RDMA enabled frameworks– enable virtualized HPC clusters at scale.

• OSU Benchmarks and HOOMD-Blue applications show good performanceusing MVAPICH2-GDR.

• Results for OSU-Caffe with CIFAR10 benchmark show good scaling. Futuretests with larger test cases planned to evaluate performance advantages ofusing the SSDs.

• Application benchmarks show good performance on virtualized cluster.PSDNS example, which is communication intensive, shows good strongscaling for a test example.

experiences with hpc and big data applications on the sdsc ...€¦ · 4/11/2017 · overview •...

Documents