hpc cloud bad; hpc in the cloud good

Download HPC Cloud Bad; HPC in the Cloud Good

If you can't read please download the document

Upload: dirk

Post on 25-Feb-2016

82 views

Category:

Documents


4 download

DESCRIPTION

HPC Cloud Bad; HPC in the Cloud Good. Josh Simons, Office of the CTO, VMware, Inc. IPDPS 2013 Cambridge, Massachusetts. Post-Beowulf Status Quo. Enterprise IT. HPC IT. Closer to True Scale. (NASA). Converging Landscape. Convergence driven by increasingly shared concerns, e.g.:. - PowerPoint PPT Presentation

TRANSCRIPT

VMware presentation

HPC Cloud Bad; HPC in the Cloud GoodJosh Simons, Office of the CTO, VMware, Inc.

IPDPS 2013Cambridge, Massachusetts 2011 VMware Inc. All rights reserved#1Post-Beowulf Status QuoEnterprise ITHPC IT#Closer to True Scale

(NASA)#Converging LandscapeEnterprise ITHPC ITConvergence driven by increasingly shared concerns, e.g.: Scale-out management Power & cooling costs Dynamic resource mgmt Desire for high utilization Parallelization for multicore Big Data Analytics Application resiliency Low latency interconnect Cloud computing

#AgendaHPC and Public CloudLimitations of the current approachCloud HPC PerformanceThroughputBig Data / HadoopMPI / RDMAHPC in the CloudA more promising model

#HardwareApplicationOperating System

With VirtualizationWithout VirtualizationServer VirtualizationHardware virtualization presents a complete x86 platform to the virtual machineAllows multiple applications to run in isolation within virtual machines on the same physical machineVirtualization provides direct access to the hardware resources to give you much greater performance than software emulation#6Virtual machines are not a new concept. They were developed over thirty years ago for mainframe systems to allow multiple users to safely share those expensive machines. As computers became cheaper, the motivation behind virtualization faded and processor architectures like the Intel x86 were developed without some of the features needed to support virtualization.

VMwares founders resurrected the virtual machine concept when problems like server proliferation and the need to run multiple applications in dedicated operating systems started becoming serious issues for IT managers and software developers. VMware developed revolutionary technology to efficiently virtualize x86 systems that,for the first time, allowed unmodified x86 operating systems and applications to run in true virtual machines with excellent performance.

Multiple virtual machines can operate concurrently on a single x86 host system. Each one can run a different operating system and application stack.

VMwares virtualization technology provides each virtual machine with a true representation of an x86 computer, complete with processor, memory, networking interfaces and storage devices. The VMware virtualization layer gives the virtual machines direct access to the underlying x86 hardware an important distinction from the much slower emulation technology that must process all virtual machine operations in software.HPC Performance in the Cloud

http://science.energy.gov/~/media/ascr/pdf/program-documents/docs/Magellan_final_report.pdf#Biosequence Analysis: BLAST

C. Macdonell and P. Lu, "Pragmatics of Virtual Machines for High-Performance Computing: A Quantitative Study of Basic Overheads, " in Proc. of the High Perf. Computing & Simulation Conf., 2007.#Biosequence Analysis: HMMer

#Molecular Dynamics: GROMACS

#EDA Workload Exampleoperating systemappapphardwareOSappOSappappappOSappOSappvirtualization layerhardware Virtual 6% slower Virtual 2% faster#Memory VirtualizationHPLNativeVirtualEPT onEPT off4K pages37.04 GFLOPS36.04 (97.3%)36.22 (97.8%)2MB pages37.74 GLFLOPS38.24 (100.1%)38.42 (100.2%)*RandomAccessNativeVirtualEPT onEPT off4K pages0.018420.0156 (84.8%)0.0181 (98.3%)2MB pages0.039560.0380 (96.2%)0.0390 (98.6%)physicalvirtualmachineEPT = Intel Extended Page Tables = hardware page table virtualization = AMD RVI#Briefly describe three-level memory hierarchy in virtual world and then discuss the differences shown here between a benchmark with good memory locality (LINPACK) and one with no locality whatsoever (StarRandomAccess represents an NSA workload). The point illustrated here is that while EPT very often works well, there may be (important) cases in which it does not. In such cases, use of shadow page tables actually increases performance (as does use of large pages).12vNUMAESXi hypervisorApplicationsocketMsocketMsocketMsocketM#vNUMA Performance Study

Performance Evaluation of HPC Benchmarks on VMwares ESX Server, Ali Q., Kiriansky, V., Simons J., Zaroo, P., 5th Workshop on System-level Virtualization for High Performance Computing, 2011#Discuss results. Left side shows performance improvements with vNUMA. Right side shows improvements bring performance in-line with native performance. Note, too, that these graphs show vCPU scalability to 64. 14Compute: GPGPU ExperimentGeneral Purpose (GP) computation with GPUsCUDA benchmarksVM Direct Path I/OSmall kernels: DSP, financial,bioinformatics, fluid dynamics,image processingRHEL 6nVidia (Quadro 4000) and AMD GPUsGenerally 98%+ of native performance(worst case was 85%)Currently looking at larger-scale financialand bioinformatics applications

#15MapReduce ArchitectureHDFSMAPMAPMAPMAPReduceReduceReduceHDFS#Interest in virtualizing Hadoop clusters -- to incorporate into existing VMware environments (presumably most of the customers in the room are already using VMware). Briefly describe the model to set context for next slide.16vHadoop ApproachesWhy virtualize Hadoop?Simplified Hadoop cluster configuration and provisioningSupport Hadoop usage in existing virtualized datacentersSupport multi-tenant environmentsProject Serengeti

Node

Node

NodeHDFSMMMRRRMVMVMVMVM

NodeData NodeCompute NodeRMCNR#Bare metal Hadoop. HDFS is built from local disks and includes redundancy for fault tolerance and performance.17vHadoop Benchmarking Collaboration with AMAXSeven-node Hadoop cluster (AMAX ClusterMax)Standard tests: PI, DFSIO, Teragen / TerasortConfigurations: NativeOne VM per hostTwo VMs per hostDetails:Two-socket Intel X5650, 96 GB, Mellanox 10 GbE, 12x 7200rpm SATARHEL 6.1, 6- or 12-vCPU VMs, vmxnet3Cloudera CDH3U0, replication=2, max 40 map and 10 reduce tasks per hostEach physical host considered a rack in Hadoops topology descriptionESXi 5.0 w/dev Mellanox driver, disks passed to VMs via raw disk mapping (RDM)

#BenchmarksPiDirect-exec Monte-Carlo estimation of pi# map tasks = # logical processors1.68 T samplesTestDFSIOStreaming write and read1 TBMore tasks than processorsTerasort3 phases: teragen, terasort, teravalidate10B or 35B records, each 100 Bytes (1 TB, 3.5 TB)More tasks than processorsCPU, networking, and storage I/O

~ 4*R/(R+G) = 22/7#Ratio to Native, Lower is BetterA Benchmarking Case Study of Virtualized Hadoop Performance on VMware vSphere 5 http://www.vmware.com/files/pdf/VMW-Hadoop-Performance-vSphere5.pdf#20kernelKernel Bypass Modeldrivertcp/ipsocketshardwareapplicationrdmaguest kerneldrivertcp/ipsockets vmkernelapplicationhardwareuseruserrdmardmardma#Virtual Infrastructure RDMADistributed services within the platform, e.g. vMotion (live migration)Inter-VM state mirroring for fault toleranceVirtually shared, DAS-based storage fabricAll would benefit from:Decreased latencyIncreased bandwidthCPU offload #vMotion/RDMA PerformanceVMwareTotal vMotion Time (sec)Pre-copy bandwidth (Pages/sec)Destination CPU UtilizationSource CPU UtilizationTime (s)Time (s)#Guest OS RDMARDMA access from within a virtual machineScale-out middleware and applications increasingly important in the Enterprisememcached, redis, Cassandra, mongoDB, GemFire Data Fabric, Oracle RAC, IBM pureScale, Big Data an important emerging workloadHadoop, Hive, Pig, etc.And, increasingly, HPC#SR-IOV VirtualFunction VM DirectPath I/OSingle-Root IO Virtualization (SR-IOV): PCI-SIG standardPhysical (IB/RoCE/iWARP) HCA can be shared between VMs or by the ESXi hypervisorVirtual Functions direct assigned to VMsPhysical Function controlled by hypervisorStill VM DirectPath, which is incompatible with several important virtualization featuresVMwareRDMA HCAVF DriverRDMA HCAVF DriverI/O MMU PF Device Driver

VFVFVFPFSR-IOVRDMA HCAGuest OSRDMA HCAVF DriverGuest OSGuest OSVirtualizationLayer

OFED Stack

OFED Stack

OFED StackRDMA HCAVF DriverRDMA HCAVF Driver#Paravirtual RDMA HCA (vRDMA) offered to VMNew paravirtualized device exposed to Virtual MachineImplements Verbs interfaceDevice emulated in ESXi hypervisorTranslates Verbs from Guest to Verbs to ESXi OFED StackGuest physical memory regions mapped to ESXi and passed down to physical RDMA HCAZero-copy DMA directly from/to guest physical memoryCompletions/interrupts proxied by emulationHoly Grail of RDMA options for vSphere VMsvRDMA HCA Device DriverPhysical RDMA HCADevice Driver Physical RDMA HCA

vRDMA Device Emulation

Guest OS

OFED Stack

ESXi OFED StackI/O Stack#InfiniBand Bandwidth with VM DirectPath I/ORDMA Performance in Virtual Machines using QDR InfiniBand on VMware vSphere 5, April 2011 http://labs.vmware.com/academic/publications/ib-researchnote-apr2012#27Latency with VM DirectPath I/O (RDMA Read, Polling)MsgSize (bytes)NativeESXi ExpA22.282.9842.282.9882.282.98162.272.96322.282.98642.282.971282.323.022562.53.19#28Latency with VM DirectPath I/O (Send/Receive, Polling)MsgSize (bytes)NativeESXi ExpA21.351.7541.351.7581.381.78161.372.05321.382.35641.392.91281.54.132562.32.31#29Intel 2009 ExperimentsHardwareEight two-socket 2.93GHz X5570 (Nehalem-EP) nodes, 24 GBDual-ported Mellanox DDR InfiniBand adaptorMellanox 36-port switch

SoftwarevSphere 4.0 (current version is 5.1)Platform Open Cluster Stack (OCS) 5 (native and guest)Intel compilers 11.1HPCC 1.3.1STAR-CD V4.10.008_x86

#HPCC Virtual to Native Run-time Ratios (Lower is Better)Data courtesy of: Marco Righini Intel Italy#Point-to-point Message Size Distribution: STAR-CD

Source: http://www.hpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf#Collective Message Size Distribution: STAR-CD

Source: http://www.hpcadvisorycouncil.com/pdf/CD_adapco_applications.pdf#STAR-CD Virtual to Native Run-time Ratios (Lower is Better)Data courtesy of Marco Righini, Intel Italy#Software Defined Networking (SDN) Enables Network VirtualizationNetworkingTelephony

Identifier = Location192.168.10.1

650.555.1212650.555.1212

Identifier = LocationWirelessTelephonyVXLAN

192.168.10.1

#35Data Center Networks Traffic Trends

WAN/Internet

NORTH / SOUTHEAST / WEST#Data Center Networks the Trend to Fabrics

WAN/Internet

WAN/Internet

#From 2- or 3-tier to spine/leafDensity & bandwidth jumpECMP for layer 3 (and layer 2)Reduce network oversubscriptionWire & configure onceUniform configurations

37Network Virtualization and RDMASDNDecouple logical network from physical hardwareEncapsulate Ethernet in IP more layersFlexibility and agility are primary goalsRDMADirectly access physical hardwareMap hardware directly into userspace fewer layersPerformance is primary goalIs there any hope of combining the two?Converged datacenter supporting both SDN management and decoupling along with RDMA38#

VMware vCloud API

UsersITResearch Group 1Research Group mPublic CloudsProgrammaticControl andIntegrationsUser PortalsSecurityVMwarevShieldResearch Cluster 1Research Cluster nVMware vCloud DirectorVMwarevCenter ServerVMware vSphereVMware vSphereVMware vSphereCatalogs

VMwarevCenter ServerVMwarevCenter Server

Secure Private Cloud for HPC#Massive Consolidation

#Run Any Software StacksApp AOS AApp BOS Bvirtualization layerhardwarevirtualization layerhardwarevirtualization layerhardwareSupport groups with disparate software requirementsIncluding root access#Separate workloadsvirtualization layerhardwarevirtualization layerhardwarevirtualization layerhardwareSecure multi-tenancyFault isolationand sometimes performanceApp AOS AApp BOS B#Live Virtual Machine Migration (vMotion)

#Use Resources More EfficientlyApp AOS AApp BOS Bvirtualization layerhardwarevirtualization layerhardwarevirtualization layerhardwareApp AOS AApp COS BApp COS AAvoid killing or pausing jobsIncrease overall throughput#Workload Agilityhardwareoperating systemappappappvirtualization layerhardwarevirtualization layerhardwareappappapp#Multi-tenancy with resource guaranteesApp AOS AApp BOS Bvirtualization layerhardwarevirtualization layerhardwarevirtualization layerhardwareApp AOS AApp COS BApp COS ADefine policies to manage resource sharing between groupsApp AOS AApp BOS B#Protect Applications from Hardware Failuresvirtualization layerhardwarevirtualization layerhardwarevirtualization layerhardwareReactive Fault Tolerance: Fail and Recover

App AOSApp AOS#Protect Applications from Hardware Failuresvirtualization layerhardwarevirtualization layerhardwarevirtualization layerhardwareMPI-0OSMPI-1OSMPI-2OSProactive Fault Tolerance: Move and Continue#Unification of IT Infrastructure

#HPC in the (Mainstream) CloudThroughputMPI / RDMA#SummaryHPC Performance in the CloudThroughput applications perform very well in virtual environmentsMPI / RDMA applications will experience small to very significant slowdowns in virtual environments, depending on scale and message traffic characteristics Enterprise and HPC IT requirements are convergingThough less so with HEC (e.g. Exascale)Vendor and community investments in Enterprise solutions eclipse those made in HPC due to market size differencesThe HPC community can benefit significantly from adopting Enterprise-capable IT solutionsAnd working to influence Enterprise solutions to more fully address HPC requirementsPrivate and community cloud deployments provide significantly more value than cloud bursting from physical infrastructure to public cloud#