could the “c” in hpc stand for cloud?

7/29/2019 Could the “C” in HPC stand for Cloud?

http://slidepdf.com/reader/full/could-the-c-in-hpc-stand-for-cloud 1/12

Thought Leadership White Paper

IBM Systems and Technology Group November 2012

Could the “C” in HPCstand or Cloud?

By Christopher N. Porter, IBM Corporation

[email protected]



2 Could the “C” in HPC stand or Cloud?

Introduction Most IaaS (inrastructure as a service) vendors such as

Rackspace, Amazon and Savvis use various virtualization

technologies to manage the underlying hardware they

build their oerings on. Unortunately the virtualization

technologies used vary rom vendor to vendor and are

sometimes kept secret. Thereore, the question about

virtual machines versus physical machines or high

perormance computing (HPC) applications is germaneto any discussion o HPC in the cloud.

This paper examines aspects o computing important in HPC

(compute and network bandwidth, compute and network

latency, memory size and bandwidth, I/O, and so on) and how

they are aected by various virtualization technologies. The

benchmark results presented will illuminate areas where cloud

computing, as a virtualized inrastructure, is sucient or some

workloads and inappropriate or others. In addition, it will

provide a quantitative assessment o the perormance

dierences between a sample o applications running on

various hypervisors so that data-based decisions can be made

or datacenter and technology adoption planning.

A business case or HPC clouds

HPC architects have been slow to adopt virtualization

technologies or two reasons:

1. The common assumption that virtualization impacts

application perormance so severely that any gains in

fexibility are ar outweighed by the loss o application

throughput.

2. Utilization on traditional HPC inrastructure is very high

(between 80 - 95 percent).Thereore, the typical driving

business cases or virtualization (or example, utilization o

hardware, server consolidation or license utilization) simply

did not hold signicant enough merit to justiy the added

complexity and expense o running workload in virtualized

resources.

In many cases, however, HPC architects would be willing tolose some small percentage o application perormance to

achieve the fexibility and resilience that virtual machine based

computing would allow. There are several reasons architects

may make this compromise, including:

• Security: Some HPC environments require data and host

isolation between groups o users or even between the users

themselves. In these situations VMs and VLANs can be used

in consort to isolate users rom each other and isolate data to

the users who should have access to it.

• Applicationstackcontrol:In a mixed application

environment where multiple applications share the samephysical hardware, it can be dicult to satisy the

conguration requirements o each application, including OS

versions, updates and libraries. Using virtualization makes that

task easier since the whole stack can be deployed as part o the

application.

• Highvalueassetmaximization:In a heterogeneous HPC

system the newest machines are oten in highest demand. To

manage this demand, some organizations use a reservation

system to minimize conficts between users. When using VMs

or computing, however, the migration acility available within



IBM Systems and Technology Group 3

most hypervisors allows opportunistic workloads to use high

value assets by even ater a reservation window opens or a

dierent user. I the reserving user submits workload against a

reservation, then the opportunistic workload can be migrated

to other assets to continue processing without losing any CPU

cycles.

• Utilizationimprovement:I the losses in application

perormance are very small (single digit percentages), then

adoption o virtualization technology may enable incrementalsteps orward in overall utilization in some cases. In these

cases, virtualization may oer an increase in overall HPC

throughput or the HPC environment.

• Largeexecutiontimejobs:Several HPC applications oer

no checkpoint restart capability. VM technology can capture

and checkpoint the entire state o the virtual machine,

however, allowing or checkpoint o these applications. I jobs

run long enough to be at the same MTBF or the solution as a

whole, then the checkpoint acility available within virtual

machines may be very attractive. Additionally, i server

maintenance is a common or predictable occurrence, then

checkpoint migration or suspension o a long running job

within a VM could prevent loss o compute time.

• Increasesinjobreliability: Virtual machines, i used on a 1:1

basis with batch jobs (meaning each job runs within a VM

container), provide a barrier between their own environment,

the host environment and any other virtual machine

environments running on the hypervisor. As such, “rogue”

jobs which try and access more memory or cpu cores than

expected can be isolated rom well behaved jobs allocated

resources as expected. Such a situation without virtual

machine containment, where jobs share a physical host oten

cause problems in the orm o slowdowns, swapping or even

OS crashes.

Management tools

Achieving HPC in a cloud environment requires a ew well

chosen tools including a hypervisor platorm, workload

manager and an inrastructure management toolkit. The

management toolkit provides the policy denition,

enorcement, provisioning management, resource reservation

and reporting. The hypervisor platorm provides the

oundation or the virtual portion o cloud resources and the

workload manager provides the task management.

The cloud computing management tools o IBM® Platorm

Computing™—IBM® Platorm™ Cluster Manager –

Advanced Edition and IBM® Platorm™ Dynamic Cluster—

turn static clusters, grids and datacenters into dynamic shared

computing environments. The products can be used to create

private internal clouds or hybrid private clouds, which use

external public clouds or peak demand. This is commonly

reerred to as “cloud bursting” or “peak shaving.”

Platorm Cluster Manager – Advanced Edition creates a cloud

computing inrastructure to eciently manage application workloads applied to multiple virtual and physical platorms. It

does this by uniting diverse hypervisor and physical

environments into a single dynamically shared inrastructure.

Although this document describes the properties o virtual

machines, Platorm Cluster Manager – Advanced Edition is

not in any way limited to managing virtual machines. It

unlocks the ull computing potential lying dormant in existing

heterogeneous virtual and physical resources according to

workload-intelligent and resource-aware policies.




Platorm Cluster Manager – Advanced Edition optimizes

inrastructure resources dynamically based on perceived

demand and critical resource availability using an API or a web

interace. This allows users to enjoy the ollowing business

benets:

• By eliminating silos resource utilization can be improved

• Batch job wait times are reduced because o additional

resource availability or fexibility • Users perceive a larger resource pool

• Administrator workload is reduced through multiple layers o

automation

• Power consumption and server prolieration is reduced

Subsystem benchmarksHardware environment and settings

KVMandOVMtesting

Physical hardware: (2) HP ProLiant BL465cG5 with Dual

Socket Quad Core AMD 2382 + AMD-V and 16 GB RAM

OS Installed: RHEL 5.5 x86_64 Hypervisor(s): KVM in RHEL 5.5, OVM 2.2, RHEL 5.5 Xen

(para-virtualized)

Number o VMs per physical node: Unless otherwise noted,

benchmarks were run on a 4 GB memory VM.

Interconnects: The interconnect between VMs or hypervisors

was never used to run the benchmarks. The hypervisor hosts

were connected to a 1000baseT network.

CitrixXentesting

Physical hardware: (2) HP ProLiant BL2x220c in a c3000

chassis with dual socket quad core 2.83 GHz Intel® CPUs and

8 GB RAM

OS Installed: CentOS Linux 5.3 x86_64

Storage: Local Disk

Hypervisor: Citrix Xen 5.5

VM Confguration: (Qty 1) 8 GB VM with 8 cores, (Qty 2) 4

GB VMs with 4 cores, (Qty 4) 2 GB VMs with 2 cores, (Qty 8)1 GB VMs with 1 core

NetPIPE

NetPIPE is an acronym that stands or Network Protocol

Independent Perormance Evaluator.1 It is a useul tool or

measuring two important characteristics o networks: latency

and bandwidth. HPC application perormance is becoming

increasingly dependent on the interconnect between compute

servers. Because o this trend, not only does parallel application

perormance need to be examined, but also the perormance

level o the network alone rom both the latency and the

bandwidth standpoints.

The terms used or each data series in this section are dened

as ollows:

• no_bkpln: Reers to communications happening over a

1000baseT Ethernet network

• same_bkpln:Reers to communications traversing a

backplane within a blade enclosure




• diff_hyp:Reers to virtual machine to virtual machine

communication occurring between two separate physical

hypervisors

• pm2pm:Physical machine to physical machine

• vm2pm: Virtual machine to physical machine

• vm2vm: Virtual machine to virtual machine

Figures 1 and 2 illustrate that the closer the two entities

communicating are, the higher the bandwidth and lower thelatency between them. Additionally they show that when there

is a hypervisor layer between the entities, the communication is

slowed only slightly, and latencies stay in the expected range

or 1000baseT communication (60 - 80 µsec). When two

dierent VMs on separate hypervisors communicate—even

when the backplane is within the blade chassis—the latency is

more than double. The story gets even worse (by about 50

percent) when the two VMs do not share a backplane and

communicate over TCP/IP.

This benchmark illustrates that not all HPC workloads are

suitable or a virtualized environment. When applications runin parallel and are latency sensitive (as many MPI based

applications are), using virtualized resources may be something

that should be avoided. I there is no choice but to use

virtualized resources, then the scheduler must have the ability

to choose resources that are adjacent to each other on the

network or the perormance is likely to be unacceptable. This

conclusion also applies to transactional applications where

latency can be the largest part o the ‘submit to receive cycle

time.’

Figure 1: Network bandwidth between machines

Figure 2: Network latency between machines




IOzone

IOzone is a le system benchmarking tool, which generates

and measures a variety o le operations.2 In this benchmark,IOzone was only run or write, rewrite, read and reread to

mimic the most popular unctions an I/O subsystem perorms.

This steady state I/O test clearly demonstrates that KVM

hypervisors are severely lacking when it comes to I/O to disk

in both reads and writes. Even in the OVM case, in a best case

scenario the perormance o the I/O is nearing 40 percent

degradation. Write perormance or Citrix Xen is also limited.

However, read perormance exceeds that o the physical

machine by over 7 percent. This can only be attributed to a

read-ahead unction in Xen, which worked better than the

native Linux read-ahead algorithm.

Figure 3: IOzone 32 GB fle (Local disk) Figure 4: IOzone 32 GB fle (Local disk)

Regardless, this benchmark, more than others, provides a

warning to early HPC cloud adopters o the perormance risks

o virtual technologies. HPC users running I/O boundapplications (Nastran, Gaussian, certain types o ABAQUS

jobs, and so on) should steer clear o virtualization until these

issues are resolved.

Application benchmarksSotware compilation

Compiler used: gcc-4.1.2

Compilation target: Linux kernel 2.6.34 (with ‘decong’ option).

All transient les were put in a run specic subdirectory using

the ‘O’ option in make. Thus the source is kept in read-only

state and writes are into the run specic sub-directory.




Figure 5 shows the dierence in compilation perormance or a

physical machine running a compile on an NFS volume

compared to Citrix Xen doing the same thing on the sameNFS volume. Citrix Xen is roughly 11 percent slower than the

physical machine perorming the task. Also included is the

dierence between compiling to a local disk target versus

compiling to the NFS target on the physical machine. The

results illustrate how NFS perormance can signicantly aect

a job’s elapsed time. This is o crucial importance because most

virtualized private cloud implementations utilize NFS as the

le system instead o using local drives to acilitate migration.

SIMULIA® Abaqus

SIMULIA® Abaqus3 is the standard o the manuacturing

industry or implicit and explicit non-linear nite element

Figure 5: Compilation o kernel 2.6.34 Figure 6 : Parallel ABAQUS explicit (e2.inp)

solutions. SIMULIA publishes a benchmark suite that

hardware vendors use to distinguish their products.4 “e2” and

“s6” were used or these benchmarks.

The ABAQUS explicit distributed parallel runs were

perormed using HP MPI (2.03.01) and scratch les were

written to local scratch disk. This comparison, unlike the

others presented in this paper, was done in two dierent ways:

1. The data series called “Citrix” is or a single 8 GB RAM VM

with 8 cores where the MPI ranks communicated within a

single VM.

2. The data series called “Citrix – Dierent VMs” represents

multiple separate VMs dened on the hypervisor host

intercommunicating.




Figure 7: Parallel ABAQUS standard (s6.inp)

As expected, the additional layers o virtualized networking

slowed the communication speeds (also shown in the NetPIPE

results) and reduced scalability when the job had higher rank counts. In addition, or communications within a VM, the

perormance or a virtual machine compared to the physical

machine was almost identical.

ABAQUS has a dierent algorithm or solving implicit Finite

Element Analysis (FEA) problem called “ABAQUS Standard.”

This method does not run distributed parallel, but can be run

SMP parallel which was done or the “s6” benchmark.

Figure 8: Serial FLUENT 12.1

Typically ABAQUS Standard does considerably more I/O to

scratch disk than its explicit counterpart. However, this is

dependent upon the amount o memory available in theexecution environment. It is clear again that when an

application is only CPU or memory constrained, a virtual

machine has almost no detectable perormance impact.

ANSYS® FLUENT

ANSYS® FLUENT5 belongs to a large class o HPC

applications reerred to as computational fuid dynamics (CFD)

codes. The “aircrat_2m” FLUENT model was selected based

on size and run or 25 iterations. The “sedan_4m” model was

chosen as a suitable sized model or running in parallel.

Hundred iterations were perormed using this model.




Figure 9: Distributed parallel FLUENT 12.1 (sedan_4m - 100 iterations)

Though CFD codes such as FLUENT are rarely run serially

because o memory requirements or solution time

requirements, the comparison in Figure 8 shows that thesolution time or a physical machine and a virtual machine are

dierent by only 1.9 percent where the virtual machine is the

slower o the two. The “aircrat_2m” model was simply too

small to scale well in parallel, and provided strangely varying

results, so the sedan_4m model was used.6

The result or the parallel case (Figure 9) illustrate that at two

CPUs the virtual machine outperorms the physical machine.

This is most likely caused by the native Linux scheduler

moving processes around on the physical host. I the

application had been bound to particular cores, then this eect

would disappear. In the our and eight CPU runs the dierence

between physical and virtual machines is negligible. Thissupports the theory that the Linux CPU scheduler is impacting

the two CPU job.

LS-DYNA®

LS-DYNA®7 is a transient dynamic nite element analysis

program capable o solving complex real world time domain

problems on serial, SMP parallel, and distributed parallel

computational engines. The “rened_neon_30ms” model was

chosen or benchmarks reviewed in this section. HP MPI

2.03.01, now owned by IBM Platorm Computing was the

message passing library used.

Figure 10: LS-DYNA - MPP971 - Refned Neon




The MPP-DYNA application responds well when run in a low

latency environment. This benchmark supports the notion that

distributed parallel LS-DYNA jobs are still very sensitive to

network latency, even when using a backplane o a VM. A serial

run shows a virtual machine is 1 percent slower. Introduce

message passing, however, and at eight CPUs the virtual

machine is nearly 40 percent slower than the physical machine.

The expectation is that i the same job was run on multiple

VMs as was done or ABAQUS Explicit parallel jobs, the eect would be even greater, where physical machines signicantly

outperorm virtual machines.

Conclusion As with most legends, there is some truth to the notion that

VMs are inappropriate or HPC applications. The benchmark

results demonstrate that latency sensitive and I/O bound

applications would perorm at levels unacceptable to HPC

users. However, the results also show that CPU and memory

bound applications and parallel applications that are not

latency sensitive perorm well in a virtual environment. HPC

architects who dismiss virtualization technology entirely may

thereore be missing an enormous opportunity to inject

fexibility and even a perormance edge into their HPC

designs.

The power o Platorm Cluster Manger - Advanced Edition

and IBM® Platorm™ LSF® is their ability to work in consort

to manage both o these types o workload simultaneously in asingle environment. These tools allow their users to maximize

resource utilization and fexibility through provisioning and

control at the physical and virtual levels. Only IBM Platorm

Computing technology allows or environment optimization at

the job-by-job level, and only Platorm Cluster Manager –

Advanced Edition continues to optimize that environment

ater jobs have been scheduled and new jobs have been

submitted. Such an environment could realize orders o

magnitude increases in eciency and throughput while

reducing the overhead o IT maintenance.

Signifcant results

• The KVM hypervisor signifcantly outperorms the OVM hypervisor on AMD servers, especially when several VMs run

simultaneously.

• Citrix Xen I/O read and rereads are very ast on Intel servers.

• OVM outperorms KVM by a signifcant margin or I/O intensive applications running on AMD servers.

• I/O intensive and latency sensitive parallel applications are not a good ft or virtual environments today.

• Memory and CPU bound applications are at perormance parity between physical and virtual machines.



Notes



For more information To learn more about IBM Platorm Computing, please

contact your IBM marketing representative or IBM

Business Partner, or visit the ollowing website:

ibm.com /platormcomputing

© Copyright IBM Corporation 2012

IBM CorporationSystems and Technology GroupRoute 100Somers, NY 10589

Produced in the United States o AmericaNovember 2012

IBM, the IBM logo, ibm.com, Platorm Computing, Platorm Cluster Manager, Platorm Dynamic Cluster and Platorm LSF are trademarks o International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks o IBMor other companies. A current list o IBM trademarks is available on the

web at “Copyright and trademark inormation” at ibm.com /legal/copytrade.shtml

Intel, Intel logo, Intel Inside, Intel Inside logo, Intel Centrino, IntelCentrino logo, Celeron, Intel Xeon, Intel SpeedStep, Itanium, and Pentiumare trademarks or registered trademarks o Intel Corporation or itssubsidiaries in the United States and other countries.

Linux is a registered trademark o Linus Torvalds in the United States,other countries, or both.

This document is current as o the initial date o publication and may bechanged by IBM at any time. Not all oerings are available in every countryin which IBM operates.

The perormance data discussed herein is presented as derived under

specic operating conditions. Actual results may vary. THEINFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS” WITHOUT ANY WARRANTY, EXPRESS OR IMPLIED,INCLUDING WITHOUT ANY WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND ANY WARRANTY OR CONDITION OF NON-INFRINGEMENT. IBM products are warranted according to the termsand conditions o the agreements under which they are provided.

Actual available storage capacity may be reported or both uncompressedand compressed data and will vary and may be less than stated.

1 http://www.scl.ameslab.gov/netpipe/

2 http://www.iozone.org/

3 ABAQUS is a trademark o Simulia and Dassault Systemes (http://www.

simulia.com)4 See http://www.simulia.com/support/v67/v67_perormance.html ordescription o the benchmark models and availability

5 Fluent is a trademark o ANSYS, Inc (http://www.fuent.com)

6 The largest model provided by ANSYS, “truck_14m”, was not an optionor this benchmark asthe model was too large to t into memory.

7 LS-DYNA is a trademark o LSTC (http://www.lstc.com/ )

Please Recycle

DCW03038-USEN-0

http://www.ibm.com/platformcomputing


http://www.ibm.com/legal/us/en/copytrade.shtml



http://www.scl.ameslab.gov/netpipe/

http://www.iozone.org/

http://www.simulia.com/


http://www.simulia.com/support/v67/v67_performance.html

http://www.fluent.com/

http://www.lstc.com/


http://www.lstc.com/

http://www.fluent.com/

http://www.simulia.com/support/v67/v67_performance.html



http://www.iozone.org/

http://www.scl.ameslab.gov/netpipe/


could the “c” in hpc stand for cloud?

Documents