blades in hpc - decus · infiniband as a high performance interconnect •performance • high...

14
© 2006 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Blades in HPC Henry Strauss Strategic Technical Consultant High Performance Computing IT Symposium, HP User Society DECUS Nuremberg, April 18th, 2007 2 1 June 2007 HPC trends and challenges Trends in HPC clustering now mainstream rapid increase in scale growing cost in facility and system administration Customer challenges high(est) performance of CPUs memory disks Network best price/performance acquisition costs running costs staff space power cooling “soft” factors ease of use manageability architecture flexibility infrastructure complexity TCO!

Upload: others

Post on 07-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

© 2006 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice

Blades in HPCHenry StraussStrategic Technical ConsultantHigh Performance Computing

IT Symposium, HP User Society DECUSNuremberg, April 18th, 2007

2 1 June 2007

HPC trends and challengesTrends in HPC

• clustering now mainstream

• rapid increase in scale

• growing cost in facility and system administration

Customer challenges• high(est) performance of

• CPUs• memory• disks• Network

• best price/performance• acquisition costs• running costs

• staff• space• power• cooling

• “soft” factors• ease of use• manageability

• architecture flexibility • infrastructure complexity

• TCO!

Page 2: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

3 1 June 2007

A quote from Chris Willard, Research Vice President at IDC:

“… cluster type systems have become the preferred architecture for HPC, … That said, users also report challenges to clustering in such areas as system complexity and management, and physical system support.”

Why blades are so “hot”?

4 1 June 2007

BladeSystem c-Class meets the HPC challenges • Performance

• Broadest choice of fastest processors

• Fastest interconnect in industry

• Environment• HP Thermal Logic: innovative power & cooling

technologies

• Management• Insight Control and Virtual Connector:

comprehensive management

• Total cost of ownership• Lowering CapEx and OpEx

• Infrastructure headroom for investment protection

Page 3: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

5 1 June 2007

c7000 EnclosureFront View

Server blades• 2x features, 2x the density

Storage blades• A new paradigm for

“bladed” storage solutions

Integrated power• Simplified configuration

and greater efficiency• Same flexibility, capacity

and redundancy

Onboard Administrator• HP Insight Display • Simple set-up delivered

out of the box

10U8-16 blades

6 1 June 2007

c7000 EnclosureRear View

Interconnect bays• 8 bays; up to 4 redundant I/O fabrics• Up to 94% reduction in cables• Ethernet, Fibre Channel, iSCSI, SAS, IB

Active Cool fans• Adaptive flow for maximum power

efficiency, air movement & acoustics

Onboard Administrator• Remote administration view• Robust, multi-enclosure control

Power management• Choice of single-phase or

three-phase enclosures • AC redundant mode or

power supply redundant mode• Best performance per watt

PARSEC architecture• Parallel, redundant and

scalable cooling and airflow design

Page 4: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

7 1 June 2007

From rack-mount to blade

BladeSystem AdvantagePower: 32% saving

Floor space: from 8 racks to 5 racks

Network cables: up to 78% less

And excellent manageability!

Exampleconfiguration:

256-node cluster

w/ InfiniBand

8 1 June 2007

Ethernet cables

Power cables

InfiniBand cables

Cluster Cabling: 1U server vs. c-ClassCluster w/ 1U servers

Cluster w/ c-Class

Page 5: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

9 1 June 2007

Deliver tangible savings to business

HPServers

&Storage

People

Capi

tal E

xpen

ses

Ope

ratio

nal C

osts

Conventional IT

ThirdParty

DataCenter

HPServers

&Storage

People

Third Party

DataCenter

Bladed

Up to 38%

Example: 320 servers over 3 years

Servers,storage, racks, & networking

Initial System setup time

Power & Cooling, cable

installation,and datacenter

space

Up to 50%

Up to 96%

10 1 June 2007

HPC Clusters Comparison•Facility saving alone pays for the small price premium

•Other savings in support/management contribute to a lower TCO w/ blade

List Prices DL140 G3 BL460c DL145 G3 BL465c

Head w/ TFT $9,538 $9,538 $8,388 $8,358

Servers & Blade Enclosure $585,024 $669,258 $449,664 $558,858

Network Infrastructure $27,394 $11,821 $29,895 $11,821

InfiniBand Interconnect $203,475 $156,308 $203,475 $156,308

Racks, power cables, PDUs $15,311 $15,086 $15,339 $15,086

Linux HPC 8 PK $13,768 $13,768 $13,768 $13,768

Intergration SVC $23,234 $23,229 $23,234 $23,229

Total List Price $877,744 $899,008 $743,763 $787,428

Price Premium $21,264 $43,665

Premium % 2.4% 5.9%

Facility cost

Space (3 years) $3,154 $2,365 $3,154 $2,365

Power Cost (3 yr) $95,659 $74,431 $79,014 $57,873

Cooling Cost (3 yr) $47,830 $37,215 $39,507 $28,936

Total Facility Cost (3 yr) $146,643 $114,011 $121,675 $89,174

Facility Savings $32,631 $32,501

Savings % 22.3% 26.7%

96 + 1 nodes

HP restricted – internal use only

Page 6: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

11 1 June 2007

HP BladeSystem c-Class Portfolio

c7000 Enclosure

Server Blades StorageBlades

A Full Range of 2P and 4P Server Blades

Best-in-class performance, choice, and reliability for Windows, Linux, and HP-UX applications

Add interconnects to connect to LAN, SAN, and Scale-Out ClustersVirtualConnect

LANInterconnects

Ethernet NICs SANInterconnects

Fibre Channel HBAs

InfiniBand4X DDR

h

h

h h

IP & FC IP Fibre Channel IBBest-in-class Industry Interconnects

HP Confidential – NDA Required

12 1 June 2007

Blade clusters are not „flat“• enclosure „boundaries“ imply a hierarchy• => additional considerations

•non-linearly increasing bisectional bandwidth•balance?!

• BUT: „classical“ clusters aren‘t flat either•non-uniform network distances (hierachy of switch modules)•XC includes administrative subclusters (per rack)

• AND: the compute node itself introduces a hierarchy• i.e. a node is an SMP with typically 2-8 cores with

fundamentally different „interconnect characterics“

Page 7: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

13 1 June 2007

More on IB support and Configuration Examples

14 1 June 2007

Why Interconnect matters?

0

200

400

600

800

1000

1200

1400

1600

2 4 6 8 10 12 14 16

Number of Processors

Nu

mb

er

of

Jo

bs P

er

Day

GigE Mellanox InfiniBand Ideal

2 Nodes 4 Nodes 8 Nodes

Fluent Perf Study – IB versus GigE

Near linear speedupobserved with IB

3.6M call model

on 1 to 16 cores

GigE does not scale beyond small clusters

Page 8: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

15 1 June 2007

InfiniBand as a high performance interconnect• Performance

• High bandwidth• Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link• Future, QDR is expected in 2008 timeframe

• Very low latency• < 4usec MPI Ping-pong w/ Mellanox technology and OFED stack

• Very low CPU usage during message-passing• ~ 10%

•Scalability• Thousands of nodes per subnet/multiple subnets

• Ease of clustering•Self-discovery of nodes

• Plug and play

16 1 June 2007

Ethernet layered communication stack

Page 9: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

17 1 June 2007

Port

WQE

send recvQP

Transport

ChannelAdapter CQE

MPI (HP-MPI)

Applications

Port

WQE

send recvQP

Transport

ChannelAdapter CQE

MPI (HP-MPI)

Applications

IBA Packets

IBA Operations

(Send/Recv, RDMA)

Port Port

IB Fabric

Message Passing

IB layered communication stack

18 1 June 2007

Subnet Management

CPU

TCA

System

Memory

HCA

IB

Switch

IB

Switch

TCA

IB

Switch

TCA

TCA

Subnet

Manager

Each Subnet must have a Subnet Manager (SM)

SMA

SMA

SMA

SMA

SMA

SMA

SMA

SMA

Every entity (CA, SW, Router) must support a Subnet

Management Agent (SMA)

Subnet

Manager

TCAIB

Switch

Standby

SM

Standby

SM

Standby

SM

Topology DiscoveryFDB InitializationFabric Maintenance

LID Route Directed Route Vector LID RouteInitialization usesDirected Route MADs:

Multipathing: LMC SupportsMultiple LIDS

LID

= 6

LMC: 1LID = 6,7

LID

= 7

LID

= 7

LID

= 6

LID = 6LID = 6

LID = 6LID = 6

LID = 7LID = 7

MADs use unreliable datagrams

Page 10: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

19 1 June 2007

HP 4X DDR IB Mezz HCA for c-Class• Based on Mellanox 4X DDR technology

• 20Gb/s each direction

• PCI-Express interface• Plug in PCI-E mezz2 (PCI-E x8) for best

performance• Support multiple mezz HCAs per blade*

• Software Support Options• Voltaire GridStack (including OFED)

• Standard Cluster Platform configurations

• Cisco IB software stack (proprietary & OFED)• OpenFabrics Enterprise Distribution (OFED w/o

HP support)• Customers who are familiar with open source

• Support via OpenFabrics Alliance

Mezz HCA #1: PCIe x4

Mezz HCA #2: PCIe x8

* Note that the standard Cluster Platform configs support one (1) HCA per server blade, multiple HCAs support via customer configuration.

20 1 June 2007

HP 4X DDR IB Switch Module• Fully non-blocking Switch

• 16 downlinks & 8 uplinks• unmanaged – requires an external Subnet Manager to

establish the IB fabric• Multiple switches supported per enclosure*

• Subnet Manager (SM) Options• Voltaire GridVision Internally managed switch family

• Running on rack-mount switch w/ embedded SM• Standard CP configurations

• Voltaire GridVision BladeFM• Running on server, for one or two enclosures

• Available by exception request• Expected to be launched in May 07

• Cisco SFS switch• SFS 7000D (24-port switch w/ embedded SM)

• OpenSM (not recommended for production use)• Runs on server• Support via OpenFabrics Alliance (not HP)

* Note that the standard Cluster Platform configs support one (1) switch per enclosure, multiple switches support via customer configuration.

IB switch in bays 5&6 (Mezz 2)

IB switch in bays 7&8 (Mezz 3 for FH blades)

Page 11: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

21 1 June 2007

16

16 BL460c with one HCA each

DDR IB Switch Module

c7000

16

c7000

Up to 32 nodes cluster configuration

(2 switch hops)

8 8 Host based SM:

OpenSM or

Voltaire GridVision BladeFM

Small configuration example with c-Class

Note: other Ethernet networks are not drawn in this diagram

22 1 June 2007

Single rack example with c-Class

16

c7000

16

c7000 c7000

1 2 3

4

44

Up to 48 nodes cluster configuration

Subnet manager runs on switch

Fabric redundancy

Max switch hops: 3

24-port DDR IB switch

16

Note: Other Ethernet networks are not drawn in this diagram.

44

4

24-port DDR IB switch

Leaf-level SW

Spine-level SW

Page 12: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

23 1 June 2007

Multi-rack configuration example with c-Class

256 nodes cluster configuration w/ 8 24-port switches

Subnet manager runs on switchFabric redundancy

Max switch hops: 3

16

1

c7000

16

1

c7000

16

1

c7000

1

11

1 2 16…

Note: other Ethernet networks are not drawn in this diagram

24-port DDR IB SW 24-port DDR IB SW 24-port DDR IB SW* * *

24 1 June 2007

SM

53Max switch hops

Two management boardsTwo internally managed switches

Subnet manager redundancy

$384K$203KIB Switch & cable List Price (2:1 oversubscription)

128128IB cables

14(15)U: one 288-port switch8U: eight 24-port switchesRack space for IB switches

Multiple 24-port switches or a larger switch?

1

6

1

c7000

1

6

1

c7000

1

6

1

c7000

1

11

1 2 16…

24-portSW #1

24-portSW #2

24-portSW #8

16

c7000

16

8

c7000

16

c7000

1 2 16…

288-port switch

(max internal switch hops: 3)

256-node example

c-Class IB switch module

Recommendation: for a cluster of fewer than 24 enclosures, use multiple24-port switches because it has lower cost and fewer switch hops.

Page 13: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

25 1 June 2007

Scaling clusters with larger switch

16

c7000

16

c7000 c7000

1 2 … 32

8 8 88x36=288 ports

512-node cluster configuration with single 288-port switch

(up to 5 switch hops)

288 port switch

16

SM

Note: other Ethernet networks are not drawn in this diagram

26 1 June 2007

. . .20Gb/s Backplane

Managed IB switchesc-Class

16 (HH) or 8 (FH) blade

serversIB DDR Mezz

PCIe x8 slot

IB DDR Switch Modules

HP’s Fully Integrated HPC Clusters

HP Unified Cluster Portfolio

•HP Innovation, Choice, Performance

•HP Cluster Platforms

•HP Scalable File Share (SFS)

•HP Scalable Visualization (SVA)

•HP XC Linux Cluster Management

•HP Worldwide Service and Support

Page 14: Blades in HPC - DECUS · InfiniBand as a high performance interconnect •Performance • High bandwidth • Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link • Future, QDR is expected

27 1 June 2007

Summary

Infrastructureheadroom

for next 5+ yearsInvestment protection

Fastest industry standard processors

&fastest interconnect

Optimal performance for your workloads

Innovative power and cooling;

comprehensivemanagement

Significantlylower operating

and maintenance cost

28 1 June 2007

CP3000 & 4000BL• Member of HP Unified Cluster Portfolio (UCP)

• Up to 512 nodes standard (more by request)• BL460c, BL480c, BL465c, BL685c compute nodes• DL3xx or DL1xx control and utility nodes• Gigabit Ethernet or InfiniBand Interconnect• OS: RHEL4/3, SLES10/9, Windows CCS• Cluster management choice

• Insight Control, CMU, XC for Linux• Windows CCS

• Benefits• Designed with decades of HPC experience• Built with HP Factory Express• Turn-key solution with integrated HW & SW• Warranty and support by HP