blades in hpc - decus · infiniband as a high performance interconnect •performance • high...
TRANSCRIPT
© 2006 Hewlett-Packard Development Company, L.P.The information contained herein is subject to change without notice
Blades in HPCHenry StraussStrategic Technical ConsultantHigh Performance Computing
IT Symposium, HP User Society DECUSNuremberg, April 18th, 2007
2 1 June 2007
HPC trends and challengesTrends in HPC
• clustering now mainstream
• rapid increase in scale
• growing cost in facility and system administration
Customer challenges• high(est) performance of
• CPUs• memory• disks• Network
• best price/performance• acquisition costs• running costs
• staff• space• power• cooling
• “soft” factors• ease of use• manageability
• architecture flexibility • infrastructure complexity
• TCO!
3 1 June 2007
A quote from Chris Willard, Research Vice President at IDC:
“… cluster type systems have become the preferred architecture for HPC, … That said, users also report challenges to clustering in such areas as system complexity and management, and physical system support.”
Why blades are so “hot”?
4 1 June 2007
BladeSystem c-Class meets the HPC challenges • Performance
• Broadest choice of fastest processors
• Fastest interconnect in industry
• Environment• HP Thermal Logic: innovative power & cooling
technologies
• Management• Insight Control and Virtual Connector:
comprehensive management
• Total cost of ownership• Lowering CapEx and OpEx
• Infrastructure headroom for investment protection
5 1 June 2007
c7000 EnclosureFront View
Server blades• 2x features, 2x the density
Storage blades• A new paradigm for
“bladed” storage solutions
Integrated power• Simplified configuration
and greater efficiency• Same flexibility, capacity
and redundancy
Onboard Administrator• HP Insight Display • Simple set-up delivered
out of the box
10U8-16 blades
6 1 June 2007
c7000 EnclosureRear View
Interconnect bays• 8 bays; up to 4 redundant I/O fabrics• Up to 94% reduction in cables• Ethernet, Fibre Channel, iSCSI, SAS, IB
Active Cool fans• Adaptive flow for maximum power
efficiency, air movement & acoustics
Onboard Administrator• Remote administration view• Robust, multi-enclosure control
Power management• Choice of single-phase or
three-phase enclosures • AC redundant mode or
power supply redundant mode• Best performance per watt
PARSEC architecture• Parallel, redundant and
scalable cooling and airflow design
7 1 June 2007
From rack-mount to blade
BladeSystem AdvantagePower: 32% saving
Floor space: from 8 racks to 5 racks
Network cables: up to 78% less
And excellent manageability!
Exampleconfiguration:
256-node cluster
w/ InfiniBand
8 1 June 2007
Ethernet cables
Power cables
InfiniBand cables
Cluster Cabling: 1U server vs. c-ClassCluster w/ 1U servers
Cluster w/ c-Class
9 1 June 2007
Deliver tangible savings to business
HPServers
&Storage
People
Capi
tal E
xpen
ses
Ope
ratio
nal C
osts
Conventional IT
ThirdParty
DataCenter
HPServers
&Storage
People
Third Party
DataCenter
Bladed
Up to 38%
Example: 320 servers over 3 years
Servers,storage, racks, & networking
Initial System setup time
Power & Cooling, cable
installation,and datacenter
space
Up to 50%
Up to 96%
10 1 June 2007
HPC Clusters Comparison•Facility saving alone pays for the small price premium
•Other savings in support/management contribute to a lower TCO w/ blade
List Prices DL140 G3 BL460c DL145 G3 BL465c
Head w/ TFT $9,538 $9,538 $8,388 $8,358
Servers & Blade Enclosure $585,024 $669,258 $449,664 $558,858
Network Infrastructure $27,394 $11,821 $29,895 $11,821
InfiniBand Interconnect $203,475 $156,308 $203,475 $156,308
Racks, power cables, PDUs $15,311 $15,086 $15,339 $15,086
Linux HPC 8 PK $13,768 $13,768 $13,768 $13,768
Intergration SVC $23,234 $23,229 $23,234 $23,229
Total List Price $877,744 $899,008 $743,763 $787,428
Price Premium $21,264 $43,665
Premium % 2.4% 5.9%
Facility cost
Space (3 years) $3,154 $2,365 $3,154 $2,365
Power Cost (3 yr) $95,659 $74,431 $79,014 $57,873
Cooling Cost (3 yr) $47,830 $37,215 $39,507 $28,936
Total Facility Cost (3 yr) $146,643 $114,011 $121,675 $89,174
Facility Savings $32,631 $32,501
Savings % 22.3% 26.7%
96 + 1 nodes
HP restricted – internal use only
11 1 June 2007
HP BladeSystem c-Class Portfolio
c7000 Enclosure
Server Blades StorageBlades
A Full Range of 2P and 4P Server Blades
Best-in-class performance, choice, and reliability for Windows, Linux, and HP-UX applications
Add interconnects to connect to LAN, SAN, and Scale-Out ClustersVirtualConnect
LANInterconnects
Ethernet NICs SANInterconnects
Fibre Channel HBAs
InfiniBand4X DDR
h
h
h h
IP & FC IP Fibre Channel IBBest-in-class Industry Interconnects
HP Confidential – NDA Required
12 1 June 2007
Blade clusters are not „flat“• enclosure „boundaries“ imply a hierarchy• => additional considerations
•non-linearly increasing bisectional bandwidth•balance?!
• BUT: „classical“ clusters aren‘t flat either•non-uniform network distances (hierachy of switch modules)•XC includes administrative subclusters (per rack)
• AND: the compute node itself introduces a hierarchy• i.e. a node is an SMP with typically 2-8 cores with
fundamentally different „interconnect characterics“
13 1 June 2007
More on IB support and Configuration Examples
14 1 June 2007
Why Interconnect matters?
0
200
400
600
800
1000
1200
1400
1600
2 4 6 8 10 12 14 16
Number of Processors
Nu
mb
er
of
Jo
bs P
er
Day
GigE Mellanox InfiniBand Ideal
2 Nodes 4 Nodes 8 Nodes
Fluent Perf Study – IB versus GigE
Near linear speedupobserved with IB
3.6M call model
on 1 to 16 cores
GigE does not scale beyond small clusters
15 1 June 2007
InfiniBand as a high performance interconnect• Performance
• High bandwidth• Now, 10Gb/s w/ 4X SDR link, 20Gb/s w/ 4X DDR link• Future, QDR is expected in 2008 timeframe
• Very low latency• < 4usec MPI Ping-pong w/ Mellanox technology and OFED stack
• Very low CPU usage during message-passing• ~ 10%
•Scalability• Thousands of nodes per subnet/multiple subnets
• Ease of clustering•Self-discovery of nodes
• Plug and play
16 1 June 2007
Ethernet layered communication stack
17 1 June 2007
Port
WQE
send recvQP
Transport
ChannelAdapter CQE
MPI (HP-MPI)
Applications
Port
WQE
send recvQP
Transport
ChannelAdapter CQE
MPI (HP-MPI)
Applications
IBA Packets
IBA Operations
(Send/Recv, RDMA)
Port Port
IB Fabric
Message Passing
IB layered communication stack
18 1 June 2007
Subnet Management
CPU
TCA
System
Memory
HCA
IB
Switch
IB
Switch
TCA
IB
Switch
TCA
TCA
Subnet
Manager
Each Subnet must have a Subnet Manager (SM)
SMA
SMA
SMA
SMA
SMA
SMA
SMA
SMA
Every entity (CA, SW, Router) must support a Subnet
Management Agent (SMA)
Subnet
Manager
TCAIB
Switch
Standby
SM
Standby
SM
Standby
SM
Topology DiscoveryFDB InitializationFabric Maintenance
LID Route Directed Route Vector LID RouteInitialization usesDirected Route MADs:
Multipathing: LMC SupportsMultiple LIDS
LID
= 6
LMC: 1LID = 6,7
LID
= 7
LID
= 7
LID
= 6
LID = 6LID = 6
LID = 6LID = 6
LID = 7LID = 7
MADs use unreliable datagrams
19 1 June 2007
HP 4X DDR IB Mezz HCA for c-Class• Based on Mellanox 4X DDR technology
• 20Gb/s each direction
• PCI-Express interface• Plug in PCI-E mezz2 (PCI-E x8) for best
performance• Support multiple mezz HCAs per blade*
• Software Support Options• Voltaire GridStack (including OFED)
• Standard Cluster Platform configurations
• Cisco IB software stack (proprietary & OFED)• OpenFabrics Enterprise Distribution (OFED w/o
HP support)• Customers who are familiar with open source
• Support via OpenFabrics Alliance
Mezz HCA #1: PCIe x4
Mezz HCA #2: PCIe x8
* Note that the standard Cluster Platform configs support one (1) HCA per server blade, multiple HCAs support via customer configuration.
20 1 June 2007
HP 4X DDR IB Switch Module• Fully non-blocking Switch
• 16 downlinks & 8 uplinks• unmanaged – requires an external Subnet Manager to
establish the IB fabric• Multiple switches supported per enclosure*
• Subnet Manager (SM) Options• Voltaire GridVision Internally managed switch family
• Running on rack-mount switch w/ embedded SM• Standard CP configurations
• Voltaire GridVision BladeFM• Running on server, for one or two enclosures
• Available by exception request• Expected to be launched in May 07
• Cisco SFS switch• SFS 7000D (24-port switch w/ embedded SM)
• OpenSM (not recommended for production use)• Runs on server• Support via OpenFabrics Alliance (not HP)
* Note that the standard Cluster Platform configs support one (1) switch per enclosure, multiple switches support via customer configuration.
IB switch in bays 5&6 (Mezz 2)
IB switch in bays 7&8 (Mezz 3 for FH blades)
21 1 June 2007
16
16 BL460c with one HCA each
DDR IB Switch Module
c7000
16
c7000
Up to 32 nodes cluster configuration
(2 switch hops)
8 8 Host based SM:
OpenSM or
Voltaire GridVision BladeFM
Small configuration example with c-Class
Note: other Ethernet networks are not drawn in this diagram
22 1 June 2007
Single rack example with c-Class
16
c7000
16
c7000 c7000
1 2 3
4
44
Up to 48 nodes cluster configuration
Subnet manager runs on switch
Fabric redundancy
Max switch hops: 3
24-port DDR IB switch
16
Note: Other Ethernet networks are not drawn in this diagram.
44
4
24-port DDR IB switch
Leaf-level SW
Spine-level SW
23 1 June 2007
Multi-rack configuration example with c-Class
256 nodes cluster configuration w/ 8 24-port switches
Subnet manager runs on switchFabric redundancy
Max switch hops: 3
16
1
c7000
16
1
c7000
16
1
c7000
1
11
1 2 16…
…
Note: other Ethernet networks are not drawn in this diagram
24-port DDR IB SW 24-port DDR IB SW 24-port DDR IB SW* * *
24 1 June 2007
SM
53Max switch hops
Two management boardsTwo internally managed switches
Subnet manager redundancy
$384K$203KIB Switch & cable List Price (2:1 oversubscription)
128128IB cables
14(15)U: one 288-port switch8U: eight 24-port switchesRack space for IB switches
Multiple 24-port switches or a larger switch?
1
6
1
c7000
1
6
1
c7000
1
6
1
c7000
1
11
1 2 16…
…
24-portSW #1
24-portSW #2
24-portSW #8
16
c7000
16
8
c7000
16
c7000
1 2 16…
…
288-port switch
(max internal switch hops: 3)
256-node example
c-Class IB switch module
Recommendation: for a cluster of fewer than 24 enclosures, use multiple24-port switches because it has lower cost and fewer switch hops.
25 1 June 2007
Scaling clusters with larger switch
16
c7000
16
c7000 c7000
1 2 … 32
…
8 8 88x36=288 ports
512-node cluster configuration with single 288-port switch
(up to 5 switch hops)
288 port switch
16
SM
Note: other Ethernet networks are not drawn in this diagram
26 1 June 2007
. . .20Gb/s Backplane
Managed IB switchesc-Class
16 (HH) or 8 (FH) blade
serversIB DDR Mezz
PCIe x8 slot
IB DDR Switch Modules
HP’s Fully Integrated HPC Clusters
HP Unified Cluster Portfolio
•HP Innovation, Choice, Performance
•HP Cluster Platforms
•HP Scalable File Share (SFS)
•HP Scalable Visualization (SVA)
•HP XC Linux Cluster Management
•HP Worldwide Service and Support
27 1 June 2007
Summary
Infrastructureheadroom
for next 5+ yearsInvestment protection
Fastest industry standard processors
&fastest interconnect
Optimal performance for your workloads
Innovative power and cooling;
comprehensivemanagement
Significantlylower operating
and maintenance cost
28 1 June 2007
CP3000 & 4000BL• Member of HP Unified Cluster Portfolio (UCP)
• Up to 512 nodes standard (more by request)• BL460c, BL480c, BL465c, BL685c compute nodes• DL3xx or DL1xx control and utility nodes• Gigabit Ethernet or InfiniBand Interconnect• OS: RHEL4/3, SLES10/9, Windows CCS• Cluster management choice
• Insight Control, CMU, XC for Linux• Windows CCS
• Benefits• Designed with decades of HPC experience• Built with HP Factory Express• Turn-key solution with integrated HW & SW• Warranty and support by HP