scaling to petaflop - hpc advisory council to petaflop.pdf24 12x ipass connectors m1 block diagram...
TRANSCRIPT
Scaling to Petaflop
Ola TorudbakkenDistinguished Engineer
Sun Microsystems, Inc
2bof-isc2008
HPC Market growth is strong
Market in 2007 doubled from 2003 (Source: IDC 2007)
CAGR increasedfrom 9.2% (2006) to15.5% (2007)
3bof-isc2008
IDC Server Data (2007)
31.0%26.1%22.6%16.4%11.6%HPC % of Total CPU
7.2%9,446M9070M8,477M7,658MNon HPC CPUs
49.1%3,333M2,643M1,658M1,005MHPC CPU Shipped
13.8%12,779M11,713M10,135M8,663MTotal CPU Shipped
CAGR2006200520042003Year
Technical Computing is the Server Growth Engine
4bof-isc2008
Market Observations•HPC is now 25% of all CPU units shipped
- Projected to grow to 33% by 2010- Key growth opportunity for the entire industry
•Clustered systems is now 65% or HPC market- IB is the fastest growing interconnect
•Consolidation - Driven by energy, infrastructure, real estate management- Server Virtualization is becoming increasingly popular- Fabric convergence through FCoIB/FCoE
•Management- Cluster managed as single machine
5bof-isc2008
PetaScale Fabric Requirements•Fabric Performance is critical to Scalability
- High Bandwidth (0.1B/F or better) - Low Latency (< 1-2 usec unidirectional ping) - Low overhead (More cycles for computation) - High Messaging Rate (N/2) - Ability to handle Fabric Congestion- Multi-path routing
•Host Adaptor must support multi-threading- Otherwise performance does not scale with cores- Must support multiple outstanding send/receives and
optimized collectives to Scale Performance- Support legacy (MPI, Socket) and emerging PGAS-style
programming models
6bof-isc2008
PetaScale Fabric Requirements (2) •Consolidation – unified fabric for Clustering and IO
- Loss-Less reliable delivery- Service differentiation- High throughput
•High Reliability to avoid interruptions- Extremely low undetected error rate- Link-level and end-to-end CRC mechanisms- Efficient re-transmission schemes in HW- Path failover support
•High Availability & Servicability- Quickly root-cause & isolate fault, and service
component
7bof-isc2008
Why Infiniband?•Open Standard•Open Source•High Performance•High Message rate•Low Latency•Reliable•QoS•Congestion control•Cost effective
IB excels in all areas
8bof-isc2008
IB in Cluster Storage•Significantly better performance
- No packet drop and RDMA performance- Demonstrated scaling to 100 GB/s I/O
•File system software exists today- Lustre and other parallel file systems- Scaling requires multiple object stores
•Same fabric can be used for MPI and I/O- No additional cost for storage network
•Lots of interest in Infiniband for Storage- Major performance gains over FibreChannel
•Many customers are evaluating- Cost-performance is a critical consideration
9bof-isc2008
Fabric architecture•Torus Topologies
- Used in BlueGene/L and Cray XT3/4- Each node connects to its neighbours in the X, Y, and Z direction- Pros: Easy to build large fabrics- Pros: Good for nearest-neighbour type applications- Cons: Blocking fabric and variable latency requires application
deployment node locality awareness•Clos Topologies
- Each Node has constant bandwidth- Pros: Lowest latency MPI communication- Pros: No need to consider locality- Cons: More difficult to construct
10bof-isc2008
The Sun Magnum 3456-port IB switch• Worlds largest CLOS switch
- 3456 ports DDR- 110Tbps total capacity- 700ns latency (DDR)
• Major improvement in reliability- 6x cable reduction vs leaf & core switches- New 12x connector and cable system
• Major improvement in managability- Single centralized switch with known topology- Dual redundant subnet managers
• Dual-Wide Rack Chassis- Redundant Power and Cooling- 36kW power consumption
• Line Cards and Fabric Cards- 24 Line Cards with 144 4x ports realized through
48 12x connectors- 18 Fabric Cards
11bof-isc2008
12x Connector and Cable•3X denser than Infiniband 4X connector•Electrically and mechanically superior•Supports active copper cable and optics•Cable Serial Number, Local/Remote Cable inserted detection•Designed for QDR•Proposed next-gen IBTA 12x connector (12x CSFP)
12x CableEquivalent to 3 CX-4 cables
12x to CX4 SplitterBreaks out to 3 CX-4 cables
12bof-isc2008
3456-p Switch Comparision
1/3001300Switch Chassis
1/6Dual Racks12 RacksRack Space
1/61.6 Ton10 TonsWeight
1/611526912# Cables
RatioMagnumTraditionalMetric
300:1 reduction in management points6:1 reduction in cables
6:1 reduction in weight and rack spaceOrder of Magnitude improvement in reliability
13bof-isc2008
Multi-Path Routing•Optimized multi-path routing for load balancing and path redundancy•~90% efficiency at uniform random distributed load
1) variable packet size distribution with 5% 1472 byte packets, 65% 320 byte packets, and 30% 64 byte packet2) Uniform traffic: every destination has the same probability of being chosen for every packet3) Round-Robin: Every source sends, x packets to its neighor, then x packets to neighbor+1, and so forthSimula Research Center
14bof-isc2008
Nano-Magnum
24
24
2424 1
2x iP
ASS
Con
nect
ors
M1 Block Diagram
• An ultra-dense 72-port IB Core Switch• Switch performance
- 3 x 24 ports SDR or DDR- 140ns latency (DDR)
• Ideal for cluster configurations up to 288 nodes- Just 4 Nano-Magnums and C48 leaf switches
• 1RU 19” enclosure- Redundant Power and Cooling- 150W power consumption
• Management- Embedded enclosure management- 100T & Serial OOB connections
15bof-isc2008
The Sun Constellation SupercomputerOpen-Standards Peta Scale Supercomputer
NetworkingCompute Storage SoftwareDeveloper
Tools ProvisioningGrid Engine
Ultra-dense Blade Platform
Fastest Processors: SPARC, AMD Opteron, Intel XeonHigh-density 2S and 4S Blades Fastest Host
Channel Adaptor
Large-Scale and Ultra-dense IB Switches
72 & 3456 port InfiniBand SwitchesUnrivaled cable simplificationMost economical InfiniBand cost/port
Ultra-dense Storage SolutionMost economical and
scalable storage system with LustreUp to 48 TB in 4RUDirect Cabling to IB
Switch
Linux
Comprehensive Software Stack
Integrated Developer ToolsIntegrated Grid Engine
Infrastructure Provisioning,
Monitoring, PatchingSimplified Inventory
Management
Sun Datacenter Switch 3x24
Sun Datacenter Switch3456
504 TF peak performance
16bof-isc2008
Constellation Rack•Density Optimized Blade Rack
- Redundant Power and Cooling•192 Sockets per Rack
- 36kW per rac•Unibody Rack Design
- Saves 480 lbs per Rack- Less weight per socket
•Infiniband Leaf Switches- Supports 7-stage fabrics- Max 13,824 Nodes
•Optional heat exchanger
17bof-isc2008
288-node cluster...where is the switch?
6 Constellation Racks + 4 NM + 96 12x Cables
18bof-isc2008
Massive Scale from 3456 to 13824 Servers1 Core SwitchServers: 3,456PFLOPS: 0.4
2 Core SwitchesServers: 6,912PFLOPS: 0.9
3 Core SwitchesServers: 10,368PFLOPS: 1.3
4 Core SwitchesServers: 13,824PFLOPS: 1.7
19bof-isc2008
.....horns optional!