hotfoot hpc cluster march 31, 2011. topics overview execute nodes manager/submit nodes nfs server...

Post on 01-Jan-2016

216 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Hotfoot HPC ClusterMarch 31, 2011

Topics

• Overview• Execute Nodes• Manager/Submit Nodes• NFS Server• Storage• Networking• Performance

Overview - Hotfoot Pilot

• Launched May 2009

• Original Partnership– Astronomy– Statistics– CUIT– Office of the Executive Vice President for Research

Overview - Hotfoot Expansion

• Expanded March 2011– More Nodes– More Storage– Changed Scheduler

• New Participant– Social Science Computing Committee (SSCC)

Overview – Cluster Components

• 52 Execute Nodes

• 520 Total Cores

• 2 Manager Nodes

• 1 NFS Server (1 Cold Spare)

• 52 TB Storage (72 TB Raw)

Overview

Overview - Architecture

Manager/SubmitNode 1

(Haddock)

RAID

NFS Server(Herring)

Manager/SubmitNode 2

(Mahimahi)

Hotfoot Components

Blade Chassis32 Execute Nodes

NFS Server(Sardine)

Original blade chassis

containing 32 Execute nodes.

New blade chassiscontaining 24

Execute nodes.

One Manager/Submit node is active. Failover is manual.

Second server available to provide NFS services.

Currently not connected.

72TB raw storage. Approximately 52TB usable

under RAID 5.

NFS server provides working storage for all other systems

in cluster.

Execute Nodes

Model Quantity CPU Cores Total Cores Memory

BL2x220c G5 32 Dual 4 core 256 16 GB

BL2x220c G6 14 Dual 6 core 168 24 GB

BL2x220c G6 8 Dual 6 core 96 96 GB

Manager/Submit Nodes

• HP DL360 G5, 4 GB RAM

• Torque Resource Manager (OpenPBS descendent)

• Maui Cluster Scheduler

• User Access via virtual interface (vif)

• Failover via Torque High Availability (HA)

NFS Servers

• Primary– HP DL360 G7– 2 x 4 cores– 16 GB RAM

• Backup– HP DL360 G5– 1 x 2 cores– 8 GB RAM

Storage

• HP P2000 Storage Array

• 32 x 2 TB Drives

• RAID 5

• ~52 TB Usable

Networking

• Execute Nodes

– Channel-bonding mode 2 (load-balancing and fault tolerance)

– 1 Gb connection to chassis switches

– Usage records suggested this was sufficient

Networking

Sample Traffic for an Execute Node

Networking

• Chassis

– Each chassis has four Cisco 3020 switches

– 1 Gb connection to Edge switches

– Usage records suggested this was sufficient

Networking

Sample Traffic for a Chassis Switch

Networking

Original Chassis, Showing Network Connections for Two Servers

Performance

• Concern about the ability of NFS to handle i/o demands.

• Reviewed performance of pilot system.

• Ran tests on expanded system.

Performance

Memory Usage on Old NFS Server

Performance

Load Average on Old NFS Server

Performance

Performance

Questions?

• Questions?

• Comments?

• Contact: roblane@columbia.edu

top related