compute clusters at the chrysler group: battle scars and

35
High Performance Computing Compute Clusters at the Chrysler Group: Battle Scars and Victory Parades Compute Clusters at the Chrysler Group: Battle Scars and Victory Parades June 24, 2003 John Picklo Manager, Mainframes and High Performance Computing

Upload: others

Post on 29-Jan-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

High Performance Computing

Compute Clusters at the Chrysler Group:Battle Scars and Victory Parades

Compute Clusters at the Chrysler Group:Battle Scars and Victory Parades

June 24, 2003

John Picklo

Manager, Mainframes and High PerformanceComputing

High Performance Computing

AgendaAgenda

Introductory comments

History of clusters at the Chrysler Group

Implementation issues

Lessons learned

High Performance Computing

Chrysler Group Product DevelopmentChrysler Group Product Development

Design and engineering for Chrysler Groupvehicles

High Performance Computing

High Performance ComputingHigh Performance Computing

Computer Aided Engineering

Vehicle simulationFluid Dynamics

Impact Events

Noise, Vibration, Harshness

High Performance Computing

CAE Simulation ProcessCAE Simulation Process

Pre-processCreate meshed modelElements and properties> 700,000 elements

SimulationBatch processCompute and memory intensiveDuration: hours to days

Post-processGraph resultsVisualizeAnimate

High Performance Computing

Simulation ExamplesSimulation Examples

High Performance Computing

Simulation ExamplesSimulation Examples

High Performance Computing

Workload CharacteristicsWorkload Characteristics

Multi-cpu jobs: 1-24Processor, memory, and I/O intensiveNot data centricHeterogeneous systems

Multi-vendorEvolutionary introduction of technologyJobs modified to match environment

LSF provides workload management (Impact, NVH)Job scheduling interfaceLoad leveling across systemsMaximize utilization and minimize queue time

High Performance Computing

HPC GoalsHPC Goals

Reduce simulation cycle timeShorten the design timelineEvaluate more design alternatives

Reduce costsEngineer productivityPhysical testsCost of computing

Increased accuracyImproved vehicles

High Performance Computing

HPC SystemsHPC Systems

HP: 6-Superdomes (352 CPUs)

96 Node Itanium Cluster (192 CPUs)

SGI: 12 Origin 300 (384 CPUs)4 Origin 3000 (256 CPUs)2 Origin 2000 (128 CPUs)

IBM 172 Node Pentium Cluster (344 CPUs)

64 Node Pentium Cluster (128 CPUs)

High Performance Computing

Cluster HistoryCluster History

High Performance Computing

Chrysler group’s first clusterChrysler group’s first cluster

Installed 1998First entry into MPI andclustersRetired 2003

HP N-Class56 Compute nodes

Impact and NVH

4-8 cpus per node

400 mhz PA8500

8-12 cpus per job

Gigabit and hiperfabricbackbone

11tb scratch disk

5 Front end L-class dataservers with fail-over

software

High Performance Computing

Impact Pentium ClusterImpact Pentium Cluster

IBM Intellistations172 Nodes, 344 CPUs

Xeon 2.2ghz, 2.8ghz

Gigabit internal network

12 nodes per switch

RedHat Linux V7.2

Installed July 2002

Management node w XCAT software

Storage node w 1.8tb shared disk

High Performance Computing

Itanium NVH ClusterItanium NVH Cluster

96 Nodes, 192 CPUs

HP ZX6000

HP/UX V11.22

MSC/Nastran

675gb shared storage

288gb scratch space per node

Gigabit internal network

High Performance Computing

CFD Hybrid ClustersCFD Hybrid Clusters

2 SGI SMP front-end systemsGeographically separated for disaster recoverypurposes64 CPUs, 64gb64 bit applications, job scheduling, and file serving

Pentium compute cluster behind each front-end

32 nodes each (64 cpus)IBM x335, 2.8ghz32 bit applications

PBSproLoad level between site and front/back endMatch workload to available licenses

High Performance Computing

ImplementationIssues

ImplementationIssues

High Performance Computing

Cluster EnablersCluster Enablers

ISV’sImplement Message Passing InterfacePort to LinuxQuality assurance

Intel1.8ghz Xeon performance exceeds RiscItanium 2 performance

IntegratorsIndustrial grade solutions

Ø Bag-of-wiresTurnkey implementation servicesHardware and software supportCost effective designs

Integrator provides research and developmentNode speed and sizeBack-end network

ClusterWare

High Performance Computing

Internal NetworkInternal Network

Speed vs. efficiencyProprietary (Myrinet, Scali, etc.)10/100/1000 EthernetNew technology (Infiniband)

Switch sizeFlat is expensiveHierarchical limits job size

Sub clustersMaintaining locality of traffic for jobs

Multiple clustersNetwork of private networksSeparating public and private traffic

TrafficDataMPIManagement

High Performance Computing

Workload ManagementWorkload Management

Workload managementMore than job scheduling

Clean-up after kill / failure

Accounting

Visibility to back-end processes

Sub-clustersMaintaining locality of traffic

Heterogeneous systemsUnequal nodes

Number of CPUs

CPU speed

Mixed environment (SMP and clusters)

High Performance Computing

DataData

Internal disksDo you need / want internal disks?Is there activity that should be targeted to theinternal disks?

Shared diskSAN

Costly and complexHigh performance

NFSLess expensive and simplerModerate performance

High Performance Computing

Lessons LearnedLessons Learned

High Performance Computing

Operating SystemOperating System

Corporate CultureCorporate standards for open sourceNiche implementations vs. corporatedirectionCan you fly below the radar?

Which OS is right for your cluster?Linux concernsWhat about Windows?HP/UX for Itanium

High Performance Computing

ResilienceResilience

Likelihood of failureAre commodity components more or lesslikely to fail?

Impact of failureCost of lost node

HW supportBack end nodes vs. shared components

Off-hours level of diagnosis and ability to repair

High Performance Computing

DCX Failure experiences (2002)DCX Failure experiences (2002)

SMP : 0.10% jobs lost to system failure

Cluster: 0.05% jobs lost to system failure

Cost savings for mixed mode hardwaresupport

High Performance Computing

ClusterWareClusterWare

Number of offerings

Need for standardization

Interface to other standard tools

Open source vs. proprietary

High Performance Computing

Choosing Your IntegratorChoosing Your Integrator

Roll your own: “bag of wires”Who you gonna call?How much did you really save?When will you find out if you made a configurationmistake?

Application expertise is dominant factorHW and SW supportLet your integrators do your research anddevelopmentThere are plenty of players. When in doubt,bid it out.

High Performance Computing

FuturesFutures

Internal networksEthernet improvements

Infiniband

Proprietary: Myrinet, Quadrics, etc.Cost

Future

Duplicate network

High Performance Computing

FuturesFutures

Multiple clustersHomogeneous interface to heterogeneousclusters

Multiple operating systems

Hybrid hardware32bit and 64bit applications

ClusterWare and workload management

Cluster experience will benefit Gridimplementation

High Performance Computing

Closing RemarksClosing Remarks

High Performance Computing

BenefitsBenefits

CostMetrics?

Transactions (jobs) per dollar spent

PerformanceMetrics?

Turnaround vs. throughput

Visibility

High Performance Computing

BenefitsBenefits

Pentium Cluster20% performance improvement overequivalent number of RISC processors

40% cost benefit

Itanium Cluster50% performance improvement over RISC

High Performance Computing

EnablersEnablers

It is all about the application

Know your workloadHow many bells and whistles will providebenefits?

Run benchmarks

Use application savvy integrators

High Performance Computing

Cluster StrategyCluster Strategy

Good for only the next installationTechnology is changing too fast

When does it make sense to expand an existingcluster

Niche solutionsDon’t have to transform the enterprise

Corporate resistanceThe only thing more notorious than a successfulimplementation is an unsuccessful one

High Performance Computing

PerspectivePerspective

Remember, the primary reason toembrace clusters is to reduce costs. Itis easy to lose sight of this goal andsuccumb to the temptation to over-configure.

High Performance Computing

Questions?Questions?