early x1 experiences at boeing - cug

21
Early X1 Experiences at Boeing Jim Glidewell Information Technology Services Boeing Shared Services Group

Upload: others

Post on 31-Oct-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Early X1 Experiences at Boeing - CUG

Early X1 Experiencesat Boeing

Jim Glidewell

Information Technology Services

Boeing Shared Services Group

Page 2: Early X1 Experiences at Boeing - CUG

Page 2 X1 Experiences - Cray User Group - May 2004

Early X1 Experiences at Boeing

• HPC computing environment

• X1 configuration

• Hardware and OS

• Applications

• Support

• Summary

Page 3: Early X1 Experiences at Boeing - CUG

Page 3 X1 Experiences - Cray User Group - May 2004

Current HPC systems

• Two Cray T-90's

• A 384 CPU Origin 3800

• Three 256-CPU Linux clusters

• Cray X1

Page 4: Early X1 Experiences at Boeing - CUG

Page 4 X1 Experiences - Cray User Group - May 2004

Cray X1

• Fully populated liquid-cooled chassis

• 64 MSPs

• 512 GB of memory, 32GB per node

• Additional Java Server

• Total of 26 terabytes of LSI RAID disk

• managed by ADIC StorNext software

• X1 is partitioned into two systems

• 14 nodes production partition

• 2 node test partition

• Allows testing of weekly OS updates

• Added additional complexity to network

Page 5: Early X1 Experiences at Boeing - CUG

Page 5 X1 Experiences - Cray User Group - May 2004

Timeline

• Early 2003 - Continuing discussions with Cray on X1

• August 2003 - Detailed plan for X1 transition

• September 19, 2003 - Final approval for X1 acquisition

• November, 2003 - Factory Visit

• January 2, 2004 - System Delivered

• January 15, 2004 - Early User Access

• March 1, 2004 - Limited X1 Production

• March 8, 2004 - Full X1 Production

• April 24, 2004 - StorNext managed SAN put into production

Page 6: Early X1 Experiences at Boeing - CUG

Page 6 X1 Experiences - Cray User Group - May 2004

Hardware experiences

• Fluorinert pumps• Four of four pumps replaced• One pump replaced twice, due to install

error which ran it dry• Bad batch has been identified by Cray

• First system shipped with 32GB per node

• Memory errors seen at factory

• 1GB DIMMs were out of timing spec

• Cray developed new memory test to validate DIMMs

• Resulting in a one month delay in ship

• CPU and memory reliability have been excellent

Page 7: Early X1 Experiences at Boeing - CUG

Page 7 X1 Experiences - Cray User Group - May 2004

UNICOS/mp experiences

• System stability

• Data compare errors

• Administration and Operations

Page 8: Early X1 Experiences at Boeing - CUG

Page 8 X1 Experiences - Cray User Group - May 2004

System Stability

• Kernel panics

• We've seen very few since start of production in March

• Application migration is still disabled

• Job aborts due to node memory oversubscription

• Overall the OS has been stable

• Exceeded our expectations of one kernel panic per week

• Improvements continue

Page 9: Early X1 Experiences at Boeing - CUG

Page 9 X1 Experiences - Cray User Group - May 2004

Data compare errors

• Problems arose during acceptance testing of disks

• Errors were extremely infrequent (one failure every 72 hours oftesting)

• Concerns about data integrity

• Cray dedicated multiple systems to replicating the problem

• Root cause was lack of "I/O cache coherency"

• I/O started before data was flushed from cache to memory

• DMA picked up stale data

• Problem resolved by ensuring all data flushed before writes

• Similar timing windows closed in other I/O related code

Page 10: Early X1 Experiences at Boeing - CUG

Page 10 X1 Experiences - Cray User Group - May 2004

Administration and Operations

• Experience with IRIX really helpful

• Disk configuration & backups

• Network

• PBSPro

• Accounting

Page 11: Early X1 Experiences at Boeing - CUG

Page 11 X1 Experiences - Cray User Group - May 2004

Disk configuration

• All disks are LSI RAID

• Complex

• Leaned heavily on Cray support

• Limited guidance on performance tuning

• Delayed StorNext managed SAN for months

• Backups

• Weekly backups to tape, daily disk to disk copies

• Long term plan to rely heavily on StorNext

Page 12: Early X1 Experiences at Boeing - CUG

Page 12 X1 Experiences - Cray User Group - May 2004

Network

• Lots of components - X1, CNS, CPES, X1-JS, backup CNS

• Very complicated - hope nothing breaks!

Page 13: Early X1 Experiences at Boeing - CUG

Page 13 X1 Experiences - Cray User Group - May 2004

eth2.20

FSS0 Primary

SNFS Net Addresses192.168.10.2 FSS-0192.168.10.3 FSS-1192.168.10.10 CPES192.168.10.11 X1-JS192.168.10.20-22 CNS

•/etc/hosts• xxx.xxx.214.242 on fss0• xxx.xxx.214.243 on fss1• establishes cvfsid (license) and ACSLS client identity

• fsroutes on all nodes to redirect MD traffic to SNFS net

PAM 10.0.109.0

Stor

Nex

t 1

92.1

68.1

0.0

PAM 10.0.104.0

FSS1 Alternate

CWS

znb6

X1-JS

znb0 ce3IP/FC <> X1

znb5

RAID

RAID PAM10.0.116.0

Boeing X1 Nets – PAM, RAID, StorNext

CPES

IP/FC <> X1

eth1

.21

.4

.2.10

.11

.3

.2

PAM 10.0.106.0

switch0.private admin

X1 CNS1 BackupIP/FC <> X1

GbE <> CustomerNet

eth1 eth0.21 .2

12

X1 CNS0 PrimaryIP/FC <> X1

GbE <> CustomerNet

eth1 eth0.20 .1

12

StorNext 192.168.10.0X1 CNS2 Test

IP/FC <> X1GbE <> CustomerNet

eth1 eth0.22 .3

12

znb0 ce3

2

12

eth2 eth1

SFM1

Ethernet.11

SFM0

Ethernet.10

1

.242eth0

.243eth0

AC

SLS

xxx.xx.21 4. 19 2

Page 14: Early X1 Experiences at Boeing - CUG

Page 14 X1 Experiences - Cray User Group - May 2004

PBSPro

• Solid and reliable for us

• Excellent documentation

• User documentation needed (What's an MPPE?)

• Good interaction with psched

• But PBS knows nothing about flexible vs. accelerated mode

• Disabled migration can leave applications stuck in postedqueue

• Expect improvements at 2.4

Page 15: Early X1 Experiences at Boeing - CUG

Page 15 X1 Experiences - Cray User Group - May 2004

Accounting

• No Cray System Accounting...

• But project accounting is supported

• Using a locally written program to sum usage by user, project

• Session id will be included in UNICOS 2.4 process records

Page 16: Early X1 Experiences at Boeing - CUG

Page 16 X1 Experiences - Cray User Group - May 2004

Applications and Tuning

• Compilation time is still an issue

• Cray has provided significant help in getting our key applicationsperforming well on the X1

• Overflow

• Tranair

• Other applications

Page 17: Early X1 Experiences at Boeing - CUG

Page 17 X1 Experiences - Cray User Group - May 2004

Overflow

• CFD code developed at NASA Ames

• Parallel CFD code using MLP (Multi-Level Parallelism)

• Scales well to large number of nodes (256 or more)

• Runs very well on our NUMA-based system

• Initial performance was roughly 16X that of a 400Mhz CPU on ourexisting Overflow system

• After tuning by Cray personnel, performance is now 25X

• Code changes have been integrated to NASA's version

Page 18: Early X1 Experiences at Boeing - CUG

Page 18 X1 Experiences - Cray User Group - May 2004

Tranair

• Boeing developed CFD code

• Adaptive grid single-CPU FORTRAN code

• Out-of-core solver - made heavy use of SSD on T-90s

• Highly optimized for T-90 series

• Large memory requirement - 7-12GB per job, desire larger

• Forced to use MSP mode due to memory needs

• Initial speed ratio was 0.9-1.18 relative to T-90, both MSP andSSP

• Current MSP speed ratio is 2.0 times the speed of T-90

Page 19: Early X1 Experiences at Boeing - CUG

Page 19 X1 Experiences - Cray User Group - May 2004

Support

• Cray has provided excellent support throughout the process

• Initial hardware and software setup

• Local technical support

• Technical folks in Chippewa Falls and Mendota

• Hardware resources to reproduce and debug problems

• Excellent training & documentation

• Cray’s assistance was essential in the success of our installationon a very aggressive schedule

Page 20: Early X1 Experiences at Boeing - CUG

Page 20 X1 Experiences - Cray User Group - May 2004

Summary

• Significant hardware and software issues were encountered andresolved

• Memory DIMM problems

• Data compare errors

• Kernel panics

• StorNext teething pains

• Cray has provided great support throughout the process

• Users are happy with the turnaround and overall reliability of the X1

• The X1 is already a key part of our CFD design process

Page 21: Early X1 Experiences at Boeing - CUG

Page 21 X1 Experiences - Cray User Group - May 2004

Coming soon…