streamlining research computing infrastructure · streamlining research computing infrastructure!!...

47
Streamlining Research Computing Infrastructure A small school’s experience Gowtham HPC Research Scientist, ITS Adj. Asst. Professor, Physics/ECE [email protected] (906) 487-3593 http://www.mtu.edu

Upload: others

Post on 10-Aug-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

Streamlining Research Computing Infrastructure

!!

A small school’s experience

Gowtham HPC Research Scientist, ITS Adj. Asst. Professor, Physics/ECE

[email protected]

(906) 487-3593 http://www.mtu.edu

Page 2: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���2

Houghton, MI

Page 3: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

Houghton, MIIsle Royale National Park, MI 56 miles

Green Bay, WI 215 miles

Dulu

th, M

N 21

5 m

iles

Detroit, MI 550 miles

Twin

Citie

s, M

N 37

5 m

iles

Sault Ste. Marie, M

I Canada 265 m

iles

���3

Page 4: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���4

Michigan Tech Fall 2013

1885

1897

1927

1964

- Population - Houghton/Hancock: 15,000 (22,000)

- Students: 7,000 (5,600 +1,400)

- Faculty: 500

- Staff: 1,000

- General budget: $170 million

- Sponsored programs awards: $48 million

- Endowment value: $83 million

Page 5: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- 8 mini to medium sized clusters - Spread around campus

- Varying versions of Rocks

- Different software configurations

- Single power supply for most components

- Manual systems administration and maintenance

- Minimal end user training and documentation

���5

An as is snapshot January 2011

These 8 clusters — purchased mostly with start-up funds — had 1,000 CPU cores spanning several hardware generations and few low-end GPUs. Only one of them had InfiniBand (40 Gb/s).

Page 6: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Move all clusters to one of two data centers - Merge clusters when possible

- Consistent racking, cabling and labeling scheme

- Upgrade to Rocks 5.4.2

- Identical software configuration

- End user training

- Complete documentation

���6

Rack 107, Back side, 36th slot, On Board NIC 1 (of a node)

Initial consolidation January 2011 — March 2011

Compute nodes deemed not up to the mark were put away for building a test cluster : wigner.research.mtu.edu

R107B36 OB1

R107B41 P01

Rack 107 Back side, 41st slot, Port 01 (of the switch)

Page 7: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- hpcmonitor.it.mtu.edu

- Ganglia monitoring system

���7

Capture usage pattern April 2011 — December 2011

Monitoring multiple clusters with Ganglia: http://central6.rocksclusters.org/roll-documentation/ganglia/6.1/x111.html

Page 8: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Low usage - 20% on most days

- 45-50% on luckiest of days

- Inability and/or unwillingness to share resources - Lack of resources for researchers in need

- More systems administrative work - Space, power and cooling costs

- Less time for research, teaching and collaborations

���8

Analysis of usage pattern January 2012

Page 9: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- VPR, Provost, CIO, CTO, Chair of HPC Committee and yours truly - Strongly encourage sharing of under-utilized clusters

- End of life for existing individual clusters

- Stop funding new individual clusters

- Acquire one big centrally managed cluster

- Central administration will fully support the new policies - One person committees

- No exceptions for anyone

���9

The meeting January 2012

Page 10: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���10

The philosophy January 2012

Greatest good for the greatest number- Warren Perger and Gifford Pinchot

Much is said of the questions of this kind, about greatest good for the greatest number. But the greatest number too often is found to be one. It is never the greatest number in the common meaning of the term that makes the greatest noise and stir on questions mixed with money … - John Muir

Page 11: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���11

It’s not just a keel and a hull and a deck and sails. That’s what a ship needs but not what a ship is. But what a ship is … what the Black Pearl Superior really is … is freedom. - Captain Jack Sparrow, Pirates of the Caribbean

Adopted shamelessly from Henry Neeman’s SC11 presentation: Supercomputing in Plain English

The philosophy January 2012

Page 12: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- $750k for everything

- $675k for hardware + 10% for unexpected expenses

- 5 rounds with 4 vendors (2 local; 2 brand names) - Local vendor won the bid February 2013 - Staggered delivery of components April — May 2013

- Fly-wheel installation April — May 2013 - Load test with building and campus generators

���12

Bidding/Acquiring process February 2012 — May 2013

Page 13: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Built with retired nodes from other clusters - 1 front end

- 2 login nodes

- 1 NAS node (2 TB RAID1 storage)

- 32 compute nodes

- 50+ software suites

- 150+ users

First version of wigner had just two nodes: 1 front end and 1 compute node, built with retired lab PCs and no switch

���13

wigner.research January 2011 — December 2013

As of Spring 2014, wigner has been retired. The nodes are being used as a testing platform for upcoming Data Science program at Michigan Tech and to teach building and managing a research computing cluster as part of PH4395: Computer Simulations

Page 14: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- HPC Proving Grounds - OS installation and customization

- Software compilation and integration with queueing system

- Extensive testing of policies, procedures and user experience - PH4390, PH4395 and MA5903 students

- Small to medium sized research groups

- Automating systems administration

- Integrating configuration files, logs, etc. with a revision control system

���14

wigner.research March 2011 — December 2013

Page 15: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Central Rocks server (x86_64) - Serves 6.1, 6.0, 5.5, 5.4.3 and 5.4.2

- Saves time during installation

- Facilitates inclusion of cluster-specific rolls

���15

rocks.it.mtu.edu April 2012 — present

Scripts and procedures were provided by Philip Papadopoulos

Page 16: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- 1 front end

- 2 login nodes

- 1 NAS node: 33 TB usable RAID60 storage space

- 72 CPU compute nodes

- 5 GPU compute nodes - 4 NVIDIA Tesla M2090 GPUs (448 CUDA cores)

���16

Superior June 2013

Compute nodes (CPU and GPU): Intel Sandy Bridge E5-2670 2.60 GHz 16 CPU cores and 64 GB RAM

Housed in the newly built Great Lakes Research Center : http://www.mtu.edu/greatlakes/

Page 17: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- 56 Gbps InfiniBand - Primary research network

- Copper cables

- Gigabit ethernet - Administrative and secondary research network

- Redundant power supply for every component

���17

Superior June 2013

With 81 total nodes, there was 33% room for growth before needing to re-design the InfiniBand switch system. Final cost was $680k. Remaining $70k was used to build a test cluster : portage.research.mtu.edu

Page 18: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Physical assembly (7 days) - Racking, cabling and labeling

- Rocks Cluster Distribution (5 days) - OS installation, customization, compliance

- Software compilation, user accounts

- 3 pilot research groups (14 days) - Reward for being good and productive users

- Help fix bugs, etc.

���18

Superior June 2013

Page 19: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���19

Superior June 2013

Ethernet switch system

Front end

Login nodes

CPU Compute nodes

InfiniBand switch system

Storage node

GPU compute nodes

Page 20: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- short.q (compute-0-N; N: 0-7) - 24 hour limit on run time

- long.q (compute-0-N: N: 8-81) - No limit on run time

- gpu.q (compute-0-N: N: 82-86) - No limit on run time

���20

Superior June 2013

http://superior.research.mtu.edu/available-resources

Page 21: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���21

Benchmarks: HPL June 2013

# Performance (TFLOPS) Notes

1 Theoretical 23.96 --2 Practical 21.57 ~90% of #13 Measured 21.38 89.23% of #1

http://netlib.org/benchmark/hpl Theoretical performance = # of nodes x # of cores per node x Clock frequency (cycles/second) x # of floating point operations per cycle

Page 22: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���22

Benchmarks: LAMMPS June 2013

Benjamin Jensen (advisor : Dr. Gregory Odegard) Computational Mechanics and Materials Research Laboratory, Mechanical Engineering-Engineering Mechanics Results from a simulation involving 1,440 atoms and 500,000 time steps.

Tota

l Run

Tim

e (h

ours

)

0

4.25

8.5

12.75

17

# of nodes (CPU cores)

2 (32) 4 (64) 6 (96) 10 (160)

Michigan Tech: SuperiorNASA: Pleiades

Page 23: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���23

Submit completed proposal to: !Dr. Warren Perger Chair, HPC Committee [email protected]

Account request

LaTeX/MS Word template available at http://superior.research.mtu.edu/account-request

- List of software/compilers

- Scalability

- Source of funding

- Résumé

- Proposal

- Title and abstract

- User population

- Preliminary results

- Nature of data sets

- Required resources

Page 24: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- A metric for merit

- An easily accessible list of projects - Know what the facility is being used for - Intellectual scholarship and computational requirements

- For VPR, CIO, deans, dept. chairs and institutional directors

- A fail-safe opportunity to practice writing proposals seeking allocations in NSF’s XSEDE, etc.

���24

Why proposal?

http://nsf.gov http://xsede.org http://superior.research.mtu.edu/list-of-projects

Page 25: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Tier A - New faculty

- Established faculty with funding

- Tier B - Established faculty with no (immediate) funding

���25

User population

Group members and external collaborators inherit their PI’s tier. New faculty status is valid for 2 years from the first day of work.

Page 26: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���26

Job submission: qgenscriptOne stop shop for

- Array jobs

- Exclusive node access

- Wait on pending jobs

- Email/SMS notifications

- Wait time statistics

- Command to submit the script

- Job information filehttp://superior.research.mtu.edu/job-submission/#batch-submission-scripts

Page 27: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Users’ priorities are computed periodically - A weighted function of CPU time and production

- In effect only when Superior is running at near 100% capacity

- Pre-emption and advanced reservation are disabled

- Any job that will start will run to completion

���27

Job scheduling policy

http://superior.research.mtu.edu/job-submission/#scheduling-policy

Page 28: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���28

Email/SMS notifications

http://superior.research.mtu.edu/job-submission/#sms-notifications

Page 29: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���29

Job information file

Page 30: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Reduces performance for all users

- First offense - Terminates the program

- An email notification [cc: user’s advisor]

- Subsequent offenses - Same as first offense

- Logs the user out and locks down the account

���30

Running programs in login nodes

http://superior.research.mtu.edu/job-submission/#running-programs-on-login-nodes A continued trend will be grounds for removal of user’s account.

Page 31: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Data is not backed up

- Limits per user

- /home/john - 25 MB

- /research/john - decided on a per proposal basis

- When a user exceeds the limit

- 12 reminders at 6 hour intervals [cc: user’s advisor]

- 13th reminder, logs out the user and locks down the account

���31

Disk usage

http://superior.research.mtu.edu/job-submission/#disk-usage

Page 32: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���32

Useful commands

Developed at Michigan Tech http://superior.research.mtu.edu/job-submission/#useful-commands

- qgenscript

- qresources

- qlist

- qnodes-map

- qnodes-active | qnodes-idle

- qwaittime

- qstatus | quser | qgroup

- qnodes-in-job

- qjobs-in-node

- qjobs-in-active-nodes

- qjobinfo | qjobcount

- qusage

Page 33: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���33

Usage reportsAll PIs and Chair of HPC Committee receive a weekly report. !VPR, CIO, deans, department chairs and institutional directors receive quarterly and annual reports (or when necessary).

Page 34: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- 21 projects - 10 Tier A +11 Tier B

- 100 users

- 9 publications

- 75+% busy on most days

- $325k worth usage

- ~50% of initial investment

- Cost recovery model: $0.10 per CPU-core per hour

���34

Usage reports July 2013 — December 2013

Page 35: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���35

Metrics

Cannot manage what cannot be measured

Not everything that’s (easily) measurable is (really) meaningful Not everything that’s (really) meaningful is (easily) measurable

Page 36: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Move towards a merit-based system

- Easily measurable quantities - Who users are

- # of CPUs and total CPU time

- Really meaningful entities - Publications - Type (poster, conference proceeding, journal) and impact factor

- Citations

���36

Metrics

Page 37: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

Publications reported to: !Dr. Warren Perger Chair, HPC Committee [email protected]

���37

An in-house algorithm to compute users’ priorities

Metrics: job prioritySystem already knows who the users are

Page 38: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���38

Metrics

http://superior.research.mtu.edu/usage-reports Interactive visualizations are built using Highcharts framework

Page 39: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���39

Metrics

http://superior.research.mtu.edu/usage-reports

Page 40: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���40

Metrics

http://superior.research.mtu.edu/usage-reports

Page 41: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���41

Metrics

http://superior.research.mtu.edu/usage-reports

Page 42: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���42

Metrics

http://superior.research.mtu.edu/usage-reports

Page 43: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���43

Metrics: global impact

http://superior.research.mtu.edu/list-of-publications

Michigan Tech original Journal Article Book Chapter Conference Proceeding

MS Thesis PhD Dissertation

Page 44: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- Move all clusters to Great Lakes Research Center - Upgrade to Rocks 6.1 and add a login node

- Retire individual clusters when possible

- 16 compute nodes and 1 NAS node added to Superior

- portage.research.mtu.edu - Segue to Superior - 1 front end, 1 login node, 1 NAS node and 6 compute nodes

- Testing, course work projects and beginner research groups

���44

Further consolidation August 2013 — December 2013

Page 45: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- 1 big, 1 mini (central) and 3 individual clusters - 1 data center with .research.mtu.edu network

- Rocks 6.1

- Identical software configurations

- Automated systems administration and maintenance

- Extensive end user training

- Complete documentation

���45

An as is snapshot January 2014

Immersive Visualization Studio (IVS) is powered by a Rocks 5.4.2 cluster and has 24 HD screens (46” 240 Hz LED) working in unison to create a 160 sq. feet display wall. @MTUHPCStatus

Page 46: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

- More tools to enhance user experience - Videos for self-paced learning of command line linux

- Encourage GPU computing

- Expand storage

- Provide backup

- Re-design InfiniBand switch system (216 nodes)

- Plan for expanded (or new) Superior

���46

Immediate future February 2014 and beyond

Page 47: Streamlining Research Computing Infrastructure · Streamlining Research Computing Infrastructure!! A small school’s experience Gowtham! HPC Research Scientist, ITS! Adj. Asst. Professor,

���47

Thanks be to- Philip Papadopoulos and Luca Clementi (UCSD and SDSC)

- Timothy Carlson (PNL)

- Thomas Reuti Reuter (Phillips Universität Marburg)

- Alexander Chekholko (Stanford University)

- Rocks, Grid Engine and Ganglia mailing lists

- Henry Neeman (University of Oklahoma)

- Steven Gordon (The Ohio State University)

- Gergana Slavova, Walter Shands and Michael Tucker (Intel)

- Gaurav Sharma and Scott Benway (MathWorks)

- Adam DeConinck (NVIDIA)