linux clusters institute: hpc user support · hpc user expectations, categorization and...

23
1 Linux Clusters Institute: HPC User Support Fang (Cherry) Liu, PhD [email protected] Wesley Emeneker, PhD [email protected] Mehmet Belgin, PhD [email protected] A Partnership for an Advanced Computing Environment (PACE) OIT/ART, Georgia Tech 4-8 August 2014 2 Targets for this session Target audience: IT professionals with little or no experience with supporting HPC users Points of interest: differences between HPC and conventional IT HPC user categories and differences in their support human aspect of HPC support (i.e. politics, conflicts) problems common to most institutions/centers supporting HPC exchange different approaches/solutions to common problems HPC education and training lessons we learned at GT

Upload: others

Post on 04-Jul-2020

18 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

1

Linux Clusters Institute:HPC User Support

Fang (Cherry) Liu, PhD [email protected]

Wesley Emeneker, PhD [email protected]

Mehmet Belgin, PhD [email protected]

A Partnership for an Advanced Computing Environment (PACE)

OIT/ART, Georgia Tech

4-8 August 2014 2

Targets for this session

Target audience:

IT professionals with little or no experience with supporting HPC users

Points of interest:

• differences between HPC and conventional IT

• HPC user categories and differences in their support

• human aspect of HPC support (i.e. politics, conflicts)

• problems common to most institutions/centers supporting HPC

• exchange different approaches/solutions to common problems

• HPC education and training

• lessons we learned at GT

Page 2: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

2

4-8 August 2014 3

Overview

Part I: HPC user expectations, categorization and commonalities

Part II: Policies, Politics, Conflicts and Personality Management

Part III: Outreach and Education

Part IV: Lessons learned at GT

4-8 August 2014 4

Differences of HPC from conventional IT

• application performance as the primary target

• usually relies on conventional IT services (by a separate team)

• more focus on supporting end-users than services

• uses common IT technologies in uncommon ways

• requires specific middleware and software layers

• requires code compilations using complicated mechanisms

• may require specific knowledge about the application/science

• has irregular usage patterns (maybe not so different than IT?)

Page 3: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

3

4-8 August 2014 5

PART I

HPC user expectations, categorization and commonalities

4-8 August 2014 6

HPC user expectations

Faculty (a.k.a PI) (owner of resources, but not active users):

• Their students and collaborators have everything they need to get the work done (and on time)

• Maximum availability of resources

• Minimum communication with HPC support staff

• Regular status reports

Page 4: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

4

4-8 August 2014 7

HPC user expectations

Students/Collaborators (or computationally active PIs) :

• ultra-fast learning curve

• simple and instant solutions to complex problems

• maximum communication with HPC support staff

• simulations running faster than their laptops (not always possible!)

• help with diagnosing problems that are NOT related to systems

• an "insider friend" in the HPC support staff

• answers that match their level of knowledge

4-8 August 2014 8

HPC User Categories

• Three coarse categories:• Novice

• Intermediate

• Advanced

• Difficult to identify a user's category without any prior interaction

• The language used in requests is a good indicator

• Replies to follow-up questions also reveal their level of proficiency

• In case of uncertainty, assume "novice"

Page 5: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

5

4-8 August 2014 9

Category 1: Novice Users

Common Points:

• 75-80% of the support requests

• no/little Linux skills

• no/little experience with running the domain specific packages

• no/little understanding of the scientific fundamentals behind the packages

• mostly identical or similar requests with straightforward solutions

• usually not aware of the standard help channels

• may ask the impossible

• may type the examples in the help documents literally

• may feel insecure or apologetic when seeking for help

4-8 August 2014 10

Category 1: Novice Users

Common Needs:

• Cluster orientation

• Linux 101

• an email list

• an easy text editor (nano?)

• help with configuring their MS Windows/OSX systems

• location of existing software

• installation of new software

• help with tools to move data in/out

• help with the very first job submission script

Page 6: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

6

4-8 August 2014 11

Category 1: Novice Users

Common approaches for effective support:

• do everything to build mutual trust

• hold regular orientation sessions and help desks

• maintain up-to-date FAQ/help with screenshots

• provide links to existing help locations

• suggest proper web search terms

• make them feel better about their simple (or sometimes stupid) questions

• explain all the steps for resolution in simple, replicable terms

• prefer exact list of commands to general/conceptual answers

• be very patient and polite!

4-8 August 2014 12

Category 2: Intermediate Users

Common Points:

• 10-25% of the support requests

• largest portion of the compute activity on the cluster

• experience with clusters in the same or other institutions

• first to notice and report system problems

• a hybrid mix of straightforward and complex questions

• advanced and multi-step scientific workflows

• aware of the standard help channels

• suggest solutions to their own problems and may not like what you did

• act as the local technical expert and often train novice users in their group

Page 7: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

7

4-8 August 2014 13

Category 2: Intermediate Users

Common Needs:

• advanced (and group-specific) information sessions

• well-explained effective solutions

• more performance/efficiency from already running codes

• specific modules/patches/versions for existing software

• higher level of control on their jobs

• access to specialized computational resources

• configurations that may conflict with system defaults

• code development/debugging/profiling support

• data/statistics for the resolution of conflicts with other users

4-8 August 2014 14

Category 2: Intermediate Users

Common approaches for effective support:

• do everything to build mutual trust

• hold advanced classes to "teach how to fish"

• schedule one-on-one meetings

• add exceptional/advanced cases to existing FAQ/help pages

• present solid data/evidence instead of speculation

• admit to speculation if it is inevitable

• show complete transparency; they can separate 'excuses' from 'facts'

• get help from vendor support and user forums, keeping users CC'ed

• be very patient and polite!

Page 8: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

8

4-8 August 2014 15

Category 3: Advanced Users

Common Points:

• experience with and access to multiple clusters

• only a small fraction of support requests

• inclination for bypassing the ticket system

• usually complex problems with long resolution time

• try to fix problems themselves, and see HPC support as a last resort (i.e. when it's too late)

• usually on the extremes; either hostile or extremely collaborative

• too busy or advanced to act as the local expert for their group

• have complex to incomprehensible workflows

• usually acknowledge challenging problems, open to workarounds

• suggest improvements on the systems (hardware and software) and provide useful feedback

• open to experimentation with new systems and software

• find bugs in libraries and applications

4-8 August 2014 16

Category 3: Advanced Users

Common Needs:

• VIP treatment

• direct and open communication channels

• social contact

• acknowledgement of their level of knowledge and intelligence

• high-level and direct vendor/developer support

• lots of exceptions, even though they require violation of existing policies

• almost everything else listed under "common intermediate users needs"

• root password (the answer is still no)

Page 9: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

9

4-8 August 2014 17

Category 3: Advanced Users

Common approaches for effective support:

• do everything to build mutual trust

• schedule one-on-one meetings

• try to learn more about their research, deadlines and aspirations

• be very careful saying that something is "impossible"

• make small exceptions as long as it does not impact other users

• avoid speculation as much as possible (as with all users)

• be completely transparent, they can easily separate 'excuses' from 'facts'.

• encourage them to contact vendor support or user forums

• Be very patient and polite!

4-8 August 2014 18

How to Build TrustRef: http://www.myspeedoftrust.com/How-The-Speed-of-Trust-works/book

According to Stephen Covey, building trust requires character and competency.

• Character:

• Integrity

• Intent

• Competency:

• Capabilities

• Results

We agree.

Page 10: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

10

4-8 August 2014 19

PART II

Policies, Politics, Conflicts and Personality Management

4-8 August 2014 20

Policies

• clear policies help keep user demands under control

• publish policies in places easy to find (online)

• be prepared to explain the reasoning behind each policy item

• make policies as strict as possible, be open to exceptions when necessary

• encourage users to openly discuss and criticize the policies

• don't hesitate updating policies frequently to stay relevant

• build trust and effective communication with decision makers

• seek delegation privileges to speed things up

• don't make policies for resources you don't own, but "influence" them

Page 11: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

11

4-8 August 2014 21

Politics and Conflicts

• Tricky but inevitable

• No magic formula, needs case-specific creative solutions

• Biggest challenge: conflicts due to limited resources• configure systems to exactly match policies

• collect and store data for past and present usage

• provide users with tools to browse data/statistics for their accounts

• run regular audits to defuse problems before they explode

4-8 August 2014 22

Tiers of Conflict

• Internal to a group/department: usually easier to solve with communication and gentlemen's agreement

• Between groups/department: can get messy quick. Use "wall of shame"

• Between users and HPC support staff: Have clear policies handy as a basis for declining impossible requests, and keep solid statistics/data as evidence

Page 12: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

12

4-8 August 2014 23

Personality Management

• some users are difficult than others; why they behave that way is irrelevant

• do not take anything personal; report any harassment you may receive and do not retaliate

• in most cases users do not mean bad, but they are extremely frustrated

• if your mistake caused frustration, take responsibility and offer an apology

• show empathy and demonstrate sincere intention for resolution

• acknowledge that:

• you understand the problem

• you are aware of its particular impact on the user

• be aware of, and show tolerance for cultural differences and language difficulties

• humor is powerful only when used appropriately, avoid being awkward or insulting

• do not wait until having a resolution, respond immediately to inform that you started working on the problem, and provide frequent updates

4-8 August 2014 24

PART III

Outreach and Education

Page 13: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

13

4-8 August 2014 25

Trainings and Tutorials

• Novice Users:• how to ask for help

• usage limitations and best practices

• basic Linux usage

• basic cluster access, concepts, scheduling system, software

• troubleshooting job related problems

• Intermediate Users• debugging/optimization of codes (incl. parallel)

• system architecture specific details

• advanced use of common tools (scientific python, parallel matlab)

4-8 August 2014 26

Group Consultations

• mini-orientations for newly joining groups

• departmental meetings to provide feedback for resolution of internal conflicts

• resolution of technical problems that are specific to a group

• technical feedback to assist in policy making and system purchases

• introduction of services to new groups with interest in getting resources

Page 14: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

14

4-8 August 2014 27

Grant Writing Help

Level 1: Answer questions (e.g. hardware specs, software licenses)

Level 2: Contribute facilities document, budget, letters of supports

Level 3: Writing and revising portions of the grant

Level 4: Initiate new grants to get more resources

4-8 August 2014 28

Collaborations with Researchers and Vendors

• research scientists helping research scientists

• crucial for staying relevant

• collaborative grant writing

• collaborative projects/papers (in acknowledgements or as authors)

• support for classes and workshops

• developer/vendor collaborations• Bug tracking and fixes

• Hardware/software feedback, evaluation of new systems and technology

• Pilot studies

Page 15: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

15

4-8 August 2014 29

PART IV

Lessons learned at GT

4-8 August 2014 30

PACE: A Partnership for an Advanced Computing Environment

• A High Performance Computing Center at Georgia Tech

• provides the necessary bridge between fast computers and brilliant humans that enables a powerful combination of the two

• is a partnership between Georgia Tech faculty and the Office of Information Technology focused on HPC

• provides HPC environment for GaTech students, researchers and faculty

• total computing capacity includes more than 1,200 compute nodes with total about 30k cores

• total user community has more than 1,200 active users

• 11 staff members and 5 student assistants

Page 16: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

16

4-8 August 2014 31

User support in PACE

• User interactions: user request system, monthly portable helpdesk event, online self-support system (in progress)

• Administrative: quarterly maintenance, software upgrades, queue changes, etc.

• Static information: website FAQ, User Forum, polices(software install, storage, new user account, etc.)

• Outreach: HPC entry level classes, Summer classes, orientation, hardware grant proposals

4-8 August 2014 32

Virtual Teams

• Team organization helps on dividing the tasks and use of the students to offload some of daily routine tasks

• Storage team is in charge of storage hardware to provide reliable access through head/compute nodes. Lots of problems may raise from storage issues

• Software team takes care of all system scientific libraries, user applications, software modules, etc. Besides installing the software, knowledge on how to use software is needed

• Operations team covers all hardware installation, maintenance and repair

• User support team acts as an interface between user community and PACE center, takes care of support request triage, scientific consultation, etc.

Page 17: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

17

4-8 August 2014 33

Virtual Team Organization

Operations

Storage Team

Software Team

User Support Team

Request Tracking System

4-8 August 2014 34

Use a request tracking system

• Provides an incident tracking system for HPC and other customers on campus• Easy-to-use, streamlined interface

• Regular reminders for unresolved incidents

• Time tracking and reporting functions

• Customizable email notices

• Excellent customer communications tool

• identification of emerging problem patterns

• used as a knowledge base

• Improves user satisfaction and increases productivity to help PACE deliver more value to education and research

Page 18: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

18

4-8 August 2014 35

User Support Team

• Acts as entry point for incoming support requests

• Answer the straight-forward questions

• Manually distribute remaining requests to subject matter experts

• Categorize the requests

• Keep track of request assignments

• Internal coordinating/information collection to communicate with user community

• User usage analysis, gathering user experience

• User education and classes

Ticket Categories

• Queue Issues (Q)

• provide access to the resources

• cancel jobs stuck in the queue

• add nodes online/offline

• queue scheduler not responding

• long waiting in the queue

• exam the jobs

• Applications (A)

• compiling, using, debugging, optimizing the user codes

• system level software installation, create modules

• give access to the restricted applications (e.g. VASP and Gaussian)

• system library/user application usage

• system wide script support

4-8 August 2014 36

Page 19: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

19

4-8 August 2014 37

Ticket Categories (Cont.)

• General Usage (G)

• VPN issues

• login issue, slow response

• scheduling meetings

• file permission issues

• system issue reporting

• software license issue

• FAQ (F)

• PBS scripts

• modules usage

• resource checking/use

• general usage on job submission, cancellation, status check, etc.

4-8 August 2014 38

Ticket Categories (Cont.)

• Storage (S)

• inability to access the storage

• file moving between local workstations and PACE

• retrieving the accidentally deleted files

• modify the disk quota and permission

• Account (C)

• user account creation/retiring

• Hardware Repairs (H)

• disk/memory/fans/networks, etc.

Page 20: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

20

Ticket Analysis (July 2013 to June 2014)

• Manually recording the user request since July 2013

• Total 223 days recorded

• Total number of request is 1834

4-8 August 2014 39

4-8 August 2014 40

Total Monthly Ticket Trend (July 2013 - June 2014)

0

50

100

150

200

250

July 2013 to June 2014 Monthly Ticket Trend

Page 21: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

21

4-8 August 2014 41

Tickets Type Trends (July 2013 - June 2014)

Queue18%

Storage15%

General30%

Application18%

Hardware7%

Account creation7%

FAQ5%

Ticket Catagory Percentage

4-8 August 2014 42

Daily Ticket trends (July 2013 - June 2014)

0

5

10

15

20

25

30

35

7/17/13 8/17/13 9/17/13 10/17/13 11/17/13 12/17/13 1/17/14 2/17/14 3/17/14 4/17/14 5/17/14 6/17/14

July 2013 to June 2014 Daily Ticket Trend

Ticket Counts Average Ticket Counts per month

Page 22: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

22

4-8 August 2014 43

Quarterly Maintenance

• Set regular, strict dates, advance announcement

• Specify primary and bonus goals, announce them beforehand

• Predefined "worst case" downtime

• Provide a summary of completed tasks after maintenance

• Plan ahead in details:• Team member / task associations• Estimated task duration• Critical paths and B plans

• Prepare to have unforeseen problems during and after the maintenance days

• Show best effort for minimal impact• configure the scheduler to have no running jobs• Disable user access to resources during the maintenance activities

4-8 August 2014 44

PACE Training and Tutorials

Currently offered classes:• Cluster orientation (once every 2 weeks)

• Linux 101 (on demand, but will be a regular class soon)

• Introduction to Parallel Programming with MPI and OpenMP (summer course)

• Introduction to Parallel Application Debugging and Profiling (summer course)

• A Quick Introduction To Python (summer course)

• Python For Scientific Computing (summer course)

• Python For Data Analysis and Visualization (summer course)

• Data Analysis and Visualization with MATLAB (by vendors, on-demand)

• Handling Large Data Sets in MATLAB (by vendors, on-demand)

Page 23: Linux Clusters Institute: HPC User Support · HPC user expectations, categorization and commonalities 4-8 August 2014 6 HPC user expectations Faculty (a.k.a PI) (owner of resources,

23

4-8 August 2014 45

Portable Help Desks

• 2-hour help-desks in common circulation areas in departments

• regular (~once monthly) and also on-demand

• usually 2-3 of the team members

• usually 5-10 users stop by

• improves visibility of HPC support services

• good for socializing

• attract shy, novice users and encourage them to ask their simple questions

• attract prospective researchers interested in joining the cluster

• defuse frustration caused by unreported problems

4-8 August 2014 46

Questions and Feedback