monitoring and troubleshooting resources

Monitoring and Troubleshooting Resources

Davide Salomoni, NIKHEF

Presented at the NE ROC Meeting

Amsterdam - October 28, 2004

NE ROC Meeting, 28/10/2004 2

Agenda (or, Thinking out Loud)

• Monitoring, monitoring, monitoring– But what, and how? And why?– Can we achieve some consistency?

• How is the software (middleware and bottomware) doing?– Do we care about the topware, by the way?

• How do we interact with– Users– Other centers– Other regions– The Rest


Why is Monitoring/Troubleshooting Complex?

• Well, in our context, at least– “I have attached a picture of

the LCG-2 job submission chain, showing how many things have to be in good shape for one's job to run OK...”(Maarten Litmaath to LCG-ROLLOUT, 8/10/2004)

• The picture on the side is simplified: a “successful user experience” also involves:– Farm setup– Network, Firewalls– Configurations– Dependencies


Farm Monitoring

• Both NIKHEF and SARA use ganglia– With some extensions, e.g. a ganglia/pbs interface

• Available from ftp://ftp.sara.nl/pub/outgoing/

• Several stats are available for both admin and user consumption; for example:– The usual ganglia pages– Job details

ftp://ftp.sara.nl/pub/outgoing/


Use of Resources• Batch system/scheduler: torque/maui

– Way better than OpenPBS• Custom RPMs

– Basically to support the transient $TMPDIR patch (automatic removal of temporary directories upon job completion) – a feature present in PBSpro and other batch systems

• Extensive use of maui’s fairshare mechanism to set targets for users (both local and grid), groups, classes, etc.– Flexible, and complex; and there are some annoyances

• For example, how one specifies the max # of CPUs in the farm (static configuration)

• Do not forget MAXJOBQUEUED or your system may get unhappy

• Packages and configuration examples are available at http://www.dutchgrid.nl/Admin/Nikhef/

http://www.dutchgrid.nl/Admin/Nikhef/


Use of Resources (2)

• Check fairshare usage:Aggregate use of the NDPF from 2004-09-01 to 2004-09-23 inclusiveFor all users that are in groups XXXX.

Date CPU time WallTime GHzHours #jobs2004-09-04 00:00:00 00:00:10 0.00 62004-09-06 49:38:00 49:41:26 127.61 102004-09-07 155:32:36 159:15:56 388.77 92004-09-08 559:31:19 579:12:23 1336.88 142004-09-09 523:15:21 524:14:17 1202.94 252004-09-10 1609:29:32 1617:20:42 3685.88 892004-09-11 319:18:39 331:14:29 662.48 132004-09-12 96:58:59 97:24:11 194.81 22004-09-13 131:43:08 133:06:45 266.23 62004-09-14 214:41:10 215:44:00 431.47 112004-09-15 59:56:58 65:24:52 130.83 52004-09-16 38:50:30 39:06:36 78.22 32004-09-17 432:55:49 452:22:26 938.97 62004-09-18 95:35:22 96:00:23 192.01 12004-09-19 95:26:31 96:00:17 192.01 12004-09-20 10:09:34 10:17:38 20.59 222004-09-21 49:06:40 49:45:10 99.51 32004-09-22 88:14:41 88:37:06 177.24 22004-09-23 184:45:49 214:44:09 429.47 3

Summed 4715:10:38 4819:32:56 10555.91 231


Babysitting the (local) Grid

• A number of home-built scripts try to keep the system under control– Indeed, an often-heard lament in the LCG world is that,

regardless of the quality of the middleware, most problems occur because of site misconfigurations/problems.

• Check for unusually short wallclock times repeated in short succession on the same node(s) – often an indication of a black hole

• Check that nodes have a standard set of open ports (e.g. ssh, nfs, ganglia); if not, take them out of the batch system

• Periodically remove old (stale) state files, lest they are taken into account by the job manager (noticeable burden on the pbs server in that case)

• Monitor the pbs log and accounting files in various ways to check for errors and abnormal conditions

• Standard node monitoring, e.g. CPU temperature, disk space


Job Submission, the Grid Way

• The job manager/grid monitor agents spawn several (potentially long-running) processes. These can be, depending on various factors, per RB, per job, per user.– These processes in the end all issue qstat calls, i.e. query the

pbs server• These calls gather detailed job info for every job owned by a given

user – even jobs that for some reason died long time ago, but left some traces on the system (in the form of GRAM state files)

– With high job submission rates (e.g. DC), and a high number of nodes in the farm, this can lead to 25+ qstat calls/second and 100% CPU on decent hardware (2 x XEON 2.8 GHz)

• In this case, “job submission” really means “the server submits itself” = dies, and brings the CE to an halt

• And if you run e.g. GridICE, you will have even more qstat queries


PBS Caching

• Waiting for somebody to fix this madness, we have now a qstat/qsub/pbsnodes caching mechanism in place– CPU load is much more reasonable– At least this is not the bottleneck anymore

• With a farm of our size, and apparently also with bigger farms (e.g. CNAF)

• But there are many other players in the chain, so scalability may be at risk anyway

– The caching wrappers are available at http://www.dutchgrid.nl/Admin/Nikhef/





On Being Monitored• We certainly want to be

on this map– But for a while we had to ask

to be removed

• There seem to be fartoo many testing scripts

• Two main problems:– Given the way the existing job manager works, current monitoring can

create disruption on busy sites. David Smith promised to change some important things in this area.

– GOC polling seems (by default) too frequent (see recent mail from Dave Kant on LCG-ROLLOUT)

– “we can [submit] fewer jobs to you [… but this] is only a temporary measure” (DK). Need to implement a different strategy.

• Many things are apparently in the works for what regards the GOC monitoring future (John Gordon, GDB 13/10/2004)


Grid Support, pre-EGEE• There are too many points of contact here and there, and they often seem

not very correlated– LCG GOC @ RAL: http://goc.grid-support.ac.uk/gridsite/gocmain/ – Testzone page at CERN: http://lcg-testzone-reports.web.cern.ch/lcg-testzone-

reports/cgi-bin/lastreport.cgi • Useful, but it often says that a site failed a test. Next time the site is OK.

Furthermore, no actions seem to be taken anymore when a site fails.– GridICE @ RAL: http://grid-ice.esc.rl.ac.uk/gridice/site/site.php

• Most sites do not seem to be tracked though, and I am not sure that the published numbers reflect reality (they don’t for NIKHEF, for example)

– FAQs abound (but are too uncorrelated):• GOC FAQ @ RAL: http://www.gridpp.ac.uk/tb-support/faq/ • GOC Wiki FAQ @ Taipei:

http://goc.grid.sinica.edu.tw/gocwiki/TroubleShootingHistory • GGUS FAQ @ Karlsruhe: https://gus.fzk.de/ (seems tailored to GridKA)

– LCG-ROLLOUT seems a good troubleshooting forum– GGUS– We (NL) have also our own FAQ and support pages: http://www.dutchgrid.nl/

• See Ron’s presentation for more user support details

http://goc.grid-support.ac.uk/gridsite/gocmain/

http://goc.grid-support.ac.uk/gridsite/gocmain/


Questions (1)• How can we share expertise for e.g. monitoring, or batch systems,

or system/farm configurations? (w/ and w/o firewalls)– But is everybody using the same tools? Probably not, not even within

the same region. Need inventory

• Many answers can be found (for LCG) in the LCG-ROLLOUT archives– About 4000 messages in the last 10 months – can we consolidate?– Another interesting (and probably less used) problem database is

Savannah– There are things that will not be answered, or fixed (e.g. because of

dependencies, people that left, politics, etc): what do we do in that case?

• Accounting: how is this currently done across the region?


Questions (2)• How does one keep current with e.g. pbs and maui? Good-will? Or

should we explicitly suggest (at least within a region) to upgrade?– A potential problem is to identify dependencies when we talk about

upgrades to non-middleware software (or even to middleware software, actually – GridKA had a good example)

– Eventually, an issue for SLA

• Applications: do we care about what they do to our nodes?– In theory, no. In practice, we may want to think about this– But this is “Grid”™, so it may well involve extra-region considerations

• Future of the GOC? Need to make better use of the monitoring infrastructure soon– Given the complexity of some problems, we can’t rely on reactive

monitoring only – it just doesn’t work well in a complex 24x7 environment

monitoring and troubleshooting resources

Documents

ne roc meetingamsterdam

aggregate use

sara use gangliawith

farm setupnetwork

job completion

ones job

lcg world

farm monitoringboth