monitoring htcondor andrew lahiff stfc rutherford appleton laboratory european htcondor site admins...

34
Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Upload: marvin-heath

Post on 19-Dec-2015

223 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Monitoring HTCondor

Andrew Lahiff

STFC Rutherford Appleton Laboratory

European HTCondor Site Admins Meeting 2014

Page 2: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Introduction• Two aspects of monitoring

– General overview of the system• How many running/idle jobs?

By user/VO? By schedd?

• How full is the farm?

• How many draining worker nodes?

– More detailed views• What are individual jobs doing?

• What’s happening on individual worker nodes?

• Health of the different components of the HTCondor pool

• ...in addition to Nagios

Page 3: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Introduction• Methods

– Command line utilities

– Ganglia

– Third-party applications(which run command-line tools or use python API)

Page 4: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Command line• Three useful commands

– condor_status• Overview of the pool (including jobs, machines)• Information about specific worker nodes

– condor_q• Information about jobs in the queue

– condor_history• Information about completed jobs

Page 5: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Overview of jobs-bash-4.1$ condor_status -collector

Name Machine RunningJobs IdleJobs HostsTotal

[email protected]. condor01.gridpp.rl 10608 8355 11347

[email protected]. condor02.gridpp.rl 10616 8364 11360

Page 6: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Overview of machines-bash-4.1$ condor_status -total

Total Owner Claimed Unclaimed Matched Preempting Backfill

X86_64/LINUX 11183 95 10441 592 0 0 0

Total 11183 95 10441 592 0 0 0

Page 7: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Jobs by schedd-bash-4.1$ condor_status -schedd

Name Machine TotalRunningJobs TotalIdleJobs TotalHeldJobs

arc-ce01.gridpp.rl.a arc-ce01.g 2388 1990 13

arc-ce02.gridpp.rl.a arc-ce02.g 2011 1995 31

arc-ce03.gridpp.rl.a arc-ce03.g 4272 1994 9

arc-ce04.gridpp.rl.a arc-ce04.g 1424 2385 12

arc-ce05.gridpp.rl.a arc-ce05.g 1 0 6

cream-ce01.gridpp.rl cream-ce01 266 0 0

cream-ce02.gridpp.rl cream-ce02 247 0 0

lcg0955.gridpp.rl.ac lcg0955.gr 0 0 0

lcgui03.gridpp.rl.ac lcgui03.gr 3 0 0

lcgui04.gridpp.rl.ac lcgui04.gr 0 0 0

lcgvm21.gridpp.rl.ac lcgvm21.gr 0 0 0

TotalRunningJobs TotalIdleJobs TotalHeldJobs

Total 10612 8364 71

Page 8: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Jobs by user, schedd-bash-4.1$ condor_status -submitters

Name Machine RunningJobs IdleJobs HeldJobs

group_ALICE.alice.alice043@g arc-ce01.gridpp.rl 0 0 0

group_ALICE.alice.alicesgm@g arc-ce01.gridpp.rl 540 0 1

group_ATLAS.atlas_pilot.tatl arc-ce01.gridpp.rl 142 0 0

group_ATLAS.prodatls.patls00 arc-ce01.gridpp.rl 82 5 0

group_CMS.cms.cmssgm@gridpp. arc-ce01.gridpp.rl 1 0 0

group_CMS.cms_pilot.ttcms022 arc-ce01.gridpp.rl 214 390 0

group_CMS.cms_pilot.ttcms043 arc-ce01.gridpp.rl 68 100 0

group_CMS.prodcms.pcms004@gr arc-ce01.gridpp.rl 78 476 4

group_CMS.prodcms.pcms054@gr arc-ce01.gridpp.rl 12 910 0

group_CMS.prodcms_multicore. arc-ce01.gridpp.rl 47 102 0

group_DTEAM_OPS.ops.ops047@g arc-ce01.gridpp.rl 0 0 0

group_LHCB.lhcb_pilot.tlhcb0 arc-ce01.gridpp.rl 992 0 2

group_NONLHC.snoplus.snoplus arc-ce01.gridpp.rl 0 0 0

Page 9: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

…Jobs by user RunningJobs IdleJobs HeldJobs

group_ALICE.alice.al 0 0 0

group_ALICE.alice.al 3500 368 5

group_ALICE.alice_pi 0 0 0

group_ATLAS.atlas.at 0 0 0

group_ATLAS.atlas.at 0 0 0

group_ATLAS.atlas_pi 414 12 10

group_ATLAS.atlas_pi 0 0 2

group_ATLAS.prodatls 354 36 11

group_CMS.cms.cmssgm 1 0 0

group_CMS.cms_pilot. 371 2223 0

group_CMS.cms_pilot. 0 0 1

group_CMS.cms_pilot. 68 200 0

group_CMS.prodcms.pc 188 1905 10

group_CMS.prodcms.pc 312 3410 0

group_CMS.prodcms_mu 47 102 0

Page 10: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

condor_q[root@arc-ce01 ~]# condor_q

-- Submitter: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:64454> : arc-ce01.gridpp.rl.ac.uk

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

794717.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794718.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794719.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794720.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794721.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794722.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794723.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794725.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

794726.0 pcms054 12/3 12:07 0+00:00:00 I 0 0.0 (gridjob )

3502 jobs; 0 completed, 0 removed, 1528 idle, 1965 running, 9 held, 0 suspended

Page 11: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Multi-core jobs-bash-4.1$ condor_q -global -constraint 'RequestCpus > 1’

-- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

832677.0 pcms004 12/5 14:33 0+00:15:07 R 0 2.0 (gridjob )

832717.0 pcms004 12/5 14:37 0+00:12:02 R 0 0.0 (gridjob )

832718.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob )

832719.0 pcms004 12/5 14:37 0+00:00:00 I 0 0.0 (gridjob )

832893.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob )

832894.0 pcms004 12/5 14:47 0+00:00:00 I 0 0.0 (gridjob )

Page 12: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Multi-core jobs• Custom print format

-bash-4.1$ condor_q -global -pr queue_mc.cpf

-- Schedd: arc-ce01.gridpp.rl.ac.uk : <130.246.180.236:39356>

ID OWNER SUBMITTED RUN_TIME ST SIZE CMD CORES

832677.0 pcms004 12/5 14:33 0+00:00:00 R 2.0 (gridjob) 8

832717.0 pcms004 12/5 14:37 0+00:00:00 R 0.0 (gridjob) 8

832718.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8

832719.0 pcms004 12/5 14:37 0+00:00:00 I 0.0 (gridjob) 8

832893.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8

832894.0 pcms004 12/5 14:47 0+00:00:00 I 0.0 (gridjob) 8

https://htcondor-wiki.cs.wisc.edu/index.cgi/wiki?p=ExperimentalCustomPrintFormats

Page 13: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Jobs with specific DN-bash-4.1$ condor_q -global -constraint 'x509userproxysubject=="/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=atlpilo1/CN=614260/CN=Robot: ATLAS Pilot1”’

-- Schedd: arc-ce03.gridpp.rl.ac.uk : <130.246.181.25:62763>

ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD

678275.0 tatls015 12/2 17:57 2+06:07:15 R 0 2441.4 (arc_pilot )

681762.0 tatls015 12/3 03:13 1+21:12:31 R 0 2197.3 (arc_pilot )

705153.0 tatls015 12/4 07:36 0+16:49:12 R 0 2197.3 (arc_pilot )

705807.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot )

705808.0 tatls015 12/4 08:16 0+16:09:27 R 0 2197.3 (arc_pilot )

706612.0 tatls015 12/4 09:16 0+15:09:37 R 0 2197.3 (arc_pilot )

706614.0 tatls015 12/4 09:16 0+15:09:26 R 0 2197.3 (arc_pilot )

Page 14: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Jobs killed• Jobs which were removed[root@arc-ce01 ~]# condor_history -constraint 'JobStatus == 3’

ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD

823881.0 alicesgm 12/5 01:01 1+06:13:22 X ??? /var/spool/arc/grid03/CVuMDmBSwGlnCIXDjqi

831849.0 tlhcb005 12/5 13:19 0+18:52:26 X ??? /var/spool/arc/grid09/gWmLDm5x7GlnCIXDjqi

832753.0 tlhcb005 12/5 14:38 0+17:07:07 X ??? /var/spool/arc/grid00/5wqKDm7C9GlnCIXDjqi

819636.0 alicesgm 12/4 19:27 1+12:13:56 X ??? /var/spool/arc/grid00/mlrNDmoErGlnCIXDjqi

825511.0 alicesgm 12/5 03:03 0+18:52:10 X ??? /var/spool/arc/grid04/XpuKDmxLyGlnCIXDjqi

823799.0 alicesgm 12/5 00:56 1+05:58:15 X ??? /var/spool/arc/grid03/DYuMDmzMwGlnCIXDjqi

820001.0 alicesgm 12/4 19:48 1+06:43:22 X ??? /var/spool/arc/grid08/cmzNDmpYrGlnCIXDjqi

833589.0 alicesgm 12/5 16:01 0+14:06:34 X ??? /var/spool/arc/grid09/HKSLDmqUAHlnCIXDjqi

778644.0 tlhcb005 12/2 05:56 4+00:00:10 X ??? /var/spool/arc/grid00/pIJNDm6cvFlnCIXDjqi

Page 15: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Jobs killed• Jobs removed for exceeding memory limit

[root@arc-ce01 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af ClusterId Owner ResidentSetSize RequestMemory

823953 alicesgm 3500000 3000

824438 alicesgm 3250000 3000

820045 alicesgm 3500000 3000

823881 alicesgm 3250000 3000

[root@arc-ce04 ~]# condor_history -constraint 'JobStatus==3 && ResidentSetSize>1024*RequestMemory' -af x509UserProxyVOName | sort | uniq -c

515 alice

5 cms

70 lhcb

Page 16: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

condor_who• What jobs are currently running on a worker node?[root@lcg1211 ~]# condor_who

OWNER CLIENT SLOT JOB RUNTIME PID PROGRAM

[email protected] arc-ce02.gridpp.rl.ac.uk 1_2 654753.0 0+00:01:54 15743 /usr/libexec/condor/co

[email protected] arc-ce02.gridpp.rl.ac.uk 1_5 654076.0 0+00:56:50 21916 /usr/libexec/condor/co

[email protected] arc-ce04.gridpp.rl.ac.uk 1_10 1337818.0 0+02:51:34 31893 /usr/libexec/condor/co

[email protected] arc-ce04.gridpp.rl.ac.uk 1_7 1337776.0 0+03:06:51 32295 /usr/libexec/condor/co

[email protected] arc-ce02.gridpp.rl.ac.uk 1_1 651508.0 0+05:02:45 17556 /usr/libexec/condor/co

[email protected] arc-ce03.gridpp.rl.ac.uk 1_4 737874.0 0+05:44:24 5032 /usr/libexec/condor/co

[email protected] arc-ce04.gridpp.rl.ac.uk 1_6 1336938.0 0+08:42:18 26911 /usr/libexec/condor/co

[email protected] arc-ce01.gridpp.rl.ac.uk 1_8 826808.0 1+02:50:16 3485 /usr/libexec/condor/co

[email protected] arc-ce03.gridpp.rl.ac.uk 1_3 722597.0 1+08:44:28 22966 /usr/libexec/condor/co

Page 17: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Startd history• If STARTD_HISTORY defined on your WNs

[root@lcg1658 ~]# condor_history

ID OWNER SUBMITTED RUN_TIME ST COMPLETED CMD

841989.0 tatls015 12/6 07:58 0+00:02:39 C 12/6 08:01 /var/spool/arc/grid03/PZ6NDmPQPHlnCIXDjqi

841950.0 tatls015 12/6 07:56 0+00:02:40 C 12/6 07:59 /var/spool/arc/grid03/mckKDm4OPHlnCIXDjqi

841889.0 tatls015 12/6 07:53 0+00:02:33 C 12/6 07:56 /var/spool/arc/grid01/X3bNDmTMPHlnCIXDjqi

841847.0 tatls015 12/6 07:50 0+00:02:35 C 12/6 07:54 /var/spool/arc/grid00/yHHODmfJPHlnCIXDjqi

841816.0 tatls015 12/6 07:48 0+00:02:36 C 12/6 07:51 /var/spool/arc/grid04/iizMDmVHPHlnCIXDjqi

841791.0 tatls015 12/6 07:45 0+00:02:33 C 12/6 07:48 /var/spool/arc/grid00/N3vKDmKEPHlnCIXDjqi

716804.0 alicesgm 12/4 18:28 1+13:15:07 C 12/6 07:44 /var/spool/arc/grid07/TUQNDmUJqGlnzEJDjqI

Page 18: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Ganglia• condor_gangliad

– Runs on a single host (can be any host)

– Gathers daemon ClassAds from the collector

– Publishes metrics to ganglia with host spoofing

• At RAL we have on one hostGANGLIAD_VERBOSITY = 2

GANGLIAD_PER_EXECUTE_NODE_METRICS = FalseGANGLIAD = $(LIBEXEC)/condor_gangliadGANGLIA_CONFIG = /etc/gmond.confGANGLIAD_METRICS_CONFIG_DIR = /etc/condor/ganglia.dGANGLIA_SEND_DATA_FOR_ALL_HOSTS = trueDAEMON_LIST = MASTER, GANGLIAD

Page 19: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Ganglia• Small subset from schedd

Page 20: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Ganglia• Small subset from central manager

Page 21: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Easy to make custom plots

Page 22: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Total running, idle, held jobs• f

Page 23: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Running jobs by schedd

Page 24: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Negotiator health• s

Negotiation cycle duration Number of AutoClusters

Page 25: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Draining & multi-core slots

Page 26: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

(Some) Third party tools

Page 27: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Job overview• Condor Job Overview Monitor

http://sarkar.web.cern.ch/sarkar/doc/condor_jobview.html

Page 28: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014
Page 29: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014
Page 30: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014
Page 31: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Mimic• Internal RAL application

Page 32: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

htcondor-sysview

Page 33: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

htcondor-sysview• Hover mouse over a core to get job information

Page 34: Monitoring HTCondor Andrew Lahiff STFC Rutherford Appleton Laboratory European HTCondor Site Admins Meeting 2014

Nagios• Most (all?) sites probably use Nagios or an alternative• At RAL

– Process checks for condor_master on all nodes– Central mangers

• Check for at least 1 collector• Check for the negotiator• Check for worker nodes

Number of startd ClassAds needs to be above a threshold

Number of non-broken worker nodes above a threshold

– CEs• Check for schedd• Job submission test