resource management analysis and accounting mike showerman, mark klein joshi fullop and jeremy enos...

14
Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters [email protected]

Upload: flora-scott

Post on 19-Jan-2016

217 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Resource Management Analysis and AccountingMike Showerman, Mark Klein Joshi Fullop and Jeremy EnosNCSA Blue [email protected]

Page 2: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Interfaces to Accounting data

• Allocations and accounting database• Command line• Portal

• User interface

• Spreadsheets of operational metrics• For reporting to management (going away I hope)

• Integrated System Console

2

Page 3: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

3

Page 4: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Complex policies can cause confusion

• Prioritizing specific workloads leads to inefficiencies – That’s OK• We are capability job focused

• Can be challenging to determine source of utilization drop• Policy

• System issue

• Available workload

4

Page 5: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Full feature view

5

Page 6: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

More typical view

6

Page 7: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Feature descriptions

7Xtreme Accounting

Draining Missing or mismatch

Unavailable Drain Thrashing

Page 8: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Moab Torque Alps

8

Source CUG 2013 paper by Matt Ezell (ORNL)[email protected]

Allocation and accounting database

Allocation and accounting database

qsubqsub

Page 9: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Challenges in usage data:A man with one source of usage data knows his usage• MAM interface (previously gold)

• The good• Realtime logging allocation integration

• Includes reservations in accounting

• The bad• Undocumented

• No retry if data fails to send• Path goes across HSN. You would be shocked how often there is a

failure sending data to an outside server.

• Incomplete data (gres and more)• Components communicate but do not coordinate

9

Page 10: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Semi-Manual analysisAlpsevents Query

JobId Date Epoch Event ResId apid

485968 2013-11-13 23:33:37

1384407217 bound 1700 0

485968 2013-11-13 23:33:39

1384407219 placed 1700 2418935

485968 2013-11-13 23:33:40

1384407220 released 1700 2418935

485968 2013-11-13 23:33:40

1384407220 placed 1700 2418936

485968 2013-11-13 23:35:03

1384407303 released 1700 2418936

485968 2013-11-13 23:35:03

1384407303 canceled 1700 2418934

485968 2013-11-13 23:35:04

1384407304 removed 1700 2418934

10

Accounting Database

job_id login account machine group_name start_time end_time queue walldurati

on charge nodes processors qos

485968 fiedler jme nid11293 vendor_cray

11/13/2013 11:33:03

PM

11/13/2013 11:55:30

PM

normal 1099 6.11 20 640 sub_25p

shredded_job_pbsshredded_job_

pbs_idjob_i

djob_array_i

ndexhos

tqueu

e user groupname ctime qtime start end etime exit_stat

ussession

jobname owner accou

nt exec_host resources_used_vmem

resources_used_mem

resources_used_walltimeu

resources_used_nodes

resources_used_cpus

resources_used_cput

resource_list_nodes

resource_list_neednodes

resource_list_walltime

185482 485968

-1 BW normal

fiedler

vendor_cray

1384406706

1384406706

1384407217

1384407304

1384406706

0 22725

test_links

fiedler@h2ologin1

jme NodesRemoved

157659136 11030528 87 20 640 1 20:ppn=32:xe 20:ppn=32:xe 300

Page 11: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Integrated System Console

• Does a wide array of tasks• Just focusing on relevant parts

• Event an log processing ad storage engine• Trigger alters based on event templates

• Parse and store logged data• Moab/torque/alps/hsn/storage/nodes

11

Page 12: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

ISC will do more

• Integrated data give us the power to make better decisions• When alps sees more than 1 cancel… is the

filesystem down, if so, take accounting action, if not alert

• Moab time/alps time/torque time out of sync… Adjust charging?

• Filesystem issue, should walltime limits be increased?

• Hole in torus… prevent some jobs from starting?

12

Page 13: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Additional data to collect

• Where is the time really going?• Sources:moab/torque/alps issue,hsn,filesystem

• Do you account for it in detail? Suspect time?

13

Job TimeJob Time OverheadOverhead

Moab TimeMoab Time

Torque TimeTorque Time

Alps TimeAlps Time

User TimeUser Time

Page 14: Resource Management Analysis and Accounting Mike Showerman, Mark Klein Joshi Fullop and Jeremy Enos NCSA Blue Waters mshow@ncsa.illinois.edu

Node state Accounting

• Job failures can cause nodes to become suspect• Often very large subsets of the system

• Overhead has not been quantified

• Extension of the SDB database to trigger on state changes• Store node state change data in ISC

• Account for reduced availability• Begin collecting MTTI data

14Xtreme Accounting