why did my job run so long? - cmg

Why did my job run so long?

Speeding Performance by

Understanding the Cause

John Baker

MVS Solutions

jbaker@mvssol.com

Agenda

• Where is my application spending its time?

– CPU time, I/O time, wait (queue) times

• What am I waiting for?

– Various flavors of queue time

– What can/should I do about delays?

• Real world comparison

– Stay tuned!

• Q/A

• Conclusions and wrap up

Distribution of Elapsed Time

Elapsed time = CPU time + I/O time + wait times

• CPU time = TCB + SRB

• I/O = IOSQ + PEND + CON + DISC

• Wait (queue) times

– Initiator

– Allocation (ENQ contention)

– System services (HSM recall)

– CPU Delay

– LPAR dispatch

– …

Sample Job A

• Elapsed time over 4 hours

• CPU time almost 1 hour

• I/O time under 10 minutes

= Focus on CPU time

JOB RUNTM CPUTM IOTIME

JOBA 4:23:53 0:48:21 0:09:12

Reducing CPU time

• Recompile

– Many improvements in OS updates

• Tune application

– Application Performance Tools

• (e.g. Strobe, FreezeFrame)

– Identify CPU use by area of source code

– Make friends with your developer

Sample Job B

• Elapsed time under 3 hours

• CPU time 20 minutes

• I/O time over 1.5 hours

= Focus on I/O time

JOB RUNTM CPUTM IOTIME

JOBB 2:41:30 0:21:20 1:37:44

Reducing I/O time

• Identify patterns

– sequential vs random; read vs write

• Buffers

– For VSAM consider NSR vs LSR

– Give SORT memory – but not too much!

• Block size

– System-determined generally works well – but check!

– Half track for sequential; No smaller than 2K for random

• Compression

– zEDC looks very impressive!

• Include Storage Subsystem in your capacity planning

Wait/Queue time

0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%

Percent Utilization

Time vs Utilization

Execution

High Utilization = High wait time

Elapsed time grows exponentially with

utilization. Increasing priority doesn’t make

the CPU any faster

• At high utilization levels, wait time is

much greater than service time

Flavors of Queue (wait) Time

• Wait for “server”: initiator / CICS AOR / IMS MPR

• CPU delay (wait for logical CPU)

• I/O delay (iosq, pend, disconnect)

• Capping delay (LPAR capped vs actual delay)

• Resource Group maximum enforced

• Wait for LPAR (logical CPU) to be dispatched

– PR/SM weight

– Demand from other LPARs

– CPC/CEC capacity

Initiator Queue

• SMF: R723CQDT

• TOTAL queue time (divide for average per job)

• Just start more inits?

• Not necessarily a good idea

• “Tuning to reduce the number of simultaneously active

address spaces to the proper number needed to support

a workload can reduce RNI and improve performance”

11 https://www-304.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprwork?OpenDocument&pathID=

Automated Initiators: Less is More

0 1 2 3 4 5 6 7 8 9

Time (hours)

Benchmark: TM vs WLM concurrent Jobs

TM jobs ahead

• Concurrency based on performance and utilization

CPU Delay

• Wait for logical CPU

• SMF: R723CCDE

• Work is ready to run but is delayed access to CPU

• Related to Service Class / goal / importance

– Dispatching priority

• There is almost always some CPU delay

– Tolerance is subjective

– Are goals/SLA’s being met?

• Priorities are relative – overloading leads to thrashing

• Consider discretionary for MTTW 13

Utilization vs CPU Delay

Utilization vs CPU delays

50% delay may be acceptable

At 100% busy, throughput degrades

significantly

I/O Delay

• IOSQ:

– HyperPAV

• Pend:

– CMR = overloaded controller

– DB = volume contention (reserve?)

– Any remaining = likely channels

• Disconnect

– Random read misses

– Synchronous remote copy

Revisit: sample Job B

• Disconnect time of 0:40:31 = 40% of total 1:37:44

• 40:31 (2431 seconds) divided by 9239646 I/O’s…

• = .263 ms average disconnect time

• Likely not unreasonable for random reads (consider SSD)

• Could also be replicated writes

• Become familiar with your typical application response times 16

JOB RUNTM CPUTM IOTIME SMF30AID SMF30AIW EXCPS

JOBB 2:41:30 0:21:20 1:37:44 0:40:31 0:04:40 9239646

Capping Delay

• Possible when caps present

• SMF70NSW

– WLM caps the logical CPUs

– Delays LPAR dispatch

• SMF70NCA

– Work is actually delayed for CPU due to capping

• Consider TM automation

MSU Demand vs R4HA

LPAR_C

LPAR_B

LPAR_A

CPCR4HA

Capping vs Delay

LPAR is capped (SMF70NSW)

Work is delayed SMF70NCA

Capping can impact all Workloads

STC_H: Importance 1

Velocity

Machine CPU reaches 100% Capping

begins

Batch is not the only workload suffering. Even the most

critical workloads are unable to meet their goal

Resource Group (max)

• Also a form of capping (same WLM algorithms)

• Pro: Useful to control “problem” applications

• Con: Static. Not flexible

• R723CCCA

– Resource Group maximum enforced

– Will override Service Class goals

LPAR Dispatch Delay

• Ratio of logical processor busy to physical processor busy

• Not always as obvious but very common!

• Term “Short CPs” introduced by Kathy Walsh (IBM WSC)

– Share, Aug. 2004

– https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1077

• MXG: PLCPRDYQ

• Improved with Hiperdispatch and IRD

LPAR Dispatch Delay

• More initiators and/or higher dispatching

priority will not resolve this problem 22

LPAR Dispatch Delay

CPUBSY

MVSBSY

40% delay

What does this look like in the real world?

Let’s take a trip to the deli

How many in the store at one time?

INITIATORS

Who’s next in line?

Dispatching Priority (Service Class)

How long til I can give my order?

Logical Processor (CP) Busy

How long til I’m done!

Physical Processor (CP) Busy

Forcing one step only stresses the next

Balance

About MVS Solutions

• MVS Solutions Inc. – Installed in over 200 datacenters worldwide

– IBM Partner in Development

• ThruPut Manager – Automated Workload Balancing

– Automated Batch Prioritization

– Automated Capacity Management

Contact me: jbaker@mvssol.com

systemz@cmg.org

Join our Blog at www.thruputmanager.com

why did my job run so long? - cmg

Documents

run! nasruddin's aim - studiestoday.com class 4 english...

cmg chernihiv

cmg nema || transtecno. Руководство по...

cmg ohio presentation

sds cmg-cmg-1033(lb1)-lysis buffer 1 (component of …...

helical in-line gearmotors nema - transtecno, · 2020. 10....

cmg llp presentation

cmg winpro 200910

ceradia · 2017. 12. 20. · ceradia cmg 90 | cmg-e 90 |...

caderno de ciÊncias navais · cmg (rm1) cláudio marin...

cmg e-learning

cloud securityperspectives cmg

netwise cmg 6.0

cmg marketing

cmg winprop tutorial

cmg solari cmg solari - nexosonline.com renovable/solar...

manual cmg

did you run that past legal?

recensies cmg

cmg 지살펜 우당탕