Seismometer CMG-3T & 3ESPc (120s) Accelerometer CMG-5T Digitiser DM24 (6 components)
Why did my job run so long? - CMG
Transcript of Why did my job run so long? - CMG
![Page 1: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/1.jpg)
Why did my job run so long?
Speeding Performance by
Understanding the Cause
John Baker
MVS Solutions
![Page 2: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/2.jpg)
Agenda
• Where is my application spending its time?
– CPU time, I/O time, wait (queue) times
• What am I waiting for?
– Various flavors of queue time
– What can/should I do about delays?
• Real world comparison
– Stay tuned!
• Q/A
• Conclusions and wrap up
2
![Page 3: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/3.jpg)
Distribution of Elapsed Time
Elapsed time = CPU time + I/O time + wait times
3
• CPU time = TCB + SRB
• I/O = IOSQ + PEND + CON + DISC
• Wait (queue) times
– Initiator
– Allocation (ENQ contention)
– System services (HSM recall)
– CPU Delay
– LPAR dispatch
– …
![Page 4: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/4.jpg)
Sample Job A
• Elapsed time over 4 hours
• CPU time almost 1 hour
• I/O time under 10 minutes
= Focus on CPU time
4
JOB RUNTM CPUTM IOTIME
JOBA 4:23:53 0:48:21 0:09:12
![Page 5: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/5.jpg)
Reducing CPU time
• Recompile
– Many improvements in OS updates
• Tune application
– Application Performance Tools
• (e.g. Strobe, FreezeFrame)
– Identify CPU use by area of source code
– Make friends with your developer
5
![Page 6: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/6.jpg)
Sample Job B
• Elapsed time under 3 hours
• CPU time 20 minutes
• I/O time over 1.5 hours
= Focus on I/O time
6
JOB RUNTM CPUTM IOTIME
JOBB 2:41:30 0:21:20 1:37:44
![Page 7: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/7.jpg)
Reducing I/O time
• Identify patterns
– sequential vs random; read vs write
• Buffers
– For VSAM consider NSR vs LSR
– Give SORT memory – but not too much!
• Block size
– System-determined generally works well – but check!
– Half track for sequential; No smaller than 2K for random
• Compression
– zEDC looks very impressive!
• Include Storage Subsystem in your capacity planning
7
![Page 8: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/8.jpg)
Wait/Queue time
8
![Page 9: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/9.jpg)
0
5
10
15
20
25
30
0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90% 95% 100%
Re
spo
nse
/Ela
pse
d T
ime
Percent Utilization
Time vs Utilization
Wait
Execution
High Utilization = High wait time
9
Elapsed time grows exponentially with
utilization. Increasing priority doesn’t make
the CPU any faster
• At high utilization levels, wait time is
much greater than service time
![Page 10: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/10.jpg)
Flavors of Queue (wait) Time
• Wait for “server”: initiator / CICS AOR / IMS MPR
• CPU delay (wait for logical CPU)
• I/O delay (iosq, pend, disconnect)
• Capping delay (LPAR capped vs actual delay)
• Resource Group maximum enforced
• Wait for LPAR (logical CPU) to be dispatched
– PR/SM weight
– Demand from other LPARs
– CPC/CEC capacity
10
![Page 11: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/11.jpg)
Initiator Queue
• SMF: R723CQDT
• TOTAL queue time (divide for average per job)
• Just start more inits?
• Not necessarily a good idea
• “Tuning to reduce the number of simultaneously active
address spaces to the proper number needed to support
a workload can reduce RNI and improve performance”
11 https://www-304.ibm.com/servers/resourcelink/lib03060.nsf/pages/lsprwork?OpenDocument&pathID=
![Page 12: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/12.jpg)
Automated Initiators: Less is More
12
0
50
100
150
200
250
300
350
0 1 2 3 4 5 6 7 8 9
Acti
ve J
ob
s
Time (hours)
Benchmark: TM vs WLM concurrent Jobs
TM
WLM
TM jobs ahead
• Concurrency based on performance and utilization
![Page 13: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/13.jpg)
CPU Delay
• Wait for logical CPU
• SMF: R723CCDE
• Work is ready to run but is delayed access to CPU
• Related to Service Class / goal / importance
– Dispatching priority
• There is almost always some CPU delay
– Tolerance is subjective
– Are goals/SLA’s being met?
• Priorities are relative – overloading leads to thrashing
• Consider discretionary for MTTW 13
![Page 14: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/14.jpg)
0
10
20
30
40
50
60
70
80
90
100
Utilization vs CPU Delay
HIDLY
MDDLY
LODLY
UTIL
Utilization vs CPU delays
14
50% delay may be acceptable
At 100% busy, throughput degrades
significantly
![Page 15: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/15.jpg)
I/O Delay
• IOSQ:
– HyperPAV
• Pend:
– CMR = overloaded controller
– DB = volume contention (reserve?)
– Any remaining = likely channels
• Disconnect
– Random read misses
– Synchronous remote copy
15
![Page 16: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/16.jpg)
Revisit: sample Job B
• Disconnect time of 0:40:31 = 40% of total 1:37:44
• 40:31 (2431 seconds) divided by 9239646 I/O’s…
• = .263 ms average disconnect time
• Likely not unreasonable for random reads (consider SSD)
• Could also be replicated writes
• Become familiar with your typical application response times 16
JOB RUNTM CPUTM IOTIME SMF30AID SMF30AIW EXCPS
JOBB 2:41:30 0:21:20 1:37:44 0:40:31 0:04:40 9239646
![Page 17: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/17.jpg)
Capping Delay
• Possible when caps present
• SMF70NSW
– WLM caps the logical CPUs
– Delays LPAR dispatch
• SMF70NCA
– Work is actually delayed for CPU due to capping
• Consider TM automation
17
![Page 18: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/18.jpg)
0
200
400
600
800
1000
1200
0:0
0:0
0
0:1
5:0
0
0:3
0:0
0
0:4
5:0
0
1:0
0:0
0
1:1
5:0
0
1:1
5:0
0
1:3
0:0
0
1:4
5:0
0
2:0
0:0
0
2:1
5:0
0
2:3
0:0
0
2:4
5:0
0
3:0
0:0
0
3:1
5:0
0
3:3
0:0
0
3:4
5:0
0
4:0
0:0
0
4:1
5:0
0
4:3
0:0
0
4:4
5:0
0
5:0
0:0
0
5:1
5:0
0
5:3
0:0
0
5:4
5:0
0
6:0
0:0
0
6:1
5:0
0
6:3
0:0
0
6:4
5:0
0
7:0
0:0
0
7:1
5:0
0
7:3
0:0
0
7:4
5:0
0
8:0
0:0
0
8:0
0:0
0
MSU
/HR
MSU Demand vs R4HA
LPAR_C
LPAR_B
LPAR_A
CPCR4HA
CAP
Capping vs Delay
LPAR is capped (SMF70NSW)
Work is delayed SMF70NCA
![Page 19: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/19.jpg)
Capping can impact all Workloads
0
10
20
30
40
50
60
70
80
90
100
Ve
loci
ty
STC_H: Importance 1
Goal
Velocity
Machine CPU reaches 100% Capping
begins
Batch is not the only workload suffering. Even the most
critical workloads are unable to meet their goal
![Page 20: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/20.jpg)
Resource Group (max)
• Also a form of capping (same WLM algorithms)
• Pro: Useful to control “problem” applications
• Con: Static. Not flexible
• R723CCCA
– Resource Group maximum enforced
– Will override Service Class goals
20
![Page 21: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/21.jpg)
LPAR Dispatch Delay
• Ratio of logical processor busy to physical processor busy
• Not always as obvious but very common!
• Term “Short CPs” introduced by Kathy Walsh (IBM WSC)
– Share, Aug. 2004
– https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/PRS1077
• MXG: PLCPRDYQ
• Improved with Hiperdispatch and IRD
21
![Page 22: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/22.jpg)
LPAR Dispatch Delay
• More initiators and/or higher dispatching
priority will not resolve this problem 22
0
10
20
30
40
50
60
70
80
90
100
LPAR Dispatch Delay
INITS
CPUBSY
MVSBSY
40% delay
![Page 23: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/23.jpg)
What does this look like in the real world?
23
Let’s take a trip to the deli
![Page 24: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/24.jpg)
How many in the store at one time?
24
INITIATORS
![Page 25: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/25.jpg)
Who’s next in line?
25
Dispatching Priority (Service Class)
![Page 26: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/26.jpg)
How long til I can give my order?
26
Logical Processor (CP) Busy
![Page 27: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/27.jpg)
How long til I’m done!
27
Physical Processor (CP) Busy
![Page 28: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/28.jpg)
Forcing one step only stresses the next
28
![Page 29: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/29.jpg)
29
![Page 30: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/30.jpg)
Balance
30
![Page 31: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/31.jpg)
About MVS Solutions
• MVS Solutions Inc. – Installed in over 200 datacenters worldwide
– IBM Partner in Development
• ThruPut Manager – Automated Workload Balancing
– Automated Batch Prioritization
– Automated Capacity Management
31
Contact me: [email protected]
Join our Blog at www.thruputmanager.com
![Page 32: Why did my job run so long? - CMG](https://reader031.fdocuments.us/reader031/viewer/2022022706/6219cf14ac678f364e277fe4/html5/thumbnails/32.jpg)