Don’t keep your CPU waiting - CMG...Don’t keep your CPU waiting Speed reading for machines...

Don’t keep your CPU waitingSpeed reading for machines

www.pivotor.com John.Baker@EPStrategies.com

z/OS Performance Education, Software, and

Managed Service Providers

Agenda

• Storage vs storage

• CPU and processor caches

• Storage tiers

• Sync vs Async

• Basic cache principles

• Improving sync access

• Hiperdispatch and DAT

• Improving async access

• DSS stats

• RMF stats

• zHyperLink

Storage vs Storage

All roads lead to the CPU

• z14: 5.2 GHz = .192 ns cycle time

How far can I go in .192 ns?

• Speed of light = 300,000 km / sec

• = 30 cm / ns

• 30cm ~ 12 inches * .192

= 2.3 inches

z14 PU Chip

• 14 nm technology

• 6.14 billion transistors

• 23.2 km or 14.4 miles of wire

• Up to 10 cores

• L3 cache on chip

http://www.redbooks.ibm.com/abstracts/sg248451.html?Open

Processor Caches

L1: On-Core

- 128 KB Instruction, 128KB Data

L2: On-Core

- 2 MB Instruction, 4MB Data

L3: 64 MB on SCM chip

One shared per 10 Cores

L4: 672 MB per drawer

Main Memory:

8 TB (multiple DIMMs) per drawer

Storage Tiers

Core: L1/L2

Memory

Aux (SCM/VFM)

Disk cache

Speed/$$$

Capacity/$

Caching Principles

• Temporal locality: Data that is accessed once, will likely be accessed again– Try to hold data in cache as long as possible (LRU)– Larger caches can hold more

• Spatial locality: Data near the same location as target data will likely be accessed as well– Cache lines and sequential prefetch

• CPU performance goal: Must be able to access L1 cache in a single clock cycle– L1 cache should not be too large to prevent this

Improving Sync Access

Memory

• Measure! Are you collecting 113’s?– Cycles per instruction, L1 miss percentage, RNI

• Configuration– Small LPARs, Logical/physical CP ratio

– Hiperdispatch: avoid Vertical Low CP’s

• Use Large Frames for DB2– Fewer segment/page table entries

– Reduce TLB misses

• Data in memory tech– Reduce cycles to fetch busy tables

• Applications– Use current compilers

– Check your old assembler

Processor Cache Measurements

• Cycles per instruction

• L1 miss percent

• Relative Nest Intensity (RNI)

Memory

• Configuration– Small LPARs

– Logical/physical CP ratio

– Hiperdispatch: avoid Vertical Low CP’s

Pre-Hiperdispatch: “Horizontal” alignment

LPAR1 LPAR2 LPAR3 LPAR4 LPAR5 LPAR6 LPAR7

WEIGHT=300

LCP = 7

WEIGHT=100

WEIGHT=200

WEIGHT=100

WEIGHT=50

350/1000

=35% Share

100/1000

=10% Share

100/1000

=10% Share

100/1000

=10% Share

200/1000

=20% Share

100/1000

=10% Share

50/1000

=5% Share

LPAR Share Percentage of CPU Pool

* * * * * * * * * * * * * * *

• Weight is guaranteed minimum

• May be exceeded if other LPARs do

not use their full share

• Subject to number of LCPs (also

limits maximum)

Total Weight=1000

15 physical CP’s

HiperDispatch

•A method of “aligning” physical CPUs with LPARs

•Goal is to take advantage of newer hardware design

•Also reduce multiprocessing overhead (“MP Effect”)

•Default as of z/OS 1.13

HIPERDISPATCH = NO HIPERDISPATCH = YES

“Horizontal” alignment “Vertical” alignment

% share distributed across all available

physical CPUs% share distributed by physical CP

All CPs distributed among all LPARsLCPs become “Vertical High, Vertical

Medium, Vertical Low”

Utilization of all CP’s tends to be even Unused (Vertical Low) CPs are “parked”

Vertical Alignment

• LPAR1 example:

• 35% share, 7 LCP’s = .35 * 15 CP (Cores) = 5.25 CPs

• 1.0 = VH, .5 - .99 = VM, <.49 = VL ***

• First every LPAR receives at least one VM = “.5 CPs”

• 5.25 - .5 = 4.75

• 4.75 - 4 (FOUR “whole CPs” = VH) = .75

• .75 - .5 = .25 (one more VM)

• .25 (less than .5) = VL

• = 4 * VH, 2* VM, 1* VL

https://www-03.ibm.com/support/techdocs/atsmastr.nsf/WebIndex/TD106389

PR/SM Association

• LPARs and Logical Processors

associated with drawers,

nodes, and Physical

Processors

Hiperdispatch Config

Parked Processors

Memory

• Consider Large Frames for large, long-running DB2 work– Dynamic Address Translation (DAT)

costs cycles

– Translation Lookaside Buffer (TLB)

smaller as a % of total memory

– Large (1M) and giant (2G) frames

require fewer TLB entries

– Fewer segment/page table entries

– Reduce TLB misses

DAT Cost

• Translation Lookaside Buffer (TLB) miss percent

Memory

• Data in memory tech– Reduce cycles to fetch busy tables

– Improve response at costs

• Applications– Can no longer count on large CPU

clock gains

– Separate instruction/data caches

offer benefits and drawbacks

– Compilers will take care of it – use

current compilers!

– Check your old assembler

Improving Async Access

Aux (SCM/VFM)

Disk cache

• The Best I/O…– Buy enough memory to avoid paging

– SCM: flash disk on EC12/z13

– Replaced by VFM “Expanded memory” on z14

– Buffers: enough and the right type (random VSAM)

Aux Storage Use

• Running out of auxiliary storage can lead to a very bad day…

Aux (SCM/VFM)

Disk cache

• DSS Hit Ratio– If less than 90% (most workloads) > more cache

– Law of diminishing returns applies

– Consider I/O “flavor”

• Random/sequential, read/write

– Some workloads are inherently “cache unfriendly”

DSS Cache Read Miss %

• Healthy read hit ratios should exceed 90%

Random Reads only

• Isolating random reads, by removing their sequential “cache friendly” brethren, gives more insight

Random reads are also referred

to as “Normal” reads.

Aux (SCM/VFM)

Disk cache

• HDD configuration– 10K, 15K, 7.5K HDD

– Consider workload access density

– SSD for cache unfriendly workloads

• DSS Auto Tiering or all flash

Disk Activity by Type

• Ideally, the SSD’s are supporting the bulk of the workload

Response by Disk Type

• SSD benefit is marginal here

Aux (SCM/VFM)

Disk cache

• Software striping– Storage Class, sustained data rate

– More granular

– May span storage subsystems

• Compression– zEDC ends the debate

Improving Async Access (2)

Aux (SCM/VFM)

Disk cache

• IOSQ– SuperPAV

– Old fashioned dataset placement. SMF 42

• PEND– CMR: check DSS front-end

– DB: hardware reserves

• DISCONNECT– DSS hit ratio

– SSD

• CONNECT– zHPF, channels, HA’s

Response Time Components

• Sub-millisecond is the new standard

Something new: zHyperLink

http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/redp5186.html?Open

zHyperlink Requirements

• Z14

• z/OS 2.1

• DS888x with zHyperlink feature (FC #0431)

• DB2 v11+ only (for now)

zHyperlink Flow

• Conditions

• Why not make everything sync?

http://www.redbooks.ibm.com/Redbooks.nsf/RedpieceAbstracts/redp5186.html?Open

“External memory”

Core: L1/L2

Memory

Aux (SCM/VFM)

Disk cache

Thank you!

www.pivotor.com John.Baker@EPStrategies.com

Don’t keep your CPU waiting - CMG...Don’t keep your CPU waiting Speed reading for machines...

Documents

Transcript of Don’t keep your CPU waiting - CMG...Don’t keep your CPU waiting Speed reading for machines...

Customer Experiences Saving MSUs Through CPC Optimization · Optimizing Processor Cache –Recap • CPU consumption decreases when unproductive cycles waiting for data to be staged

Introduction to Coupling Facility Requests and Structure (for ......Introduction to Coupling Facility Requests and Structure (for Performance) Instructor: Peter Enrico Email: Peter.Enrico@EPStrategies.com

wPerf: Generic Off-CPU Analysis to Identify Bottleneck Waiting … · wPerf: Generic Off-CPU Analysis to Identify Bottleneck Waiting Events Fang Zhou, Yifan Gan, Sixiang Ma, Yang

CHAPTER 6 CPU SCHEDULING 1. Process Scheduling In most programs, there is alternating between I/O bursts and CPU bursts like: Waiting time Sum of.

CHAPTER CPU Scheduling - WordPress.com...CPU-scheduling decisions may take place under the following four circum-stances: 1. When a process switches from the running state to the waiting

CPU Performance Pipelined CPU - Cornell University

Ch 4 : CPU scheduling · 2017-10-30 · CPU-scheduling decisions may take place under the following four circumstances: 1. When a process switches from the running state to the waiting

Chapter 6: CPU Scheduling - Auburn Universityweishinn/Comp7500/Ch6_CPU...CPU scheduling decisions may take place when a process: 1. Switches from running to waiting state 2. Switches

IBM i Performance Management and Performance Data Collectors · – CPU • Time spent dispatched to the processor, active/running – CPU queuing • Ready to be processed, waiting

CJ1H-CPU H-R, CJ1G/H-CPU H, CJ1G-CPU P, CJ1G-CPU , CJ1M ...€¦ · the CS1D CPU Units for Single-CPU Systems, begins at version 2.0. • The unit version of the CS1D CPU Units for

High Performance Computing at ac3 - University of Wollongongtim/casties.pdf · Shared Memory (SMP) CPU CPU CPU CPU Mem 3. Clusters CPU CPU CPU CPU The Earth Simulator (currently the

Today - Northwestern University of CPU-burst times Large number of short CPU bursts Small number of long CPU bursts Long CPU burst Waiting for I/O Short CPU burst . 4 Environments

Get moving with CMC FPGA/GPU Cluster · 2019. 12. 12. · FPGA GPU CPU CPU FPGA GPU CPU CPU FPGA GPU CPU CPU FPGA GPU CPU 2 x CPU 2 x CPU 2 x CPU 2 x CPU 8 x CPU 2 x CPU & Large memory

Waiting Line Models Waiting Line Models -...Use waiting line models ...

ManagingGPUConcurrencyin Heterogeneous&Architectures&omutlu/pub/gpu... · evaluated&architecture& llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc llc/ mc cpu cpu cpu cpu cpu cpu cpu

Omron - CS1D-CPU H CPU Units CS1D-CPU S CPU Units ......CS1D CPU Units Function CS1D-CPU@@H CS1D-CPU@@S Duplex CPU, Single I/O Expansion System Duplex CPU, Dual I/O Expan-sion System

Siemens - S7-300 Instruction List CPU 31x, CPU 31x, …...S7-300 Instruction List CPU 31xC, CPU 31x, IM 151-7 CPU, IM 151-8 CPU, IM 154-8 CPU, BM 147-1 CPU, BM 147-2 CPU This instruction

CPU SchedulingPreemptive Scheduling CPU scheduling decisions may take place when a process: 1. Switches from running to waiting state - I/O request 2. Switches from running to ready

Operating Systems, 082 - BGUos142/wiki.files/Practical Session 3 - Scheduling.pdf · process is waiting for the CPU (in the ready queue, but not running), its priority changes at

CPU Performance Pipelined CPU