z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições até 1TB de RAM por...

1

z/VM 6.3 – Changes in hipervisor behavior to support Large LPARs

Lívio Sousa - [email protected] IBM z/VM and Linux on System z LA ATS!

http://br.linkedin.com/in/liviosousa

2

Overview

•  z/VM Version 6 Product Evolution •  Implementation of HiperDispatch

•  Dispatching Affinity •  Vertical CPU Management

•  Large Memory Support •  Studying MONWRITE Data

3

z/VM Version 6 Product Evolution

z/VM z/VM

z/VM z/VM

z/VM 6.1 z/VM 6.2 z/VM 6.3

z/VM

z/VM

z/VM

z/VM

32 IFLs 256 GB + 128 GB RAM

(each)

32 IFLs 256 GB + 128 GB RAM

(each)

32 IFLs 1 TB + 128 GB RAM

(each)

z/VM

z/VM

z/VM

z/VM

4

Reduce the number of z/VM systems

§ CPU – Exploit HiperDispatch to improve processor efficiency, allowing more work to be

done per IFL and therefore supporting more virtual servers per IFL, requiring fewer systems for applicable workloads

§ Storage – Expand z/VM systems constrained by memory up to four times (from 256GB to

1TB of Real Storage), in a single z/VM Image – Expand the real memory used in a Single System Image Cluster up to 4 TB

z/VM z/VM

z/VM z/VM z/VM 6.2

z/VM 6.3

z/VM

z/VM

z/VM

z/VM

5

Implementation of HiperDispatch

§  Improved processor efficiency –  Better n-way curves

•  Supported processor limit of 32 remains unchanged –  Better use of processor cache to take advantage of cache-rich system

design of more recent machines

§  Two components: –  Dispatching affinity –  Vertical CPU management

6

What It Means to Reduce CPU Wait Time

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 clock cycles

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 clock cycles

A R3,MEMWORD

A R3,MEMWORD work

Instruction complexity CPI aka Infinite CPI Cache miss CPI aka

Finite CPI

wait for memory work

work wait for memory work

System z EC12 - Multi-Chip Module (MCM) Cache Layers

L4 Shared

L4 Shared

48MB Cache L3

48MB Cache L3

192MB Cache L4

•  L1 cache per core - 36 cores * (64K+96K) = 5,6 MB

•  L2 cache per core - 36 cores * 2 MB = 72 MB •  L3 cache shared by 6 cores per chip - 6 chips * 48 MB = 288 MB •  L4 cache shared by 24 cores - 2 * 192 MB L4 chips = 384 MB •  Cache Total per MCM - L1 + L2 + L3 + L4 = 749, 6 MB •  Cache Total per zEC12 with 4 Books

4 * MCM = 2,9 GBytes

Hexacore 6 x L1, L2 1 x L3, Shd






48MB Cache L3

48MB Cache L3

48MB Cache L3

48MB Cache L3

192MB Cache L4

6 * 2MB Cache L2

6 * 2MB Cache L2

6 * 2MB Cache L2

6 * 2MB Cache L2

6 * 2MB Cache L2

6 * 2MB Cache L2 6 * 2MB

Cache L2

HiperDispatch – Dispatching Affinity ■  Processor cache structures become increasingly complex and critical to

performance

■  Goal is to re-dispatch work close (in terms of topology) to where it last ran

9

HiperDispatch – Dispatching Affinity

§  Dispatcher is aware of the cache and memory topology –  Dispatch virtual CPU near where its data may be in cache based on where

the virtual CPU was last dispatched

§  Better use of cache can reduce the execution time of a set of related instructions

§  z/VM 6.2 and earlier uses “soft” affinity to dispatch virtual CPUs

– No awareness of chip or book

10

HiperDispatch – Vertical CPU Management

§  Today's “horizontal” management distributes the LPAR weight evenly across the logical processors of the z/VM LPAR

§  “Vertical” management attempts to minimize the number of logical processors, allowing LPAR to similarly manage logical CPUs

Example: –  Ten Physical IFLs, seven logical IFLs, weight of 400 out of 1000

•  Each logical IFL (LPU) entitled to 57% of an IFL –  When CEC is constrained, the LPAR’s entitlement is reduced to four IFLs, so seven is

more than required –  z/VM and LPAR will cooperate

•  z/VM will concentrate the workload on a smaller number of logical processors •  LPAR will redistribute the partition weight to give a greater portion to this smaller

number of logical processors (~100% of four CPUs)

11

Horizontal vs. Vertical CPU Management

Horizontal: § The logical processors are all created/treated equally. § z/VM dispatches work evenly across the seven logical processors

Vertical: § The logical processors are skewed to where some get greater share of the weight. § z/VM dispatches work accordingly to the heavier weighted workload.

LPU

LPU

LPU

LPU

LPU

LPU

LPU

Full Physical IFL

57%

LPU

LPU

LPU

LPU

Full Physical IFL

Looks like this in

concept

100%

100%

z/VM HiperDispatch: VMDBK Steal

12

OLD WAY 0 → 1 → 2 → 3 → 4 … → 19 → 0 Steal from neighbor by CPU number. Work your way around the ring. This is not topologically informed.

NEW WAY

(Easy) Steal within your chip. (Harder) Steal within your book. (Still harder) Steal across books. This is topologically informed.

CP Monitor has been updated to log out steal behavior as a function of topology drag distance.

z/VM HiperDispatch Various Numbers of HEAVY Tiles

13

Blue – 6.2.0 Red – 6.3.0 Horizontal with reshuffle Orange – 6.3.0 Vertical with reshuffle Green – 6.3.0 Vertical with Rebalance

Synthetic, memory-touching workload A HEAVY tile is 540% busy: - 1-CPU guest - 15% busy - 4-CPU guest with each CPU 31% busy - 8-CPU guest with each CPU 50% busy - No I/O, paging, etc.

z/VM HiperDispatch: Knobs

14

Concept Knob

Horizontal or vertical SET SRM POLARIZATION { HORIZONTAL | VERTICAL }

How optimistically to predict XPF floors

SET SRM [TYPE cpu_type] EXCESSUSE { HIGH | MED | LOW }

How much CPUPAD safety margin to allow when we park below available power

SET SRM [TYPE cpu_type] CPUPAD nnnn%

Reshuffle or rebalance SET SRM DSPWDMETHOD { RESHUFFLE | REBALANCE }

Defaults: - Vertical mode -  EXCESSUSE MEDIUM (70%-confident floor) -  CPUPAD 100% -  Reshuffle

CP Monitor has been updated to log out the changes to these new SRM settings.

z/VM HiperDispatch Global Performance Data “Global Performance Data” is a setting in the partition’s activation profile, “Security” category

–  Also you can use the SE’s “Change LPAR Security” function to change it while the partition is up –  z/VM can handle changes in GPD without a re-IPL

GPD is on by default (in DR scenario, ask your partition provider about it)

When it is on, the partition can see performance data about all partitions

–  Their weights –  How much CPU they are consuming

That performance data lets the z/VM system do:

–  Determine every partition’s entitlement –  Determine how much entitled power is being consumed –  Determine how much excess power is available (XP = TP – EP) –  Determine which partitions are over consuming –  Calculate the z/VM system’s XPF

z/VM HiperDispatch is substantially crippled if you fail to enable GPD for the partition

–  You might see HCP1052I, “Global performance data is disabled. This may degrade system performance.”

–  You can always use CP QUERY SRM to find out whether GPD is on for your partition 15

16

Large Memory Support

§  Real memory limit raised from 256GB to 1 TB –  Proportionately increases total virtual memory based on tolerable

over commitment levels and workload dependencies

§  Virtual machine memory limit remains unchanged at 1 TB

§  Paging DASD utilization and requirements change –  Removed the need to double the paging space on DASD –  Paging algorithm changes increase the need to have a properly configured

paging subsystem

§  Expanded Storage continues to be supported with a limit of 128 GB

New Approach: The Big State Diagram

17

Frame-owned

lists

Global aging list

Early writes: write only changed pages

reclaim

>2G contig

<2G single

<2G contig

>2G single

Available lists

reference

To whoever needs frames

Demand scan pushes frames: -  From FO valid sections, down to… -  FO IBR sections, then down to… -  Global aging list, then over to… -  Available lists, from which they... -  Are used to satisfy requests for frames

New Approach: How We Now Use Paging DASD

18

Newest

Oldest

Global aging list

Optional prewriting

Paging DASD

One I/O either read or write (many volumes of course)

19

Large Memory Support (cont.)

Reorder processing removed –  Commands remain, but have no impact –  Improves environment for running larger virtual machines

Improved effectiveness of the CP SET RESERVE command

–  Stronger “glue” to hold reserved pages in memory –  Support for reserving pages of NSS or DCSS

•  Example: Use with the Monitor Segment (MONDCSS) –  Ability to limit the overall number of reserved pages for the system

20

Dump Support (Enhanced)

§  Stand-alone Dump utility has been rewritten –  Creates a CP hard abend format dump –  Dump is written to ECKD™ or SCSI DASD

§  Larger memory sizes supported, up to a maximum of 1 TB –  Includes Stand-alone dump, hard abend dump, SNAPDUMP, DUMPLD2,

and VM Dump Tool

§  Performance improvements for hard abend dump –  Reduces time to take a CP hard abend dump

21

Studying MONWRITE Data

•  z/VM Performance Toolkit

•  Interactively – possible, but not so useful

•  PERFKIT BATCH command – pretty useful –  Control files tell Perfkit which reports to produce –  You can then inspect the reports by hand or programmatically

•  See z/VM Performance Toolkit Reference for information on how to use PERFKIT BATCH

•  Brian Wade •  MONWRITE Collector - http://www.vm.ibm.com/devpages/bkw/linmon.html •  PRFIT - http://www.vm.ibm.com/download/packages/descript.cgi?PRFIT

22

Some Final Thoughts

•  Large z/VM 6.3 partitions require more affection •  Remember to turn on Global Performance Data •  Vertical Mode is on by Default •  z/VM Performance Toolkit has been updated •  Remember to measure before and after migration

•  The study of MONWRITE DATA will help you to understand the environment behavior

23

Thank You!

Informações de Contato: Livio Sousa IBM Tutóia – SP [email protected] +55 11 9 7203 6637

z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições até 1TB de RAM por...

Technology

Transcript of z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições até 1TB de RAM por...