z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições até 1TB de RAM por...
-
Upload
joao-galdino-mello-de-souza -
Category
Technology
-
view
871 -
download
0
Transcript of z/VM 6.3 - Mudanças de Comportamento do hypervisor para suporte de partições até 1TB de RAM por...
1
z/VM 6.3 – Changes in hipervisor behavior to support Large LPARs
Lívio Sousa - [email protected] IBM z/VM and Linux on System z LA ATS!
http://br.linkedin.com/in/liviosousa
2
Overview
• z/VM Version 6 Product Evolution • Implementation of HiperDispatch
• Dispatching Affinity • Vertical CPU Management
• Large Memory Support • Studying MONWRITE Data
3
z/VM Version 6 Product Evolution
z/VM z/VM
z/VM z/VM
z/VM 6.1 z/VM 6.2 z/VM 6.3
z/VM
z/VM
z/VM
z/VM
32 IFLs 256 GB + 128 GB RAM
(each)
32 IFLs 256 GB + 128 GB RAM
(each)
32 IFLs 1 TB + 128 GB RAM
(each)
z/VM
z/VM
z/VM
z/VM
4
Reduce the number of z/VM systems
§ CPU – Exploit HiperDispatch to improve processor efficiency, allowing more work to be
done per IFL and therefore supporting more virtual servers per IFL, requiring fewer systems for applicable workloads
§ Storage – Expand z/VM systems constrained by memory up to four times (from 256GB to
1TB of Real Storage), in a single z/VM Image – Expand the real memory used in a Single System Image Cluster up to 4 TB
z/VM z/VM
z/VM z/VM z/VM 6.2
z/VM 6.3
z/VM
z/VM
z/VM
z/VM
5
Implementation of HiperDispatch
§ Improved processor efficiency – Better n-way curves
• Supported processor limit of 32 remains unchanged – Better use of processor cache to take advantage of cache-rich system
design of more recent machines
§ Two components: – Dispatching affinity – Vertical CPU management
6
What It Means to Reduce CPU Wait Time
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 clock cycles
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 clock cycles
A R3,MEMWORD
A R3,MEMWORD work
Instruction complexity CPI aka Infinite CPI Cache miss CPI aka
Finite CPI
wait for memory work
work wait for memory work
System z EC12 - Multi-Chip Module (MCM) Cache Layers
L4 Shared
L4 Shared
48MB Cache L3
48MB Cache L3
192MB Cache L4
• L1 cache per core - 36 cores * (64K+96K) = 5,6 MB
• L2 cache per core - 36 cores * 2 MB = 72 MB • L3 cache shared by 6 cores per chip - 6 chips * 48 MB = 288 MB • L4 cache shared by 24 cores - 2 * 192 MB L4 chips = 384 MB • Cache Total per MCM - L1 + L2 + L3 + L4 = 749, 6 MB • Cache Total per zEC12 with 4 Books
4 * MCM = 2,9 GBytes
Hexacore 6 x L1, L2 1 x L3, Shd
Hexacore 6 x L1, L2 1 x L3, Shd
Hexacore 6 x L1, L2 1 x L3, Shd
Hexacore 6 x L1, L2 1 x L3, Shd
Hexacore 6 x L1, L2 1 x L3, Shd
Hexacore 6 x L1, L2 1 x L3, Shd
48MB Cache L3
48MB Cache L3
48MB Cache L3
48MB Cache L3
192MB Cache L4
6 * 2MB Cache L2
6 * 2MB Cache L2
6 * 2MB Cache L2
6 * 2MB Cache L2
6 * 2MB Cache L2
6 * 2MB Cache L2 6 * 2MB
Cache L2
HiperDispatch – Dispatching Affinity ■ Processor cache structures become increasingly complex and critical to
performance
■ Goal is to re-dispatch work close (in terms of topology) to where it last ran
9
HiperDispatch – Dispatching Affinity
§ Dispatcher is aware of the cache and memory topology – Dispatch virtual CPU near where its data may be in cache based on where
the virtual CPU was last dispatched
§ Better use of cache can reduce the execution time of a set of related instructions
§ z/VM 6.2 and earlier uses “soft” affinity to dispatch virtual CPUs
– No awareness of chip or book
10
HiperDispatch – Vertical CPU Management
§ Today's “horizontal” management distributes the LPAR weight evenly across the logical processors of the z/VM LPAR
§ “Vertical” management attempts to minimize the number of logical processors, allowing LPAR to similarly manage logical CPUs
Example: – Ten Physical IFLs, seven logical IFLs, weight of 400 out of 1000
• Each logical IFL (LPU) entitled to 57% of an IFL – When CEC is constrained, the LPAR’s entitlement is reduced to four IFLs, so seven is
more than required – z/VM and LPAR will cooperate
• z/VM will concentrate the workload on a smaller number of logical processors • LPAR will redistribute the partition weight to give a greater portion to this smaller
number of logical processors (~100% of four CPUs)
11
Horizontal vs. Vertical CPU Management
Horizontal: § The logical processors are all created/treated equally. § z/VM dispatches work evenly across the seven logical processors
Vertical: § The logical processors are skewed to where some get greater share of the weight. § z/VM dispatches work accordingly to the heavier weighted workload.
LPU
LPU
LPU
LPU
LPU
LPU
LPU
Full Physical IFL
57%
LPU
LPU
LPU
LPU
Full Physical IFL
Looks like this in
concept
100%
100%
z/VM HiperDispatch: VMDBK Steal
12
OLD WAY 0 → 1 → 2 → 3 → 4 … → 19 → 0 Steal from neighbor by CPU number. Work your way around the ring. This is not topologically informed.
NEW WAY
(Easy) Steal within your chip. (Harder) Steal within your book. (Still harder) Steal across books. This is topologically informed.
CP Monitor has been updated to log out steal behavior as a function of topology drag distance.
z/VM HiperDispatch Various Numbers of HEAVY Tiles
13
Blue – 6.2.0 Red – 6.3.0 Horizontal with reshuffle Orange – 6.3.0 Vertical with reshuffle Green – 6.3.0 Vertical with Rebalance
Synthetic, memory-touching workload A HEAVY tile is 540% busy: - 1-CPU guest - 15% busy - 4-CPU guest with each CPU 31% busy - 8-CPU guest with each CPU 50% busy - No I/O, paging, etc.
z/VM HiperDispatch: Knobs
14
Concept Knob
Horizontal or vertical SET SRM POLARIZATION { HORIZONTAL | VERTICAL }
How optimistically to predict XPF floors
SET SRM [TYPE cpu_type] EXCESSUSE { HIGH | MED | LOW }
How much CPUPAD safety margin to allow when we park below available power
SET SRM [TYPE cpu_type] CPUPAD nnnn%
Reshuffle or rebalance SET SRM DSPWDMETHOD { RESHUFFLE | REBALANCE }
Defaults: - Vertical mode - EXCESSUSE MEDIUM (70%-confident floor) - CPUPAD 100% - Reshuffle
CP Monitor has been updated to log out the changes to these new SRM settings.
z/VM HiperDispatch Global Performance Data “Global Performance Data” is a setting in the partition’s activation profile, “Security” category
– Also you can use the SE’s “Change LPAR Security” function to change it while the partition is up – z/VM can handle changes in GPD without a re-IPL
GPD is on by default (in DR scenario, ask your partition provider about it)
When it is on, the partition can see performance data about all partitions
– Their weights – How much CPU they are consuming
That performance data lets the z/VM system do:
– Determine every partition’s entitlement – Determine how much entitled power is being consumed – Determine how much excess power is available (XP = TP – EP) – Determine which partitions are over consuming – Calculate the z/VM system’s XPF
z/VM HiperDispatch is substantially crippled if you fail to enable GPD for the partition
– You might see HCP1052I, “Global performance data is disabled. This may degrade system performance.”
– You can always use CP QUERY SRM to find out whether GPD is on for your partition 15
16
Large Memory Support
§ Real memory limit raised from 256GB to 1 TB – Proportionately increases total virtual memory based on tolerable
over commitment levels and workload dependencies
§ Virtual machine memory limit remains unchanged at 1 TB
§ Paging DASD utilization and requirements change – Removed the need to double the paging space on DASD – Paging algorithm changes increase the need to have a properly configured
paging subsystem
§ Expanded Storage continues to be supported with a limit of 128 GB
New Approach: The Big State Diagram
17
Frame-owned
lists
Global aging list
Early writes: write only changed pages
reclaim
>2G contig
<2G single
<2G contig
>2G single
Available lists
reference
To whoever needs frames
Demand scan pushes frames: - From FO valid sections, down to… - FO IBR sections, then down to… - Global aging list, then over to… - Available lists, from which they... - Are used to satisfy requests for frames
New Approach: How We Now Use Paging DASD
18
Newest
Oldest
Global aging list
Optional prewriting
Paging DASD
One I/O either read or write (many volumes of course)
19
Large Memory Support (cont.)
Reorder processing removed – Commands remain, but have no impact – Improves environment for running larger virtual machines
Improved effectiveness of the CP SET RESERVE command
– Stronger “glue” to hold reserved pages in memory – Support for reserving pages of NSS or DCSS
• Example: Use with the Monitor Segment (MONDCSS) – Ability to limit the overall number of reserved pages for the system
20
Dump Support (Enhanced)
§ Stand-alone Dump utility has been rewritten – Creates a CP hard abend format dump – Dump is written to ECKD™ or SCSI DASD
§ Larger memory sizes supported, up to a maximum of 1 TB – Includes Stand-alone dump, hard abend dump, SNAPDUMP, DUMPLD2,
and VM Dump Tool
§ Performance improvements for hard abend dump – Reduces time to take a CP hard abend dump
21
Studying MONWRITE Data
• z/VM Performance Toolkit
• Interactively – possible, but not so useful
• PERFKIT BATCH command – pretty useful – Control files tell Perfkit which reports to produce – You can then inspect the reports by hand or programmatically
• See z/VM Performance Toolkit Reference for information on how to use PERFKIT BATCH
• Brian Wade • MONWRITE Collector - http://www.vm.ibm.com/devpages/bkw/linmon.html • PRFIT - http://www.vm.ibm.com/download/packages/descript.cgi?PRFIT
22
Some Final Thoughts
• Large z/VM 6.3 partitions require more affection • Remember to turn on Global Performance Data • Vertical Mode is on by Default • z/VM Performance Toolkit has been updated • Remember to measure before and after migration
• The study of MONWRITE DATA will help you to understand the environment behavior
23
Thank You!
Informações de Contato: Livio Sousa IBM Tutóia – SP [email protected] +55 11 9 7203 6637