VIO Technology
-
Upload
rajiv-kodali -
Category
Documents
-
view
310 -
download
1
Transcript of VIO Technology
Eserver pSeries
© 2003 IBM Corporation
"Any sufficiently advanced technology will have the appearance of magic."…Arthur C. Clarke
Section 2: The Technology
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Section Objectives
On completion of this unit you should be able to:
– Describe the relationship between technology and solutions.
– List key IBM technologies that are part of the POWER5 products.
– Be able to describe the functional benefits that these technologies provide.
– Be able to discuss the appropriate use of these technologies.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
IBM and Technology
Science
Technology
Products
Solutions
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Technology and innovation
Having technology available is a necessary first step.
Finding creative new ways to use the technology for the benefit of our clients is what innovation is about.
Solution design is an opportunity for innovative application of technology.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
When technology won’t ‘fix’ the problem
When the technology is not related to the problem.
When the client has unreasonable expectations.
Eserver pSeries
© 2003 IBM Corporation
POWER5 Technology
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER4 and POWER5 CoresPOWER4 Core POWER5 Core
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER5
Designed for entry and high-end servers
Enhanced memory subsystem
Improved performance
Simultaneous Multi-Threading
Hardware support for Shared Processor Partitions (Micro-Partitioning)
Dynamic power management
Compatibility with existing POWER4 systems
Enhanced reliability, availability, serviceability
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
GX+
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Enhanced memory subsystem
Improved L1 cache design– 2-way set associative i-cache– 4-way set associative d-cache– New replacement algorithm (LRU vs. FIFO)
Larger L2 cache– 1.9 MB, 10-way set associative
Improved L3 cache design– 36 MB, 12-way set associative– L3 on the processor side of the fabric– Satisfies L2 cache misses more frequently– Avoids traffic on the interchip fabric
On-chip L3 directory and memory controller– L3 directory on the chip reduces off-chip delays after
an L2 miss– Reduced memory latencies
Improved pre-fetch algorithms
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enh
anced distributed switch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Enhanced memory subsystem
L3 Cache
L3 Cache
ProcessorProcessor ProcessorProcessor ProcessorProcessor ProcessorProcessorProcessorProcessor ProcessorProcessorProcessorProcessor ProcessorProcessor
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
L2 Cache
Fabric controller
Fabric controller
Fabric controller
Fabric controller
Memory controller
Memory controller
Memory Memory
L3 Cache
L3 Cache
Fabric controller
Fabric controller
Fabric controller
Fabric controller
Memory controllerMemory controller
Memory controllerMemory controller
Memory Memory
POWER4 system structure POWER5 system structureReduced
L3 latency
Faster access to memory
Larger SMPs
64-way
Number of chips cut
in half
L3
Dir
L3
DirL
3 D
irL
3 D
ir
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Simultaneous Multi-Threading (SMT)
What is it?
Why would I want it?
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Branch pipeline
Load/store pipeline
Fixed-point pipeline
Floating-point pipeline
POWER4 pipeline
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF F6
Xfer
F6F6F6F6F6
D1 D2 D3 Xfer GD
IF BPCP
Instruction Crack and Group Formation
Instruction Fetch
Branch redirects
Interrupts & Flushes
Out-of-order processing
WB
Fmt
D0
IC
POWER4 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit)
POWER5 pipeline
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
FX0FX1LS0LS1FP0FP1BFXCRL
Processor Cycles
i-Cac
he
Multi-threading evolution
Execution unit utilization is low in today’s microprocessors
25% of average execution unit utilization across a broad spectrum of environments
MemoryInstruction streams
Next evolution step
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
FX0FX1LS0LS1FP0FP1BFXCRL
Processor Cycles
i-Cac
he
Coarse-grained multi-threading
Two instruction streams, one thread at any instance
Hardware swaps in second thread when long-latency event occurs
Swap requires several cycles
MemoryInstruction streams
Sw
ap
Sw
ap
Sw
ap
Next evolution step
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Coarse-grained multi-threading (Cont.)
Processor (for example, RS64-IV) is able to store context for two threads
– Rapid switching between threads minimizes lost cycles due to I/O waits and cache misses.
– Can yield ~20% improvement for OLTP workloads.
Coarse-grained multi-threading only beneficial where number of active threads exceeds 2x number of CPUs
– AIX must create a “dummy” thread if there are insufficient numbers of real threads.
• Unnecessary switches to “dummy” threads can degrade performance ~20%
• Does not work with dynamic CPU deallocation
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
FX0FX1LS0LS1FP0FP1BFXCRL
Processor Cycles
i-Cac
he
Fine-grained multi-threading
Variant of coarse-grained multi-threading
Thread execution in round-robin fashion
Cycle remains unused when a thread encounters a long-latency event
MemoryInstruction streams
Next evolution step
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER5 pipeline
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF F6
Xfer
F6F6F6F6F6
D1 D2 D3 Xfer GD
IF BPCP
Branch pipeline
Instruction Crack and Group Formation
Instruction Fetch
Branch redirects
Interrupts & Flushes
Out-of-order processing
WB
Fmt
D0
IC
POWER5 instruction pipeline (IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage 0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit)
IFCP
Load/store pipeline
Fixed-point pipeline
Floating-point pipeline
POWER4 pipeline
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
FX0FX1LS0LS1FP0FP1BFXCRL
Processor Cycles
i-Cac
he
Simultaneous multi-threading (SMT)
Reduction in unused execution units results in a 25-40% boost and even more!
MemoryInstruction streams
First evolution step
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Simultaneous multi-threading (SMT) (Cont.)
Each chip appears as a 4-way SMP to software
– Allows instructions from two threads to execute simultaneously
Processor resources optimized for enhanced SMT performance
– No context switching, no dummy threads
Hardware, POWER Hypervisor, or OS controlled thread priority
– Dynamic feedback of shared resources allows for balanced thread execution
Dynamic switching between single and multithreaded mode
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Dynamic resource balancing
Threads share many resources
– Global Completion Table, Branch History Table, Translation Lookaside Buffer, and so on
Higher performance realized when resources balanced across threads
– Tendency to drift toward extremes accompanied by reduced performance
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Adjustable thread priority
Instances when unbalanced execution is desirable
– No work for opposite thread
– Thread waiting on lock
– Software determined non uniform balance
– Power management
Control instruction decode rate
– Software/hardware controls eight priority levels for each thread
0
0
0
1
1
1
1
1
2
2
Ins
tru
cti
on
s p
er
cy
cle
2,7 4,7 6,7 7,7 7,6 7,4 7,20,7 7,0 1,1
Thread 0 Priority - Thread 1 Priority
Thread 0 IPC Thread 1 IPC
Power Save Mode
Single-threaded operation
Hardware thread priorities
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Single-threaded operation
Advantageous for execution unit limited applications
– Floating or fixed point intensive workloads
Execution unit limited applications provide minimal performance leverage for SMT
– Extra resources necessary for SMT provide higher performance benefit when dedicated to single thread
Determined dynamically on a per processor basis
Dormant
Null
Active
Software
Hardware or Software
Software
Software
Thread states
Eserver pSeries
© 2003 IBM Corporation
Micro-Partitioning
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Micro-Partitioning overview
Mainframe inspired technology
Virtualized resources shared by multiple partitions
Benefits– Finer grained resource allocation– More partitions (Up to 254)– Higher resource utilization
New partitioning model– POWER Hypervisor– Virtual processors– Fractional processor capacity partitions– Operating system optimized for Micro-Partitioning exploitation– Virtual I/O
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Shared processor pool
Processor terminologyShared processor
partitionSMT Off
Shared processor partitionSMT On
Dedicated processor partition
SMT Off
Deconfigured
Inactive (CUoD)
Dedicated
Shared
Virtual
Logical (SMT)
Installed physical processors
Entitled capacity
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Shared processor partitions
Micro-Partitioning allows for multiple partitions to share one physical processor
Up to 10 partitions per physical processor
Up to 254 partitions active at the same time
Partition’s resource definition
– Minimum, desired, and maximum values for each resource
– Processor capacity
– Virtual processors
– Capped or uncapped• Capacity weight
– Dedicated memory• Minimum of 128 MB and 16 MB increments
– Physical or virtual I/O resources
CPU 0 CPU 1
CPU 3 CPU 4
LPAR 1 LPAR 2
LPAR 5 LPAR 6
LPAR 4LPAR 3
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Understanding min/max/desired resource values
The desired value for a resource is given to a partition if enough resource is available.
If there is not enough resource to meet the desired value, then a lower amount is allocated.
If there is not enough resource to meet the min value, the partition will not start.
The maximum value is only used as an upper limit for dynamic partitioning operations.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Partition capacity entitlement
Processing units
– 1.0 processing unit represents one physical processor
Entitled processor capacity
– Commitment of capacity that is reserved for the partition
– Set upper limit of processor utilization for capped partitions
– Each virtual processor must be granted at least 1/10 of a processing unit of entitlement
Shared processor capacity is always delivered in terms of whole physical processors
Processing capacity1 physical processor1.0 processing units
0.5 processing unit 0.4 processing unit
Minimum requirement0.1 processing units
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Capped and uncapped partitions
Capped partition
– Not allowed to exceed its entitlement
Uncapped partition
– Is allowed to exceed its entitlement
Capacity weight
– Used for prioritizing uncapped partitions
– Value 0-255
– Value of 0 referred to as a “soft cap”
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Partition capacity entitlement example
Shared pool has 2.0 processing units available
LPARs activated in sequence
Partition 1 activated
– Min = 1.0, max = 2.0, desired = 1.5
– Starts with 1.5 allocated processing units
Partition 2 activated
– Min = 1.0, max = 2.0, desired = 1.0
– Does not start
Partition 3 activated
– Min = 0.1, max = 1.0, desired = 0.8
– Starts with 0.5 allocated processing units
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Understanding capacity allocation – An example
A workload is run under different configurations.
The size of the shared pool (number of physical processors) is fixed at 16.
The capacity entitlement for the partition is fixed at 9.5.
No other partitions are active.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Uncapped – 16 virtual processors
16 virtual processors.
Uncapped.
Can use all available resource.
The workload requires 26 minutes to complete.
Uncapped (16PPs/16VPs/9.5CE)
0
5
10
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Elapsed time
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Uncapped – 12 virtual processors
12 virtual processors.
Even though the partition is uncapped, it can only use 12 processing units.
The workload now requires 27 minutes to complete.
Uncapped (16PPs/12VPs/9.5CE)
0
5
10
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Elapsed time
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Capped
The partition is now capped and resource utilization is limited to the capacity entitlement of 9.5.
– Capping limits the amount of time each virtual processor is scheduled.
– The workload now requires 28 minutes to complete.
Capped (16PPs/12VPs/9.5E)
0
5
10
15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Elapses time
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Dynamic partitioning operations
Add, move, or remove processor capacity– Remove, move, or add entitled shared processor capacity– Change between capped and uncapped processing– Change the weight of an uncapped partition– Add and remove virtual processors
• Provided CE / VP > 0.1 Add, move, or remove memory
– 16 MB logical memory block
Add, move, or remove physical I/O adapter slots
Add or remove virtual I/O adapter slots
Min/max values defined for LPARs set the bounds within which DLPAR can work
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Dynamic LPAR
Standard on all new systems
HMC
AIX 5L
Linux
Hypervisor
Part#1
Production
Part#2 Part#3 Part#4
Legacy Apps
Test/Dev
File/Print
AIX 5L
AIX 5L
Move resources between live
partitions
Eserver pSeries
© 2003 IBM Corporation
Firmware
POWER Hypervisor
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER Hypervisor strategy
New Hypervisor for POWER5 systems
– Further convergence with iSeries
– But brands will retain unique value propositions
– Reduced development effort
– Faster time to market
New capabilities on pSeries servers
– Shared processor partitions
– Virtual I/O
New capability on iSeries servers
– Can run AIX 5L
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
H-Call Interface
POWER Hypervisor component sourcing
HSC VLANVLAN
VLAN IOALAN IOA
Nucleus (SLIC)
Virtual Ethernet
Capacity on Demand
Shared processor LPAR
Virtual I/O
Bus recovery Dump
Location codes
FSP
Load from flash
NVRAM
Message passing
pSeries
iSeries
255 partitions
Slot/tower concurrent maint. Drawer concurrent maint.
Partition on demand
HMC
SCSI IOA
I/O configuration
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER Hypervisor functions
Same functions as POWER4 Hypervisor.– Dynamic LPAR– Capacity Upgrade on Demand
New, active functions.– Dynamic Micro-Partitioning– Shared processor pool– Virtual I/O– Virtual LAN
Machine is always in LPAR mode.– Even with all resources dedicated to one OS
Dynamic Micro-Partitioning
CPU 0 CPU 1
CPU 2 CPU 3
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
Shared processor pools
Disk LAN
Virtual I/ODynamic LPAR
PlannedActual
Client Capacity Growth
Capacity Upgrade on Demand
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER Hypervisor implementation
Design enhancements to previous POWER4 implementation enable the sharing of processors by multiple partitions
– Hypervisor decrementer (HDECR)
– New Processor Utilization Resource Register (PURR)
– Refine virtual processor objects
• Does not include physical characteristics of the processor
– New Hypervisor calls
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER Hypervisor processor dispatch Manage a set of processors on the machine
(shared processor pool).
POWER5 generates a 10 ms dispatch window.– Minimum allocation is 1 ms per physical
processor.
Each virtual processor is guaranteed to get its entitled share of processor cycles during each 10 ms dispatch window.– ms/VP = CE * 10 / VPs
The partition entitlement is evenly distributed among the online virtual processors.
Once a capped partition has received its CE within a dispatch interval, it becomes not-runnable.
A VP dispatched within 1 ms of the end of the dispatch interval will receive half its CE at the start of the next dispatch interval.
Shared processor pool
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
CPU 0 CPU 1
CPU 2 CPU 3
POWER Hypervisor’s
processor dispatch
Virtual processor capacity entitlement for six shared processor partitions
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Dispatching and interrupt latencies
Virtual processors have dispatch latency.
Dispatch latency is the time between a virtual processor becoming runnable and being actually dispatched.
Timers have latency issues also.
External interrupts have latency issues also.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Shared processor pool Processors not associated with
dedicated processor partitions.
No fixed relationship between virtual processors and physical processors.
The POWER Hypervisor attempts to use the same physical processor.
– Affinity scheduling
– Home node
Shared processor pool
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
SMT CoreSMT Core
1.9 MB L2 Cache1.9 MB L2 Cache
Chip-Chip / MCM-MCM / SMPLink
Enhanced distributed sw
itch
SMT CoreSMT Core
L3 Dir
L3 Dir
Mem
Ctrl
Mem
Ctrl
CPU 0 CPU 1 CPU 2 CPU 3
POWER Hypervisor’s
processor dispatch
Virtual processor capacity entitlement for six shared processor partitions
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Affinity scheduling
When dispatching a VP, the POWER Hypervisor attempts to preserve affinity by using:
– Same physical processor as before, or
– Same chip, or
– Same MCM
When a physical processor becomes idle, the POWER Hypervisor looks for a runnable VP that:
– Has affinity for it, or
– Has affinity to no-one, or
– Is uncapped
Similar to AIX affinity scheduling
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Operating system support
Micro-Partitioning capable operating systems need to be modified to cede a virtual processor when they have no runnable work
– Failure to do this results in wasted CPU resources
• For example, an partition spends its CE waiting for I/O – Results in better utilization of the pool
May confer the remainder of their timeslice to another VP
– For example, a VP holding a lock
Can be redispatched if they become runnable again during the same dispatch interval
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Example
POWER Hypervisor dispatch interval pass 1 (msec) POWER Hypervisor dispatch interval pass 2 (msec)
0 1 2 3 4 5 6 7 8 9
Physical processor 0
Physical processor 1
10 11 12 13 14 15 16 17 18 19
LPAR 2VP 0
LPAR 1VP 1
20
LPAR 3VP 2
LPAR2Capacity entitlement = 0.2 processing units; virtual processors = 1 (capped)
LPAR3Capacity entitlement = 0.6 processing units; virtual processors = 3 (capped)
LPAR1Capacity entitlement = 0.8 processing units; virtual processors = 2 (capped)
LPAR 1VP 1
LPAR 1VP 0
LPAR 3VP 0
LPAR 3VP 1
LPAR 3VP 2
LPAR 1VP 0
LPAR 3VP 1
LPAR 2VP 0
LPAR 1VP 1
LPAR 3VP 0
IDLE IDLE
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER Hypervisor and virtual I/O
I/O operations without dedicating resources to an individual partition
POWER Hypervisor’s virtual I/O related operations– Provide control and configuration structures for virtual adapter
images required by the logical partitions– Operations that allow partitions controlled and secure access to
physical I/O adapters in a different partition– The POWER Hypervisor does not own any physical I/O devices;
they are owned by an I/O hosting partition
I/O types supported– SCSI– Ethernet– Serial console
Disk LAN
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Performance monitoring and accounting
CPU utilization is measured against CE.– An uncapped partition receiving more than its CE will record 100%
but will be using more.
SMT– Thread priorities compound the variable speed rate.– Twice as many logical CPUs.
For accounting, interval may be incorrectly allocated.– New hardware support is required.
Processor utilization register (PURR) records actual clock ticks spent executing a partition.– Used by performance commands (for example, new flags) and
accounting modules.– Third party tools will need to be modified.
Eserver pSeries
© 2003 IBM Corporation
Virtual I/O Server
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual I/O Server
Provides an operating environment for virtual I/O administration– Virtual I/O server administration– Restricted scriptable command line user interface (CLI)
Minimum hardware requirements– POWER5 VIO capable machine– Hardware management console– Storage adapter– Physical disk– Ethernet adapter– At least 128 MB of memory
Capabilities of the Virtual I/O Server– Ethernet Adapter Sharing– Virtual SCSI disk
• Virtual I/O Server Version 1.1 is addressed for selected configurations, which include specific models of EMC, HDS, and STK disk subsystems, attached using Fiber Channel
– Interacts with AIX and Linux partitions
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual I/O Server (Cont.)
Installation CD when Advanced POWER Virtualization feature is ordered
Configuration approaches for high availability
– Virtual I/O Server• LVM mirroring• Multipath I/O• EtherChannel
– Second virtual I/O server instance in another partition
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual SCSI
Allows sharing of storage devices
Vital for shared processor partitions
– Overcomes potential limit of adapter slots due to Micro-Partitioning
– Allows the creation of logical partitions without the need for additional physical resources
Allows attachment of previously unsupported storage solutions
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
VSCSI server and client architecture overview Virtual SCSI is based on a
client/server relationship.
The virtual I/O resources are assigned using an HMC.
Virtual SCSI enables sharing of adapters as well as disk devices.
Dynamic LPAR operations allowed.
Dynamic mapping between physical and virtual resources on the virtual I/O server.
POWER Hypervisor
Client partition
Linux
Virtual I/O Server partition
Client partition
AIX
Physical adapter
VSCI client adapter
hdisk
VSCSI server adapter
VSCSI server adapter
VSCI client adapter
LVM
hdiskLogical volume 2
Logical volume 1
Physical disk (SCSI, FC)
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual I/O Server partition
hdisk
Virtual devices
Are defined as LVs in the I/O server partition
– Normal LV rules apply
Appear as real devices (hdisks) in the hosted partition
Can be manipulated using Logical Volume Manager just like an ordinary physical disk
Can be used as a boot device and as a NIM target
Can be shared by multiple clients
POWER Hypervisor
Client partition
VSCI client adapter
LVM
LV
VSCSI server adapter
Virtual disk
LVM
hdisk
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
SCSI RDMA and Logical Remote Direct Memory Access
SCSI transport protocols define the rules for exchanging information between SCSI initiators and targets.
Virtual SCSI uses the SCSI RDMA Protocol (SRP).– SCSI initiators and targets have the
ability to directly transfer information between their respective address spaces.
SCSI requests and responses are sent using the Virtual SCSI adapters.
The actual data transfer, however, is done using the Logical Redirected DMA protocol.
Reliable Command / Response TransportLogical Remote Direct Memory Access
POWER Hypervisor
Virtual I/O Server partition
Client partition AIX
Physical adapter
Physical adapter device
driver
VSCI device driver (target)
Device Mapping
VSCI device driver (initiator)
Data Buffer
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual SCSI security
Only the owning partition has access to its data.
Data-information is copied directly from the PCI adapter to the client’s memory.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Performance considerations
Twice as many processor cycles to do VSCSI as a locally attached disk I/O (evenly distributed on the client partition and virtual I/O server)
– The path of each virtual I/O request involves several sources of overhead that are not present in a non-virtual I/O request.
– For a virtual disk backed by the LVM, there is also the performance impact of going through the LVM and disk device drivers twice.
If multiple partitions are competing for resources from a VSCSI server, care must be taken to ensure enough server resources (CPU, memory, and disk) are allocated to do the job.
If not constrained by CPU performance, dedicated partition throughput is comparable to doing local I/O.
Because there is no caching in memory on the server I/O partition, it's memory requirements should be modest.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Limitations
Hosting partition must be available before hosted partition boot.
Virtual SCSI supports FC, parallel SCSI, and SCSI RAID.
Maximum of 65535 virtual slots in the I/O server partition.
Maximum of 256 virtual slots on a single partition.
Support for all mandatory SCSI commands.
Not all optional SCSI commands are supported.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Implementation guideline
Partitions with high performance and disk I/O requirements are not recommended for implementing VSCSI.
Partitions with very low performance and disk I/O requirements can be configured at minimum expense to use only a portion of a logical volume.
Boot disks for the operating system.
Web servers that will typically cache a lot of data.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER Hypervisor
Virtual I/O Server
partition
Client partition
Virtual I/O Server
partition
LVM mirroring
This configuration protects virtual disks in a client partition against failure of:
– One physical disk
– One physical adapter
– One virtual I/O server
Many possibilities exist to exploit this great function!
Physical SCSI adapter
VSCSI server adapter
LVM
Physical disk (SCSI)
Physical SCSI adapter
Physical disk (SCSI)
VSCSI server adapter
LVM
VSCSI client
adapter
LVM
VSCSI client
adapter
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
POWER Hypervisor
Virtual I/O Server
partition
Client partition
Virtual I/O Server
partition
Multipath I/O
This configuration protects virtual disks in a client partition against failure of:
– Failure of one physical FC adapter in one I/O server
– Failure of one Virtual I/O server
Physical disk is assigned as a whole to the client partition
Many possibilities exist to exploit this great function!
Physical FC adapter
VSCSI server adapter
LVM(hdisk)
Physical FC adapter
Physical disk ESS
VSCSI server adapter
LVM(hdisk)
VSCSI client
adapter
LVM
VSCSI client
adapter
SAN Switch
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual LAN overview
Virtual network segments on top of physical switch devices.
All nodes in the VLAN can communicate without any L3 routing or inter-VLAN bridging.
VLANs provides:– Increased LAN security– Flexible network deployment over
traditional network devices
VLAN support in AIX is based on the IEEE 802.1Q VLAN implementation.– VLAN ID tagging to Ethernet
frames– VLAN ID restricted switch ports
Switch B Switch C
Switch A
Node A-1 Node A-2
Node B-1 Node B-2 Node B-3 Node C-1 Node C-2
VLAN 1
VLAN 2
X
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual Ethernet
Enables inter-partition communication.
– In-memory point to point connections
Physical network adapters are not needed.
Similar to high-bandwidth Ethernet connections.
Supports multiple protocols (IPv4, IPv6, and ICMP).
No Advanced POWER Virtualization feature required.
– POWER5 Systems
– AIX 5L V5.3 or appropriate Linux level
– Hardware management console (HMC)
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual Ethernet connections
VLAN technology implementation– Partitions can only access data directed to
them.
Virtual Ethernet switch provided by the POWER Hypervisor
Virtual LAN adapters appears to the OS as physical adapters– MAC-Address is generated by the HMC.
1-3 Gb/s transmission speed– Support for large MTUs (~64K) on AIX.
Up to 256 virtual Ethernet adapters– Up to 18 VLANs.
Bootable device support for NIM OS installations
Virtual Ethernet switch
POWER Hypervisor
Linux partition
AIX partition
Virtual Ethernet adapter
Virtual Ethernet adapter
AIX partition
Virtual Ethernet adapter
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual Ethernet switch
Based on IEEE 802.1Q VLAN standard
– OSI-Layer 2
– Optional Virtual LAN ID (VID)
– 4094 virtual LANs supported
– Up to 18 VIDs per virtual LAN port
Switch configuration through HMC
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
How it worksVirtual Ethernet adapter
Virtual VLAN switch port
PHYP caches source MAC
Check VLAN headerIEEE VLAN
header?
Insert VLAN header
Port allowed?
Dest. MAC in table?
Trunk adapter defined?
Configured associated switch port
Match for VLAN Nr. in
table?
DeliverPass to Trunk
adapterDrop packet
Y
N
N
N
N
Y
YN
Y
N
Y
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Performance considerations
Virtual Ethernet performance
– Throughput scales nearly linear with the allocated capacity entitlement
Virtual LAN vs. Gigabit Ethernet throughput
– Virtual Ethernet adapter has higher raw throughput at all MTU sizes
– In-memory copy is more efficient at larger MTU
0
200
400
600
800
1000
Throughput/0.1 entitlement
[Mb/s]
0.1 0.3 0.5 0.8 1
15009000
65394
CPU entitlements
MTUsize
Throughput per 0.1 entitlement
0
2000
4000
6000
8000
10000
Throughput [Mb/s]
1
Throughput, TCP_STREAM
VLAN
Gb Ethernet
MTU 1500 1500 9000 9000 65394 65394Simpl./Dupl. S D S D S D
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Limitations
Virtual Ethernet can be used in both shared and dedicated processor partitions provided with the appropriate OS levels.
A mixture of Virtual Ethernet connections, real network adapters, or both are permitted within a partition.
Virtual Ethernet can only connect partitions within a single system.
A system’s processor load is increased when using virtual Ethernet.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Implementation guideline
Know your environment and the network traffic.
Choose a high MTU size, as it makes sense for the network traffic in the Virtual LAN.
Use the MTU size 65394 if you expect a large amount of data to be copied inside your Virtual LAN.
Enable tcp_pmtu_discover and udp_pmtu_discover in conjunction with MTU size 65394.
Do not turn off SMT.
No dedicated CPUs are required for virtual Ethernet performance.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Connecting Virtual Ethernet to external networks Routing
– The partition that routes the traffic to the external work does not necessarily have to be the virtual I/O server.
Virtual Ethernet switch
POWER Hypervisor
Linux partition
AIX partition
3.1.1.103.1.1.10
AIX partition
3.1.1.11.1.1.100
Physical adapter
Virtual Ethernet switch
POWER Hypervisor
Linux partition
AIX partition
4.1.1.114.1.1.10
AIX partition
4.1.1.12.1.1.100
Physical adapter
IP subnet 1.1.1.X
AIX Server
LinuxServer
IP subnet 2.1.1.X
1.1.1.10 2.1.1.10
IP Router1.1.1.12.1.1.1
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Shared Ethernet Adapter
Connects internal and external VLANs using one physical adapter.
SEA is a new service that acts as a layer 2 network switch.
– Securely bridges network traffic from a virtual Ethernet adapter to a real network adapter
SEA service runs in the Virtual I/O Server partition.
– Advanced POWER Virtualization feature required
– At least one physical Ethernet adapter required
No physical I/O slot and network adapter required in the client partition.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Shared Ethernet Adapter (Cont.)
Virtual Ethernet MAC are visible to outside systems.
Broadcast/multicast is supported.
ARP (Address Resolution Protocol) and NDP (Neighbor Discovery Protocol) can work across a shared Ethernet.
One SEA can be shared by multiple VLANs and multiple subnets can connect using a single adapter on the Virtual I/O Server.
Virtual Ethernet adapter configured in the Shared Ethernet Adapter must have the trunk flag set.
– The trunk Virtual Ethernet adapter enables a layer-2 bridge to a physical adapter
IP fragmentation is performed or an ICMP packet too big message is sent when the shared Ethernet adapter receives IP (or IPv6) packets that are larger than the MTU of the adapter that the packet is forwarded through.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual Ethernet and Shared Ethernet Adapter security
VLAN (virtual local area network) tagging description taken from the IEEE 802.1Q standard.
The implementation of this VLAN standard ensures that the partitions have no access to foreign data.
Only the network adapters (virtual or physical) that are connected to a port (virtual or physical) that belongs to the same VLAN can receive frames with that specific VLAN ID.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Performance considerations
Virtual I/O-Server performance
– Adapters stream data at media speed if the Virtual I/O server has enough capacity entitlement.
– CPU utilization per Gigabit of throughput is higher with a Shared Ethernet adapter.
0
500
1000
1500
2000
1 2 3 4
Virtual I/O Server Throughput, TCP_STREAMThroughput
[Mb/s]
MTU 1500 1500 9000 9000Simplex/Duplex simplex duplex simplex duplex
0
20
40
60
80
100
1 2 3 4
Virtual I/O Server normalized CPU utilisation, TCP_STREAM
CPU Utilisation [%cpu/Gb]
MTU 1500 1500 9000 9000Simplex/Duplex simplex duplex simplex duplex
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Limitations
System processors are used for all communication functions, leading to a significant amount of system processor load.
One of the virtual adapters in the SEA on the Virtual I/O server must be defined as a default adapter with a default PVID.
Up to 16 Virtual Ethernet adapters with 18 VLANs on each can be shared on a single physical network adapter.
Shared Ethernet Adapter requires:
– POWER Hypervisor component of POWER5 systems
– AIX 5L Version 5.3 or appropriate Linux level
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Implementation guideline
Know your environment and the network traffic.
Use a dedicated network adapter if you expect heavy network traffic between Virtual Ethernet and local networks.
If possible, use dedicated CPUs for the Virtual I/O Server.
Choose 9000 for MTU size, if this makes sense for your network traffic.
Don’t use Shared Ethernet Adapter functionality for latency critical applications.
With MTU size 1500, you need about 1 CPU per gigabit Ethernet adapter streaming at media speed.
With MTU size 9000, 2 Gigabit Ethernet adapters can stream at media speed per CPU.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Shared Ethernet Adapter configuration
The Virtual I/O Server is configured with at least one physical Ethernet adapter.
One Shared Ethernet Adapter can be shared by multiple VLANs.
Multiple subnets can connect using a single adapter on the Virtual I/O Server.
Virtual Ethernet switch
POWER Hypervisor
Linux partition
AIX partition
VLAN 210.1.2.11
VLAN 110.1.1.11
Virtual I/O Server
VLAN 1ent0
Physical adapter
VLAN 1
AIX Server10.1.1.14
Shared Ethernet Adapter
VLAN 2
Linux Server10.1.2.15
VLAN 2
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Multiple Shared Ethernet Adapter configuration
Maximizing throughput
– Using several Shared Ethernet Adapters
– More queues
– More performance
Virtual Ethernet switch
POWER Hypervisor
Linux partition
AIX partition
VLAN 210.1.2.11
VLAN 110.1.1.11
Virtual I/O Server
VLAN 1
ent0
Physical adapter
VLAN 1
AIX Server10.1.1.14
Shared Ethernet Adapter
VLAN 2
Linux Server10.1.2.15
VLAN 2
Physical adapter
ent1
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Multipath routing with dead gateway detection
This configuration protects your access to the external network against:
– Failure of one physical network adapter in one I/O server
– Failure of one Virtual I/O server
– Failure of one gateway
Virtual Ethernet switch
POWER Hypervisor
AIX partition
Physical adapter
External network
Physical adapter
Virtual I/O Server 2
Shared Ethernet Adapter
VLAN 29.3.5.21
Virtual I/O Server 2
ent0
Shared Ethernet Adapter
VLAN 19.3.5.11
VLAN 29.3.5.22
VLAN 19.3.5.12
Multipath routing with
dead gateway detection
default route to 9.3.5.10 via 9.3.5.12default route to 9.3.5.20 via 9.3.5.22
Gateway9.3.5.10
Gateway9.3.5.20
ent0
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Shared Ethernet Adapter commands
Virtual I/O Server commands
– lsdev -type adapter: Lists all the virtual and physical adapters.
– Choose the virtual Ethernet adapter we want to map to the physical Ethernet adapter.
– Make sure the physical and virtual interfaces are unconfigured (down or detached).
– mkvdev: Maps the physical adapter to the virtual adapter, creates a layer 2 bridge, and defines the default virtual adapter with its default VLAN ID. It creates a new Ethernet interface (for example, ent5).
– The mktcpip command is used for TCP/IP configuration on the new Ethernet interface (for example, ent5).
Client partition commands
– No new commands are needed; the typical TCP/IP configuration is done on the virtual Ethernet interface that it is defined in the client partition profile on the HMC.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Virtual SCSI commands
Virtual I/O Server commands
– To map a LV:• mkvg: Creates the volume group, where a new LV will be created using the
mklv command.• lsdev: Shows the virtual SCSI server adapters that could be used for
mapping with the LV.• mkvdev: Maps the virtual SCSI server adapter to the LV.• lsmap -all: Shows the mapping information.
– To map a physical disk:• lsdev: Shows the virtual SCSI server adapters that could be used for
mapping with a physical disk.• mkvdev: Maps the virtual SCSI server adapter to a physical disk.• lsmap -all: Shows the mapping information.
Client partition commands
– No new commands needed; the typical device configuration uses the cfgmgr command.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Section Review Questions
1. Any technology improvement will boost performance of any client solution.
a. True
b. False
2. The application of technology in a creative way to solve client’s business problems is one definition of innovation.
a. True
b. False
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Section Review Questions
3. Client’s satisfaction with your solution can be enhanced by which of the following?
a. Setting expectations appropriately.
b. Applying technology appropriately.
c. Communicating the benefits of the technology to the client.
d. All of the above.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Section Review Questions
4. Which of the following are available with POWER5 architecture?
a. Simultaneous Multi-Threading.
b. Micro-Partitioning.
c. Dynamic power management.
d. All of the above.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Section Review Questions
5. Simultaneous Multi-Threading is the same as hyperthreading, IBM just gave it a different name.
a. True.
b. False.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Section Review Questions
6. In order to bridge network traffic between the Virtual Ethernet and external networks, the Virtual I/O Server has to be configured with at least one physical Ethernet adapter.
a. True.
b. False.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Review Question Answers
1. b
2. a
3. d
4. d
5. b
6. a
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Unit Summary
You should now be able to:
– Describe the relationship between technology and solutions.
– List key IBM technologies that are part of the POWER5 products.
– Be able to describe the functional benefits that these technologies provide.
– Be able to discuss the appropriate use of these technologies.
^Eserver pSeries
© 2003 IBM Corporation Concepts of Solution Design
Reference
You may find more information here:
IBM eServer pSeries AIX 5L Support for Micro-Partitioning and Simultaneous Multi-threading White Paper
Introduction to Advanced POWER Virtualization on IBM eServer p5 Servers SG24-7940
IBM eServer p5 Virtualization – Performance Considerations SG24-5768