How to Fail at VDI
-
Upload
dan-brinkmann -
Category
Technology
-
view
2.848 -
download
5
description
Transcript of How to Fail at VDI
BriForum | © TechTarget
Welcome
BriForum | © TechTarget
Dan Brinkmann @dbrinkmann blog.danbrinkmann.comSolutions Architect, VMware vExpertLewan & Associates (Denver, CO)
How to Fail at VDI
“What business problem are we solving?”
BriForum | © TechTarget
4
Business/Expectation VDI Failures
● No business problem● Desktop virtualization is not server virtualization● Saving money● Project in the hands of the vSphere administrator● No success criteria● Assume you know what users do● The same or better experience remotely as locally
BriForum | © TechTarget
BriForum | © TechTarget 5
Agenda
● Compute● Storage● Guessing
To understand what causes VDI failures
6
How to Fail at VDI
● Test with 5 users● Using vendor provided users/core sizing● Using vendor provided IOPs estimates● Ignore anti-virus● Ignore user profile management● Use existing desktop images for physcial PC’s● Guess
BriForum | © TechTarget
The technology failure points
7
Compute
● Multi-threaded apps● Latency sensitive workloads● Hyperthreading● Latency = Health
BriForum | © TechTarget
It’s magic until it stops working
8
Compute
● CPU scheduler in vSphere is entitlement/consumption based, not priority (unlike Windows)
● There is no priority in the CPU scheduler● Given equal entitlement the more a vm/world consumes
the more likely it is to be prempted by another vm/world● http://www.vmware.com/resources/techresources/10131
BriForum | © TechTarget
CPU scheduler in vSphere
9
Compute with a Physical PC
BriForum | © TechTarget
CPU 1
OS/Apps/Profile
10
Compute with Citrix XenApp
BriForum | © TechTarget
OS/Apps/Profile
OS/Apps/Profile
CPU 1 CPU 2
OS/Apps/Profile
OS/Apps/Profile
OS/Apps/Profile
OS/Apps/Profile
OS/Apps/Profile
OS/Apps/Profile
11
Compute with VDI
BriForum | © TechTarget
CPU 1 CPU 2
12
vSphere Compute
BriForum | © TechTarget
This is poor performance monitoring
13
vSphere Compute
BriForum | © TechTarget
This is better performance monitoring - ESXTOP
Display Metric Threshold Explanation
CPU %RDY 10 Overprovisioning of vCPUs, excessive usage of vSMP or a limit(check %MLMTD) has been set.
CPU %CSTP 3Excessive usage of vSMP. Decrease amount of vCPUs for this particular VM. This should lead to increased scheduling opportunities.
CPU %SYS 20The percentage of time spent by system services on behalf of the world. Most likely caused by high IO VM. Check other metrics and VM for possible root cause
CPU %MLMTD 0The percentage of time the vCPU was ready to run but deliberately wasn’t scheduled because that would violate the “CPU limit” settings. If larger than 0 the world is being throttled due to the limit on CPU.
CPU %SWPWT 5 VM waiting on swapped pages to be read from disk. Possible cause: Memory overcommitment.
14
vSphere Compute
BriForum | © TechTarget
15
vSphere Compute
BriForum | © TechTarget
%CSTP probably driving %RDY values
16
vSphere Compute
BriForum | © TechTarget
Now with fewer vCPU’s
17
Summary on Compute
● Multithreading, vSMP● Not priority based● % Utilization is not the complete picture● Latency = Health● http://kb.vmware.com/selfservice/microsites/search.do?
language=en_US&cmd=displayKC&externalId=1017926
BriForum | © TechTarget
18
Storage
● #1 cause of performance issues in server virtualization● #1 cause of performance issues in desktop virtualization● Latency = Health
20ms - in trouble 50ms - your users hate you
BriForum | © TechTarget
The wrath of the math
19
What You Need to Know
● Capacity vs performance● Random vs sequential● Average vs peak● Where it’s coming from● Most are guessing
BriForum | © TechTarget
20
Storage
BriForum | © TechTarget
Spinning disk
21
RAID Penalty
BriForum | © TechTarget
22
The Math – RAID 5 50/50
● 500 users, Windows 7, 20 IOPs avg, 50/50 read/write RAID 5
● 500 * 20 = 10,000 IOPs – 5,000 read, 5,000 write● 5,000 write * 4 = 20,000 + 5,000 read = 25,000 IOPs● 25,000 IOPs on 15K spindles (200 IOPS) = 125 spindles
BriForum | © TechTarget
Some back of the napkin math
23
The Math – RAID 10 50/50
● 500 users, Windows 7, 20 IOPs avg, 50/50 read/write RAID 10
● 500 * 20 = 10,000 IOPs – 5,000 read, 5,000 write● 5,000 write * 2 = 10,000 + 5,000 read = 15,000 IOPs● 15,000 IOPs on 15K spindles (200 IOPS) = 75 spindles
BriForum | © TechTarget
Some back of the napkin math
24
The Math – RAID 10 20/80
● 500 users, Windows 7, 20 IOPs avg, 20/80 read/write RAID 10
● 500 * 20 = 10,000 IOPs – 2,000 read, 8,000 write● 8,000 write * 2 = 16,000 + 2,000 read = 18,000 IOPs● 18,000 IOPs on 15K spindles (200 IOPS) = 90 spindles
BriForum | © TechTarget
Some back of the napkin math
25
vSphere Storage Latency
BriForum | © TechTarget
Guest
VMkernel
Application
Filesystem
I/O Drivers
Virtual SCSI
Filesystem
A
G
D
K
S
R
Device Queue
Application Latency
R = Physical Disk “Disk Secs/Transfer”
G = Guest Latency
K = ESX Kernel
D = Device Latency
26
vSphere Storage
BriForum | © TechTarget
Performance monitoring for storage
Display Metric Threshold Explanation
DISK GAVG 20 Look at “DAVG” and “KAVG” as the sum of both is GAVG.
DISK DAVG 20 Disk latency most likely to be caused by array.
DISK KAVG 2 Disk latency caused by the VMkernel, high KAVG usually means queuing. Check “QUED”.
DISK QUED 1Queue maxed out. Possibly queue depth set to low. Check with array vendor for optimal queue depth value.
DISK ABRTS/s 1 Aborts issued by guest(VM) because storage is not responding. Can be caused when paths failed.
DISK RESETS/s 1 The number of commands reset per second.
DISK CONS/s 20 SCSI Reservation Conflicts per second. Can be caused by too many VMDKs on a datastore.
27
Building for Read IOPs
● Memory - Storage controller cache, PVS● Host/Hypervisor - CBRC, Intellicache● Storage - SSD tiering / flash cache
BriForum | © TechTarget
Fairly easy
28
Building for Write IOPs
● Profiles/Apps● Spinning disk● SSD tiering● Local disk● IO optimization (dedupe, serializing IO)
BriForum | © TechTarget
Much harder…and expensive
29
Storage Summary
● 25,000 IOPs R5 50/50 – 125 spindles● 15,000 IOPs R10 50/50 – 75 spindles● 18,000 IOPs R10 20/80 – 90 spindles● Latency is the key metric● Write IOPs & things that cause it is #1 focus
BriForum | © TechTarget
30
How does this relate to VDI failure?
● Pilot performance is great, then terrible in production● Boot storm vs login storm● Applications in gold image vs streamed● Read/write ratio is important● Anti-virus software● Existing desktop images
BriForum | © TechTarget
31
Guessing
● Initial sizing● Determine peaks and when● Baseline application impact● Monitor application impact over time● Application updates/changes
BriForum | © TechTarget
You need to use tools to do this
32
Project testing
● Unit/system testing● Application testing● Performance/scalability testing● Operational testing● User acceptance testing
BriForum | © TechTarget
Good to know what you are and aren’t doing
33
Summary
● Understand your limited resources (compute/storage)● Don’t guess● 5 users = what kind of testing, what are you really
accomplishing?
BriForum | © TechTarget