VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

141

Transcript of VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Page 1: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1
Page 2: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

VMware vSphere 4.1 HA and DRS Technical Deepdive

Page 3: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

VMware vSphere 4.1, HA and DRS Technical Deepdive

Copyright © 2010 by Duncan Epping and Frank Denneman.

All rights reserved. No part of this book shall be reproduced, stored in a retrieval system, or

transmitted by any means, electronic, mechanical, or otherwise, without written permission from

the publisher. No patent liability is assumed with respect to the use of the information contained

herein. Although every precaution has been taken in the preparation of this book, the publisher and

authors assume no responsibility for errors or omissions. Neither is any liability assumed for

damages resulting from the use of the information contained herein.

International Standard Book Number (ISBN:)

9781456301446

All terms mentioned in this book that are known to be trademarks or service marks have been

appropriately capitalized.

Use of a term in this book should not be regarded as affecting the validity of any trademark or

service mark.

Version: 1.1

Page 4: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

About the Authors

Duncan Epping is a Principal Architect working for VMware as part of the Technical Marketing

department. Duncan primarily focuses on vStorage initiatives and ESXi. He is specialized in

vSphere, vStorage, VMware HA and Architecture. Duncan is a VMware Certified Professional and

among the first VMware Certified Design Experts (VCDX 007). Duncan is the owner of Yellow-

Bricks.com, one of the leading VMware/virtualization blogs worldwide (Recently been voted

number 1 virtualization blog for the 4th consecutive time on vsphere-land.com.) and lead-author of

the "vSphere Quick Start Guide" and "Foundation for Cloud Computing with VMware vSphere 4"

which has recently been published by Usenix/Sage. (#21 in the Short Topics Series). He can be

followed on twitter at http://twitter.com/DuncanYB.

Frank Denneman is a Consulting Architect working for VMware as part of the Professional

Services Organization. Frank works primarily with large Enterprise customers and Service

Providers. He is focused on designing large vSphere Infrastructures and specializes in Resource

Management, DRS in general and storage. Frank is a VMware Certified Professional and among the

first VMware Certified Design Experts (VCDX 029). Frank is the owner of FrankDenneman.nl which

has recently been voted number 6 worldwide on vsphere-land.com. He can be followed on twitter

at http://twitter.com/FrankDenneman.

Page 5: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Table of Contents

About the Authors

Acknowledgements

Foreword

Introduction to VMware High Availability

How Does High Availability Work?

Pre-requisites

Firewall Requirements

Configuring VMware High Availability

Components of High Availability

VPXA

VMAP Plug-In

AAM

Nodes

Promoting Nodes

Failover Coordinator

Preferred Primary

High Availability Constructs

Isolation Response

Split-Brain

Isolation Detection

Selecting an Additional Isolation Address

Failure Detection Time

Adding Resiliency to HA (Network Redundancy)

Single Service Console with vmnics in Active/Standby Configuration

Secondary Management Network

Page 6: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Admission Control

Admission Control Policy

Admission Control Mechanisms

Host Failures Cluster Tolerates

Unbalanced Configurations and Impact on Slot Calculation

Percentage of Cluster Resources Reserved

Failover Host

Impact of Admission Control Policy

Host Failures Cluster Tolerates

Percentage as Cluster Resources Reserved

Specify a Failover Host

Recommendations

VM Monitoring

Why Do You Need VM/Application Monitoring?

How Does VM/App Monitoring Work?

Is AAM enabling VM/App Monitoring?

Screenshots

vSphere 4.1 HA and DRS Integration

Affinity Rules

Resource Fragmentation

DPM

Flattened Shares

Summarizing

What is VMware DRS?

Cluster Level Resource Management

Requirements

Page 7: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Operation and Tasks of DRS

Load Balance Calculation

Events and Statistics

Migration and Info Requests

vCenter and Cluster sizing

DRS Cluster Settings

Automation Level

Initial Placement

Impact of Automation Levels on Procedures

Resource Management

Two-Layer Scheduler Architecture

Resource Entitlement

Resource Entitlement Calculation

Calculating DRS Recommendations

When is DRS Invoked?

Defragmenting cluster during Host failover

Recommendation Calculation

Constraints Correction

Imbalance Calculation

Impact of Migration Threshold on Selection Procedure

Selection of Virtual Machine Candidate

Cost-Benefit and Risk Analysis Criteria

The Biggest Bang for the Buck

Calculating the Migration Recommendation Priority Level

Influence DRS Recommendations

Migration Threshold Levels

Page 8: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Rules

VM-VM Affinity Rules

VM-Host Affinity Rules

Impact of Rules on Organization

Virtual Machine Automation Level

Impact of VM Automation Level on DRS Load Balancing Calculation

Resource Pools and Controls

Root Resource Pool

Resource Pools

Resource pools and simultaneous vMotions

Under Committed versus Over Committed

Resource Allocation Settings

Shares

Reservation

VM Level Scheduling: CPU vs Memory

Impact of Reservations on VMware HA Slot Sizes.

Behavior of Resource Pool Level Memory Reservations

Setting a VM Level Reservation inside a Resource Pool

VMkernel CPU reservation for vMotion

Reservations Are Not Limits.

Memory Overhead Reservation

Expandable Reservation

Limits

CPU Resource Scheduling

Memory Scheduler

Distributed Power Management

Page 9: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Enable DPM

Templates

DPM Threshold and the Recommendation Rankings

Evaluating Resource Utilization

Virtual Machine Demand and ESX Host Capacity Calculation

Evaluating Power-On and Power-Off Recommendations

Resource LowScore and HighScore

Host Power-On Recommendations

Host Power-Off Recommendations

DPM Power-Off Cost/Benefit Analysis

Integration with DRS and High Availability

Distributed Resource Scheduler

High Availability

DPM awareness of High Availability Primary Nodes

DPM Standby Mode

DPM WOL Magic Packet

Baseboard Management Controller

Protocol Selection Order

DPM and Host Failure Worst Case Scenario

DRS, DPM and VMware Fault Tolerance

DPM Scheduled Tasks

Summarizing

Appendix A – Basic Design Principles

VMware High Availability

VMware Distributed Resource Scheduler

Appendix B – HA Advanced Settings

Page 10: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Acknowledgements

The authors of this book work for VMware. The opinions expressed here are the authors’ personal

opinions. Content published was not read or approved in advance by VMware and does not

necessarily reflect the views and opinions of VMware. This is the authors’ book, not a VMware book.

First of all we would like to thank our VMware management team (Steve Beck, Director; Rob

Jenkins, Director) for supporting us on this and other projects.

A special thanks goes out to our Technical Reviewers: fellow VCDX Panel Member Craig Risinger

(VMware PSO), Marc Sevigny (VMware HA Engineering), Anne Holler (VMware DRS Engineering)

and Bouke Groenescheij (Jume.nl) for their very valuable feedback and for keeping us honest.

A very special thanks to our families and friends for supporting this project. Without your support

we could have not have done this.

We would like to dedicate this book to the VMware Community. We highly appreciate all the effort

everyone is putting in to take VMware, Virtualization and Cloud to the next level. This is our gift to

you.

Duncan Epping and Frank Denneman

Page 11: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Foreword

Since its inception, server virtualization has forever changed how we build and manage the

traditional x86 datacenter. In its early days of providing an enterprise-ready hypervisor, VMware

focused their initial virtualization efforts to meet the need for server consolidation. Increased

optimization of low-utilized systems and lowering datacenter costs of cooling, electricity, and floor

space requirements was a surefire recipe for VMware’s early success. Shortly after introducing

virtualization solutions, customers started to see the significant advantages introduced by the

increased portability and recoverability that were all of a sudden available.

It’s this increased portability and recoverability that significantly drove VMware’s adoption during

its highest growth period. Recovery capabilities and options that were once reserved for the most

critical of workloads within the world’s largest organizations became broadly available to the

masses. Replication, High-Availability, and Fault Tolerance were once synonymous with “Expensive

Enterprise Solutions, but are now available to even the smallest of companies. Data protection

enhancements, when combined with the intelligence of intelligent resource management, placed

VMware squarely at the top market leadership board. VMware’s virtualization platform can

provide near instant recovery time with increasingly more recent recovery points in a properly

designed environment.

Now, if you’ve read this far, you likely understand the significant benefits that virtualization can

provide, and are probably well on your way to building out your virtual infrastructure and strategy.

The capabilities provided by VMware are not ultimately what dictates the success and failure of a

virtualization project, especially as increasingly more critical applications are introduced and

require greater availability and recoverability service levels. It takes a well-designed virtual

infrastructure and a full understanding of how the business requirements of the organization align

to the capabilities of the platform.

This book is going to arm you with the information necessary to understand the in-depth details of

what VMware can provide you when it comes to improving the availability of your systems. This

will help you better prepare for, and align to, the requirements of your business as well as set the

proper expectations with the key stakeholders within the IT organization. Duncan and Frank have

used their extensive field experience into this book to enable you to drive broader virtualization

adoption across more complex and critical applications. This book will enable you to make the

most educated decisions as you attempt to achieve the next level of maturity within your virtual

environment.

Scott Herold

Lead Architect, Virtualization Business, Quest Software

Page 12: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Part 1 VMware High Availability

Page 13: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 1 Introduction to VMware High Availability

VMware High Availability (HA) provides a simple and cost effective clustering solution to increase

uptime for virtual machines. HA uses a heartbeat mechanism to detect a host or virtual machine

failure. In the event of a host failure, affected virtual machines are automatically restarted on other

production hosts within the cluster with spare capacity. In the case of a failure caused by the Guest

OS, HA restarts the failed virtual machine on the same host. This feature is called VM Monitoring,

but sometimes also referred to as VM HA.

Figure 1: High Availability in action

Unlike many other clustering solutions HA is literally configured and enabled with 4 clicks.

However HA is not, and let’s repeat it, is not a 1:1 replacement for solutions like Microsoft

Clustering Services. (MSCS). MSCS and for instance Linux Clustering are stateful clustering solutions

where the state of the service or application is preserved when one of the nodes fails. The service is

transitioned to one of the other nodes and it should resume with limited downtime or loss of data.

With HA the virtual machine is literally restarted and this incurs downtime. HA is a form of stateless

clustering.

Page 14: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

One might ask why would you want to use HA when a virtual machine is restarted and service is

temporarily lost. The answer is simple; not all virtual machines (or services) need 99.999% uptime.

For many services the type of availability HA provides is more than sufficient. Stateful clustering

does not guarantee 100% uptime, can be complex and need special skills and training. One example

is managing patches and updates/upgrades in a MSCS environment; this could even cause more

downtime if not operated correctly. Just like MSCS a service or application is restarted during a

failover, the same happens with HA and the effected virtual machines.

Besides that, HA reduces complexity, costs (associated with downtime and MSCS), resource

overhead and unplanned downtime for minimal additional costs. It is important to note that HA,

contrary to MSCS, does not require any changes to the guest as HA is provided on the hypervisor

level. Also, VM Monitoring does not require any additional software or OS modifications except for

VMware Tools, which should be installed anyway.

We can’t think of a single reason not to use it.

How Does High Availability Work? Before we deep dive into the main constructs of HA and describe all the choices one has when

configuring HA we will first briefly touch on the requirements. Now, the question of course is how

does HA work? As just briefly touched in the introductions, HA triggers a response based on the loss

of heartbeats. However you might be more interested in knowing which components VMware uses

and what is required in order for HA to function correctly. Maybe if this is the first time you are

exposed to HA you also want to know how to configure it.

Pre-requisites For those who want to configure HA, the following items are the pre-requisites in order for HA to

function correctly:

Minimum of two VMware ESX or ESXi hosts

Minimum of 2300MB memory to install the HA Agent

VMware vCenter Server

Redundant Service Console or Management Network (not a requirement, but highly

recommended)

Shared Storage for VMs – NFS, SAN, iSCSI

Pingable gateway or other reliable address for testing isolation

We recommend against using a mixed cluster. With that we mean a single cluster containing both

ESX and ESXi hosts. Differences in build numbers has led to serious issues in the past when using

VMware FT. (KB article: 1013637)

Page 15: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Firewall Requirements The following list contains the ports that are used by HA for communication. If your environment

contains firewalls ensure these ports are opened for HA to function correctly.

High Availability port settings:

8042 – UDP - Used for host-to-hosts "backbone" (message bus) communication.

8042 – TCP - Used by AAM agents to communicate with a remote backbone.

8043 – TCP - Used to locate a backbone at bootstrap time.

8044 – UDP - Used by HA to send heartbeats.

2050 – 2250 - Used by AAM agent process to communicate with the backbone.

Configuring VMware High Availability As described earlier, HA can be configured with the default settings within 4 clicks. The following

steps however will show you how to create a cluster and how to enable HA including VM

Monitoring. Each of the settings and the mechanisms associated with these will be described more

in-depth in the following chapters.

1. Select the Hosts & Clusters view.

2. Right-click the Datacenter in the Inventory tree and click New Cluster.

3. Give the new cluster an appropriate name. We recommend at a minimum including the

location of the cluster and a sequence number ie. ams-hadrs-001.

4. In the Cluster Features section of the page, select Turn On VMware HA and click Next.

5. Ensure Host Monitoring Status and Admission Control is enabled and click Next

6. Leave Cluster Default Settings for what it is and click Next

7. Enable VM Monitoring Status by selecting “VM Monitoring Only” and click Next

8. Leave VMware EVC set to the default and click Next

9. Leave the Swapfile Policy set to default and click Next

10. Click Finish to complete the creation of the cluster

Page 16: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

When the HA cluster has been created ESX hosts can be added to the cluster simply by dragging

them into the cluster. When an ESX host is added to the cluster the HA agent will be loaded.

Page 17: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 2 Components of High Availability

Now that we know what the pre-requisites are and how to configure HA the next steps will be

describing which components form HA. This is still a “high level” overview however. There is more

under the cover that we will explain in following chapters. The following diagram depicts a two

host cluster and shows the key HA components.

Figure 3: Components of High Availability

As you can clearly see there are three major components that form the foundation for HA:

VPXA

VMAP

AAM

Page 18: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

VPXA The first and probably the most important is VPXA. This is not an HA agent, but it is the vCenter

agent and it allows your vCenter Server to interact with your ESX host. It is also takes care of

stopping and starting virtual machines if and when needed.

HA is loosely coupled with vCenter Server. Although HA is configured by vCenter Server, it does not

need vCenter to manage an HA failover. It is comforting to know that in case of a host failure

containing the virtualized vCenter server, HA takes care of the failure and restarts the vCenter

server on another host, including all other configured virtual machines from that failed host.

When a virtual vCenter is used we do however recommend setting the correct restart priorities

within HA to avoid any dependency problems.

It’s highly recommended to register ESX hosts with their FQDN in vCenter. VMware vCenter

supplies the name resolution information that HA needs to function. HA stores this locally in a file

called “FT_HOSTS”. In other words, from an HA perspective there is no need to create local host files

and it is our recommendation to avoid using local host files. They are too static and will make

troubleshooting more difficult.

To stress my point even more as of vSphere 4.0 Update 1 host files (i.e. /etc/hosts) are corrected

automatically by HA. In other words if you have made a typo or for example forgot to add the short

name HA will correct the host file to make sure nothing interferes with HA.

Basic design principle:

Avoid using static host files as it leads to inconsistency, which makes troubleshooting

difficult.

VMAP Plug-In Next on the list is VMAP. Where vpxa is the process for vCenter to communicate with the host

VMAP is the translator for the HA agent (AAM) and vpxa. When vpxa wants to communicate with

the AAM agent VMAP will translate this into understandable instructions for the AAM agent. A good

example of what VMAP would translate is the state of a virtual machine: is it powered on or

powered off? Pre-vSphere 4.0 VMAP was a separate process instead of a plugin linked into vpxa.

VMAP is loaded into vpxa at runtime when a host is added to an HA cluster.

The vpxa communicates with VMAP and VMAP communicates with AAM. When AAM has received it

and flushed the info it well tell VMAP and VMAP on its turn will acknowledge to vpxa that info has

been processed. The VMAP plug-in acts as a proxy for communication to AAM.

Page 19: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

One thing you are probably wondering is why do we need VMAP in the first place? Wouldn’t this be

something vpxa or AAM should be able to do? The answer is yes, either vpxa or AAM should be able

to carry this functionality. However, when HA was first introduced it was architecturally more

prudent to create a separate process for dealing with this which has now been turned into a plugin.

AAM That brings us to our next and final component, the AAM agent. The AAM agent is the core of HA

and actually stands for “Automated Availability Manager”. As stated above, AAM was originally

developed by Legato. It is responsible for many tasks such as communicating host resource

information, virtual machine states and HA properties to other hosts in the cluster. AAM stores all

this info in a database and ensures consistency by replicating this database amongst all primary

nodes. (Primary nodes are discussed in more detail in chapter 4.) It is often mentioned that HA uses

an In-Memory database only, this is not the case! The data is stored in a database on local storage or

in FLASH memory on diskless ESXi hosts.

One of the other tasks AAM is responsible for is the mechanism with which HA detects

isolations/failures: heartbeats.

All this makes the AAM agent one of the most important processes on an ESX host, when HA is

enabled of course, but we are assuming for now it is. The engineers recognized the importance and

added an extra level of resiliency to HA. The agent is multi-process and each process acts as a

watchdog for the other. If one of the processes dies the watchdog functionality will pick up on this

and restart the process to ensure HA functionality remains without anyone ever noticing it failed. It

is also resilient to network interruptions and component failures. Inter-host communication

automatically uses another communication path (if the host is configured with redundant

management networks) in the case of a network failure. The underlying message framework

exactly-once guarantees message delivery.

Page 20: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 3 Nodes

Now that you know what the components of HA are, it is time to start talking about the one of the

most crucial concepts when it comes to designing HA clusters.

Everyone who has implemented VMware VI3 or vSphere knows that multiple hosts can form a

cluster. A cluster can best be seen as a collection of resources. These resources can be carved up

with the use of VMware Distributed Resource Scheduler (DRS) into separate pools of resources or

used to increase availability by enabling HA.

Before we discuss the various options one has during the configuration of HA there is one

important aspect that needs to be discussed first, and that is the concept of nodes. It is important to

understand the concepts of nodes as how they work can and will influence your design.

The following diagram depicts the concepts of nodes:

Figure 4: Primary and secondary hosts

Page 21: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

An HA cluster consists of hosts, or nodes as HA calls them. There are two types of nodes. A node is

either a primary or a secondary node. This concept was introduced to enable scaling up to 32 hosts

in a cluster and each type of node has a different role. Primary nodes hold cluster settings and all

“node states”. The data a primary node holds is stored in a persistent database and synchronized

between primaries as depicted in the diagram above.

An example of node state data would be host resource usage. In case vCenter is not available the

primary nodes will always have a very recent calculation of the resource utilization and can take

this into account when a failover needs to occur. Secondary nodes send their state info to primary

nodes. This will be sent when changes occur, generally within seconds after a change. As of vSphere

4.1 by default every host will send an update of its status every 10 seconds. Pre-vSphere 4.1 this

used to be every second.

This interval can be controlled by an advanced setting called das.sensorPollingFreq. As stated

before the default value of this advanced setting is 10. Although a smaller value will lead to a more

update view of the status of the cluster overall it will also increase the amount of traffic between

nodes. It is not recommended to decrease this value as it might lead to decreased scalability due to

the overhead of these status updates. The maximum value of the advanced setting is 30.

As discussed earlier, HA uses a heartbeat mechanism to detect possible outages or network

isolation. The heartbeat mechanism is used to detect a failed or isolated node. However, a node will

recognize it is isolated by the fact that it isn’t receiving heartbeats from any of the other nodes.

Nodes send a heartbeat to each other. Primary nodes send heartbeats to all primary nodes and all

secondary nodes. Secondary nodes send their heartbeats to all primary nodes, but not to

secondaries. Nodes send out these heartbeats every second by default. However, this is a

configurable value through the use of the following cluster advanced setting:

das.failuredetectioninterval. We do however not recommend changing this interval as it was

carefully selected by VMware.

The first 5 hosts that join the HA cluster are automatically selected as primary nodes. All other

nodes are automatically selected as secondary nodes. When you do a reconfigure for HA, the

primary nodes and secondary nodes are selected again; this is virtually random.

Except for the first host that is added to the cluster; any host that joins the cluster must

communicate with an existing primary node to complete its configuration. At least one primary host

must be available for HA to operate correctly. If all primary hosts are unavailable, you will not be

able to add or remove a host from your cluster.

The vCenter client normally does not show which host is a primary node and which is a secondary

node. As of vCenter 4.1 a new feature has been added which is called “Operational Status” and can

be found on the HA section of the Cluster’s summary tab. It will give details around errors and will

show the primary and secondary nodes. There is one gotcha however; it will only show which

nodes are primary and secondary in case of an error.

Page 22: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 5: Cluster operational status

This however can also be revealed from the Service Console or via PowerCLI. The following are two

examples of how to list the primary nodes via the Service Console (ESX 4.0):

Figure 6: List node command

Another method of showing the primary nodes is:

Page 23: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 7: List nodes command

With PowerCLI the primary nodes can be listed with the following lines of code:

Power-CLI code:Get-Cluster <clustername> | Get-HAPrimaryVMHost

Now that you have seen that it is possible that you can list all nodes with the CLI you probably

wonder what else is possible… Let’s start with a warning - this is not supported! Currently the

supported limit of primaries is 5. This is a soft limit however. It is possible to manually add a 6th

primary but this is not supported nor encouraged.

Having more than 5 primaries in a cluster will significantly increase network and CPU overhead.

There should be no reason to increase the number of primaries beyond 5. For the purpose of

education we will demonstrate how to promote a secondary node to primary and vice versa.

To promote a node:

Page 24: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 8: Promote node command

To demote a node:

Figure 9: Demote node command

This method however is unsupported and there is no guarantee this will remain working in the

future. On earlier versions of ESX “ftcli” should be used, this command can’t be run however

without setting the required environment variables first. You can execute

…/config/agent_env.[platform] to set these.

Page 25: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Promoting Nodes A common misunderstanding about HA with regards to primary and secondary nodes is the re-

election process. When does a re-election, or promotion, occur?

It is a common misconception that a promotion of a secondary occurs when a primary node fails.

This is not the case. Let’s stress that, this is not the case! The promotion of a secondary node to

primary only occurs in one of the following scenarios:

When a primary node is placed in “Maintenance Mode”

When a primary node is disconnected from the cluster

When a primary node is removed from the cluster

When the user clicks “reconfigure for HA” on any ESX host

This is particularly important for the operational aspect of a virtualized environment. When a host

fails it is important to ensure its role is migrated to any of the other hosts in case it was an HA

primary node. To simplify it; when a host fails we recommend placing it in maintenance mode, to

disconnect it or to remove it from the cluster to avoid any risks!

If all primary hosts fail simultaneously no HA initiated restart of the virtual machines can take

place. HA needs at least one primary node to restart virtual machines. This is why you can configure

HA to tolerate only up to 4 host failures when you have selected the “host failures” Admission

Control Policy (Remember 5 primaries…). The amount of primaries is definitely something to take

into account when designing for uptime.

Failover Coordinator As explained in the previous section, you will need at least one primary to restart virtual machines.

The reason for this is that one of the primary nodes will hold the “failover coordinator” role. This

role will be randomly assigned to a primary node; this role is also sometimes referred to as “active

primary”. We will use “failover coordinator” for now.

The failover coordinator coordinates the restart of virtual machines on the remaining primary and

secondary hosts. The coordinator takes restart priorities in account when coordinating the restarts.

Pre-vSphere 4.1 when multiple hosts would fail at the same time it would handle the restarts

serially. In other words, restart the virtual machines of the first failed host (taking restart priorities

in account) and then restart the virtual machines of the host that failed as second (again taking

restart priorities in account). As of vSphere 4.1 this mechanism has been severely improved. In the

case of multiple near-simultaneous host failures, all the host failures that occur within 15 seconds

will have all their VMs aggregated and prioritized before the power-on operations occur.

If the failover coordinator fails, one of the other primaries will take over. This node is again

randomly selected from the pool of available primary nodes. As any other process within the HA

stack, the failover coordinator process is carefully watched by the watchdog functionality of HA.

Page 26: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Pre-vSphere 4.1 the failover coordinator would decide where a virtual machine would be restarted.

Basically it would check which host had the highest percentage of unreserved and available

memory and CPU and select it to restart that particular virtual machine. For the next virtual

machine the same exercise would be done by HA, select the host with the highest percentage of

unreserved memory and CPU and restart the virtual machine.

HA does not coordinate with DRS when making the decision on where to place virtual machines. HA

would rely on DRS. As soon as the virtual machines were restarted, DRS would kick in and

redistribute the load if and when needed.

As of vSphere 4.1 virtual machines will be evenly distributed across hosts to lighten the load on the

hostd service and to get quicker power-on results. HA then relies on DRS to redistribute the load

later if required. This improvement results in faster restarts of the virtual machines and less stress

on the ESX hosts. DRS also re-parents the virtual machine when it is booted up as virtual machines

are failed over into the root resource pool by default. This re-parent process however did already

exist pre-vSphere 4.1.

The failover coordinator can restart up to 32 VMs concurrently per host. The number of concurrent

failovers can be controlled by an advanced setting called das.perHostConcurrentFailoversLimit. As

stated the default value is 32. Setting a larger value will allow more VMs to be restarted

concurrently and might reduce the overall VM recovery time, but the average latency to recover

individual VMs might increase.

In blade environments it is particularly important to factor the primary nodes and failover

coordinator concept into your design. When designing a multi chassis environment the impact of a

single chassis failure needs to be taken into account. When all primary nodes reside in a single

chassis and the chassis fails, no virtual machines will be restarted as the failover coordinator is the

only one who initiates the restart of your virtual machines. When it is unavailable, no restart will

take place.

It is a best practice to have the primaries distributed amongst the chassis in case an entire chassis

fails or a rack loses power, there is still a running primary to coordinate the failover. This can even

be extended in very large environments by having no more than 2 hosts of a cluster in a chassis.

The following diagram depicts the scenario where four 8 hosts clusters are spread across four

chassis.

Page 27: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 10: Logical cluster layout on blade environment

Basic design principle:In blade environments, divide hosts over all blade chassis and never exceed

four hosts per chassis to avoid having all primary nodes in a single chassis.

Preferred Primary

With vSphere 4.1 a new advanced setting has been introduced. This setting is not even

experimental, it is currently considered unsupported. We don't recommend anyone using it in a

production environment, if you do want to play around with it use your test environment.

This new advanced setting is called das.preferredPrimaries. With this setting multiple hosts of a

cluster can be manually designated as a preferred node during the primary node election process.

The list of nodes can either be comma or space separated and both hostnames and IP addresses are

allowed. Below you can find an example of what this would typically look like. The “=” sign has been

used as a divider between the setting and the value.

Page 28: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

das.preferredPrimaries = hostname1,hostname2,hostname3

or

das.preferredPrimaries = 192.168.1.1 192.168.1.2 192.168.1.3

As shown there is no need to specify 5 hosts; you can specify any number of hosts. If you specify 5

hosts, or less, and all 5 hosts are available they will become the primary nodes in your cluster. If you

specify more than 5 hosts, the first 5 hosts of your list will become primary.

Again, please be warned that this is considered unsupported at times of writing and please verify in

the VMware Availability Guide or online in the knowledge base (kb.vmware.com) what the status is

of the support on this feature before even thinking about implementing it.

A work around found by some pre-vSphere 4.1 was using the “promote/demote” option of HA’s CLI

as described earlier in this chapter. Although this solution could fairly simply be scripted it is

unsupported and as opposed to “das.preferredPrimaries” a rather static solution.

Page 29: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 4 High Availability Constructs

When configuring HA two major decisions will need to be made.

Isolation Response

Admission Control

Both are important to how HA behaves. Both will also have an impact on availability. It is really

important to understand these concepts. Both concepts have specific caveats. Without a good

understanding of these it is very easy to increase downtime instead of decreasing downtime.

Isolation Response

One of the first decisions that will need to be made when HA is configured is the “isolation

response”. The isolation response refers to the action that HA takes for its VMs when the host has

lost its connection with the network. This does not necessarily means that the whole network is

down; it could just be this hosts network ports or just the ports that are used by HA for the

heartbeat. Even if your virtual machine has a network connection and only your “heartbeat

network” is isolated the isolation response is triggered.

Today there are three isolation responses, “Power off”, “Leave powered on” and “Shut down”. This

answers the question what a host should do when it has detected it is isolated from the network. In

any of the three chosen options, the remaining non isolated, hosts will always try to restart the

virtual machines no matter which of the following three options is chosen as the isolation

response:

Power off – When network isolation occurs all virtual machines are powered off. It is a hard

stop, or to put it bluntly, the power cable of the VMs will be pulled out!

Shut down – When network isolation occurs all virtual machines running on the host will be

shut down using VMware Tools. If this is not successful within 5 minutes, a “power off” will

be executed. This time out value can be adjusted by setting the advanced option

das.isolationShutdownTimeout. If VMware Tools is not installed, a “power off” will be

initiated immediately.

Leave powered on – When network isolation occurs on the host, the state of the virtual

machines remains unchanged.

This setting can be changed on the cluster settings under virtual machine options.

Figure 11: Cluster default setting

Page 30: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

The default setting for the isolation response has changed multiple times over the last couple of

years. Up to ESX 3.5 U2 / vCenter 2.5 U2 the default isolation response when creating a new cluster

was “Power off”. This changed to “Leave powered on” as of ESX 3.5 U3 / vCenter 2.5 U3. However

with vSphere 4.0 this has changed again. The default setting for newly created clusters, at the time

of writing, is “Shut down” which might not be the desired response. When installing a new

environment; you might want to change the default setting based on your customer’s requirements

or constraints.

The question remains, which setting should you use? The obvious answer applies here; it depends.

We prefer “Shut down” because we do not want to use a degraded host to run our virtual machines

on and it will shut down your virtual machines in clean manner. Many people however prefer to use

“Leave powered on” because it eliminates the chances of having a false positive and the associated

down time with a false positive. A false positive in this case is an isolated heartbeat network but a

non-isolated virtual machine network and a non-isolated iSCSI / NFS network.

That leaves the question how the other HA nodes know if the host is isolated or failed.

HA actually does not know the difference. The other HA nodes will try to restart the affected virtual

machines in either case. When the host is unavailable, a restart attempt will take place no matter

which isolation response has been selected. If a host is merely isolated, the non-isolated hosts will

not be able to restart the affected virtual machines. The reason for this is the fact that the host that

is running the virtual machine has a lock on the VMDK and swap files. None of the hosts will be able

to boot a virtual machine when the files are locked. For those who don’t know, ESX locks files to

prevent the possibility of multiple ESX hosts starting the same virtual machine. However, when a

host fails, this lock expires and a restart can occur.

To reiterate, the remaining nodes will always try to restart the “failed” virtual machines. The

possible lock on the VMDK files belonging to these virtual machines, in the case of an isolation

event, prevents them from being started. This assumes that the isolated host can still reach the files,

which might not be true if the files are accessed through the network on iSCSI, NFS, or FCoE based

storage. HA however will repeatedly try starting the “failed” virtual machines when a restart is

unsuccessful.

The amount of retries is configurable as of vCenter 2.5 U4 with the advanced option

“das.maxvmrestartcount”. The default value is 5. Pre-vCenter 2.5 U4 HA would keep retrying

forever which could lead to serious problems as described in KB article 1009625 where multiple

virtual machines would be registered on multiple hosts simultaneously leading to a confusing and

inconsistent state. (http://kb.vmware.com/kb/1009625)

Page 31: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

HA will try to start the virtual machine on one of your hosts in the affected cluster; if this is

unsuccessful on that host, the restart count will be increased by 1. The next restart attempt will

than occur after two minutes. If that one fails, the next will occur after 4 minutes, and if that one

fails the following will occur after 8 minutes until the “das.maxvmrestartcount” has been reached.

To make it more clear look at the following list:

T+0 – Restart

T+2 – Restart retry 1

T+4 – Restart retry 2

T+8 – Restart retry 3

T+8 – Restart retry 4

T+8 – Restart retry 5

As shown above in the bullet list and clearly depicted in the diagram below; a successful power-on

attempt could take up to 30 minutes in the case multiple power-on attempts are unsuccessful.

However HA does not give a guarantee and a successful power-on attempt might not ever take

place.

Figure 12: High Availability restart timeline

Page 32: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Split-Brain

When creating your design, make sure you understand the isolation response setting. For instance

when using an iSCSI array or NFS based storage choosing “Leave powered on” as your default

isolation response might lead to a split-brain situation.

A split-brain situation can occur when the VMDK file lock times out. This could happen when the

iSCSI, FCoE or NFS network is also unavailable. In this case the virtual machine is being restarted on

a different host while it is not being powered off on the original host because the selected isolation

response is “Leave powered on”. Which could potentially leave vCenter in an inconsistent state as

two VMs with a similar UUID would be reported as running on both hosts. This would cause a

“ping-pong” effect where the VM would appear to live on ESX host 1 at one moment and on ESX

host 2 soon after.

VMware’s engineers have recognized this as a potential risk and developed a solution for this

unwanted situation. (This not well documented, but briefly explained by one of the engineers on the

VMTN Community forums. http://communities.vmware.com/message/1488426#1488426.)

In short; as of version 4.0 Update 2 ESX detects that the lock on the VMDK has been lost and issues a

question if the virtual machine should be powered off and auto answers the question with yes.

However, you will only see this question if you directly connect to the ESX host. HA will generate an

event for this auto-answer though, which is viewable within vCenter. Below you can find a

screenshot of this question.

Figure 13: Virtual machine message

Page 33: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

As stated above, as of ESX 4 update 2 the question will be auto-answered and the virtual machine

will be powered off to recover from the split brain scenario.

The question still remains: with iSCSI or NFS, should you power off virtual machines or leave them

powered on?

As described above in earlier versions, "Leave powered on" could lead to a split-brain scenario. You

would end up seeing multiple virtual machines ping-ponging between hosts as vCenter would not

know where it resided as it was active in memory on two hosts. As of ESX 4.0 Update 2, this is

however not the case anymore and it should be safe to use “Leave powered on”.

We recommend avoiding the chances of a split-brain scenario. Configure a secondary Service

Console on the same vSwitch and network as the iSCSI or NFS VMkernel portgroup and pre-vSphere

4.0 Update 2 to select either “Power off” or “Shut down” as the isolation response. By doing this you

will be able to detect if there’s an outage on the storage network. We will discuss the options you

have for Service Console / Management Network redundancy more extensively later on in this.

Basic design principle: For network-based storage (iSCSI, NFS, FCoE) it is

recommended (pre-vSphere 4.0 Update 2) to set the isolation response to "Shut Down" or

“Power off”. It is also recommended to have a secondary Service Console (ESX) or

Management Network (ESXi) running on the same vSwitch as the storage network to detect

a storage outage and avoid false positives for isolation detection.

Isolation Detection

We have explained what the options are to respond to an isolation event. However we have not

extensively discussed how isolation is detected. This is one of the key mechanisms of HA. Isolate

detection is a mechanism that takes place on the host that is isolated. The remaining, non-isolated,

hosts don’t know if that host has failed completely or if it is isolated from the network, they only

know it is unavailable.

The mechanism is fairly straightforward though and works as earlier explained with heartbeats.

When a node receives no heartbeats from any of the other nodes for 13 seconds (default setting)

HA will ping the “isolation address”. Remember primary nodes send heartbeats to primaries and

secondaries, secondary nodes send heartbeats only to primaries.

The isolation address is the gateway specified for the Service Console network (or management

network on ESXi), but there is a possibility to specify one or multiple additional isolation addresses

Page 34: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

with an advanced setting. This advanced setting is called “das.isolationaddress” and could be used

to reduce the chances of having a false positive. We recommend to set at least one additional

isolation address.

Figure 14: das.isolationaddress

When isolation has been confirmed, meaning no heartbeats have been received and HA was unable

to ping any of the isolation addresses, HA will execute the isolation response. This could be any of

the above-described options, power down, shut down or leave powered on.

If only one heartbeat is received or just a single isolation address can be pinged the isolation

response will not be triggered, which is exactly what you want.

Selecting an Additional Isolation Address

A question asked by many people is which address should be specified for this additional isolation

verification. We generally recommend an isolation address closest to the hosts to avoid too many

Page 35: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

network hops. In many cases the most logical choice is the physical switch to which the host is

directly connected, another usual suspect would be a router or any other reliable and pingable

device. However, when you are using network based shared storage like NFS and for instance iSCSI

a good choice would be the IP-address of the device, this way you would also verify if the storage is

still reachable or not.

Failure Detection Time

Failure Detection Time seems to be a concept that is often misunderstood but is critical when

designing a virtual infrastructure. Failure Detection Time is basically the time it takes before the

“isolation response” is triggered. There are two primary concepts when we are talking about failure

detection time:

The time it will take the host to detect it is isolated

The time it will take the non-isolated hosts to mark the unavailable host as isolated and

initiate the failover

The following diagram depicts the timeline for both concepts:

Figure 15: High Availability failure detection time

The default value for failure detection is 15 seconds. (das.failuredetectiontime) In other words the

failed or isolated host will be declared failed by the other hosts in the HA cluster on the fifteenth

second and a restart will be initiated by the failover coordinator after one of the primaries has

verified that the failed or isolated host is unavailable by pinging the host on its management

network.

It should be noted that in the case of a dual management network setup both addresses will be

pinged and a 1 second will need to be added to the timeline. Meaning that the failover coordinator

will initiate the restart on the 17th second.

Page 36: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Let’s stress that again, a restart will be initiated after one of the primary nodes has tried to ping all

of the management network addresses of the failed host.

Let’s assume the isolation response is “Power off”. The isolation response “Power off” will be

triggered by the isolated host 1 second before the das.failuredetectiontime elapses. In other words a

“Power off” will be initiated on the fourteenth second. A restart will be initiated on the sixteenth

second by the failover coordinator if the host has a single management network.

Does this mean that you can end up with your virtual machines being down and HA not restarting

them?

Yes, when the heartbeat returns between the 14th and 16th second the “Power off” might have

already been initiated. The restart however will not be initiated because the received heartbeat

indicates that the host is not isolated anymore.

How can you avoid this?

Selecting “Leave VM powered on” as an isolation response is one option. Increasing the

das.failuredetectiontime will also decrease the chances of running into issues like these, and with

ESX 3.5 it was a standard best practice to increase the failure detection time to 30 seconds.

At the time of writing (vSphere) this is not a best practice anymore as with any value the “2-second”

gap exists and the likelihood of running into this issue is small. We recommend keeping

das.failuredetectiontime as low as possible to decrease associated down time.

Basic design principle: Keep das.failuredetectiontime low for fast responses to

failures. If an isolation validation address has been added, “das.isolationaddress”, add 5000

to the default “das.failuredetectiontime” (15000).

Page 37: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 5

Adding Resiliency to HA (Network Redundancy)In the previous chapter we have extensively

covered Isolation Detection that triggers the selected Isolation Response and the impact of a false

positive. The Isolation Response enables HA to restart virtual machines when “Power off” or “Shut

down” has been selected and the host is isolated from the network.

To increase resiliency of the “heartbeat” network (Service Console for ESX and Management

Network for ESXi) VMware introduced the concept of NIC teaming.

“NIC teaming is the process of grouping together several

physical nics into one single logical nic, which can

be used for network fault tolerance and load balancing.”

Using this mechanism it is possible to add redundancy to the Management Network or Service

Console network to decrease the chances of a false positive. (This is of course also possible for other

“Portgroups” but that is not the topic of this book.) Another option is configuring a secondary

Management Network or Service Console network. VMware supports both of these configurations

and each have their own pros and cons that are listed in the section below. To simplify the concepts

we used ESX as an example, however these recommendations are also valid for ESXi. We have

included the vMotion (VMkernel) network in our examples as combining the Service Console and

the VMkernel is the most commonly used configuration and a VMware best practice.

Single Service Console with vmnics in Active/Standby Configuration

Requirements:

2 physical NICs

VLAN trunking

Recommended:

2 physical switches

The vSwitch should be configured as follows:

vSwitch0: 2 Physical NICs (vmnic0 and vmnic2)

2 Portgroups (Service Console and VMkernel)

Page 38: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Service Console active on vmnic0 and standby on vmnic2

VMkernel active on vmnic2 and standby on vmnic0

Failback set to No

Each portgroup has a VLAN ID assigned and runs dedicated on its own physical NIC; only in the

case of a failure it is switched over to the standby NIC. We highly recommend setting failback to

“No” to avoid chances of a false positive which can occur when a physical switch routes no traffic

during boot but the ports are reported as “up”. (NIC Teaming Tab)

Pros: Only 2 NICs in total are needed for the Service Console and VMkernel, especially useful in

Blade environments. This setup is also less complex.

Cons: Just a single active path for heartbeats.

The following diagram depicts the active/standby scenario:

Figure 16: Active-standby Service Console network layout

Page 39: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Secondary Management Network

Requirements:

3 physical NICs

VLAN trunking

Recommended:

2 physical switches

The vSwitch should be configured as follows:

vSwitch0 – 3 Physical NICs (vmnic0 & vmnic2)

3 Portgroup (Service Console, secondary Service Console and VMkernel)

Page 40: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

The primary Service Console runs on vSwitch0 and active on vmnic0, with a VLAN assigned on

either the physical switch port or the portgroup and is connected to the first physical switch. (We

recommend using a VLAN trunk for all network connections for consistency and flexibility.)

The secondary Service Console will be active on vmnic2 and connected to the second physical

switch.

The VMkernel is active on vmnic1 and standby on vmnic2.

Pros - Decreased chances of false alarms due to Spanning Tree “problems” as the setup contains

two Service Consoles that are both connected to only 1 physical switch. Subsequently both Service

Consoles will be used for the heartbeat mechanism that will increase resiliency.

Cons - Need to set advanced settings. It is mandatory to set an additional isolation address

(das.isolationaddress2) in order for the secondary Service Console to verify network isolation via a

different route.

The following diagram depicts the secondary Service Console scenario:

Figure 17: Secondary management network

Page 41: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

The question remains; which would we recommend? Both scenarios are fully supported and

provide a highly redundant environment either way. Redundancy for the Service Console or

Management Network is important for HA to function correctly and avoid false alarms about the

host being isolated from the network. We however recommend the first scenario. Redundant NICs

for your Service Console adds a sufficient level of resilience without leading to an overly complex

environment.

Page 42: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 6

Admission Control Admission Control is often misunderstood and disabled because of this. However Admission

Control is a must when availability needs to be guaranteed and isn’t that the reason for enabling HA

in the first place?

What is HA Admission Control about? Why does HA contain Admission Control?

The “Availability Guide” a.k.a HA bible states the following:

“vCenter Server uses Admission

Control to ensure that sufficient

resources are available in

a cluster to provide failover

protection and to ensure that

virtual machine resource

reservations are respected.”

Admission Control guarantees capacity is available for an HA initiated failover by reserving

resources within a cluster. It calculates the capacity required for a failover based on available

resources. In other words if a host is placed into maintenance mode, or disconnected, it is taken out

of the equation. Available resources also mean that the virtualization overhead has already been

subtracted from the total. To give an example; Service Console Memory and VMkernel memory is

subtracted from the total amount of memory that results in the available memory for the virtual

machines.

There is one gotcha with Admission Control that we want to bring to your attention before drilling

into the different policies.

When Admission Control is set to strict, VMware Distributed Power Management in no way will

violate availability constraints. This means that it will always ensure multiple hosts are up and

running. (For more info on how DPM calculates read Chapter 18)

When Admission Control was disabled and DPM was enabled in a pre-vSphere 4.1 environment you

could have ended up with all but one ESX host placed in sleep mode, which could lead to potential

issues when that particular host failed or resources were scarce as there would be no host available

to power-on your virtual machines. (KB: http://kb.vmware.com/kb/1007006)

Page 43: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

With vSphere 4.1 however; if there are not enough resources to power on all hosts, DPM will be

asked to take hosts out of standby mode to make more resources available and the virtual machines

can then get powered on by HA when those hosts are back online.

Admission Control Policy

The Admission Control Policy dictates the mechanism that HA uses to guarantee enough resources

are available for an HA initiated failover. This section gives a general overview of the available

Admission Control Policies. The impact of each policy is described in the following section including

our recommendation.

HA has three mechanisms to guarantee enough capacity is available to respect virtual machine

resource reservations.

Figure 18: Admission control policy

Page 44: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Below we have listed all three options currently available as the Admission Control Policy. Each

option has a different mechanism to ensure resources are available for a failover and each option

has its caveats.

Admission Control Mechanisms

Each Admission Control Policy has its own Admission Control mechanism. Understanding this

Admission Control mechanism is important to understand the impact of decisions for your cluster

design. For instance setting a reservation on a specific virtual machine can have an impact on the

achieved consolidation ratio. This section will take you on a journey through the trenches of

Admission Control mechanisms.

Host Failures Cluster Tolerates

The Admission Control Policy that has been around the longest is the “Host Failures Cluster

Tolerates” policy. It is also historically the least understood Admission Control Policy due to its

complex admission control mechanism.

The so-called “slots” mechanism is used when selecting “host failures cluster tolerates” as the

Admission Control Policy. The mechanism of this concept has changed several times in the past and

it is one of the most restrictive policies.

Slots dictate how many virtual machines can be powered on before vCenter starts yelling “Out Of

Resources”! Normally a slot represents one virtual machine. Admission Control does not limit HA in

restarting virtual machines, it ensures enough resources are available to power on all virtual

machines in the cluster by preventing “over-commitment”. For those wondering why HA initiated

failovers are not prone to the Admission Control Policy think back for a second. Admission Control

is done by vCenter. HA initiated restarts are executed directly on the ESX host without the use of

vCenter. So even if resource would be low and vCenter would complain it couldn’t stop the restart.

If a failure has occurred and the host has been removed from the cluster, HA will recalculate all the

values and start with an “N+x” cluster again from scratch. This could result in an over-committed

cluster as you can imagine.

“A slot is defined as a logical representation of the memory and CPU resources that satisfy the

requirements for any powered-on virtual machine in the cluster…”

In other words a slot is the worst case CPU and memory reservation scenario in a cluster. This

directly leads to the first “gotcha”:

HA uses the highest CPU reservation of any given virtual machine and the highest memory

reservation of any given VM in the cluster. If no reservations of higher than 256 MHz are set HA will

use a default of 256 MHz for CPU. If no memory reservation is set HA will use a default of

0MB+memory overhead for memory. (See the VMware vSphere Resource Management Guide for

Page 45: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

more details on memory overhead per virtual machine configuration) The following example will

clarify what “worst-case” actually means.

Example - If virtual machine “VM1” has 2GHz of CPU reserved and 1024MB of memory reserved

and virtual machine “VM2” has 1GHz of CPU reserved and 2048MB of memory reserved the slot

size for memory will be 2048MB (+memory overhead) and the slot size for CPU will be 2GHz. It is a

combination of the highest reservation of both virtual machines. Reservations defined at the

Resource Pool level however, will not affect HA slot size calculations.

Basic design principle:Be really careful with reservations, if there’s no need to have them on a per

virtual machine basis; don’t configure them, especially when using Host Failures Cluster Tolerates.

If reservations are needed, resort to resource pool based reservations.

Now that we know the worst case scenario is always taken into account when it comes to slot size

calculations we will describe what dictates the amount of available slots per cluster.

We will need to know what the slot size for memory and CPU is first. Then we will divide the total

available CPU resources of a host by the CPU slot size and the total available Memory Resources of a

host by the memory slot size. This leaves us with a slot size for both memory and CPU. The most

restrictive number, again worst-case scenario is the number of slots for this host. If you have 25

CPU slots but only 5 memory slots, the amount of available slots for this host will be 5 as HA always

will always take the worst case scenario into account to “guarantee” all virtual machines can be

powered on in case of a failure or isolation.

The question we receive a lot is how do I know what my slot size is? The details around slot sizes

can be monitored on the HA section of the Cluster’s summary tab by clicking the “Advanced

Runtime Info” line.

Figure 19: High Availability cluster summary tab

This will show the following screen that specifies the slot size and more useful details around the

amount of slots available.

Page 46: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 20: High Availability advanced runtime info

As you can see using reservations on a per-VM basis can lead to very conservative consolidation

ratios. However, with vSphere this is something that is configurable. If you have just one virtual

machine with a really high reservation you can set the following advanced settings to lower the slot

size used for these calculations: “das.slotCpuInMHz” or “das.slotMemInMB”.

To avoid not being able to power on the virtual machine with high reservations the virtual machine

will take up multiple slots. When you are low on resources this could mean that you are not able to

power-on this high reservation virtual machine as resources may be fragmented throughout the

cluster instead of available on a single host. As of vSphere 4.1 HA will notify DRS that a power-on

attempt was unsuccessful and a request will be made to defragment the resources to accommodate

the remaining virtual machines that need to be powered on.

The following diagram depicts a scenario where a virtual machine spans multiple slots:

Page 47: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 21: Virtual machine spanning multiple HA slot

Notice that because the memory slot size has been manually set to 1024MB one of the virtual

machines (grouped with dotted lines) spans multiple slots due to a 4GB memory reservation. As

you might have noticed none of the hosts has 4 slots left. Although in total there are enough slots

available; they are fragmented and HA will not be able to power-on this particular virtual machine

directly but will request DRS to defragment the resources to accommodate for this virtual machines

resource requirements.

Admission control does not take fragmentation of slots into account when slot sizes are manually

defined with advanced settings. It will take the number of slots this virtual machine will consume

into account by subtracting them from the total number of available slots, but it will not verify the

amount of available slots per host to ensure failover. As stated earlier though HA will request DRS,

as of vSphere 4.1, to defragment the resources. However, this is no guarantee for a successful

power-on attempt or slot availability.

Page 48: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Basic design principle:Avoid using advanced settings to decrease the slot size as it could lead to

more down time and adds an extra layer of complexity. If there is a large discrepancy in size and

reservations are set it might help to put similar sized virtual machines into their own cluster.

Unbalanced Configurations and Impact on Slot

Calculation

It is an industry best practice to create clusters with similar hardware configurations. However

many companies start out with a small VMware cluster when virtualization is introduced and plan

on expanding when trust within the organization has been built.

When the time has come to expand, chances are fairly large the same hardware configuration is no

longer available. The question is will you add the newly bought hosts to the same cluster or create a

new cluster?

From a DRS perspective, large clusters are preferred as it increases the load balancing options.

However there is a caveat for DRS as well, which is described in the DRS section of this book. For

HA there is a big caveat and when you think about it and understand the internal workings of HA

you probably already know what is coming up.

Let’s first define the term “unbalanced cluster”.

An unbalanced cluster would for instance be a cluster with 6 hosts of which one contains more

memory than the other hosts in the cluster.

Let’s try to clarify that with an example.

Example:

What would happen to the total number of slots in a cluster of the following specifications?

Six host cluster

Five hosts have 16GB of available memory

One host has 32GB of available memory

The sixth host is a brand new host that has just been bought and as prices of memory dropped

immensely the decision was made to buy 32GB instead of 16GB.

The cluster contains a virtual machine that has 1 vCPU and 4GB of memory. A 1024MB memory

reservation has been defined on this virtual machine. As explained earlier a reservation will dictate

the slot size, which in this case leads to a memory slot size of 1024MB+memory overhead. For the

sake of simplicity we will however calculate with 1024MB.

The following diagram depicts this scenario:

Page 49: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 22: High Availability memory slot size

When Admission Control is enabled and a number of host failures has been specified as the

Admission Control Policy, the amount of slots will be calculated per host and the cluster in total.

This will result in:

ESX01 - 16 Slots

ESX02 - 16 Slots

ESX03 - 16 Slots

ESX04 - 16 Slots

ESX05 - 16 Slots

ESX06 - 32 Slots

As Admission Control is enabled a worst-case scenario is taken into account. When a single host

failure has been specified, this means that the host with the largest number of slots will be taken

out of the equation. In other words for our cluster this would result in:

Page 50: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

esx01 + esx02 + esx03 + esx4 + esx5 = 80 slots available

Although you have doubled the amount of memory in one of your hosts you are still stuck with only

80 slots in total. As clearly demonstrated there is absolutely no point in buying additional memory

for a single host when your cluster is designed with Admission Control enabled and a number of

host failures as the Admission Control Policy has been selected.

In our example the memory slot size happened to be the most restrictive, the same principle is

applied when CPU slot size is most restrictive.

Basic design principle:When using Admission Control, balance your clusters and be conservative

with reservations as it leads to decreased consolidation ratios.

Now what would happen in the scenario above when the number of allowed host failures is to 2?

In this case ESX06 is taken out of the equation and one of any of the remaining hosts in the cluster is

also taken out. It would result in 64 slots. This makes sense doesn’t it?

Can you avoid large HA slot sizes due to reservations without resorting to advanced settings? That’s

the question we get almost daily. The answer used to be NO if per virtual machine reservations

were required. HA uses reservations to calculate the slot size and there’s no way to tell HA to ignore

them without using advanced settings pre-vSphere. With vSphere, the new Percentage method is

an alternative.

Percentage of Cluster Resources Reserved

With vSphere VMware introduced the ability to specify a percentage next to a number of host

failures and a designated failover host. The percentage avoids the slot size issue, as it does not use

slots for Admission Control. So what does it use?

When you specify a percentage; that percentage of the total amount of available resources will stay

reserved for HA purposes. First of all HA will add up all available resources to see how much it has

available (virtualization overhead will be subtracted) in total. Then HA will calculate how much

resources are currently reserved by adding up all reservations for both memory and CPU for

powered on virtual machines.

For those virtual machines that do not have a reservation larger than 256 MHz a default of 256 MHz

will be used for CPU and a default of 0MB+memory overhead will be used for Memory. (Amount of

overhead per configuration type can be found in the “Understanding Memory Overhead” section of

the Resource Management guide.)

In other words:

Page 51: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

((Total amount of available resources – total reserved virtual machine

resources)/total amount of available resources) <= (percentage ha should reserve as

spare capacity)

Here total reserved virtual machine resources include the default reservation of 256MHz and the

memory overhead of the virtual machine.

Let’s use a diagram to make it a bit clearer:

Figure 23: Percentage of cluster resources reserved

Total cluster resources are 24GHz (CPU) and 96GB (MEM). This would lead to the following

calculations:

((24GHz-(2Gz+1GHz+256MHz+4GHz))/24GHz) = 69 % available

((96GB-(1,1GB+114MB+626MB+3,2GB)/96GB= 85 % available

As you can see the amount of memory differs from the diagram. Even if a reservation has been set,

the amount of memory overhead is added to the reservation. For both metrics HA Admission

Control will constantly check if the policy has been violated or not. When one of either two

thresholds is reached, memory or CPU, Admission Control will disallow powering on any additional

virtual machines. These thresholds can be monitored on the HA section of the Cluster’s summary

tab

Figure 24: High Availability summary

Page 52: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

If you have an unbalanced cluster (hosts with different sizes of CPU or memory resources) your

percentage should be equal or preferably larger than the percentage of resources provided by the

largest host. This way you ensure that all virtual machines residing on this host can be restarted in

case of a host failure.

As earlier explained this Admission Control Policy does not use slots, as such resources might be

fragmented throughout the cluster. Although as of vSphere 4.1 DRS is notified to rebalance the

cluster, if needed, to accommodate for these virtual machines resource requirements a guarantee

cannot be given. We recommend ensuring you have at least one host with enough available capacity

to boot the largest virtual machine (reservation CPU/MEM). Also make sure you select the highest

restart priority for this virtual machine (of course depending on the SLA) to ensure it will be able to

boot.

The following diagram will make it more obvious. You have 5 hosts, each with roughly 80%

memory usage, and you have configured HA to reserve 20% of resources. A host fails and all virtual

machines will need to failover. One of those virtual machines has a 4GB memory reservation, as you

can imagine, the first power-on attempt for this particular virtual machine will fail due to the fact

that none of the hosts has enough memory available to guarantee it.

Figure 25: Available resources

Page 53: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Basic design principle:Although vSphere 4.1 will utilize DRS to try to accommodate for the resource

requirements of this virtual machine a guarantee cannot be given. Do the math; verify that any

single host has enough resources to power-on your largest virtual machine. Also take restart

priority into account for this/these virtual machine(s).

Failover Host

The third option one could choose is a designated Failover host. This is commonly referred to as a

hot standby. There is actually not much to tell around this mechanism, as it is “what you see is what

you get”. When you designate a host as a failover host it will not participate in DRS. You will not be

able to power on virtual machines on this host! It is almost like it is in maintenance mode and it will

only be used in case a failover needs to occur.

Page 54: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 7

Impact of Admission Control Policy As with any decision when architecting your environment there is an impact. This especially goes

for the Admission Control Policy. The first decision that will need to be made is if Admission Control

is enabled or not. We recommend enabling Admission Control but carefully select the policy and

ensure it fits your or your customer’s needs.

Basic design principle:

Admission Control guarantees enough capacity is available for virtual machine failover. As

such we recommend enabling it.

We have explained all the mechanisms that are being used by each of the policies in Chapter 6. As

this is one of the most crucial decisions that need to be made we have summarized all the pros and

cons for each of the three policies below.

Page 55: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Host Failures Cluster Tolerates

This option is historically speaking the most used for Admission Control. Most environments are

designed with an N+1 redundancy and N+2 is also not uncommon. This Admission Control Policy

uses “slots” to ensure enough capacity is reserved for failover, which is a fairly complex

mechanism. Slots are based on VM-level Reservations.

Pros:

Fully automated (When a host is added to a cluster, HA re-calculates how many slots are

available.)

Ensures failover by calculating slot sizes.

Cons:

Can be very conservative and inflexible when reservations are used as the largest reservation

dictates slot sizes.

Unbalanced clusters lead to wastage of resources.

Complexity for administrator from calculation perspective.

Percentage as Cluster Resources Reserved

Percentage based Admission Control is the latest addition to the HA Admission Control Policy. The

percentage based Admission Control is based on per VM reservation calculations instead of slots.

Pros:

Accurate as it considers actual reservation per virtual machine.

Cluster dynamically adjusts when resources are added.

Cons:

Manual calculations needed when adding additional hosts in a cluster and number of host failures

need to remain unchanged.

Unbalanced clusters can be a problem when chosen percentage is too low and resources are

fragmented, which means failover of a virtual machine can’t be guaranteed as the reservation of this

virtual machine might not be available as resources on a single host.

Specify a Failover Host

With the Specify a Failover Host Admission Control Policy, when a host fails, HA will attempt to

restart all virtual machines on the designated failover host. The designated failover host is

essentially a “hot standby”. In other words DRS will not migrate VMs to this host when resources

are scarce or the cluster is imbalanced.

Page 56: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Pros:

What you see is what you get.

No fragmented resources.

Cons:

What you see is what you get.

Maximum of one failover host. (N+2 redundancy is impossible.)

Dedicated failover host not utilized during normal operations.

Recommendations

We have been asked many times for our recommendation on Admission Control and it is difficult to

answer as each policy has its pros and cons. However, we generally recommend a Percentage based

Admission Control Policy. It is the most flexible policy as it uses the actual reservation per virtual

machine instead of taking a “worse case” scenario approach like the number of host failures does.

However, the number of host failures policy guarantees the failover level under all circumstances.

Percentage based is less restrictive, but offers lower guarantees that in all scenarios, HA will be able

to restart all virtual machines. With the added level of integration between HA and DRS we believe

a Percentage based Admission Control Policy will fit most environments.

Basic design principle: Do the math, and take customer requirements into account.

We recommend using a “Percentage” based Admission Control Policy, as it is the most

flexible policy.

Page 57: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 8

VM Monitoring VM monitoring or VM level HA is an often overlooked but really powerful feature of HA. The reason

for this is most likely that it is disabled by default and relatively new compared to HA. We have

tried to gather all the info we could around VM Monitoring but it is a pretty straightforward

product that actually does what you expect it would do.

With vSphere 4.1 VMware also introduced VM and Application Monitoring. Application Monitoring

is a brand new feature that Application Developers can leverage to increase resiliency as shown in

the screenshot below.

Figure 26: VM and Application Monitoring

Page 58: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

As of writing there was little information around Application Monitoring besides the fact that the

Guest SDK is be used by application developers or partners like for instance Symantec to develop

solutions against the SDK. In the case of Symantec a simplified version of Veritas Cluster Server

(VCS) is used to enable application availability monitoring including of course responding to issues.

Note that it is not a multi-node clustering solution like VCS itself but a single node solution.

Symantec ApplicationHA as it is called is triggered to get the application up and running again by

restarting it. Symantec's ApplicationHA is aware of dependencies and knows in which order

services should be started or stopped. If however for whatever reason this fails for an "X" amount

(configurable option within ApplicationHA) of times HA will be asked to take action. This action will

be a restart of the virtual machine.

Although Application Monitoring is relatively new and there are only a few partners currently

exploring the capabilities it does add a whole new level of resiliency in our opinion. We have tested

ApplicationHA by Symantec and personally feel it is the missing link. It enables you as System

Admin to integrate your virtualization layer with your application layer. It ensures you as a System

Admin that services, which are protected, are restarted in the correct order and it avoids the

common pitfalls associated with restarts and maintenance.

Why Do You Need VM/Application Monitoring?

VM and Application Monitoring acts on a different level as HA. VM/App Monitoring responds to a

single virtual machine or application failure as opposed to HA which responds to a host failure. An

example of a single virtual machine failure would for instance be the infamous “blue screen of

death”.

How Does VM/App Monitoring Work?

VM Monitoring restarts individual virtual machines when needed. VM/App monitoring uses a

similar concept as HA, heartbeats. If heartbeats, and in this case VMware Tools heartbeats, are not

received for a specific amount of time, the virtual machine will be rebooted. The heartbeats are

communicated directly to VPXA by VMware Tools; these heartbeats are not sent over a network.

Figure 27: VM monitoring sensitivity

Page 59: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

When enabling VM/App Monitoring, the level of sensitivity can be configured. The default setting

should fit most situations. Low sensitivity basically means that the amount of allowed “missed”

heartbeats is higher and as such the chances of running into a false positive are lower. However if a

failure occurs and the sensitivity level is set to low the experienced downtime will be higher. When

quick action is required in case of a possible failure “high sensitivity” can be selected, and as

expected this is the opposite of “low sensitivity”.

Table 1: VM monitoring sensitivity

It is important though to remember that VM Monitoring does not infinitely reboot virtual machines;

to avoid a problem from repeating. By default when a virtual machine has been rebooted three

times within an hour no further attempts will be taken. Well unless of course the specified time has

elapsed. The following advanced settings can be set to change this default behavior:

High Availability advanced settings:

das.maxFailures - Maximum amount of virtual machine failures, if the amount is reached VM

Monitoring doesn’t restart the machine automatically. Default value is 3 automatic reboots after a

failure.

das.maxFailureWindow - Minimum amount of seconds between failures. Default value is

3600 seconds, if a virtual machine fails more than specified with das.maxFailures within 3600

seconds VM Monitoring will not restart the machine.

Page 60: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

das.iostatsInterval - Amount of seconds VM Monitoring will look back to see if any Storage

or Network I/O has taken place before deciding to reboot a virtual machine in the case no VMware

Tools heartbeats are received. The default value is 120 seconds.

Although the heartbeat produced by VMware Tools is reliable, VMware added a further verification

mechanism. To avoid false positives, VM Monitoring also monitors I/O activity of the virtual

machine. When heartbeats are not received AND no disk or network activity occurred over the last

120 seconds (per default) the virtual machine will be reset. This 120 second interval can be

modified of course by changing the advanced setting “das.iostatsInterval” as described above.

Is AAM enabling VM/App Monitoring?

Although VM/App Monitoring is configured within HA, VM/App Monitoring is independent from

host-level HA, and the AAM agent has absolutely nothing to do with VM/App Monitoring. The M

Monitoring mechanism sits in the vpxa agent. The agent uses the “Performance Manager” to

monitor the disk and network I/O. VM/App Monitoring uses the “usage” counters for both disk and

network and “vpxa” requests these every 20 seconds. These requests are also logged in the vpxa log

file.

As stated before, VM/App Monitoring uses heartbeats just like host-level HA. The heartbeats are

sent to VPXA, which is responsible for the restarts. Of course this info is also being rolled up into

vCenter, but that is done via the Management Network, not by using the VM Network. This is crucial

to know as this means that when a virtual machine network error occurs the virtual machine

heartbeat will still be received. If and when error occurs VPXA will request a restart of the virtual

machine when all conditions are met through hostd, again just like with host-level HA it works

independent of vCenter.

Screenshots

The cool thing about VM Monitoring is the fact that it takes screenshots of the VM console. They are

taken right before a virtual machine is reset by VM Monitoring. This has been added as of vCenter

4.0. It is a very useful feature when a virtual machine “freezes” every once in a while with no

apparent reason. This screenshot can be used to debug the virtual machine operating system, if and

when needed, and is stored in the virtual machine’s working directory.

Basic design principle: VM Monitoring can substantially increase availability. It

is part of the HA stack and we heavily recommend using it!

Page 61: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 9

vSphere 4.1 HA and DRS Integration

HA integrates on multiple levels with DRS as of vSphere 4.1. It is a huge improvement and it is

something that we wanted to stress as it has changed the behavior and the reliability of HA.

Page 62: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Affinity Rules

VMware introduced VM-Host affinity rules with vSphere 4.1. VM-Host affinity rules are specified

within the DRS configuration and are typically used to bind a group of virtual machines to a group

of hosts.

There are two types of VM-Host affinity rules “must” and “should”. If a rule is created of the type

“must” HA will need to adhere to this rule when a failover needs to occur. However, if it is not

possible to perform a failover without violating the rule the failover will not be performed. Affinity

rules are covered in-depth in the DRS section of this book.

Resource Fragmentation

As of vSphere 4.1 HA is closely integrated with DRS. When a failover occurs HA will first check if

there are resources available on that host for the failover. If for instance that particular virtual

machine has a very large reservation, and the Admission Control Policy is based on a percentage, it

could happen that resources are fragmented across multiple hosts. (For more details on this

scenario see Chapter 7.) HA, as of vSphere 4.1, will ask DRS to defragment the resources to

accommodate for this virtual machine’s resource requirements. Although HA will request a

defragmentation of resources, a guarantee cannot be given. As such, even with this additional

integration you should still be cautious when it comes to resource fragmentation.

DPM

In the past there barely was integration between DRS/DPM and HA. Especially when DPM was

enabled this could lead to some weird behavior when resources where scarce and an HA failover

would need to happen. With vSphere 4.1 this has changed. In such cases, HA will use DRS to try to

adjust the cluster (for example, by bringing hosts out of standby mode or migrating virtual

machines to defragment the cluster resources) so that HA can perform the failovers.

Flattened Shares

Pre-vSphere 4.1 an issue could arise when shares had been set custom on a virtual machine. When

HA fails over a virtual machine it will power-on the virtual machine in the Root Resource Pool.

However, the virtual machine’s shares were scaled for its appropriate place in the resource pool

hierarchy, not for the Root Resource Pool. This could cause the virtual machine to receive either too

many or too few resources relative to its entitlement.

A scenario where and when this can occur would be the following:

VM1 has a 1000 shares and Resource Pool A has 2000 shares. However Resource Pool A has 2 VMs

and both will have 50% of those “20003 shares. The following diagram depicts this scenario:

Page 63: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 28: Flatten shares starting point

When the host would fail both VM2 and VM3 will end up on the same level as VM1. However as a

custom shares value of 10.000 was specified on both VM2 and VM3 they will completely blow away

VM1 in times of contention. This is depicted in the following diagram:

Figure 29: Flatten shares host failure

Page 64: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

This situation would persist until the next invocation of DRS would re-parent the virtual machine to

its original Resource Pool. To address this issue as of vSphere 4.1 DRS will flatten the virtual

machine’s shares and limits before fail-over. This flattening process ensures that the virtual

machine will get the resources it would have received if it had failed over to the correct Resource

Pool. This scenario is depicted in the following diagram. Note that both VM2 and VM3 are placed

under the Root Resource Pool with a shares value of 1000.

Figure 30: Flatten shares after host failure before DRS invocation

Of course when DRS is invoked both VM2 and VM3 will be re-parented under Resource Pool A and

will receive the amount of shares they had originally assigned again.

Chapter 10

Page 65: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Summarizing The integration of HA with DRS has been vastly improved and so has HA in general. We hope

everyone sees the benefits of these improvements and of HA and VM and Application Monitoring in

general. We have tried to simplify some of the concepts to make it easier to understand, still we

acknowledge that some concepts are difficult to grasp. We hope though that after reading this

section of the book everyone is confident enough to make the changes to HA needed to increase

resiliency and essentially uptime of your environment because that is what it is all about.

If there are any questions please do not hesitate to reach out to either of the authors.

Part 2

Page 66: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

VMware Distributed Resource

Scheduler

Chapter 11

Page 67: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

What is VMware DRS? VMware Distributed Resource Scheduler (DRS) is an infrastructure service run by VMware vCenter

Server (vCenter). DRS aggregates ESX host resources into clusters and automatically distributes

these resources to the virtual machines.

DRS monitors resource usage and continuously optimizes the virtual machine resource distribution

across ESX hosts.

DRS computes the resource entitlement for each virtual machine based on static resource allocation

settings and dynamic settings such as active usage and level of contention.

DRS attempts to satisfy the virtual machine resource entitlement with the resources available in the

cluster by leveraging vMotion. vMotion is used to either migrate the virtual machines to alternative

ESX hosts with more available resources or migrating virtual machines away to free up resources.

Because DRS is an automated solution and easy to configure, we recommend enabling DRS to

achieve higher consolidation ratios at low costs.

A DRS-enabled cluster is often referred to as a DRS cluster. In vSphere 4.1, a DRS cluster can

manage up to 32 hosts and 3000 VMs.

Cluster Level Resource Management

Clusters group the resources of the various ESX hosts together and treat them as a pool of

resources, DRS presents the aggregated resources as one big host to the virtual machines. Pooling

resources allows DRS to create resource pools spanning across all hosts in the cluster and apply

cluster level resource allocation policies. Probably unnecessary to point out, but a virtual machine

cannot span hosts even when resources are pooled by using DRS. In addition to resource pools and

resource allocation policies, DRS offers the following resource management capabilities.

Initial placement – When a virtual machine is powered on in the cluster, DRS places the

virtual machine on an appropriate host or generates a recommendation depending on the

automation level.

Load balancing – DRS distributes virtual machine workload across the ESX hosts inside the

cluster. DRS continuously monitors the active workload and the available resources, DRS compares

the results to the ideal resource distribution and performs or recommends virtual machines

migration to ensure workloads receive the resources to which they are entitled and with the goal of

allocating resources to maximize workload performance.

Power management – When Distributed Power Management (DPM) is enabled, DRS

compares cluster-level and host-level capacity to the demand of the virtual machines, including

Page 68: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

recent historical demand. It places, or recommends placing ESX hosts in standby mode if excessive

capacity is detected or DPM powers on hosts if more capacity is needed.

Constraint correction – DRS redistributes virtual machines across ESX hosts to evacuate

hosts for user requests that the hosts enter maintenance or standby mode, and moves virtual

machines as needed to adhere to user-defined affinity and anti-affinity rules.

Requirements

In order for DRS to function correctly the environment must meet the following requirements:

VMware ESX or ESXi in a cluster

VMware vCenter Server

VMware vSphere Enterprise or Enterprise Plus License

Meet vMotion requirements (not mandatory, but highly recommended)

o Shared VMFS volumes accessible by all ESX host inside the cluster.

o Private migration network

o Gigabit Ethernet

o Processor compatibility

For DRS to allow automatic load-balancing, vMotion is required. For initial placement though,

vMotion is not a requirement.

Basic design principle:Configure vMotion to fully benefit from DRS capabilities.

Operation and Tasks of DRS

Load Balance Calculation

vCenter creates and runs a DRS thread on the vCenter Server per cluster which communicates with

the management agent (VPXA) on every ESX host inside the cluster. By default a DRS thread is

invoked every five minutes. This thread calculates the imbalance of the cluster, applies resource

settings and (if needed) generates migration recommendations. In practice the thread may be

invoked more frequently due to changes made inside the cluster. For example, DRS will be invoked

when virtual machine resource settings are changed or host are added or removed.

Figure 31: DRS thread components

Page 69: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Events and Statistics

The vCenter agent (VPXA) runs inside each ESX host in the cluster and enables a two way

communication, keeping the status of both ESX and VMs in sync with the status in vCenter. The

VPXA sends information when a virtual machines power state changes or when a virtual machine is

migrated with vMotion. Periodically the VPXA sends additional notification and statistics to the

vCenter server.

Migration and Info Requests

In turn DRS sends messages to the ESX host, such as proposed migrations and information

requests.

vCenter and Cluster sizing

The impact of the resource utilization by the DRS threads on the vCenter must be taken into

account when sizing the vCenter server and designing the cluster environment.

To ensure performance of vCenter add sufficient vCPUs and memory. The technical paper: “VMware

vCenter Server Performance and Best Practices” list the minimum hardware recommendations for

three deployment sizes, ranging from 50 hosts and 500 virtual machines to 1.000 hosts and 10.000

powered on virtual machines. It’s recommended to follow these hardware recommendations when

sizing the vCenter virtual machine.

Page 70: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

The configuration of cluster sizes, combination of workload types, virtual machine management

and amount of virtual machines have impact on the behavior and performance of the vCenter and

therefore influence the performance of DRS threads, which in turn can impact the performance of

the virtual machines due to slow or insufficient load-balancing migration recommendations and

resource entitlement calculations.

For example; vCenter servers in Virtual Desktop Infrastructure environments experience more load

due to the amount of virtual machines and higher frequency of virtual machine power state

changes, which leads to invoking DRS threads more often. (Table 3 of Chapter 14 list the events

invoking DRS calculations)

Separate workloads

In large environments it’s recommended to separate VDI workloads and server workloads and

assign different clusters to each workload to reduce the DRS invocations. By isolating server

workloads from VDI workloads, only the VDI cluster experience increased DRS invocations,

reducing the complexity and the amount of calculations performed by DRS per cluster.

Amount of clusters

A lower amount of virtual machines inside a cluster will reduce the amount of load-balancing

calculations, therefore ensuring fast DRS performance. A lower amount of virtual machines

generally results in a smaller amount of hosts per cluster. However the potential danger is creating

too much small size clusters, having 200 x 3 host clusters instead of 100 x 6 host clusters could

drive up CPU utilization of the vCenter as each cluster will at least invoke the periodic load-

balancing every 5 minutes.

It is believed that the current “sweetspot” of hosts per cluster ranges between the 16 and 24 hosts

per cluster, offering sufficient options to load-balance the virtual machines across the host inside

the cluster without introducing too many DRS threads in vCenter.

Page 71: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 12

DRS Cluster Settings When DRS is enabled on the cluster you need to select the automation level and set the migration

threshold. DRS settings can be modified when the cluster is in use and without disruption of

service. The following steps however will show you how to create a cluster and how to enable

DRS:

1. Select the Hosts & Clusters view.

2. Right Click your Datacenter in the Inventory tree and click New Cluster

3. Give the new cluster an appropriate name. We recommend at a minimum including the location

of the cluster and a sequence number ie. ams-hadrs-001.

4. In the Cluster Features section of the page, select Turn On VMware DRS and click Next

5. Verify the Automation Level is set to Fully Automated and select Next

6. Leave the Swapfile Policy set to default and click Next

7. Click Finish to complete the creation of the cluster

Figure 32: Enable DRS

Page 72: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Automation Level

The automation level determines the level of autonomy of DRS, ranging from generating placement

and load-balancing recommendations to automatically applying the generated recommendations.

Three automation levels exist:

Manual - DRS generates recommendations for initial placement of the virtual machines and if the

cluster becomes unbalanced, DRS suggests migration recommendations for the virtual machines.

The recommendations will not be automatically applied, but they will be applied if an administrator

accepts each one.

Partially automated - DRS starts the VM on the most suitable host. If the cluster is

unbalanced DRS only suggests migration recommendations, which the user must apply manually.

Page 73: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Fully automated - DRS places the virtual machine on the most suitable host when it is

powered on. If the cluster is unbalanced, DRS migrates the virtual machine to a more suitable hosts.

Table 2: DRS automation level

Initial Placement

Initial placement occurs when a virtual machine is powered on or resumed. By default DRS selects

an ESX host based on the virtual machine resource entitlement. DRS will create a prioritized list of

recommended hosts for virtual machine placement, if the cluster is configured with the manual

automation level. This list is presented to the user to help the user selecting the appropriate host.

Also the DRS policy settings are taken into account, discussed in a later chapter.

Impact of Automation Levels on Procedures

When manual- or partially automated automation level is selected, the user must manually apply

the recommendations issued by DRS. DRS reviews the state of the cluster at an interval of five

minutes and publish recommendations to solve the imbalance of the cluster. Consequently, the

administrators should check the recommendations after each DRS invocation to solve the cluster

imbalance. Besides inefficiency, it is possible that DRS rules are violated because the administrators

apply the recommendations infrequently. DRS rules are explained in section “Rules” of chapter 16.

The automation level of the cluster can be changed without disrupting virtual machines. It’s easy

to change it, so why not try fully automatic for a while to get comfortable with it?

Basic design principle:Set automation level to Automatic to fully benefit from DRS

capabilities.

Page 74: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 13

Resource Management As stated before, the primary goal of DRS is to ensure each virtual machine receives its entitled

resources and to do this it rebalances virtual machine workload across the hosts in the cluster. But

contrary to popular believe DRS is not concerned with performance per se. Instead DRS focuses on

whether each virtual machine inside the cluster or resource pool gets its specified resource

allocation. DRS examines the current demand and contention in the environment and uses the

resource allocation settings of the virtual machine to determine the resource entitlement for it.

By trying to ensure that all virtual machines receive enough resources to satisfy their resource

entitlement, DRS assumes that a virtual machine should not have any performance problems if it

receives those resources. In other words, the entitlements are adequate to ensure the virtual

machine’s performance goals. To satisfy each virtual machine resource entitlement, DRS

dynamically moves a virtual machine across the cluster to optimize the cluster load balance. To do

this as effectively as possible, DRS computes the cluster imbalance and creates recommendations

for migrating virtual machines to solve the resource supply and resource demand imbalance.

To properly interact with the local resource scheduler of each ESX host, DRS converts cluster level

resource pool settings into host level settings. Let us look at the scheduler architecture layer.

Two-Layer Scheduler Architecture

When enabling DRS on a cluster a two-layer scheduler architecture is created. In addition to the ESX

local host resource scheduler, DRS introduces a global scheduler.

Figure 33: Global scheduler and local schedulers

Page 75: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

The global scheduler supervises the entire cluster and the local scheduler manages the resource

allocation of the virtual machines on each host. Both DRS and the local ESX host scheduler use the

virtual machines resource allocation settings to compute the resource entitlement of the virtual

machines. This entitlement is the allocation of resources that a virtual machine should receive.

DRS relies on host level scheduling to allocate resources, the global scheduler calculates resource

entitlement when virtual machines are placed inside a resource pool and sends these calculations

to the host. The host-level CPU and memory schedulers handle the resource entitlement of the

virtual machine.

Resource Entitlement

Every virtual machine has a “resource entitlement” for CPU and memory. This is how much of the

physical resources ESX thinks the virtual machine should get. It is the target ESX defines for how

much resource to give the virtual machine. By default this will be everything the virtual machines

wants, unless there is too little resources to meet all the virtual machines aggregated demand (in

other words “contention”), or if there is an artificial limit imposed. A virtual machine’s resource

entitlement changes as the virtual machines runs.

A virtual machine’s resource entitlement is based on static entitlement and dynamic entitlement

which subsequently consist of static settings and dynamic metrics. The static entitlement consists of

resource allocation settings, reservations, shares, and limits. The dynamic entitlement consists of

dynamic metrics, such as the estimated active memory, which is also known as the working set size,

it’s CPU demand which is an estimated amount of CPU the virtual machine would consume if no

contention exists and the utilization or degree of contention of the host.

Resource allocation settings:

Reservation - Also referred to as “MIN”. A reservation is the amount of physical resources

(MHz or MB) guaranteed to be available for the virtual machine.

Page 76: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Shares - Shares specify the relative importance of the virtual machine. Shares are always

measured against other powered-on sibling virtual machines and resource pools on the

host.

Limit - Also referred as “MAX”. A limit specifies an upper bound for resources that can be

allocated to a virtual machine. By default, if no limit is explicitly set, the limit will implicitly

be the amount installed as virtual hardware in the virtual machine.

Working set - Estimated amount of active memory of the virtual machine.

CPU demand - Estimated amount of CPU the virtual machine would consume if no

contention.

Idle memory tax - Mechanism by which ESX reallocates unreserved idle memory from

virtual machines

It is very important to know that if a cluster or host is under-committed the virtual machine

resource entitlement will be the same as its resource demand. In other words, the virtual machine

will be allocated whatever it wants to consume within its configured limit. The virtual machine will

receive its CPU cycles and the memory pages issued by the virtual machine will be mapped on

machine pages (physical memory of the ESX host). A limit is the only exception; the ESX host can

still revert to swapping when a memory limit is set on the virtual machine because this introduces

an arbitrary cap not due to genuine contention.

When a cluster is overcommitted, the cluster might experience more resource demand than its

current capacity; in that case, DRS and the VMkernel will distribute and allocate resources based on

the resource entitlement of each virtual machine

Resource Entitlement Calculation

The distribution of resources is as follows; if a reservation is set on the virtual machine it will have

a static resource entitlement that is at least as large as or greater than the reservation.

The reservation setting guarantees that physical resources will be available to back the specified

amount of resources. The limit setting is the opposite of reservation; it defines the upper limit of

usage of physical resources even if there is a surplus beyond the aggregated demand. For example,

if a CPU limit is set, any vCPU instructions will be placed in the pCPU scheduler queue when the

virtual machine consumes all its assigned time. The instructions will be processed by the pCPU until

CPU timeslots are available again for the virtual machine.

CPU demand is used to calculate the virtual machine’s CPU entitlement. If the vCPU is not

demanding any physical CPU resources it is de-scheduled anyway.

When a virtual machine hits its memory limit, the amount of memory used above the limit is

ballooned, compressed or swapped out to a hypervisor swap file.

CPU and memory

The rest of the resource entitlement is based on the number of shares and in case of memory, the

working set size. The working set size is the estimated amount of active memory. ESX applies the

Page 77: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

idle memory tax to the virtual machine and all the inactive memory can be reallocated to other

virtual machines during contention. The amount of shares determines which virtual machine has a

priority over other virtual machines. ESX compares the shares value of each sibling virtual machine

and select “victims” to confiscate memory from.

This is the memory entitlement of the virtual machine. If the virtual machine uses more than its

entitlement during contention, the excess memory is ballooned, compressed or swapped depending

on the free memory state of the ESX host. The host will keep on reclaiming memory until the virtual

machine resource usage is at or below its resource entitlement.

Page 78: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 14

Calculating DRS Recommendations

DRS takes several metrics into account when calculating migration recommendations to load

balance the cluster; the current resource demand of the virtual machines, host resource availability

and the applied high-level resource policies. The following section explores how DRS use these

metrics to create a new and better placement of virtual machine than the existing location of the

virtual machine, while still satisfying all the requirements and constraints.

When is DRS Invoked?

By default DRS is invoked every 300 seconds. When the invocation interval expires DRS will

compute and generate recommendations to migrate virtual machines. The DRS imbalance

calculation is also triggered if the cluster detects changes in its resource pool tree, operations and

events such as changes in resource supply, or modification of resource settings.

Table 3: Events invoking DRS calculations

Each recommendation generated by DRS and the recommendations not launched are retired at the

next invocation of DRS. DRS might generate the exact recommendation again if the imbalance is not

solved. The interval in which the DRS algorithm is invoked can be controlled through the vpxd

configuration file (vpxd.cfg) with the following option:

vpxd config file:

Page 79: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

<config>

<drm>

<pollPeriodSec>

300

</pollPeriodSec>

</drm>

</config>

The default frequency is 300 seconds, but can be set to anything in the range of 60 seconds to 3600

seconds. It is strongly discouraged to change the default value. A less frequent interval might

reduce the number of vMotions and therefore overhead but would risk leaving the cluster

imbalanced for a longer period of time. Shortening the interval will likely generate extra overhead,

for little added benefit.

MaxMovesPerHost

Adjusting the interval impacts the amount of migrations DRS will recommend. There are limits to

how many migrations DRS will recommend per interval per ESX host. A limit is imposed because

there is no advantage in recommending migrations that cannot be complement before the next DRS

invocation. During the next re-evaluation cycle, virtual machine resource demand can have changed

rendering the previous recommendations obsolete

Please note that there is no limit on max moves per host for a host entering maintenance or standby

mode, but only a limit on the maximum amount of moves per host for load balancing. This can, but

usually shouldn't, be changed by setting the DRS Advanced Option MaxMovesPerHost. The

default value is of this parameter is 8.

In vSphere 4.1, the limit on moves per host is dynamic, based on how many moves DRS estimates

that can be completed during one DRS evaluation interval cycle.

The MaxMovesPerHost value is adaptive to the maximum number of concurrent vMotions per

host and the average migration time observed from previous migrations. These improvements

make DRS less conservative compared to that in vSphere 4.0 and allow DRS to reach a steady state

more quickly when a significant load imbalance in the cluster exist.

The MaxMovesPerHost parameter still exist, but can be exceeded by DRS. The

MaxMovesPerHost setting should adapt to the DRS invocation frequency & average time per

vMotion, so there is no need to tweak the value.

Page 80: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Defragmenting cluster during Host failover

As described in Chapter 9 “vSphere 4.1 HA and DRS integration”, HA requests DRS to defragment

resources to accommodate for a virtual machine’s resource requirements to enable an HA initiated

restart. As explained, when strict Admission Control is enabled and the “host-failure policy is

selected, DRS will keep spare resources defragmented and available permanently to support the

configured HA host failover level. Selecting a different failover policy or disabling strict Admission

Control can however result in a fragmented cluster. In a fragmented cluster enough resources are

available overall but not enough resources are available at a host level to restart a specific virtual

machine.

In vSphere 4.1 if HA is unable to find sufficient available resources to restart the virtual machines

after a host failure, HA will request DRS to provide these resources to accommodate the virtual

machine that couldn’t be restarted.

This scenario however will not invoke a traditional cluster wide DRS load-balance calculation (due

to the duration of this process), but DRS will simulate an over-reservation. During an over-

reservation, the virtual machines reserve more memory than available on the host. To create an

over-reservation, HA creates “ghost virtual machines” identical to the virtual machines that could

not be restarted and places them on a random host in the cluster. The reason why HA uses a

random approach is because HA already knows that none of the available hosts can accommodate

this virtual machine, due to the fact that HA could not restart the virtual machines on any host in

the cluster after the host failure.

The host on which the ghost virtual machine is placed becomes over-reserved and this violates the

DRS rules. DRS will move virtual machines around in the cluster and if possible and necessary

leveraging DPM by powering on ESX hosts to try to fix this over-reservation. These moves are

mandatory.

What distinguishes this scenario from the common DRS load-balancing policy scenario is that DRS

can calculate migrations that involve multiple virtual machines. The normal load-balancing policy

only calculates single-hop migrations, VM A from host 1 to host 2, where this scenario uses multi-

hop migrations, VM A from host 1 to host 2, VM B from host 2 to host 3.

Due to the ability to use multi-hop migrations, defragmentation of resources occurs across the

cluster, but is started by the over-reservations on a particular host. This reduces the time, overhead

and cost incurred compared to a traditional DRS load-balancing calculation.

Recommendation Calculation

To generate a migration recommendation, DRS executes a series of calculations and passes in which

it determines the level of cluster imbalance and which virtual machines it needs to migrate to solve

the imbalance.

Page 81: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Constraints Correction

Before DRS runs its load-balancing pass, it runs a pass to consider and correct constraints,

including:

Evacuating hosts that the user requested enter maintenance or standby mode.

Correcting Mandatory VM-Host affinity/anti-affinity rule violations.

Correcting VM/VM anti-affinity mode violations.

Correcting VM/VM affinity mode violations.

Correcting host resource overcommitment (rare, since DRS is controlling resources).

These constraints are respected during load-balancing. The constraints may cause imbalance, and

that imbalance may not be fixable due to these constraints. The imbalance info on the cluster

summary page informs the administrator if there was an unfixable imbalance.

VM-Host affinity rules are a special case. More details about VM-Host affinity rules can be found in

chapter 16. But a quick primer on VM-Host affinity rules should help you to understand the

Constraints Correction Pass better. vSphere 4.1 introduces VM-Host affinity/anti affinity rules in

addition to the VM-VM affinity (or anti-affinity) rules.

The VM-Host affinity (or anti-affinity) rules specify which virtual machines must or should run on a

group of ESX hosts. Two types of VM-Host affinity/anti-affinity rules exists; Must-rules and Should-

rules. Must-rules are mandatory rules for HA, DRS and DPM. The Should-rule is a preferential rule

for DRS and DPM and both DRS and DPM use their best effort to apply the Should- rules.

The Should-rules are a special case in DRS algorithm. The entire DRS algorithm (constraint

correction + load-balancing) is executed and DRS tries to place the virtual machines listed in the

Should-rules. DRS essentially treats the Should-rules as Hard-Rules during this phase. If all virtual

machines can be placed without introducing violations or over-utilized hosts the results are output.

If the Should-rules introduce constraints or over-utilization of hosts the DRS algorithm is repeated

again with the Should-rules dropped (since they are best effort) and retried with only the Must-

rules in place.

Imbalance Calculation

DRS needs to establish if the cluster is imbalanced, it does this by comparing the "current hosts load

standard deviation" metric to the "target host load standard deviation". If the Current Host Load

Standard Deviation (CHLSD) exceeds the Target Host Load Standard Deviation (THLSD), the cluster

is considered imbalanced. To calculate the CHLSD and THLSD DRS needs to determine the load of

each host first. It does this by computing the resource entitlement of each active virtual machine on

the host and summing the entire virtual machine load on the same host. This sum is divided by the

capacity of the host and this value is called the host’s normalized entitlement.

Sum (VM entitlements) / (capacity of host)

Page 82: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

VM entitlement

As mentioned in chapter 14 “Resource Management” the virtual machine resource entitlement

calculated by the local scheduler is based upon its static settings, various dynamic metrics and the

virtual machine overhead on the ESX host. DRS enforces an overhead limit that is respected by the

VMkernel scheduler, so that growth in a virtual machine’s use of overhead memory will not go out

of control in between invocations of DRS. DRS will (if possible) keep increasing the limit on a virtual

machine’s overhead memory if the virtual machine needs more overhead memory.

Capacity of host

The memory capacity of the host is lower than the amount of installed physical memory. The

capacity of the host is calculated by subtracting the VMkernel overhead, the Service Console

overhead from the installed memory and a 6% reservation. In a cluster with VMware High

Availability (HA) enabled and HA Admission Control enabled (default), DRS maintains excess

powered-on capacity to meet the High Availability settings. This information is displayed on the

resource allocation tab of the cluster in vCenter.

The outcome of the sum (VM entitlement) / (capacity of host) becomes the load metric of the host

(CHLSD). The standard deviation of this value across all the hosts in the cluster is the CHLSD. Given

this value and the migration threshold, DRS computes the Target Host Load Standard Deviation

(THLSD).

Impact of Migration Threshold on Selection Procedure

The migration threshold reflects tolerance of cluster load imbalance. The THLSD is derived from the

migration threshold setting. Five settings can be selected, ranging from conservative to aggressive.

If the cluster is configured with an aggressive migration threshold, DRS will calculate a more

restrictive THLSD, which results in frequent migrations to keep the CHLSD beneath the THLSD

threshold. The conservative migration threshold setting creates a more tolerant environment that

leads to fewer migrations.

Figure 34: Migration threshold

Every migration recommendation from DRS has a priority level which indicates how beneficial the

migration is expected to be. The conservative migration threshold setting generates only the

“priority-one” recommendations which are mandatory recommendations. Selecting the aggressive

migration threshold setting, the cluster will be less tolerant to cluster imbalance and will generate

“priority-five” recommendations, which are expected to produce only very modest improvements.

Page 83: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Selection of Virtual Machine Candidate

If an imbalance is detected, a procedure is triggered to decide which virtual machine(s) it will

migrate to correct the imbalance. DRS will use the following procedure:

DRS procedure:

While (load imbalance metric > threshold) {

move = GetBestMove();

If no good migration is found:

stop;

Else:

Add move to the list of recommendations;

Update cluster to the state after the move is added; }

While the cluster is imbalanced (Current host load standard deviation > Target host load

standard deviation), DRS selects a virtual machine to migrate based on specific criteria and

simulates the migration in the cluster. In this simulation, DRS computes the possible Current Host

Load Standard Deviation after the migration. If the CHLSD is still above the threshold it will repeat

the procedure but if this migration solves the imbalance, it will stop after adding it to the migration

recommendation list.

The GetBestMove procedure aims to find the virtual machine that will give the best improvement in

the cluster wide imbalance. The GetBestMove procedure consists of the following instructions:

getbestmove procedure:

GetBestMove() {

For each virtual machine v:

For each host h that is not Source Host:

If h is lightly loaded compared to Source Host:

If Cost-Benefit and Risk Analysis accepted simulate move v to h measure new cluster-

wide load imbalance metric as g

Return move v that gives least cluster-wide imbalance g.}

Page 84: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

This procedure tries to find the migration that will offer the best improvement. DRS cycles through

each DRS-enabled virtual machine and each host that is not the source host. Only hosts that are

using fewer resources than the source host are considered.

After the cost-benefit and risk analysis is completed and the results are accepted a migration of the

virtual machine to the host is simulated. DRS will measure the new cluster-wide load imbalance

metric. DRS does this for all the virtual machines and compares the result of all tried combinations

(VM<->Host) and returns the vMotion that results in the least cluster imbalance.

Basic design principle:The number of clusters and virtual machines managed by vCenter influences

the number of calculations which impacts the performance of vCenter. Take this into account when

sizing the vCenter server.

Cost-Benefit and Risk Analysis Criteria

The purpose of the cost-benefit and risk analysis is to filter out expensive and unstable migrations.

NOTE

The term “unstable migration” indicates the effect of the migration on the cluster load

balance and examines the stability of the workload pattern of the virtual machine. It has

nothing to do with the stability of the vMotion process itself! If the workload of the virtual

machine changes after the migration due to an increase or decrease of demand, the

migration is deemed as unstable. One might argue that the used terminology can be

alarming.

By doing a cost-benefit and risk analysis DRS tries to throttle migrations and avoid a constant

stream of vMotions, to avoid the high cost associated with unnecessary vMotions. The vMotion

process itself uses CPU and memory resources; by constantly migrating machines it will nullify the

benefit from migrating virtual machines.

The cost benefit and risk analysis also prevents unstable workloads from affecting the

recommendations. If the virtual machine’s workload changes directly after the recommendation,

the recommendation becomes useless and creates a situation where the virtual machine is selected

over and over again resulting in “Ping-Pong” migrations.

The following diagram and section will go into the cost benefit and risk analysis:

Page 85: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 35: Cost benefit risk analysis

Cost - During migration vMotion tries to reserve 30% of a physical CPU on both the source and

destination host. A shadow virtual machine is created during the vMotion process on the

destination host; the memory consumption by this shadow virtual machine is also factored in to the

cost of the recommendation. At the end of the vMotion process the migrated virtual

machine has a short period of downtime where a snapshot is made of the virtual machine and is

resumed on the destination host. This brief downtime is approximately one second or less and is

not disruptive to virtual machine connections.

NOTE

The term “downtime” needs some clarification; Downtime indicates the interruption of

service of the virtual machine. Downtime incurred during vMotion is usually measured in

milliseconds. But because there is an interruption of service, although negligible, it needs to

be factored in.

Page 86: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Benefit -

Due to the migration of the virtual machine, resources are freed up on the source host and the

virtual machine itself receives more resource due to the availability of resources on its new host.

The migration of workload will result in a much more balanced cluster.

Risk -

Risks accounts for the possibility of irregular loads. Irregular load indicates inconsistent and spiky

demand workloads.

The cost-benefit and risk analysis results in a resource gain, whether positive or negative, VMware

used the following chart to illustrate this resource gain.

Figure 36: Resource gain calculation

The X-axis of the chart displays the progress of time and the Y-axis shows the absolute positive or

negative gain of the virtual machine on both source and destination hosts. The resource gain is in

term of the absolute units MHz or MB depending on the type being measured.

DRS uses historical data for this calculation. DRS starts with determining how much resources the

virtual machine is consuming, by using the metrics Host CPU: active and Host Memory: active. After

establishing the consumed resources, it will predict the amount of time of this workload; this is

called stable time. After stable time, DRS becomes conservative and it assumes that the virtual

machine will run at the worst possible load listed in the history (up to 60 minutes) until the next

DRS invocation time. Imagine what impact adjusting the invocation interval has on the analysis.

The net resource gain is calculated for each of the periods and weighted by the length of the period.

Page 87: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

In the example shown, migration gain is lower as there is an extra migration cost. After migration,

there is a period where the gain is positive and when the stable period ends, the worst gain turns

out to be negative. The areas are added together and the sum is used to decide if the move should

be rejected. DRS will only recommend a migration if it has an acceptable result of the cost-benefit

and risk analysis.

The Biggest Bang for the Buck

After the cost-benefit risk analysis and the simulation, a recommendation is created; this

recommendation should result in a migration which gives the most improvement in terms of

cluster balance, in other words: Biggest bang for the buck!

This is the reason why usually the busier, most hungry virtual machines are moved as they will

most likely decrease “Current host load standard deviation” the most. If it’s not enough to balance

the cluster within the given threshold the GetBestMove gets executed again by the procedure which

is used to form a set of recommendations.

Basic design principle:Although DRS migrates busier virtual machines to gain the

most improvement of cluster balance, it does not justify use of big virtual machine. Virtual

machines with a smaller memory sizes or fewer virtual CPUs provide more placement

opportunities for DRS.

Virtual machines with larger memory size and/or more virtual CPUs add more constraints

to the selection and migration process. This means it’s recommended to configure the size

of the virtual machine to what it actually needs, preventing oversizing.

Calculating the Migration Recommendation Priority Level

VMware published the algorithm in VMware knowledgebase article 1007485 “Calculating the

priority level of a VMware DRS migration recommendation”, which explains how DRS assigns the

priority ratings to the migration recommendations:

“For each migration recommendation, the priority level is limited to the integral range priority 2 to

priority 5 (inclusive) and is calculated according to the following formula:

6 - ceil(LoadImbalanceMetric / 0.1 * sqrt (NumberOfHostsInCluster)).

Here, ceil (x) is the smallest integral value not less than x. LoadImbalanceMetric is the current host

load standard deviation shown on the cluster's summary page of the vSphere Client. For each host,

Page 88: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

compute the load on the host as sum (expected VM loads)) / (capacity of host). Then compute the

standard deviation of the host load metric across all hosts to determine LoadimbalanceMetric.”

The LoadimbalanceMetric value used in the algorithm is the current host load standard deviation

value and ceil rounds up the value to a integer (a whole number, like 1,2,3 etc).

Let us use this formula in an example, according to the screenshot; the 3 host cluster has a current

host load standard deviation of 0.022. According to the formula, the calculation would be:

6 – ceil (0.022 / 0.1 * sqrt(3))

Figure 37: DRS summary

This would result in a priority level of 5 for the migration recommendation if the cluster was

imbalanced. We created a workflow diagram to help visualize the flow of the DRS imbalance

calculation process.

Page 89: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 38: DRS workflow diagram

Page 90: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 15

Influence DRS Recommendations Some DRS settings and feature can influence the DRS migration recommendation, this chapter takes

a closer look at the various settings and the impact they can have on the DRS processes.

Migration Threshold Levels

As mentioned in the previous chapter, the migration threshold specifies the tolerance of imbalance

of the Current Host Load Standard Deviation relating to the Target Host Load Standard Deviation.

The migration threshold factor is configured at the DRS setting on cluster level. To make the

migration threshold setting more understandable, priority levels were introduced to exemplify

which level of tolerance is used to generate migration recommendations.

During calculation DRS assigns a priority level to a recommendation and this priority level is

compared to the migration threshold. If the priority level is less or equal to the migration threshold,

the recommendation is displayed or applied depending on the automation level of the cluster. If it’s

above the migration threshold, the recommendations are not displayed or discarded.

Level 1 (conservative)

When selecting the conservative migration threshold level only mandatory moves, priority-one

recommendations, are being executed. The DRS cluster will not invoke load-balancing migrations.

Mandatory moves are issued when:

The ESX host enters maintenance mode

The ESX host enters standby mode

An (anti) affinity rule is violated

The sum of the reservations of the virtual machine exceeds the capacity of the host.

It is possible that a mandatory move will cause a violation on another host, if this happens DRS will

move virtual machines to fix the new violation. This scenario is possible when multiple rules exist

on the cluster. It is not uncommon to see several migrations to satisfy the configured DRS rules.

Level 2 (moderately conservative)

The level 2 migration threshold only applies priority-one and priority-two recommendations,

priority two recommendations promise a very good improvement in the cluster’s load balance.

Page 91: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Level 3 (moderate)

The level 3 migration threshold is the default migration threshold when creating DRS clusters, the

moderate migration threshold applies priority-one, -two and priority-three recommendations

promising a good improvement in the cluster’s load balance.

Level 4 (moderately aggressive)

The level 4 migration threshold applies all recommendations up to priority level four. Priority-four

recommendations promise a moderate improvement in the cluster’s load balance.

Level 5 (aggressive)

Level 5 migrations is the right-most setting on the migration threshold slider and applies all five

priority level recommendations, every recommendation which promises even a slight improvement

in the cluster’s load balance is applied.

A level 1, five star recommendations should always be applied, but a list of several priority level 5

recommendations could also collectively affect the cluster negatively if those recommendations are

not applied.

Although the cost benefit risk analysis takes unstable workloads into account, selecting an

aggressive migration threshold when hosting virtual machines with varying loads in a cluster can

lead to a higher possibility of wasted migrations. A moderate migration threshold is more suitable

in such a scenario. Aggressive thresholds, level 4 and 5 are considered suitable for clusters with

equal-sized hosts, relatively constant workload demands and little to few DRS.

The default moderate migration threshold provides sufficient balance without excessive migration

activity. It is typically aggressive enough to maintain workload balance across hosts without

creating excessive overhead caused by too-frequent migrations.

Basic design principle: Select a moderate migration threshold if the cluster hosts virtual

machines with varying workloads.

Rules

VMware vSphere 4.1 contains two types of affinity rules, Virtual Machines to Host rules (VM-Host)

and Virtual Machine to Virtual Machine (VM-VM) rules. A VM-Host affinity rule specifies the affinity

between a group of virtual machines and a group of ESX hosts inside the cluster, whereas a VM-VM

affinity rule only specifies the affinity between individual virtual machines.

Affinity rules can specify if the virtual machines should stay together and run on specified hosts

(affinity rules) or if they are not allowed to run on the same host (anti-affinity).

Page 92: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

VM-VM Affinity Rules

When an affinity rule is configured, DRS should try to keep the specified virtual machines together

on the same ESX host. An example would be running both the front-end and back-end servers of an

application on the same host; it can improve performance due to lower latencies of communication.

An anti-affinity rule is the opposite of an affinity rules: It tries to avoid that the specified virtual

machines run on the same host. These rules are suitable for creating a highly available application,

such as multiple web servers; by running them on different hosts, the application will still be

running even if the ESX host fails. Another example of an anti-affinity rule could be the separation of

two virtual machines with network intensive workloads. DRS measures CPU and Memory usage,

not network usage, so due to this it is possible to end up with both virtual machines running on the

same ESX host which could saturate the host networking capacity available for virtual machines.

VM-Host Affinity Rules

A VM-Host affinity rule determines the groups of available ESX hosts on which the virtual machine

can be powered-on or moved to by DRS, HA and the user. Apart from the affinity and anti-affinity

specification, a VM-Host can be a mandatory rule or a preferential rule.

VM-Host affinity rules are created to establish an association between a group of virtual machines

and a group of ESX hosts. These associations enable the administrator to designate certain ESX

hosts for virtual machines to comply with ISV license regulations or to create availability zones.

This can be to restrict Oracle database virtual machine to run only on ESX host which are licensed

by Oracle or to separate virtual machines across different blade chassis for availability reasons.

Components

A virtual machine to host affinity rule consists of three components:

Virtual machine DRS group

ESX host DRS group

Designation – “Must” affinity/anti-affinity or “Should” affinity/anti-affinity

Virtual machine DRS groups and ESX host DRS groups are quite self-explanatory so let’s dive into

the designations component straight away.

Designations

Two different types of VM-Host rules are available, a VM-Host affinity rule can either be a “must”

rule or a “should” rule. The must-rule is a mandatory rule for HA, DRS and DPM, it forces the virtual

machines to run on the ESX hosts specified in the ESX host DRS Group.

The “should” rule is a preferential rule for DRS and DPM. DRS and DPM use their best effort to try to

confine or prevent the virtual machines from running on the ESX host they are affined to, but DRS

and DPM can violate “should” rules if it compromises certain key operations HA is not aware of

preferential rules because DRS will not communicate these rules to HA.

Page 93: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

HA, DRS and DPM must take the mandatory rules into account when generating or executing

operations. HA, DRS and DPM will never take any action that result in the violation of mandatory

affinity rules. Because of this, mandatory rules place more constraints on VM mobility, making it

more difficult for DRS to balance load and enforce resource allocation policies. HA and DPM

operations are constrained as well, for example mandatory rules will:

Limit DRS in selecting hosts to load-balance the cluster

Limit HA in selecting hosts to power up the virtual machines

Limit DPM in selecting hosts to power down

Due to their limiting behavior, it is recommended to use mandatory rules sparingly and only for

specific cases, such as licensing requirements. Preferential rules can be used to meet availability

requirements such as separating virtual machines between blade enclosures.

DRS and mandatory rules

DRS takes mandatory rules into account when generating load-balance recommendations. If a rule

is created and the current virtual machine placement is in violation of a rule, DRS will create a

priority one recommendation (five stars) and, if DRS is set to fully automatic, execute the

recommendation. DRS will not generate recommendations that would violate a rule. For example, if

trying to place an ESX host into maintenance mode it would not be allowed if this would violate a

mandatory rule. Similarly vMotion will reject the operation if it detects that the operation is in

violation of the mandatory rule.

If a reservation is set on the virtual machine, DRS takes both reservation and mandatory affinity

rule into account. Both requirements must be satisfied during placement or power on. If DRS is

unable to honor either one of the requirements the virtual machine is not powered on or migrated

to the proposed destination host. For example if a new rule is created and the current virtual

machine placement is in violation of the rule, it can only migrate to a new host if the virtual machine

memory reservation can be satisfied on the new host; if this is not possible, DRS will not generate

the recommendation.

If a rule is created that conflict with another active rule, the older rule overrules the newer rule and

DRS will disable the new rule. When creating a new rule, a message indicating the conflicting rule

will appear and the rule will be visibly disabled.

As you can imagine mandatory affinity rules can complicate troubleshooting in certain scenarios.

For example, why a virtual machine is not migrated from a highly utilized host to an alternative

lightly utilized host in the cluster.

DPM

DPM does not place an ESX host into standby mode if the result would violate a mandatory rule and

will power-on ESX hosts if these are needed to meet the requirements of the mandatory rules.

Page 94: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

High Availability

Due to the DRS-HA integration in vSphere 4.1, HA respects only mandatory (must) rules. During an

ESX host failure event, HA uses an archived list of hosts provided by DRS and places the virtual

machines only on a compatible host, i.e. one of the hosts that are allowed by the mandatory rules.

HA is unaware of the preferential (should) rules, so HA might unknowingly violate the rule during

placement of virtual machines after an ESX failure, but the violation will be corrected by the next

DRS invocation.

Let us take a look at a configuration which is very likely to be widely implemented soon the Oracle

Must affinity rule.

1. Place all Oracle virtual machines in a Cluster VM DRS group. (VM01, VM03, VM11, VM20)

2. Place all Oracle licensed ESX host in a Cluster Host DRS Group (ESX01, ESX02, ESX09,

ESX10)

3. Select “Must run on Host in Group”

Figure 39: Mandatory VM-Host affinity rule

In this scenario, DRS never places, migrates, or recommends placement of a host-affined virtual

machine on a host to which is not listed in the Cluster Host DRS Group (ESX01 – ESX06 & ESX09-

ESX14). This means that DRS will never ever place the virtual machine on an unlicensed host, not

for maintenance mode, not for DPM power saving and not after an ESX host failure event.

Page 95: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

This virtual-machine-to-host affinity rule makes it possible to run Oracle inside big clusters without

having to license all the ESX hosts. Oracle licenses can create a constraint on the design. Normally

separate smaller clusters were deployed for Oracle database virtual machines, increasing both

OPEX and CAPEX of the environment. These new rules allow the Oracle virtual machines to run

inside the main cluster with other virtual machines without having to license all the ESX hosts

inside the cluster.

Mandatory VM-Host rule behavior

By design, mandatory rules are considered very important and it’s believed that the intended use

case which is licensing compliance is so important, that VMware decided to apply these restrictions

to non-DRS operations in the cluster as well.

If DRS is disabled while mandatory VM-Host rules still exist, mandatory rules are still in effect and

the cluster continues to track, report and alert mandatory rules. If a manually-started vMotion

would violate the mandatory VM-Host affinity rule even after DRS is disabled, the cluster still

rejects the vMotion.

Mandatory rules can only be disabled if the administrator explicitly does so. If it is the

administrator’s intend to disable DRS, remove mandatory rules first before disabling DRS.

Basic design principle:

Use VM-Host and VM-VM affinity rules sparingly, as rules can have an impact on the

effectiveness of the Load balancing calculation. The DRS algorithm has less choice when

rules are configured. Mandatory affinity rules apply even when DRS is disabled.

Impact of Rules on Organization

Many users create rules but seem to forget to create a backup or to document them. Anti-affinity

rules can play an important role in meeting certain SLA requirements or BC/DR requirements.

Creating a backup or documenting seems appropriate. By using PowerCLI the rules can easily be

extracted from the vCenter database.

Virtual Machine Automation Level

You can customize the automation level for individual virtual machines in a DRS cluster to override

the automation level set on the cluster. This might be necessary to meet certain availability or

business requirements.

There are five automation level modes:

Fully Automated

Page 96: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Partially Automated

Manual

Default (cluster automation level)

Disabled

If the automation level of a virtual machine is set to disabled, DRS does not migrate that virtual

machine or provide migration recommendations for it. By setting the automation mode of the

virtual machine to manual, maintenance mode is able to evacuate this virtual machine

automatically.

Partially automated versus disabled automation level mode

As mentioned before, due to requirements or constraints it might be necessary to exclude a virtual

machine from automatic migration and stop it from moving around by DRS. Use the Partially

automated setting instead of Disabled at the individual virtual machine automation level. Partially

automated blocks automated migration by DRS, but keep the initial placement function. During

startup, DRS is still able to select the most optimal host for the virtual machine. By selecting the

“Disabled” function, the virtual machine is started on the ESX server it is registered and chances of

getting an optimal placement are low(er).

An exception for this recommendation might be a virtualized vCenter server; most admins like to

keep track of the vCenter server in case a disaster happens. After a disaster occurs, for example a

datacenter-wide power-outage, you only need to power-up the ESX host on which the vCenter

virtual machine is registered and manually power-up the vCenter virtual machine. An alternative to

this method is keeping track of the datastore vCenter is placed on and register and power-on the

virtual machine on a (random) ESX host after a disaster. Slightly more work than disabling DRS for

vCenter, but offers probably better performance of the vCenter Virtual Machine during normal

operations.

Due to expanding virtual infrastructures and new additional features, vCenter is becoming more

and more important for day-to-day operational management. Assuring good performance

outweighs any additional effort necessary after a (hopefully) rare occasion, but both methods have

merits.

Select the automation level in accordance with your environment and level of comfort. Try to have

virtual machines in DRS fully automated mode as much as possible, as DRS considers these virtual

machines for cluster load balance migrations before the virtual machines not in fully automated

mode.

Page 97: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Basic design principle:

Leave virtual machine automation mode set to Default to minimize administration overhead

and possibility of human error. Set automation level to Manual instead of Partially

Automated if more control over placement is required.

Impact of VM Automation Level on DRS Load Balancing Calculation

Contrary to popular belief, a virtual machine set to disabled automation level still has impact on the

calculation of the current host load standard deviation as the sum of the active workload is divided

through the capacity of the host. DRS does not need to be aware of virtual machine automation

levels at that stage. During the recommendation calculation, DRS skips the virtual machines which

are set to disabled automation level and selects other virtual machines on that host. If DRS can

choose between virtual machines set to the automatic automation level and the manual automation

level, DRS chooses the virtual machines which are set to automatic as it prefers them over virtual

machines set to manual.

Page 98: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 16

Resource Pools and Controls As we progress using virtualization, most administrators spend less time setting up environments

and more time on resource management of the virtual infrastructure. VMware introduced DRS,

clusters and resource pools to help simplify resource management. Clusters aggregate ESX host

capacity into one large pool and create an independent layer between the resource providers (ESX

hosts) and resource consumers (virtual machines).

This independent layer has several advantages; one of these advantages is subdividing the cluster

capacity into smaller resource pools. Resource pools do not "carve up" physical resources of the

cluster which can only be used exclusively by their member virtual machines, but instead

guarantee, limit and prioritize their member virtual machines to a certain amount of cluster

capacity. The cluster provides resources based on resource allocation controls set at the resource

pool level.

These resource allocation controls, reservations, limits and shares are similar to virtual machine

resource allocation settings. But how do these settings work at resource pool level and what

impact does it have on the virtual machine workloads? Let us explore the construct called resource

pool a bit more.

Root Resource Pool

When a VMware cluster is created, a top-level resource pool called the root resource pool is created

implicitly. The cluster aggregates all the CPU and memory resources, made available by the ESX

hosts inside the cluster minus the resources used by the virtualization overhead. When the cluster

is created both reservation and limit parameters are set at the root resource pool. Both settings are

set to the same value and indicate the total amount of resources the cluster has available to run the

virtual machines. The root resource pool is not displayed by vCenter and its resource allocation

settings cannot be changed by the user.

For example; when a cluster is created of 4 hosts and each host has 4GHz CPU and 4 GB memory

available, the root resource pool contains 16GB of memory- and 16GHz of CPU resources.

Page 99: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 40: Root resource pool

The following legend applies to all diagrams in the following chapters.

When calculating the root resource pool, vCenter will exclude resources reserved for the

virtualization layer, such as the Service Console and VMkernel. The amount of resources required to

satisfy HA failover (assuming HA Admission Control is enabled) will be shown in the root resource

pool as reserved, whereas the amount of resources used by the Service Console and VMkernel will

not even show up in the capacity of the root resource pool.

Page 100: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Resource Pools

As stated before a VMware cluster allocates resources from hosts (resource providers) to virtual

machines (resource consumers). Resource pools are in between and are both resource providers

and consumers; a resource pool provides resources to virtual machines, but consumes resources

from the cluster.

Figure 41: Resource pools

Apart from the root resource pool, each resource pool has a parent resource pool. A resource pool

contains children, which can be other resource pools or virtual machines. In the example pictured

above, the root resource pool is parent of resource pool 1 and 2. Resource pool 1 is the child of the

root resource pool but also the parent of resource pool 3, vm3 and vm4.

NOTE

Placing virtual machines at the same level as resource pools is not a recommended configuration!

The maximum number of resource pool tree depth is 8, excluding the 4 resource pools created

internally on each ESX host. These internal resource pools are independent of the DRS resource

pools. To avoid complicated proportional share calculations and complex DRS resource entitlement

calculations, we advise not to exceed a resource pool depth of maximum of 2. The flatter the

resource pool tree the easier it is to manage.

Page 101: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Resource pools and simultaneous vMotions

vCenter Server 4.1 supports up to 512 resource pools in a DRS cluster and the resource pools can

be up to 8 levels deep under the root resource pool. Some administrators have the bad habit of

using resource pools to create a folder structure in the “host and cluster” view of vCenter. Virtual

machines are being placed inside a resource pool to show some kind of relation or sorting order

like operating system or types of application.

Using resource pools as a folder structure is the limitation resource pools inflict on vMotion

operations. Depending on the network speed, vSphere 4.1 allows 8 simultaneous vMotion

operations, however simultaneous migrations with vMotion can only occur if the virtual machine is

moving between hosts in the same cluster and is not changing its resource pool.

Fortunately simultaneous cross-resource-pool vMotions can occur if the virtual machines are

migrating to different resource pools, but still one vMotion operation per target resource pool.

Because clusters are actually implicit resource pools (the root resource pool), migrations between

clusters are also limited to a single concurrent vMotion operation.

Figure 42: simultaneous migrations

Under Committed versus Over Committed

Resource allocation settings are used to guarantee a certain amount of resources when the cluster

is overcommitted. When the active usage of resources does not exceed the available amount of

physical resources of the host, the state of the system is called under committed.

During this state every resource request done by a virtual machine is backed by physical resources.

When the active usage of resources exceeds the available amount of physical resources, the system

reaches a state called over committed. During overcommitment, the VMkernel uses several

techniques and mechanisms to allocate resources according to the virtual machines resource

entitlement. Similar to virtual machines, resource pools have reservations, limit and shares

parameters for CPU and memory resources. Expandable reservation is the only setting that exists

on the resource pool level and not at the virtual machine level.

Page 102: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Resource Allocation Settings

By configuring the Resource Allocation Settings; reservation, shares and limit parameters on the

resource pool a collective sum of resources are defined. This collective sum of resources can be

used by the virtual machines inside the resource pool.

For example by setting a 10GB memory reservation on a resource pool, the virtual machines within

the resource pool have guaranteed access to 10GB of machine memory (ESX physical memory)

inside the cluster.

Because virtual machine workloads are executed at ESX host level, but how are resource pool

settings translated to ESX host level? DRS does this by a mechanism called resource pool mapping.

DRS mirrors the resource pool hierarchy to each host and divides the parent resource pool resource

allocation values across the mirrored local trees.

Figure 43: Local resource pool mapping

Dividing these values across the hosts is based on the amount of running active virtual machines,

their VM resource allocation settings and their current utilization. Once the parent resource

allocation settings are propagated to the host local RP tree, the local host CPU and memory

scheduler takes care of the actual resource allocation.

Resource pool resource allocation settings

Shares - Shares specify the relative importance of the virtual machine or resource pool. Shares are

always measured against other powered-up virtual machines or resource pools containing such

virtual machines at the same hierarchical level

Reservation – Also referred to as “MIN”(minimum). Reservation is the amount of physical

resources (MHz or MB) guaranteed to be available for the virtual machine or resource pool

Limit – Also referred as “MAX”(maximum). Limit specifies an upper bound for resources that can

be allocated to a virtual machine or resource pool

Expandable reservation - This allows the resource pool, once it has already reserved as much

capacity as defined in its own Reservation setting to reserve even more. The reservation is taken

from the unreserved capacity in the parent of this root resource pool

Page 103: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Shares

Shares specify the priority for the virtual machine or resource pool relative to other resource pools

and/or virtual machines with the same parent in the resource hierarchy. Now there is not an

official term (yet) to this but let us use the term sibling share level. The relative priority is

calculated across all siblings in relation to their sibling share level.

Figure 44: Parent-child relation

The key point is that shares values can be compared directly only among siblings: the ratios of

shares of VM1:VM2 tells which VM is higher priority, but the shares of VM2:VM3 does not tell which

VM has higher priority.

Contrary to reservation and limits which are specified in absolute numbers, shares are relative to

the other virtual machines and resource pools. In consequence when more virtual machines

become active in the same hierarchical level, the relative share of the resources allocated to the

virtual machine will change.

When configuring shares you can select one of the three predefined settings; High, Normal or Low,

which specify share values with a 4:2:1 ratio, or select the Custom setting to specify a more

granular value.

Because shares determine relative priority on the same hierarchical level; the absolute values do

not matter, configuring the resource pools 1 and 2 with respectively share values 10 and 20 has the

same effect as configuring the resource pools with share values 10000 and 20000. Use care when

selecting the custom setting, as a virtual machine with a custom value can lead to the virtual

machine owning a large portion of shares or in other words priority.

Page 104: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

The default behavior of shares is that they scale with the size of the virtual machine, a virtual

machine set to share level normal and configured with 1 vCPU and 1024 MB of memory will receive

a 1000 CPU shares and 10240 memory shares. The pre-defined share settings have the following

values:

Table 4: Share values

When a virtual machine is created, the amount of shares the virtual machine receives is based upon

the amount of configured vCPUs and default share level (low, normal and high); a virtual machine

set to share level normal receives 1000 shares per vCPU. This means that without changing any

share settings, a virtual machine that is configured with more vCPUs and memory is entitled to a

correspondingly larger amount of physical resources during contention. The assumption is that a

virtual machine with more vCPUs actually needs more CPU resources. In practice, this is not always

true.

NOTE

This method of assigning shares based on the amount of vCPU implicitly indicates (and

explicitly controls) that a virtual machine with multiple vCPUs is more important and has a

higher priority than a virtual machine with a lower amount of vCPUs. This situation does

not always reflect the business side requirements; the virtual machine with fewer vCPUs

might contain an application which is more important to the business than the resource

intensive application running on the virtual machine with multiple vCPUs. The level of

importance to the business does not automatically equal bigger virtual machines, and vice

versa.

Per default a resource pool is configured similar to a virtual machine with 4 vCPUs and 16GB set at

normal level, i.e. 4000 CPU shares and 163840 memory shares. Caution must be taken when placing

virtual machines on the same hierarchical level as resource pools, as virtual machines can end up

with a higher priority than intended.

Placing virtual machines at the same level as resource pools is something we see very often;

sometimes by design and sometimes accidental (and not recommended). During a manual vMotion,

the administrator needs to select a resource pool, and by default the root resource pool is selected.

If this step is overlooked by the administrator, the virtual machine ends up in the root resource

Page 105: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

pool. In this scenario a virtual machine can be denied or deny other sibling virtual machines or

resource pools from resources.

vSphere 4.1 introduces a mechanism called flattened shares, explained in chapter 10 “vSphere 4.1

HA and DRS Integration”, please be aware that this only occurs when HA fails over the virtual

machine.

Resource pool share settings do not influence share settings on virtual machine. When creating a

virtual machine inside a resource pool with a share value set to High, the share level on the virtual

machine is still set to the default value of normal. This is because these VM-level shares indicate the

relative importance of the virtual machine within its hierarchical level i.e. to its siblings, not to

virtual machines in other resource pools.

So how do resource pool shares affect virtual machine workloads? As mentioned before, DRS

mirrors the resource pool hierarchy to each host, and the ESX host’s local CPU and memory

scheduler retrieve the resource allocation settings of each active virtual machine from this tree.

DRS divides the resource pool cluster level share amount across the mirrored local trees based on

the amount of running active virtual machines, their VM-level shares amounts and their current

utilization. Once the resource allocation settings are propagated to the host local RP tree, the local

host CPU and memory scheduler takes care of the actual resource allocation.

For the sake of simplicity, let’s forget the previous example of nested resource pools and use a 2

host cluster. In this cluster a resource pool is created and the default resource allocation settings

are used. Per default a resource pool is configured similar to a 4 vCPU and 16GB virtual machine at

normal level, i.e. 4000 CPU shares and 163840 memory shares. (This example uses memory shares;

the same applies to CPU shares). Four virtual machines running inside the resource pool are

configured as followed:

Table 5: Share configuration scenario

Let us assume that all the virtual machines are running equal and stable workloads, DRS will

balance the virtual machines across both hosts and create the following resource pool mapping

Figure 45: Resource pool mapping

Page 106: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

The amount of shares specified on virtual machines vm1, vm2 and vm3 totals 20480, which equals

half of the amount of total configured shares inside the resource pool. In this example DRS decides

to place the virtual machines on ESX host ESX1 and therefore assigns half of the share value of the

resource pool to the resource pool 1 mirrored in the host level resource pool tree.

At this point resource pool 2 is created but with a different share configuration, resource pool 2 is

configured with double the amount of resource pool 1; i.e. 327680. Inside the resource pool, the

virtual machines are identical to the virtual machines in resource pool 1.

Figure 46: Share ratio result

Page 107: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

The local resource pool tree of ESX1 is updated with resource pool 2; this resource pool is

configured with twice the amount of shares as resource pool 1. By introducing 327680 shares, the

total amount of shares active on the host is increased to 491520 (163840+327680) Resource pool 2

owning 327680 of the total of 491520, which equals roughly to 66.6 percent.

Due to the 67%- 33% ratio at resource pool level, the local resource scheduler will allocate more

resources to resource pool 2. The allocated resources to resource pool 2 are subdivided between

the virtual machines based on their hierarchical level (sibling share level). This means that virtual

machine VM5 is entitled to get 50% of the resource pool’s resources during contention, which is

basically 33% of the ESX host’s resources. (1/2 of a pool which is 2/3 of the host = 2/6 = 1/3 of the

host).The %shares column in the resource allocation tab of the cluster displays the amount of

shares each object gets per parent share level.

To emphasize, my example used virtual machines with equal and stable workloads. During normal

conditions, some virtual machines have a higher utilization then others. As the working set is part

of the resource entitlement calculation, active workload is accounted for when dividing the

resource pool shares and resources between hosts and local resource pool trees. Because of the

resource usage this process of dividing occurs every time DRS is invoked and therefore the

distribution of resources will keep changing if appropriate.

Reservation

A reservation is a guaranteed lower bound of resources that is reserved for the resource pool or

virtual machine to ensure availability of physical resources at all times, even during resource

contention. Reservations can be set at resource pool level and virtual machine level. When setting a

reservation at resource pool level, you will guarantee a certain amount of resources for all its

children collectively, though they may contend with one another. To exclusively guarantee

resources for a specific virtual machine a VM-level memory reservation has to be set.

VM Level Scheduling: CPU vs Memory

The behavior of the VMkernel CPU scheduler differs from the VMkernel memory scheduler when it

comes to claiming and releasing physical resources.

If a CPU reservation is set on a virtual machine, the virtual machine is guaranteed to have these CPU

cycles available. But since CPU reservations are friendly, the “owner” sometimes has to wait for the

interloper to leave when the owner wants the pCPU. Normally this should be brief.

If the virtual machine does not use all CPU cycles, they will flow back to the system and be available

for other virtual machines until the virtual machine wants to use these again.

Page 108: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

If a memory reservation is set on a virtual machine, the virtual machine is guaranteed to have this

amount of physical memory available. If the virtual machine uses and therefore claims these

machine pages, the VMkernel will not reclaim these pages. These pages are allocated by the virtual

machine and no other virtual machine can use them anymore. Even if the virtual machine idles,

these pages are listed as reserved capacity. This claiming phenomenon has an enormous impact

when using Microsoft Windows as guest OS, as Microsoft Windows touches every page during

startup, these pages are claimed by the virtual machine. These pages aren't really used, but this

behavior can result in the ESX host starting to swap and balloon if no free memory is available for

other virtual machines while the owning VM’s aren’t using their claimed reserved memory.

Because of the “friendly” conduct of CPU scheduling and "greedy" behavior of memory scheduling,

we will focus more on memory reservation than CPU memory in the following paragraphs.

Impact of Reservations on VMware HA Slot Sizes.

Both virtual machine level CPU and memory reservations are used by HA Admission Control as

input for the calculation of slot sizes. Reservations set at resource pool level are ignored by HA.

This can be useful to circumvent the HA slot size if virtual machines configured with large amounts

of memory are active in the virtual infrastructure. Most of the time these “heavy hitters” run

mission critical applications so it’s not unusual setting memory reservations to guarantee the

availability of memory resources. If such a virtual machine is placed in a HA cluster, these

significant memory reservations can lead to a very conservative consolidation ratio, due to the

impact on HA slot size calculation. Chapter 7 contains more information about HA slot sizes.

By placing these virtual machines inside the resource pool and configuring the resource pool with a

memory reservation equal to the configured memory, the virtual machine will be guaranteed of

physical resources without creating an over-conservative slot size.

We recommend implementing this workaround very sparingly as creating a resource pool for each

VM creates a lot of administrative overhead and makes the host and cluster view a very unpleasant

environment to work in.

Behavior of Resource Pool Level Memory Reservations

Fortunately the memory reservation mechanism on resource pool level has a different behavior

than the mechanism of virtual machine level reservation. The next section compares the behavior

of memory reservation at resource pool level and virtual machine level.

When a resource pool is created it instantly "annexes" the specified memory reservation, even

when no virtual machine is running inside the resource pool. When reviewing the resource

allocation tab of the cluster, the amount of memory reservation is added to the reserved capacity.

DRS reduces this amount of memory from the available memory pool, prohibiting other virtual

machines or resource pools to annex this. Basically the amount of reserved memory is just being

subtracted from the pool of unreserved memory in the cluster; it does not have anything to do with

the physical usage or claiming of the memory.

Page 109: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

When we power up a virtual machine inside the resource pool, the virtual machine can start

consuming resources, and as long as no resource contention occurs, the VMkernel will back every

request with physical resources. (Except if a limit is set on the virtual machine level, in that scenario

the virtual machine cannot allocate more than its limit.) In this example no limits are set on virtual

machine and resource pool level. Limits are addressed in Chapter 17.

Once contention occurs, the resource pool must divide the reserved memory pool across its

member virtual machines. How does it know which virtual machines are more "entitled" to physical

memory and which are not really in need of resources?

The resource pool will look at the virtual machine resource entitlement. Resource pool level

reservations settings do not "flow" to the virtual machine, resource pools do not explicitly set the

calculated reservation per virtual machine but use reservation for calculation of the resource

entitlement of the virtual machine from a resource pool point of view. So basically the memory

reservation parameter becomes a dynamic "metric". Virtual machines who are actively using

memory, receive more access to memory resources than idle virtual machines. Memory

reservations are more in line with the whole concept of consolidation and fairness, and they only

use resources when needed.

Now the big difference between virtual machine level and resource pool level memory reservation

is that resource pool memory reservations do not "hoard" memory. Virtual machine memory

reservations will not be reclaimed by the VMkernel, so when memory is protected by a VM level

reservation it is never reclaimed. It does not matter how idle the virtual machine is. Because of the

dynamic characteristics of the resource pool memory reservation and the fact that it does not flow

to the virtual machine, unused memory will flow back to the system. The memory is available and

free to be used even by virtual machines external to the resource pool; however, they can use it, not

reserve it. Therefore shares on virtual machine level become much more important than actually

perceived by many.

NOTE

We must stop treating shares as the redheaded stepchild of the resource allocation settings

family and realize how important share ratios really are. Besides shares, active memory

usage and the configured memory size will impact the resource entitlement and thus

performance of the virtual machine.

Now setting a memory reservation on a resource pool level has its own weaknesses. As stated

before resource pool reservations do not flow to virtual machines, so they will not influence HA slot

sizes. Using only resource pool reservations and not virtual machine reservations can lead to

(temporary) performance loss if a host failover occurs. When a virtual machine is restarted by HA

they are not restarted in the correct resource pool but in the root resource pool, which can lead to

(temporary) starvation. vCenter 4.1 uses the flattened shares mechanism when restarting virtual

machines in the root resource pool. The movement of the virtual machine from the root resource

pool to the correct resource pool gets corrected in the next DRS run, but until that point in time, the

virtual machine needs to do without any memory reservations.

Page 110: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Setting a VM Level Reservation inside a Resource Pool

It is possible to set a reservation on a virtual machine inside a resource pool that is configured with

a reservation as well. When this scenario occurs, the amount of resources specified at the virtual

machine level reservation is subtracted from its parent. For example, a 10GB memory reservation is

set on the resource pool, if a memory reservation of 2GB is set on a virtual machine, DRS respects

the virtual machine reservation and will immediately pass this to the local scheduler, but subtracts

this amount of the resource pool reservation, resulting in a 8GB pool of guaranteed memory which

DRS divides across the remaining child virtual machines and resource pools inside the resource

pool.

Basic design principle:

Set per-VM memory reservations only if a virtual machine absolutely requires guaranteed

memory.

VMkernel CPU reservation for vMotion

The VMkernel reserves a certain amount of CPU capacity for the vMotion task. In vSphere 4.0 and

earlier ESX reserved 30% of a CPU core at both source and destination host. In vSphere 4.1, this

CPU reservation is 30% of a processor core for a 1 Gb network interface and 100% of a CPU core

for a 10 Gb network interface. Such amount of CPU resource is needed to take full advantage of the

network bandwidth in order to minimize migration time.

By default all vMotions initiated by DRS are of low priority. The distinction between High and Low

priority vMotion is that a Low Priority vMotions tries to reserve the percentage of a core, but will

proceed regardless of how much it actually received. High priority vMotions are designed to fail if

they cannot reserve sufficient resources.

Basic design principle:

Leave some CPU capacity in a cluster as unreserved. This helps to ensure that vMotion tasks

get the resources needed and are completed as quickly as possible.

Reservations Are Not Limits.

When setting a memory reservation on a resource pool it does not imply that the virtual machine

cannot use additional memory above the reservation. If a virtual machine wants to allocate more

memory above the specified amount of the reservation, the memory is allocated based on shares in

case of resource contention.

Page 111: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 47: Reservations and share based resources

In the picture above, resource pool 1 has a 10GB memory reservation configured, but the total

amount of configured memory of its virtual machines (16GB) is greater than the reservation.

During contention, the amount of reserved memory of resource pool 1 is divided between its virtual

machines based on their resource entitlements. In the example above, the 6GB memory resources

exceeding the reserved memory pool is allocated from the unreserved memory pool based on the

proportional share level of resource pool 1 and resource pool 2. The maximum amount of memory

that resource pool 1 can allocate is a total of the combined configured memory of all virtual

machines plus their memory overhead reservations.

Memory Overhead Reservation

If you do not set a memory reservation at resource pool level, be sure to enable the Expandable

reservation otherwise Admission Control cannot allocate resources to satisfy the virtual machine

memory overhead reservation.

Every virtual machine running on an ESX host consumes some memory overhead additional to the

current usage of its configured memory. This extra space is needed by ESX for the internal

VMkernel data structures like virtual machine frame buffer and mapping table for memory

translation (mapping physical virtual machine memory to machine memory). Two kinds of virtual

machine overhead exist:

Page 112: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Static overhead

Static overhead is the minimum overhead that is required for the virtual machine startup. DRS and

the VMkernel uses this metric for Admission Control and vMotion calculations. The destination ESX

host must be able to back the virtual machine reservation and the static overhead otherwise the

vMotion will fail.

Dynamic overhead

Once the virtual machine has started up, the virtual machine monitor (VMM) can request additional

memory space. The VMM will request the space, but the VMkernel is not required to supply it. If the

VMM does not obtain the extra memory space, the virtual machine will continue to function but this

could lead to performance degradation. The VMkernel treats virtual machine overhead reservation

the same as VM-level memory reservation and it will not reclaim this memory once it has been

used.

Admission control

As mentioned earlier DRS and the VMkernel will not allow a virtual machine to be powered on if

reservations cannot be guaranteed. This means that the effective memory reservation for a virtual

machine is the user configured memory reservation (VM-level reservation) plus the overhead

reservation. This means that during the design phase of a resource pool, the memory overhead of a

virtual machine must be included in the calculation of the memory reservation specified on the

resource pool. The behavior of dynamic overhead must also be taken into account. Table 3.2 of the

vSphere Resource Management guide lists the overhead memory of virtual machines. The table

listed below is an excerpt from the Resource Management guide and lists the most common ones.

Table 6: Virtual machine memory overhead (in MB)

Please be aware of the fact that memory overheads can grow with each new release of ESX, so keep

this in mind when upgrading to a new version. Verify the documentation of the virtual machine

memory overhead and check the specified memory reservation on the resource pool.

Page 113: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Expandable Reservation

The setting expandable reservation only exists on a resource pool level, although it is used to

allocate resources for virtual machine level reservations, including virtual machine memory

overhead reservation.

Expandable reservation is used by Admission Control. If the expandable reservation setting is

selected, Admission Control considers the capacity in the ancestor resource pool tree as available

for satisfying VM-level reservations. If the expandable reservation is not selected, Admission

Control considers only the resources available of the resource pool to satisfy the reservation.

A simple way to think of it is this: Add the VM-level reservations—plus implicit overhead

reservations—of every VM running in the resource pool. That sum cannot be greater than the

resource-pool level reservation, unless Expandable is checked.

Note that this has nothing to do with how much memory can be configured in or used by the VMs in

the resource pool; it's only about what they can reserve.

Figure 48: Expandable reservation workflow

Page 114: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

When a virtual machine is powered on, it will search for unreserved capacity through the resource

pool tree. It will only consider unreserved resources from its ancestors but not siblings. Ancestors

are direct parents of the resource pool, parents of the parents, etc.

The search for unreserved capacity stops when a resource pool is configured without the

expandable reservation selected or when a limit is set. When the requested capacity would allocate

more resources than the limit of the parent resource pool specifies, the request is rejected, and the

virtual machine will not be started.

Figure 49: Traversing expandable reservation

Basic design principle:

Enable expandable reservation if possibly inadequate memory reservation is set on the

resource pool if VM-level reservations will be defined.

Limits

A limit is an artificial cap on the usage of a resource. A limit is the complete opposite of reservation.

Where the reservation is a guaranteed lower bound of resources; the limit is a guaranteed upper

bound of resources. A virtual machine or resource pool is prevented from using more physical

resources than its configured limit. Even when there are plenty resources available, the limit will

Page 115: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

prohibit the virtual machine or resource pool from making use of these available resources. Setting

limits on resource pool or virtual machine level can affect the performance of the virtual machines

but limits can negatively affect the rest of the environment as well.

The VMkernel CPU scheduler behaves differently from the VMkernel memory scheduler when it

comes to limiting physical resources. Let us take a quick look at the differences between the

VMkernel CPU and memory scheduler when setting limits.

CPU Resource Scheduling

If a CPU limit is set on a virtual machine, the VMkernel restricts the amount of time the virtual

machine will be scheduled. The lower the amount of MHz specified in the limit setting, the less

amount of time the virtual machine is receiving from the CPU scheduler.

If a limit is set on the virtual machine, this limit specifies the upper boundary of CPU resources for

the entire virtual machine; it is not used for specifying the limit on a vCPU level. The CPU scheduler

makes sure that the total CPU resource consumption of the virtual machine does not exceed the

specified limit. Whether the virtual machine has 1 vCPU or 8 vCPUs, the VMkernel will not allocate

more physical resources than are specified by the limit. For example, a 2000 MHz limit is set on a 2

vCPU virtual machine; if vCPU 1 uses 1600 MHz, the VMkernel restricts the second vCPU and allows

it to consume a maximum of only 400 MHz.

Setting a CPU limit on resource pool level, the physical resources are divided based on the resource

entitlement of the virtual machines. DRS takes both virtual machine and resource pool shares into

account when calculating the resource entitlement of the virtual machine.

The resource pool divides the resources and assigns a limit based on the resource entitlement of a

virtual machine. A dual vCPU virtual machine receives more shares than a single vCPU virtual

machine. If both virtual machines are equally active and no per-VM reservations or limits are set,

the share level indicates the priority level of the virtual machine, therefore limiting the 1 vCPU

more than the virtual machine with 2 vCPUs. And this might end up negatively affecting the

performance of a business critical application. If it is necessary to set a limit on resource pool level,

be aware that the number of vCPUs of each virtual machine can have impact on the assignment of

resources within the resource pool.

Memory Scheduler

If a memory limit is set on the virtual machine, the virtual machine is not allowed to consume more

physical memory than its configured limit. If a virtual machine is configured with 4GB of memory,

but the administrator sets the memory limit to 1 GB, the virtual machine is able to consume 1 GB of

physical memory, the other 3 GB memory space will be supplied by ballooning or the swap file. A 4

GB swap file is created for the virtual machine to ensure availability of memory space for the virtual

machine.

Page 116: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 50: Virtual machine memory limit

The Guest OS inside the virtual machine is unaware of the specified limit as such setting limits can

have impact on the performance of the application inside the virtual machine, even if it does not

always consume more memory above the limit threshold. When modern Operating Systems boot,

one of the first things they do is check to see how much RAM they have available then tune their

caching algorithms and memory management accordingly. Applications such as SQL, Oracle and

JVMs do much the same thing. As stated before the limit is not exposed to the operating system

itself and as such the application will suffer and so will the service provided to the user. The funny

thing about this is that although the application might request everything it can, it might not even

need it. In that case, more common then we think, it is better to decrease provisioned memory than

to apply a limit as the limit will impose an avoidable and unwanted performance impact in most

cases while lowering the memory most likely will not.

Page 117: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Basic design principle:

Configure a virtual machine with correct memory size instead of applying a memory limit.

Let us focus again on the resource pool limit setting. If a memory limit is set on the resource pool,

the limit applies to all the virtual machines inside the resource pool. The virtual machines cannot

consume more physical resources than the specified memory limit.

Figure 51: Resource pool memory limit

DRS will divide the limit between the ESX host based on the amount of active virtual machines

inside the resource pool and the aggregated resource entitlement of the virtual machines. DRS

calculates the amount of maximum allowed resources for each ESX host and pushes the resource

allocation information to each ESX host inside the cluster.

The VMkernel memory scheduler only knows of the part of the DRS resource tree that is relevant to

its own local node. The memory scheduler will divide the amount of resources between the virtual

machines belonging to the same resource pool. The memory scheduler uses the same mechanism as

the CPU scheduler and assigns a limit based on the resource entitlement of the virtual machine.

Because the limit is set on the resource pool the memory scheduler is free to allocate resources

within the pool as required. This results in dynamic limiting the availability of physical resources as

the limit for the virtual machine is related to the resource utilization of its sibling virtual machines

in the resource pool.

Page 118: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

In the scenario when other virtual machines are dormant, an active virtual machine can possibly

allocate up to its configured memory, contrary to a per-VM limit which is always active regardless

of the resource utilization of other virtual machines. If it is necessary to set limits, using a resource

pool level limit is preferred over per-VM limits.

Figure 52: Dividing of resource pool limit

Basic design principle:

If a limit is necessary use resource pool limits instead of virtual machine limits if possible.

If the virtual machines want to consume more than the limit threshold, VMkernel resorts to

ballooning, compressing or swapping to provide the additional resources. It is possible to see

virtual machines balloon or swap while the ESX server does not experience any memory pressure.

The same problem occurs when the virtual machine is using large memory pages, if the virtual

Page 119: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

machine reaches the limit of the local resource pool tree, the large pages (2MB) will be broken into

small pages (4KB) to allow reclamation.

Because the limit obstructs the virtual machine from using physical resources above the specified

limit, the VMkernel will back the remaining memory request by ballooning, compression-cache or

the swap file.

Ballooning, compressing and swapping virtual machine memory has impact on the ESX host and

possibly the SAN infrastructure. The VMkernel needs to use resources to communicate and run the

balloon driver and needs to store the memory pages inside a SAN-based swap file, consuming

bandwidth and creating additional load on the storage processors. Some administrators might

ignore the additional load created by swap and balloon but if the virtual machine is sized properly

to reflect its workload or SLA, these overhead situations will not occur.

Basic design principle:

We recommend using memory limits sparingly as they are invisible to the guest OS and can

cause swapping. As virtual machines cannot allocate more memory than their configured

memory, we recommend sizing the virtual machine correctly.

Expandable reservation and limits

As explained in the previous section, expandable reservation is used to allocate unreserved

memory for virtual machine reservations and virtual machine memory overhead reservations on

behalf of the virtual machines inside the resource pool. If the resource pool is unable to provide

enough unreserved resources, it will traverse the ancestor tree to allocate sufficient unreserved

resources.

However, when a limit is set at the resource pool level, the resource pool cannot allocate more

physical resources than defined by the limit setting. Although the expandable reservation setting

allows the resource pool to allocate additional unreserved resources, the limit parameter will

prohibit the resource pool allocating more physical resources than the configured limit.

If the resource pool is configured with a memory reservation less than the limit, the expandable

reservation setting allows the resource pool to allocate unreserved memory resources from its

parent resource pools up till the configured memory limit.

Page 120: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 17

Distributed Power Management With ESX 3.5, VMware introduced Distributed Power Management (DPM). DPM provides power

savings by dynamically sizing the cluster capacity to match the virtual machine resource demand.

DPM will dynamically consolidate virtual machines onto fewer ESX hosts and power down excess

ESX hosts during periods of low resource utilization. If the resource demand increases ESX hosts

are powered back on and the virtual machines are redistributed among all available ESX hosts in

the ESX cluster.

The goal of DPM is to keep the cluster utilization within a specific DPM target range, but at the same

time take various cluster settings, virtual machines settings and requirements into account when

generating DPM recommendations. After DPM has determined the maximum number of hosts

needed to handle the resource demand of the virtual machines, it leverages the DRS algorithm to

distribute the virtual machines across the number of hosts before placing the target ESX hosts into

standby mode.

Enable DPM

DPM is disabled by default and can be enabled by selecting the power management modes Manual

or Automatic. Due to DPM using DRS to migrate the virtual machines off the ESX hosts, DRS must be

enabled first before DPM can be enabled on the cluster.

Figure 53: DPM settings

Page 121: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

DPM can be set to run in either manual or automated mode for the cluster. All hosts inside the

cluster will inherit the default cluster setting, but in addition a per-host setting can be set as well.

This setting overrides the cluster default. Per-host settings are only meaningful when DPM is

enabled, a use case for overriding the default cluster setting is when VMware Fault Tolerance

protected virtual machines are running inside the cluster. See section “DRS, DPM and VMware Fault

Tolerance” of chapter 18 for more info about the constraints Fault Tolerance introduce to DPM.

Each power management mode operates differently:

Power Management State and DPM behavior

Disabled – No power recommendation will be issued

Manual – A power recommendation will be generated, the recommendation must be

manually confirmed by the user

Automatic - A power recommendation will be generated and will be executed automatically,

no user intervention required

The power management mode setting, manual or automatic, can differ from the DRS automation

settings, even the threshold can vary from each other. The combination of both mechanisms will

have different effect on the role of user and the automatic application of the recommendations

generated by DRS and DPM.

Table 7: Effect of combining DPM and DRS

Page 122: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Basic design principle:

Configure DRS to automation level Automatic if DPM is set to automation level Automatic.

Templates

While DPM leverages DRS to migrate all active virtual machines on the host before powering down

the host, the registered templates are not moved. This means that templates registered on the ESX

host placed in standby mode will not be accessible as long as the host is in standby mode.

Basic design principle:

Register templates on a single host and disable DPM on this host.

DPM Threshold and the Recommendation Rankings

The DPM threshold slider works similarly to the DRS slider, move the slider to set the DPM

threshold to be more conservative or aggressive. DPM recommendation priority levels can be

compared to the DRS priority levels. Setting the DPM threshold at the most conservative level,

would generate only the most important (priority 1) recommendations. Setting the DPM threshold

at the most aggressive level would generate all recommendations.

Each level indicates the importance of the recommendation regarding the current utilization of the

ESX hosts in the cluster and the possible constrains on the current capacity. DPM uses different

ranges for its recommendations, Host power-on recommendations range from priority level 1 to

priority level 3, while power-off recommendations range from priority level 2 to priority level 5.

The highest power-off priority level (2) indicates a larger amount of underutilized powered-on

capacity in the cluster. Recommendations with a higher priority level will result in more power-

saving if the recommendations are applied. For the range of power-on recommendations, a priority

level 1 is generated when a VMware High Availability requirement must be met. Priority level 1

recommendations are also generated to meet powered-on capacity requirement set by the user.

Power-on priority level 2 indicates a more urgent recommendation to solve higher host utilization

saturation levels than priority level 3.

Table 8: Recommendations priority level

Page 123: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Setting the DPM threshold to the most conservative level will result in DPM generating only priority

level 1 recommendations, according to the accompanied text below the threshold slider:

“Apply only priority 1

recommendations. vCenter will apply power-on recommendations

produced to meet HA requirements or user-specified capacity

requirements” DPM will only automatically apply the

power-on recommendations…”

DPM will not generate power-off recommendations; this effectively means that the automatic DPM

power saving mode is disabled. The user is able to place the server in the standby mode manually,

but DPM will only power-on ESX hosts when the cluster fails to meet certain HA or custom capacity

requirements or constraints.

Evaluating Resource Utilization

DPM generates power management recommendations based on the CPU and memory utilization of

the ESX host. DPM aims to keep the ESX host resource utilization within the target utilization range.

If the resource utilization is above the target utilization range, DPM evaluates host power-on

operations. When the ESX host resource utilization is below the target utilization range, DPM

evaluates power-of operations. DPM calculates the target utilization range as follows:

Target resource utilization range = DemandCapacityRatioTarget±DemandCapacityRatio-

ToleranceHost

The DemandCapacityRatioTarget is the utilization target of the ESX host, by default this is set at

63%. The DemandCapacityRatioToleranceHost specifies the tolerance around the utilization target

for each host, by default this is set at 18%. This means that DPM will try to keep the ESX host

resource utilization centered at the 63% sweet spot, plus or minus 18 percent, resulting in a range

between 45 and 81 percent. If the resource utilization of both CPU and memory resources of an ESX

host falls below 45%, DPM evaluates power-off operations. If the resource utilization exceeds the

81 percent of either CPU or memory resources, DPM evaluates powering-on operations of standby

ESX hosts.

Page 124: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Figure 54: Power operations regarding to host utilization levels

The sweet spot of 63 percent is based on in-house testing and feedback from customers. Both the

DemandCapacityRatioTarget and DemandCapacityRatioToleranceHost values can be modified by the

user, the DemandCapacityRatioTarget can be set between the range 40 to 90% and the

DemandCapacityRatioToleranceHost allowed input range is between 10 and 40%. It is

recommended to use the default values and to only modify the values when you fully understand

the impact.

Virtual Machine Demand and ESX Host Capacity Calculation

Page 125: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

DPM calculates the resource utilization of the ESX host based on the virtual machine demand and

the ESX host capacity. The calculated host capacity equals the installed physical CPU and memory

resources minus the overhead created by the Service Console and the VMkernel.

DPM calculates the ESX host resource demand as the sum of each active virtual machine over a

historical period of interest plus two standard deviations. The demand itself is a combination of the

virtual machines working set (active memory) and an estimation of unsatisfied demand during

periods of contention. By using historical data over a longer period of time instead of using the

virtual machine active current demand, DPM ensures that the evaluated virtual machine demand is

representative of the virtual machine normal workload behavior. Using shorter periods of time or

only current demand can include short-term resource demand if DPM would react to this situation

it would unnecessarily generate power-on and power-off recommendations. Not only does this

negatively affect the power-saving efficiency, it will also have impact on the current resource

utilization as DRS shall try to load-balance the active virtual machines across constantly changing

landscape of available hosts. Finding a proper balance between providing resources and resource

demand can be quite difficult as underestimating resource demand can result in lower performance

while overestimating resource demand can lead to less optimal power savings.

DPM uses two periods of interest when calculating the average demand. The period of interest DPM

uses when evaluating virtual machine demand that can possibly lead to power-on operations is 300

seconds (5 minutes), DPM uses a longer period when evaluating resource demand that may lead to

power-off operations, DPM evaluates the virtual machine workload of the past 2400 seconds (40

minutes).

By using shorter periods of time for evaluation power-on operations DPM will have the ability to

respond to demand increase relatively quick. A longer period is used to evaluate power-off

operations so that DPM will respond slowly to a decrease in workload demand. DPM must be

absolutely sure that it will not negatively impact virtual machine performance.

Providing adequate resources for workload demand is considered more important by DPM than

rapid response to decreasing workloads, so performance receives a higher priority by DPM than

saving power. This also becomes visible when reviewing the rules of power-on operations

recommendation and power off operation recommendations, a power off recommendation is only

applied when the ESX host is below the specified target utilization range AND there are no power-

on recommendations active.

Evaluating Power-On and Power-Off Recommendations

The next step taken by DPM if the resource utilization evaluation indicates low or high resource

utilization is generating recommendations which reduce the distance of the current resource

utilization to the target resource utilization range. I.e. optimizing and aligning the power demand to

the workload demand.

Before selecting an ESX host for power off, DPM iterates through the active hosts inside the cluster

and places them in a specific order for DPM power off evaluation process. If the cluster contains

Page 126: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

hosts in automatic mode and manual mode, they are placed in separate groups. Hosts inside the

automatic group are considered before the hosts inside the manual mode group. If the cluster

contains homogeneous sized hosts, DPM considers hosts in order of lower virtual machine

evacuation costs. Hosts inside the automatic mode group with a lower amount of virtual machines

or smaller virtual machines are considered first before heavy loaded hosts in the same group. If the

cluster contains heterogeneous sized hosts, DPM considers hosts in order of critical resource

capacity. Hosts with more critical resources (CPU or memory) are sorted before the other hosts in

its group. For power-on recommendations, larger capacity host are favored first, smaller capacity

host are favored for power off recommendation first.

Table 9: DPM preference

If the sort process discovers equal hosts with respect to the capacity or evacuation cost, DPM will

randomize the order of hosts, done for a wear-leveling effect. Be aware that sorting of the hosts for

power-on or power-off recommendations does not determine the actual order for the selection

process to power-on or power-off hosts. In addition, it might be possible that DPM will not strictly

adhere to its host sort order if doing so would lead to choosing a host with excessively larger

capacity than needed, if a smaller capacity host that can adequately handle the demand is also

available. But under normal circumstances DPM generates the power-off recommendation based on

the resource LowScore and HighScore.

Resource LowScore and HighScore

To measure the amount of resource utilization under the target resource utilization, DPM calculates

the value for CPU and memory resources called cpuLowScore and memLowScore. To measure the

amount of resource utilization above the target resource utilization range, DPM computes the

resource HighScore called cpuHighScore and memHighScore. The formula used for each resource is

similar and calculates the weighted distance below/above the target utilization, for example the

memLowScore is calculated as followed:

memLowScore = Sum across all host below target utilization.(target utilization – host

utilization)

DPM is aware of which resource is more critical and will use and process this in the evaluation. If

the hosts are overcommitted on memory, DPM determines that memory is the critical resource and

will prioritize memory over CPU recommendations.

Page 127: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Host Power-On Recommendations

If the resource utilization evaluation indicates a host with high utilization inside the cluster, DPM

considers generating host power-on recommendations. DPM begins with iterating through the

standby hosts in the sort order described in the previous section. DPM continues with evaluating

each standby host and invokes DRS to run simulations. This will simulate the distribution of virtual

machines across all the hosts inside the cluster, even migration to the hosts who are currently

placed in standby mode. By using the HighScore calculation, DPM determines the impact that a

power-up operation has on the current utilization ratio. It needs to determine how much

improvement this power-up operation has on the distance of the resource utilization from the

target utilization or the possible reduction of the number of highly utilized hosts. DPM compares

the HighScore value of the cluster in its current state (standby host still down) to the HighScore

value of the simulations. If a simulation offers an improved HighScore value if a standby host is

powered-on, DPM will generate a power- on recommendation for that specific host. In some cases it

might be possible that constraints will limit the host selection. Limits such as the inability to

migrate virtual machines to the candidate host if it were powered on or that the virtual machines

that would move to a candidate host are not expected to reduce load on the highly-utilized hosts in

the cluster.

DPM continues to run simulations as long as there are hosts in the cluster exceeding the target

utilization range. DPM is very efficient in homogeneous sized clusters as DPM will skip every host

which is identical regarding physical resources or vMotion compatibility to any host who is already

rejected for power-on operation during the simulation.

Basic design principle:

We recommend using homogeneous clusters as DPM will operate more efficiently.

If the user set a custom value in the advanced settings; MinPoweredOnCpuCapacity and

MinPoweredOnMemCapacity, DPM needs to adjust its power-on operations recommendation to

fulfill the requirements defined in these settings. By default both settings have a value of 1 MHz and

1MB respectively, which ensures that at least one host in the cluster is kept powered-on. If these

settings are altered, it might happen that DPM and DRS do not need the physical resources to run

the virtual machines at a proper level, the ESX host may be idle and will lead to less efficient power

ratio.

Contrary to a power-off recommendation, redistribution of virtual machines among the powered-

on hosts is not included in the power-on recommendation. However DPM relies on future

invocation rounds of DRS.

Host Power-Off Recommendations

DPM uses a similar approach to power-off recommendations as the power-on recommendation. If

the resource utilization evaluation indicates low utilization, DPM considers host power-off

Page 128: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

recommendations. DPM iterates through the active hosts in the sort order described in the previous

section.

DPM will evaluate the candidate hosts and uses DRS to run simulations in which the candidate

hosts are powered off in the cluster. These simulations are used by DPM to determine how much

impact the power-off operations have on reducing the number of lightly loaded hosts or reducing

the distance of the lightly-utilized host to the target resource utilization and minimize the increase

of resource utilization on the remaining hosts.

DPM compares the LowScore value of the cluster with all the candidate hosts active to the

LowScore value of the simulations, if a simulation offers improvement of the LowScore and if the

HighScore value does not increase, DPM generates a power-off recommendation. This power-off

recommendation also contains virtual machine migration recommendations for the virtual

machines running on this particular host. DRS will not indicate these virtual machine migration

recommendations as priority level 1 migrations. If DRS is set to the conservative migration

threshold level, then DRS will only generate priority level 1 migration recommendations. These

priority level 1 migrations are mandatory moves and only address constraint violations, such as

anti-affinity rules or an ESX host entering maintenance mode. This threshold does not generate

non-mandatory recommendations to rebalance the workload of the virtual machines across the ESX

hosts in the cluster; therefore setting the migration threshold of DRS to generate priority level 1

recommendations will effectively disable DPM.

Basic design principle:When DPM is activated make sure DRS is not set to the “conservative”

threshold level.

DPM will not power down a host if it violates the minimum powered-on capacity specified by the

settings MinPoweredOnCpuCapacity and MinPoweredOnMemCapacity. Another reason for DPM to

not select a specific candidate host can be based on DRS constrains or objectives. For example a

host might be rejected to be powered off if the virtual machines that need to be migrated can only

be moved to hosts that become too heavily utilized. This situation can occur when multiple DRS

(anti) affinity are active in the cluster. A third factor is that DPM does not select a candidate host to

power down based on the negative or non-existing benefit indicated by the power-off cost/benefit

analysis run by DPM

Similar to power-on recommendations, DPM continues to run simulations as long as the cluster

contains ESX hosts below the target utilization range. (Considering both resources in case of a

power-off.)

DPM Power-Off Cost/Benefit Analysis

Before DPM generates a power-off recommendation, DPM calculates the cost associated with

powering down a host. DPM takes the follow cost into account:

Migrating virtual machines off the candidate host

The power consumed during the power-down period

Unavailable resources of candidate host during power-down

Page 129: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Loss of performance if candidate host resources are needed to meet workload demand

while candidate host is powered off.

Unavailability of candidate host resources during power-up period

The power consumed during the power-up period

Cost of migrating virtual machines to the candidate host

DPM runs the power-off cost/benefit analysis which compares the costs and risk associated with a

power-off operation to the benefit of powering off the host. DPM will only accept a host power-off

recommendation if the benefits meet or exceed the performance impact multiplied by the

PowerPerformanceRatio setting. The default value is 40 but can be modified to a value in the range

between 0 and 500. As always do not change these settings only if the impact of modifying is known

to you. Both cost and benefit calculations include both CPU and Memory resources.

The power-off benefits and power-off cost are calculated as follows. The power-off benefit analysis

calculates the StableOffTime value, which indicates the amount of time the candidate host is

expected to be powered-off until the cluster needs its resources because of an anticipated increase

in virtual machine workload. The time that the virtual machine workload is stable and no power-up

operations are required is called the ClusterStableTime. DPM will use the virtual machine

stabletime, calculated by DRS cost-benefit-risk analysis, as input for the ClusterStableTime

calculation.

The time it takes from applying the power-off recommendation to the power-off state is taken into

account as well. The analysis breaks this time down into two sections and calculates this as the sum

of the time it takes migrating all active virtual machines off the host (HostEvacuationTime) and the

time it takes to power off the host (HostPowerOffTime). These values are combined in the sum:

StableOffTime = ClusterStableTime – (HostEvacuationTime + HostPowerOffTime)

The power-off cost is calculated as the summation of the following estimated resource costs:

Migration of the active virtual machines running on the candidate host to other ESX hosts

Unsatisfied virtual machine resource demand during power-on candidate host at the end of

the ClusterStableTime

Migration of virtual machines back onto the candidate host

The last two bullet points can only be estimated by DPM; DPM calculates the required hosts which

need to be available at the end of the ClusterStableTime. This calculation is somewhat of a worst-

case scenario as DPM expects all the virtual machines to generate heavy workloads at the end of the

ClusterStableTime, hereby generating a conservative value.

As previously mentioned DPM will only recommend a power-off operation as long as it is equal or

exceeds the performance impact. It might be possible the ClusterStableTime is low, this can result in

a StableOffTime equal or even less than zero. During this scenario, DPM will stop evaluating the

candidate host for a power-off operation recommendation because it will not offer any benefit.

Page 130: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Chapter 18

Integration with DRS and High Availability

Distributed Resource Scheduler

DPM tries to match the availability of resources in the cluster to the virtual machine workload and

resource demand. During the recommendation generating process DRS what-if mode is executed to

ensure that the power operation recommendations do not violate the DRS constraints and

objectives. In addition DRS has the ability to bring ESX hosts out of standby mode to acquire the

necessary resources to match the resource demand created by unexpected increase of virtual

machine workloads. DRS does not distinguish between ESX host which are placed in standby mode

by DPM or manually by the administrator. DRS might undo the manually placed standby mode the

next time DRS runs.

High Availability

If HA strict Admission Control is enabled (default), DPM will maintain the necessary level of

powered-on capacity to meet the configured HA failover capacity. HA places a constraint to prevent

DPM from powering down too many ESX hosts if it would violate the Admission Control Policy.

Contrary to disconnected hosts or hosts in maintenance mode, HA will consider the unreserved

resources provided by the ESX host for Admission Control and the ESX host can be brought out of

standby mode if the resources are required.

If HA strict Admission Control is disabled, the failover constraints are not passed on to DPM.

Because no constraints to keep enough resources available is enforced, DPM will generate power-

off recommendations and places ESX hosts in standby mode regardless of the impact it has on the

HA failover requirements. However, starting with vCenter 4.1, if a failure happens and HA cannot

restart some virtual machines due to insufficient powered-on hosts, HA will ask DRS/DPM to

power-on hosts to accommodate the restart of those virtual machines.

vSphere 4.0 HA clusters have a soft limit of a maximum amount of virtual machine if the cluster

exceeds more than nine hosts. If the cluster contains a maximum of eight hosts, the maximum

Page 131: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

amount of virtual machines exceeds far more than the 40 virtual machines. DPM does not consider

the soft limit and can create a scenario where the remaining nine host end up with more than 40

virtual machines. Fortunately in vSphere 4.1 the soft limit of a maximum number of virtual

machines when the amount of ESX hosts in the cluster exceeds nine is removed.

DPM awareness of High Availability Primary Nodes

DPM does not take current the current HA different node roles into account when selecting a host

or multiple hosts for power-down recommendations. Simply because DPM is not aware of the

different types of HA nodes. To avoid DPM powering down all HA primary nodes, DPM will explicit

disable HA on the host before placing a host into standby mode. By disabling the HA agent on the

host, HA will trigger a new primary node election resulting in the recalculation of primary nodes for

each former primary that is put into standby mode.

DPM Standby Mode

The term “Standby mode” used by DPM specifies a powered down ESX host. The term is used to

indicate that the ESX host is available to be powered on should the cluster require its resources.

DPM requires the Host to be able to awake from an ACPI S5 state via Wake-On-LAN (WOL) packets

or the two out-of-band methods; Intelligent Platform Management Interface (IPMI) version 1.5 (or

higher) or HP Integrated Lights-Out (iLO) technology. Both IPMI and iLO require the availability of a

Baseboard Management Controller (BMC) providing access to hardware control functions and

allowing the server hardware to be accessed from the vCenter server using a LAN connection. To

use WOL the ESX host must contain a Network Interface Card that supports the WOL protocol. If the

host does not offer the hardware support and configurations of any of these protocols it cannot be

placed into standby mode by DPM.

DPM WOL Magic Packet

If the ESX host is not a HP server or does not support IPMI version 1.5 or if the appropriate

credentials for using iLO or IPMI have not been configured and set up in vCenter, DPM uses Wake-

On-LAN Packets to bring the ESX host out of standby mode. The magic packet, the network packet

to bring the server back to live, is send over the vMotion network by another current powered on

ESX server in the cluster. For this reason, DPM keeps at least one host powered on in the cluster at

all times, managed by the DPM advanced controls, MinPoweredOnCpuCapacity and

MinPoweredOnMemCapacity, both configured with the respective default value of 1 MHz and 1 MB.

Because the magic packet is send across the vMotion network to a powered-off server, DPM

impacts the configuration of the vMotion network as well. Because most NICs support only WOL if it

can switch to 100 Mb/s, the switch port used by the vMotion NIC must be set to auto negotiate link

speed instead of setting the port to a fixed speed such as 1000 MB/s Full. Industry best practices

Page 132: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

advice setting both NIC and Switch port to identical settings, ensure that vmnic speed is set to auto-

negotiate as well.

Baseboard Management Controller

If both IPMI and WOL are present and both are operational, DPM will attempt to use IPMI as

default. To ensure IPMI is operational, configure the BMC LAN channel to always be available, some

BMC LAN channel requires the availability to send operator-privileged commands. A number of

BMC boards require IPMI accounts set in the BIOS.

DPM uses MD5- or plaintext-based authentication with IPMI, if the BMC reports that it supports

MD5 and has the operator role enabled, only then vCenter will use MD5 authentication. vCenter will

switch to plaintext authentication if none or only one requirement is met. If MD5- or plaintext is not

enabled or supported, vCenter will not use IPMI and attempts to use Wake-On-LAN.

Protocol Selection Order

If the server is configured for IPMI or iLO, DPM will try use the protocols in the order IPMI, iLO and

WOL. If vCenter is unable to succeed to power on the ESX host with the IPMI, it will try the second

protocol iLO, if this attempt fails too DPM will try to power down using the Wake-On-LAN and

instructs a powered-on ESX host to send the magic packet.

Placing the ESX host in standby mode, does not use any power management protocols, vCenter

initiates a graceful shutdown of the ESX host.

DPM and Host Failure Worst Case Scenario

As described in the previous section, DPM relies on vCenter to power-on standby ESX host when

using IPMI or iLO. Another ESX server uses the vMotion network to send the WOL magic packet to

power-on standby ESX host. Both technologies have a dependency of another system, either

vCenter or a powered-on ESX host. Now imagine the following scenario, HA strict admission control

is disabled, DPM is enabled in a four-way cluster and due to the minimal amount of required

resources, DPM powers-down three ESX hosts. The active ESX server hosts the vCenter server

virtual machine.

Due to unlucky circumstances this ESX host fails and thereby failing the vCenter server virtual

machine as well, leaving three ESX hosts in standby mode and one ESX host down. Because both

IPMI and iLO are depending on vCenter, no ESX host can be brought out of standby mode using

these methods. And because no ESX host is operational no magic packet is send to the standby

mode using Wake-On-LAN.

Page 133: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

As the section title suggests, this is the utmost worst-case scenario what can happen to a DPM

cluster, but this scenario must be considered. To avoid this situation multiple solutions can be used

such as using minor impact solutions. For example; running the vCenter in a non-DPM cluster or

disabling DPM on an additional ESX host. A major impact solution can be implemented such as

running vCenter server on a physical server, but this is something we would not easily recommend.

Another valuable option in this situation is to configure HA with Admission Control enabled.

Admission Control will ensure that enough resources will be available and will not allow DPM to

power down ESX hosts if it would violate failover requirements

DRS, DPM and VMware Fault Tolerance

vSphere 4.1 includes DRS-Fault Tolerance integration. vSphere 4.1 allows DRS not only to perform

initial placement of the Fault Tolerance (FT) virtual machines, but also migrate the primary and

secondary virtual machine during DRS load balancing operations.

In vSphere 4.0 DRS is disabled on the FT primary and secondary virtual machines. When FT is

enabled on a virtual machine in 4.0, the existing virtual machine becomes the primary virtual

machine and is powered-on onto its registered host, the newly spawned virtual machine, called the

secondary virtual machine is automatically placed on another host. DRS will refrain from

generating load balancing recommendations for both virtual machines.

Basic design principle:

In vSphere 4.0 clusters disable DPM on two ESX hosts which can act as hosts for Fault

Tolerant enabled virtual machines. However, be aware there is a limit of 4 FT enabled

virtual machines per host, so disable DPM on enough hosts.

The new DRS integration removes both the initial placement- and the load-balancing limitation.

DRS is able to select the best suitable host for initial placement and generate migration

recommendations for the FT virtual machines based on the current workload inside the cluster.

This will result in a more load-balanced cluster that likely has positive effect on the performance of

the FT virtual machines. In vSphere 4.0 an anti-affinity rule prohibited both the FT primary- and

secondary virtual machine to run on the same ESX hosts based on an anti-affinity rule, vSphere 4.1

offers the possibility to create a VM-host affinity rule ensuring that the FT primary and secondary

virtual machine do not run on ESX hosts in the same blade chassis if the design requires this.

Not only has the DRS-FT integration a positive impact on the performance of the FT enabled virtual

machines and arguably all other VMs in the cluster, but it will also reduce the impact of FT-enabled

virtual machines on the virtual infrastructure. For example, DPM is now able to move the FT virtual

machine to other hosts if DPM decides to place the current ESX host in standby mode. In vSphere

4.0, DPM needs to be disabled on at least two ESX hosts because of the DRS disable limitation.

Page 134: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Because DRS is able to migrate the FT-enabled virtual machines, DRS can evacuate all the virtual

machines automatically if the ESX host is placed into maintenance mode. The administrator does

not need to manually select an appropriate ESX host and migrate the virtual machines to it, DRS will

automatically select a suitable host to run the FT-enabled virtual machines. This reduces the need

of both manual operations and the need to create “exciting” operational procedures on how to deal

with FT-enabled virtual machines during the maintenance window.

DRS FT integration requires having EVC enabled on the cluster. Many companies do not enable EVC

on their ESX clusters based on either FUD (Fear, Uncertainty and Doubt) on performance loss or

arguments that they do not intend to expand their clusters with new types of hardware and

creating homogenous clusters. The advantages and improvement DRS-FT integration offers on both

performance and reduction of complexity in cluster design and operational procedures shed some

new light on the discussion to enable EVC in a homogeneous cluster. If EVC is not enabled, vCenter

will revert back to vSphere 4.0 behavior and enables the DRS disable setting on the FT virtual

machines.

Basic design principle:

Enable EVC on the vSphere 4.1 clusters to allow DRS to select an appropriate host for

placement and allow DRS to load balance the FT-enabled virtual machines.

DPM Scheduled Tasks

vSphere 4.1 offers the option to enable and disable DPM via scheduled tasks. The DPM “Change

cluster power settings” schedule task allows the administrator to enable or disable DPM via an

automated task. If the admin selects the option DPM off, vCenter will disable all DPM features on

the selected cluster and all hosts in standby mode will be powered on automatically when the

scheduled task runs.

This option removes one of the biggest obstacles of implementing DPM. One of the main concerns

administrators have, is the incurred (periodic) latency when enabling DPM. If DPM place an ESX

host in standby mode, it can take up to five minutes before DPM decides to power up the ESX host

again. During this (short) period of time, the environment experiences latency or performance loss,

usually this latency occurs in the morning.

It is common for DPM to place ESX hosts in standby mode during the night due to the decreased

workloads, when the employees arrive in the morning the workload increases and DPM needs to

power on additional ESX hosts. The period between 7:30 and 10:00 is recognized as one of the

busiest periods of the day and during that period the IT department wants their computing power

lock, stock and ready to go.

This scheduled task will give the administrators the ability to disable DPM before the employees

arrive. Because the ESX hosts remain powered-on until the administrator or a DPM scheduled task

Page 135: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

enables DPM again, another schedule can be created to enable DPM after the periods of high

workload demand ends.

For example, by scheduling a DPM disable task every weekday at 7:00, the administrator is

ensured that all ESX hosts are powered on before the morning peak, rather than having to wait for

DPM to react to the workload increase.

By scheduling the DPM disable task more than one hour in advance of the morning peak, DRS will

have the time to rebalance the virtual machine across all active hosts inside the cluster and

Transparent Page Sharing process can collapse the memory pages shared by the virtual machines

on the ESX hosts. By powering up all ESX hosts early, the ESX cluster will be ready to accommodate

load increases.

Chapter 19

Summarizing

Improvements were made to DRS in vSphere 4.1 Better integration and more efficient algorithms

allows DRS to reach a steady state more quickly when there is significant load imbalance in the

cluster.

We have tried to simplify some of the concepts to make it easier to understand, still we

acknowledge that some concepts are difficult to grasp. We hope though that after reading this

section of the book everyone is confident enough to create and configure DRS clusters to achieve

higher consolidation ratios at low costs.

If there are any questions please do not hesitate to reach out to either of the authors.

Page 136: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Appendix

Appendix A – Basic Design Principles

VMware High Availability

Avoid using static host files as it leads to inconsistency which makes troubleshooting

difficult.

In blade environments, divide hosts over all blade chassis and never exceed four hosts per

chassis to avoid having all primary nodes in a single chassis.

For network-based storage (iSCSI, NFS, FCoE) it is recommended (pre-vSphere 4.0 Update

2) to set the isolation response to "Shut Down" or “Power off”. It is also recommended to

have a secondary Service Console (ESX) or Management Network (ESXi) running on the

same vSwitch as the storage network to detect a storage outage and avoid false positives for

isolation detection.

Keep das.failuredetectiontime low for fast responses to failures.

If an isolation validation address has been added, “das.isolationaddress”, add 5000 to the

default “das.failuredetectiontime” (15000).

Be really careful with reservations, if there’s no need to have them on a per virtual machine

basis, don’t configure them, especially when using Host Failures Cluster Tolerates. If

reservations are needed, resort to resource pool based reservations.

Avoid using advanced settings to decrease the slot size as it could lead to more down time

and adds an extra layer of complexity. If there is a large discrepancy in size and reservations

are set it might help to put similar sized virtual machines into their own cluster.

When using Admission Control, balance your clusters and be conservative with reservations

as it leads to decreased consolidation ratios.

Although vSphere 4.1 will utilize DRS to try to accommodate for the resource requirements

of this virtual machine a guarantee cannot be given. Do the math; verify that any single host

has enough resources to power-on your largest virtual machine. Also take restart priority

into account for this/these virtual machine(s).

Admission Control guarantees enough capacity is available for virtual machine failover. As

such we recommend enabling it.

Page 137: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

Do the math, and take customer requirements into account. We recommend using a

“Percentage” based Admission Control Policy as it is the most flexible policy.

VM Monitoring can substantially increase availability. It is part of the HA stack and we

heavily recommend using it!

VMware Distributed Resource Scheduler

Configure vMotion to fully benefit from DRS capabilities.

Set automation level to Automatic to fully benefit from DRS capabilities.

The number of clusters and virtual machine managed by vCenter influences the number of

calculations that impacts the performance of vCenter. Take this into account when sizing

the vCenter server.

Although DRS migrates virtual machines to gain the most improvement of cluster balance, it

does not justify use of big virtual machine. Virtual machines with a smaller memory sizes or

fewer virtual CPUs provide more placement opportunities for DRS.

Virtual machines with larger memory size and/or more virtual CPUs add more constraints

to the selection and migration process. This means it’s recommended to configure the size

of the VM to what the VM actually needs, preventing oversizing.

Select a moderate migration threshold if the cluster hosts virtual machines with varying

workloads.

Use VM-Host and VM-VM affinity rules sparingly, as rules can have an impact on the

effectiveness of the Load balancing calculation. The DRS algorithm has less choice when

rules are configured. Mandatory affinity rules apply even when DRS is disabled.

Leave virtual machine automation mode set to Default to minimize administration overhead

and possibility of human error. Set automation level to Manual instead of Disabled if more

control over placement is required. Set per-VM memory reservation only if virtual machine

absolutely requires guaranteed memory.

Leave some CPU capacity in a cluster as unreserved. This helps ensure that vMotion tasks

get the resources needed and are completed as quickly as possible.

Enable expandable reservation if possibly inadequate memory reservation is set on the

resource pool if VM-level reservations will be defined.

Configure a virtual machine with a correct memory size instead of applying a memory limit.

If a limit is necessary use resource pool limits instead of virtual machine limits if possible.

Because memory limits are invisible to the guest OS and can cause swapping, and because

virtual machines cannot allocate more memory than their configured memory anyway, use

limits sparingly.

Configure DRS to automation level Automatic if DPM is set to automation level Automatic.

Register templates on a single host and disable DPM on this host.

DPM operates more efficiently in homogeneous clusters.

When DPM is activated make sure DRS is not set to the “conservative” threshold level.

Page 138: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

In vSphere 4.0 clusters disable DPM on two ESX hosts which can act as hosts for Fault

Tolerant enabled virtual machines. However, be aware there is a limit of 4 FT enabled

virtual machines per host, so disable DPM on enough hosts.

Enable EVC on the vSphere 4.1 clusters to allow DRS to select an appropriate host for

placement and allow DRS to load balance the FT-enabled virtual machines.

Appendix B – HA Advanced Settings

HA is probably the feature with the most advanced settings. Although many of them are rarely used

some of them are needed in specific situations or included in best practices documents. The most

used and valuable advanced settings are described below:

das.failuredetection-time - Number of milliseconds, timeout time, for isolation response

action (with a default of 15000 milliseconds). Pre-vSphere 4.0 it was a general best practice

to increase the value to 60000 when an active/standby Service Console setup was used.

This however is no longer needed. For a host with two Service Consoles or a secondary

isolation address it still is a best practice to increase the value to at least 20000. The impact

of this however is that the failover response will be delayed.

das.isolationaddress[x] - IP address the ESX host uses to check on isolation when no

heartbeats are received, where [x] = 110. HA will use the default gateway as an isolation

address and the provided values as an additional check. We recommend adding an isolation

address when a secondary Service Console is being used for redundancy.

das.usedefaultisolation-address - Value can be true or false and needs to be set to false in

case the default gateway, which is the default isolation address, should not or cannot be

used for this purpose. In other words, if the default gateway is a non-pingable address, set

the “das.isolationaddress” to a pingable address and disable the usage of the default

gateway by setting this to “false”. As of vSphere 4.0 HA will verify if the isolation addresses

are pingable by the host during configuration and will raise a configuration issue if this is

not the case.

das.allowNetwork[x] - Enables the use of port group names to control the networks used

for HA, where [x] is a number between 0 and 10. You can set the value to be “Service

Console 2” or ºManagement Networkº to use (only) the networks associated with those port

group names in the networking configuration. These networks need to be compatible for

HA to configure successful. Please note that the number [x] has no relationship with the

network, it only gives you the option to specify multiple networks.

Page 139: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

das.bypassNetCompatCheck - Disable the “compatible network” check for HA that was

introduced with ESX 3.5 Update 2. Disabling this check will enable HA to be configured in a

cluster which contains hosts in different subnets or also called incompatible networks.

Default value is “false”; setting it to “true” disables the check.

das.ignoreRedundantNetWarning - Remove the error icon/message from your vCenter

when you don’t have a redundant Service Console connection. Default value is “false”,

setting it to “true” will disable the warning. HA must be reconfigured to make the

configuration issue go away.

das.vmMemoryMinMB - The minimum default slot size used for calculating failover

capacity. Higher values will reserve more space for failovers. Do not confuse with

das.slotMemInMB.

das.vmCpuMinMHz - The minimum default slot size used for calculating failover capacity.

Higher values will reserve more space for failovers. Do not confuse with das.slotCpuInMHz.

das.slotMemInMB - Sets the slot size for memory to the specified value. This advanced

setting can be used when a virtual machine with a large memory reservation skews the slot

size. As this will typically result in a conservative number of available slots.

das.slotCpuInMHz - Sets the slot size for CPU to the specified value. This advanced setting

can be used when a virtual machine with a large CPU reservation skews the slot size. As this

will typically result in a conservative number of available slots.

das.sensorPollingFreq - Set the time interval for status updates. As of vSphere 4.1 the

default value of this setting is 10. It can be configured between 1 and 30. It is not

recommended to decrease this value as it might lead to less scalability due to the overhead

of the status updates.

das.perHostConcurrentFailoversLimit - By default HA can restart up to 32 VMs

concurrently per host. This setting controls the maximum amount of concurrent restarts on

a single host. Setting a larger value will allow more VMs to be restarted concurrently and

might reduce the overall VM recovery time, but the average latency to recover individual

VMs might increase.

das.failureInterval (VM Monitoring) - The polling interval for failures. Default value is 30

seconds.

das.minUptime (VM Monitoring) - The minimum uptime in seconds before VM

Monitoring starts polling. The default value is 120 seconds.

Page 140: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1

das.maxFailures (VM Monitoring) - Maximum amount of virtual machine failures within

the specified “das.maxFailureWindow”, if the amount is reached VM Monitoring doesn’t

restart the machine automatically. Default value is 3.

das.maxFailureWindow (VM Monitoring) - Minimum amount of seconds between

failures. Default value is 3600 seconds, if a virtual machine fails more than das.maxFailures

within 3600 seconds VM Monitoring doesn’t restart the machine.

das.vmFailoverEnabled (VM Monitoring) - If set to true VM Monitoring is enabled. When

it is set to false VM Monitoring is disabled.

Soon in a book store near you....

VMware vSphere ...

Clustering

technical deepdive

Page 141: VMware vSphere 4.1 HA and DRS Technical Deepdive Volume 1