Webber VMworld2005 PAC057B FINA v2L - VMwaredownload3.vmware.com/vmworld/2005/pac057-b.pdf · Check...

40

Transcript of Webber VMworld2005 PAC057B FINA v2L - VMwaredownload3.vmware.com/vmworld/2005/pac057-b.pdf · Check...

Troubleshooting VMware System Problems II

Session: PAC057-B

David WebberVMware Education

This presentation may contain VMware confidential information.

Copyright © 2005 VMware, Inc. All rights reserved. All other marks and names mentioned herein may be trademarks of their respective

companies.

Learning Objectives

� Know how to work with performance problems

� Prevent and diagnose network problems

� Avoid and troubleshoot SAN and storage problems

� Learn what to do about system reliability problems

Performance Problems: Before You Begin

�Create a clear statement of the problem:� Problems that can’t be measured can’t be repaired

� “Poor performance” and “poor speeds” are relative

�Compared to previous results? or, to expectations?

�Gather performance and utilization data� Capture an empirical measurement of a quantitative

value

�Compare to a normal, baseline “benchmark” prior to any problem

Troubleshooting Performance ProblemsClarify and quantify the initial complaint

Perception or reality?Perception

Supply education

Reality

Relieve the bottleneck:Allocate more resources, or

Decrease competition

Satisfactory performance?

No

Measure performance again

Identify the resource that is the bottleneck

Rule out common errors

Another bottleneck?Yes Consider physical hardware

for this application

DoneYes

No

Performance: Perception vs. Reality� Common perceived performance problems:

� The display is sluggish: A virtual machine without VMware Tools installed would appear sluggish in Remote Console� Virtual machine lacks proper graphics drivers

� The mouse is slow: Hardware mouse acceleration is turned off by default in Windows Server 2003� Result: poor mouse responsiveness

� Virtual machines perceived as generally slow: Bad network connectivity between client and Service Console

� Remote Console performance indicates nothing about the actual performance of the virtual machines

� High utilization does not necessarily mean bad performance� True measures of performance are user-facing metrics

� Transactions per time

� Response time

Performance: Common Errors in Virtual Machine Configuration

�Failure to use high-performance virtual devices� Virtual SCSI adapters: Choose LSILogic rather than Buslogic

� Virtual Ethernet adapters: Choose vmxnet rather than vlance

�Failure to size virtual machines properly� Set virtual machine’s maximum memory to avoid paging inside

the guest OS

� Set virtual machine’s minimum memory to accommodate steady-state memory needs

� Employ Virtual SMP only for� Guest operating systems supported for SMP use

� Applications that benefit from multiple CPUs

�Failure to distinguish between high- and low-priority consumers� Give high-priority virtual machines many shares of their key

resources

� Give low-priority virtual machines comparatively few shares

Performance: Common Errors in Virtual Machine Configuration

�Failure to set up the guest OS and application environment wisely� Disable screensavers

� Run antivirus software in scheduled-scan mode, not continuously

� Choose the right Windows HAL or Linux kernel for the number of virtual CPUs

�Failure to lay out virtual machine storage correctly� If an application benefits from a particular RAID setting, put its

virtual disks in a VMFS volume in a LUN with that setting

� Do not use undoable disks casually

� Avoid using many undoable disks in the same VMFS volume simultaneously

Performance: Common Errors in Server Configuration

�Failure to match system configuration to workload

� Virtual machines need at least as much CPU and RAM as do physical machines running the same application

� Size your system’s physical RAM to accommodate all virtual

machines’ steady-state memory needs

�If virtual machines’ memory use totals more than RAM allocated to the VMkernel, VMkernel swapping will occur

�Failure to lay out PCI buses properly

� Connect no more than one high-traffic adapter to each bus

� If possible, place storage adapter on one bus, NICs on another

� If possible, spread the NICs in a bond across multiple buses

�Failure to design storage layout properly

� Choose the right failover policy for your disk array

� Spread the traffic to your LUNs across available paths

Performance: Limiting Resource for Some Applications

CPU: Size the physical system properly; choose the Terminal Services workload option

Citrix MetaFrame XP terminal services

RAM: Size virtual machine’s minimum memory to hold the Java heap

CPU: Preallocate database pools

ATG Dynamoapplication server

RAM: Size virtual machine’s minimum memory to hold content

CPU: Avoid system calls by configuring Apache with minimal logging, DNS lookups, process creation

Apache Web server

Disk: Place logs and data in virtual disks on different physical LUNs

CPU: Monitor the scheduler queue length; do not run 2-VCPU virtual machines on 2-PCPU hardware

Microsoft SQL Serverdatabase server

Resources and actionsApplication

Performance: Run Physical? or Run Virtual?

� Workloads that use more resources than virtual machines’maxima should stay on physical hardware� Examples: applications that can use more than 3.6 GB of RAM or two

CPUs

� Virtualization is always a tradeoff between performance and manageability

Workloads that spend much time executing operating-system code will suffer the greatest performance cost

� Examples: Process creation and destruction, thread management

Less virtualization cost More virtualization cost

Computationally intensive workloads

I/O intensive workloads

Workloads with high system-call overhead

Network: Problems in Functional Layers

�Gateway

�Router

�SwitchPhysical hardware

�Firewall running inside guest OS

�Packets dropped inside guest OSVirtual machine or guest OS

�Virtual switch configuration

�VLAN configuration settings

�NIC teaming configuration

�Service Console NIC device driver

VMkernel or Service Console

Reason for connectivity lossLayer

�You may not be able to tell what layer the problem is in, just from the symptom:

� Example: a ping timeout (lost network connectivity) to a virtualmachine: it could be in any layer

Network: Issues With Physical Hardware

Eliminate any problems with physical network components:

�Bad cables

�Bad switch ports

�Broken firmware/hardware

�Software or configuration problems on the switch

Network: VMkernel or Service Console Issues

� Configure networking components to work properly with the ESX Server:

� Configure Service Console NIC to properly negotiate a link with the switch

�� For example, see KB Article #1564For example, see KB Article #1564

� Configure VLAN switch ports as trunk ports

� Configure link aggregation on the switch port side if using NIC teaming with out-IP balancing mode

Network: VMkernel or Service Console Issues

� Ensure that your overall network design is correct

� Check for proper VLAN configuration

� Check for proper firewall design

� Ensure that applications do not conflict with VMkernel networking functionality

� Unicast NLB does not work with VMotion: both applications require conflicting VMkernel parameters�� For more information, see KB article #1573For more information, see KB article #1573

Network: Virtual Machine or Guest OS Problems

� Prevent random connectivity loss if

using vlance virtual NIC with Windows

2003:

� Install updated AMD PCNET driver

�� For more information, see KB Article #1631For more information, see KB Article #1631

� Isolate problems by interchanging the

guest OS network adapter from vlance

to vmxnet, or vice versa

� Eliminate duplicate IP address errors that result from changing network connection’s TCP/IP configuration from DHCP to a static IP address in a Windows guest OS

� Problem can also exist in a native environment

�� For more information, see KB Article # 1179For more information, see KB Article # 1179

� Allow a Red Hat Linux 9.0 virtual machine to properly receive a DHCP-assigned IP address

� Symptom is the error message “Determining IP information for eth0… failed; no link present. Check cable?”

�� For more information, see KB Article # 977For more information, see KB Article # 977

Network: Virtual Machine or Guest OS Problems

Storage: Document and Verify SAN Topology

� How many hosts? How many HBAs in each?

� How many arrays? How many SPs in each? How many ports per SP?

� How many FC switches? How is each zoned? How are they interconnected?

� How is each HBA cabled to each FC switch?

� How is each SP port cabled to each FC switch?

� Are your disk arrays and HBAs supported?

� Disk arrays can be supported for basic LUN connectivity but not for advanced features

Storage: Avoid Common Errors� When installing ESX Server on a production system

� Disconnect Fibre Channel HBAs if doing a local install

� Carefully verify zoning before doing a boot-from-SAN install

� Installer lets you wipe any accessible disks, including SAN LUNs others may be using

� If possible, VMkernel’s resources should be on a path not exposed to SAN-administrator error� VMkernel’s core dump partition (filesystem type vmkcore)

� VMkernel swap file (VMFS partition)

� Dedicate HBAs to VMkernel; do not share with Service Console

� Eliminates risk of I/O contention between the two

� Preserves ability to dynamically scan for new SAN LUNs

Storage: Troubleshoot SAN ConnectivityA SAN-based VMFS is not available

Does the VMkernel see the LUN?Check /proc/vmware/scsi

LUN is present

Does the FC card see the LUN?Boot into its menu

No VMFS in LUN

LUN is absent

LUN is present

LUN is absent

VMkernel configuration problem

Does the FC switch see the FC card?Check for fabric login

Yes, switch sees card

No, switch doesn’t see card

Cabling problem

Will the switch allow the FC card to talk to storage? Check for

port login

No port login to target

Zoning problem

LUN masking problem

Yes, port login to target occurred

Storage: SAN Multipathing

� Multipathing allows continuous availability of a SAN LUN in the event of a hardware failure� Administrator may set preferred paths for each LUN

� ESX Server supports failover with any supported HBAs� Failover occurs automatically, with a configurable delay

� Do not attempt to combine ESX Server failover with other multipathing solutions� Other software or hardware multipathing will conflict with

the VMkernel

� Use zones to enforce access from your ESX Server to your disk array

� Choose the right failover policy for your disk array

Storage: Failover Policies, Disk-Array Types

� ESX Server multipath failover policies: MRU and fixed� Fixed policy “fails back” once original failed path is restored (it

is preferred)

� MRU (“most recently used”) does NOT “fail back”, even when the original failed path is restored (with MRU, there is nopreferred path)

� Disk-array types: active/active and active/passive� Active/active: Any SP may at any time access any LUN

� Active/passive: One SP is active at any time; other is a hot standby

� Examples of disk-array types:� Active/active: EMC Symmetrix, IBM ESS (“Shark”), Hitachi

9900

� Active/passive: HP MSA 1000, EMC CLARiiON, IBM FAStT� HP EVA is technically active/active, but ESX Server uses it as

active/passive

Storage: Clustering Correctly

In virtual machine configuration, set bus sharing type to physical

Use raw-device mapping in physical compatibility mode

Physical-to-virtual clustering

In virtual machine configurations, set bus sharing type to physical

Set VMFS accessibility mode to shared, whether using virtual disks or raw-device mappings

If not using RDM, then replace VMFS volume labels in virtual machine configuration with vmhbaC:T:L:P paths

Cluster-across-boxes

In virtual machine configurations, set bus sharing type to virtual

Cluster-in-a-box

Actions requiredCluster type

Storage: VMFS Accessibility Modes

VMFS

SCSI LUN

VMFS accessibility mode controls whether more than one virtual machine at a time can access a file in

VMFS

SCSI reservations can limit access to one ESX Server at a time

VMFS structure is read-only!

VMkernels cooperate to make one change at a time

How VMFS structureis protected from

corruption

Cluster across boxes

General use

When to use

Software inside virtual machines (MSCS, etc.) requests SCSI reservations

shared

VMkernels locks entire virtual diskspublic

How virtual disks are protectedfrom corruption

Mode

Storage: Troubleshoot Boot-From-SANBoot from a SAN LUN fails

Supported hardware?Check SAN guide

Unsupported

Was the installation done withboot-from-san option?

Replace unsupported gear

Supported

No

Yes

Repeat install

Modify BIOS boot order

Is FC card pointed at correctdisk-array WWN and LUN?

NoModify FC card configuration

Mask any lower-numbered LUN

Rule out cabling or zoning problem

Is server’s BIOS set toboot from FC?

Yes

No

Yes

Storage: Boot-from-SAN BIOS Configuration

Ensure that the Fibre Channel adapter’s boot code is enabled

Storage: Boot-From-SAN BIOS Configuration� Configure BIOS so that Fibre Channel adapter is the boot

device, and desired LUN is the boot volume

� Disable built-in IDE controller if present

Reliability: Avoid Common Causes of Crashes�Hardware problems may produce crash or hang:

�System memory: “burn-in” new memory for 72 hours

�CPU failures or problems

� (Local) boot disk

�Storage: controller’s battery dies and it loses its cache

�VMware software: Apply patches or upgrades

�ASR: Automatic System Reset (or reboot):

� If intermittent: difficult to predict, track, document

�May have to be referred out to the OEM for diagnosis:�Third-party vendor software (backup or management agent)

�Disk hardware

Reliability: Avoid Common Causes of Crashes

ESX Server misconfigurations:�BIOS Settings:

�For example, MPS table mode needs to be set to full table APIC, since they do not get PCI IRQ routing entries

� For details, See Kedge Base article # 1081

�ESX Server SAN configuration�Each LUN must have at least one active path

�If SAN is “active-passive”, ESX Server multipathing policy should be set to “MRU”

� If set to “Fixed”, can easily cause a crash or hang

�VMkernel and Service Console device allocation�When using vmkpcidivy, make sure that devices are properly allocated (to Service Console, virtual machines, or both)

External and/or environmental factors:

�Connections physically broken (to network, attached storage, etc.)

�Temperature and other environmental changes

�SAN hardware online maintenance:

�Some SAN products have an online maintenance feature (Storage Processor service downtime)

�Doesn’t work reliably w/ESX Server; can sometimes cause crash or hang

�May need to be disabled

Reliability: Avoid Common Causes of Crashes

Reliability: PSOD (Purple Screen of Death)

Caused by hardware and VMkernelproblems, and Service Console Oops:�Typically halt the system (“PSOD” or Purple

Screen of Death)

�On reboot, Service Console copies the contents of the vmkcore partition to a core file placed in user root’s home directory (/root)

�Next reboot will copy core file into /rootdirectory of Service Console

�Customer needs to run vm-support and upload resulting file to ftpsite.vmware.com

Reliability: PSOD

An Example of a Purple Screen of Death:

Reliability: Server Management

Problems, not with the entire Service Console, but rather with one of its key daemon processes:

�vmware-serverd, or

�vmware-ccagent (if managed by a VirtualCenter server)

�vmware-authd

Usually produces a conventional Unix core file:�Path: /var/log/vmware/core

�Also produces characteristic “511 error”

Reliability: vmm (Virtual Machine Monitor)

If the vmm world crashes:

� If the virtual machine has one vcpu, a core file named vmware-core is generated

� If the virtual machine has two vcpus, two core files, vmware-core0 and vmware-core1are generated

�Core files are placed in the virtual machine configuration file directory

�Core files get archived and compressed

�For example, vmware-core gets archived and compressed into vmware-core.gz

Reliability: the Virtual Machine’s Processes

Service Console processes:�vmware-vmx processes support virtual

devices which are removable (virtual floppy disk, virtual CD-ROM, etc.)

�vmware-mks process supports virtual mouse, keyboard and video through VMware Remote Console

�Crashes of any of these typically will produce conventional Unix core file: “core”(or, “core.pid”)

Reliability: the Virtual Machine’s Guest OS

Guest OS: Windows, Linux, Netware, etc.

�Produces bluescreen (BSOD) or application

fault

�Capture screen

�Minidump (64kb), or kernel dump (size of

kernel image)

�Dr. Watson output file (Windows)

�Core file (Linux)

�Abend (Netware)

Reliability: Guest OS Bluescreen

Troubleshoot just as you would a similar problem on a physical machine.

Summary

� Performance problems

� Network problems

� SAN and storage problems

� System reliability problems

Questions?