Additional Info (some are still draft )

download Additional Info (some are still  draft )

If you can't read please download the document

description

Additional Info (some are still draft ). Tech notes that you may find useful as input to the design. A lot more material can be found at the Design Workshop. Internal Cloud: Gartner model and VMware model. Gartner take: Virtual infrastructure On-demand, elastic, automated/dynamic - PowerPoint PPT Presentation

Transcript of Additional Info (some are still draft )

PowerPoint Presentation

Additional Info (some are still draft)Tech notes that you may find useful as input to the design. A lot more material can be found at the Design Workshop#Confidential1Internal Cloud: Gartner model and VMware modelGartner take: Virtual infrastructureOn-demand, elastic, automated/dynamicImproves agility and business continuity

Virtual infrastructure managementSelf-service provisioning portalService catalogPerformance managementEnterprise service managementIdentity and access managementLife cycle managementOrchestratorExt. cloud connectorChargeback systemConfiguration and change managementCapacity managementPhysical infrastructureVirtual infrastructureService governor/infrastructure authority#ConfidentialInternal cloud means a lot of things to a lot of customers. Gartner defines internal cloud as: AnIT capability as a service that an IT organization provides to its own business. Hardware infrastructure as a service is defined as: A virtual or physical hardware platform offered as a service. The hardware platform is offered as a raw resource, so the consumer must install additional software or connect the resource to additional resources to create a complete solution. Typically, the administrator manages the infrastructure platform using a web-based interface to provision, start, and stop individual platform instances.Gartner clients look for specific attributes in an internal cloud, including: A virtual infrastructure base On-demand service Elasticity (scale on-demand) Automation (e.g., service delivery and optimization)The above attributes are key enablers for improved IT and business agility as well as better business continuity. Master / Slave concept

3#ConfidentialCluster: SettingsFor the 3 sample sizes, here is my personal recommendationDRS fully automated. Sensitivity: Moderate Use anti-affinity or affinity rules only when needed.More things for you to remember.Gives DRS less room to maneuverDPM enabled. Choose hosts that support DPMDo not use WOL. Use DPM or IPMIVM Monitoring enabled. VM monitoring sensitivity: MediumHA will restart the VM if the heartbeat between the host and the VM has not been received within a 60 second intervalEVC enabled. Enable you to upgrade in future. Prevent VMs from being powered on if they violate availability constraints better availabilityHost isolation response: Shut down VMSee http://www.yellow-bricks.com/vmware-high-availability-deepdiv/Compared with Leave VM Powered on, this prevent data/transaction integrity risk. The risk is rather low as the VM itself has lockCompared with Power off VM, this allows graceful shutdown. Some application needs to run consistency check after a sudden power off.#ConfidentialDRS, DPM, EVCIn our 3 sizes, here are the settings:DRS: Fully AutomatedDRS sensitivity: Leave it at default (middle. 3 Star migration)EVC: turn on. It does not reduce performance.It is a simple mask.DPM: turn on. Unless HW vendor shows otherwiseVM affinity: use sparingly. It adds complexity as we are using group affinity.Group affinity: use (as per diagram in design)Why turn on DPMPower cost is real concernSingapore example: S$0.24 per kWh x (600 W + 600 W) x 24 hours 365 days x 3 years / 1000 W = $5100This is quite close of buying 1 serverFor every 1W of power consumed, we need minimum 1W of power for aircond + UPS + lighting#ConfidentialIntel Xeon 32nm Core i7 baseline:Applies baseline feature set of Intel Xeon 32nm Corei7 (Westmere) processors to all hosts in the cluster. Compared to the Intel Xeon Corei7 mode, this EVC mode exposes additional CPU features including AES and PCLMULQDQ. Note: Intel i3/i5 Xeon Clarkdale Series processors that do not support AESNI and PCLMULQDQ cannot be admitted to EVC modes higher than the Intel Xeon Corei7 mode.

Intel Xeon Core i7 baseline:Applies baseline feature set of Intel Xeon Corei7 (Nehalem) processors to all hosts in the cluster.Compared to the Intel Xeon 45nm Core2 EVC mode, this EVC mode exposes additional CPU features including SSE4.2 and POPCOUNT5VMware VMmarkUse VMmark as the basis for CPU selection only, not entire box selection.It is the official benchmark for VMware, and it uses multiple workloadOther benchmark are not run on vSphere, and typically test 1 workloadVMmark does not include TCO. Consider entire cost when choosing HW platformUse it as a guide onlyYour environment is not the same.You need head room and HA.How its doneVMmark 2.0 uses 1 - 4 vCPUMS Exchange, MySQL, Apache, J2EE, File Server, Idle VMResult page: VMmark 2.0 is not compatible with 1.x resultswww.vmware.com/products/vmmark/results.htmlThis slide needs update#ConfidentialVMware VMmark

#ConfidentialVMmark: sample benchmark result (HP only)Im only showing result from 1 vendor as vendor comparison is more than just VMmark result.IBM, Dell, HP, Fujitsu, Cisco, Oracle, NEC have VMmark results

Look at this number. 20 tiles = 100 Active VM

This tells us that Xeon 5500 can run 17 Tiles, at 100% utilisation.Each Tile has 6 VM, but 1 is idle. 17 x 5 VM = 85 active VM in 1 box. At 80% Peak utilisation, thats ~65 VM.

Opteron 8439, 24 coresXeon 5570, 8 coresOpteron 2435, 12 coresXeon 5470, 8 coresThis number is when comparing with same #Tiles 10% is ok for real-life sizing. This is benchmark

#ConfidentialFault ToleranceWorkload TypeApplication SpecificsDatabasesThe most popular workloads on FT.Small to medium instances. Mostly SQL Server.MS Exchange and MessagingBES, Exchange.Gaming company has 750 mailboxes on 1 FT VM.See FT load test at blogs.vmware.comWeb and File serversFile server might be stateless but application using it may be sensitive to denial of service and may be very costly to lose. A simulation relying on a file server might have to be restarted if the file server fails.Manufacturing and Custom ApplicationsThese workloads keep production lines moving. Breaks result in loss of productivity and material.Examples: Propeller factory, meat factory, pharma line.SAPSAP ECC 6.0 System based on SAP NetWeaver 7.0 platform. ASCS, a Message and Transaction locking service, is a SPOF.BlackBerry BlackBerry Enterprise Server 4.1.6 (BES)1 vCPU BES can support 200 users, 100-200 emails/day

#ConfidentialTitleMonth Year9MS ClusteringESX Port Group properties Notify Switches = NO Forged Transmits = Accept.Win08 does not support NFSStorage DesignVirtual SCSI adapter LSI Logic Parallel for Windows Server 2003LSI Logic SAS for Windows Server 2008ESXi changesESXi 5.0 uses a different technique to determine if RDM LUNs are used for MSCS cluster devices, by introducing a configuration flag to mark each device as "perennially reserved" that is participating in a MSCS cluster.Unicast mode reassigns the station (MAC) address of the network adapter for which it is enabled and all cluster hosts are assigned the same MAC address, you cannot have ESX send ARP or RARP to update the physical switch port with the actual MAC address of the NICs as this break the the unicast NLB communication

#ConfidentialSymantec ApplicationHACan install agent to multiple VM simultaneouslyAdditional Roles for securityIt does not cover Oracle yetPresales contact for ASEAN: Vic

#ConfidentialVMware HA and DRSRead Duncans yellowbrick first. Done? Read it again. This time, try to internalise it. See speaker notes below for an example.vSphere 4.1Primary NodesPrimary nodes hold cluster settings and all node states which are synchronized between primaries.Node states hold for instance resource usage information. In case that vCenter is not available the primary nodes will have a rough estimate of the resource occupation and can take this into account when a fail-over needs to occur. Primary nodes send heartbeats to primary nodes and secondary nodes. HA needs at least 1 primary because the fail-over coordinator role will be assigned to this primary, this role is also described as active primary.If all primary hosts fail simultaneously no HA initiated restart of the VMs will take place. HA needs at least one primary host to restart VMs. This is why you can only take four host failures in account when configuring the host failures HA admission control policy. (Remember 5 primaries)The first 5 hosts that join the VMware HA cluster are automatically selected as primary nodes. All the others are automatically selected as secondary nodes. A cluster of 5 will be all Primary.When you do a reconfigure for HA the primary nodes and secondary nodes are selected again, this is at random. The vCenter client does not show which host is a primary and which is not. Secondary NodesSecondary nodes send their state info & heartbeats to the primary nodes only. HA does not knows if the host is isolated or completely unavailable (down).The VM lock file is the safety net. In VMFS, the file is not visible. In NFS, it is the .lck file.Nodes send a heartbeat every 1 second. The mechanism to detect possible outages.#ConfidentialPercentage based admission control uses a more flexible mechanism. Itaccumulatesall reservations and subtracts it from the total amount of available resources while making sure the specified spare capacity is always available.

What happens if you set the number of allowed host failures to 1?The host with the most slots will be taken out of the equation. Example: You have 8 hosts with 90 slots in total But 7 hosts each have 10 slots 1 host is hosting 20 slots. So for some reasons, it is not balanced. Perhaps it has odd size CPU/RAM combination. Remember, this is HA, not DRS. VMware HA thinks this single host is the largest, so it must handle failure of this host. So it need to cater for 20 slots. It is catering for worst case scenario! In other words the 7 hosts should be able to provide enough resources for the cluster when a failure of the 20 slot host occurs.

In vCenter 2.5, the number of vCPUs for any given VM was also taken in to account. This led to a very conservative and restrictive admission control. This behavior has been modified with vCenter 2.5 U2, the amount of vCPUs is not taken into account.

How does HA calculate how many slots are available per host?[The key word here is per host. Each host can have different slot size]Of course we need to know what the slot size for memory and CPU is first. Thats for the entire cluster.We divide the total available CPU resources of the host by the CPU slot size and the total available Memory Resources of a host by the memory slot size. This leaves us with a slot size for both memory and CPU. Say the host has 7 CPU slot size and 5 RAM slot size.The lowest number is the amount of slots for this host. If you have 25 CPU slots but only 5 memory slots the amount of available slots for this host will be 5.

As you can see this can lead to very conservative consolidation ratios. With vSphere this is something thats configurable. If you have just one VM with a really high reservation you can set the following advanced settings to lower the slot size being used during these calculations: das.slotCpuInMHz or das.slotMemInMB. To avoid not being able to power on the VM with high reservations these VM will take up multiple slots.

vSphere 4.1 adds a more refine check. Before 4.1, when you were low on resources this could mean that you were not able to power-on this high reservation VM as resources would befragmented throughout the cluster instead of located on a single host. So vSphere 4.1 does a per host check.

12vSphere 4.1: HA and DRSBest PracticesAvoid using advance settings to decrease slot size as it might lead to longer down time. Admission control does not take fragmentation of slots into account when slot sizes are manually defined with advanced settings.What can go wrong in HAVM Network lostHA network lostStorage Network lost

FailedNot FailedWhat happen as a resultVM Network HA NetworkStorage NetworkUsers cant access VM. If there are active users, they will complain.HA does nothing as its not within the scope of HA in vSphere 4.1HA NetworkVM NetworkStorage NetworkIt depends: Split Brain or Partitioned?If the host is isolated, it will execute Isolation Response (shut down VM)Lock is released.Other host will gain lock. Other host will then start the VMStorage NetworkDoes not matterVM probably crash as it cant access disk.Lock expires. Host will lose connection to array. Other host (first one to get the lock?) will boot the VM.#ConfidentialHA monitors the heart beat via VMware Tools.

If a host stops receiving heartbeats from all other hosts in the cluster for more than 12 seconds, it attempts to ping its isolation address and if that also fails, it declares itself as isolated from the network. Normally, the default isolation address is the default gateway.So long 2 primary node is active, the default gateway can be brought down.

Q. How is VM Monitoring performed?A. If regular heartbeats from the VMware Tools process are not received within the failure interval, then VM Monitoring service will verify I/O stats (for disk and network) level (to avoid any unnecessary resets) for about 120 seconds and if not VM will be reset. And default 120 seconds can be changed at das.iostatsinterval. This setting helps you avoid restarting a guest where the heartbeat ceasedfunctioning by checking the guests IO activity for a certain period of time.

13VMware HA and DRSSplit Brain >< Partitioned ClusterA large cluster that spans across racks might experience partitioning. Each partition will think they are full cluster. So long there is no loss is storage network, each partition will happily run their own VM.Split Brain is when 2 hosts want to run a VM.Partitioned can happen when the cluster is separated by multiple switches. Diagram below shows a cluster of 4 ESX.

#ConfidentialHA: Admission Control Policy (% of Cluster)Specify a percentage of capacity that needs to be reserved for failoverYou need to manually set it so it is at least equal to 1 host failure.E.g. you have a 8 node cluster and wants to handle 2 node failure. Set the % to be 25%Complexity arises when nodes are not equalDifferent RAM or CPUBut this also impact the other Admission Control option. So always keep node size equal, especially in Tier 1.Total amount of reserved resource < (Available Resources Reserved Resources)If no reservation is set a default of 256 MHz is used for CPU and 0MB + overhead for MEM

Monitor the thresholds with vCenter on the Clusters summary tab

#Confidential15SnapshotOnly keep for maximum 1-3 days.Delete or commit as soon as you are done.A large snapshot may cause issue when committing/deleting.For high transaction VM, delete/commit as soon as you are done verifyingE.g. databases, emails.3rd party toolSnapshots taken by third party software (called via API) may not show up in the vCenter Snapshot Manager. Routinely check for snapshots via the command-line.Increasing the size of a disk with snapshots present can lead to corruption of the snapshots and potential data loss. Check for snapshot via CLI before you increase#ConfidentialSource: http://kb.vmware.com/kb/1025279

The maximum supported amount in a chain is 32. However, VMware recommends that you useonly 2-3 snapshots in a chain. Use no single snapshot for more than 24-72 hours. This prevents snapshots from growing so large as to cause issues when deleting/committing them to the original virtual machine disks.Take the snapshot, make the changes to the virtual machine, and delete/commit the snapshot as soon as you have verified the proper working state of the virtual machine. Be especially diligent with snapshot use on high-transaction virtual machines such as email and database servers.These snapshots can very quickly grow in size, filling datastore space.Commit snapshots on these virtual machines as soon as you have verified the proper working state of the process you are testing.

If using a third party product that takes advantage of snapshots (such as virtual machine backup software), regularly monitor systems configured for backups to ensure that no snapshots remain active for extensive periods of time. Snapshots should only be present for the duration of the backup process. Snapshots taken by third party software (called via API) may not show up in the vCenter Snapshot Manager. Routinely check for snapshots via the command-line.

An excessive number of snapshots in a chain or snapshots large in size may cause decreased virtual machine and host performance. Configure automated vCenter Serveralarms to trigger when a virtual machine is running from snapshots. For more information, see Configuring VMware vCenter Server to send alarms when virtual machines are running from snapshots (1018029). Confirm that there no snapshots are present (via command line) before a Storage vMotion.If snapshots are present, delete them prior to the Storage vMotion.For more information, see Migrating an ESX 3.x virtual machine with snapshots in powered-off or suspended state to another datastore might cause data loss and make the virtual machine unusable (1020709). Confirm that there are no snapshots present (via command line) before increasing the size of any Virtual Machine virtual disk or virtual RDM. If snapshots are present, delete them prior to increasing the size of the disk/s. Increasing the size of a disk with snapshots present can lead to corruption of the snapshots and potential data loss. For more information, see Increasing the Size of a Virtual Disk.

16vMotionCan be encrypted. At a cost certainly. If vMotion network is isolated, then there is no need.May lose 1 ping. Inter-cluster vMotion is not the same with intra-clusterInvolves additional calls into vCenter, so hard limitLose VM cluster properties (HA restart priority, DRS settings, etc.)

#Confidential

ESXi: Network configuration with UCSIf you are using Cisco UCS blade2x 10G or 4x 10G depending on blade model and mezzanine cardAll mezzanine card models support FCoEUnified I/O Low LatencyThe Cisco Virtualized Adapter (VIC) supportsMultiple virtual adapters per physical adapterEthernet & FC on the same adapterUp to 128 virtual adapters (vNICs)High Performance 500K IOPSIdeal for FC, iSCSI and NFSOnce you decide its Cisco,discuss the detail with Cisco.#ConfidentialBelmont: See link for detailshttp://www.cisco.com/en/US/prod/collateral/ps10265/ps10280/data_sheet_c78-525049.html

18What Is Auto DeployWithout Auto DeployWith Auto DeployHost image tied to physical server Each host needs full install and configNot easy to recover host Redundant boot disks/dedicated LUNHost image decoupled from serverRun on any server w/ matching hardware Config stored in Host ProfileNo boot diskA lot of time/effort building hostsDeploying hosts is repetitive and tediousHeavy reliance on scriptingNeed to update for each new releaseAgile deployment modelDeploy many hosts quickly and efficientlyNo pre/post install scriptsNo need to update with each releaseConfiguration drift between hostsConfig drift always a concernCompromises HA/DRManging drift consumes admin resourcesHost State GuaranteedSingle boot image shared across hosts Every reboot provides consistent imageEliminate need to detect/correct drift#Confidential19Auto Deploy ComponentsComponentSub-ComponentsNotesPXE Boot InfrastructureDHCP ServerTFTP ServerSetup independentlygPXE file from vCenterCan use Auto Deploy ApplianceAuto Deploy Server Rules EnginePowerCLI Snap-inWeb ServerBuild/Manage RulesMatch server to Image and Host ProfileDeploy serverImage BuilderImage Profiles,PowerCLI Snap-inCombine ESXi image with 3rd party VIBs to create custom Image ProfilesvCenter ServerStores RulesHost ProfilesAnswer FilesProvides store for rulesHost configs saved in Host ProfilesCustom Host settings saved in Answer Files#Confidential20DatastoreSpaceI/OConnected CPUConnected MemoryIntegrated Metric1HighHighLowLowLow2LowMediumMediumMediumMedium3HighMediumHighHighHigh

Datastore 3Datastore 1

Datastore 2

Storage DRS and DRSInteractions:Storage DRS placement may impact VM-host compatibility for DRSDRS placement may impact VM-datastore compatibility for Storage DRSSolution: datastore and host co-placementDone at provisioning time by Storage DRSBased on an integrated metric for space, I/O, CPU and memory resourcesOvercommitted resources get more weights in the integrated metricDRS placement proceeds as usualBut easier to architect it properly. Map ESX Cluster to Datastore Cluster manually.#ConfidentialHeres an example. Datastore 1 has a lot of available space and I/O resources, so if we were only considering those metrics, it would be a good placement choice. However, it only has one loaded host connected to it, so the CPU and memory availability to that datastore is not good. Datastore 2 has more CPU and memory availability, but little space remaining. Datastore 3 gets the best score wrt the integrated metric, since it has a reasonable amount of I/O bandwith available along with a lot of CPU, memory, and space resources.

Note that we only showed you the create workflow on these slides, but you can also clone, relocate, and add-disk into a datastore cluster and the same ease of use applies to those cases as well.

21Unified Fabric with Fabric Extender Multiple points of managementFCEthernetBlade switchesHigh cable countUnified fabric with Fabric extenderSingle point of managementReduced cablesFiber between racksCopper in racks

End of Row DeploymentFabric Extender#Confidential2222Source: great VMworld preso by McKesson Corporation. A lot of blade comparison here.Storage IO ControlSuggested Congestion Threshold values

One: Avoid different settings for datastores sharing underlying resourcesUse same congestion threshold on A, BUse comparable share values(e.g. use Low/Normal/High everywhere)

Storage MediaCongestion ThresholdSolid State Disks10 - 15 millisecondsFiber Channel20 - 30 millisecondsSAS20 - 30 millisecondsSATA30 - 50 millisecondsAuto-tiered StorageFull LUN auto - tieringVendor recommended value. If none provided, recommended threshold from above for the slowest storageAuto-tiered StorageBlock level / sub-LUN auto - tieringVendor recommended value. If none provided, combination of thresholds from above for the fastest and the slowest media typesPhysical drives

Datastore ADatastore B

SIOCSIOC#Confidential23NAS & NFSTwo key NAS protocols: NFS (the Network File System). This is what we support.SMB (Windows networking, also known as CIFS)Things to know about NFSSimpler for person who are not familiar with SAN complexityTo remove a VM lock is simpler as its visible.When ESX Server accesses a VM disk file on an NFS-based datastore, a special .lck-XXX lock file is generated in the same directory where the disk file resides to prevent other ESX Server hosts from accessing this virtual disk file. Dont remove the .lck-XXX lock file, otherwise the running VM will not be able to access its virtual disk file.No SCSI reservation. This is a minor issue1 Datastore will only use 1 pathDoes Load Based Teaming work with it?For 1 GE, throughput will peak at 100 MB/s. At 16 K block size, thats 7500 IOPS. The Vmkernel in vSphere 5 only supports NFS v3, not v4. Over TCP only, no support for UDP.MSCS (Microsoft Clustering) is not supported with NAS.NFS traffic by default is sent in clear text since ESX does not encrypt it. Use only NAS storage over trusted networks. Layer 2 VLANs are another good choice here.10 Gb NFS is supported. So is Jumbo Frames, and configure it end to end.Deduplication can save sizeable amount. See speaker notes

#Confidential24CIFS - Common Internet file system. Used for client/server communication within Microsoft operating systems. SMB - Short for Server Message Block, SMB is the communications protocol used by Windows-based operating systems to support sharing of resources across a network.

Some NAS storage vendors offer data de-duplication features that can greatly reduce the amount of storagespace required. It is important to distinguish between in-place de-duplication and de-duplication for backupstreams. Both offer significant savings in space requirements, but in-place de-duplication seems to be far moresignificant for virtualization environments. Some customers have been able to reduce their storage needs byup to 75 percent of their previous storage footprint with the use of in place de-duplication technology.

The default setting for the maximum number of mount points/datastore an ESX server can concurrently mountis eight. Although the limit can be increased to 64 in the existing release. If you increase max NFS mountsabove the default setting of eight, make sure to also increase Net.TcpipHeapSize as well. If 32 mount points areused, increase tcpip.Heapsize to 30MB.

Best Practices:Separate your IP storage and LAN network traffic on separate physical switches or be willing and able to logically isolate them using VLANs. Enable Flow-Control Enable spanning tree protocol with either RSTP or portfast enabled Filter / restrict bridge protocol data units on storage network ports Configure jumbo frames (always end-to-end - meaning in every device in all the possible IP storage network paths). Support for Jumbo Frames for NFS (and iSCSI) was added in VMware ESX 3.5U3 and later. Strongly consider using Cat6 cables rather than Cat5/5e. Can 1GbE work on Cat 5 cable? Yes. Are you building a bet the business Ethernet infrastructure? Remember that retransmissions will absolutely recover from errors - but have a more significant impact for these IP storage use cases than in general networking use cases. Ensure your Ethernet switches have the proper amount of port buffers, and other internals to properly support NFS (and iSCSI) traffic optimally While vSphere adds support for IPv6 for VM networks and VMkernel networks - IPv6 for VMkernel storage traffic is experimental at the initial vSphere release With NFS datastores strongly consider switches which support cross-stack Etherchannel or Virtual port Channeling technologies. (This will become apparent during the HA section) With NFS datastores strongly consider 10GbE or a simple upgrade path to 10GbE as an important Ethernet switch feature.

Every NFS datastore mounted by ESX (including vSphere though NetApp and EMC are both collaborating for longer term NFS client improvements in the vmkernel) uses two TCP sessions one for NFS control information, and the other for NFS data flow itself. This means that the vast majority of the traffic to a single NFS datastore will use a single TCP session. What this means is that the upper limit throughput achievable for a single datastore regardless of link aggregation or other methods will use a single link for the traffic to that datastore

Review this carefully: http://communities.vmware.com/thread/197850Time out issue.iSCSIUse Virtual port storage system instead of plain Active/ActiveIm not sure if they cost much more.Has 1 additional Array Type over traditional FC: Virtual port storage systemAllows access to all available LUNs through a single virtual port. These are active-active Array, but hide their multiple connections though a single port. ESXi multipathing cannot detect the multiple connections to the storage. ESXi does not see multiple ports on the storage and cannot choose the storage port it connects to. These array handle port failover and connection balancing transparently. This is often referred to as transparent failover The storage system uses this technique to spread the load across available ports.

#ConfidentialIf ESX/ESXi loses connection to one of these ports, it automatically attempts to reconnect with the virtual portof the storage system, and should be redirected to an active, usable port. This reconnection and redirectionhappens quickly and generally does not disrupt running virtual machines. These storage systems can alsorequest that iSCSI initiators reconnect to the system, to change which storage port they are connected to. Thisallows the most effective use of the multiple ports.Figure 1-6 shows an example of port redirection. ESX/ESXi attempts to connect to the 10.0.0.1 virtual port. Thestorage system redirects this request to 10.0.0.2. ESX/ESXi connects with 10.0.0.2 and uses this port for I/Ocommunicatio25iSCSILimitationsESX/ESXi does not support iSCSI-connected tape devices.You cannot use virtual-machine multipathing software to perform I/O load balancing to a single physical LUN.A host cannot access the same LUN when it uses dependent and independent hardware iSCSI adapters simultaneously.Broadcom iSCSI adapters do not support IPv6 and Jumbo Frames. [e1: still true in vSphere 5??]Some storage systems do not support multiple sessions from the same initiator name or endpoint. Multiple sessions to such targets can result in unpredictable behavior.Dependant and IndependentA dependent hardware iSCSI adapter is a third-party adapter that depends on VMware networking, and iSCSI configuration and management interfaces provided by VMware. This type of adapter can be a card, such as a Broadcom 5709 NIC, that presents a standard network adapter and iSCSI offload functionality for the same port. The iSCSI offload functionality appears on the list of storage adapters as an iSCSI adapterError correctionTo protect the integrity of iSCSI headers and data, the iSCSI protocol defines error correction methods known as header digests and data digests. These digests pertain to the header and SCSI data being transferred between iSCSI initiators and targets, in both directions.Both parameters are disabled by default, but you can enable them. Impact CPU. Nehalem processors offload the iSCSI digest calculations, thus reducing the impact on performanceHardware iSCSIWhen you use a dependent hardware iSCSI adapter, performance reporting for a NIC associated with the adapter might show little or no activity, even when iSCSI traffic is heavy. This behavior occurs because the iSCSI traffic bypasses the regular networking stackBest practiceConfigure jumbo frames end to end.Use NIC with TCP segmentation offload (TSO)#ConfidentialiSCSI & NFS: caveat when used togetherAvoid using them togetheriSCSI and NFS have different HA models.iSCSI uses vmknics with no Ethernet failover using MPIO insteadNFS client relies on vmknics using link aggregation/Ethernet failoverNFS relies on host routing table.NFS traffic will use iSCSI vmknic and results in links without redundancyUse of multiple session iSCSI with NFS is not supported by NetAppEMC supports, but best practice is to have separate subnets, virtual interfaces

#ConfidentialSource: Chad and Vaughn Vmworld preso, TA8133. Listen to the audio, not just read the slide.27NPIVWhat it isAllow a single Fibre Channel HBA port to register with the Fibre Channel fabric using several worldwide port names (WWPNs). This ability makes the HBA port appear as multiple virtual ports, each having its own ID and virtual port name. Virtual machines can then claim each of these virtual ports and use them for all RDM traffic.Note that is WWPN, not WWNNWWPN World Wide Port NameWWNN World Wide Node NameSingle port HBA typically has a single WWNN and a single WWPN (which may be the same).Dual port HBAs may have a single WWNN to identify the HBA, but each port will typically have its own WWPN.However they could also have an independent WWNN per port too.

Design considerationOnly applicable to RDMVM does not get its own HBA nor FC driver required. It just gets an N-port, so its visible from the fabric.HBA and SAN switch must support NPIVCannot perform Storage vMotion or VMotion between datastores when NPIV is enabled. All RDM files must be in the same datastore.Still in place in v5

First one is WW Node NameSecond one is WW Port Name#Confidential282 TB VMDK barrierYou need to have > 2 TB disk within a VM. There are some solutions, each with pro and cons.Say you need a 5 TB disk in 1 Windows VM.RDM (even with physical compatibility) and DirectPath I/O do not increase virtual disk limit.Solution 1: VMFS or NFSCreate a datastore of 5 TB.Create 3 VMDK. Present to WindowsWindows then combine the 3 disk into 1 disk.LimitationCertain low level storage-softwares may not work as they need 1 disk (not combined by OS)Solution 3: iSCSI within the GuestConfigure the iSCSI initiator in WindowsConfigure a 5 TB LUN. Present the LUN directly to Windows, bypassing the ESX layer. You cant monitor it.By default, it will only have 1 GE. NIC teaming requires driver from Intel. Not sure if this supported.#ConfidentialStorage: Queue DepthWhen should you adjust the queue depth?If a VM generates more commands to a LUN than the LUN queue depth; Adjust the device/LUN queue.Generally with fewer, very high IO VMs on a host, larger queues at the device driver will improve performance.If the VMs queue depth is lower than the HBAs; Adjust the vmkernel.Be cautious when setting queue depthsWith too large of device queues, the storage array can easily be overwhelmed and its performance may suffer with high latencies.Device driver queue depths is global and set per LUN setting.Change the device queue depth for all ESX hosts in the clusterCalculating the queue depth:To verify that you are not exceed the queue depth for an HBA use the following formula:Max. queue depth of the HBA = Device queue setting * # of LUNs on HBAQueue are at multiple levelsLUN queue for each LUN at ESXi host.If the above queue is full, then kernel queue will be filled upLUN queue at array level for each LUNIf this queue does not exist, then the array writes straight into disk.Disk queueThe queue at the disk level, if there is no LUN queue30#Confidential30Sizing the Storage ArrayFor RAID 1 (it has IO Penalty of 2)60 Drives= ((7000 x 2 x 30%) + (7000 x 70%)) / 150 IOPSWhy RAID 5 has 4 IO Penalty?

RAID LevelIO Penalty125466

#Confidential31Storage: Performance MonitoringGet a baseline of your environment during a normal IO time frame.Capture as many data points as possible for analysis.Capture data from the SAN Fabric, the storage array, and the hosts.Which statistics should be capturedMax and average read/write IOps Max and average read/write latency (ms)Max and average Throughput (MB/sec)Read and write percentagesRandom vs. sequentialCapacity total and used

#ConfidentialSCSI Architecture Model (SAM)

#Confidential33The SCSI-3 specification allows for faster transfer rate, more devices on a single chain (a maximum of 32), and also it incorporates serial connections in addition to the traditional parallel interconnect of SCSI-2.

The serial interconnect of SCSI-3 incorporates three technologies: Serial Storage Architecture (SSA), Fibre Channel, and IEEE P1394. The serial transfer mode allows faster data rates, more devices per bus, simple connectors, and longer cables.

Fibre Channel is an ANSI standard that specifies high-speed serial communication between devices.It encapsulates networking and I/O protocols (e.g. SCSI) for transportation within FC frames.Fibre Channel supports serial communications across both copper and optical connections. Fibre Channel is one of the serial bus architectures supported by SCSI.

Fibre Channel Multi-Switch Fabric34Fabric Switch 1TRRCN_Port 0Node ANode DTRRCN_Port 3F_PortRCTRF_PortRCTRTRRCN_Port 1TRRCN_Port 2Node BNode CF_PortRCTRF_PortRCTRFabric Switch 2TRRCN_Port 0Node ENode FTRRCN_Port 3F_PortRCTRF_PortRCTRTRRCN_Port 1TRRCN_Port 2Node GNode HF_PortRCTRF_PortRCTRE_PortRCTRE_PortRCTR#Confidential34For long distance ISLs, it should be noted that additional licensing from the switch vendor would be required to purchase extended link credits (also need special fibre optic cables in your hardware infrastructure).Backup: VADP vs Agent-basedESX has 23 VM. Each VM is around 40 GB. All VMs are idle, so this CPU/Disk are purely on back up.CPU Peak is >10 GHz (just above 4 cores)But Disk Peak is >1.4 Gbps of IO, almost 50% of a 4 Gb HBA.After VAPD, both CPU and Disk drops to negligible

#ConfidentialVADP: Adoption StatusThis is as at June 2010.Always check with vendor for the most accurate dataPartner NameProduct NameVersionIntegration StatusCAArcServe12.5 w/patchReleasedCommvaultSimpana8.0 SP5ReleasedEMCAvamar5.0ReleasedEMCNetworker7.6.xNot yetHPData Protector6.1.1 with patchNot yetIBMTivoli Storage Manager6.2.0ReleasedSymantecBackup Exec2010ReleasedSymantecBackup Exec System Recovery2010ReleasedSymantec NetBackup7.0ReleasedVizioncorevRanger Pro4.2ReleasedVeeamBackup & Replication4.0Released#ConfidentialInternal source: https://vmshare.vmware.com/alliances/BCDR-Storage/BCDRTechAlliances/Lists/Tasks/AllItems.aspx

36Partition alignmentAffects every protocol, and every storage arrayVMFS on iSCSI, FC, & FCoE LUNsNFSVMDKs & RDMs with NTFS, EXT3, etcVMware VMFS partitions that align to 64KB track boundaries give reduced latency and increased throughputCheck with storage vendor if there are any recommendations to follow. If no recommendations are made, use a starting block that is a multiple of 8 KB.Responsibility of Storage Team.Not vSphere TeamOn NetApp :VMFS Partitions automatically aligned. Starting block in multiples of 4kMBRscan and MBRalign tools available to detect and correct misalignment

ClusterChunkClusterChunkClusterChunkBlockVMFS 1MB-8MBArray 4KB-64KBFS 4KB-1MB#ConfidentialSee great tips from virtualgeek.

Creating a VMFS volume generates a unique, hexadecimal value known as a UUID. This UUID, the label name and the SCSI ID of the disk are allembedded in the VMFS volume metadata.37Tools: Array-specific integrationThe example below is from NetApp. Other Storage partners have integration capability too.Always check with respective product vendor for latest information.

#ConfidentialTools: Array-specific integrationManagement of the Array can be done from vSphere client. Below is from NetAppEnsure storage access is not accidently given to vSphere admin by using RBAC

#ConfidentialData RecoveryNo integration with tapeCan do manualIf a third-party solution is being used to backup the deduplication store, those backups must not run while the Data Recovery service is running. Do not back up the deduplication store without first powering off the Data Recovery Backup Appliance or stopping the datarecovery service using the command service datarecovery stop.Some limits8 concurrent jobs on the appliance at any time (backup & restore).An appliance can have at the most 2 dedupe store destinations due to the overhead involved in deduping.VMDK or RDM based deduplication stores of up to 1TB or CIFS based deduplication stores of up to 500GB.No IPv6 addressesNo multiple backup appliances on a single host.VDR cannot back up VMsthat are protected by VMware Fault Tolerance.with 3rd party multi-pathing enabled where shared SCSI buses are in use.with raw device mapped (RDM) disks in physical compatibility mode.Data Recovery can back up VMware View linked clones, but they are restored as unlinked clones.Using Data Recovery to backup Data Recovery backup appliances is not supported. This should not be an issue. The backup appliance is a stateless device, so there is not the same need to back it up like other types of VMs.

#ConfidentialFault Tolerant VMs cannot be snapshoted so we cannot back them up.VDR cannot backup Fault Tolerant VMs as there is no way to snapshot FT VMs.VDR appliance cannot reside in a vApp.

Limitations & supportability statements are all in the VDR 1.2 Admin Guide.

2 dedupe stores is a hard limit with VDR1.2. You now get a popup message that you cant have more than 2 dedupe stores

40VMware Data RecoveryWe assume the following requirementsBack up to external array, not the same array.External Array can be used for other purpose too. So the 2 arrays are backing up each other.How to ensure Write performance as the array is shared?1x a day back up. No need multiple back up per day on the same VM.ConsiderationBandwidth: Need dedicated NIC to the Data Recovery VMPerformance: Need to reserve CPU/RAM for the VM?Group like VM together. It maximises dedupeDestination: RDM LUN presented via iSCSI to the Appliance. See picture below (hard disk 2)Not using VMDK format to enable LUN level operation Not using CIFS/SMB as Dedupliation Store is 0.5 TB vs 1 TB on RDM/VMDKSpace calculation: need to find a tool to help estimate the disk requirements.

#ConfidentialFirst backup is full VM, then incremental forever. Any limit on this? Does it impact restore if there are too many chains?

First image backup is full, all the rest are synthetic full, but restores are always from full.The de-duplication is block level, not file and it happens within the VMDK even though it is a file. The de-duplication can happen across VMs in the destination de-dupe store so it is VERY efficient

At most two destinations per appliance hard limitAt most 100 VMs backed up per appliance hard limitSuggest Thick destination disk to start, not Thin; Thin works but Thick provides a small improvement in performance and avoids pause issue if vDR uses all space!Destination can be VMDK or CIFS, but strongly recommend VMDK for performance and CIFS for DR.

The backups will not run if the host CPU utilization is 90% or higher, or if free space in destination is less than 5 GB per VM in job

41Mapping: Datastore VMCriteria to use when placing a VM into a Tier:How critical is the VM? Importance to business.What are its performance and availability requirements?What are its Point-in-Time restoration requirements?What are its backup requirements?What are its replication requirements?Have a document that lists which VM resides on which datastore groupContent can be generated using PowerCLI or Orchestrator, which shows datastores and their VMs.Example tool: Quest PowerGUIWhile rarely happen, you cant rule out if datastore metadata get corrupted.When that happens, you want to know what VMs are affected.A VM normally change tiers throughout its life cycle Criticality is relative and might change for a variety of reasons, including changes in the organization, operational processes, regulatory requirements, disaster planning, and so on.Be prepared to do Storage vMotion. Always test it first so you know how long it takes in your specific environmentVAAI is critical, else the traffic will impact your other VMs.

Datastore GroupVM NameSize (GB)IOPSTotal12 VM1 TB1400 IOPS#ConfidentialRDMUse sparingly. VMDK is more portable, easier to manage, and easier to resize. VMDK and RDM have similar performance.Physical RDMCant take snapshot.No Storage vMotion. But can do vMotion.Physical mode specifies minimal SCSI virtualization of the mapped device, allowing the greatest flexibility for SAN management software.VMkernel passes all SCSI commands to the device, with one exception: the REPORT LUNs command is virtualized so that the VMkernel can isolate the LUN to the owning virtual machine.Virtual RDMSpecifies full virtualization of the mapped device. Features like snapshot, etc worksVMkernel sends only READ and WRITE to the mapped device. The mapped device appears to the guest operating system exactly the same as a virtual disk file in a VMFS volume. The real hardware characteristics are hidden. #ConfidentialLatency (ms)IOPSHuman Experts vs Storage DRS2 VMware performance engineers vs Storage DRS competing to balance the following:13 VMs: 3 DVD store, 2 Swingbench, 4 mail servers, 2 OLTP, 2 web servers2 ESX hosts and 3 storage devices (different FC LUNs in shades of blue)Storage DRS provides lowest average latency, while maintaining similar throughput. Why human expert lost?Too many numbers to crunch, too many dimensions to the analysis. Human took a couple of hours to think this through.Why bother anyway

StorageDRSStorageDRSGreen: Average Latency (ms)#ConfidentialIBM built DEEP BLUE to pit against Garry KasparovFocus on the green color barWhy humans lose to computers (e.g. in chess). Our performance engineers are very good and44Alternative Backup MethodVMware ecosystem may provide new way of doing back up.Example below is from NetAppNetApp SnapManager for Virtual Infrastructure (SMVI)In Large Cloud, SMVI server should sit on a separate VM from with vCenter. While it has no performance requirement, it is best from Segregation of Duty point of view.Best practice is to keep vCenter clean & simple. vCenter is playing much more critical role in larger environment where plug-ins are relying on vCenter up time.Allows for consistent array snapshots & replication.Combine with other SnapManager products (SM for Exchange, SM for Oracle, etc) for application consistency Exchange and SQL work with VMDKOracle, SharePoint, SAP require RDMCan be combined with SnapVault for vaulting to disk. 3 levels of data protection :On disk array snapshots for fast backup (seconds) & recovery (up to 255 snapshot copies of any datastore can be kept with no performance impact)Vaulting to separate array for better protection, slightly slower recoverySnapMirror to offsite for DR purposesServes to minimize backup window (and frozen vmdk when changes are applied)Option to not create a vm snapshot to create crash consistent array snapshots#Confidential45Source: William, NetApp

One VMKernel port& IP subnetSupportmulti-switchLinkaggr?Use multiple links withIP hash load balancing onthe NFS client (ESX)Use multiple links withIP hash load balancing onThe NFS server (array)Storage needs multiplesequential IP addressesUse multiple VMKernelPorts & IP subnetsUse ESX routing tableStorage needs multiplesequential IP addressesYes#ConfidentialNetApp recommends defining a separate VMkernel for each storage protocol. Doing so makes the configuration of iSCSI with NFS very simple. Each of these VMkernel ports supports IP traffic on a different subnet. Using different subnet addressing schemes for iSCSI and NFS provides the benefit of being able to control which VMkernel ports are used for communication of each protocol. As an example, see Figure 31. Because the two VMkernel ports are in the same vSwitch, they can share the vmnics in a vSwitch. For NFS datastores, each VMkernel port is configured with a single active vmnic, with one or more standby vmnics defined. This allows the administrator to control which vmnic is used for traffic by each VMkernel port. 46#ConfidentialvMotion Performance on 1 GbE Vs 10 GbE ScenarioCPU %USEDWeb TrafficIdle VM 0 0 GbpsModerately Loaded VM 140% 2.5 GbpsHeavily Loaded VM 325% 6 Gbps

Duration of vMotion (lower the better)Idle/Moderately loaded VM scenariosReductions in duration when using 10 GbE vs 1 GbE on both vSphere 4.1 and vSphere 5

Consider switch from 1 GbE to 10 GbE vMotion network

Heavily loaded VM scenarioReductions in duration when using 10 GbE vs 1 GbE

1 GbE on vSphere 5 : SDPS kicked-in resulting in zero connection drops

vMotion in vSphere 5 never fails due to memory copy convergence issues1 GbE on vSphere 4.1: Memory copy convergence issues lead to network connection drops

#ConfidentialIn the 1GbE/41 case, where there was no forward progress, vMotion decided to suspend the VM and send over the pages. According to its estimate, max suspend time was 54 seconds less than the max of 100 seconds. In 1GbE/50 case, SDPS was enabled when there was no forward progress, and the vMotion went thru with success.48Impact on Database Server Performance During vMotion Performance impact minimal during the memory trace phase in vSphere 5Throughput was never zero in vSphere 5 (due to switch-over time < half a second)Time to resume to normal level of performance about 2 seconds better in vSphere 5Time (in seconds)vMotion duration : 23 secImpact during guest trace periodImpact during switch-over periodTime (in seconds)vMotion duration : 15 secImpact during guest trace periodImpact during switch-over period#Confidential49vMotion Network Bandwidth Usage During Evacuation

#Confidential50Network SettingsLoad-Based TeamingWe will not use as we are using 1 GE in this design.If you use 10 GE, the default settings is a good starting point. It gives VM 2x the share versus hypervisor.NIC TeamingIf the physical switch can support, then use IP-HashNeed a Stacked-Switch. Basically, they can be managed as if they are 1 bigger switch. Multi-chassis EtherChannel Switch is another name.IP-Hash does not help if the source and address are constant. For example, vMotion always use 1 path only as source-destination pair is constant. Connection from VMkernel to NFS server is contant,If the physical switch cant support, then use Source PortYou need to manually balance this, so not all VM go via the same port. VLANWe are using VST. Physical switch must support VLAN trunking.PVLANNot using in this design. Most physical switches are PVLAN aware already.Packets will be dropped or security can be compromised if physical switch is not PVLAN aware.Beacon ProbingNot enabled, as my design only has 2 NIC per vSwitch. ESXi will flood both NIC if it has 2 NIC only.Review default settingsChange Forged Transmit to Reject.Change MAC address changes to Reject#ConfidentialCross stack ether channel A pairing of ports that can span across two physical LAN switches managed as one logical switch. This is only an option with a limited number of switches that are available today.

Multiple VLANs on a portCisco: TrunkHP: Tagged or PortAggregation

Multiple ports bundled together (for load balancing or failover - not related to VLANs)Cisco: EtherChannelHP: Trunk

Also note that MEC is the best practice for performance and HA point of view, leverage on Nexus 1000v to simplify the design.

PVLAN:PVLAN is a Cisco feature now available to DVS as part of our collaboration. Use this if you wanted to build a large L2 domain with security.

?? How do we assign/map a VM to use a particular vmnic as its active port?51VLANNative VLANNative VLAN means the switch can receive and transmit untagged packets.VLAN hopping occurs when an attacker with authorized access to one VLAN creates packets that trick physical switches into transmitting the packets to another VLAN that the attacker is not authorized to access. Attacker send forms an ISL or 1Q trunk port to switch by spoofing DTP messages, getting access to all VLANs. Or attacker can send double tagged 1Q packets to hop from one VLAN to another, sending traffic to a station it would otherwise not be able to reach.This vulnerability usually results from a switch being misconfigured for native VLAN, as it can receive untagged packets.Local vSwitches do not support native VLAN. Distributed vSwitch does.All data passed on these switches is appropriately tagged. However, because physical switches in the network might be configured for native VLAN, VLANs configured with standard switches can still be vulnerable to VLAN hopping.If you plan to use VLANs to enforce network security, disable the native VLAN feature for all switches unless you have a compelling reason to operate some of your VLANs in native mode. If you must use native VLAN, see your switch vendors configuration guidelines for this feature.VLAN 0: the port group can see only untagged (non-VLAN) traffic. VLAN 4095: the port group can see traffic on any VLAN while leaving the VLAN tags intact.

#ConfidentialSource: http://bradhedlund.com/2010/02/10/vswitch-illusion-dmz-virtualization/52Distributed SwitchDesign considerationVersion upgrade?? Upgrade procedure

#ConfidentialFeature Comparison Among Switches (partial)FeaturevSSvDSCisco N1KVLANyesyesyesPort SecurityyesyesyesMulticast SupportyesyesyesLink AggregationstaticstaticLACPTraffic ManagementlimitedyesyesPrivate VLANnoyesyesSNMP, etc.nonoyesManagement InterfacevSphere ClientvSphere clientCisco CLINetflowNoyesyes#Confidential54

Port Groups are policy definitions for a set or group of ports.e.g. VLAN membership,port security policy,teaming policy, etc Uplinks (physical NICs)attached to vSwitch.vNetwork Standard Switch (vSwitch)vSS defined on a per host basis from Home Inventory Hosts and Clusters.vNetwork Standard Switch: A Closer Look#ConfidentialSource: great slide from Donald Mann, Senior Solutions Architect, ePlus Technology.

55

DV Port Groups span all hosts covered by vDSand are groups of portsdefined with the same policye.g. VLAN, etc DV Uplink Port Groupdefines uplink policiesDV Uplinks abstractactual physical nics (vmnics) on hostsvmnics on each hostmapped to dvUplinks vNetwork Distributed Switch: A Closer LookvDS operates off the local cache No operational dependency on vCenter serverHost local cache under /etc/vmware/dvsdata.db and /vmfs/volumes//.dvsdataLocal cache is a binary file. Do not hand edit

#ConfidentialSource: great slide from Donald Mann, Senior Solutions Architect, ePlus Technology. 56Nexus 1000V: VSMVM propertiesEach requires a 1 vCPU, 2 GB RAM. Must be reserved, so it will impact the cluster Slot Size.Use Other Linux 64-bit" as the Guest OS.Each needs 3 vNIC. Requires the Intel e1000 network driver. Because No VMware Tools installed? Availability2 VSMs are deployed in an active-standby configuration, with the first VSM functioning in the primary role and the other VSM functioning in a secondary role. If the primary VSM fails, the secondary VSM will take over. They do not use VMware HA mechanism.Unlike cross-bar based modular switching platforms, the VSM is not in the data path. General data packets are not forwarded to the VSM to be processed, but rather switched by the VEM directly.

#ConfidentialBelmont:See also best practices whitepaper, Nexus 1000v on UCS:

http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9902/white_paper_c11-558242.html

57Nexus 1000V: VSM has 3 Interface for mgmtControl InterfaceVSM VEMs communication, and VSM VSM communicationHandles low-level control packets such as heartbeats as well as any configuration data that needs to be exchanged between the VSM and VEM. Because of the nature of the traffic carried over the control interface, it is the most important interface in Nexus 1000VRequires very little bandwidth (