VMworld 2013: Building a Validation Factory for VMware Partners

Post on 05-Dec-2014

158 views 0 download

description

VMworld 2013 Tim Harris, VMware Learn more about VMworld and register at http://www.vmworld.com/index.jspa?src=socmed-vmworld-slideshare

Transcript of VMworld 2013: Building a Validation Factory for VMware Partners

Building a Validation Factory for VMware Partners

Tim Harris, VMware

TEX5485

#TEX5485

2 2

Disclaimer

This session may contain product features that are

currently under development.

This session/overview of the new technology represents

no commitment from VMware to deliver these features in

any generally available product.

Features are subject to change, and must not be included in

contracts, purchase orders, or sales agreements of any kind.

Technical feasibility and market demand will affect final delivery.

Pricing and packaging for any new technologies or features

discussed or presented have not been determined.

3 3

About the Speaker…

Tim Harris:

• At VMWare since 2007

• Currently running ISV Validation Program

• Engineering and Lab Resources for TAP members

• Oracle Corp for nearly 10 years

• Managed various performance engineering teams

• Ran Oracle Applications Standard Benchmark effort

• PhD in Computer Science

• Focus on Parallel Computing algorithms and architectures

• BS in Electrical Engineering

3

4 4

Agenda

Validation Services Overview

• Goals and Best Practices

Why Build a Validation Factory?

• Business and Technical Value

Process and Procedures

• Org charts, resources, planning and objectives

Tuning Best Practices and Telco

• What’s challenging today, and how best to solve those challenges

4

5 5

Validation Services

6 6

Overview of Validation Services

Engineering Back-End to ISV Alliances

• Lab and Engineer Resources

• Free of Cost, Indirect Revenue for VMware

Performance Validations

• Virtualized Net-New App

Business Continuity/Disaster Recovery

• Site Recovery Manager

• VMware HA, vMotion, DRS, FT

Cloud Migration Services

• vCloud Director

• vApps

• vShield

• Hosting and Billing

Performance

Validations

View and BCDR

Cloud/ SAAS

7 7

Settings Goals for a Performance Validation

Primary: Remove blockers for adoption

• As perceived by you, the Partner

VMware in supporting role here

• We do not set requirements

Supportability

• Our mutual customers should be happy

Maximize Value Proposition

• Synergy in combined functionality?

• 1 + 1 = 3 opportunities?

8 8

Performance Goals

Same performance as physical?

• Is “nearly the same” enough?

What are the application stress points?

• Realtime access to CPU?

• High throughput access to I/O?

• Dynamic memory footprint?

Infrastructure requirements

• Storage requirements

• Load driver requirements

Application level KPIs?

• For small, medium and large customers

9 9

Validation Goals and Common vSphere Use Cases

Validation Collaboration

• Many general learning opportunities

What’s likely vSphere configuration

• Existing cluster of 6 to 12 nodes

• DRS turned on

• Reservations turned off

• HA turned on

• Mix of diverse workloads

vSphere Admin’s may

• Prioritize the good of the many

• Vs the good of the few (applications)

Any conflicts with your best practices?

10 10

Vmware Ready and Validations

Vmware ready is marketing certification program

• Applications Category requires some performance testing

• Designed as self-service activity

Validations mean can waive testing requirements

• If you’ve done good performance work

• Can provide testing waiver

Testing requirements are modest

• Apply load and observe behavior and capacity

11 11

Why Build a Validation Factory?

12 12

What Is a Validation Factory?

Validate All Your Applications

• Solution Level, Suite Level, Company Level

Plan for Capacity with Resource Requirements

• Hardware, Manpower, Marketing, Management

• Move from Event to Service model

Leverage results

• Document, Market, Enable the Field

Broaden solutions

• BC/DR, Hybrid Cloud (Private/Public), VDI

Get Certified

• VMware Ready status for all products

13 13

Validation Factory: Why Do It?

Provide Suite level virtualization advice

• Combine point products into virtualized solutions

Differentiate from competitors

• Establish technical leadership across products

Provide broader value of single platform

• Point products not sufficient

Enable delivery of specific deployment architectures

• E.g. 5 product suite on 3 node cluster supports 200 users

14 14

Process and Procedures

15 15

Org Chart and Process

Centralized Resources are easier

• Center of Expertise Model

Two Major Product Categories

• Need full validation to support

• Just need VMware Ready logo

Build Prioritized List

• Easy/Quick wins

• Hard/Longer Challenges

Internal and External Marketing

• Take credit for incremental achievements

16 16

Factory Deliverables

Suite Level VMware Ready Status

• vSphere based solutions

• Reference architectures

• Availability story

• Solution Deployment Guide

Span the Gap from R&D to Field

• Key architects in the loop

• Field enabled to understand and sell

Document and Market

• External doc delivered

• Internal message delivered

Planning Your Validation Effort

18 18

Validation Process in Agile Sprints

Planning Sprint: 3 weeks

• Iteratively populate test plan template

• HW resource requirements

• Storage volume and throughput

• Workload and Load Driver Tooling

Execution: 3 weeks

• At VMware Labs or ISV Labs

Wrap up: 3 weeks

• Interactively create Field Facing Documents

• Any join marketing/Press releases/VMware Ready Logos, etc

Add concurrency to increase throughput

• Different products can overlap sprints

Plan Execute Wrap –

up

19 19

Planning Risk Factors

Infastructure limitations

• Little is learned by testing with insufficient capacity

• Entire benchmark limited by smallest bottleneck

Storage throughput

• Do we know the requirements?

• Can we verify the device can hit requirements?

• E.g. run IOMeter before testing begins

Length of effort

• Assume problems throughput before locking in dates

• Or choose timeline and work backwards to test schedule

• E.g. We plan 2 weeks of testing and reserve 3 weeks of HW

Executing on Your Validation Effort

21 21

Environment Build-out

Assume Build period largely single threaded

• Not considered full lab time

Start all staging/installs week before

• Assume long copy/install/datagen steps

• May include snail mail steps

• Ship USB drives for items bigger than 20G

• 10G and under via FTP

Full install on greenfield VM

• Most common process

vApps (OVFs) arguably better

• But more likely to break size limits for FTP

22 22

Load Drivers and Validations

Good load driver is critical to Performance testing

• Not virtualization specific

Load drivers are expensive to build

• Assume 2 man years and 6 calendar months

Bad load drivers don’t represent realistic use cases

• Focus should be on customer critical activities

• Proving the performance of edge cases is a waste of resources

• Load should represent common production load

23 23

Physical vs. Virtual Comparisons

Obvious choice, but not always correct choice

• Costs substantially more

• Adds a bit more value

Assume P-vs-V costs 2X+ more time/resources

• Physical HW setup is slow and inflexible

• Apples to Oranges comparisons common

Apples to Apples is…

• Must remove resources from physical to match VM

• VM must not consume all physical resources

• Hypervisor will have resources in production

• Needs to have resources in testing too

24 24

Tuning Best Practices and Telco

25 25

Executive Summary: vSphere Tuning in Last 5 Years

Used to be scary – now they just work:

• High I/O Applications: Run at wire speed now

• Monster VM type workloads: Big iron now in a VM

• Enterprise use cases for Linux: Now safer

What’s still hard?

• Realtime requirements under 1 ms

• ESX 3.5 – 100 ms

• ESX 4 and 5 – 10 ms

• ESX 5.1 and 5.5 – working on sub-ms (100s of microseconds) now

• vMotion of Huge Realtime VMs

• 64 GB in-memory DBs like to stay still

26 26

Example Telco Workload Challenges

Service Provider Use Cases

• Large SAAS deployments

BC/DR QOS Built into application

• Realtime active/passive failover

Conservative by nature

• “Don’t try and fix it if you might break it”

Realtime Transaction Rates

• Latency requirements of <10ms

27 27

Tuning Strategies

Shopping list of tune-ables may be misused

• Changes for changes sake

Experimental science says

• Make one change at a time

• Assess value of change

• Remove or move on to next change

Prioritize by relative impact

• No reason to make change if can’t solve a problem

28 28

Large Tuning Knobs Available

Incrementally back off virtualization

• Realtime demands likely can be met

Reservations for CPU and Memory

• Hard allocation of resources

If truly needed – CPU Affinity

• Exclusive or with Halt.desched flag

If truly needed – NIC passthrough

• With SRIOV or not

Horizontally scaled apps

• Still have less scheduling overhead

Storage design still critical

• Ensure Iops are available before tuning

29 29

Advanced Tuning: CPU Affinity

CPU Affinity (aka Pinning)

• Rumored to be critical for VOIP

• Our data shows little gain with vSphere 4.x and before

Affinity and vSphere 5.0

• Allows “Exclusive Affinity”

• Previously, cores still accessible to other VMs despite affinity

0

2

4

6

8

Max DSP Execution Timein Milliseconds

SLA

Without ExclusiveAffinity

With ExclusiveAffinity

30 30

Halt Desched vs. Affinity vs. Latency Sensitive

“Pre-Allocating” CPU resources to a VM

• Reducing benefits of virtualization (vmotion, overcommit)

• Reducing scheduling overhead

Hierarchy of Techniques

• Simple reservations first

• Exclusive CPU Affinity (5.0 and beyond)

• Halt Desched option

Latency Sensitive UI available in 5.1 and beyond

• At highest setting, equivalent to Exclusive CPU affinity

Halt Desched

• vCPUs at 100% usage even if no work being done

• monitor_control.halt_desched set to FALSE

31 31

Horizontal Scaling and Latency Sensitivity

Scheduling overhead a function of vCPUs per VM

• 4 to 8 vCPU VMs may be our sweet spot

Many Applications scale horizontally effectively

• Doesn’t need to impact aggregate resources for an application

• E.g. double vm count and halve vCPUs per VM

• Trade-offs with management overhead of more VMs

Expect less jitter with smaller VMs

• Empirical result across many workloads

32 32

Non-Uniform Memory Access (NUMA) Impacts

Physical Memory Spread across NUMA Nodes

• Typically one node per socket

Access to remote node’s memory expensive

• Access to local node “cheap”

Monitor from ESXtop

• NUMA stats: %local memory should be 100

• vSphere 5 more NUMA aware than previous

• Small-ish VMs and Smallish RAM best case

Align Core count per socket with vCPUs

• Fully occupy integer socket count

Disable “Node Interleaving” at BIOS to enable NUMA

• Node interleaving (enabled) leads to consistent but poor performance

33 33

Advanced Tuning: Direct Path I/O

Direct Path I/O (aka NIC Pass-through)

• Disables vMotion

• Makes physical NIC available for only one VM

Substantial jitter improvements in realtime workloads

• But at substantial cost in vSphere functionality

SRIOV provides alternative

• Reusable NIC with vMotion and Pass-through

020406080

100

Worst Case Latency inMilliseconds

SLA

Without Direct PathI/O

Direct Path I/O

34 34

Interrupt Management and Latency Sensitive Workloads

Interrupt coalescing in vSphere 4.x and 5

• Does “Adaptive Interrupt Coalescing” by default

• Groups interrupts to reduce impact and CPU

• Group size (queue depth) dynamically adjusts to the workload

Adaptive coalescing may introduce latency

• Can disable coalescing for latency sensitive workloads

• Some improvements observed, but not always a win

Pinning of interrupts

• Likely used with CPU pinning

• Keeps all interrupts on vCPU and hence pCPU

• Modest gain – test before using

35 35

Latency Sensitive Tuning and Overcommitment

Safest solution – undercommit physical cores on each host

• E.g. 16 core server runs no more than 14 vCPUs

• 1-2 cores per host and 2G of RAM uncommitted

Challenges with undercommitment

• HW utilization, DRS in cluster with mixed workloads, etc.

• Most viable with dedicated (to one app) clusters

Alternative approaches

• CPU Affinity locks a VM to cores

• Other cores available for general use in cluster

36 36

Realtime Tuning Summary

Start with simple techniques

• Reservations, BIOS tuning, etc

Move towards pre-allocation of resources

• CPU Exclusive Affinity if CPU bound

• NIC-passthrough if network bound

Consider horizontal scaling of configuration

• More, smaller VMs

Test one change at a time and iterate

• Don’t overlap your changes

37 37

Telco Progress In-flight

Active Efforts with Nearly Every Global Telco Provider

• Some solutions in market, so on the way

Easy to virtualize pieces definitely exist

• Careful prioritization of efforts underway

Realtime workloads are achievable

• 2ms for compute and packet send consistently achievable (5.1)

• <1ms QOS work in progress (5.5?)

Availability still adds value

• Augment built in availability story

• Protect previous unprotected components

38 38

Validation Factory Summary

Vendors see value in Suite Level solutions design

• TAP program can provide support for such efforts

VMware Ready status for all applications

• Detailed performance assessment for some

What was once hard is not possible

• Most challenging applications successfully virtualized today

39 39

Questions?

THANK YOU

Building a Validation Factory for VMware Partners

Tim Harris, VMware

TEX5485

#TEX5485