Applying Benchmark Data To A Model for Relative Server Capacity CMG 2013
description
Transcript of Applying Benchmark Data To A Model for Relative Server Capacity CMG 2013
Applying Benchmark Data To A Model for Relative Server Capacity
CMG 2013
Joseph Temple, LC-NS ConsultingJohn J Thomas, IBM
CMG 2013 2
Relative Server Capacity
“How do I compare machine capacity?” “What platform is best fit to deliver a given workload?”
Simple enough questions, but difficult to answer!
Establishing server capacity is complex Different platform design points Different machine architectures Continuously evolving platform generations
“Standard” benchmarks (SPECInt, TPC-C etc.) and composite metrics (RPE2, QPI etc.) help, but may not be sufficient
Some platforms do not support these metrics May not be sufficient to decide best fit for a given workload
We need a model to address Relative Server Capacity See “Alternative Metrics for Server RFPs” [J. Temple]
CMG 2013 3
Local Factors / Constraints
Non-FunctionalRequirements
TechnologyAdoption
StrategicDirection
CostModels
ReferenceArchitectures
Systemz
Systemx
Power WorkloadFit
CMG 2013 4
Fit for Purpose Workload Types
Mixed Workload – Type 1• Scales up• Updates to shared data
and work queues• Complex virtualization• Business Intelligence with
heavy data sharing and ad hoc queries
Parallel Data Structures – Type 3
Small Discrete – Type 4
Application Function Data Structure Usage Pattern SLA Integration Scale
Highly Threaded – Type 2
• Scales well on clusters• XML parsing• Buisness intelligence with
Structured Queries• HPC applications
• Scales well on large SMP• Web application servers• Single instance of an ERP
system• Some partitioned
databases
• Limited scaling needs• HTTP servers• File and print• FTP servers• Small end user apps
Black are design factors Blue are local factors
CMG 2013 5
Fitness Parameters in Machine Design
Can be customized to machines of interest. Need to know specific comparisons desired
These parameters were chosen to represent the ability to handle parallel, serial and bulk data traffic.This is based on Greg Pfister’s work on workload characterization in In Search of Clusters
CMG 2013 6
Key Aspects Of The Theoretical Model
Throughput (TP) Common concept: Units of Work Done / Units of Time Elapsed Theoretical model defines TP as a function of Thread Speed:
TP = Thread Speed x Threads− Thread Speed is calculated as clock rate x Threading multiplier / Threads per Core.
Threading multiplier is the increase in throughput due to multiple threads per core
Thread Capacity (TC) Throughput (TP) gives us an idea of instantaneous peak throughput rate
− In order to sustain this rate the load has to keep all threads of the machine busy In the world of dedicated systems, TP is the parameter of interest because it
tells us the peak load the machine can handle without causing queues to form However in the world of virtualized/consolidated workloads, we are stacking
multiple workloads on threads of the machine− Thread capacity is an estimator of how deep these stacks can be
Theoretical model defines TC as: TC = Thread Speed x Cache per Thread
CMG 2013 7
Throughput, Saturation, Capacity
7
TP Measured ITR Capacity
TP Pure Parallel CPU ITR Other resources and Serialization ETR Load and Response Time
CMG 2013 8
Single Dimension Metrics Do Not Reflect True Capacity
The “standard metrics” do not leverage cache.This leads to the pure ITR view of relative capacity on the right.
Common Metrics:ITR TPETR ITR
Power advantagedz is not price competitive
Consolidation:ETR << ITR unless loads are consolidatedConsolidation accumulates working sets Power and z advantagedCache can also mitigate “Saturation”
CMG 2013 9
Bridging Two Worlds - I
There appears to be a disconnect between “common benchmark metrics” and “theoretical model metrics” like TP
Does this mean metrics like TP are invalid? No We see the effect of TP/TC in real world deployments
− a machine performs either better or poorer than what a common benchmark metric would have suggested
Does this mean benchmark metrics are useless? No They provide valuable data points
A better approach would be to try and bridge these two worlds in a meaningful way
CMG 2013 10
Bridging Two Worlds - II Theoretical model calculates TP and TC using estimated values for thread speed
Based on machine specifications
Example: TP calculation for POWER7 A key factor in TP calculation is Thread Speed, which in turn depends on the
value of the thread multiplier− But this factor is only an estimate. − We estimated the thread multiplier for POWER7 in SMT-4 mode was 2
However, using an estimate for thread speed assumes common path length and linear scaling
An inherent problem here – these estimates are not measured or specified using any common metric across platforms
− As an example, should the thread multiplier be the same for POWER7 in SMT-2 mode as Intel running with HyperThreading?
Recommendation: Refine factors in the theoretical model with benchmark results Instead of using theoretical values for thread speed, pathlength etc., plug in
benchmark observations
CMG 2013 11
Two Common Categories Of Benchmarks
Stress tests Measure raw throughput
− Measure the maximum throughput that can be driven through a system, focusing all system resources to this particular task
VM density tests Consolidation ratios (VM density) that can be achieved on a
platform Usually do not try to maximize throughput of a system
− They usually look at how multiple workloads can be stacked efficiently to share the resources on a system, while delivering steady throughput
Adjusting Thread Speed affects both TP and TC
CMG 2013 12
Example of a Stress Test, A Misleading One If Used In Isolation!
This benchmark result is quite misleading, it suggests a z core yields only 15% better ITR. But we know that z has much higher “capacity”
What is wrong here? System z design point is to run multiple workloads together, not a single
atomic application under stress This particular application doesn’t seem to leverage many of z’s capabilities
(cache, IO etc.) Can this benchmark result be used to compare capacity?
Online trading WAS ND workload
driven as a stress test
2ch/16co Intel 2.7GHz Blade
Linux on System z16 IFLs
TradeLite workload
Peak ITR:3467 tps
Peak ITR:3984 tps
CMG 2013 13
Use Benchmark Data To Refine Relative Capacity Model
Calculate Effective thread speed from measured values What is the benchmarked thread speed? Normalizing thread speed and clock to a platform allows us to
calculate pathlength for a given platform This in turn allows us to calculate Effective thread speed
Doing this affects both TP and TC
Plug in Effective thread speed values into Relative Capacity calculation model
CMG 2013 14
Use Benchmark Data To Refine Relative Capacity Model - Results
ITR / Threads
Clock ratio / Threadspeed ratio
Effective Threadspeed * Total Threads * Cache/Thread
In this case, System z ends up with a 13.5x Relative Capacity factor, relative to Intel
CMG 2013 15
Online banking WAS ND workloads, each driving 22
transactions per second with light I/O
Common x86 hypervisor2ch/16co Intel 2.7GHz Blade
PowerVM2ch/16co POWER7+
3.6GHz Blade
z/VM on zEC1216 IFLs
Light workloads
48 VMs per IPAS Intel
blade
100 VMs per 16-way
z/VM
68 VMs per IPAS
POWER7+ blade
Consolidation ratios derived from IBM internal studies. Results will vary based on workload profiles/characteristics.
Example of a VM Density Test: Consolidating Standalone VMs With Light CPU Requirements
CMG 2013 16
Use Benchmark Data To Refine Relative Capacity Model - Results
Follow a similar exercise to calculate effective thread speed Each VM is driving a certain fixed throughput
− This test used a constant injection rate− If throughput varies (for example, holding a constant think time),
need to adjust for that Calculate benchmarked thread speed Normalize to a platform to get path length Calculate effective thread speed Plug into relative server capacity calculation
In this case, System z ends up with a 22.2x Relative Capacity factor relative to Intel
CMG 2013 17
Math Behind Consolidation
Roger’s Equation:Uavg = 1/(1+HR(avg))
WhereHR(avg) = kcN1/2
For consolidation, N is the number of loads (VMs)k is a design parameter (Service Level)c is the variability of the initial load
CMG 2013 18
Larger Servers With More Resources Make More Effective Consolidation Platforms
Most workloads experience variance in demand
When you consolidate workloads with variance on a virtualized server, the variance of the sum is less (statistical multiplexing)
The more workloads you can consolidate, the smaller is the variance of the sum
Consequently, bigger servers with capacity to run more workloads can be driven to higher average utilization levels without violating service level agreements, thereby reducing the cost per workload
CMG 2013 19
A Single Workload Requires a Machine Capacity Of 6x the Average Demand
Server utilization = 17%
Average Demand
m=10/sec
Assumes coefficient of variation = 2.5, required to meet 97.7% SLA
6x Peak To Average
Server Capacity Required
60/sec
CMG 2013 20
Consolidation Of 4 Workloads Requires Server Capacity Of 3.5x Average Demand
Server utilization = 28%
Average Demand
4*m =40/sec
Server Capacity Required140/sec
Assumes coefficient of variation = 2.5, required to meet 97.7% SLA
3.5x Peak To Average
CMG 2013 21
Consolidation Of 16 Workloads Requires Server Capacity Of 2.25x Average Demand
Server utilization = 44%
Average Demand 16*m =160/sec
Server Capacity Required360/sec
Assumes coefficient of variation = 2.5, required to meet 97.7% SLA
2.25x Peak To Average
CMG 2013 22
Consolidation Of 144 Workloads Requires Server Capacity Of 1.42x Average Demand
Server utilization = 70%
Average Demand 144*m = 1440/sec
Server Capacity Required2045/sec
Assumes coefficient of variation = 2.5, required to meet 97.7% SLA
1.42x Peak To Average
CMG 2013 23
Let’s Look At Actual Customer Data
Large US insurance company
13 Production POWER7 frames Some large servers, some small servers
Detailed CPU utilization data 30 minute intervals, one whole week For each LPAR on the frame For each frame in the data center
Measure peak, average, variance
CMG 2013 24
Detailed Data Example: One FramePAF5PDC
0
10
20
30
40
50
60
70
80
90
100
12/9/19120:00
12/10/19120:00
12/11/19120:00
12/12/19120:00
12/13/19120:00
12/14/19120:00
12/15/19120:00
12/16/19120:00
CPU
%MSP159
0
2
4
6
8
10
12
12/9/19120:00
12/10/19120:00
12/11/19120:00
12/12/19120:00
12/13/19120:00
12/14/19120:00
12/15/19120:00
12/16/19120:00
Core
s AllGuidewire
CMG 2013 25
Workloads vs. Peak-to-Average(Final Theoretical Model Overlaid)
0
1
2
3
4
5
6
7
8
0 10 20 30 40 50 60
LPAR Count
Peak
To
Aver
age
Ratio
Customer Data Confirms Theory
Servers with more workloads have less variance in their utilizationand less headroom requirements
CMG 2013 26
Consolidation Observations
There is a benefit to large scale servers The headroom required to accommodate variability goes up
only by sqrt(n) when n workloads are pooled The larger the shared processor pool is, the more statistical
benefit you get Large scale virtualization platforms are able to consolidate
large numbers of virtual machines because of this
Servers with capacity to run more workloads can be driven to higher average utilization levels without violating service level agreements
CMG 2013 27
Summary
We need a theoretical model for relative server capacity comparisons
Purely theoretical models need to be grounded in reality Atomic benchmarks can sometimes be quite misleading in
terms of overall system capability Refine theoretical models with benchmark measurements Real world (customer) data trumps everything!
Validates or negates models Customer data validates sqrt(n) model for consolidation