Guerrilla Capacity Assessment using USL A Case · PDF fileGuerrilla Capacity Assessment using...

Computer Measurement Group, India 1

Guerrilla Capacity Assessment using USL

A Case Study

Prajakta Bhatt, Infosys December 2013

www.cmgindia.org

http://www.cmgindia.org/


Contents

• Typical Capacity Assessment Approach • Its Challenges – Need of USL • Understand Scalability Models • Answer WH Questions (What, Where, How, When etc.) on USL • Assumptions & Constraints of USL • Detailed case study of capacity assessment of on real production

DB server (Oracle based)


Introduction

• Key is Effective Capacity Planning • It is Science and Art that predicts resources (s/w, h/w,

connection infrastructure) required to handle additional loads optimally in future. Thus it helps to: • Reduce cost (by avoiding over-sizing) • Improve Productivity (by avoiding under-sizing)

Can my app scale well to take 2x existing load?

At what time I need to upgrade my

hardware?

Will this new CR interfere with existing app

performance?


Traditional Capacity Assessment Approach

• Growth projections from Business

• Data from Prod Logs

Gather NFRs

• Collect Throughput, Utilization, Service Demand data from Load Tests/ Prod logs using regression Analysis techniques

Test Scalability

• Build Analytical or Simulation Models using Service Demands

Build Capacity Model

• Extrapolate model

• Predict capacity requirement for critical resources

Predict Capacity

• Challenges in Traditional Capacity Modeling: –Predict Capacity assuming

theoretical linear scalability –Use complex statistical

algorithms (like Non-Linear Least Squares Regression)

–Unable to predict response times

• Any Alternatives? USL : Universal Scalability

Law


What is USL?

• Universal Scalability Law (USL), quantifies scalability for an application setup as a whole (both hardware and software together)

• Helps depicting scalability behavior in systems realistically • Other Features:

Universal in Nature

• H/w - disk arrays, SANs, CPUs, multicores

• S/W – Virtual Users, Unix Processes, POSIX threads

• Certain N/W IO Types

Simple to Implement

• No time consuming computation of Service Demand

Fairly accurate

Predictions

• System Throughput

• System Utilization

• Concurrency at maximum Throughput

• Transaction Response times


Understanding Scalability Models – Perfect Scalability

• Ideally, Scalability should be perfectly linear.

• If there are N processors and X(N) is the load handled by N processors.

If for N = 1, X(N) = 5, Then with N = 10, desired X(N) = 50

• Hence, the capacity model depicting perfect linear scalability is:


Explaining non-linearity in graphs

• Based on measurements on real multi-processor systems we know that scalability is non-linear.

If for N = 1, X(N) = 5 Then at N = 10, X(N) < 50

• In 1967, Gene Amdahl, recognized first this theoretical Linearity cannot be achieved because certain portions of the workload that can only be executed sequentially and accounted Contention

factor α for it.

E.g. When there are N processes, each of them competes for shared resources resulting into Contention at various layers e.g. Read/Write Lock Contention, Bus Contention etc.


Amdahl’s law implications

• Used in parallel computing to predict theoretical maximum speedup using multiple processors, given as:

Speedup, S(N) = 𝑵

𝟏+𝛂(𝐍−𝟏)

Speedup, S(N) can be defined as the ratio of time taken for serial execution to time taken for parallel execution.

α is the degree of contention or the part of task that cannot be parallelized

• Often used to find the maximum expected improvement to an overall system when only part of the system is improved.

E.g. Assume A & B are independent parts of a work task taking 75% & 25% of execution time.

Make B 5x faster, S(5) = 5

1+(0.75∗(4)) = 1.25 ; Make A 2x faster, S(2) =

2

1+(0.25∗(1)) = 1.6

Though, B’s speed-up is greater by ratio (5x), better optimization is achieved by tuning A!


Introducing Scale-up in Amdahl’s law

• Amdahl’s law derives speedup achieved by executing task in a multi-processor environment

• In real-world, by adding more processors we try to get actually more work done instead of reducing response time i.e. achieve more throughput keeping response times reasonably constant, this is termed as Scaleup.

• Hence we can apply the same Amdahl’s law to scale-up as well:

Scaleup, C(N) = 𝑁

1+𝛼(𝑁−1)

• However in real-world we see the load does not remain constant, after some point of time, but reduces or moreover gets unpredictable.


Scalability – Effect of Coherence

• In 1993, Dr. Neil Gunther defined Universal Scalability law that quantified scalability quite closely to realistic systems.

• In addition to contention α (e.g. queuing for shared resources) addressed by Amdahl’s law, USL accounted for coherence factor β (latency for shared data to become consistent) to account for this non-linearity trend.

• USL puts forward the Scaleup by accounting for both contention and coherence as:

Scaleup, C(N) = 𝑁

1+𝛼 𝑁−1 + 𝛽𝑁(𝑁−1)


Contention Vs. Coherency

Parameter Contention (α) Coherency (β)

Meaning Degree of contention because of shared data

Penalty incurred for maintaining consistency of shared data

Example in DBMS

When one user process has to wait in queue to get access to Table Row (get DB row lock).

Now even if the user process gets DB row lock, it cannot directly update the table, as it may have to look whether the data in its cache is stale, if yes it has to wait for its local data instance to be consistent with latest copy of cache from other CPU and then only update. This additional processing is Coherency delay.

Root cause Part of program being serial in nature (that cannot be parallelized).

Caused by inter-process communication, and increases in proportion to the square of concurrency.

Dependent Factor

Factor (N-1), suppose there are N processes, then in worst case the user process needs to wait (N-1) processes to finish before getting hold of the shared resource.

Factor N * (N-1), suppose there are N processes, then first process, needs to communicate to all N processes except itself, same for second, communicate to all other (N-1) processes. Thus for N processes, inter-process communication happens as N*(N-1)


Quiz – 1

Bucket items as per their categories : • Memory Thrashing • Wait to obtain DB latch to modify shared structure • Wait for other thread to update shared counter • Cache-miss latency

Contention Coherency

Memory Thrashing

Cache-miss latency

Wait to obtain DB latch to modify shared structure

Wait for other thread to update shared counter


Universal Scalability Law (USL) – A revised look

• Linear Scalability – Without contention and

coherence linear (perfect) scalability will be achieved i.e. C(N) = N

• Contention – The factor α represents the degree of contention because of shared data

• Coherence – The factor β represents the penalty incurred for maintaining consistency of shared data

USL gives the point of maximum throughput, beyond which performance actually degrades!


USL Application

Steps to Apply: 1. Collect data points at different Loads C(N) at

various Concurrency Levels (N)

2. By definition 𝑁

𝐶(𝑁) takes 2nd degree polynomial,

[1+𝛼 𝑁 − 1 + 𝛽𝑁(𝑁 − 1 )] hence transform data to plot

points (X, Y) as: X=(N-1), Y=𝑁

𝐶(𝑁) - 1

3. Then perform least-squares regression to fit the data to polynomial of degree 2, (y = ax2+bx+c)

4. Do curve fitting with R2 ~ 1, calculate values for α,β as:

α = b-a , β = a

y = 2E-05x2 + 0.0006xR² = 0.9906

0

0.2

0.4

0.6

0.8

1

1.2

0 50 100 150 200 250

(N/C

) -1

N-1

Non-Linear Scalability

Leveraging USL for Predictions: • Compute Scaleup value C(N) as per USL formula • Predict Load (Throughput/Utilization) for different N points as: Xp(N) = N*C(N)

• Predict Maximum Concurrency as: Npmax = √1−𝛼

𝛽

• Predict Response time using User mix, Throughput mix & little’s law

R1 = (𝑁1

𝑋1(𝑁))-Z1, R2 = (

𝑁2

𝑋2(𝑁))-Z2, R3 = (

𝑁3

𝑋3(𝑁))-Z3


*Indicative* USL application for App Server Predictions

SDLC Stage: Testing Phase 1. From Performance testing results, collect data

points for various #Virtual Users (N), Throughput X(N), CPU Utilization U(N)

2. Transform data to plot points (X, Y) as:

X=(N-1), Y=𝑁

𝐶(𝑁) - 1

3. Do Curve fitting* and compute co-efficient of regression, a, b, c from curve with R2 close to 100%

4. Compute, contention & coherency parameters as: α = b-a , β = a

Note*: In Excel, inversion transformation is necessary as it cannot fit a rational function by default. Also to get around with precision problems in excel, its recommended to serious capacity planners to use statistical tools in R, matlab etc.


USL App Server Predictions vs. Actual Test Results

Xpmax, Npmax

Upmax, Npmax


Assumptions and Limitations of USL

• The selected workload mix must accurately represent business activities and should be constant.

• All the test conditions (application, database setup - code/configurations), should be same, only load can vary.

• For analysis and forecasts it is assumed, the impact of any application/component other than that of client application is negligible on workload and Utilizations.

• As systems don’t generally scale as they are supposed to, USL can be used for capacity planning purposes as best-case bound. Its better not to count on getting any more performance than the model indicates!

• Any major change (functional or at configuration level) to the application, database or infrastructure level can cause the model to change.

• If C(1) is not directly available, choose it so that the other points don’t show better-than-linear behavior and its within reasonable limits.

• It is observed that the forecast error margins are somewhere between 15%-20%. However this can be improved by having sufficient and accurate data points depicting system throughput, concurrency correctly and fitting curve well (R2 > 97%)


Database Capacity Assessment Case Study

• A major telecom client wanted to do capacity assessment of its critical workforce application that aids in effective planning, scheduling, dispatching work to technicians.

• Application had combination Java, .NET platforms and comprising of : 58+ Application Servers for various interface Transactions, 15+ Virtual Servers for Background jobs processing, and 1 Oracle DB server - Sun SPARC F25k system hosting 48 CPUs Having 88 business transactions/sec across various app servers

• Capacity Assessment Challenges: Capacity projections after actual performance testing -> Option not available due to

time/cost constraints No information about workload mix for each transaction in database Also no information available about DB service demands for each transaction Using Regression Analysis on DB data (MVA) -> not an option as data not available of

how change in application throughput translates to change in DB throughput So, paper based Guerrilla Capacity Assessment approach was employed to come up with Database projections.


Steps to apply USL

Choose Parameter

s

• Concurrency N – Average Active Sessions (AAS), Average Session Load (ASL)

• Load – Throughput - X(N), Utilization - U(N)

Collect Data

• Oracle AWR/ Statspack report

• Custom Queries

Consolidate Data

• Generate unified view of: ASL (N), Logical Reads X(N), and CPU Utilization U(N)

Model Data

• Feed data into USL tool – excel based, custom R script

• Fit curve well so that error is less - α,β values are realistic

Analyze Model

• Find Max Concurrency - Nmax

• Predict Max Load - Xmax, Umax


Step 1a: Choose Data – Throughput Parameter

• Various parameters could represent load on the system: Physical reads User commits + User rollbacks Execute count Session logical reads CPU Utilization

• Session Logical Reads was selected as measure of Throughput as it is closely related to ‘Queries executed on Database’ by giving number of actual Reads on Buffer along with Physical Reads.

• CPU Utilization was selected as measure of Utilization to measure load on system.


Step 1b: Choosing Data - Concurrency (N)

• Could use Average Active Sessions (AAS) term, readily available from Oracle 10 G AWR report.

• However since application DB was Oracle 9i which doesn’t have AWR reports, Average session load (ASL) concept was used:

ASL = (CPU Time + Time spent by Wait events)/ Elapsed Time Where

CPU Time = Given by ‘CPU used by this session’ present in v$sysstat table. It represents the total amount of CPU used by all sessions, excluding background processes. Time spent by Wait events = calculated by time taken by DB events which were not idle i.e. for obvious idle events like: Client message, dispatcher timer, lock element cleanup etc.

Please note: ASL value indicates degree of concurrency database can support. This is not to be confused with actual number of concurrent users/connections supported by the Database System. In real life, queues etc. other mechanism exists which can support thousands of concurrent end users.


Quiz - 2

For applying USL in other databases, what other parameters can be possibly employed?

SQL Server -> User Connections , Logical Connections, #Sessions etc. MySQL -> Queries executed per sec (Threads_Running) etc.

• Load

• Concurrency

SQL Server -> Transactions per Sec, Batch Requests per sec etc. MySQL -> Queries Received per Sec (Questions) etc.


Step 2: Collect Data Details

• ASL & Session Logical Reads, both parameters can be measured in following two ways: Using Oracle Stats pack report, or Using custom queries on System Tables like v$sysstat and v$system_event

• Since on client production environment stats pack reports were readily available only at one hour interval and this duration is very high as it does not capture system dynamics closely, custom queries from system tables v$sysstat and v$system_event were executed every 10 minutes and the CPU Utilization for that duration was noted.


Step 3: Consolidate Data

• Results from the SQL Query were consolidated to get a unified view of ASL (N), Total Logical Reads (X), and Average CPU Utilization (U) for every 10 mins.

Database 1 Database 2 Physical Database Server

Date Time ASL

(N1)

Total Logical

reads (X1)

ASL

(N2)

Total Logical

reads (X2)

Total N

(N=N1+N2)

Total X

(X=X1+X2)

CPU

Utilization

(%)

8/1/2012 9:18 0.468 1544544 6.18 35639445 6.648 37183989 19.47

8/1/2012 9:28 0.438 1057181 6.952 44029984 7.39 45087165 20.47

8/1/2012 9:38 0.396 905675 5.896 35182053 6.292 36087728 21.15

8/1/2012 9:48 0.799 1968244 5.593 30522881 6.392 32491125 23.16

.. .. .. .. .. .. .. ..

8/1/2012 11:08 1.445 3521876 8.669 42440075 10.114 45961951 29.38

8/1/2012 11:18 4.022 2783427 8.668 53818447 12.69 56601874 27.62

.. .. .. .. .. .. .. ..

8/1/2012 13:28 0.795 1937491 9.406 53881083 10.201 55818574 30.52

8/1/2012 13:38 0.687 4024107 8.404 49021605 9.091 53045712 28.99


Step 4: Model Data

This data was then feed into USL tools- R script and the appropriate points were selected to fit the USL curve well and realistic values for α, β are seen.

Utilization Model (N vs. U) Throughput Model (N vs. X)

R2 = 97.78%, Nmax = 18.33 R2 = 99.86%,

Nmax = 17.64


Step 5: Analyze Model-Observations and Inferences

• Model is validated for both models as: Curve Fitting efficiency (R2) > 97% Both Throughput and Utilization Capacity Models indicate Nmax close to 18.

• Throughput Capacity Model shows –ve α (i.e. no contention) implying more performance due to parallelization of tasks. This is as expected for Database as DB takes into account efficiency due to Buffering of data. With increase in workload, data in buffer becomes more locally available, hence more throughput can be easily serviced without extra work, hence notion of improved performance

• In Utilization Capacity Model we see, at Nmax = 18, CPU Utilization ~ 30%. • Hence application is not able to scale beyond 30%

• This fact was also confirmed from the event that occurred on 31st July 11AM-12PM. Due to users directly logging onto some Transaction1 servers, number of DB connections suddenly increased leading to higher CPU Utilization and poor DB response times.


Conclusion

• DB hardware has sufficient capacity to handle more additional workload to the tune of 100% increase.

• Hence in current state, it can easily support the workload projections of next 1 year. • However, a separate Database tuning exercise is recommended out to improve scalability

of the application & better utilize the underlying hardware. • Thus, paper based scalability assessment through USL helped analyzing scalability issues

in application and aided in effective capacity planning on the real production systems.


References

• Average Session Load • Interpreting Wait Events to Boost System Performance • How to Quantify Scalability, Neil J Gunther • Forecasting MySQL Scalability with the Universal Scalability Law, Baron Schwartz and Ewen

Fortune, • Guerrilla Capacity Planning, Neil J Gunther, 2005

http://www.nocoug.org/download/2007-08/02_asl_golden_metric.ppt

http://www.dbspecialists.com/files/presentations/wait_events.html

http://www.perfdynamics.com/Manifesto/USLscalability.html

http://www.percona.com/files/white-papers/forecasting-mysql-scalability.pdf&rct=j&sa=U&ei=-Bw6T7ecFsTorQeq2rzBCA&ved=0CCQQFjAA&sig2=oo9lBM9Any0XHlNV86ICbA&q=Universal+scalability+law+mysql&usg=AFQjCNESuFzy7avfufi5fNN7je4W11Qcfg&cad=rja

Guerrilla Capacity Assessment using USL A Case · PDF fileGuerrilla Capacity Assessment using...

Documents

Transcript of Guerrilla Capacity Assessment using USL A Case · PDF fileGuerrilla Capacity Assessment using...