Cost Based Performance Modelling

11

Click here to load reader

Transcript of Cost Based Performance Modelling

Page 1: Cost Based Performance Modelling

1

Cost Based Performance Modeling – dealing with performance “uncertainties”

Eugene Margulis

Telus Health Solutions, Ottawa [email protected]

Abstract. Traditional Performance Evaluation methods (e.g. “big system test”, instrumentation) are costly

and do not deal well with inherent performance related “uncertainties” of modern systems. There are three such uncertainties: requirements that are either unclear or change from deployment to deployment, 3

rd party

code that one has no direct access to and the variable h/w platform. These uncertainties make exhaustive testing impractical and the “worst case” testing results in engineering for the “worst impossible case” rather than a realistic customer scenario. Creating a single model that is based on traceable and repeatable test results of individual system components (transactions) saves a huge amount of effort/costs on performance related engineering/QA - and provides an almost instantaneous "what if" analysis for product planning.

Introduction

The primary goal of performance/capacity activities is the ability to articulate/quantify resource requirements for a given system behavior on a given h/w provided a number of constraints.

The resource requirements may apply to CPU, memory, network bandwidth, size of thread pools, heap sizes of individual JVMs, disk sizes, etc. There are many different types of resources that the system uses during processing its “payload”.

The behavior of the system defines the expected quantifiable use patterns. For example, the number of network elements the network controller is connected to, the frequency of evens from the network elements to the controller would represent an aspect of system behavior. A system can have several “behaviors” – e.g. payload processing, upgrade, overload, etc.

The constraints are the additional sets of requirements that define what behaviors are “useful” (from the user perspective). For example, it may not be useful to process 10 events per second if the response time per event is greater than 20 seconds. Or it may not be useful to process 10 events per second if you cannot retain event logs for 30 days.

The h/w defines the target h/w environment for system deployment. An 8 core 32 “strands” processor 1GHz system with 8G of memory may fit better to one behavior pattern favoring throughput whereas a single core at 2GHz might fit another behavior where response time is more important.

The goal is to create a quantifiable relation between behaviors, constraints resource requirements and h/w. This can, of course be accomplished using “brute force” testing, but the size of the “test space” makes this approach impractical. Traditionally this has been addressed by testing for “small/medium/large” configurations. Unfortunately many deployments/behaviors do not easily scale, they are simply different and not larger or smaller. (e.g. How do you compare a deployment A that requires 30 day historical event retention and 3 events/sec with deployment B that requires 1 day retention but 20 events/second?)

In addition, the traditional methods of performing most test measurements in a high capacity lab are often costly, inflexible and in most cases can only be done at the tail end of the development process when discovered problems are very expensive to fix or mitigate. Cost based capacity modeling provides a lean and flexible approach to performance and capacity evaluation resulting in a significant reduction of performance related R&D costs.

In this paper we will describe cost based performance modeling, how it applies to the different performance/capacity QA activities and outline the benefits of using this approach.

Page 2: Cost Based Performance Modelling

2

What are the challenges and uncertainties? The key challenges during development of such systems from performance perspective are dealing with the three uncertainties:

Behavior Uncertainty – because of multiple deployment scenarios one is never quite certain of the conditions the system is going to be used in. If the product that is being developed is new and provides a new capability there is little historical data on how it might be used.

Code Uncertainty – in the past when most of the code was developed “in-house” one could always

have direct access to the code, test it in isolation, contact the developers and understand the resource requirements. When using 3

rd party code this is no longer the case. Such code can only

be treated as “black box”, one cannot often rebuild it with debug/optimization options.

H/W uncertainty – underlying h/w architecture is no longer fixed. If it takes 6 months to develop a

system the deployment h/w may be very different from the h/w available in the lab.

Performance is no longer a verification exercise. Traditionally performance activities during system

development were viewed as a verification exercise. It was assumed that there was a set of requirements, defined h/w and a stable system to test on. The uncertainties mentioned above make performance analysis of more exploratory nature, where the focus of performance activities is to discover and identify system operating costs and limits rather than to validate a specific scenario. (Direct validation/verification of a specific scenario/behavior is still the best option – provided that this scenario is known).

Efficient articulation and sharing of performance information. In addition to these uncertainties there are organizational challenges of sharing performance information between different groups of stakeholders. The performance profile is usually a multi-dimensional problem (multiple resources, multiple constraints, multiple behaviors). It is important to articulate and communicate this kind of complex information effectively and consistently. A development group may span across continents, time zones, language boundaries – designers can be in India, architects in Canada, testers in China and customers in Spain. It is important to make sure that when someone refers to the “cost of event processing” everyone knows exactly what it means and the implications.

Performance estimates available at any time. Given the length of the development cycle and the size of

the modern systems it is usually too late to identify/address performance issues after the code “freeze”. It is important to be able to determine performance bounds as early as possible. For example, if event-processing component is ready to be tested before historical event reporting, then there is no reason to wait until everything is ready before determining the cost of event processing. The initial cost estimates of a given component (e.g. event processing) should be determined as soon as the component has basic functionality and should be refined throughout the development cycle. A quantifiable “best guess” of the system performance should be available at any time and the accuracy and the confidence of such “guess” should improve continuously throughout the development cycle.

Cost Based Transactional Model

3+ way view The system performance and capacity relationship can be visualized as the following “3+ view” where the Cost Model provides mapping between the system behavior (e.g. events per second), costs (cost per event), and the total resource requirements. The relationship is subject to a number of constraints and h/w characteristics:

Page 3: Cost Based Performance Modelling

3

Costs

Behaviour

Resource

Requirements

COST MODEL

HW Latency +

Other Constraints

Costs

Behavior

Resource

Requirements

COST MODEL

HW Latency +

Other Constraints

The cost model is the “glue” that provides quantification of resources based on quantifiable behavior (requirements), measured costs and specific constraints. The model is used to drive the performance analysis activities throughout the project to make sure that we perform the minimal amount of testing for the expected deployment. Building and using the model early on ensures that the tests performed and data collected are relevant for the intended system deployment.

The model is capable of accurately determining a fairly comprehensive capacity envelope even though it is based on a much smaller set of test measurements than the traditional brute force multi-dimensional testing normally required to produce such an envelope. The model forecasts (either extrapolates or interpolates) most of the behavior/constraints combinations so that explicit measurements are not necessary. The cost savings are realized by the relatively small number of test measurements required to do this.

The following sections describe the process of building of such a model.

Transaction as a unit of System Behavior System behavior can be described in terms of processing a number of distinct (transactions) within a unit of time. A transaction represents some unit of work offered to the system. System Behavior is the offered workload (that is, a set of transactions).

For example, a system may be expected to process X events per second, update Y user GUI displays every Z seconds, collect N performance management reports from M network elements every K minutes, etc. Each of these examples can be viewed as a transaction associated with some frequency of execution:

Transaction = (TransactionType, TransactionFrequency)

SystemBehaviour = SetOf {Transactions}

Clearly defining an exhaustive set of transactions for any large system is impractical. However, most systems execute only a small subset of transactions most of the time, especially during the steady state” payload” processing.

It is possible to define multiple system behaviors for different system states to expand this model. For example, a steady state processing can be associated with one set of transactions; overload state can be associated with another set. From a performance/capacity perspective we need to identify and to prioritize the set of system behaviors that need to be assessed and evaluated.

Each transaction executed on the system results in use of system resources – CPU, memory, disk, etc. The total amount of resources required to support the behavior depends on behavior (the set of transactions and their frequencies) and constraints:

SystemRequiredResourceUsage = Function (SystemBehavior, Constraints)

The resource cost of a given system behavior is usually dependent on the transaction cost and the frequency of transactions – however different resources may require different resource computations, for example the”cost” of disk (disk space) is related to the total disk “residue” of the transaction – e.g. DB space required per event record; memory is most difficult to trace since memory utilization is affected by multiple layers of memory management policies.

Page 4: Cost Based Performance Modelling

4

The cost based model implements the function above – namely the mapping of system behavior and constraints to the required system resources to support such behavior. This mapping is based on cost measurements and testing of individual transactions – rather then on the on direct testing of every behavior.

Transaction Linearity Each Transaction is associated with resource “costs” – the price the system has to “pay” for performing the transaction. For example, processing 1 event per second may require 10% of CPU on 1500 MHz processor. The CPU utilization may or may not be linear with respect to the frequency of the given transaction type. That is, processing of 2 events per second might not be twice as expensive as processing of 1 event. However, using a linear model as a starting point provides a good initial approximation:

Most of the processes are linear within the “operating scope of the system” of 20%-70% of CPU. Over 70% CPU utilization the OS switching results in non-linear CPU utilization (the actual max CPU cut-off may be higher than 70% for many systems). However, at that range we are usually not interested in capacity of the system since it is outside of the operating scope (as long as we know that the utilization is over 70%). If the utilization is below (say 20%) then the error due to non-linearity is fairly small and can be ignored

If the behavior of the system with respect to the given transaction is demonstrated to be non-linear (as a result of testing) within the operating scope of the system then it may be possible to decompose the transaction into independent linear components. For example, event processing may include aggregating and storing events in a historical database. Database “write” (the most expensive part) is usually done in batches with multiple events. Therefore the overall cost of event processing is non-linear. However decomposing event processing into two transactions – 'event arrival' and 'event save' with different frequencies results in a linear transaction set.

If the system behavior is demonstrated to be non-linear with respect to transaction frequency and further decomposition is not possible/practical and linear assumption results in significant estimation errors then a non-linear function must be developed based on capacity tests for various transaction rates.

Using the linear assumption above we can derive the system resource usage as follows:

SystemRequiredResourceUsage =

Function(

SumOf {Cost(TransactionType, TransactionFrequency)},

Constraints

) + C

The C above is the constant representing the background resource utilization. In theory it is possible to decompose C into individual transaction (consisting of OS activities, system management, etc.) but in practice C is simply the background utilization in absence of the defined transactions.

Transaction Costs A transaction may depend on a number of resources therefore the maximum rate of a given transaction may be reached even though a given resource is not exhausted. For example assuming an event processing is a single-threaded process that “costs” 5% of CPU for 1 event per second on a 4-core processor. In this case processing of 5 events per second will result in 25% of CPU. Processing of 6 events per second will not be possible since since the single-threaded process will never be able to use more than 1 core of the 4 core CPU.

The costs per transaction are measured by monitoring resource utilization and constraints (e.g. latency) for various rates of the transaction. For example, the following charts show the test results of executing a certain transaction at rates of 2, 4, 6, 8, 10 and 12 rps (requests per second):

Page 5: Cost Based Performance Modelling

5

Vmstat1:

CPU

0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0

11/0

5-0

8:5

4:2

3

11/0

5-0

8:5

9:2

3

11/0

5-0

9:0

4:2

3

11/0

5-0

9:0

9:2

3

11/0

5-0

9:1

4:2

4

11/0

5-0

9:1

9:2

4

11/0

5-0

9:2

4:2

4

11/0

5-0

9:2

9:2

4

11/0

5-0

9:3

4:2

4

11/0

5-0

9:3

9:2

5

11/0

5-0

9:4

4:2

5

11/0

5-0

9:4

9:2

5

11/0

5-0

9:5

4:2

5

TOTAL usr sys

y = 0.048x + 0.006 R 2 = 0.9876

0%

10%

20%

30%

40%

50%

60%

1 2 3 4 5 6 7 8 9 10 11 12 0

2

4

6

8

10

12

CPU% LATENCY

Linear (CPU%)

Cpu% Latency (sec) Cpu%

Requests Per Sec

The chart on the left shows the CPU utilization pattern throughout the test. The chart on the right shows the averaged CPU utilization per test (blue) as well as transaction latency (pink) for different rates of transactions.

One can see that at the rates higher than 10 rps the latency increases dramatically (yet the CPU% utilization at this rate is about 50%). Clearly the limiting factor for this transaction is not CPU but something else (in this specific case it was JVM heap utilization). All of the tests were run by generating requested rate of transaction for the same amount of time. The CPU% trace on the left chart shows that during the 12 rps test the system was active for longer than during the other tests – the latency increased so much that the system was not able to keep up with the given rate of transactions. Based on these results the maximum sustainable rate (MaxRate) for this transaction is 10 rps. The 12 rps is not sustainable and was therefore not included in the average CPU values used for trend estimations on the chart on the right.

It is also clear that the transaction costs are linear with respect to transaction frequency – the slope of trend line shows that the cost is approximately 4.8% of CPU per 1 transaction per sec. For the model computation purposes it is useful to express the CPU costs in terms of MHz per transaction (so that we can map them from one processor to another). This test was performed on Netra 440 (4 CPUs, 1.6 GHz each) – therefore 4.8% is equivalent to 4.5%*4*1600 = 307 MHz per request.

The transaction cost therefore is described as (MHz, MaxRate) = (307, 10).

In practice the cost of a transaction may depend on additional parameters. The cost of DB insert will depend on the number of records in the database. Other transactions may depend on multiple factors, but there are usually very few parameters that transaction costs are most sensitive to. Once these parameters are identified then the cost of transaction is measured for the different values of the parameters. MaxRate and the MHz for each combination of parameters is measured and derived as above. The final transaction cost profile is represented as transaction cost matrix:

DB Insert Cost

ActiveRecords MHz MaxRate

0 12.0 125.0

10000 15.5 96.6

30000 16.5 90.8

60000 18.0 83.3

100000 20.0 75.0

200000 24.9 60.1

Page 6: Cost Based Performance Modelling

6

The table above shows the costs and maximum rates of DB Insert transaction with respect to the value of ActiveRecords parameter.

Transaction Cost Matrix description is unambiguous, can be shared between groups, can be traced directly to specific test runs, can be reproduced, tested and consistently used for projection and estimations.

Model Implementation This model lends itself to a fairly straightforward spreadsheet implementation. The spreadsheet implementation allows for continuous update, refinement and automatic import of test data and measurements. Once implemented, the model allows immediate “what-if” analysis of the impact of a change in customer behavior or a given cost (e.g. what if we have twice as many events but half as many GUI clients? How many more clients can we handle if we speed-up security function by a factor of 2?).

The primary “deliverable” of such model is an operating envelope of the system with respect to the system resources. The operating envelope can be graphically represented as a 2D or 3D chart showing system operating limits with respect to key transactions and resources. An envelope can be dynamically recomputed with respect to any behavior/scenario/hardware.

For example, the following figure shows the operating envelope of a system with respect to 3 parameters (event rate, number of users and number of records). The particular system had approximately 30 different “dimensions” (about 10 transaction types and other constraints and parameters). The model was capable of computing operating envelope for any three parameters while fixing the values of the rest. The area below the 3D surface represents system operating range that would meet capacity and behavior constraints and allows instantaneous “what if” analysis. Based on this chart, If the intended deployment calls for 30000 users then the system will meet its performance/capacity constraints provided that event rate is less than 14 per sec and the number of records is less than 22000. This capability to compute this kind of operating envelope greatly assists and simplifies the “what-if” analysis necessary to determine, articulate and communicate limits of a potential deployment.

10000

14000

18000

22000

26000

30000

34000

38000

42000

46000

50000

25000

26800

28600

30400

32200

34000

10

10.8

11.6

12.4

13.2

14

14.8

15.6

16.4

17.2

18

EVENT

RATE

NRECORDS1NUSERS

Operating Envelope

In the example above, the operating envelope showed the general limitations of the system based on the value of certain parameters and transaction frequencies.

Another example (below) shows the operating envelope chart that displays the specific resources that limit the system capacity. In this case, the system, network controller, can support various combinations of two network element types (NE1 and NE2). The white line (MAX) shows the overall operating limit of the system. The support of NE2s maxes out at about 21 and is bounded by CPU (MAX_cpu). The limiting factor to NE1s is the disk (MAX_ed), which maxes out at 300-400 depending on the number of NE2s. This chart clearly

Page 7: Cost Based Performance Modelling

7

shows what resource use needs to be optimized (or amount of resource increased) depending on potential deployment.

Operating Envelope

0

10

20

30

40

50

60

70

80

0 200 400 600 800 1000 1200

#NE1

#N

E2

MAX

MAX_cpu

MAX_ed

MAX_ned

MAX_bw

MAX_aeps

As part of the total resource utilization computation for the given behavior the model must compute the expected resource utilization by each individual transaction (or application/component identified with a transaction set). This information is valuable in identifying which feature/transaction is responsible for most use of resources – for a particular system behavior. The next three charts show this type of breakdown for CPU, Memory and disk utilization on various systems:

For example, the top left chart above shows that the biggest CPU users are “CollectDigits” and “RTDisplay” components (and transactions associated with them). If CPU is the limiting factor then these two should be the focus of improvements/optimization.

Disk_Security_Logs

_GB, 1

Disk_NE_Loads_GB,

7

Disk CACP_GB, 3

Disk NE B&R_GB, 7

Disk_NE_LOG_GB,

57

Disk_PM_GB, 11

spare, 37

Disk_Alarm_DB_GB,

10

Disk_Alarms_REDO_

GB, 5

CPU Distribution by feature

(500A x 15K cph)

14%

4%

13%

27%

0% 0%2%

0% 1%3%

5%

32%

0% 0%0%

5%

10%

15%

20%

25%

30%

35%

40%

Base C

P

Queuin

g

Bro

adcasts

Colle

ct

Dig

its

Giv

e I

VR

Giv

e R

AN

Mlin

kS

crP

op

Hdx

Intr

insic

s

CalB

yC

all

DB

Blu

e D

B

RT

Dis

pla

y

RT

Data

AP

I

Report

s

RAM - Top 10 Users

AppLd

IPComms

Base

IMF

Logs

OSI

STACK

PP

GES

HIDiags

OTHER

FREE

Page 8: Cost Based Performance Modelling

8

Model Accuracy – Start early, improve as you go The model accuracy clearly depends on the accuracy of the underlying data. The underlying data broadly belongs to two categories – tested/measured transaction costs and the quantification of expected system behavior. As the system matures the tested/measured transaction costs become more and more accurate. However the expected behavior, expected use (especially in new systems) is often a best guess based on perceived marked need, similar systems, past experience or even wishful thinking. The benefit of the model is that it allows to focus on implication of such guesses and encourages Product Line Management / Marketing to come up with better guesses and estimates. A performance team should be able to provide best estimates of system performance at any stage of development. The accuracy of such estimates should improve as system matures. The model allows order-of-magnitude estimates very early in design cycle. One can usually start getting initial sensitivity/assessment results even before a single line of code is tested based on expected behavior and educated guesses on transaction costs. Often the educated “guesses” of transaction costs can be based on previous experience, past system analysis or a simple baseline testing of similar functionality. This early analysis is valuable for detecting “big” showstoppers and allows early adjustment of architecture. As the system is being developed and measurements are performed the model becomes progressively more accurate. The model is usually within 10% accurate by the end of the development cycle for a new product, but it does require final calibration and verification with a “Big System Test” where the resource is measured under load of multiple transactions at a time. For mature systems where this methodology has been applied over a number of years, the accuracy is such that in some projects the model was productized as a provisioning and planning tool.

Improving communications and addressing uncertainties

Use of the model allows unambiguous and clear communicates of test results. The results per individual transaction are communicated in terms of costs (transaction costs matrices can be exchanged between groups, tracked by load and release to release basis). The results in terms of the overall behavior impact are communicated pictorially as operating envelopes that show the performance capacity implications visually and explicitly. For example, the following two charts show the operating envelope of the system with and without certain code optimization improving processing of transactions SEC1 and SEC2. Since there is a cost associated with deployment of the optimization/patch it is important to demonstrate the impact of the change. The charts clearly show which behaviors (deployments) will be impacted and where the change needs to be deployed. The deployments that have less then 15000 records and do not use SEC2 transactions will not benefit from the patch.

12.

43.

85.

26.

6

8

9.4

10.8

12.2

13.6

15

0

1

2

3

4

5

1

5000.9

10000.8

15000.7

20000.6

25000.5

30000.4

35000.3

40000.2

45000.1

50000

NREC

SEC1SEC2

Operating Envelope (w /o code change)

12.

43.

85.

26.

6

8

9.4

10.8

12.2

13.6

15

0

1

2

3

4

5

1

5000.9

10000.8

15000.7

20000.6

25000.5

30000.4

35000.3

40000.2

45000.1

50000

NREC

SEC1SEC2

Operating Envelope (w ith code change)

Page 9: Cost Based Performance Modelling

9

The model can also be used to compute the operating envelope with respect to different h/w. For example, the following two charts show the different range of system behaviors that can be supported on different h/w. Form the two charts below it is clear that V890 would be a better choice handling high rates of TX1 transactions (right), but T2 would be better at handling the mix of TX1, TX2 transactions (left), specifically with high number of records (TX2 implementation was able to take advantage of the multithreaded architecture of T2, whereas TX1 was had a single-threaded implementation that benefited form high speed of V890).

15.

910

.815

.720

.6

25.5

30.4

35.3

40.2

45.1

50

1

4.8

8.6

12.4

16.2

20

1

3000.9

6000.8

9000.7

12000.6

15000.5

18000.4

21000.3

24000.2

27000.1

30000

NRECS

TX1//SECTX2/SEC

Operating Envelope - 1000MHz/Core, 4 Core, 32 strands, Sun T2

15.

910

.815

.720

.6

25.5

30.4

35.3

40.2

45.1

50

1

4.8

8.6

12.4

16.2

20

1

3000.9

6000.8

9000.7

12000.6

15000.5

18000.4

21000.3

24000.2

27000.1

30000

NRECS

TX1//SECTX2/SEC

Operating Envelope - 1800MHz/Core, 8 Core, Sun V890

The use of cost based model throughout as the base for the performance analysis activities addresses and mitigates the three uncertainties:

Behavior. Even thought the expected system behavior at the deployment time may not be known, the model allows forecasting for ANY behavior. Ability to compute the operating envelope identifies

and quantifies the range of acceptable behaviors/scenarios that allows making intelligent decisions on deployment and capacity limitation and trade-offs.

Code. The transaction based cost model is based on identifying costs with respect to the customer visible behavior, work/activity, rather then a specific code module, process, etc. There is no need to

access, instrument, rebuild or recompile 3rd

party code. All of the code is treated as black box (although depending on the granularity of metrics collected one can always map transactional resource utilization to the specific process, module).

H/W. Using of cost model allows mapping of results from one h/w to another. It might be necessary

to perform a small set of pilot test cases on different h/w platforms to identify the mapping; the model can be used to compute operating envelope for different target h/w throughout the development cycle. The h/w limitations are identified early that allows to make changes if necessary.

Cost Reduction

There is inherent cost reduction of using the model to forecast the comprehensive operating envelope of the system vs. the brute force multi-dimensional testing. Suppose the system has N transactions, each transaction can have Rn distinct rates that can be expected in some deployment scenarios. Suppose this system has M parameters with Pm distinct values each. Then the total number of “brute force” tests necessary to determine the operating envelope for the system would be equivalent to the number of points within this (N+M)-dimensional space, or the product of the number of distinct rates per transaction and the

number of distinct parameters values:

BruteForceTests = R1 * R2 * R3 … * Rn * P1 * P2 * P3 … * Pm

On the other hand, if we are using the model to forecast such envelope, we only need to determine the cost of each transaction individually and forecast the cost of any combination. Each transaction individually would depend on a (small) subset of all the parameters, so the number of tests for transaction X will be Rx *

Page 10: Cost Based Performance Modelling

10

Ptx where Ptx is the product of the number of distinct parameter values the transaction X depends on. Therefore the total number of tests is the sum of the number of distinct rates per transaction and relevant

parameters combinations:

ModelTests = R1*Pt1 + R2*Pt2 + R3*Pt3 … + Rn*Ptn

The reduction of the number of tests represents a significant direct cost reduction. There are additional cost benefits related to the use of the cost based model:

The test simplicity – individual transaction tests are much simpler to run and automate then the big

system test required to test multiple transactions in combinations.

Hardware cost reduction – there is no need to run all tests on the target hardware all the time. Once the mapping is established the results obtained on one hardware platform can be mapped to another. (The mapping needs to be re-tested/re-established periodically)

There are, of course, costs associated with creating and maintaining such model (and validating it with a BST-lie tests), but in many cases they are far outweighed by the cost reductions associated with testing.

There are situations when direct testing makes more sense (and is more efficient) than the modeling. If the expected rate for each transaction and the specific value of each parameter is known in advance then according BruteForceTests formula above only one test will be required. However this is rare and signifies a system with no “uncertainties” as described here.

Practical experience Using transaction based cost analysis is not new. A similar approach, Microsoft’s Transaction Cost Analysis is described in [1] and is used for web services costs evaluation. Transaction Aware Performance Modeling [2] focuses on response time per individual transaction. We used this approach to focus on capacity profile with respect to multiple resource types (cpu, memory, threading, disk space etc.) under a set of well-defined constraints. We have applied this approach in a number of projects in Telecommunication environment in a variety of systems including:

Call Centre Server (WinNT platform, C++)

Optical Switch (VxWorks, C, Java)

Network Management System (Solaris, Mixed, 3rd party, Java)

Management Platform Application (Solaris, Mixed, 3rd party, Java) Each project presented different challenges (memory modeling in VxWorks, Java Heap sizing, Threading in Solaris, etc.). Application of this method has resulted in a number of “indirect” benefits (apart from the direct benefit of addressing uncertainties and enabling forecasting):

Improved Communication across groups – everyone speaks the same language (well defined transactions/costs). The “framework” or “platform” groups were able to describe the cost of the services they provide to the rest of the system by publishing the cost of the key transactions. The users of these services were able to use these costs directly for their estimations.

Change of verification focus – the verification activity focuses on validation and calibration of the model, rather then on some specific scenario. This results in more reliable and trustworthy model.

“De-politization” of performance engineering. We found that the visual representation of the complex information as well as an instantaneous “what-if” capability greatly reduced the negative political aspects of performance engineering. The charts and the underlying data were open and traceable to individual costs/tests making all the numbers and trade-offs clear and quantifiable.

Better requirements – quantifiable, PLM/Customer can see value in quantifying behavior. The PLM and customer were also much more motivated to spend extra effort determining realistic requirements.

Page 11: Cost Based Performance Modelling

11

Documentation reduction – engineering guides are replaced by the model; the performance related design documentation focus on design/architecture improvements.

Early problem detection - most performance problems are discovered before code “freeze” and the beginning of the official verification cycle.

Cost Reduction – less need for BST type of tests/equipment, less effort to run PV, reduced “over-engineering”. The transaction costs can be measured on one h/w platform and mapped to another. The transaction costs are determined based on automated tests that can often be executed on non-dedicated h/w by designers (although final tests must be done in a controlled environment).

End-user capacity planning tools – the model can be directly used to develop end-user capacity planning and performance analysis tools.

Summary

Cost Based Modeling effectively addresses the key deployment uncertainties of Performance evaluation in modern systems by providing a quick and inexpensive method to estimate performance impact of changing system behavior and hardware platform. It is a “black” box based approach that does not require access to the “hidden” 3

rd party code. In addition, cost based modeling provides the ability to obtain performance and

capacity estimates for key product functionalities throughout the entire development cycle, often even before the first line of code is written. It is a conceptually simple and inexpensive to implement requiring no large-scale equipment. The approach improves performance/capacity information communication in large projects and facilitates iterative feedback to project management and design groups.

Acknowledgements:

I would like to thank my former colleague, Robert Lieberman for his advice and comments. Most of results described here are based on various performance projects at Nortel.

References:

[1] Using Transaction Cost Analysis for Site Capacity Planning, Microsoft, http://technet.microsoft.com/en-us/commerceserver/bb608757.aspx

[2] Securing System Performance with Transaction Aware Performance Modelling, Michael Kok, (Parts 1,2,3 in CMG MeasureIt) http://www.cmg.org/measureit/issues/mit61/m_61_17.html

About the author:

Eugene Margulis is a software performance lead at Telus Health Solutions. He has worked on capacity/performance analysis and evaluation on multiple projects at Nortel over the last 15 years. During this time he was involved in design, architecture and QA of telecommunication systems – from hard real time call processing to network management.