Download - Planning and Auditing Your Firm’s Capacity Planning Efforts By Ron “The Hammer” Kaminski [email protected] [email protected].

Planning and Auditing Your Firms Capacity Planning Efforts By Ron The Hammer Kaminski [email protected] [email protected]

Foreign speaker rules Please feel free to stop me to ask any questions Raise your hand or clap if I am going too fast or if my Mississippi accent becomes impossible for yall to understand This is not rude, and I will not take it that way The paper and all slides will be furnished to my hosts

Introduction Over the past 20 years, Ive started and expanded capacity planning groups at dozens of firms, my most recent is now 15 months old You learn things in that process CMG is the place to share this information I look forward to your presentation on this topic in a few years! Todays goal is to give you planning and audit points that you can use to review how you do capacity planning, and maybe persuade you that other methods might be more productive, or at least worth a shot! There will also be How to information, that may have you adding some to do items to your list If you have a question, ask it! I like nothing better than surfing off on a tangent that helps the class Story Times! New risks 3 Ron Kaminski 2010, All Rights Reserved

Introduction In the next few hours, we will cover Defining your mission Picking the right vendor partners Going Extra-Product Avoiding the IT Mindset Traps The politics of capacity planning in organizations, the key factor in your eventual success, or failure Reporting, what you should and surprisingly should not do Classic capacity planning question descriptions and proper answering techniques Ron Kaminski 2010, All Rights Reserved4

Introduction In the next few hours, we will cover How clouds and software as a service will still need capacity tracking and planning tools, and what new kinds you will need Modeling when all of the cards are stacked against you, or Tricks of the trade Goals to work towards An audit list to compare to your systems Capacity planning done well can change the fortunes of a company and help all of our careers. Come sharpen your methods and learn tricks that will make you part of your firms future productive assets, and not an expense to be controlled Ron Kaminski 2010, All Rights Reserved5

Rons Rules You can ask anything, at any time Sometimes the answer is coming up soon in the examples, and in that case Ill tell you so Quick Survey Does anyone here already have A network queuing theory based modeling package? Regular, automated process and workload pathology detection? Fast web reporting of resource consumption by business useful workloads? By the end of this talk, I hope that you will realize that workload characterized views of consumption, web accessible, over business useful time spans are a must have part of the best run IT shops Lets see why

Defining your mission Every site has their own Hot button! issues We are buying a new $23 million computer room every 6 months! Attack server sprawl with data, not words I dont know why we hired a capacity planner, we just Our critical applications are slowing down! Use relative response times and historical information to show why Chargeback used to be a big draw but it has really faded away in the post.com world It shows you when you are talking to an old vendor The ITIL push and reality when facing outsourcing or ZOG ITIL takes a back seat to cost control, at least in the states We need better reporting! Be careful to be holistic in what you deliver, cover every thing that they can buy, historically and ideally with business cycle peaks When you start hearing terms like focus on business priority and really look at travel expenses realize that cost cutting is in your future and report in ways that enable them to cut power and machines Ron Kaminski 2010, All Rights Reserved7

Defining your mission You might think that all that variation would lead to very different solutions, and youd be wrong! All effective capacity planning systems are based on having: Efficient data collection, regrouping, reduction and storage Effective graphical reporting of business meaningful spans of time Components of workload response time that lead to diagnosis Solving the desire for answers to What if? questions Problematic consumption diagnosis, reporting and ticketing Some capacity planning product features marketed by vendors to the nave are actually seldom used in the real world, and for good reasons Linear Trending, when what you really need is business cycle discovery and planning The retail cycle at grocery chains and web payment system vendors Real Time Monitors, when you might want to go home or on vacation some day. Remember, problems happen 24 X 7, and humans wont be watching twitch monitors that consistently. - The mission control room story Top 10 is often used to focus a newbie on peak consumption, which may all be valid Ron Kaminski 2010, All Rights Reserved8

Defining your mission Who is doing the reporting? Vendor supplied reports Tend to be single metric Often dont include contextual information Are often generate on demand and therefore any useful span of time takes beyond the allowable attention span Often have serious contextual clarity problems Workloads change colors as the number present changes You switch machines Use black outlines that swamp the colors for small workloads The Im only using vendor reports this time and hit count story Can take unimaginable resources to produce Set yourself a consumption budget and manage to it You want to trade more bonds? Stop looking at it! May focus on reporting right now data rather than long term useful decision support information Seldom contain disturbance to the status quo notation capabilities Ron Kaminski 2010, All Rights Reserved9

Defining your mission Who is doing the reporting? Write your own reports Can be anything that you dream up (and can deliver the code for) There are multiple free languages and infrastructure to pick from Weve used perl, PHP, java and a whole lot more Can be tailored for your firms decision makers specific needs Can use generate ahead and other techniques to speed web reporting Writing your own can also have down sides Staff turnover and the Who is going to maintain this ___? issues Some staff are not gifted visual communicators If the information used changes formats, (and over time they all do) someone is going to have to maintain that stuff Ron Kaminski 2010, All Rights Reserved10

Defining your mission What do you want to present? Workload characterized subdivisions of consumption over time? Long term historical context for decision makers over multiple natural business cycles? Information subdivided into audience specific groupings for ease of use by subgroups Integration into your firms CMDB Ticketing systems Software development life cycle Totals over time The spark lines counter-argument Ron Kaminski 2010, All Rights Reserved11

Why sparklines of totals can be really useful These are sparklines of total CPU used, Average CPU used and the average CPU used by all nodes in that O/S Is there one in particular that draws your eyes to it, that wants you to probe deeper? Ron Kaminski 2010, All Rights Reserved12

Why sparklines of totals can be really useful If you are like me, ustca102 has you wondering, What made it step up like that? On our system, clicking on the tiny sparkline brings up a zoomed in image, which really gets you wondering: Clicking on that graphic brings up our normal web reporting system: Ron Kaminski 2010, All Rights Reserved13

Why sparklines of totals can be really useful Ron Kaminski 2010, All Rights Reserved14

Why sparklines of totals can be really useful OK, sometimes totals are useful Sometimes they can draw your eye to issues They can quickly dispel rumors that All of our machines are maxed out! For example, our applications specialists were consistently maintaining that all of their machines were barely big enough to make month end, and they would argue mightily whenever we might suggest that there was room for consolidation I brought the chart on the next slide to the next meeting, and suddenly their tune changed Ron Kaminski 2010, All Rights Reserved15

Why sparklines of totals can be really useful What happened after the meeting? In the next 9 months, using extremely conservative criteria, we Virtualized 230 machines ($1,521,000) Retired 55 machines ($ 390,553) Oh! You can just turn that off!, or, See steam come out of the operations folks ears stories Planned 10 machines ($ 40,000) Potential 28 machines ($ 112,000) We then plan on going back over with slightly less conservative criteria and finding a couple million more We will also be doing more application stacking where it makes more sense Sort of makes capacity planning tools look cheap, doesnt it? Ron Kaminski 2010, All Rights Reserved17

Why sparklines of totals can be really useful Ron Kaminski 2010, All Rights Reserved18 A DBA pal of mine asked for a review of memory on a box, asking for an increase to add caching and improve performance I didnt really detect a memory shortage:

Why sparklines of totals can be really useful Still, people dont usually mention issues unless there is an underlying cause. So, as a capacity planner, you have to always look deeper and always check all of the following: CPU Disk I/O Memory Network Response time for key workloads If you dont always check everything, something can sneak by Here is what I found when I followed the always check everything rule When I looked at CPU, I saw: Ron Kaminski 2010, All Rights Reserved19

Update! Theyve since added 2 more CPUs and the issue continues unabated Some issues are not based in physics and data! Ron Kaminski 2010, All Rights Reserved22

New, new update, Just for St. Louis! Ron Kaminski 2010, All Rights Reserved23

New, new update, Just for India! Ron Kaminski 2010, All Rights Reserved24 In the end, someone looked at what was running, and decided most was waste! Look at what happened after Feb 22 nd !

Why sparklines of totals can be really useful Now you see several reasons see why longer term sparklines can be pretty useful Do you currently have ways to generate them? If not, do you want to get ways to generate them? Dont you all think that your vendor ought to provide them, in group and zoomed in formats? So lets start asking them to Do you also see why you should always check everything and then sit back and ask yourself: If I had asked that question and then got this response, what would I ask next? Ron Kaminski 2010, All Rights Reserved25

Defining your mission Anticipate the next questions and always answer them before being asked The unanswered next question can be a huge time waster often a stall technique used by the politically astute It raises temporary doubt in your findings, and builds their case for swift purchase, before you answer their question often a way for the old guard to show that they still are the top dogs to management Impatient or frightened management might run off and buy something! The undeclared war between Project Managers and Capacity Planners The project manager weasel who never lost story Ron Kaminski 2010, All Rights Reserved 26

Defining your mission If you are going to shoot down someones hypothesis that lack of CPU was the cause of a problem, youd better find out what really caused the problem before the meeting! Your goal: One meeting or phone call per issue! They may say We just want a quick and dirty answer but they never really do! Always cover at least: CPU Memory Disk I O Workload response time changes For web-centric systems, network distances and loads 27 Ron Kaminski 2010, All Rights Reserved

Defining your mission Cultural differences are real and might affect your workload choices Some cultures avoid direct blame or information that would cause someone to lose face Any workloads are better than none The No personal pronouns story Be consistent! Always use the same groupings on all similar nodes Use the same colors if you can! Reduce the burden on your audience Multiply the value of your workload creation efforts Use consistent precedence order to decide where to put a process that meets the criteria to be in several different workloads Ron Kaminski 2010, All Rights Reserved 28

Defining your mission Whatever you decide: Track your own tools usage! There are multiple great freeware web usage reports that will tell you if folks are using or snoozing your data (We use webilizer: http://www.mrunix.net/webalizer/ ) http://www.mrunix.net/webalizer/ Unviewed information is wasted time and efforts Use speed tests If there are multiple ways to do something (CSV files versus a Performance database) code for both and have a race Will your web users want the slower one? The capacity planning reporting challenge story Dont settle, always seek new audiences and better reports Add new functions Sadly, there is no shortage of bad vendor reporting on expensive infrastructure Anyone here ever seen a great graphical historical display in business useful terms of SAN information or LAN usage by segment? Your firm may have business specific information that might be really useful to decision makers if overlaid on or graphically reported near with IT resource consumption Ron Kaminski 2010, All Rights Reserved29

Our sites web usage: Ron Kaminski 2010, All Rights Reserved30

Our shared long term mission When you innovate and come up with new report ideas, share them at CMG! Or at least send me examples in mail and Ill do it for you! Share code in this or other user groups that make sense We should all work together in user groups, public forums, on the web, etc., to push all of our vendor partners to address these needs The more they do for us, the less we carry the home brew code weight We should also all work to reduce the volume, impact and long term storage requirements of our solutions I have yet to encounter a vendor that isnt carrying around a lot of extra metrics in the bowels of their systems that will never be used We should have a CMG sponsored help wanted section for capacity planning specialist positions in the various countries Ron Kaminski 2010, All Rights Reserved33

Picking the right vendor partners I believe that all capacity planning efforts should have tools that include: Efficient resource usage and process consumption collectors Network queuing theory based what if? modeling based on workloads, not total consumption The bulge trap Efficient, speedy web-based historical consumption data display Ideally your chosen vendor would support most or all of your differing operating systems and devices have ample training and consultants available, there is nothing better than a co-pilot when you are starting out participates in and supports CMG! Ron Kaminski 2010, All Rights Reserved34

Picking the right vendor partners In the not too distant future, the best vendors should be: Offering efficient low impact cloud deployable wrappers that run with your applications in a cloud We dont have to worry, its in a cloud is nonsensical Are you going to generate fake transactions and time them? When you get a long time back, or significant variance, are you going to have enough information to know why? I think that in time people will realize this need, and want it in their contracts Dont you want to know the overhead of encryption and decryption in the process, and its response time effects? Stupidity is infinitely scalable, as long as you arent getting the bill If nobody cares to make their code efficient, because they just send it to the cloud, how good is that code going to be? Will it be running on the same machine as you tested? Will it impact your users? Ron Kaminski 2010, All Rights Reserved35

Picking the right vendor partners In the not too distant future, the best vendors should be: Offering efficient low impact cloud deployable wrappers that run with your applications in a cloud (continued) The internet will continue to grow logarithmically So those clouds could get mighty full, mighty quick How do you want to find out that it is too full? Do you want your customers telling you? Or do you want your own reports based on scientifically accurately collected consumption data? Social media sites are becoming valuable business tools Businesses tweet and have Facebook pages! Do you think that a free application originally designed to let 14 year olds share photos is designed for high performance business needs? How will you be sure? Ron Kaminski 2010, All Rights Reserved36

Picking the right vendor partners In the not too distant future, the best vendors should be: Thinking about SaaS user tools as well, Sure, SaaS vendors maintain the code and pay if it is a hog, but are they: running maintenance activities like backups and virus cans that slow things down right during prime time for Australia in your globally distributed firm? suffering from office hours peaks of consumption that impact your users response times? Taking outages to horizontally scale that might impact your firms ability to ship product? Without your own data, you will never know What responsibility do you have to your firms users? Why is this network queuing theory based modeling stuff so important? Lets understand what it means and then see an example Ron Kaminski 2010, All Rights Reserved37

Ron Kaminski 2010, All Rights Reserved38 Modeling Norms Most modeling packages assume a Poison or Chi-squared distributions of the arrival rate of transactions Some simpler, yet often quite elegant systems like Dr. Neil Gunthers PDQ modeling just use a quadratic and forget the tails They arent all that different despite what we modeling junkies might say! Dont focus on the distribution selected, focus on whether they use queuing theory models and give you relative response times

Ron Kaminski 2010, All Rights Reserved39 Why network queuing theory based modeling? These concepts are also often illustrated with simple queue graphics like the one at the right An important implied assumption is that all requests are served, none are lost Response time is the sum of Queuing Time plus Service Time

Ron Kaminski 2010, All Rights Reserved40 Why network queuing theory based modeling? Methods do differ, but queues for interactive workloads are usually computed based on load percentage using a formula like: Q = U/(1-U) where: Q = Expected Queue U = Utilization Response time is the sum of Queuing Time plus Service Time

Ron Kaminski 2010, All Rights Reserved41 Why network queuing theory based modeling? So, as a workload competes for resources throughout a day, its response time is likely to vary Computed relative response times show us both the variations and the reason The Y Axis metric does not matter! Just pick a basis, the ratio is the important part!

Ron Kaminski 2010, All Rights Reserved42 Why network queuing theory based modeling? A workloads typical transaction is likely to rely on several resources Imagine a workload running on a machine with four CPUs, six disks and some network IO on one card Note that when technologies differ, service times can differ

Ron Kaminski 2010, All Rights Reserved43 Why network queuing theory based modeling? Now do you see where a graph like this can come from? If the warehouse folks are complaining about response times at 3:00 AM, should you upgrade the CPU? When do you suspect that the backups are running? Would a CPU upgrade help daytime response? But it also might make demand for I/Os faster and really slow down the warehouse at 3:00 AM too, so you better address the I/O issue!

Picking the right vendor partners In my experience, network queuing theory based tools move folks quickest to actionable answers Once you understand relative response times, most issues are quick and easy to diagnose If a new vendor harps on linear trending graphics and projections, dont expect them to be around for very long If a monitoring or other product vendor keeps adding and you can use this for capacity studies it is probably because the salesperson heard that you were looking for capacity planning tools! Stick with network queuing theory based packages and you wont go wrong! Dozens of And we can do capacity planning too! stories Ron Kaminski 2010, All Rights Reserved44

Ron Goes Off on VMware VMware is not a capacity solution VMware is a symptom of now capacity management Ron Kaminski 2010, All Rights Reserved45

Ron Goes Off on VMware VMware is the single biggest indictment of the poor way most firms have done capacity planning in the Windows space The lack of workload characterized views of consumption is why folks bought a server for each functional part We dont want to stack multiple applications on one server! So we VMware them! which is just stacking with the added joy of paying for not only extra copies of the OS and tools, but $900+ for VMware as well And in the end, the code is running on the same box! VMwares so called capacity planning tool is proof that they never attended a CMG! It is as near useless as any marketed tool that I have ever seen, but at least it is expensive Ron Kaminski 2010, All Rights Reserved46

Going Extra-Product Once you get used to your vendors product, if you are like me, youll start wishing for more functions tailored to your specific needs In the old days, a grey haired expert would whip out a spreadsheet or other mathematical package and start creating some home-brew solution I use perl and GD:Graphics, PHP, java script and anything else that I can think of, you can use what makes sense to you Check out old CMG papers, they are laced with great ideas In other words, dont feel limited to what your vendor does out of the box Find buddies that use the same vendor and start sharing ideas and code Things that you will see later in this presentation are shared among dozens of firms and they wouldnt live without them You dont have to agree 100%, take what fits best and leave the rest Ron Kaminski 2010, All Rights Reserved47

Going Extra-Product There are a whole group of us running many of the extensions that weve developed over time Some of our extensions have made it into some products, but nowhere near enough of them! We probably get 50% of our firms benefit from the tools from our own extensions We regularly meet with the vendors and implore them to add the features that we like Having more singing from the same hymnal might just get through to them! Come join us! The best ideas might be in your head! Share! Ron Kaminski 2010, All Rights Reserved48

Avoiding the IT Mindset Traps Capacity planners come in several flavors, because people from several different camps end up in this role Scientists - Scientifically minded users of network queuing theory tools and simulation models that want to subdivide consumption into different behavioral groups and analyze them Application specialists application subject matter experts who know the application are trusted by management, and care deeply about its success. They often come from the application side of the firm Old Timers They know everybody, have worked on everything and have connections a and favors to call in to get things done. They often come from the operations side of the firm Each of these can be successful, but some are more prone to certain behaviors that can limit your capacity planning effectiveness and raise the costs of doing it Lets look at the typical pros, cons and peccadilloes of each Ron Kaminski 2010, All Rights Reserved49

The Scientists The Scientist capacity planner loves to get data from everywhere and everything that they can Willingly tackles huge tasks as long as there is a possible learning benefit Will constantly tweak the automation to be able to get yet more data Will go extra product and build tools for specific functions without fear, because they are used to building things from scratch and being successful Pros No fear, they view no problem as intractable and are sure that if they can get real data into a scientifically designed framework, business useful learning will result No agenda, all applications and systems are equally important to them, they will not lobby for one application to get resources instead of another, preferring instead a rising tide that raises all boats Willing to try new methods and tools in search of solutions Ron Kaminski 2010, All Rights Reserved50

The Scientists Cons Scientists can be viewed as remote or doesnt know the business by some in management, particularly application development They may want some really expensive and/or tricky software, and on every machine, and these tools produce copious amounts of data that needs to be processed, graphed and stored The volume of tools and special case software that they accumulate over time can be hard to support by others Good ones are relatively rare, ones that can teach/mentor others are extremely scarce Mindset Traps Scientists can go off on tangents, they really need a manager who can Help them get the most productive subset of tools working first translate their outputs into terms understandable to the business help keep them focused on what the business deems most valuable Their pursuit of the one scientifically superior way left unchecked can lead to ongoing high costs Ron Kaminski 2010, All Rights Reserved51

The Application Specialist The application specialist in the capacity planning role Will often drop everything else to don their fire-fighter jacket and save the firm by working on emergencies Will rely strictly on simple O/S tools and minimal data, often just totals because that was all we needed when we started this thing, and look how far weve come Seldom tracks historical consumption data over time, or if they do, seldom presents it in a format that is easily understood by others Pros They really do know the application, the folks who are powerful, and they have a lot of chips at the bargaining table when it comes time to get things negotiated Their application specific knowledge can really come in handy when strange behaviors are noticed Their continuing drive to make an application succeed and the lengths that they go to are often very favorably viewed by non-technical management Ron Kaminski 2010, All Rights Reserved52

The Application Specialist Cons EGO! Our conference rooms are named after comic book super heroes! Ron Kaminski 2010, All Rights Reserved53

The Application Specialist Cons Their self confidence can lead to large egos, they dismiss opposing views of how to address issues other than the way that weve always done it Their extreme willingness to join in every fire-fight eats a lot of time and delays the deployment of tools and systems (like long term historical consumption tracking) that would help others understand and make better decisions Tend to enjoy being the go to guy and thus seldom share the basis for their decisions This is sometimes covering up the fact that the basis for their decisions is gut feel, not data They will commit in public forums where management is present to supporting the scientists to get some application specific technical need, and then fail to do so in a timely manner, if ever They really know their silo, but they are very uncomfortable when asked to go outside of it Ron Kaminski 2010, All Rights Reserved54

The Application Specialist Mindset traps These folks career successes have been built on thinking on their feet as issues occur, so they seldom take the time to build data collection and reporting structures that lead to well informed decisions When you need to know something, just ask me. They may even resist or delay deployment of capacity planning systems, calling them costly, unnecessary and not our applications highest priority They will resist changes to their sacred architectures from the 1980s They can be initially really interested in capacity planning information about their application, and use it to point out the positive impacts of their past decisions and successes but dont expect them to mention immense over capacity Often their interest stops immediately at the edge of their application When there are issues larger than one application, they view it as their duty to defend their applications turf and will move to segregate the environments into us and them groupings that need not share any infrastructure They think that The vendor will tell us when to Ron Kaminski 2010, All Rights Reserved55

The Old Timers The old timers in the capacity planning role Are a calming presence in meetings Have stories of a time when we faced something similar Have the best jokes Know and address the VPs as Phil and Sandy Have capacity tracking systems that tend to the super-inclusive, when asked, they alone can root out data about darn near anything, but they have to be asked Pros They have the trust and respect of nearly everyone, because everyone has worked successfully with them over time When they need tools or space to get or keep their data, they just go ask Phil or Sandy Are among the few to have worked on many of the systems, not just one or two, and so they understand deeply the inter-reliance of many of the systems and how an issue in one can manifest elsewhere Ron Kaminski 2010, All Rights Reserved56

The Old Timers Cons Old timers are often tired of learning. They seldom want to embrace radical new methods when they are retiring in a few years Old timers are survivalists, or they wouldnt be old timers. They have a great political sense of when not to rock the boat and who not to mess with that can prevent or delay the introduction of useful new information Mindset Traps They approach capacity planning like they approached most of the IT issues that theyve faced in their long careers Lets start with a database with thousands of metrics! You never know what will come in handy, so resist deleting them while disk can still be purchased Their reporting systems evolved over a long time, hence can be hopeless for someone new to decipher or change They can be based on large tables of numbers that only a select few can successfully use Ron Kaminski 2010, All Rights Reserved57

Avoiding the IT Mindset Traps So what do we do? How do we get the pros of each type and minimize the downsides? You must build a matrix-ed team containing some of each type The team concept must have support from the highest levels It must have priority from each of their respective management They must be charged with: enabling the scientists to integrate new tools into the environment getting graphical reporting working that management can understand maintaining just enough information to provide long term historical context for decisions, but no more Sometimes, youll have to bring in outside expertise, and the only way that will succeed is to have friends in high places It is critical to put this under an excellent manager Each of the three types have useful and less useful behavior patterns You need a manager that all can respect, who doesnt try to be the expert, rather one who coaches each to be part of a well functioning whole Ron Kaminski 2010, All Rights Reserved58

The politics of capacity planning in organizations Organizational politics are often the key factor in your capacity planning groups eventual success or failure Long experience has taught many of us the importance of Friends in high places Try to get the capacity planning issue instigated by a knowledgeable VP or at least a director Often a major initial stumbling block is even getting permission to install collectors on production systems, much less the physics of actually doing it, and there is nothing better than having their bosses boss saying, Yes, you must do this, it is a priority Determining and rating the skills and power balances in your organization, usually by O/S Managerial chaos can be a severe issue Diagnosing and surmounting the barriers to success Describing the type Their common barriers and techniques to surmount them Ron Kaminski 2010, All Rights Reserved59

Identifying and surmounting barriers Barrier: The not invented here ber-geek Identification clues Often are early members of a firm Usually position themselves as masters of several related technologies, but can be rather sparse on details The younger the firm, the more often you find them, internet firms in high growth areas are full of them They are convinced that If we didnt need it then, we dont need it now! Their typical barrier methods This is not an organizational priority This collector code is not proven on our sensitive production systems Techniques to surmount their barriers Friends in high places compel them Share credit for successes with them to their management Involve them in the model setup, ideally model along side them, letting them suggest probable growth steps Ron Kaminski 2010, All Rights Reserved60

Identifying and surmounting barriers Barrier: The high priests of the old tool set Identification clues They like twitch monitoring and often have built an extensive installation of them with impressive sounding names like The war room or mission control Whenever you enter it during non-emergencies, notice how few people are actually using the displays They prefer current totals like total CPU because theyve never had consumption by business identifiable sub-groupings They react to brief workload peaks by demanding upgrades Their typical barrier methods Stalling. They ask streams of technical questions, and each answer that you give prompts another Requests to integrate, new capacity tools must feed information to their war room Techniques to surmount their barriers Ask them to put long term, workload characterized consumption on their displays Have them tasked to help address pathologies automatically detected (that their monitors did not seem to surface) Ron Kaminski 2010, All Rights Reserved61

Identifying and surmounting barriers Barrier: The application architects Identification clues They rigorously defend their current multi-node spread as vital for The organization Uptime Scalability 90% of their machines will be empty or nearly so The architecture was set in stone a decade ago, and is designed to solve the issues of that time, miniscule PCs Their typical barrier methods Lecturing you on how their way is the only way Dont you realize that these are business critical systems? is used to justify all manner of excessive purchasing They will lecture you on availability and scalability at the drop of a hat Techniques to surmount their barriers Show them the serious speedups possible by collapsing application layers onto fewer machines and removing network time from chatty applications Ask them for estimates on just how much more their application will need to scale, given that it is 7 years old and already in use firm wide? Ron Kaminski 2010, All Rights Reserved62

Identifying and surmounting barriers Barrier: The entrenched fire fighting squad Identification clues They offer to work with you, but not today as there is an emergency They position themselves as the experts in an application They are hyper-sensitive to any changes in the environment, they view them as dangerous Our conference rooms are named after comic book super heroes! revisited, when you fly in to interview, everyone is fighting a fire Their typical barrier methods They position themselves as must have team members and then are never Beware their commitments to make data or specifics available, they will often be too busy later to do it in a timely manner if at all Techniques to surmount their barriers Agree to work with them as valued members of the team, then ignore them in your plans as they will always be too busy to help anyway Never trust them to come through with a key item, always plan for another way to get what they promise that does not involve them Over time, train them that many of the time consuming fires that they fight are simple pile ups of multiple pathologies that wont bite if addressed in a timely manner Ron Kaminski 2010, All Rights Reserved63

Identifying and surmounting barriers Barrier: The overwhelmed, outsource-able and scared Identification clues They have single functions, often somewhat amorphous, and difficult to tag a dollar value on They are not in politically savvy managements structures Their typical barrier methods They stall, seemingly frightened to take on any task without exact instructions from their management The view tasks related to capacity planning as Not their priority They view all new functions as threats They seem to ignore all information not generated by their own function Techniques to surmount their barriers These are politically weak people in politically weak areas, stay away from them so as not to have to rely on them If forced to work with them, work with their manager to emphasize that capacity planning is an important priority that they cannot stall Help the good ones get out of that group Ron Kaminski 2010, All Rights Reserved64

Identifying and surmounting barriers Barrier: This is a database server only DBAs Identification clues They claim that In order to save the firm database license money, we are concentrating the databases from multiple applications on just a few servers and nothing else can run on these servers Their typical barrier methods Outright refusal to try collapsing micro-applications onto database servers Claim remaining capacity on the 1/3 used database server is for growth but are real hard to pin down for specifics, usually because there arent any Techniques to surmount their barriers Try to get them to allow/install only a certain small percentage of application code on their machines due to a network emergency. That seems tiny and reasonable. Use a number like 10% to 20%. They dont need to know that that was all of the applications that you ever dreamed of doing. Show them how your automated process pathology code works, to ease their fears about rogue applications eating their machines alive and harming other applications Praise them to their boss as innovative and balanced problem solvers Ron Kaminski 2010, All Rights Reserved65

Identifying and surmounting barriers Barrier: Lying, manipulative project leaders Identification clues You are originally asked to model 400 users from a sample of 30. Later they say, Oh no! We meant 1000 users! Their typical barrier methods Some project leaders view themselves as risk minimizers. Sadly, they often feel that 60% excess hardware is a proper sized cushion, so they inflate their usage estimate 60% to make the modelers justify excess hardware for them They took 3 extra months to get all these whacky features in, way past their deadline, but now time is an emergency and they need their results immediately or they just need to buy hardware right away because they have no time to test properly Techniques to surmount their barriers Speed. You can model this stuff far faster than they can get a load test to work without half of those whacky features blowing up Ask more people for how many users really are going to be there Ron Kaminski 2010, All Rights Reserved66

Identifying and surmounting barriers Barrier: Enthusiastic but We went to Load Runner Class and we absolutely have to to run huge saturation load tests drones Identification clues They dont understand mesa tests and modeling is all that is needed. Even if you can get a decent mesa test out of them, they still want to do a saturated load test anyway They REALLY BELIEVE two seemingly counter intuitive things: 1.Your operations group must run out and buy exactly the machine and memory that they dreamed up from dubious research for their tests 2.They do not have to run against realistic data volumes with similar indexes and size as intended production. They will NEVER create a statistically relevant data source. They will frankly state: It is impossible! Ron Kaminski 2010, All Rights Reserved67

Identifying and surmounting barriers Barrier: Enthusiastic but We went to Load Runner Class and we absolutely have to to run huge saturation load tests drones Their typical barrier methods No matter how many times you say not to, they will always strive to ramp up users at the start and ramp down afterward. Get ready to lose your first and last measurement periods If you can get a realistic transaction mix from them, they will still strive to run them too fast The 30 second contract review, 8 hours a day story Techniques to surmount their barriers Always question their user think times, then adjust your model to deal with the silliness that you uncover. Maybe 20% of the samples that I get have realistic transaction arrival rates, so beware Be consistent, over a series of tests you will wear them down, or get them fired Ron Kaminski 2010, All Rights Reserved68

A mail message to a new fleet of Load Runner enthused contractor drones The purpose of load tests can be manifold, to test functionality, capacity, and feel. Modeling based on a sample does the same things and more, and usually much faster and cheaper. If you choose to run a load test, be sure to run a realistic transaction mix with the expected blend of all commands, not just one kind. If you are limited to simulating a subset of intended loads by physics (we dont recommend simulating above 20 users per load running PC for accuracy) we can then take that load and model much higher ones and any alternate hardware that you might dream of. We have these caveats to improve accuracy: 1.Perform the tests on real, not virtual, servers for measurement accuracy 2.Run a proper mesa test for sampling which includes: A.Make sure that the CPP group has a collector on your intended test machine days before the test B.Start your test precisely on an hour boundary C.Do not, repeat, DO NOT ramp up or ramp down users. Just start and go, 20 users per load runner box will not overwhelm anything. Ramping is not required for models, indeed it is wrong to do it. D.Stop precisely on an hour boundary E.Send mail to us telling us I.how many users you simulated II.The precise timings III.How many more users we should add in the models IV.Anything else pertinent Ron Kaminski 2010, All Rights Reserved69

A mail message to a new fleet of Load Runner enthused contractor drones 3.The purpose of the test is to produce a flat topped mesa of usage that depicts your users acting normally. A graph of CPU consumption should look like a rectangle with a flat steady top, nowhere near saturated. We then take that sample of happy users unconstrained and model what hardware is needed for more happy unconstrained users. 4.Do a practice run several days before your real test to flush out issues and tell us so we can see how well you followed mesa instructions 5.DO NOT do any of the following, which will waste your time, ruin the data and cause rework A.DO NOT ramp up or ramp down usage at the start or end of your tests. It just makes us throw out that data B.DO NOT try to saturate the machine. The models will find that saturation load, dont waste your time. Concentrate on producing an unsaturated load of happy users getting great response times C.DO NOT try to simulate hundreds of users from one PC with one network card. It will fail or worse, produce incorrect data leading to massive errors D.DO NOT create loads with unrealistically fast think times. If the user is likely to do a transaction, then wait 5 minutes reading it or processing it, then set the inter-transaction wait time to 5 minutes, not 30 seconds. Remember, your goal is to be realistic, not to have high unrealistic loads. Mesa tests may seem odd at first, but in time you will learn to love mesa tests and their time and cost savings to projects. After a few of them, youll never load test the old way ever again. Questions? Please ask, or invite us to your team meetings for a confab! Ron Kaminski 2010, All Rights Reserved70

The politics of capacity planning in organizations How to win friends and influence people in the operations group Set up being on the capacity planning team as an aspiration goal, a promotion path, for the operations folks Try to find an operations or O/S expert at the top of their game and get them assigned to the capacity planning effort These are often the best acolytes and really take well to capacity planning As the operations staff start to use the capacity planning reporting and pathology detection systems Praise their efforts and successes to management Coach their failures privately Get them (and their management) to realize that keeping process pathology counts down reduces emergencies and call-outs, and greatly contributes to system stability Train them on the tools so they start to use them and build new skills If the only users of the capacity planning reports are on the capacity planning team, you are doing something wrong! Ron Kaminski 2010, All Rights Reserved71

The politics of capacity planning in organizations How to win friends and influence people in the application development group In addition to the barriers presented previously, you may also encounter The earnest improver, who takes the time to learn about new technologies and tries to integrate their benefits into their software development lifecycle The non-technical manager, who may never understand all of the math and formulas, but who will be far better at the political skill required for success External vendors whose future profits hinge on success Try to become an asset to each of these groups make sure that they see you as a willing partner in their success work late on their models help them succeed and get the resources that they need when they need them Send mail when you work early, late or on the weekends (and CC your boss of course), it shows that you are really trying to help Ron Kaminski 2010, All Rights Reserved72

The politics of capacity planning in organizations How to win over and influence your boss There are several types of bosses The experienced true believer The unbeliever The unconvinced cost counter There are techniques to deal with each Your goal is to convert the last two into the first one! Keeping all happy will involve deploying collectors, generating workload characterized historical consumption web pages and What if..? models of future consumption The key is to survive long enough to get a proper network queuing theory model based software purchased in sufficient quantity to make a difference Get some applications leadership on your side keep the last two from canning you before you start to get meaningful results on a large scale Ron Kaminski 2010, All Rights Reserved73

The experienced true believer Usually you have worked with or for this boss before, so they already know How expensive the tools can be, so they are not shocked What a reasonable time for results is How to help enable your success What battles to fight, and what battles to avoid My last 4 gigs have been for someone who I had either consulted for or worked for Delivering results delivers career options for you! Characteristics of the experienced believer Patience Helps get the software quickly Helps break through organizational politics to get your collectors quickly deployed Projects confidence in meetings with other management Ron Kaminski 2010, All Rights Reserved74

The unbeliever These folks (often with a development background) are distrustful of fancy methods like network queuing theory This is often based on an insecurity, they dont understand complex tools and thus distrust them Have made their career by betting on simple solutions and extrapolating linearly Are often in their position due to management turmoil In several gigs Ive had non-believers in the management structure above me Characteristics of the non-believer Initial open contempt of scientific capacity planning methods Demand results before they help you get collectors in place to answer it with a historical basis Often will throw CPU and memory at disk I/O slowness Can be turned, but wow, it sure takes patience! Ron Kaminski 2010, All Rights Reserved75

The unconvinced cost counter These can be great bosses in time, because like scientists, they demand proof before supporting you, but once they have it, they will be true believers They either have no experience with sophisticated capacity planning, or have had running the group forced on them by higher ups who have Characteristics of the unconvinced cost counter Repeated references early in the process to how much your group and your software costs, and lots of implying that savings results had better surpass that soon Caution early on, so they will spend the time with other departments getting them to go along with you Thrive on informational updates, so show steady progress You dont have to be perfect, just constantly getting better Youll know when they switch to true believers when They start buying you more licenses! They stop complaining about costs The We need to show results! to Do you need more licenses? conversion Ron Kaminski 2010, All Rights Reserved76

Reporting There are a lot of tragically bad business graphics and especially capacity planning reports out there. Issues include: Graphics that distort the viewers perceptions Quasi-3d Black outlines around bar charts Non-calendar displays of long spans of time No color consistency Foolish consistency may be the hobgoblin of little minds, but it is also the key to getting management to use your site for decision making (dont pay attention to little minds and management appearing in the same sentence) Lots of chrome, little content Tufte: Question every pixel. Basically, any pixel that isnt conveying new data, get rid of it! Ron Kaminski 2010, All Rights Reserved77

Reporting Other issues that limit effectiveness Multi-page reports that nobody ever reads If your answer is so complex that it requires that much evidence, start over on a new one They paid $10,000! It has to hit the desk with a thud! The same thud lives on! Relying on the untrained user to wade in and find the answers themselves Some you can train, most no If any correlation of graphics requiring memory is needed, forget it Rons Position: Non-web presentations in general are useless relics of a bygone age. Most of your readers data comes in hyperlinked form, so get with it or be left behind Web reports of all nodes in the firm Most users really appreciate ways to see only their span of control Ron Kaminski 2010, All Rights Reserved78

Reporting There are also some Must haves Automated context that graphically highlights when something is out of the ordinary (managers love this stuff) Automated business and hardware context, ideally driven by your CMDB, that include Hardware and software specifics Business Purpose Business owner Primary and backup technical contacts Ideally a text description of its business function Other helpful links Ron Kaminski 2010, All Rights Reserved79

The Zen of Great Reporting Seek minimalism in all parts of it Reduce graphic clutter Reduce user perceived complexity Workload color consistency is a simple must-have Reduce user choices and actions If the user needs to know 4 things to make a decision, they had better be close on the same web page Add extra information that lets the user more fully understand odd behaviors and situations Sorting it by date is nice too Dont restrict yourself to measured quantities Workload response time detail is one of the most powerful graphics that you can use Ron Kaminski 2010, All Rights Reserved80

Reporting Examples Ron Kaminski 2010, All Rights Reserved81

Reporting Examples (UNIX) Ron Kaminski 2010, All Rights Reserved82

Reporting Examples (Windows) Ron Kaminski 2010, All Rights Reserved83

Reporting Examples (Windows) Tangent, Multiple Memory Leaks Here is an example of a rather severe repeating set of memory leaks See the saw-toothing memory? See the climbing Commit Bytes in a different sequence? Ron Kaminski 2010, All Rights Reserved 84

Reporting Examples (Windows) Tangent, Multiple Memory Leaks When you dig deeper, you can see memory totals by process owner People often want to blame someone Alas sometimes the Someone is harder to pin down by just username Ron Kaminski 2010, All Rights Reserved 85

Reporting Examples (Windows) Tangent, Multiple Memory Leaks When you dig deeper, we can see the individual process names leaking In time youll find the best way to keep them unique, we use process start date/time and PID You can show these to the Fake_Name vendor and then it is hard for Fake_Name to deny a memory leak I believe that java is Finnish for memory leak Ron Kaminski 2010, All Rights Reserved 86

Reporting Examples (Windows) Tangent, Multiple Memory Leaks Well it is hard to deny a leak, but some Fake_Name vendor might want raw data, so Since you already have it, put out some csv files to be easily mailed to the vendor, eliminating one of their stall tactics Ron Kaminski 2010, All Rights Reserved87

Reporting Examples (Windows) Tangent, Multiple Memory Leaks The right way to convey the message We detected the issue, and sent mail to the application owner, stating The exact processes with the issue They can expect to keep crashing every day or so until they get the vendor to fix it Offers to help with data or technical calls We get no response at all Three weeks later, we get a request to add memory to the machine The owner Cant get the vendor to respond quickly and wants to reduce outage counts in the mean time Dont get mad Stay positive and helpful in tone, they are just trying to help their users have less outages but continue to urge them to turn up the heat on their vendors, but do it in a nice way Ron Kaminski 2010, All Rights Reserved88

New! Reporting Examples Windows Ron Kaminski 2010, All Rights Reserved91

New! Reporting Examples UNIX Ron Kaminski 2010, All Rights Reserved92

Classic capacity planning question descriptions and proper answering techniques Capacity issues are usually an emergency to someone Roughly 93% of the requests for upgrades are nonsensical if you have any historical workload based resource consumption information So you have to say no in a way that makes the evidence clear What to expect when you say no: The 5 stages of grief (also called the Kbler-Ross model) http://en.wikipedia.org/wiki/K%C3%BCbler-Ross_model http://en.wikipedia.org/wiki/K%C3%BCbler-Ross_model Denial Anger Bargaining Depression Acceptance Always give them a way to succeed along with your no, remember that may they still have a real problem! No, you dont need CPU or memory, but you are doing 5500 I/Os a second to your slow, locally attached C drive Can you turn down logging? Can you send those I/Os to fast SAN or RAM drives? Can you get help from your DBA pals? No, you dont need more CPUs, you need to fix those looping processes. Ron Kaminski 2010, All Rights Reserved94

Classic capacity planning question descriptions and proper answering techniques Here is the pattern for this next section: Real quotes from the users (disguised, slightly) The evidence The answer What happened I want some interaction on these, if you did it better, speak up! Share! That is what CMG is for! These graphs used in the examples are all homebrew perl and GD:Graphics, and they are used at several firms Yes I will share the code if you want it, but sheesh, you can do better! You are going to want some form of screen graphics capture tool I use freeware ZScreen, downloadable from many sources, it is fabulous Ron Kaminski 2010, All Rights Reserved95

Classic capacity planning question descriptions and proper answering techniques User quote We are keeping these machines rather heavily loaded. but they wont tell you why The evidence Ron Kaminski 2010, All Rights Reserved96

Classic capacity planning question descriptions and proper answering techniques The answer It turns out that this application was on three nodes, two heavily used and one lightly used They wanted a review of each Is ustca027 too empty? Is ustwa007 too full? Is ustca031 too full? Lets use Relative Response Time by hour to answer them Ron Kaminski 2010, All Rights Reserved97

Is ustwa007 too full? Ron Kaminski 2010, All Rights Reserved98

Is ustca031 too full? Ron Kaminski 2010, All Rights Reserved99

Classic capacity planning question descriptions and proper answering techniques What happened The users are initially shocked to see that the capacity planners, whom the view as machine stealers for VMs, are recommending that they get more hardware! Once they started to understand relative response time graphs, they became quite sophisticated at moving workloads around Youll know that youve converted them when they e-mail you asking if their IO_Wait could be solved if they split them over more drives or better RAID choices The morals of the story Any vendor can show totals Favor vendors that show workload characterized historical views of consumption Favor vendors that can show you workload relative response times, so that your answers make sense to the business Ron Kaminski 2010, All Rights Reserved100

Classic capacity planning question descriptions and proper answering techniques We started getting warnings from our automated checks: 10/03/23 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used up to 392.920% of an available 400% from 2010/03/23 at 0200 until 2300. 10/03/26 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used up to 394.572% of an available 400% from 2010/03/26 at 0000 until 2300. 10/03/27 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used up to 396.000% of an available 400% from 2010/03/27 at 0000 until 2300. 10/03/28 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used up to 392.920% of an available 400% from 2010/03/23 at 0300 until 2300. The evidence (heres what the sparkline looked like): Ron Kaminski 2010, All Rights Reserved101

Classic capacity planning question descriptions and proper answering techniques More evidence: Ron Kaminski 2010, All Rights Reserved102

Classic capacity planning question descriptions and proper answering techniques My initial suspicions were Code improvement opportunities so I contacted my DBA pals: Ron Kaminski 2010, All Rights Reserved103

Classic capacity planning question descriptions and proper answering techniques Those CPU graphs with response time increases due to CPU_Wait when they hit the knee in the curve: Ron Kaminski 2010, All Rights Reserved104

Classic capacity planning question descriptions and proper answering techniques The answer from my DBA pals: Ron Kaminski 2010, All Rights Reserved105

Classic capacity planning question descriptions and proper answering techniques What happened (the changes went in on Mar 29 th ): Ron Kaminski 2010, All Rights Reserved106

Classic capacity planning question descriptions and proper answering techniques What about the charts Ron? Ron Kaminski 2010, All Rights Reserved107

Classic capacity planning question descriptions and proper answering techniques Things to learn from this example: Not all code innovations work as efficiently as desired SQL developed in far flung places for even farther flung places is especially suspect When the answer is correct, the code is done, well maybe not Not all innovations will go through a rigid capacity planning review You need either automated warnings or to take the time to scan thousands of graphs often to detect and correct these You need fast graphical evidence to get fast reactions You need to go out of your way to be nice to DBAs, they will save your firm millions if you let them, and if you only ring them up when there is real evidence of mayhem Always ask their boss to praise their efforts, those memos come in handy at review time Ron Kaminski 2010, All Rights Reserved108

Classic capacity planning question descriptions and proper answering techniques Many of you will be deploying virtual terminal environments to hundreds of users What if something goes a little wrong? The evidence: Ron Kaminski 2010, All Rights Reserved109

Classic capacity planning question descriptions and proper answering techniques The answer: We started ticketing suspicious CPU consuming VMware slices on Feb 3 rd Most of it was Bezier curve screen savers! We banned them What happened: We got back more than half of our VMware farm! Ron Kaminski 2010, All Rights Reserved110

Classic capacity planning question descriptions and proper answering techniques User quote: I was wondering if we could get the memory increased on our Exchange 2007 CAS servers USTCAX100 and USTWAX100? Right now both servers are running 4.25GB and I would like to move them to 8GB. We are seeing performance issues with those servers and we are noticing that RAM usage is at 80%-90% or higher all of the time. Users are starting to notice this with Communicator. Due to the fact that it cant get a response quick enough from CAS, it is putting an exclamation point on the communicator alerting them to address book issues. If we are not able to increase the memory, the only other option would be to add more CAS servers in the environment to balance the load. We also are going to be increasing the load on these servers with the 2000 users we will be adding to the North America environment from the XYZ Co. acquisition and moving South American users to North America servers. Please let me know if this is feasible or not? Ron Kaminski 2010, All Rights Reserved111

Classic capacity planning question descriptions and proper answering techniques The evidence: First, look to see if anything has gone wrong recently They might be reacting to a recent problem, but dont stop there Ron Kaminski 2010, All Rights Reserved112

Classic capacity planning question descriptions and proper answering techniques The evidence: Looking deeper, we dont see a memory shortage, (there is evidence of a slight leak) paging is very low, CommitBytes isnt anywhere near CommitLimit, but CPU seems in short supply, and the CPU Wait component of relative response time is huge Their short term performance issue is due to CPU shortage, not memory! Ron Kaminski 2010, All Rights Reserved113

Classic capacity planning question descriptions and proper answering techniques The Answer: Along with the graphs from the previous page (and getting them to address the lsass loop) we added two virtual processors to this VMware slice Note that if you disagree with their solution, give them an alternative that fixes present issues We may give them more memory later, when theyve earned it Ron Kaminski 2010, All Rights Reserved114

Classic capacity planning question descriptions and proper answering techniques What happened: The CPU Wait disappeared immediately The users immediate issues were solved The users now know that decisions will be based on evidence, the results will be real, and they like it! Hardware in use for a growing application will grow, but slowly Ron Kaminski 2010, All Rights Reserved115

Classic capacity planning question descriptions and proper answering techniques Hey folks, there is still one more issue, with imjpmig process, the Input Method Editor, which lets you use Japanese characters. It is looping regularly: 10/01/15 LOOP_PROBLEM: 3444 running imjpmig CPU looped from Jan 15 04:59:54 until Jan 15 23:54:53 and may still be looping. 10/01/16 LOOP_PROBLEM: 3444 running imjpmig CPU looped from Jan 16 00:07:48 until Jan 16 23:54:58 and may still be looping. 10/01/21 LOOP_PROBLEM: 5344 running imjpmig CPU looped from Jan 21 13:59:59 until Jan 21 23:54:58 and may still be looping. 10/01/22 LOOP_PROBLEM: 5344 running imjpmig CPU looped from Jan 22 00:01:27 until Jan 22 23:54:56 and may still be looping. 10/01/23 LOOP_PROBLEM: 5344 running imjpmig CPU looped from Jan 23 00:01:25 until Jan 23 23:54:53 and may still be looping. I changed the workload to just highlight Input Method Editor by itself. I also found a bunch of patches available: http://search.microsoft.com/Results.aspx?q=imjpmig+d ownloads&mkt=en- US&FORM=QBME1&l=1&refradio=0&qsc0=0 http://search.microsoft.com/Results.aspx?q=imjpmig+d ownloads&mkt=en- US&FORM=QBME1&l=1&refradio=0&qsc0=0 Ron Kaminski 2010, All Rights Reserved116 Sometimes your own systems detect problems, so answer in a way that provides all required information

Classic capacity planning question descriptions and proper answering techniques Eventually they got the fix migrated to production and everything worked fine from then on Dont get discouraged if folks dont always do what you want immediately Change controls, priority conflicts and other issues may stall the fix With enough graphical evidence, eventually you will win! Ron Kaminski 2010, All Rights Reserved117 What happened?

Classic capacity planning question descriptions and proper answering techniques Ron logs in on a Saturday to work on slides for UKCMG (Again! And what do you get paid to do this? asks my dear wife) and sees the following: The evidence (from my pathology detection codes morning mail) CPU saturation found: CPU_SATURATION_WARNING: Windows2000 node ustca337 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustwasbx16 used up to 99.000% of an available 100% from 2010/03/12 at 1400 until 2300. CPU_SATURATION_WARNING: Windows2003 node uktcas06 used up to 99.000% of an available 100% from 2010/03/12 at 0300 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustca227 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustca724 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustcas44 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustcas54 used up to 99.000% of an available 100% from 2010/03/12 at 0400 until 2300. CPU_SATURATION_WARNING: Windows2003 node ustca088 used up to 99.000% of an available 100% from 2010/03/12 at 0800 until 2300. Ron Kaminski 2010, All Rights Reserved118

Classic capacity planning question descriptions and proper answering techniques The evidence continued Whenever a whole bunch of bad things happen synchronized over many machines, think global tool Ron Kaminski 2010, All Rights Reserved119

Classic capacity planning question descriptions and proper answering techniques The evidence continued Whenever a whole bunch of bad things happen synchronized over many machines, think global tool Ron Kaminski 2010, All Rights Reserved120

Classic capacity planning question descriptions and proper answering techniques Ron Kaminski 2010, All Rights Reserved121 This is really bad news, a critical Business Sensitive / Critical production server doing its normal real sqlservr workload with a Tool process going on a CPU binge and causing excessive response times due to CPU_Wait

Classic capacity planning question descriptions and proper answering techniques The answer A new piece of monitoring code was installed BREAKING THE NO NEW CODE INSTALLS ON A FRIDAY rule! What happened The code creator had deployed a new script, and he reviewed it after getting mail about all of the warnings: This was a bug in a script update that I made; we should be seeing this behavior on most of the attached server list. ______ is pushing out an update to the script now; once this is done well have to log into each of the affected servers, verify the looping process is running sqlcheck.vbs, and kill it. We were able to swiftly detect and fix the issue How would your site do this? Ron Kaminski 2010, All Rights Reserved122

Classic capacity planning question descriptions and proper answering techniques What we saw: We started getting Commit_Bytes approaching Commit_Limit warnings: 10/04/05 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 5 18:00:00 until Apr 5 23:59:00 and may still be. 10/04/06 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may still be. 10/04/07 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 7 00:00:00 until Apr 7 23:59:00 and may still be. 10/04/09 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 9 00:00:00 until Apr 9 23:59:00 and may still be. 10/04/10 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 10 00:00:00 until Apr 10 23:59:00 and may still be. 10/04/11 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 11 00:00:00 until Apr 11 23:59:00 and may still be. 10/04/12 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 12 00:00:00 until Apr 12 23:59:00 and may still be. 10/04/13 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from Apr 13 00:00:00 until Apr 13 23:59:00 and may still be. Ron Kaminski 2010, All Rights Reserved123

Classic capacity planning question descriptions and proper answering techniques The answer: Clearly this application has a jlaunch process (run by the SAPServicePRG user) memory leak You have two options: Get them to patch/fix the application, or Get them to reboot the machine periodically so that you dont start paging hard and affect performance So you notify the project leader: Hi all, If you look at memory usage over the last few months on these three severs, youll see steady and/or repeating ramps. http://ustwu002.kcc.com/node_reports/ustca146/memory.html http://ustwu002.kcc.com/node_reports/ustca147/memory.html http://ustwu002.kcc.com/node_reports/ustca148/memory.html This leads eventually to warnings like these: COMMIT_BYTES_PROBLEM: On ustca146, Commit Bytes were within 80% of Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may still be. COMMIT_BYTES_PROBLEM: On ustca147, Commit Bytes were within 80% of Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may still be. COMMIT_BYTES_PROBLEM: On ustca148, Commit Bytes were within 80% of Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may still be. and after that, when commit bytes hits commit limit, you can experience rather severe application slowdowns. In every case, the major rising memory consumer seems to be jlaunch processes run by SAPServicePRG. Most recently: PID 6160 on ustca146 started Mar 2 20:54:58 PID 3772 on ustca147 started Mar 2 20:54:50 PID 8032 on ustca148 started Mar 2 20:54:56 Could someone take a look at these to see if a fix is possible? If not, could we recycle these jlaunch processes, perhaps weekly, to keep memory usage down? Thanks for looking! Ron Kaminski 2010, All Rights Reserved128

Classic capacity planning question descriptions and proper answering techniques What happened: Hi Ron, Thank you for keeping an eye on these servers! You are right, there is a steady growth of memory usage by the SAP PRG processes on these application servers. This is not a surprise. There are several known issues regarding memory leaks with the current version of the Java hibernate libraries being used in the fake_name application and old fake_product. We have worked with the application vendor, fake_name, to resolve some of the more significant issues that were causing regular outages. Fake_vendor has not resolved some of the less-severe issues. There are plans to upgrade the entire application suite and change the underlying application execution platform from fake_product to new fake product. The application upgrade includes new libraries for hibernate, and the memory leak issues related to hibernate with fake_product have not appeared in new fake product. The landscape upgrade is currently scheduled for June. We will go ahead and schedule a recycle of the old fake product to recycle the Jlaunch processes you mentioned below. We will schedule regular process recycles until the system is upgraded. Please let me know if you have any additional questions or concerns. Thank you! Ron Kaminski 2010, All Rights Reserved129

Classic capacity planning question descriptions and proper answering techniques What happened: Memory leaks, key points to remember Graphics help get their attention, CSV files are there for the whackos who demand the real data Sometimes they say that they need it to prove to the vendor Believe me, the vendor usually knows all too well It is easy to do and nips their evasions in the bud Remember the stall techniques? Sometimes they cant, or arent, going to fix it Welcome to big corporations and priorities Then you need to get them to reboot periodically to get the leaked memory back Do you have the graphs and data quickly available to discover, document and communicate this? Ron Kaminski 2010, All Rights Reserved130

Classic capacity planning question descriptions and proper answering techniques The evidence: Subject: Possible disk space issue looming on ustca479 Hi All, Here is a view of total disk space and disk space used on ustca479: Perhaps some purge/delete/cleanup is in order? Ron Kaminski Ron Kaminski 2010, All Rights Reserved132

Classic capacity planning question descriptions and proper answering techniques The answer: Subject: RE: Possible disk space issue looming on ustca479 Ron, Thank you for the heads up. The increased disk space utilization is partially due to enhanced logging that we have enabled over the past few months. I have cleaned up some old logs and we will continue to monitor the disk utilization to determine if additional disk space is required. Thanks, Matt Ron Kaminski 2010, All Rights Reserved133

Classic capacity planning question descriptions and proper answering techniques What happened: Well, It was a start! But alas, note the inexorable rise beginning again after the clean up. Ron Kaminski 2010, All Rights Reserved134

Classic capacity planning question descriptions and proper answering techniques The best way to deal with these is to avoid them proactively by making great, workload characterized consumption information available to all Train your firm to use the capacity reporting and pathology detection systems You have automated pathology detection, all the way through ticketing issues, havent you? Think graphics, not tables of numbers If only a secret club know the capacity data, you are making a big mistake Train OS support folks to use the What if? models Ron Kaminski 2010, All Rights Reserved136

What I said about clouds and SAAS last year: Say goodbye to your data centers and your privileges folks! Cloudy days are coming, and this is good Paying people in each firm to worry about OS, backup, security, and staying current was always expensive, and now it is ridiculous Change firms a few times and note how wildly different It has to be this way! is Our capacity planning needs, and tools, will have to change too Instead of vendors selling you software, many will sell the service running on their cloud This is great! Let the vendor maintain their own code! They are the naturally cheapest way, the expertise needed is naturally concentrated Having a year more to search for and find issues, I see a few potential storm clouds in some firms sunny plans! Lets dig into why Ron Kaminski 2010, All Rights Reserved138

Clouds and Software as a Service Definition: Clouds = Running our stuff on someone elses computers, plus whatever else will be needed for the new demands that will place on us, like: Encryption, so we can run sensitive corporate data over the world wide web safely Note that this is done on both sides, the users machine and in the cloud. This may be an unpleasant surprise for firms that have replaced those expensive desk top processors (and all that excess capacity) with light desktops running virtual machines on shared hardware Exhaustive disk cleansing when we delete files or parts of files Network lag measuring tools, because there will be slowdowns and our users will want to direct their wrath Increased firm internet firewall bandwidth needed Increased firm internet bandwidth needed Ron Kaminski 2010, All Rights Reserved139

Cloud issues Well just run everything in someone elses cloud, so we wont need capacity planning any more. It will be the cloud vendors problem! Clouds will place new, different, and often resource intensive new demands on our firms computing infrastructure Capacity concerns will become very important, and historical records of what consumed what will be paramount for figuring things out Someone is going to have to pay for all of that extra processing and it wont be the vendor! The Mushroom Cloud will be appearing at firms that ignore these risks Ron Kaminski 2010, All Rights Reserved141

Clouds and Software as a Service Definition: Software as a Service = Letting someone else run their code on their machines to serve us, but undoubtedly with our data, plus whatever else will be needed for the new demands that will place on us If there is customer identifiable information, we will need all of that encrypt/decrypt overhead again Disk cleansing will be less of a priority as no one can run disk scrapers unlike the cloud Network lag measuring tools, because there will be slowdowns and our users will want to direct their wrath Increased firm internet firewall bandwidth needed Increased firm internet bandwidth needed Ron Kaminski 2010, All Rights Reserved142

Other Cloud and SaaS issues The key thing to remember is that cloud and SaaS vendors will have to eventually operate at a profit! This will drive them to the same attempts to economize that your firms are trying now Big and cheap IO devices, that are of course much slower Virtualization will be a certainty, you will never know what fraction of what hardware you will be on Architectural choices of the firms past wont make sense any more What hardware largess do you tolerate now for Mission critical applications? Hot spares? N+1 copies of data? Will your cloud vendor leave enough excess capacity for your theoretical worst case? How will you be sure? Ron Kaminski 2010, All Rights Reserved143

Other Cloud and SaaS issues And remember the graph, they have to run it on 2X to 3X+ the hardware for the same loads! Unless your firms Data Processing division is utterly ridiculous in their spending (and many are) how can clouds be cheaper? Clouds and SaaS only make sense when the non- hardware savings exceed the hardware and network costs, or provide other business useful opportunities Perhaps outsourcing a staff intensive application to a SaaS vendor is still a really great idea Ron Kaminski 2010, All Rights Reserved144

The moral of the story: Eventually businesses may evolve into partial cloud and SaaS users when the overhead of extra processing is smaller than some fraction (Ill go out on a limb and say half) of the average resources needed to run the application and the security demands are low, and/or the total function cost is lower Quick! Think of a low security function at your firm that you would be happy to have some greasy haired geek intercept, and put that in a low security cloud I couldnt think of any as an example! Can anyone here? Almost all real corporate work will demand far more internal resources to run externally than to run internally Be sure to add those costs to your cloud and SaaS plans! Ron Kaminski 2010, All Rights Reserved145

The story continues Go out and repeat my analysis at your firm on one of your firms attempts to do it Publish a paper via CMG or elsewhere where you outline the specific true costs in consumption and hosting spend If your costs come out like mine did, i.e. This doesnt make a lot of sense! expect a flood of analyst calls from consulting groups wanting you to expound on your cloud computing experiences expect some wholehearted chuckling and agreement that it is nuts I think that many firms are acting like consulting groups when they are in fact trying to gather data to beat down internal pushes to go cloud or to go SAAS Or they are consulting to potential cloud providers and giving them a less than rosy view For now, I would proceed very slowly Ron Kaminski 2010, All Rights Reserved146

Clouds, last words I used to live not far from here in Oviedo Fl Every summer day a lot of sunlight hitting the swampy ground would create a lot of hot rising moist air, so we had clouds and thunderstorms about 3 PM each day In IT journals and analysts sessions, there is a lot of hot moist hype filled air rising Maybe that is why they see clouds! Ron Kaminski 2010, All Rights Reserved147

Modeling when all of the cards are stacked against you In a perfect world, when new code is written There is a comprehensive test plan to verify functionality All issues are corrected prior to the capacity planning tests The capacity planning tests are performed on real (non- virtual) hardware with known characteristics The testing group are old pros who know how to run a proper mesa test which has An hour of nothing An unsaturated hour of a realistic transaction mix, with realistic think times, on a realistic indexed database, without excessive application logging or extraneous monitors adding logging loads to key disks Followed by an hour of nothing And finally a truthful estimate of expec