Foreign speaker rules Please feel free to stop me to ask any
questions Raise your hand or clap if I am going too fast or if my
Mississippi accent becomes impossible for yall to understand This
is not rude, and I will not take it that way The paper and all
slides will be furnished to my hosts
Slide 3
Introduction Over the past 20 years, Ive started and expanded
capacity planning groups at dozens of firms, my most recent is now
15 months old You learn things in that process CMG is the place to
share this information I look forward to your presentation on this
topic in a few years! Todays goal is to give you planning and audit
points that you can use to review how you do capacity planning, and
maybe persuade you that other methods might be more productive, or
at least worth a shot! There will also be How to information, that
may have you adding some to do items to your list If you have a
question, ask it! I like nothing better than surfing off on a
tangent that helps the class Story Times! New risks 3 Ron Kaminski
2010, All Rights Reserved
Slide 4
Introduction In the next few hours, we will cover Defining your
mission Picking the right vendor partners Going Extra-Product
Avoiding the IT Mindset Traps The politics of capacity planning in
organizations, the key factor in your eventual success, or failure
Reporting, what you should and surprisingly should not do Classic
capacity planning question descriptions and proper answering
techniques Ron Kaminski 2010, All Rights Reserved4
Slide 5
Introduction In the next few hours, we will cover How clouds
and software as a service will still need capacity tracking and
planning tools, and what new kinds you will need Modeling when all
of the cards are stacked against you, or Tricks of the trade Goals
to work towards An audit list to compare to your systems Capacity
planning done well can change the fortunes of a company and help
all of our careers. Come sharpen your methods and learn tricks that
will make you part of your firms future productive assets, and not
an expense to be controlled Ron Kaminski 2010, All Rights
Reserved5
Slide 6
Rons Rules You can ask anything, at any time Sometimes the
answer is coming up soon in the examples, and in that case Ill tell
you so Quick Survey Does anyone here already have A network queuing
theory based modeling package? Regular, automated process and
workload pathology detection? Fast web reporting of resource
consumption by business useful workloads? By the end of this talk,
I hope that you will realize that workload characterized views of
consumption, web accessible, over business useful time spans are a
must have part of the best run IT shops Lets see why
Slide 7
Defining your mission Every site has their own Hot button!
issues We are buying a new $23 million computer room every 6
months! Attack server sprawl with data, not words I dont know why
we hired a capacity planner, we just Our critical applications are
slowing down! Use relative response times and historical
information to show why Chargeback used to be a big draw but it has
really faded away in the post.com world It shows you when you are
talking to an old vendor The ITIL push and reality when facing
outsourcing or ZOG ITIL takes a back seat to cost control, at least
in the states We need better reporting! Be careful to be holistic
in what you deliver, cover every thing that they can buy,
historically and ideally with business cycle peaks When you start
hearing terms like focus on business priority and really look at
travel expenses realize that cost cutting is in your future and
report in ways that enable them to cut power and machines Ron
Kaminski 2010, All Rights Reserved7
Slide 8
Defining your mission You might think that all that variation
would lead to very different solutions, and youd be wrong! All
effective capacity planning systems are based on having: Efficient
data collection, regrouping, reduction and storage Effective
graphical reporting of business meaningful spans of time Components
of workload response time that lead to diagnosis Solving the desire
for answers to What if? questions Problematic consumption
diagnosis, reporting and ticketing Some capacity planning product
features marketed by vendors to the nave are actually seldom used
in the real world, and for good reasons Linear Trending, when what
you really need is business cycle discovery and planning The retail
cycle at grocery chains and web payment system vendors Real Time
Monitors, when you might want to go home or on vacation some day.
Remember, problems happen 24 X 7, and humans wont be watching
twitch monitors that consistently. - The mission control room story
Top 10 is often used to focus a newbie on peak consumption, which
may all be valid Ron Kaminski 2010, All Rights Reserved8
Slide 9
Defining your mission Who is doing the reporting? Vendor
supplied reports Tend to be single metric Often dont include
contextual information Are often generate on demand and therefore
any useful span of time takes beyond the allowable attention span
Often have serious contextual clarity problems Workloads change
colors as the number present changes You switch machines Use black
outlines that swamp the colors for small workloads The Im only
using vendor reports this time and hit count story Can take
unimaginable resources to produce Set yourself a consumption budget
and manage to it You want to trade more bonds? Stop looking at it!
May focus on reporting right now data rather than long term useful
decision support information Seldom contain disturbance to the
status quo notation capabilities Ron Kaminski 2010, All Rights
Reserved9
Slide 10
Defining your mission Who is doing the reporting? Write your
own reports Can be anything that you dream up (and can deliver the
code for) There are multiple free languages and infrastructure to
pick from Weve used perl, PHP, java and a whole lot more Can be
tailored for your firms decision makers specific needs Can use
generate ahead and other techniques to speed web reporting Writing
your own can also have down sides Staff turnover and the Who is
going to maintain this ___? issues Some staff are not gifted visual
communicators If the information used changes formats, (and over
time they all do) someone is going to have to maintain that stuff
Ron Kaminski 2010, All Rights Reserved10
Slide 11
Defining your mission What do you want to present? Workload
characterized subdivisions of consumption over time? Long term
historical context for decision makers over multiple natural
business cycles? Information subdivided into audience specific
groupings for ease of use by subgroups Integration into your firms
CMDB Ticketing systems Software development life cycle Totals over
time The spark lines counter-argument Ron Kaminski 2010, All Rights
Reserved11
Slide 12
Why sparklines of totals can be really useful These are
sparklines of total CPU used, Average CPU used and the average CPU
used by all nodes in that O/S Is there one in particular that draws
your eyes to it, that wants you to probe deeper? Ron Kaminski 2010,
All Rights Reserved12
Slide 13
Why sparklines of totals can be really useful If you are like
me, ustca102 has you wondering, What made it step up like that? On
our system, clicking on the tiny sparkline brings up a zoomed in
image, which really gets you wondering: Clicking on that graphic
brings up our normal web reporting system: Ron Kaminski 2010, All
Rights Reserved13
Slide 14
Why sparklines of totals can be really useful Ron Kaminski
2010, All Rights Reserved14
Slide 15
Why sparklines of totals can be really useful OK, sometimes
totals are useful Sometimes they can draw your eye to issues They
can quickly dispel rumors that All of our machines are maxed out!
For example, our applications specialists were consistently
maintaining that all of their machines were barely big enough to
make month end, and they would argue mightily whenever we might
suggest that there was room for consolidation I brought the chart
on the next slide to the next meeting, and suddenly their tune
changed Ron Kaminski 2010, All Rights Reserved15
Slide 16
Why sparklines of totals can be really useful Ron Kaminski
2010, All Rights Reserved16
Slide 17
Why sparklines of totals can be really useful What happened
after the meeting? In the next 9 months, using extremely
conservative criteria, we Virtualized 230 machines ($1,521,000)
Retired 55 machines ($ 390,553) Oh! You can just turn that off!,
or, See steam come out of the operations folks ears stories Planned
10 machines ($ 40,000) Potential 28 machines ($ 112,000) We then
plan on going back over with slightly less conservative criteria
and finding a couple million more We will also be doing more
application stacking where it makes more sense Sort of makes
capacity planning tools look cheap, doesnt it? Ron Kaminski 2010,
All Rights Reserved17
Slide 18
Why sparklines of totals can be really useful Ron Kaminski
2010, All Rights Reserved18 A DBA pal of mine asked for a review of
memory on a box, asking for an increase to add caching and improve
performance I didnt really detect a memory shortage:
Slide 19
Why sparklines of totals can be really useful Still, people
dont usually mention issues unless there is an underlying cause.
So, as a capacity planner, you have to always look deeper and
always check all of the following: CPU Disk I/O Memory Network
Response time for key workloads If you dont always check
everything, something can sneak by Here is what I found when I
followed the always check everything rule When I looked at CPU, I
saw: Ron Kaminski 2010, All Rights Reserved19
Slide 20
Why sparklines of totals can be really useful Ron Kaminski
2010, All Rights Reserved20
Slide 21
Why sparklines of totals can be really useful Ron Kaminski
2010, All Rights Reserved21
Slide 22
Update! Theyve since added 2 more CPUs and the issue continues
unabated Some issues are not based in physics and data! Ron
Kaminski 2010, All Rights Reserved22
Slide 23
New, new update, Just for St. Louis! Ron Kaminski 2010, All
Rights Reserved23
Slide 24
New, new update, Just for India! Ron Kaminski 2010, All Rights
Reserved24 In the end, someone looked at what was running, and
decided most was waste! Look at what happened after Feb 22 nd
!
Slide 25
Why sparklines of totals can be really useful Now you see
several reasons see why longer term sparklines can be pretty useful
Do you currently have ways to generate them? If not, do you want to
get ways to generate them? Dont you all think that your vendor
ought to provide them, in group and zoomed in formats? So lets
start asking them to Do you also see why you should always check
everything and then sit back and ask yourself: If I had asked that
question and then got this response, what would I ask next? Ron
Kaminski 2010, All Rights Reserved25
Slide 26
Defining your mission Anticipate the next questions and always
answer them before being asked The unanswered next question can be
a huge time waster often a stall technique used by the politically
astute It raises temporary doubt in your findings, and builds their
case for swift purchase, before you answer their question often a
way for the old guard to show that they still are the top dogs to
management Impatient or frightened management might run off and buy
something! The undeclared war between Project Managers and Capacity
Planners The project manager weasel who never lost story Ron
Kaminski 2010, All Rights Reserved 26
Slide 27
Defining your mission If you are going to shoot down someones
hypothesis that lack of CPU was the cause of a problem, youd better
find out what really caused the problem before the meeting! Your
goal: One meeting or phone call per issue! They may say We just
want a quick and dirty answer but they never really do! Always
cover at least: CPU Memory Disk I O Workload response time changes
For web-centric systems, network distances and loads 27 Ron
Kaminski 2010, All Rights Reserved
Slide 28
Defining your mission Cultural differences are real and might
affect your workload choices Some cultures avoid direct blame or
information that would cause someone to lose face Any workloads are
better than none The No personal pronouns story Be consistent!
Always use the same groupings on all similar nodes Use the same
colors if you can! Reduce the burden on your audience Multiply the
value of your workload creation efforts Use consistent precedence
order to decide where to put a process that meets the criteria to
be in several different workloads Ron Kaminski 2010, All Rights
Reserved 28
Slide 29
Defining your mission Whatever you decide: Track your own tools
usage! There are multiple great freeware web usage reports that
will tell you if folks are using or snoozing your data (We use
webilizer: http://www.mrunix.net/webalizer/ )
http://www.mrunix.net/webalizer/ Unviewed information is wasted
time and efforts Use speed tests If there are multiple ways to do
something (CSV files versus a Performance database) code for both
and have a race Will your web users want the slower one? The
capacity planning reporting challenge story Dont settle, always
seek new audiences and better reports Add new functions Sadly,
there is no shortage of bad vendor reporting on expensive
infrastructure Anyone here ever seen a great graphical historical
display in business useful terms of SAN information or LAN usage by
segment? Your firm may have business specific information that
might be really useful to decision makers if overlaid on or
graphically reported near with IT resource consumption Ron Kaminski
2010, All Rights Reserved29
Slide 30
Our sites web usage: Ron Kaminski 2010, All Rights
Reserved30
Slide 31
Our sites web usage: Ron Kaminski 2010, All Rights
Reserved31
Slide 32
Our sites web usage: Ron Kaminski 2010, All Rights
Reserved32
Slide 33
Our shared long term mission When you innovate and come up with
new report ideas, share them at CMG! Or at least send me examples
in mail and Ill do it for you! Share code in this or other user
groups that make sense We should all work together in user groups,
public forums, on the web, etc., to push all of our vendor partners
to address these needs The more they do for us, the less we carry
the home brew code weight We should also all work to reduce the
volume, impact and long term storage requirements of our solutions
I have yet to encounter a vendor that isnt carrying around a lot of
extra metrics in the bowels of their systems that will never be
used We should have a CMG sponsored help wanted section for
capacity planning specialist positions in the various countries Ron
Kaminski 2010, All Rights Reserved33
Slide 34
Picking the right vendor partners I believe that all capacity
planning efforts should have tools that include: Efficient resource
usage and process consumption collectors Network queuing theory
based what if? modeling based on workloads, not total consumption
The bulge trap Efficient, speedy web-based historical consumption
data display Ideally your chosen vendor would support most or all
of your differing operating systems and devices have ample training
and consultants available, there is nothing better than a co-pilot
when you are starting out participates in and supports CMG! Ron
Kaminski 2010, All Rights Reserved34
Slide 35
Picking the right vendor partners In the not too distant
future, the best vendors should be: Offering efficient low impact
cloud deployable wrappers that run with your applications in a
cloud We dont have to worry, its in a cloud is nonsensical Are you
going to generate fake transactions and time them? When you get a
long time back, or significant variance, are you going to have
enough information to know why? I think that in time people will
realize this need, and want it in their contracts Dont you want to
know the overhead of encryption and decryption in the process, and
its response time effects? Stupidity is infinitely scalable, as
long as you arent getting the bill If nobody cares to make their
code efficient, because they just send it to the cloud, how good is
that code going to be? Will it be running on the same machine as
you tested? Will it impact your users? Ron Kaminski 2010, All
Rights Reserved35
Slide 36
Picking the right vendor partners In the not too distant
future, the best vendors should be: Offering efficient low impact
cloud deployable wrappers that run with your applications in a
cloud (continued) The internet will continue to grow
logarithmically So those clouds could get mighty full, mighty quick
How do you want to find out that it is too full? Do you want your
customers telling you? Or do you want your own reports based on
scientifically accurately collected consumption data? Social media
sites are becoming valuable business tools Businesses tweet and
have Facebook pages! Do you think that a free application
originally designed to let 14 year olds share photos is designed
for high performance business needs? How will you be sure? Ron
Kaminski 2010, All Rights Reserved36
Slide 37
Picking the right vendor partners In the not too distant
future, the best vendors should be: Thinking about SaaS user tools
as well, Sure, SaaS vendors maintain the code and pay if it is a
hog, but are they: running maintenance activities like backups and
virus cans that slow things down right during prime time for
Australia in your globally distributed firm? suffering from office
hours peaks of consumption that impact your users response times?
Taking outages to horizontally scale that might impact your firms
ability to ship product? Without your own data, you will never know
What responsibility do you have to your firms users? Why is this
network queuing theory based modeling stuff so important? Lets
understand what it means and then see an example Ron Kaminski 2010,
All Rights Reserved37
Slide 38
Ron Kaminski 2010, All Rights Reserved38 Modeling Norms Most
modeling packages assume a Poison or Chi-squared distributions of
the arrival rate of transactions Some simpler, yet often quite
elegant systems like Dr. Neil Gunthers PDQ modeling just use a
quadratic and forget the tails They arent all that different
despite what we modeling junkies might say! Dont focus on the
distribution selected, focus on whether they use queuing theory
models and give you relative response times
Slide 39
Ron Kaminski 2010, All Rights Reserved39 Why network queuing
theory based modeling? These concepts are also often illustrated
with simple queue graphics like the one at the right An important
implied assumption is that all requests are served, none are lost
Response time is the sum of Queuing Time plus Service Time
Slide 40
Ron Kaminski 2010, All Rights Reserved40 Why network queuing
theory based modeling? Methods do differ, but queues for
interactive workloads are usually computed based on load percentage
using a formula like: Q = U/(1-U) where: Q = Expected Queue U =
Utilization Response time is the sum of Queuing Time plus Service
Time
Slide 41
Ron Kaminski 2010, All Rights Reserved41 Why network queuing
theory based modeling? So, as a workload competes for resources
throughout a day, its response time is likely to vary Computed
relative response times show us both the variations and the reason
The Y Axis metric does not matter! Just pick a basis, the ratio is
the important part!
Slide 42
Ron Kaminski 2010, All Rights Reserved42 Why network queuing
theory based modeling? A workloads typical transaction is likely to
rely on several resources Imagine a workload running on a machine
with four CPUs, six disks and some network IO on one card Note that
when technologies differ, service times can differ
Slide 43
Ron Kaminski 2010, All Rights Reserved43 Why network queuing
theory based modeling? Now do you see where a graph like this can
come from? If the warehouse folks are complaining about response
times at 3:00 AM, should you upgrade the CPU? When do you suspect
that the backups are running? Would a CPU upgrade help daytime
response? But it also might make demand for I/Os faster and really
slow down the warehouse at 3:00 AM too, so you better address the
I/O issue!
Slide 44
Picking the right vendor partners In my experience, network
queuing theory based tools move folks quickest to actionable
answers Once you understand relative response times, most issues
are quick and easy to diagnose If a new vendor harps on linear
trending graphics and projections, dont expect them to be around
for very long If a monitoring or other product vendor keeps adding
and you can use this for capacity studies it is probably because
the salesperson heard that you were looking for capacity planning
tools! Stick with network queuing theory based packages and you
wont go wrong! Dozens of And we can do capacity planning too!
stories Ron Kaminski 2010, All Rights Reserved44
Slide 45
Ron Goes Off on VMware VMware is not a capacity solution VMware
is a symptom of now capacity management Ron Kaminski 2010, All
Rights Reserved45
Slide 46
Ron Goes Off on VMware VMware is the single biggest indictment
of the poor way most firms have done capacity planning in the
Windows space The lack of workload characterized views of
consumption is why folks bought a server for each functional part
We dont want to stack multiple applications on one server! So we
VMware them! which is just stacking with the added joy of paying
for not only extra copies of the OS and tools, but $900+ for VMware
as well And in the end, the code is running on the same box!
VMwares so called capacity planning tool is proof that they never
attended a CMG! It is as near useless as any marketed tool that I
have ever seen, but at least it is expensive Ron Kaminski 2010, All
Rights Reserved46
Slide 47
Going Extra-Product Once you get used to your vendors product,
if you are like me, youll start wishing for more functions tailored
to your specific needs In the old days, a grey haired expert would
whip out a spreadsheet or other mathematical package and start
creating some home-brew solution I use perl and GD:Graphics, PHP,
java script and anything else that I can think of, you can use what
makes sense to you Check out old CMG papers, they are laced with
great ideas In other words, dont feel limited to what your vendor
does out of the box Find buddies that use the same vendor and start
sharing ideas and code Things that you will see later in this
presentation are shared among dozens of firms and they wouldnt live
without them You dont have to agree 100%, take what fits best and
leave the rest Ron Kaminski 2010, All Rights Reserved47
Slide 48
Going Extra-Product There are a whole group of us running many
of the extensions that weve developed over time Some of our
extensions have made it into some products, but nowhere near enough
of them! We probably get 50% of our firms benefit from the tools
from our own extensions We regularly meet with the vendors and
implore them to add the features that we like Having more singing
from the same hymnal might just get through to them! Come join us!
The best ideas might be in your head! Share! Ron Kaminski 2010, All
Rights Reserved48
Slide 49
Avoiding the IT Mindset Traps Capacity planners come in several
flavors, because people from several different camps end up in this
role Scientists - Scientifically minded users of network queuing
theory tools and simulation models that want to subdivide
consumption into different behavioral groups and analyze them
Application specialists application subject matter experts who know
the application are trusted by management, and care deeply about
its success. They often come from the application side of the firm
Old Timers They know everybody, have worked on everything and have
connections a and favors to call in to get things done. They often
come from the operations side of the firm Each of these can be
successful, but some are more prone to certain behaviors that can
limit your capacity planning effectiveness and raise the costs of
doing it Lets look at the typical pros, cons and peccadilloes of
each Ron Kaminski 2010, All Rights Reserved49
Slide 50
The Scientists The Scientist capacity planner loves to get data
from everywhere and everything that they can Willingly tackles huge
tasks as long as there is a possible learning benefit Will
constantly tweak the automation to be able to get yet more data
Will go extra product and build tools for specific functions
without fear, because they are used to building things from scratch
and being successful Pros No fear, they view no problem as
intractable and are sure that if they can get real data into a
scientifically designed framework, business useful learning will
result No agenda, all applications and systems are equally
important to them, they will not lobby for one application to get
resources instead of another, preferring instead a rising tide that
raises all boats Willing to try new methods and tools in search of
solutions Ron Kaminski 2010, All Rights Reserved50
Slide 51
The Scientists Cons Scientists can be viewed as remote or
doesnt know the business by some in management, particularly
application development They may want some really expensive and/or
tricky software, and on every machine, and these tools produce
copious amounts of data that needs to be processed, graphed and
stored The volume of tools and special case software that they
accumulate over time can be hard to support by others Good ones are
relatively rare, ones that can teach/mentor others are extremely
scarce Mindset Traps Scientists can go off on tangents, they really
need a manager who can Help them get the most productive subset of
tools working first translate their outputs into terms
understandable to the business help keep them focused on what the
business deems most valuable Their pursuit of the one
scientifically superior way left unchecked can lead to ongoing high
costs Ron Kaminski 2010, All Rights Reserved51
Slide 52
The Application Specialist The application specialist in the
capacity planning role Will often drop everything else to don their
fire-fighter jacket and save the firm by working on emergencies
Will rely strictly on simple O/S tools and minimal data, often just
totals because that was all we needed when we started this thing,
and look how far weve come Seldom tracks historical consumption
data over time, or if they do, seldom presents it in a format that
is easily understood by others Pros They really do know the
application, the folks who are powerful, and they have a lot of
chips at the bargaining table when it comes time to get things
negotiated Their application specific knowledge can really come in
handy when strange behaviors are noticed Their continuing drive to
make an application succeed and the lengths that they go to are
often very favorably viewed by non-technical management Ron
Kaminski 2010, All Rights Reserved52
Slide 53
The Application Specialist Cons EGO! Our conference rooms are
named after comic book super heroes! Ron Kaminski 2010, All Rights
Reserved53
Slide 54
The Application Specialist Cons Their self confidence can lead
to large egos, they dismiss opposing views of how to address issues
other than the way that weve always done it Their extreme
willingness to join in every fire-fight eats a lot of time and
delays the deployment of tools and systems (like long term
historical consumption tracking) that would help others understand
and make better decisions Tend to enjoy being the go to guy and
thus seldom share the basis for their decisions This is sometimes
covering up the fact that the basis for their decisions is gut
feel, not data They will commit in public forums where management
is present to supporting the scientists to get some application
specific technical need, and then fail to do so in a timely manner,
if ever They really know their silo, but they are very
uncomfortable when asked to go outside of it Ron Kaminski 2010, All
Rights Reserved54
Slide 55
The Application Specialist Mindset traps These folks career
successes have been built on thinking on their feet as issues
occur, so they seldom take the time to build data collection and
reporting structures that lead to well informed decisions When you
need to know something, just ask me. They may even resist or delay
deployment of capacity planning systems, calling them costly,
unnecessary and not our applications highest priority They will
resist changes to their sacred architectures from the 1980s They
can be initially really interested in capacity planning information
about their application, and use it to point out the positive
impacts of their past decisions and successes but dont expect them
to mention immense over capacity Often their interest stops
immediately at the edge of their application When there are issues
larger than one application, they view it as their duty to defend
their applications turf and will move to segregate the environments
into us and them groupings that need not share any infrastructure
They think that The vendor will tell us when to Ron Kaminski 2010,
All Rights Reserved55
Slide 56
The Old Timers The old timers in the capacity planning role Are
a calming presence in meetings Have stories of a time when we faced
something similar Have the best jokes Know and address the VPs as
Phil and Sandy Have capacity tracking systems that tend to the
super-inclusive, when asked, they alone can root out data about
darn near anything, but they have to be asked Pros They have the
trust and respect of nearly everyone, because everyone has worked
successfully with them over time When they need tools or space to
get or keep their data, they just go ask Phil or Sandy Are among
the few to have worked on many of the systems, not just one or two,
and so they understand deeply the inter-reliance of many of the
systems and how an issue in one can manifest elsewhere Ron Kaminski
2010, All Rights Reserved56
Slide 57
The Old Timers Cons Old timers are often tired of learning.
They seldom want to embrace radical new methods when they are
retiring in a few years Old timers are survivalists, or they
wouldnt be old timers. They have a great political sense of when
not to rock the boat and who not to mess with that can prevent or
delay the introduction of useful new information Mindset Traps They
approach capacity planning like they approached most of the IT
issues that theyve faced in their long careers Lets start with a
database with thousands of metrics! You never know what will come
in handy, so resist deleting them while disk can still be purchased
Their reporting systems evolved over a long time, hence can be
hopeless for someone new to decipher or change They can be based on
large tables of numbers that only a select few can successfully use
Ron Kaminski 2010, All Rights Reserved57
Slide 58
Avoiding the IT Mindset Traps So what do we do? How do we get
the pros of each type and minimize the downsides? You must build a
matrix-ed team containing some of each type The team concept must
have support from the highest levels It must have priority from
each of their respective management They must be charged with:
enabling the scientists to integrate new tools into the environment
getting graphical reporting working that management can understand
maintaining just enough information to provide long term historical
context for decisions, but no more Sometimes, youll have to bring
in outside expertise, and the only way that will succeed is to have
friends in high places It is critical to put this under an
excellent manager Each of the three types have useful and less
useful behavior patterns You need a manager that all can respect,
who doesnt try to be the expert, rather one who coaches each to be
part of a well functioning whole Ron Kaminski 2010, All Rights
Reserved58
Slide 59
The politics of capacity planning in organizations
Organizational politics are often the key factor in your capacity
planning groups eventual success or failure Long experience has
taught many of us the importance of Friends in high places Try to
get the capacity planning issue instigated by a knowledgeable VP or
at least a director Often a major initial stumbling block is even
getting permission to install collectors on production systems,
much less the physics of actually doing it, and there is nothing
better than having their bosses boss saying, Yes, you must do this,
it is a priority Determining and rating the skills and power
balances in your organization, usually by O/S Managerial chaos can
be a severe issue Diagnosing and surmounting the barriers to
success Describing the type Their common barriers and techniques to
surmount them Ron Kaminski 2010, All Rights Reserved59
Slide 60
Identifying and surmounting barriers Barrier: The not invented
here ber-geek Identification clues Often are early members of a
firm Usually position themselves as masters of several related
technologies, but can be rather sparse on details The younger the
firm, the more often you find them, internet firms in high growth
areas are full of them They are convinced that If we didnt need it
then, we dont need it now! Their typical barrier methods This is
not an organizational priority This collector code is not proven on
our sensitive production systems Techniques to surmount their
barriers Friends in high places compel them Share credit for
successes with them to their management Involve them in the model
setup, ideally model along side them, letting them suggest probable
growth steps Ron Kaminski 2010, All Rights Reserved60
Slide 61
Identifying and surmounting barriers Barrier: The high priests
of the old tool set Identification clues They like twitch
monitoring and often have built an extensive installation of them
with impressive sounding names like The war room or mission control
Whenever you enter it during non-emergencies, notice how few people
are actually using the displays They prefer current totals like
total CPU because theyve never had consumption by business
identifiable sub-groupings They react to brief workload peaks by
demanding upgrades Their typical barrier methods Stalling. They ask
streams of technical questions, and each answer that you give
prompts another Requests to integrate, new capacity tools must feed
information to their war room Techniques to surmount their barriers
Ask them to put long term, workload characterized consumption on
their displays Have them tasked to help address pathologies
automatically detected (that their monitors did not seem to
surface) Ron Kaminski 2010, All Rights Reserved61
Slide 62
Identifying and surmounting barriers Barrier: The application
architects Identification clues They rigorously defend their
current multi-node spread as vital for The organization Uptime
Scalability 90% of their machines will be empty or nearly so The
architecture was set in stone a decade ago, and is designed to
solve the issues of that time, miniscule PCs Their typical barrier
methods Lecturing you on how their way is the only way Dont you
realize that these are business critical systems? is used to
justify all manner of excessive purchasing They will lecture you on
availability and scalability at the drop of a hat Techniques to
surmount their barriers Show them the serious speedups possible by
collapsing application layers onto fewer machines and removing
network time from chatty applications Ask them for estimates on
just how much more their application will need to scale, given that
it is 7 years old and already in use firm wide? Ron Kaminski 2010,
All Rights Reserved62
Slide 63
Identifying and surmounting barriers Barrier: The entrenched
fire fighting squad Identification clues They offer to work with
you, but not today as there is an emergency They position
themselves as the experts in an application They are
hyper-sensitive to any changes in the environment, they view them
as dangerous Our conference rooms are named after comic book super
heroes! revisited, when you fly in to interview, everyone is
fighting a fire Their typical barrier methods They position
themselves as must have team members and then are never Beware
their commitments to make data or specifics available, they will
often be too busy later to do it in a timely manner if at all
Techniques to surmount their barriers Agree to work with them as
valued members of the team, then ignore them in your plans as they
will always be too busy to help anyway Never trust them to come
through with a key item, always plan for another way to get what
they promise that does not involve them Over time, train them that
many of the time consuming fires that they fight are simple pile
ups of multiple pathologies that wont bite if addressed in a timely
manner Ron Kaminski 2010, All Rights Reserved63
Slide 64
Identifying and surmounting barriers Barrier: The overwhelmed,
outsource-able and scared Identification clues They have single
functions, often somewhat amorphous, and difficult to tag a dollar
value on They are not in politically savvy managements structures
Their typical barrier methods They stall, seemingly frightened to
take on any task without exact instructions from their management
The view tasks related to capacity planning as Not their priority
They view all new functions as threats They seem to ignore all
information not generated by their own function Techniques to
surmount their barriers These are politically weak people in
politically weak areas, stay away from them so as not to have to
rely on them If forced to work with them, work with their manager
to emphasize that capacity planning is an important priority that
they cannot stall Help the good ones get out of that group Ron
Kaminski 2010, All Rights Reserved64
Slide 65
Identifying and surmounting barriers Barrier: This is a
database server only DBAs Identification clues They claim that In
order to save the firm database license money, we are concentrating
the databases from multiple applications on just a few servers and
nothing else can run on these servers Their typical barrier methods
Outright refusal to try collapsing micro-applications onto database
servers Claim remaining capacity on the 1/3 used database server is
for growth but are real hard to pin down for specifics, usually
because there arent any Techniques to surmount their barriers Try
to get them to allow/install only a certain small percentage of
application code on their machines due to a network emergency. That
seems tiny and reasonable. Use a number like 10% to 20%. They dont
need to know that that was all of the applications that you ever
dreamed of doing. Show them how your automated process pathology
code works, to ease their fears about rogue applications eating
their machines alive and harming other applications Praise them to
their boss as innovative and balanced problem solvers Ron Kaminski
2010, All Rights Reserved65
Slide 66
Identifying and surmounting barriers Barrier: Lying,
manipulative project leaders Identification clues You are
originally asked to model 400 users from a sample of 30. Later they
say, Oh no! We meant 1000 users! Their typical barrier methods Some
project leaders view themselves as risk minimizers. Sadly, they
often feel that 60% excess hardware is a proper sized cushion, so
they inflate their usage estimate 60% to make the modelers justify
excess hardware for them They took 3 extra months to get all these
whacky features in, way past their deadline, but now time is an
emergency and they need their results immediately or they just need
to buy hardware right away because they have no time to test
properly Techniques to surmount their barriers Speed. You can model
this stuff far faster than they can get a load test to work without
half of those whacky features blowing up Ask more people for how
many users really are going to be there Ron Kaminski 2010, All
Rights Reserved66
Slide 67
Identifying and surmounting barriers Barrier: Enthusiastic but
We went to Load Runner Class and we absolutely have to to run huge
saturation load tests drones Identification clues They dont
understand mesa tests and modeling is all that is needed. Even if
you can get a decent mesa test out of them, they still want to do a
saturated load test anyway They REALLY BELIEVE two seemingly
counter intuitive things: 1.Your operations group must run out and
buy exactly the machine and memory that they dreamed up from
dubious research for their tests 2.They do not have to run against
realistic data volumes with similar indexes and size as intended
production. They will NEVER create a statistically relevant data
source. They will frankly state: It is impossible! Ron Kaminski
2010, All Rights Reserved67
Slide 68
Identifying and surmounting barriers Barrier: Enthusiastic but
We went to Load Runner Class and we absolutely have to to run huge
saturation load tests drones Their typical barrier methods No
matter how many times you say not to, they will always strive to
ramp up users at the start and ramp down afterward. Get ready to
lose your first and last measurement periods If you can get a
realistic transaction mix from them, they will still strive to run
them too fast The 30 second contract review, 8 hours a day story
Techniques to surmount their barriers Always question their user
think times, then adjust your model to deal with the silliness that
you uncover. Maybe 20% of the samples that I get have realistic
transaction arrival rates, so beware Be consistent, over a series
of tests you will wear them down, or get them fired Ron Kaminski
2010, All Rights Reserved68
Slide 69
A mail message to a new fleet of Load Runner enthused
contractor drones The purpose of load tests can be manifold, to
test functionality, capacity, and feel. Modeling based on a sample
does the same things and more, and usually much faster and cheaper.
If you choose to run a load test, be sure to run a realistic
transaction mix with the expected blend of all commands, not just
one kind. If you are limited to simulating a subset of intended
loads by physics (we dont recommend simulating above 20 users per
load running PC for accuracy) we can then take that load and model
much higher ones and any alternate hardware that you might dream
of. We have these caveats to improve accuracy: 1.Perform the tests
on real, not virtual, servers for measurement accuracy 2.Run a
proper mesa test for sampling which includes: A.Make sure that the
CPP group has a collector on your intended test machine days before
the test B.Start your test precisely on an hour boundary C.Do not,
repeat, DO NOT ramp up or ramp down users. Just start and go, 20
users per load runner box will not overwhelm anything. Ramping is
not required for models, indeed it is wrong to do it. D.Stop
precisely on an hour boundary E.Send mail to us telling us I.how
many users you simulated II.The precise timings III.How many more
users we should add in the models IV.Anything else pertinent Ron
Kaminski 2010, All Rights Reserved69
Slide 70
A mail message to a new fleet of Load Runner enthused
contractor drones 3.The purpose of the test is to produce a flat
topped mesa of usage that depicts your users acting normally. A
graph of CPU consumption should look like a rectangle with a flat
steady top, nowhere near saturated. We then take that sample of
happy users unconstrained and model what hardware is needed for
more happy unconstrained users. 4.Do a practice run several days
before your real test to flush out issues and tell us so we can see
how well you followed mesa instructions 5.DO NOT do any of the
following, which will waste your time, ruin the data and cause
rework A.DO NOT ramp up or ramp down usage at the start or end of
your tests. It just makes us throw out that data B.DO NOT try to
saturate the machine. The models will find that saturation load,
dont waste your time. Concentrate on producing an unsaturated load
of happy users getting great response times C.DO NOT try to
simulate hundreds of users from one PC with one network card. It
will fail or worse, produce incorrect data leading to massive
errors D.DO NOT create loads with unrealistically fast think times.
If the user is likely to do a transaction, then wait 5 minutes
reading it or processing it, then set the inter-transaction wait
time to 5 minutes, not 30 seconds. Remember, your goal is to be
realistic, not to have high unrealistic loads. Mesa tests may seem
odd at first, but in time you will learn to love mesa tests and
their time and cost savings to projects. After a few of them, youll
never load test the old way ever again. Questions? Please ask, or
invite us to your team meetings for a confab! Ron Kaminski 2010,
All Rights Reserved70
Slide 71
The politics of capacity planning in organizations How to win
friends and influence people in the operations group Set up being
on the capacity planning team as an aspiration goal, a promotion
path, for the operations folks Try to find an operations or O/S
expert at the top of their game and get them assigned to the
capacity planning effort These are often the best acolytes and
really take well to capacity planning As the operations staff start
to use the capacity planning reporting and pathology detection
systems Praise their efforts and successes to management Coach
their failures privately Get them (and their management) to realize
that keeping process pathology counts down reduces emergencies and
call-outs, and greatly contributes to system stability Train them
on the tools so they start to use them and build new skills If the
only users of the capacity planning reports are on the capacity
planning team, you are doing something wrong! Ron Kaminski 2010,
All Rights Reserved71
Slide 72
The politics of capacity planning in organizations How to win
friends and influence people in the application development group
In addition to the barriers presented previously, you may also
encounter The earnest improver, who takes the time to learn about
new technologies and tries to integrate their benefits into their
software development lifecycle The non-technical manager, who may
never understand all of the math and formulas, but who will be far
better at the political skill required for success External vendors
whose future profits hinge on success Try to become an asset to
each of these groups make sure that they see you as a willing
partner in their success work late on their models help them
succeed and get the resources that they need when they need them
Send mail when you work early, late or on the weekends (and CC your
boss of course), it shows that you are really trying to help Ron
Kaminski 2010, All Rights Reserved72
Slide 73
The politics of capacity planning in organizations How to win
over and influence your boss There are several types of bosses The
experienced true believer The unbeliever The unconvinced cost
counter There are techniques to deal with each Your goal is to
convert the last two into the first one! Keeping all happy will
involve deploying collectors, generating workload characterized
historical consumption web pages and What if..? models of future
consumption The key is to survive long enough to get a proper
network queuing theory model based software purchased in sufficient
quantity to make a difference Get some applications leadership on
your side keep the last two from canning you before you start to
get meaningful results on a large scale Ron Kaminski 2010, All
Rights Reserved73
Slide 74
The experienced true believer Usually you have worked with or
for this boss before, so they already know How expensive the tools
can be, so they are not shocked What a reasonable time for results
is How to help enable your success What battles to fight, and what
battles to avoid My last 4 gigs have been for someone who I had
either consulted for or worked for Delivering results delivers
career options for you! Characteristics of the experienced believer
Patience Helps get the software quickly Helps break through
organizational politics to get your collectors quickly deployed
Projects confidence in meetings with other management Ron Kaminski
2010, All Rights Reserved74
Slide 75
The unbeliever These folks (often with a development
background) are distrustful of fancy methods like network queuing
theory This is often based on an insecurity, they dont understand
complex tools and thus distrust them Have made their career by
betting on simple solutions and extrapolating linearly Are often in
their position due to management turmoil In several gigs Ive had
non-believers in the management structure above me Characteristics
of the non-believer Initial open contempt of scientific capacity
planning methods Demand results before they help you get collectors
in place to answer it with a historical basis Often will throw CPU
and memory at disk I/O slowness Can be turned, but wow, it sure
takes patience! Ron Kaminski 2010, All Rights Reserved75
Slide 76
The unconvinced cost counter These can be great bosses in time,
because like scientists, they demand proof before supporting you,
but once they have it, they will be true believers They either have
no experience with sophisticated capacity planning, or have had
running the group forced on them by higher ups who have
Characteristics of the unconvinced cost counter Repeated references
early in the process to how much your group and your software
costs, and lots of implying that savings results had better surpass
that soon Caution early on, so they will spend the time with other
departments getting them to go along with you Thrive on
informational updates, so show steady progress You dont have to be
perfect, just constantly getting better Youll know when they switch
to true believers when They start buying you more licenses! They
stop complaining about costs The We need to show results! to Do you
need more licenses? conversion Ron Kaminski 2010, All Rights
Reserved76
Slide 77
Reporting There are a lot of tragically bad business graphics
and especially capacity planning reports out there. Issues include:
Graphics that distort the viewers perceptions Quasi-3d Black
outlines around bar charts Non-calendar displays of long spans of
time No color consistency Foolish consistency may be the hobgoblin
of little minds, but it is also the key to getting management to
use your site for decision making (dont pay attention to little
minds and management appearing in the same sentence) Lots of
chrome, little content Tufte: Question every pixel. Basically, any
pixel that isnt conveying new data, get rid of it! Ron Kaminski
2010, All Rights Reserved77
Slide 78
Reporting Other issues that limit effectiveness Multi-page
reports that nobody ever reads If your answer is so complex that it
requires that much evidence, start over on a new one They paid
$10,000! It has to hit the desk with a thud! The same thud lives
on! Relying on the untrained user to wade in and find the answers
themselves Some you can train, most no If any correlation of
graphics requiring memory is needed, forget it Rons Position:
Non-web presentations in general are useless relics of a bygone
age. Most of your readers data comes in hyperlinked form, so get
with it or be left behind Web reports of all nodes in the firm Most
users really appreciate ways to see only their span of control Ron
Kaminski 2010, All Rights Reserved78
Slide 79
Reporting There are also some Must haves Automated context that
graphically highlights when something is out of the ordinary
(managers love this stuff) Automated business and hardware context,
ideally driven by your CMDB, that include Hardware and software
specifics Business Purpose Business owner Primary and backup
technical contacts Ideally a text description of its business
function Other helpful links Ron Kaminski 2010, All Rights
Reserved79
Slide 80
The Zen of Great Reporting Seek minimalism in all parts of it
Reduce graphic clutter Reduce user perceived complexity Workload
color consistency is a simple must-have Reduce user choices and
actions If the user needs to know 4 things to make a decision, they
had better be close on the same web page Add extra information that
lets the user more fully understand odd behaviors and situations
Sorting it by date is nice too Dont restrict yourself to measured
quantities Workload response time detail is one of the most
powerful graphics that you can use Ron Kaminski 2010, All Rights
Reserved80
Slide 81
Reporting Examples Ron Kaminski 2010, All Rights
Reserved81
Slide 82
Reporting Examples (UNIX) Ron Kaminski 2010, All Rights
Reserved82
Slide 83
Reporting Examples (Windows) Ron Kaminski 2010, All Rights
Reserved83
Slide 84
Reporting Examples (Windows) Tangent, Multiple Memory Leaks
Here is an example of a rather severe repeating set of memory leaks
See the saw-toothing memory? See the climbing Commit Bytes in a
different sequence? Ron Kaminski 2010, All Rights Reserved 84
Slide 85
Reporting Examples (Windows) Tangent, Multiple Memory Leaks
When you dig deeper, you can see memory totals by process owner
People often want to blame someone Alas sometimes the Someone is
harder to pin down by just username Ron Kaminski 2010, All Rights
Reserved 85
Slide 86
Reporting Examples (Windows) Tangent, Multiple Memory Leaks
When you dig deeper, we can see the individual process names
leaking In time youll find the best way to keep them unique, we use
process start date/time and PID You can show these to the Fake_Name
vendor and then it is hard for Fake_Name to deny a memory leak I
believe that java is Finnish for memory leak Ron Kaminski 2010, All
Rights Reserved 86
Slide 87
Reporting Examples (Windows) Tangent, Multiple Memory Leaks
Well it is hard to deny a leak, but some Fake_Name vendor might
want raw data, so Since you already have it, put out some csv files
to be easily mailed to the vendor, eliminating one of their stall
tactics Ron Kaminski 2010, All Rights Reserved87
Slide 88
Reporting Examples (Windows) Tangent, Multiple Memory Leaks The
right way to convey the message We detected the issue, and sent
mail to the application owner, stating The exact processes with the
issue They can expect to keep crashing every day or so until they
get the vendor to fix it Offers to help with data or technical
calls We get no response at all Three weeks later, we get a request
to add memory to the machine The owner Cant get the vendor to
respond quickly and wants to reduce outage counts in the mean time
Dont get mad Stay positive and helpful in tone, they are just
trying to help their users have less outages but continue to urge
them to turn up the heat on their vendors, but do it in a nice way
Ron Kaminski 2010, All Rights Reserved88
Slide 89
Reporting Examples Ron Kaminski 2010, All Rights
Reserved89
Slide 90
Reporting Examples Ron Kaminski 2010, All Rights
Reserved90
Slide 91
New! Reporting Examples Windows Ron Kaminski 2010, All Rights
Reserved91
Slide 92
New! Reporting Examples UNIX Ron Kaminski 2010, All Rights
Reserved92
Slide 93
Reporting Examples Ron Kaminski 2010, All Rights
Reserved93
Slide 94
Classic capacity planning question descriptions and proper
answering techniques Capacity issues are usually an emergency to
someone Roughly 93% of the requests for upgrades are nonsensical if
you have any historical workload based resource consumption
information So you have to say no in a way that makes the evidence
clear What to expect when you say no: The 5 stages of grief (also
called the Kbler-Ross model)
http://en.wikipedia.org/wiki/K%C3%BCbler-Ross_model
http://en.wikipedia.org/wiki/K%C3%BCbler-Ross_model Denial Anger
Bargaining Depression Acceptance Always give them a way to succeed
along with your no, remember that may they still have a real
problem! No, you dont need CPU or memory, but you are doing 5500
I/Os a second to your slow, locally attached C drive Can you turn
down logging? Can you send those I/Os to fast SAN or RAM drives?
Can you get help from your DBA pals? No, you dont need more CPUs,
you need to fix those looping processes. Ron Kaminski 2010, All
Rights Reserved94
Slide 95
Classic capacity planning question descriptions and proper
answering techniques Here is the pattern for this next section:
Real quotes from the users (disguised, slightly) The evidence The
answer What happened I want some interaction on these, if you did
it better, speak up! Share! That is what CMG is for! These graphs
used in the examples are all homebrew perl and GD:Graphics, and
they are used at several firms Yes I will share the code if you
want it, but sheesh, you can do better! You are going to want some
form of screen graphics capture tool I use freeware ZScreen,
downloadable from many sources, it is fabulous Ron Kaminski 2010,
All Rights Reserved95
Slide 96
Classic capacity planning question descriptions and proper
answering techniques User quote We are keeping these machines
rather heavily loaded. but they wont tell you why The evidence Ron
Kaminski 2010, All Rights Reserved96
Slide 97
Classic capacity planning question descriptions and proper
answering techniques The answer It turns out that this application
was on three nodes, two heavily used and one lightly used They
wanted a review of each Is ustca027 too empty? Is ustwa007 too
full? Is ustca031 too full? Lets use Relative Response Time by hour
to answer them Ron Kaminski 2010, All Rights Reserved97
Slide 98
Is ustwa007 too full? Ron Kaminski 2010, All Rights
Reserved98
Slide 99
Is ustca031 too full? Ron Kaminski 2010, All Rights
Reserved99
Slide 100
Classic capacity planning question descriptions and proper
answering techniques What happened The users are initially shocked
to see that the capacity planners, whom the view as machine
stealers for VMs, are recommending that they get more hardware!
Once they started to understand relative response time graphs, they
became quite sophisticated at moving workloads around Youll know
that youve converted them when they e-mail you asking if their
IO_Wait could be solved if they split them over more drives or
better RAID choices The morals of the story Any vendor can show
totals Favor vendors that show workload characterized historical
views of consumption Favor vendors that can show you workload
relative response times, so that your answers make sense to the
business Ron Kaminski 2010, All Rights Reserved100
Slide 101
Classic capacity planning question descriptions and proper
answering techniques We started getting warnings from our automated
checks: 10/03/23 CPU_SATURATION_WARNING: Windows2003 node
in04sqp001 used up to 392.920% of an available 400% from 2010/03/23
at 0200 until 2300. 10/03/26 CPU_SATURATION_WARNING: Windows2003
node in04sqp001 used up to 394.572% of an available 400% from
2010/03/26 at 0000 until 2300. 10/03/27 CPU_SATURATION_WARNING:
Windows2003 node in04sqp001 used up to 396.000% of an available
400% from 2010/03/27 at 0000 until 2300. 10/03/28
CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used up to
392.920% of an available 400% from 2010/03/23 at 0300 until 2300.
The evidence (heres what the sparkline looked like): Ron Kaminski
2010, All Rights Reserved101
Slide 102
Classic capacity planning question descriptions and proper
answering techniques More evidence: Ron Kaminski 2010, All Rights
Reserved102
Slide 103
Classic capacity planning question descriptions and proper
answering techniques My initial suspicions were Code improvement
opportunities so I contacted my DBA pals: Ron Kaminski 2010, All
Rights Reserved103
Slide 104
Classic capacity planning question descriptions and proper
answering techniques Those CPU graphs with response time increases
due to CPU_Wait when they hit the knee in the curve: Ron Kaminski
2010, All Rights Reserved104
Slide 105
Classic capacity planning question descriptions and proper
answering techniques The answer from my DBA pals: Ron Kaminski
2010, All Rights Reserved105
Slide 106
Classic capacity planning question descriptions and proper
answering techniques What happened (the changes went in on Mar 29
th ): Ron Kaminski 2010, All Rights Reserved106
Slide 107
Classic capacity planning question descriptions and proper
answering techniques What about the charts Ron? Ron Kaminski 2010,
All Rights Reserved107
Slide 108
Classic capacity planning question descriptions and proper
answering techniques Things to learn from this example: Not all
code innovations work as efficiently as desired SQL developed in
far flung places for even farther flung places is especially
suspect When the answer is correct, the code is done, well maybe
not Not all innovations will go through a rigid capacity planning
review You need either automated warnings or to take the time to
scan thousands of graphs often to detect and correct these You need
fast graphical evidence to get fast reactions You need to go out of
your way to be nice to DBAs, they will save your firm millions if
you let them, and if you only ring them up when there is real
evidence of mayhem Always ask their boss to praise their efforts,
those memos come in handy at review time Ron Kaminski 2010, All
Rights Reserved108
Slide 109
Classic capacity planning question descriptions and proper
answering techniques Many of you will be deploying virtual terminal
environments to hundreds of users What if something goes a little
wrong? The evidence: Ron Kaminski 2010, All Rights Reserved109
Slide 110
Classic capacity planning question descriptions and proper
answering techniques The answer: We started ticketing suspicious
CPU consuming VMware slices on Feb 3 rd Most of it was Bezier curve
screen savers! We banned them What happened: We got back more than
half of our VMware farm! Ron Kaminski 2010, All Rights
Reserved110
Slide 111
Classic capacity planning question descriptions and proper
answering techniques User quote: I was wondering if we could get
the memory increased on our Exchange 2007 CAS servers USTCAX100 and
USTWAX100? Right now both servers are running 4.25GB and I would
like to move them to 8GB. We are seeing performance issues with
those servers and we are noticing that RAM usage is at 80%-90% or
higher all of the time. Users are starting to notice this with
Communicator. Due to the fact that it cant get a response quick
enough from CAS, it is putting an exclamation point on the
communicator alerting them to address book issues. If we are not
able to increase the memory, the only other option would be to add
more CAS servers in the environment to balance the load. We also
are going to be increasing the load on these servers with the 2000
users we will be adding to the North America environment from the
XYZ Co. acquisition and moving South American users to North
America servers. Please let me know if this is feasible or not? Ron
Kaminski 2010, All Rights Reserved111
Slide 112
Classic capacity planning question descriptions and proper
answering techniques The evidence: First, look to see if anything
has gone wrong recently They might be reacting to a recent problem,
but dont stop there Ron Kaminski 2010, All Rights Reserved112
Slide 113
Classic capacity planning question descriptions and proper
answering techniques The evidence: Looking deeper, we dont see a
memory shortage, (there is evidence of a slight leak) paging is
very low, CommitBytes isnt anywhere near CommitLimit, but CPU seems
in short supply, and the CPU Wait component of relative response
time is huge Their short term performance issue is due to CPU
shortage, not memory! Ron Kaminski 2010, All Rights
Reserved113
Slide 114
Classic capacity planning question descriptions and proper
answering techniques The Answer: Along with the graphs from the
previous page (and getting them to address the lsass loop) we added
two virtual processors to this VMware slice Note that if you
disagree with their solution, give them an alternative that fixes
present issues We may give them more memory later, when theyve
earned it Ron Kaminski 2010, All Rights Reserved114
Slide 115
Classic capacity planning question descriptions and proper
answering techniques What happened: The CPU Wait disappeared
immediately The users immediate issues were solved The users now
know that decisions will be based on evidence, the results will be
real, and they like it! Hardware in use for a growing application
will grow, but slowly Ron Kaminski 2010, All Rights
Reserved115
Slide 116
Classic capacity planning question descriptions and proper
answering techniques Hey folks, there is still one more issue, with
imjpmig process, the Input Method Editor, which lets you use
Japanese characters. It is looping regularly: 10/01/15
LOOP_PROBLEM: 3444 running imjpmig CPU looped from Jan 15 04:59:54
until Jan 15 23:54:53 and may still be looping. 10/01/16
LOOP_PROBLEM: 3444 running imjpmig CPU looped from Jan 16 00:07:48
until Jan 16 23:54:58 and may still be looping. 10/01/21
LOOP_PROBLEM: 5344 running imjpmig CPU looped from Jan 21 13:59:59
until Jan 21 23:54:58 and may still be looping. 10/01/22
LOOP_PROBLEM: 5344 running imjpmig CPU looped from Jan 22 00:01:27
until Jan 22 23:54:56 and may still be looping. 10/01/23
LOOP_PROBLEM: 5344 running imjpmig CPU looped from Jan 23 00:01:25
until Jan 23 23:54:53 and may still be looping. I changed the
workload to just highlight Input Method Editor by itself. I also
found a bunch of patches available:
http://search.microsoft.com/Results.aspx?q=imjpmig+d
ownloads&mkt=en-
US&FORM=QBME1&l=1&refradio=0&qsc0=0
http://search.microsoft.com/Results.aspx?q=imjpmig+d
ownloads&mkt=en-
US&FORM=QBME1&l=1&refradio=0&qsc0=0 Ron Kaminski
2010, All Rights Reserved116 Sometimes your own systems detect
problems, so answer in a way that provides all required
information
Slide 117
Classic capacity planning question descriptions and proper
answering techniques Eventually they got the fix migrated to
production and everything worked fine from then on Dont get
discouraged if folks dont always do what you want immediately
Change controls, priority conflicts and other issues may stall the
fix With enough graphical evidence, eventually you will win! Ron
Kaminski 2010, All Rights Reserved117 What happened?
Slide 118
Classic capacity planning question descriptions and proper
answering techniques Ron logs in on a Saturday to work on slides
for UKCMG (Again! And what do you get paid to do this? asks my dear
wife) and sees the following: The evidence (from my pathology
detection codes morning mail) CPU saturation found:
CPU_SATURATION_WARNING: Windows2000 node ustca337 used up to
99.000% of an available 100% from 2010/03/12 at 0400 until 2300.
CPU_SATURATION_WARNING: Windows2003 node ustwasbx16 used up to
99.000% of an available 100% from 2010/03/12 at 1400 until 2300.
CPU_SATURATION_WARNING: Windows2003 node uktcas06 used up to
99.000% of an available 100% from 2010/03/12 at 0300 until 2300.
CPU_SATURATION_WARNING: Windows2003 node ustca227 used up to
99.000% of an available 100% from 2010/03/12 at 0400 until 2300.
CPU_SATURATION_WARNING: Windows2003 node ustca724 used up to
99.000% of an available 100% from 2010/03/12 at 0400 until 2300.
CPU_SATURATION_WARNING: Windows2003 node ustcas44 used up to
99.000% of an available 100% from 2010/03/12 at 0400 until 2300.
CPU_SATURATION_WARNING: Windows2003 node ustcas54 used up to
99.000% of an available 100% from 2010/03/12 at 0400 until 2300.
CPU_SATURATION_WARNING: Windows2003 node ustca088 used up to
99.000% of an available 100% from 2010/03/12 at 0800 until 2300.
Ron Kaminski 2010, All Rights Reserved118
Slide 119
Classic capacity planning question descriptions and proper
answering techniques The evidence continued Whenever a whole bunch
of bad things happen synchronized over many machines, think global
tool Ron Kaminski 2010, All Rights Reserved119
Slide 120
Classic capacity planning question descriptions and proper
answering techniques The evidence continued Whenever a whole bunch
of bad things happen synchronized over many machines, think global
tool Ron Kaminski 2010, All Rights Reserved120
Slide 121
Classic capacity planning question descriptions and proper
answering techniques Ron Kaminski 2010, All Rights Reserved121 This
is really bad news, a critical Business Sensitive / Critical
production server doing its normal real sqlservr workload with a
Tool process going on a CPU binge and causing excessive response
times due to CPU_Wait
Slide 122
Classic capacity planning question descriptions and proper
answering techniques The answer A new piece of monitoring code was
installed BREAKING THE NO NEW CODE INSTALLS ON A FRIDAY rule! What
happened The code creator had deployed a new script, and he
reviewed it after getting mail about all of the warnings: This was
a bug in a script update that I made; we should be seeing this
behavior on most of the attached server list. ______ is pushing out
an update to the script now; once this is done well have to log
into each of the affected servers, verify the looping process is
running sqlcheck.vbs, and kill it. We were able to swiftly detect
and fix the issue How would your site do this? Ron Kaminski 2010,
All Rights Reserved122
Slide 123
Classic capacity planning question descriptions and proper
answering techniques What we saw: We started getting Commit_Bytes
approaching Commit_Limit warnings: 10/04/05 COMMIT_BYTES_PROBLEM:
Commit Bytes were within 80% of Commit Limit from Apr 5 18:00:00
until Apr 5 23:59:00 and may still be. 10/04/06
COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit
from Apr 6 00:00:00 until Apr 6 23:59:00 and may still be. 10/04/07
COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit
from Apr 7 00:00:00 until Apr 7 23:59:00 and may still be. 10/04/09
COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit
from Apr 9 00:00:00 until Apr 9 23:59:00 and may still be. 10/04/10
COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit
from Apr 10 00:00:00 until Apr 10 23:59:00 and may still be.
10/04/11 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of
Commit Limit from Apr 11 00:00:00 until Apr 11 23:59:00 and may
still be. 10/04/12 COMMIT_BYTES_PROBLEM: Commit Bytes were within
80% of Commit Limit from Apr 12 00:00:00 until Apr 12 23:59:00 and
may still be. 10/04/13 COMMIT_BYTES_PROBLEM: Commit Bytes were
within 80% of Commit Limit from Apr 13 00:00:00 until Apr 13
23:59:00 and may still be. Ron Kaminski 2010, All Rights
Reserved123
Slide 124
Classic capacity planning question descriptions and proper
answering techniques We investigated, seeing rising total memory:
Ron Kaminski 2010, All Rights Reserved124
Slide 125
Classic capacity planning question descriptions and proper
answering techniques The evidence, memory by user: Ron Kaminski
2010, All Rights Reserved125
Slide 126
Classic capacity planning question descriptions and proper
answering techniques The evidence, memory by leaking process: Ron
Kaminski 2010, All Rights Reserved126
Slide 127
Classic capacity planning question descriptions and proper
answering techniques The evidence, for the spreadsheet inclined:
Ron Kaminski 2010, All Rights Reserved127
Slide 128
Classic capacity planning question descriptions and proper
answering techniques The answer: Clearly this application has a
jlaunch process (run by the SAPServicePRG user) memory leak You
have two options: Get them to patch/fix the application, or Get
them to reboot the machine periodically so that you dont start
paging hard and affect performance So you notify the project
leader: Hi all, If you look at memory usage over the last few
months on these three severs, youll see steady and/or repeating
ramps. http://ustwu002.kcc.com/node_reports/ustca146/memory.html
http://ustwu002.kcc.com/node_reports/ustca147/memory.html
http://ustwu002.kcc.com/node_reports/ustca148/memory.html This
leads eventually to warnings like these: COMMIT_BYTES_PROBLEM: On
ustca146, Commit Bytes were within 80% of Commit Limit from Apr 6
00:00:00 until Apr 6 23:59:00 and may still be.
COMMIT_BYTES_PROBLEM: On ustca147, Commit Bytes were within 80% of
Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may still
be. COMMIT_BYTES_PROBLEM: On ustca148, Commit Bytes were within 80%
of Commit Limit from Apr 6 00:00:00 until Apr 6 23:59:00 and may
still be. and after that, when commit bytes hits commit limit, you
can experience rather severe application slowdowns. In every case,
the major rising memory consumer seems to be jlaunch processes run
by SAPServicePRG. Most recently: PID 6160 on ustca146 started Mar 2
20:54:58 PID 3772 on ustca147 started Mar 2 20:54:50 PID 8032 on
ustca148 started Mar 2 20:54:56 Could someone take a look at these
to see if a fix is possible? If not, could we recycle these jlaunch
processes, perhaps weekly, to keep memory usage down? Thanks for
looking! Ron Kaminski 2010, All Rights Reserved128
Slide 129
Classic capacity planning question descriptions and proper
answering techniques What happened: Hi Ron, Thank you for keeping
an eye on these servers! You are right, there is a steady growth of
memory usage by the SAP PRG processes on these application servers.
This is not a surprise. There are several known issues regarding
memory leaks with the current version of the Java hibernate
libraries being used in the fake_name application and old
fake_product. We have worked with the application vendor,
fake_name, to resolve some of the more significant issues that were
causing regular outages. Fake_vendor has not resolved some of the
less-severe issues. There are plans to upgrade the entire
application suite and change the underlying application execution
platform from fake_product to new fake product. The application
upgrade includes new libraries for hibernate, and the memory leak
issues related to hibernate with fake_product have not appeared in
new fake product. The landscape upgrade is currently scheduled for
June. We will go ahead and schedule a recycle of the old fake
product to recycle the Jlaunch processes you mentioned below. We
will schedule regular process recycles until the system is
upgraded. Please let me know if you have any additional questions
or concerns. Thank you! Ron Kaminski 2010, All Rights
Reserved129
Slide 130
Classic capacity planning question descriptions and proper
answering techniques What happened: Memory leaks, key points to
remember Graphics help get their attention, CSV files are there for
the whackos who demand the real data Sometimes they say that they
need it to prove to the vendor Believe me, the vendor usually knows
all too well It is easy to do and nips their evasions in the bud
Remember the stall techniques? Sometimes they cant, or arent, going
to fix it Welcome to big corporations and priorities Then you need
to get them to reboot periodically to get the leaked memory back Do
you have the graphs and data quickly available to discover,
document and communicate this? Ron Kaminski 2010, All Rights
Reserved130
Slide 131
We have this really cool way to see all of the servers disk
space for the last 90 days Ron Kaminski 2010, All Rights
Reserved131
Slide 132
Classic capacity planning question descriptions and proper
answering techniques The evidence: Subject: Possible disk space
issue looming on ustca479 Hi All, Here is a view of total disk
space and disk space used on ustca479: Perhaps some
purge/delete/cleanup is in order? Ron Kaminski Ron Kaminski 2010,
All Rights Reserved132
Slide 133
Classic capacity planning question descriptions and proper
answering techniques The answer: Subject: RE: Possible disk space
issue looming on ustca479 Ron, Thank you for the heads up. The
increased disk space utilization is partially due to enhanced
logging that we have enabled over the past few months. I have
cleaned up some old logs and we will continue to monitor the disk
utilization to determine if additional disk space is required.
Thanks, Matt Ron Kaminski 2010, All Rights Reserved133
Slide 134
Classic capacity planning question descriptions and proper
answering techniques What happened: Well, It was a start! But alas,
note the inexorable rise beginning again after the clean up. Ron
Kaminski 2010, All Rights Reserved134
Slide 135
An update from Friday Note that the max space has grown
considerably, from 83 to 112 GB Ron Kaminski 2010, All Rights
Reserved135
Slide 136
Classic capacity planning question descriptions and proper
answering techniques The best way to deal with these is to avoid
them proactively by making great, workload characterized
consumption information available to all Train your firm to use the
capacity reporting and pathology detection systems You have
automated pathology detection, all the way through ticketing
issues, havent you? Think graphics, not tables of numbers If only a
secret club know the capacity data, you are making a big mistake
Train OS support folks to use the What if? models Ron Kaminski
2010, All Rights Reserved136
Slide 137
Break Time! Please be back at Ron Kaminski 2010, All Rights
Reserved137
Slide 138
What I said about clouds and SAAS last year: Say goodbye to
your data centers and your privileges folks! Cloudy days are
coming, and this is good Paying people in each firm to worry about
OS, backup, security, and staying current was always expensive, and
now it is ridiculous Change firms a few times and note how wildly
different It has to be this way! is Our capacity planning needs,
and tools, will have to change too Instead of vendors selling you
software, many will sell the service running on their cloud This is
great! Let the vendor maintain their own code! They are the
naturally cheapest way, the expertise needed is naturally
concentrated Having a year more to search for and find issues, I
see a few potential storm clouds in some firms sunny plans! Lets
dig into why Ron Kaminski 2010, All Rights Reserved138
Slide 139
Clouds and Software as a Service Definition: Clouds = Running
our stuff on someone elses computers, plus whatever else will be
needed for the new demands that will place on us, like: Encryption,
so we can run sensitive corporate data over the world wide web
safely Note that this is done on both sides, the users machine and
in the cloud. This may be an unpleasant surprise for firms that
have replaced those expensive desk top processors (and all that
excess capacity) with light desktops running virtual machines on
shared hardware Exhaustive disk cleansing when we delete files or
parts of files Network lag measuring tools, because there will be
slowdowns and our users will want to direct their wrath Increased
firm internet firewall bandwidth needed Increased firm internet
bandwidth needed Ron Kaminski 2010, All Rights Reserved139
Slide 140
What will those loads look like? Ron Kaminski 2010, All Rights
Reserved140
Slide 141
Cloud issues Well just run everything in someone elses cloud,
so we wont need capacity planning any more. It will be the cloud
vendors problem! Clouds will place new, different, and often
resource intensive new demands on our firms computing
infrastructure Capacity concerns will become very important, and
historical records of what consumed what will be paramount for
figuring things out Someone is going to have to pay for all of that
extra processing and it wont be the vendor! The Mushroom Cloud will
be appearing at firms that ignore these risks Ron Kaminski 2010,
All Rights Reserved141
Slide 142
Clouds and Software as a Service Definition: Software as a
Service = Letting someone else run their code on their machines to
serve us, but undoubtedly with our data, plus whatever else will be
needed for the new demands that will place on us If there is
customer identifiable information, we will need all of that
encrypt/decrypt overhead again Disk cleansing will be less of a
priority as no one can run disk scrapers unlike the cloud Network
lag measuring tools, because there will be slowdowns and our users
will want to direct their wrath Increased firm internet firewall
bandwidth needed Increased firm internet bandwidth needed Ron
Kaminski 2010, All Rights Reserved142
Slide 143
Other Cloud and SaaS issues The key thing to remember is that
cloud and SaaS vendors will have to eventually operate at a profit!
This will drive them to the same attempts to economize that your
firms are trying now Big and cheap IO devices, that are of course
much slower Virtualization will be a certainty, you will never know
what fraction of what hardware you will be on Architectural choices
of the firms past wont make sense any more What hardware largess do
you tolerate now for Mission critical applications? Hot spares? N+1
copies of data? Will your cloud vendor leave enough excess capacity
for your theoretical worst case? How will you be sure? Ron Kaminski
2010, All Rights Reserved143
Slide 144
Other Cloud and SaaS issues And remember the graph, they have
to run it on 2X to 3X+ the hardware for the same loads! Unless your
firms Data Processing division is utterly ridiculous in their
spending (and many are) how can clouds be cheaper? Clouds and SaaS
only make sense when the non- hardware savings exceed the hardware
and network costs, or provide other business useful opportunities
Perhaps outsourcing a staff intensive application to a SaaS vendor
is still a really great idea Ron Kaminski 2010, All Rights
Reserved144
Slide 145
The moral of the story: Eventually businesses may evolve into
partial cloud and SaaS users when the overhead of extra processing
is smaller than some fraction (Ill go out on a limb and say half)
of the average resources needed to run the application and the
security demands are low, and/or the total function cost is lower
Quick! Think of a low security function at your firm that you would
be happy to have some greasy haired geek intercept, and put that in
a low security cloud I couldnt think of any as an example! Can
anyone here? Almost all real corporate work will demand far more
internal resources to run externally than to run internally Be sure
to add those costs to your cloud and SaaS plans! Ron Kaminski 2010,
All Rights Reserved145
Slide 146
The story continues Go out and repeat my analysis at your firm
on one of your firms attempts to do it Publish a paper via CMG or
elsewhere where you outline the specific true costs in consumption
and hosting spend If your costs come out like mine did, i.e. This
doesnt make a lot of sense! expect a flood of analyst calls from
consulting groups wanting you to expound on your cloud computing
experiences expect some wholehearted chuckling and agreement that
it is nuts I think that many firms are acting like consulting
groups when they are in fact trying to gather data to beat down
internal pushes to go cloud or to go SAAS Or they are consulting to
potential cloud providers and giving them a less than rosy view For
now, I would proceed very slowly Ron Kaminski 2010, All Rights
Reserved146
Slide 147
Clouds, last words I used to live not far from here in Oviedo
Fl Every summer day a lot of sunlight hitting the swampy ground
would create a lot of hot rising moist air, so we had clouds and
thunderstorms about 3 PM each day In IT journals and analysts
sessions, there is a lot of hot moist hype filled air rising Maybe
that is why they see clouds! Ron Kaminski 2010, All Rights
Reserved147
Slide 148
Last words: Dont trust the vendor (or yourselves) Ron Kaminski
2010, All Rights Reserved148
Slide 149
In the interests of time, we are going to skip some here But
you all have the slides! Ron Kaminski 2010, All Rights
Reserved149
Slide 150
Modeling when all of the cards are stacked against you In a
perfect world, when new code is written There is a comprehensive
test plan to verify functionality All issues are corrected prior to
the capacity planning tests The capacity planning tests are
performed on real (non- virtual) hardware with known
characteristics The testing group are old pros who know how to run
a proper mesa test which has An hour of nothing An unsaturated hour
of a realistic transaction mix, with realistic think times, on a
realistic indexed database, without excessive application logging
or extraneous monitors adding logging loads to key disks Followed
by an hour of nothing And finally a truthful estimate of expec