Ghost Processes Production 0110

download Ghost Processes Production 0110

of 23

Transcript of Ghost Processes Production 0110

  • 8/6/2019 Ghost Processes Production 0110

    1/23

    The SysAdmin Group

    The SysAd min Group Pty Ltd (ACN 069 951 677) Page 1 of 23238 Richardson Street, Middle Park, VIC 3206, Australia Phone: (03) 9686-3233Email: [email protected] Fax: (03) 9686-3399

    A SysAdmin Group Technical Report

    Copyright 1997, The SysAdmin Group Pty Ltd. All Rights Reserved.

    All information in th is document is copyright of The SysAdm in Group (SysAdm in). This docum entmay not be duplicated in any form (paper, electronic, etc.) except as permitted by a valid licenseagreement with SysAdm in.

    Managing aProduction Environment

    Prepared For: General Release

    Author: Geoff Halprin

    Reference: GHOST-PROCESSES-0110Version: V1.00

    Date Created: 17 March 1997 17:42

    Date Modied: 24 January 1998 09:17

    Abstract

    Management of a prod uction comp uting environm ent is all about the goals of Reliability , Availability an d Serviceability (RAS). These have been the bench-marks for evaluating system man agement p ractices since the m ainframe envi-ronm ents of the 1960s.

    In order to at tain the goals of RAS, we mu st seek to maximise predictability . Thequest for predictability is fought on three fronts; Standards , Processes an d Tech-nology .

    In this paper I examine the processes that should be u sed to imp lement a totalquality solution to the management of mission critical production environ-

    ments.

  • 8/6/2019 Ghost Processes Production 0110

    2/23

    The SysAdmin Group

    Table of Contents Managing a Production Environment

    Page 2 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    Table of Contents

    1.0 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.1 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.2 Intended Aud ience . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.4 Acknow ledgem ents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    1.5 Chan ge Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    2.0 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

    3.0 Setting The Scene - The Cost of Dow ntime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    4.0 Managing a Distributed Computing Environment. . . . . . . . . . . . . . . . . . . . . . . . 6

    5.0 Process Context. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    6.0 The Change Management Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    6.1 Why do we Need a Change Management Process? . . . . . . . . . . . . . . . . . . . 8

    6.2 The Change Management Process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    6.3 Chan ge Managemen t 101 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    6.4 Controlled Learn ing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    6.5 The Chang e Request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106.6 The Change Request Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    6.7 The Customer Auth orisation Matrix. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    6.8 The Change Control Board . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    6.9 Performing Ch anges - The Test Lab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    6.10 Classes of Chan ge - The ESPA System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    7.0 The Production Acceptance Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    7.1 The Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    7.2 The Steps of the PA Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197.3 The Auth orisation Gates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    7.4 The Sociability Laboratory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    7.5 Roles in the PA Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    8.0 A Final Word . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

  • 8/6/2019 Ghost Processes Production 0110

    3/23

    The SysAdmin Group

    Managing a Production Environment Preface

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 3 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    1.0 Preface

    1.1 Scope

    This paper is abou t the Processes that are requ ired for the p redictable, formal m anagement of a production computing environment, specically the Change Management and ProductionAcceptance p rocesses.

    1.2 Intended Audience

    It is assumed that the reader is involved in the management of production computingresources at some level, either at the coal face as a system or network administrator, or at theman agement level.

    This paper is a general d iscussion pap er, and does not use concepts or terminology specic toa p articular op erating system or environm ent, except w here necessary to illustrate a p oint.

    1.3 References

    [1] Peopleware, Tom Demarco and Timothy Lister, 1987. ISBN 0-932633-05-6.[2] Change Management, Michelle Trout. The MOSES Whitepapers.

    (Massive Open Systems Environment Standards).http://www.uniforum.org/news/html/publications/techpubs/moses/start.html

    1.4 Acknowledgements

    1.5 Change Control

    Version ReleaseDate Author Comments

    V01.00 24-Jun-97 Geoff Halprin Initial Release

    V01.10 24-Jan-98 Geoff Halprin Minor updates

  • 8/6/2019 Ghost Processes Production 0110

    4/23

    The SysAdmin Group

    Introduction Managing a Production Environment

    Page 4 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    2.0 Introduction

    Management of a production computing environment is all about the goals of Reliability , Availability an d Serviceability (RAS). These have been the benchmarks for evaluating systemman agement practices since the mainframe environments of th e 1960s.

    These goals were addressed in the mainframe environment primarily by one manufacturerdictating stand ards (both good and bad). This problem is far more complex in the op en sys-tems world where there are many manufacturers and standards bodies, and further exacer-bated by the sheer scale of the distributed environments which we must manage. Suchexibility requires careful management. We must devise a model for the exible, disciplinedmanagement of distributed computing environments.

    In order to at tain the goals of RAS, we m ust seek to maximise predictability . The qu est for pre-dictability is fought on three fronts; Standards , Processes an d Technology (SPT).

    With Standards , we seek to improve predictability through consistency. By improving theconceptual integrity of a site, we red uce dow ntime d ue to trouble-shooting and entropy. Wekeep hosts, subsystems an d software versions in sync in order to sup port a less diverse envi-ronment.

    With Processes , we seek to improve predictability by ensuring that system maintenanceactivities follow known paths which include quality assurance steps such as peer review,impact analysis and dep loyment plann ing. This is pred ictability through planning.

    With Technology , we seek to improve the manner in which we manage our environment.From simple tools which automate functions and hence improve consistency of results,through tools wh ich seek to redu ce the man agement effort in real terms throu gh a par adigmshift in the w ay w e perform h igher level tasks.

    So, where RAS are the measures (we can measure quantities like Mean Time Between Failuresan d Application Availability ), SPT is the strategy by w hich we seek to imp rove those measures.We can allocate a cost to downtim e, and h ence evaluate the effectiveness of our strat egies.

    In this paper I examine the p rocesses that shou ld be used to implement a total quality solu-tion to the m anagement of mission critical prod uction environments.

  • 8/6/2019 Ghost Processes Production 0110

    5/23

    The SysAdmin Group

    Managing a Production Environment Setting The Scene - The Cost of Downtime

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 5 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    3.0 Setting The Scene - The Cost of Downtime

    Why w rite a paper abou t a subject as boring as Change Managem ent or Produ ction Accept-ance processes? Well, lets just take a few m oments to an swer t hat.

    Lets assume that you are managing a normal Prime Shift computer operation (8am to 6pm,Mond ay to Friday). Thats 10 hour s a day, 5 days a w eek, 52 weeks a year ; or 2600 hour s peryear. (A 24X7 operation is 8736 hou rs p er year, so multip ly these gu res by 3.36.)

    At 95% availability, you are allowed 130 hour s of downtim e per year, 10.83 hou rs per m onth ,2.50 hours per week. At 99% availability, this drops to 26 hours per year, 2.17 hours permon th, or just 0.50 hours p er w eek.

    Now, lets assum e a comp any em ploys 100 engineers at an average of $50k per ann um . This isa p ayroll of $5million, wh ich is arou nd $2000 per h our (assum ing they work a 10-hou r d ay).

    So, on salaries alone, a down time of 5% mean s peop le unable to use the system for 130 hoursper year, costing your comp any $260,000 in lost tim e. If your availability is increased to 99%,this gu re d rops to $52,000 - a saving of $208,000. And this is solely based u pon salaries.

    It turns out that this estimate might be way down the low end of the scale. I have one cus-tomer w ho estimates their cost of down time to be arou nd $500,000 per d ay. This wou ld m akethe above im provemen t yield an annu al saving of $5.2million.

    This pap er is abou t p redictability. Down time is a fact of life. Trying to m inimise it is only oneaspect of reducing th is cost. If we can pred ict wh en systems w ill be unavailable, then th is willtranslate into increased p rodu ctivity (people can plan to d o something else), decreased stress

    (What do you mean I just lost 8 hours w ork?), and a more enjoyable work environment inwh ich you nally have time to improve th e quality of life (for yourself and y our colleaguesand customers).

    3.1 Proactive Workow

    Being a system administrator shouldnt be solely about ghting res - it should be aboutdesigning real, systemic solutions to problems before they arise.

    The processes presented in this p aper are th e prim ary m echanism for shifting the balance of system ad ministration workow from being reactive to being proactive. By d oing this, yougain control over your d estiny as a system adm inistrator. You start to nd the time to craftthose tools you so desperately need, time to nd out more about what the users require-ments and expectations are, and time to evaluate new technology, and plan for futureupgrades.

    So, read th is paper w ith an open m ind as to wh at it can really do for you.

  • 8/6/2019 Ghost Processes Production 0110

    6/23

    The SysAdmin Group

    Managing a Distributed Computing Environment Managing a Production Environment

    Page 6 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    4.0 Managing a Distributed Computing Environment

    There are three key processes which form the basis for the quality managem ent of a mod ernproduction computing environment;

    1. Production Change Management (CM) Process.

    The Change Management process is inten ded to maximise the availability of the existingproduction environment. It forces the careful planning, peer review, post-change testingand user acceptance of any changes to the p rodu ction environm ent.

    No matter how stable a production environment is in theory, in practice there isalways ongoing change taking place which does not affect or augment the underlying

    functionality of the system. The CM process ensures the integrity of such changes byforcing careful planning and impact analysis of any prop osed change. It guid es supp ortstaff through a controlled learning exercise with respect to the potential impact of a changeon one or more elements of the prod uction environm ent.

    2. Production Acceptance (PA) Process.

    By contrast to the CM process, the Production Acceptance process is intend ed to gu idesup port staff through a controlled learning exercise with respect to the introduction of anew element into the production environm ent.

    The PA process provides a framework for introducing change into the p rodu ction envi-ronmen t in a mann er w hich is controlled, pred ictable and aud itable. This process seeksto ensure m aximu m availability of systems and maximu m customer satisfaction with a

    minimu m am ount of ongoing intervention by sup port staff. This involves learning w hatit means to manage and sup port the new p roduct, and then introdu cing it into produ c-tion in a controlled m anner.

    3. Problem Management (Helpdesk) Process. 1

    No matter how we might try, and even with the help of the CM and the PA processes,things will always go aw ry. In su ch a case, it is a user wh o will often notice the problemrst.

    There needs to be a clearly dened process for the accepting, handling an d tr acking of alluser comp laints, requests and su ggestions such that they can be reacted to according todened criteria, and in a quality assured man ner.

    The PM process should not only ensure timely response to user problems, but shouldprovide valuable statistics to m anagement on th e progress and effectiveness of sup portstaff, and the satisfaction of the user ba se.

    1. Due to space restrictions, and the already exhaustive coverage of this topic elsewhere, I will not address the

    PM process in this paper (other than as it provides context to the other processes).

  • 8/6/2019 Ghost Processes Production 0110

    7/23

    The SysAdmin Group

    Managing a Production Environment Process Context

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 7 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    5.0 Process Context

    As d escribed in the p revious chapter, there are three key p rocesses wh ich contribute to thequality managem ent of a produ ction compu ting environment. The diagram below illustratesthe (simp lied) overall process ow and interaction in a norm al produ ction environm ent;

    Figure 1 - Process Context

    An operational computer system continues to operate in the RUN state until a defect isnoticed or a new element is introduced. If the user notices a defect, then the Service LevelAgreement is consulted to d etermine the level of sup port th at should be p rovided. Assumingthat the p roblem m ust be xed, this will then involve changes to the p rodu ction environm entin some form . These changes are controlled by th e CM process.

    When the Change Management process is invoked, any aspects of the system which arewithin the affected scope of th e change m ust be tested for correct w orking behaviour (regres-sion testing) before the change is deemed complete.

    When a n ew element is being introdu ced into th is environment, it is pu t through the Produ c-tion Acceptance Process. This guid es sup port staff through the creation of key docum entationwhich controls the entire life-cycle of the product.This documentation includes a deploymentplan, a regression test suite, and a p rodu ct signatu re, architecture an d technical description.

    These docum ents then feed into th e other key processes involved in p rodu ction m anagement.

    Deploy RUN

    Change

    Deployment Plan

    Service Level Agreement

    Test Suite Test

    BUILD RUN

    Production

    ProblemResolution

    Acceptance

    Mgmt

    Problem

    End of

    Life

    Product

    Notes

    Mgmt

  • 8/6/2019 Ghost Processes Production 0110

    8/23

    The SysAdmin Group

    The Change Management Process Managing a Production Environment

    Page 8 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    6.0 The Change Management Process

    The rst of the processes we will examine is the Change Management process. There is ahigher p ublic awareness of the need for a CM p rocess than for the oth er processes describedin this d ocument an d these other p rocesses all feed into th e CM p rocess, so it makes sense todiscuss change m anagement rst.

    6.1 Why do we Need a Change Management Process?

    The principle goal of any change managemen t process is to ensu re the stability of the existingprod uction environment by forcing th e careful and thorough impact analysis, planning, peer

    review an d testing of any change to the prod uction environment.Without such a p rocess, prod uction man agement consists of a never-ending series of ad-hoc,undocumented and unplanned changes, all of which dramatically add to the entropy of thesystem. It becomes easier to rep lace these systems than maintain th em. This in tu rn leads tosystem support staff being entirely reactive, going from one re to the next.

    Indeed, this is usually the most obvious sign of problems at an organ isation. System sup portstaff running all over the p lace, not documenting wh at they d o, not discussing problems w itheach other, and genera lly living a fairly stressed existence.

    By contrast, using a change management process offers several clear advantages:

    It allows us to solve a problem once. This means that we arent re-diagnosingproblems, and always re-xing the sam e thing.

    It allows us to learn from previous experience. By recording and reviewing theprocess, we can learn from ou r m istakes and seek to furth er improve u ptime.

    It ensures that the customer is kept informed. This is no small advantage.

    Try to answer these questions:

    How did I do it six month s ago? Can someone else do it next time?

    If you cant, then youre in need of a change management process!

    6.2 The Change Management Process

    The single most imp ortant characteristic of a p rodu ction op erations environment is predicta-bility . Most users dont mind downtime if they can plan for it, but these same people will

    jum p th rough th e phon e at you if it just hap pens!

    The only way to achieve predictability is throu gh controlling all changes to th e prod uctionenvironment, includ ing the steps of;

    Plan nin g th e ch an ge, Testing the change, Backing-out an unsuccessful change, and User acceptance of the change.

    The Change Man agement p rocess govern s the change life-cycle. It enforces app rop riate plan-ning to take p lace before any change can be effected to the p rodu ction environm ent and con-

    trols the application of that change to th e prod uction environment.

  • 8/6/2019 Ghost Processes Production 0110

    9/23

    The SysAdmin Group

    Managing a Production Environment The Change Management Process

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 9 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    6.3 Change Management 101

    All changes m ust begin their life-cycle as an unkn own amou nt of work w ith un know n effects.Thus, the initial emphasis on change management is to perform a controlled analysis andimpact stud y of the change before any implementation w ork can be commen ced. Resultingfrom this analysis will be a plan of attack, and this attack must be reviewed by a peer toensure the qu ality of the solution. Of paramou nt imp ortance in all this is to also determinehow to test the correctness of the change, and h ow to back-out th e change if it d oes not suc-ceed. Of course, every change should include user acceptance of the change.

    From this we see a very embryonic change managem ent p rocess:

    Figure 2 - Em bryonic Change Management Process

    1. The change begins with a Change Request (CR) being submitted. This is the basic unit of work for a change management system. We will examine the CR in detail in the nextsection.

    2. Each new CR must be assessed so as to determine the scope of change and the overall priority of the change. We will return to th ese concepts shortly.

    3. Once a change has been assessed and resolution authorised, we must plan the change. i.e.How we will effect the change onto a production system; when we will schedule thechange to occur; wh ich h osts and app lications w ill be affected by the change; and w hatactions w e mu st therefore take w ith respect to those app lications.

    4. Generally, one person will create such a p lan alone. The single most valuable form of quality assurance is to have a p eer review th at plan an d su ggest changes or om issions.

    5. After these preliminary steps, we are ready to execute the changes as per the plan.Copious notes should be kept for review.

    6. Once the changes have been mad e, we mu st test (a) that the changes worked correctly,and (b) that all related systems (see scope of change ) still work correctly. If this is not th ecase, then w e mu st backout the changes, and re-test that original functionality has beenrestored.

    7. Finally, the user mu st accept the changes as correct, and we should review the p rocess tosee wh at we have learned.

    6.4 Controlled Learning

    A major advantage of recording all this information is that once we have learned what isrequired to m ake a change, and now have that kn owledge on le, we can re-use it when next

    CR A ssess Plan

    Backout

    Test Accept

    Review

    Reject

    Execute

    Review

    A

    A

  • 8/6/2019 Ghost Processes Production 0110

    10/23

    The SysAdmin Group

    The Change Management Process Managing a Production Environment

    Page 10 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    we h ave a similar change. We can further rene and tun e it over time, and hen ce save enor-mou s amou nts of personal effort, and improve ou r rate of success.

    6.5 The Change Request

    The life of a change begins and is tracked through the Change Request. The CR is a form(paper or electronic) wh ich records the p rogress and d ecisions m ade d uring the CM p rocess.

    6.5.1 Priority, Scope and Severity

    Figure 3 - A ssessing a Change Request

    Each Change Request has a priority . This is a coarse-grained priority set by the submitter atthe time of submission of the CR. This pr iority reects the severity of the problem in term s of its impact upon the submitting user. This priority should, hence, be a simple choice basedupon criteria such as resolution time. Something like:

    Alternatively, schemes are used wh ich requ ire the subm itter to qu alify the impact that theproblem is having on th em. I do not recommend these more qualitative app roaches.

    Next, there is the scope (aka severity ) of the change.This is a classication of the bread th of theapplications, systems, user communities or sites which are affected by this change. i.e. Doesthis change affect a single user, or the m ain prod uction host and hence the entire user base?

    An example might be:

    These inpu ts of priority and scope combine to determine the level of the imp ending change.This is discussed in the n ext section.

    CR Assess Plan Inform

    PRI

    SC Customer Authorisation Matrix

    Schedule Review

    LEVEL

    Requestor

    P1 Critical. Resolution is required within 4 hours.

    P2 High. Resolution is required within 48 hours.

    P3 Medium. Resolut ion is required w ithin 1 week.

    P 4 Low. A ll o ther CRs.

    S1 Major site/ division affected.Critical functionality lost. No work-around .

    S2 Multiple people directly affected. Critical functionality lost. No w ork-around.

    S3 Multiple people affected. Major functionality lost. Staff can continue m ost work functions.

    S4 Failure only affects a single user.

  • 8/6/2019 Ghost Processes Production 0110

    11/23

    The SysAdmin Group

    Managing a Production Environment The Change Management Process

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 11 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    6.6 The Change Request Form

    All of this information, planning, testing, etc. must be recorded for quality control. QualityAssurance, as specied by ISO-9000 consists of two parts; known processes and records of execution of those processes. The Change Request form is the basic record of the execution of the CM process.

    A CR form should record the following information:

    1. Description of Change . A brief description of what the change is and why it is beingperformed.

    2. Effect on Availability and Customer Impact Statement . Details what applications andsystems will be made unavailable during the performance of this change, and whatimpact that w ill have on the u ser base.

    3. Risks . Details what risks are involved in p erforming this change and wh at risks there areto the success of the change. A risk analysis should be included.

    4. Prerequisites . What needs to be in place prior to this CR commencing. This should be atable so that each entry can be signed off as ready. Most importantly amongst theprerequisites is an appropriate form of backup, as this is usually an integral part of abackout plan.

    5. Change Plan . What actions are to be taken to effect the chan ge? Includ e as much detail asis necessary to avoid any p ossible ambiguity wh en it comes time to follow the p lan.

    6. Backout Plan . What actions are to be taken to backout the change in th e event of a testfailure or time p lan over-run. This should includ e details of the conditions und er wh ich

    the backout plan mu st be executed .7. Test Plan . This should include a list of pre-change tests, post-change tests, and post-

    backout tests. These tests should be derived from the relevant product regression testsuites, as detailed in th e Production Acceptance process. See Section 7.0 ("The Produ ctionAcceptance Process") on p age 17 for more d etails.

    8. Do cumentation Updates . This section should list any documentation which requiresup dating as a result of this change.

    9. Customer Authorisations . This should be a table listing the requ ired app rovals before thechange can comm ence. This table should h ave room for the app ropriate signatures.

    10. Customer Acceptance . This is a table for the user acceptance of the change as passingpost-change acceptance tests. This table should have room for the ap prop riate signatures.

    11. Change Notes . A large section for recording any progress notes whilst executing thechange. Always record any deviations from any of the above p lans.

    6.7 The Customer Authorisation Matrix

    When m anag ing prod uction services for a large organisation, it is most often the case that d if-ferent applications and data are used by different user communities, with some commoninfrastructure, app lications and data w hich is used by all commu nities.

    It becomes vital, then, to build u p a list of the man agers responsible for each area, what app li-cations they use, and indirectly which systems they therefore rely on. This information isbuilt up into a Customer Au thorisation Matrix.

  • 8/6/2019 Ghost Processes Production 0110

    12/23

    The SysAdmin Group

    The Change Management Process Managing a Production Environment

    Page 12 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    This matrix serves two p urp oses:

    1. When a change is pending which requires some application or host down time, we canqu ickly ascertain wh ich user comm un ities will be affected by t he change. This in turn tellsus which managers to coordinate the downtime amongst, and who must authorise thisdow ntime on behalf of the affected users.

    2. When a system breaks, we can immediately identify which user commu nities areaffected, and contact the approp riate managers so that they know w e are working on theproblem. We, of course, also know who to contact when normal services have beenresumed. This pro-active communication maintains a strong communication path andhigh level of trust with th e user commu nities.

    6.8 The Change Control BoardThe Change Control Board (CCB) is the controlling body for CR processing. The CCB isresponsible for the auth orisation an d coordination of CRs. Their d uties includ e:

    Authorising CRs. Reviewing the CR, its scope, priority and impact, and assessing thisagainst corporate priorities. Just because someone submitted a CR, doesnt mean thatit mu st be performed ! The CCB mu st decide wh ether a CR is worthy of the resourcesrequired an d associated system imp act. This authorisation shou ld occur before anyheavy work is done on CR preparation.

    Grouping/ un-grouping of CRs. It is often useful to group the execution of CRstogether, such as those wh ich w ill occur du ring a m aintenance window. In these cases,the CRs should not work on related systems or products. (This is an attempt tominimise dow ntime, not combine CRs.)

    Coordination of down -time across platforms and CRs. Where an organisation hasmultiple platforms which it supports (e.g. Unix, NT, MVS), then the CCB mustcoordinate changes across platforms to minimise the imp act on the user base.

    The CCB must approve each CR prior to execution (apart from emergency changes), andensures that app ropriate preparation has been p erformed, that customers have signed off onthe CR and that p re-requisites have been fullled.

    The CCB should meet weekly to assess new requests, and on-demand for more urgentrequests.

    6.9 Performing Changes - The Test Lab

    There are several very imp ortant strategies for imp roving quality and further imp roving thelikelihood of any given change being successful. As mentioned previously, the single mostimportan t of these is peer review.

    A second strategy is the trial run of a change in a controlled, non-produ ction environm ent -the test lab. The lab should be congured to reect the existing environmen t, and then th e CRput through a trial run.

    Note that the am oun t of gain is proportional to the complexity of the change. For examp le, itis highly advisable for evaluating new operating system releases or major patches to p erformsuch trials to ensu re functionality is not lost.

  • 8/6/2019 Ghost Processes Production 0110

    13/23

    The SysAdmin Group

    Managing a Production Environment The Change Management Process

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 13 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    6.10 Classes of Change - The ESPA System

    As was seen in the overall process context diagram , a CM process takes a Change Request , per-forms some work to process that request and (hopefully) returns the system to a fully work-ing state. It turns out that there are several distinct classes of change, and each must behandled differently.

    The ESPA CM system denes four classes of change which form a natural hierarchy. Thesefour levels work together to d ene a comp rehensive approach to produ ction change man age-ment:

    Figure 4 - The Four Levels of Change

    Scheduled (The default). The standard level assigned to any change is level 2, the Scheduled

    Change .This can be characterised as a change w hich is new (has not p reviously been p erformed), andso we must investigate and p lan accord ingly. The pr inciple role of the Schedu led CM processis that is guides us through a controlled learning exercise as to what is required to ensure thatthe chan ge is successful.

    Emergency . This is the process we invoke w hen som ething has broken (a m ajor compon ent of the production system is no longer available) and, hence, we do not have the privilege of planning the change. In this case, rather th an creating an d reviewing a plan p rior to execu-tion, our priority mu st be to x the problem rst.

    Instead, we perform th e peer review step in the form of a d own time conference, and p erformthat review after the repair has been performed, in order to ensure that we have indeed

    resolved all problems and correctly brought the system back into a fully operational state,with know n outstand ing issues. Every incidence of dow ntime mu st be dealt with accordingto this process.

    Procedurised . Those tasks which occur more than once (such as replacing a d isk drive, add-ing a u ser, migrating Oracle database tables between tablespaces) can an d should be turn edinto procedures.

    Once a change has been through the scheduled change process, and we have successfullycompleted it and noted any failures, then w e have completed th e necessary controlled learn-ing exercise. We now know how to successfully complete this type of task. Hence, we cantake the lessons learned an d th e plan developed, and turn it into a den ed, repeatable proce-dure.

    Level 4 - Automated Change

    Level 3 - Procedurised Change

    Level 2 - Scheduled Change

    Level 1 - Emergency Change

    CR Assess

    End

  • 8/6/2019 Ghost Processes Production 0110

    14/23

    The SysAdmin Group

    The Change Management Process Managing a Production Environment

    Page 14 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    Automated . Finally, those procedures which, by their nature, we execute many times overcan be au tomated to imp rove consistency an d redu ce execution ov erhead. (You cant au to-mate a procedure you havent yet dened! At least, not if you want it to work.) Obviousexamples here are things like: add ing a user, add ing a new disk dr ive, or removing a n ews-group.

    6.10.1 Scheduled Changes

    The process we d eveloped in Section 6.3 ("Change Managem ent 101") on p age 9 was a basicScheduled CM process.

    Scheduled changes can be further broken down into sub-levels based upon such criteria asresolution w indow. For example, an organisation m ay deem it approp riate to have ve sched-uled categories: 48 hou rs, 7 days, 14 days, main tenan ce wind ow, 60 days. Their Change Man -agement Policy m ight read:

    Level S1 High Priority . This level is appropr iate for a change wh ich w ill affect a m ajorprod uction comp onent, but w hich must occur within a short time frame (48hours). This might include replacing a disk which is on the way out, ormigrating some software between disks to avoid an imminent sp ace problem.This level can be characterised as appropriate when we must provide alimited n egotiation with th e user base as to wh en the d own time will occur.(We have to operate before your appendix explodes!)

    Level S2 Medium Priority . To be resolved w ithin 7 days.

    Level S2 Low Priority . To be resolved with in 14 days.

    Level S4 Maintenance Window . To be resolved during the standard monthlymaintenance window. This will be the rst Wednesday of each monthbetween 18:00 (6pm) an d midn ight.

    Level S5 Enhancement . To be resolved with in 60 days.

    6.10.2 Un-Scheduled Changes

    A careful look at the Scheduled CM process reveals that it is only appropriate for thosechanges we have the p rivilege of schedu ling and p lanning in ad vance. Clearly a nu mber of changes do n ot fall into th is category, and are reactive to an incident w hich has already h ap-pened and already compromised system fun ctionality.

    A m odied change process is required for th ese situations:

    Figure 5 - The Emergency Change Management Process

    CR Assess Plan Test

    A ccept Review

    Execute A

    A

    Inform

  • 8/6/2019 Ghost Processes Production 0110

    15/23

    The SysAdmin Group

    Managing a Production Environment The Change Management Process

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 15 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    The primary difference is that rather than peer review and user sign-off happ ening before thechange is effected, we repair whatever is broken (execute), then hold a dow nti me conferenceto assess what happened, whether our reaction was appropriate and sufcient, and whatissues remain as a result of this problem. The other ad dition to the change p rocess is that w emu st be sure to inform the customer immed iately wh en there is a problem.

    When d own time is encountered, the general process should be;

    1. Trouble-shoot and repair/ work-around the problem. Return the system to aninterim wor king state. Take copious notes on th e activities un dertaken du ring thetrouble-shooting and repair of the system. (Where feasible, make use of the Unixscript(1) comm and or similar.)

    2. Prepare a list of follow-up issues that were uncovered or initiated as a result of thedown time and/ or an interim solution.

    3. Hold the Post-Mortem (Downtime Conference).

    6.10.3 The Downtime Conference

    Each incident of server or ap plication dow ntime m ust be thorou ghly investigated by a com-posite team of support staff to determine the cause of the downtime, and any courses of action to red uce the likelihood of similar down time in the futu re.

    The emphasis of these conferences is team learning from m istakes, so that we can continue toimprove customer service levels.

    This conference should involve a n um ber of supp ort staff. A copy of the incident report (min-utes of this meeting) should be circulated to all members of the supp ort team as w ell as to themanagement.

    Prior to the downtime conference, the primary consultant who worked on the resolution of the problem shou ld prep are a draft of the incident report, as per the agend a below, and p revi-ous reports. This will help the conference run more qu ickly and smoothly, and minimise theimpact of these conferences on other m embers of the sup port team.

    1. Assign incident number.

    2. Summary of inciden t.

    3. Cu r ren t sy st em s ta tu s.

    4. Investigation walk-through. What actions were taken? Are they sufcient? Are thereany un planned side-effects?

    5. Incident follow-up. What outstanding issues remain from this down time.

    6. System Resolution. How d o we prevent similar downtime in the future?

    The Downtime Conference is a pow erful tool for identifying trends, issues which require fur-ther investigation, and systemic cures to problems.

    Warning: Do not attempt to execute other pending CRs whilst a system is down due to afault. Youre sole aim is to follow th e Emergency CR process to bring t he system back up , thenexecute other CRs as per the app roved schedule. Attemp ting to group any other CR with anEmergency CR is asking for trouble in the form of prolonged, un planned dow ntime.

  • 8/6/2019 Ghost Processes Production 0110

    16/23

    The SysAdmin Group

    The Change Management Process Managing a Production Environment

    Page 16 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    6.10.4 Procedurised Change - Learning from History

    The above change processes both d eal with controlled learning . We need to ensure the qualityof our w ork by p lanning the imp act of that w ork. This is an unn ecessary repetition, however,wh ere we have already performed such a change, and know the answers.

    In such cases, we should be able to document what we did last time, along with what welearned in the reviews, and turn this into a procedure to follow w henever a signicantly sim-ilar situation arises.

    Such p rocedu res mu st still includ e all necessary customer coordination and inter-CR down -time scheduling, but the actual planning and review steps become much more streamlined,and the peer review is eliminated greatly optimising the overall process. Procedurising achange does not affect the level of authority required to perform the change either. This canbe docum ented an d en forced by p olicy as appropr iate to the organisation.

    Of course, we still need to review the procedure to identify when a process is no longerapp ropriate, or mu st be enhanced d ue to a changing environm ent.

    6.10.5 Automated Change

    Finally, where a particular change request is frequently executed; then once we have proce-durised the change and tuned that procedure over a few executions, it becomes possible toreduce the overhead of executing that procedure by automating it. It is a waste of time toattempt to automate a procedure which has not been dened, and is therefore not properlyunderstood.

  • 8/6/2019 Ghost Processes Production 0110

    17/23

    The SysAdmin Group

    Managing a Production Environment The Production Acceptance Process

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 17 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    7.0 The Production Acceptance Process

    The PA process is the natural counterpart to the CM process. Where the CM processaddresses the problem of maintaining existing functionality, the PA process addresses theintroduction of new functionality.

    The PA process provides a framework for introducing change into the production environ-ment in a m anner wh ich is controlled, pred ictable and aud itable. This process seeks to ensuremaximum availability of systems and maximum customer satisfaction with a minimum of ongoing intervention by systems supp ort staff. The PA p rocess contributes to this by ad dress-ing issues such as the following:

    How are new products introduced to a stable production environment?

    How are applications successfully transitioned from development to produ ction? What are the training, security, maintenance, availability, output, and other

    requirements of new p rodu cts in a produ ction environment? Who is responsible for supp ort, security, hardw are and software maintenance, etc.? What host resources are affected by the installation of the new product? How does one recognize when the product is not working as expected? How does one add users to this p roduct? How is backup and restore of this product accomplished? What are the disaster recovery requirements of the product? How does this product interact with other products?

    In short, the PA process guid es supp ort staff through a controlled learning process to ensurethat they u nd erstand the effect of a new prod uct on the prod uction environment before theyintrodu ce it into that en vironment. In short, it answers the qu estion;

    What d oes it mean to say yes to this prod uct?

    Before we comm it to sup porting something, we need to know what that m eans to the entiresup port organisation. The PA process ensures that w e und erstand w hat it is that we are aboutto commit to.

    The alternative to the PA p rocess is to just introd uce a new prod uct, and discover its idiosyn-crasies over time as errors arise. Usually, these ndings will not be documented in any coher-ent manner leading to greater repetition of effort, decreased consistency, and increasedentropy.

    It may seem like extra w ork to p erform the PA process, but the fact is that you will alwaysincur this expense; the PA process merely captures it all up-front and records the resultswh ere they can be found rather than this being a h idden cost. Using the PA process, you willalso exert far less effort for a great er retu rn.

  • 8/6/2019 Ghost Processes Production 0110

    18/23

    The SysAdmin Group

    The Production Acceptance Process Managing a Production Environment

    Page 18 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    7.1 The Process

    It would app ear that the PA process will make prod uction managem ent pred ictable, simple,controlled an d p erhap s [gasp!] a little less stressful. So, what is it?

    The PA process consists of a small number of well-dened steps. Each step takes certaininpu ts in the form of guid elines and docum ent temp lates, and generates one or more records.These records either form an input to another step, or are led as part of our new productknowledge.

    Figure 6 - Overview of the PA Process

    The intention is that a single binder (or the on-line equivalent) is built up per p rodu ct, andthis bind er is then continually referred to and up dated as the produ ct is supp orted over time.

    When th ere is a problem w ith a p rodu ct we can immed iately run the acceptance tests detailedin the binder to determ ine whether th e system is fun ctioning as per customer specications.Where it passes these tests, but still fails customer requirements, then w e m ust review thetest plan, and perhap s also the SLA.

    Moreover, when w e are performing any change to prod uction (via the CM p rocess), then wecan identify all products which are in scope (affected by the CR), and hence we immediatelyhave a list of known tests to perform w ith respect to those produ cts. Further, we w ill alwaysrun the same tests with respect to a produ cts functionality. This is a hu ge win in terms of quality assuran ce.

    By creating d etailed p rodu ct descriptions, we can quickly und erstand the basic nature of theprod uct and how to trouble-shoot problems with it.

    Phase Step Outputs

    Proposal 1 Crea te P roduct Proposa l andQuotation

    Produ ct Proposal

    2 Plan Project Coarse Project PlanGATE 0

    Negotiation 3 Create SLA Service Level Agreement

    4 Create Accep tan ce Test Plan Accep tan ce Test Plan

    5 Create Deployment Plan Deployment Plan

    6 Review Quote Revised Project Plan

    GATE 1

    Investigation 7 Analyse Product Product Notes

    8 Stress Test

    Revision 9 Rev iew D ep loym en t Plan ,Test Plan and SLA

    Revised Plans

    GATE 2

    Deployment 10 Product Installation Installation Record

    Operation 11 In itial O per ation an d Tu nin g Op er ation al O bser vation s

    A cceptance 12 Customer Acceptance Customer Sign-off

    Review 13 Review Process Process Review Minutes.

  • 8/6/2019 Ghost Processes Production 0110

    19/23

    The SysAdmin Group

    Managing a Production Environment The Production Acceptance Process

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 19 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    A second importan t aspect of the PA process is that each new prod uct is passed throu gh th isprocess, and so each time the PA is executed we have created a self-contained project. Thisproject mentality is useful; in seeing the product successfully through the PA into productionwe have completed the project. This gives you, management and the customer a sense of achievement.

    The PA is not a Heavy Weight Methodology

    At this point it should be made clear that the PA process described is a guideline. It is a littlem methodology, not a big M Methodology [1]. You should always apply it with intelli-gence and discretion.

    Clearly, some products require vastly more effort to analyse and control than others. It wouldnot be ap prop riate to spend the same effort bringing Samba into p rodu ction as Or acle. Simi-

    larly, a simple p rodu ct being installed on a single machine will not requ ire the same comp lex-ity of project plan and dep loyment p lan as a new piece of accounting software to be d eployedacross a num ber of sites as a replacement for an existing system .

    The intention of the PA is to guide you through a p rocess, so that you can make th ose deci-sions, and be prom pted to look at critical issues in m aking these determinations. As with eve-rything else, there is no substitute for experience. The PA process will be more wieldy to useat rst as you are u nfamiliar w ith it, but w ith time will become second nature.

    7.2 The Steps of the PA Process

    The above overview does not reveal the amou nt of work involved in each step, nor the realnatu re of that work. We mu st look at each step in tu rn in ord er to app reciate what it is tryingto achieve.

    1. Create Product Proposal and Quotation

    Life for a new prod uct mu st begin w ith a user requirement. At that point a prod uct pro-posal is written detailing th e business case for the new prod uct. Also docum ented is thepriority of the prod uct, and the time frame within w hich it mu st enter produ ction.

    This step formally recognises the requirement for a new product, and hence sets man-agement expectations with respect to the work involved in executing the PA process forthis new p rodu ct. This avoids the trap of man agement just expecting a n ew p rodu ct tobe introdu ced in passing.

    2. Plan the Pro ject

    It is very impor tant that w e control the effort required to bring a new prod uct into pro-du ction. As was stated in th e introduction to the PA, the cost of dep loying a n ew p rod-uct, and the cost of the ongoing support of that product are non-trivial, and thesegenerally go u n-identied and un -controlled. This leads to sup port staff overload.

    If more than a week of time is going to be required, then it is denitely wor th a p roperproject plan. Depend ing up on the natu re of your organ isation, you m ay wish to chargethe customer for time spent on this project. In such cases, a customer will more oftenthan not w ish to see a project plan and bud get.

  • 8/6/2019 Ghost Processes Production 0110

    20/23

    The SysAdmin Group

    The Production Acceptance Process Managing a Production Environment

    Page 20 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    3. Create a Service Level Agreement (SLA)

    The most overlooked part of a non-PA regime is that of setting customer expectations.Do they expect the product to be available 24 hours a day? Of course they do - if youhavent pointed out how mu ch that will cost.

    The SLA is the basic document for agreement betw een the customer (external or inter-nal) and sup port staff. This docum ent is created in close consultation w ith the customerbecause both the customer and the support staff must be happy that they can meet thelevels of supp ort p romised in the SLA.

    The SLA becomes the benchm ark by w hich qu ality and quan tity of futu re service will bemeasured.

    4. Create an Acceptance Test Plan

    One of the most pow erful concepts behind the PA is that right u p front w e dene sometests, in agreement with th e customer, which constitute the ben chmark for w hether theprod uct is behaving as expected. Whenever there is a suspected problem w ith that prod -uct, we can imm ediately identify w hether th is baseline of functionality is being met.

    This gives sup port staff a far m ore open an d p owerful way for dealing with customers. If you can clearly show th at the system has met their d ened requ irements, then they w illbe more willing to help you tune those tests and the system behaviour to match theirchanging requirements. A more congenial and respectful relationship will generallyresult.

    These tests are far more than just acceptance tests performed once as the product isplaced into p roduction. Whenever a CR is processed wh ich might affect the fun ctionality

    of this product, we re-run th ese tests to ensure that it is behav ing as expected. We gener-ally refer to such a set of tests as a regression test suite .

    As more products are run through the PA process, we build up a more comprehensiveset of regression tests such tha t we can easily test system statu s, and SLA compliance.

    5. Create a Deployment Plan

    Bringing a new product up under isolated conditions is a relatively straight forwardexercise. You can p robably even just follow th e instructions tha t came with th e prod uct.But introdu cing a prod uct into an existing p rodu ction environm ent is far more difcult.

    Even armed with full know ledge of the produ ct (as discovered in step 7 below), there areman y changes that will effect functionality of other p rodu cts across a nu mber of targethosts.

    Deployment of the new product must, therefore, be carefully planned. Events such asdowntime must be coordinated. Where a product is to be deployed across multiplehosts, we mu st do a p artial deployment and test for u nexpected side-effects before con-tinuing with a wider d eployment.

    6. Review the Quotation

    At this point we have a far better und erstanding of the requirements of the prod uct, andso can review the qu oted effort and costs of accepting th is produ ct into produ ction.

  • 8/6/2019 Ghost Processes Production 0110

    21/23

    The SysAdmin Group

    Managing a Production Environment The Production Acceptance Process

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 21 of 23Created: 17 March 1997 17:42 Prepared For: General ReleaseModied: 24 January 1998 09:17 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved.

    7. Analyse the Product

    This is the heart of the PA process. Here w e perform a sequence of steps to gath er know l-edge about the prod uct and its interaction with other prod ucts.

    Using a sociability laboratory (see below), we create a virgin system image, and snap-shot that system (Tripwire is an excellent tool for this). We then load the software ontothis test platform, and re-snapshot the system to d etect changes. This tells us w hat p artsof a host the produ ct touches du ring installation.

    We then run the software and again snapshot the system. This tells us about what theprod uct touches wh en ru nning. All of this is on a test platform, isolated from p rodu ctionhosts.

    We then n eed to scour the p rodu ct manu als and ascertain the overall architecture, com-

    ponent interaction and d ata ows of the prod uct; wh at processes does it comp rise? Howdoes data m ove between these? What are the basic data ow s of the prod uct into variousdata areas? What accounts/ group s does it require? How is the app lication started upcleanly? How is it shutdown cleanly? How do we determine its current status? Whatregular housekeeping is required?

    8. Stres s Te sti ng

    A vital step in the acceptance of app lications w ith a w ide u ser base is the tru e stress test-ing of the prod uct to ensure it meets the needs of the intended user base un der load. It isquite astounding the nu mber of problems wh ich rst surface und er true prod uction con-ditions, and stress testing is the primary tool in attempting to reduce this occurrence,and the ensuing nightm are of eleventh h our changes.

    At the completion of the prod uct analysis step, we hav e a fully installed prod uct in theSocLab, on hard ware similar to the nal prod uction ha rdware. (See Section 7.4 ("TheSociability Laboratory") on p age 22 for more on this.) We can now proceed to stress testthe produ ct to determine how it behaves und er a representative load.

    Of course, it is usually impossible to test a p rodu ct using an actual load, so instead weoften n eed to construct test scaffolding w hich simulates the user load in some ap prop ri-ate manner. Discussion of benchmarking tests is beyond the scope of this paper.

    9. Review SLA and Deployment and Test Plans

    Having now created a knowledge base about this product, we are in a very good posi-tion to proceed with t he installation. Before we do so, how ever, we m ust review the SLA,dep loyment plan an d test plan to d etermine if they are still appropriate, given this new

    knowledge.10. Install Product

    Now, with that major work behind us w e can nally deploy the prod uct according to thedep loyment p lan, with a reasonable level of condence in the results.

    As part of this we must re-test all other products on any affected hosts to ensure thatexisting functionality has n ot been compromised.

    Note that in a d istributed environment, this installation step is non -trivial, and may bequite complex. In fact, it is combined and overlapping w ith the stress testing and useracceptan ce testing step s. We will often perform an install in several d istinct phases, ontoa nu mber of hosts, clients and users as is app ropriate, testing und er an increasing u ser

    load as we go.

  • 8/6/2019 Ghost Processes Production 0110

    22/23

    The SysAdmin Group

    The Production Acceptance Process Managing a Production Environment

    Page 22 of 23 Reference: GHO ST-PROCESSES-0110 (Version V1.00)Prepared For: General Release Created: 17 March 1997 17:42 Copyright 1997, The SysAdmin Group Pty Ltd . All Rights Reserved. Modied: 24 January 1998 09:17

    11. Initial Operation and Tuning

    Once the product has been fully deployed, there will usually be a settling-in periodwh ere the product is closely monitored and tun ed un der a real workload.

    12. Customer Acceptance

    Finally, we can show the customer the results of the acceptan ce tests and have th em sign-off that the p rodu ct is in p rodu ction an d working correctly. We now have th e baseline fora w orking p roduction p roduct.

    13. Review Process and Project Closure

    An im portant part of quality assurance it the review of the project and of the PA p rocess,and the continuing tun ing of that process to better meet the environment.

    7.3 The Authorisation Gates

    The table that was presented at th e beginning of this section (Figure 6 on page 18) had vari-ous steps separated by authorisation gates . It is often the case in larger organisations that aproject does not get complete app roval at the start and then just proceeds until don e. Rather,there are a n um ber of checkpoints to ensu re that th e project is proceeding as expected, andthat costs are within bud get, or that an ap prop riate budget extension has been auth orised.

    The authorisation gates are that basic mechanism for review, and hence occur after each re-visit of the project plan. Again, this emphasises the need for formal project management onlarger mission critical product installations.

    7.4 The Sociability Laboratory

    In analysing the product, the notion of the Sociability Laboratory (test lab) was introduced.This is an extremely important part of both the CM and PA processes. Wherever we are d oingsomething for the rst time, we shou ld d o so rst in a controlled env ironment, isolated fromthe prod uction environment, where any u nexpected behav iour w ill not affect prod uction sys-tems.

    For the SocLab to be of most benet, it shou ld be id entical to, or as close as reasonable to, theexisting prod uction platform. This will further red uce the likelihood of u nexpected behaviourrst showing up on th e prod uction platform. This is especially tru e wh ere the SocLab is useddu ring the CM p rocess to test changes before applying them to prod uction, (such as testingoperating system or p rodu ct patches), and for stress testing of applications prior to p rodu c-tion deployment.

    7.5 Roles in the PA Process

    The PA process is best performed by the staff that w ill be supp orting the p rodu ct in p rodu c-tion. It is also best performed as a team of two or more p eople. This ensures that key kn owl-edge is more likely to be recorded, and we furth er redu ce any key-person reliance.

    This is an excellent op portu nity for sup port staff to p erform non -reactive tasks, and to learnand practice the planning aspects of prod uction managem ent.

  • 8/6/2019 Ghost Processes Production 0110

    23/23

    The SysAdmin Group

    Managing a Production Environment A Final Word

    Reference: GH OST-PROCESSES-0110 (Version V1.00) Page 23 of 23C d 17 M h 1997 17 42 P d F G l R l

    8.0 A Final Word

    In this pap er I have introd uced tw o key p rocesses. The Chan ge Management process forcessup port staff to be proactive and plan changes to a system, rather th an just execute them an dwait for the consequences to become apparent. The Production Acceptance process forcessup port staff to learn abou t a prod uct before they supp ort it.

    Together th ese processes prov ide a pow erful tool for the managemen t of mission critical com-puting environments.

    This paper covers a lot of material in varying levels of depth . In ord er to gain the m ost out of this topic, I have identied the key concepts wh ich, if you take aw ay and use, will provide th emost benet.

    1. The Production Acceptance Process . The PA process is the natu ral counter par t to the CMprocess. By forcing each new prod uct to be pu t throu gh th e paces, we obt ain a far greaterlevel of quality of service and this can be measured in terms of application availability.Remember, you w ill always spend this effort in some form, the PA process just m aximisesthe return on that effort.

    2. Regression Test Suites . Providing a single consistent set of tests for each product, andthen u sing these same tests each time a CR is processed w hich m ay affect that p rodu ct,will signicantly improve the quality of changes.

    3. Downtime Conferences . Diligently holding downtime conferences after everyoccurrence of app lication (not just h ost) down time, and seeking to perform a root causeanalysis will yield signicant improvements to application availability (as long as youfollow it up). If you compare the cost of downtime (what do you mean you haventcalculated it?) to the cost of this analysis, you w ill be quite su rp rised by th e gains.

    4. The Sociability Laboratory (aka Test Lab). Having a t est laboratory in w hich you can testoperating system upgrades, patches, and new products will signicantly increase thesuccess rate of changes and add itions, and red uce the incidence of dow ntime.