Availability 99 999

download Availability 99 999

of 6

Transcript of Availability 99 999

  • 8/12/2019 Availability 99 999

    1/6

  • 8/12/2019 Availability 99 999

    2/6

    High availability 2

    Availability % Downtime per year Downtime per month* Downtime per week

    90% ("one nine") 36.5 days 72 hours 16.8 hours

    95% 18.25 days 36 hours 8.4 hours

    97% 10.96 days 21.6 hours 5.04 hours

    98% 7.30 days 14.4 hours 3.36 hours

    99% ("two nines") 3.65 days 7.20 hours 1.68 hours

    99.5% 1.83 days 3.60 hours 50.4 minutes

    99.8% 17.52 hours 86.23 minutes 20.16 minutes

    99.9% ("three nines") 8.76 hours 43.2 minutes 10.1 minutes

    99.95% 4.38 hours 21.56 minutes 5.04 minutes

    99.99% ("four nines") 52.56 minutes 4.32 minutes 1.01 minutes

    99.999% ("five nines") 5.26 minutes 25.9 seconds 6.05 seconds

    99.9999% ("six nines") 31.5 seconds 2.59 seconds 0.605 seconds

    * For monthly calculations, a 30-day month is used

    Uptime and availability are not synonymous. A system can be up, but not available, as in the case of a network outage.

    In general, the number of nines is not often used by a network engineer when modeling and measuring availabilitybecause it is hard to apply in formula. More often, the unavailability expressed as a probability (like 0.00001), or adowntime per year is quoted. Availability specified as a number of nines is often seen in marketing documents.

    The use of the "nines" has been called into question, since it does not appropriately reflect that the impact of unavailability varies with its time of occurrence. [2]

    Measurement and interpretationClearly, how availability is measured is subject to some degree of interpretation. A system that has been up for 365days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usageperiod; the user community will see the system as unavailable, whereas the system administrator will claim 100%"uptime." However, given the true definition of availability, the system will be approximately 99.9% available, orthree nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencingperformance problems are often deemed partially or entirely unavailable by users, even when the systems arecontinuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators

    yet be devastating to users

    a true availability measure is holistic.Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation")that are themselves highly available. If there is a lack of instrumentation, systems supporting high volumetransaction processing throughout the day and night, such as credit card processing systems or telephone switches,are often inherently better monitored, at least by the users themselves, than systems which experience periodic lullsin demand.

    http://en.wikipedia.org/w/index.php?title=Marketinghttp://en.wikipedia.org/w/index.php?title=Downtimehttp://en.wikipedia.org/w/index.php?title=Probabilityhttp://en.wikipedia.org/w/index.php?title=Network_outagehttp://en.wikipedia.org/w/index.php?title=Network_outagehttp://en.wikipedia.org/w/index.php?title=Availabilityhttp://en.wikipedia.org/w/index.php?title=Uptime
  • 8/12/2019 Availability 99 999

    3/6

    High availability 3

    Closely related conceptsRecovery time (or estimated time of repair (ETR)) is closely related to availability, that is the total time required fora planned outage or the time required to fully recover from an unplanned outage. Recovery time could be infinitewith certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood thatdestroys a data center and its systems when there is no secondary disaster recovery data center.

    Another related concept is data availability, that is the degree to which databases and other information storagesystems faithfully record and report system transactions. Information management specialists often focus separatelyon data availability in order to determine acceptable (or actual) data loss with various failure events. Some users cantolerate application service interruptions but cannot tolerate data loss.

    A service level agreement ("SLA") formalizes an organization's availability objectives and requirements.

    System design for high availabilityParadoxically, adding more components to an overall system design can undermine efforts to achieve highavailability. That is because com plex systems inhe rently have more potential failur e points and are more difficult to

    implement correctly. While some analysts would put forth the theory that the most highly available systems adhereto a simple architecture (a single, high quality, multi-purpose physical system with comprehensive internal hardwareredundancy); however, this architecture suffers from the requirement that the entire system must be brought downfor patching and Operating System upgrades. More advanced system designs allow for systems to be patched andupgraded without compromising service availability (see load balancing and failover).

    High availability implies no human intervention to restore operation in complex systems. For example, availabilitylimit of 99.999% allows about one second of down time per day, which is impractical using human labor. The needfor human intervention for maintenance actions in a large system will exceed this limit. Availability limit of 99%would allow an average of 15 minutes per day, which is realistic for human intervention.

    Redundancy (engineering) is used to eliminate the need for human intervention. The two kinds of redundancy arepassive redundancy and active redundancy.

    Passive redundancy is used to achieve high availability by including enough excess capacity in the design toaccommodate a performance dec line. The simples t example is a boat with two sepa rate engines driving two separ atepropellers. The boat continues toward its destination despite failure of a single engine or propeller so long as boatspeed exceeds water velocity long enough to avoid running out of fuel. A more complex example is multipleredundant power generation facilities within a large system involving electric power transmission. Malfunction of single components is not considered to be a failure unless the resulting performance decline exceeds the specificationlimits for the entire system.

    Active redundancy is used in co mplex systems to achieve high availability with no performance decline. Multi ple

    items of the same kind are incorporated into a design that includes a method to detect failure and automaticallyreconfigure the system to bypass failed items using a voting scheme. This is used with complex computing systemsthat are linked. Internet routing is derived from early work by Birman and Joseph in this area. [3] Active redundancymay introduces more complex failure modes into a system, such as continuous system reconfiguration due to faultyvoting logic.

    Zero downtime system design means that modeling and simulation indicates mean time between failuressignificantly exceeds the period of time between planned maintenance, upgrade events, or system lifetime. Zerodowntime involves massive redundancy, which is needed for some types of aircraft and for most kinds of communications satellite. Global Positioning System is an example of a zero downtime system.

    Fault instrumentation can be used in systems with limited redundancy to achieve high availability. Maintenanceactions occur during brief periods of down-time only after a fault indicator activates. Failure is only significant if thisoccurs during a mission critical period. This strategy is called Condition-based maintenance, and this is only

    http://en.wikipedia.org/w/index.php?title=Mission_criticalhttp://en.wikipedia.org/w/index.php?title=Condition-based_maintenancehttp://en.wikipedia.org/w/index.php?title=Mission_criticalhttp://en.wikipedia.org/w/index.php?title=Condition-based_maintenancehttp://en.wikipedia.org/w/index.php?title=Mission_criticalhttp://en.wikipedia.org/w/index.php?title=Condition-based_maintenancehttp://en.wikipedia.org/w/index.php?title=Condition-based_maintenancehttp://en.wikipedia.org/w/index.php?title=Mission_criticalhttp://en.wikipedia.org/w/index.php?title=Instrumentationhttp://en.wikipedia.org/w/index.php?title=Global_Positioning_Systemhttp://en.wikipedia.org/w/index.php?title=Communications_satellitehttp://en.wikipedia.org/w/index.php?title=Upgradehttp://en.wikipedia.org/w/index.php?title=Planned_maintenancehttp://en.wikipedia.org/w/index.php?title=Mean_time_between_failureshttp://en.wikipedia.org/w/index.php?title=Routinghttp://en.wikipedia.org/w/index.php?title=Electric_power_transmissionhttp://en.wikipedia.org/w/index.php?title=Redundancy_%28engineering%29http://en.wikipedia.org/w/index.php?title=Failoverhttp://en.wikipedia.org/w/index.php?title=Load_balancing_%28computing%29http://en.wikipedia.org/w/index.php?title=Service_level_agreementhttp://en.wikipedia.org/w/index.php?title=Disaster_recovery
  • 8/12/2019 Availability 99 999

    4/6

    High availability 4

    effective with active redundancy.

    Modeling and simulation is used to evaluate the theoretical reliability for large syste ms. The outcome of this kind o f model is used to e valuate different de sign options. A model of the entire system is created, and the model is stressedby removing components. Redundancy simulation involves the N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed

    by evaluating performance with all possible combinations where one component is faulted. N-2 means the model isstressed by evaluating performance with all possible combinations where two component are faulted simultaneously.

    Reasons for unavailabilityA survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems,from most to least important, as follows [4] :

    Causal factor of unavailability

    Lack of best practice change control

    Lack of best practice monitoring of the relevant components

    Lack of best practice requirements and procurement

    Lack of best practice operations

    Lack of best practice avoidance of network failures

    Lack of best practice avoidance of internal application failures

    Lack of best practice avoidance of external services that fail

    Lack of best practice physical environment

    Lack of best practice network redundancy

    Lack of best practice technical solution of backup

    Lack of best practice process solution of backup

    Lack of best practice physical location

    Lack of best practice infrastructure redundancy

    Lack of best practice storage architecture redundancy

    The factors themselves are based on the work of Evan Marcus & Hal Stern. [5]

    Costs of unavailabilityIn a 1998 report from IBM Global Services, unavailable systems are estimated to have cost American businesses$4.54 billion in 1996, due to lost productivity and revenues. [6]

    References[1] Piedad, Floyd. High Availability: Design, Techniques, and Processes , (http:/ / books. google. com/ books?id=kHB0HdQ98qYC& dq=high+

    availability+ floyd+ piedad+ book& printsec=frontcover& source=bn& hl=en& ei=gs0LSrLvBKjm6gOT3ISPCA& sa=X& oi=book_result&ct=result& resnum=7)

    [2] Evan L. Marcus, The myth of the nines (http:/ / searchstorage. techtarget. com/ tip/ 0,289483,sid5_gci921823,00. html)[3] RFC 992[4] Ulrik Franke, Pontus Johnson, Johan Knig, Liv Marcks von Wrtemberg: Availability of enterprise IT systems - an expert-based Bayesian

    model, Proc. Fourth International Workshop on Software Quality and Maintainability (WSQM 2010), Madrid, (http:/ / www. kth. se/ ees/ forskning/ publikationer/ modules/ publications_polopoly/ reports/ 2010/ IR-EE-ICS_2010_047. pdf?l=en_UK)

    [5] E. Marcus and H. Stern, Blueprints for high availability , second edition. Indianapolis, IN, USA: John Wiley & Sons, Inc., 2003.[6] IBM Global Services, Improving systems availability , IBM Global Services, 1998, (http:/ / www. dis. uniroma1. it/ ~irl/ docs/

    availabilitytutorial.pdf)

    http://www.dis.uniroma1.it/~irl/docs/availabilitytutorial.pdfhttp://www.dis.uniroma1.it/~irl/docs/availabilitytutorial.pdfhttp://www.dis.uniroma1.it/~irl/docs/availabilitytutorial.pdfhttp://www.dis.uniroma1.it/~irl/docs/availabilitytutorial.pdfhttp://www.kth.se/ees/forskning/publikationer/modules/publications_polopoly/reports/2010/IR-EE-ICS_2010_047.pdf?l=en_UKhttp://www.kth.se/ees/forskning/publikationer/modules/publications_polopoly/reports/2010/IR-EE-ICS_2010_047.pdf?l=en_UKhttp://searchstorage.techtarget.com/tip/0,289483,sid5_gci921823,00.htmlhttp://books.google.com/books?id=kHB0HdQ98qYC&dq=high+availability+floyd+piedad+book&printsec=frontcover&source=bn&hl=en&ei=gs0LSrLvBKjm6gOT3ISPCA&sa=X&oi=book_result&ct=result&resnum=7http://books.google.com/books?id=kHB0HdQ98qYC&dq=high+availability+floyd+piedad+book&printsec=frontcover&source=bn&hl=en&ei=gs0LSrLvBKjm6gOT3ISPCA&sa=X&oi=book_result&ct=result&resnum=7http://books.google.com/books?id=kHB0HdQ98qYC&dq=high+availability+floyd+piedad+book&printsec=frontcover&source=bn&hl=en&ei=gs0LSrLvBKjm6gOT3ISPCA&sa=X&oi=book_result&ct=result&resnum=7http://en.wikipedia.org/w/index.php?title=Hal_Sternhttp://en.wikipedia.org/w/index.php?title=Evan_Marcushttp://en.wikipedia.org/w/index.php?title=Requirements_managementhttp://en.wikipedia.org/w/index.php?title=Change_controlhttp://en.wikipedia.org/w/index.php?title=Modeling_and_simulation
  • 8/12/2019 Availability 99 999

    5/6

    High availability 5

    External links Carrier grade:The Myth of the Nines (http:/ / www.pipelinepub. com/ 0407/ pdf/ Article 4_Carrier Grade_LTC.

    pdf) Pipeline PDF Service Availability Reporting (http:/ / themonitoringguy. com/ articles/ service-availability-reporting/ )- A Guide

    To Service Availability Reporting

    Cisco IOS Management for High Availability Networking (http:/ / www. cisco. com/ en/ US/ tech/ tk869/ tk769/ technologies_white_paper09186a00800a998b. shtml/ ) - Best Practices White Paper

    http://www.cisco.com/en/US/tech/tk869/tk769/technologies_white_paper09186a00800a998b.shtml/http://www.cisco.com/en/US/tech/tk869/tk769/technologies_white_paper09186a00800a998b.shtml/http://themonitoringguy.com/articles/service-availability-reporting/http://www.pipelinepub.com/0407/pdf/Article%204_Carrier%20Grade_LTC.pdfhttp://www.pipelinepub.com/0407/pdf/Article%204_Carrier%20Grade_LTC.pdf
  • 8/12/2019 Availability 99 999

    6/6

    Article Sources and Contributors 6

    Article Sources and ContributorsHigh availability Source : http://en.wikipedia.org/w/index.php?oldid=433430281 Contributors : 1ForTheMoney, Ais523, AlephGamma, BBCWatcher, BD2412, Bakahamster, Barrylb,Bovineone, Brian2wood, Chipmc, Chrism, Chuq, Clarityfiend, Constantine Kulikovsky, Dbu, Dyl, Eriberto mota, Ettrig, Fixe, Galloping Moses, Gar yzx, Geek2003, Golbez, Graham87, Hm2k,Interchange88, JCLately, JForget, JHunterJ, Jim.henderson, Jkelly, JonHarder, Joy, Jpbowen, Kmcnamee, Krille, Kvng, Little Mountain 5, Marc Kupper, MarkoKevac, Matt Britt, Mikehou, MildBill Hiccup, MrOllie, Mwanner, Nanoatzin, Ngupta4, Openminds, PabloStraub, Pablothegreat85, Pearle, Platte Daddy, Rchandra, Siress, Sohale, SolarisBigot, Thepillow, Trabadori, Ulrikf,Woohookitty, Wwwwolf, Xepto, 72 anonymous edits

    LicenseCreative Commons Attribution-Share Alike 3.0 Unportedhttp:/ / creativecommons.org/ licenses/ by-sa/ 3.0/

    http://creativecommons.org/licenses/by-sa/3.0/