Download - ITIC 2009 Global Server Hardware and Server OS Reliability Survey

© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.

Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.

INFORMATION TECHNOLOGY

INTELLIGENCE CORP.

ITIC 2009 Global Server Hardware and Server OS Reliability Survey

July 2009



Executive Summary “Time is money”

For the second year in a row, IBM AIX UNIX running on the Power or ―P‖ series servers scored

the highest reliability ratings among 15 different server operating system platforms – including

Linux, Mac OS X, UNIX and Windows.

Those are the results of the ITIC 2009 Global Server Hardware and Server OS Reliability Survey

which polled C-level executives and IT managers at 400 corporations from 20 countries

worldwide. The results indicate that the IBM AIX operating system running on Big Blue’s

Power servers (System p5s), is the clear winner; it offers rock solid reliability, besting all

competing operating systems, including those running on Intel-based x86 machines. The IBM

servers running AIX consistently score at least 99.99% or just 15 minutes of unplanned per

server, per annum downtime (See Exhibit 1).

Overall, the results showed improvements in reliability, patch management procedures and an

across-the-board reduction in per server, per annum Tier 1, Tier 2 and the most severe Tier 3

outages.

IBM AIX on the Power series System p5 and System p6 servers leads all vendors for

both server hardware and server OS reliability. The IBM UNIX distribution recorded the

fewest number of Tier 1, Tier 2 and Tier 3 unplanned server outages per year. IBM AIX

running on the System p5s and newer p6s had less than one unplanned outage incident

per server in a 12 month period. More impressively, the IBM servers experience no

severe Tier 3 outages.

Hewlett-Packard’s HP UX 11i running on the HP 9000 and Integrity servers also

performed very well though HP servers notch approximately 21 to 25 minutes more

downtime than IBM servers, depending on model and configuration. The HP UX 11i v. 3

Update 4 on the HP 9000s average 36 minutes of per server, per annum downtime; while

the HP UX 11i v.3 Update 4 on HP Integrity servers recorded 39 minutes of per server,

per annum downtime.

Faster Patch Management. IT managers spend approximately 11 minutes to apply

patches to IBM servers running the AIX operating system, which is again, the least

amount of time spent patching any server or operating system. The open source Ubuntu

distribution is a close second with IT managers spending 12 minutes to apply patches,

while IT managers in the Novell SUSE Enterprise, customized Linux distribution and

Apple Mac OS X 10.x. environments each spend a very economical 15 to 19 minutes

applying patches.

Unplanned severe Tier 2 and Tier 3 Outages Decline. IBM also took top honors in

another important category: IBM Power Series System p5 and p6 servers running AIX

experience the lowest amount of the more severe Tier 2 and Tier 3 outages combined of

any server hardware or server operating system. The combined total of Tier 2 and Tier 3



outages accounted for just 19% of all per server, per annum failures in IBM network

environments. HP UX on the 9000 and Integrity servers, Novell SUSE Linux Enterprise

11 and ―other‖ Linux distributions were close behind with combined Tier 2 + Tier 3

outages accounting for 24% to 25% of unplanned yearly downtime.

Novell SUSE Superiority. Among the Linux and Open Source server operating system

distributions, both Novell SUSE Linux Enterprise 10 and 11 versions consistently

achieved superior reliability ratings. In fact, Novell SUSE in a customized

implementation had the lowest instance -- approximately 16 minutes of per server/server

OS, per annum downtime – of any distribution with the exception of IBM’s AIX on the

Power Series. Many IT managers specifically mentioned and extolled the high level of

integration and interoperability between their Novell SUSE Linux Enterprise and

Microsoft Windows Server 2003 and Windows Server 2008 in heterogeneous networks,

in their anecdotal responses and first person customer interviews.

Most Improved. Microsoft Windows Server 2003 and Windows Server 2008 showed the

biggest improvements of any of the vendors. The Windows Server 2003 and 2008

operating systems running on Intel-based platforms saw a 35% reduction in the amount

of unplanned per server, per annum downtime from 3.77 hours in 2008 to 2.42 hours in

2009. The number of annual Windows Server Tier 3 outages also decreased by 31% year

over year and the time spent applying patches similarly decline by 35% from last year to

32 minutes in 2009.

Apple Mac and OS X 10.x Competitive Enterprise Reliability. This year’s survey for

the first time also incorporated reliability results for the Apple Mac and OS X 10.x OS

platform. Over the past two to three years, the Apple Mac platform has made a comeback

in corporate enterprises. The numbers of Mac G4 servers are modest in comparison to the

more entrenched Windows, Linux and UNIX distributions. Nonetheless, they are making

their presence known. IT managers report the reliability has been generally very good.

The survey respondents indicated that the Apple Mac G4 servers are extremely

competitive in an enterprise setting. IT managers spend approximately 15 minutes per

server to apply patches and an average recorded downtime of about 40 minutes per

server, per annum.. It is important to note that at this point, the workloads of the G4 Macs

are not comparable to those of the high end IBM, HP and Sun (now Oracle) UNIX

systems or the customized Linux and open source distributions.



The intent of this Report is to quantify and qualify the reliability of 15 different server operating

system platforms running on a variety of proprietary UNIX and Intel-based hardware platforms.

This will allow organizations to more easily identify baseline reliability metrics associated with

individual platforms in order to better determine and optimize their total cost of ownership

(TCO), accelerate return on investment (ROI) and more efficiently manage risk.



Table of Contents

Executive Summary ................................................................................................... 2

Introduction.............................................................................................................. 6

Survey Methodology .............................................................................................. 8

Survey Demographics ............................................................................................ 9

Data & Analysis ......................................................................................................... 9

Conclusions ............................................................................................................ 19

Recommendations................................................................................................... 19

Recommendations for Corporate Customers .......................................................... 20

Recommendations for Vendors ............................................................................. 22



Introduction

Server hardware and server operating system reliability is the foundation and bedrock upon

which crucial applications, storage, security and third party utilities and management, rest. The

stability and health of the entire network infrastructure depend heavily on the server hardware

and the operating systems that run on them. Server hardware and server operating system

reliability are inextricably linked to the corporation’s ability to lower its TCO, accelerate ROI

and reduce the risk factors that negatively impact performance.

Information on specific reliability metrics, allows businesses to calculate the real-time resources

and monies needed to manage and maintain their various server hardware platforms and

operating systems. It also enables them to determine whether or not their mission critical server

hardware and operating system software are assisting or impeding the business from meeting key

service level agreements (SLAs) to their customers, business partners and suppliers as well as

internally to the company’s own end users.

The ITIC self-selecting reliability survey polled IT managers at 400 corporations worldwide on

the annual amount and percent of unplanned per server, per annum downtime experienced

following 15 hardware and server OS environments.

IBM AIX on Power series System p5 and p6 servers

HP UX on the 9000

HP UX on Integrity servers

Sun Solaris UNIX on the SPARC Servers

Apple Mac OS X 10.5, 10.6 on G4 Macs

Novell SUSE Linux Enterprise on Intel x86 servers

Novell SUSE Linux Enterprise on Intel x86 servers

Red Hat Enterprise Linux on Intel x86 servers

Red Hat Enterprise Linux with customization

Windows Server 2003 on Intel x86 servers

Windows Server 2008 on Intel x86 servers

Ubuntu open source

Debian open source

Other Linux distributions (e.g. Mandriva, Turbo Linux)

Other Linux distributions with customization

The survey data gives a detailed comparison breakdown of the percentage of Tier 1, Tier 2 and

highest severity Tier 3 outages.



ITIC’s definition of server outages is as follows:

Tier 1: These are the typically minor common, albeit annoying occurrences. A network

administrator can usually resolve such incidents with less than 30 minutes for dependent

users. Tier 1 incidents can usually be resolved by rebooting the server and rarely involve any

data loss.

Tier 2: These are moderate issues in which the server may be offline from one hour to four

hours or about a half-day. Tier 2 problems may require the intervention of more than one

network administrator to troubleshoot and it frequently affects the corporation’s end users

and possibly business partners, customers and suppliers in the event they are attempting to

access data on an affected corporate extranet.

Tier 3: This is the most severe type of incident. Tier 3 outages are of longer than four hours

duration for network administrators and the company’s associated dependent users. Tier 3

outages almost always require a team of multiple network administrators to resolve. Data loss

or damage to systems and applications may or may not occur. Another real threat associated

with a protracted Tier 3 outage is potential lost business and the potential damage to the

company’s reputation.

.

The length and severity of each of these actions correspond to specific line item capital

expenditure and operational expenditure costs for the business. Reliability, measured by

downtime, can positively or negatively impact TCO and accelerate or delay the time it takes to

realize ROI.

Improvements or declines in reliability also mitigate or increase technical and business risks to

the organization’s end users and external customers. The ability to meet service-level

agreements (SLAs) hinges on server reliability, uptime and manageability. These are key

indicators that enable organizations to determine which server operating system platform or

combination thereof is most suitable.

The survey data detailed the disparity in the number and severity of unplanned server outages

and the amount of time in minutes and hours that businesses experience on the various Linux,

Windows and UNIX platforms.

The survey closely examined both the actual quantitative reliability statistics as well as the

qualitative issues that positively or negatively impacted outage time. The ITIC survey queried

corporate IT managers and C-level executives on myriad reliability-related functions including:

The amount of downtime (minutes/hours experienced per server, per annum

The amount of time spent patching each server

Whether the IT administrators apply updates via an automated group policy procedure or

manually apply the patches to individual servers



On average, individual corporate Linux, Windows and UNIX servers experience from zero to

approximately two failures per server per year. In a best case scenario, this results in 20 minutes

(IBM AIX running on p5 and p6 Power servers) to 4.3 hours (Debian open source) hours of

annual downtime for each server. Windows Server 2008 servers experienced a total of just under

three unplanned yearly Tier 1, Tier 2 and Tier 3 outages. However, the necessity of having to

take many of the Windows Servers offline to apply monthly patches and then do a system reboot,

resulted in Windows Server 2008 machines being offline for just under two and a half hours each

year. Still, this is a 35% reduction for the 3.77 hours of downtime experienced by Windows

Server 2008 machines in last year’s ITIC reliability survey.

Among the Linux distributions Novell SUSE Enterprise exhibited consistent reliability

reminiscent of the late 1980s and 1990s when Novell NetWare was famous for running several

years – in some cases as long as nine years – without experiencing a failure or the need to reboot.

This can be attributed to the stability of the Novell distribution, the experience of the SUSE

engineers and the length of experience of many IT managers who came from the NetWare

environment. Novell also inked an interoperability and technical service and support agreement

with Microsoft two and a half years ago, which also served to improve reliability.

The open source Ubuntu distribution also scored some impressive reliability gains as it continues

to gain in popularity and deployments.

Overall, these survey responses provide crucial, comparative reliability metrics to enable

customers to make informed choices on which server hardware and server operating system or

combination thereof, best suits their specific business and budgets needs.

Survey Methodology

ITIC conducted the 2009 Global Server Hardware and Server OS Survey, an independent Web-

based survey; that included multiple-choice questions and essay responses from March through

July 2009. ITIC polled C-level executives and IT managers at 400 corporations worldwide.

ITIC analysts supplemented the Web survey by conducting two dozen first-person customer

interviews. ITIC conducted additional interviews with customers in October 2009 and updated

the Report with specific information on server downtime statistics. The anecdotal data obtained

from these interviews validates the survey responses and provides deeper insight into the

challenges confronting businesses in both the immediate and long term.

To deliver the most unbiased, accurate information, ITIC did not accept any vendor

sponsorship money for the online poll or the subsequent first-person interviews conducted in

connection with this project. ITIC employed authentication and tracking mechanisms to prevent

tampering and to prohibit multiple responses by the same parties.



Survey Demographics

Companies of all sizes and all vertical markets were represented in the survey. Respondents

came from companies ranging from small and medium businesses (SMBs) with fewer than 50

workers, to large enterprises with more than 100,000 employees.

Roughly 33% of the survey respondents came from the SMB segment with 1 to 100 employees;

12% of those polled were from midsize companies with 100 to 500 employees; 14% were drawn

from corporations employing 500 to 1,000 employees; and 41% of respondents worked in large

enterprises with 1,000 to more than 100,000 workers.

The survey was truly global. Approximately 85% of respondents came from North America.

The remaining 15% hailed from more than 20 countries including Europe, Asia, Australia, New

Zealand, South America and Africa.

Data & Analysis

Server hardware and server operating system reliability has improved immeasurably in the last

five years.

When ITIC began conducting reliability research and surveys, our original definition of

unplanned downtime was an unexpected external or internal incident that caused the server

hardware and/or the server operating system software to spontaneously fail or freeze, thereby

disrupting network operations and requiring remediation efforts and a reboot. Depending on the

seriousness of the incident, the downtime may also have resulted in lost or damaged data.

However, it quickly became apparent from the anecdotal survey comments and during our first

person customer interviews, that IT managers and network administrators had a broader

definition of what constituted downtime.

As far as IT departments are concerned, anything that causes them to take the server offline,

regardless of the cause, is unplanned downtime. Included in this category are instances of

vendors releasing an unanticipated patch to fix a technical bug or security vulnerability. Such an

occurrence does not qualify as unplanned downtime in the narrowest definition of the term;

network administrators oftentimes do not make that distinction. To them downtime is downtime

because it disrupts their routine and may also impact daily operations because it means the IT

department must devote time to remedial issues that would have been spent performing other IT

chores. And in some network environments like Windows, it’s still necessary to take the servers

down, apply the patch and perform a hard reboot.

Time very literally equates to money. The economic downturn has forced companies to cut staff,

put network and software upgrades on hold, decimated IT departments and has severely reduced

the training and recertification for network administrators.



A recent ITIC survey that polled 250 corporations worldwide in October 2009 found that 47% of

businesses had budget cuts within the past 12 months. That number was even greater for

companies with over 500 end users; 64% of large enterprises experienced budget cuts.

Consequently, 84% of the respondents reported that their IT departments simply pick up the

slack and work longer and harder.

Downtime by the Numbers

In the early days of networks, corporate enterprises considered 99% uptime to be an adequate

reliability standard. Not so in 2009. An ITIC survey of 250 enterprises conducted in October

found that only 14% of survey respondents consider 99% uptime adequate for their most mission

critical, line of business (LOB) applications. Another 14% said that 99.9% or three nines met

their reliability needs. A two-thirds majority – 66% -- of those polled however, said their

network environments require 99.95%; 99.999% or greater reliability for their most mission

critical LOBs.

It’s easy to see why when you correlate the downtime percentages to actual downtime:

99% = average unplanned downtime of one hour and 40 minutes per week

99.9% = average unplanned downtime of 45 minutes per month

99.95% = average unplanned downtime of 22 minutes per month

99.999% = average unplanned downtime of 5 1/2 minutes per year

Taken in this context, it’s easy to understand how the ongoing economic crisis has cast renewed

emphasis on server and server operating system reliability. Businesses of all sizes and across all

vertical markets are extremely risk averse. IT departments grapple daily with the reality of

keeping networks up and running in the face of cost cuts, layoffs and fewer resources. Server

hardware, server operating systems and the a Businesses and their IT departments are under

pressure to maximize server hardware and server operating system uptime in order to realize the

greatest economies of scale and ensure that their server hardware, server operating systems and

the crucial business applications and services that run on them are available to end users,

corporate clients, business partners and suppliers. A server outage of even a few minutes

duration can disrupt network operations and result in lost data, steep monetary losses and

damage a company’s reputation.

Reliability Then and Now

The first generations of server hardware and server operating system software platforms

introduced in the mid-to-late 1980s, were proprietary. Network administrators typically became

experts in a particular vendor’s platform. The 1.0 version of new hardware and software products



from 10 to 20 years ago were also rife with bugs. It typically took from six months to a year for

the vendors to work the kinks out and achieve an acceptable level of stability and IT managers to

gain sufficient expertise and knowledge resulting in higher levels of uptime. It is also worth

noting that two decades ago, businesses were not as wholly dependent on their networks as they

are today.

In the 1990s, 99% reliability was considered an acceptable industry standard. That is no longer

the case; 99% uptime is the equivalent of over 80 hours of annual per server downtime. ITIC’s

separate 2009 Global Application Availability Survey conducted in April found that eight out

of 10 of the 300 businesses polled said that their major business applications require higher

availability rates than they did two or three years ago. However, nearly three-quarters of

companies – 72% -- are unable to quantify the cost of downtime or the impact that unplanned

reliability outages have on the business. Among the other 2009 Global Application Availability

survey findings:

Nearly two-thirds -- 61% -- of organizations are unsure of how estimate the impact of

downtime on the business or do not even attempt to track the losses associated with

application downtime and reliability

Two out of five firms -- 41% -- said they require conventional 99% to 99.9% application

availability; 29% said they needed 99.95% or 99.99% uptime; while 7% of respondents

indicated they need continuous availability of 99.999% or 99.9999% availability.

Just under half – 49% of companies – lack the budget to purchase additional third party

software or hardware availability technology. This places more of an onus on the underlying

server hardware and server OS to deliver high reliability.

The responses from the ITIC 2009 Global Application underscore the crucial importance of

having highly reliable server hardware and server operating system reliability. If the servers,

server OS and related applications are unavailable for any reason, business and daily operations

grind to a halt – with sometimes catastrophic results.

The demand for server hardware, server OS and application availability has grown, particularly

with the emergence of new technologies like cloud computing and virtualization. Corporations

need to ensure that reliability keeps pace. To quantify the reliability statistics: 99.99% uptime

equates to approximately four hours or 240 minutes of per server, per annum downtime.

Today’s networks demand near perfect reliability; corporations deem any downtime as an

anathema to their business operations. This is particularly true for those companies in vertical

markets such as banking and finance, stock exchanges, insurance, healthcare and legal, whose

businesses are based on intensive data transactions. A server crash of even 15 to 30 minutes

duration can cost a company from tens of thousands or tens of millions in lost business and

remediation efforts. Zero downtime – or as close to it as is humanly and technologically possible,

is the obvious goal and Holy Grail of reliability.



While system flaws will always be present in some fashion, the survey found that at present,

server hardware and server OS reliability was also inextricably linked with several other crucial

factors and components. They are:

Integration and interoperability is crucial. Over 85% of businesses with 300+ end

users have myriad types of server hardware and three different operating systems present

in their environment. Heterogeneity and openness are essential to the reliability of

today’s networks. The 2007 wide ranging, non-exclusive interoperability pact between

Microsoft and Novell was extremely well received and a huge boon for the respective

customer bases of both firms. As part of the deal, Microsoft and Novell team up to

provide joint sales, technical service and support to deliver plug and play interoperability

between the Windows and SUSE Linux Enterprise environments.

Workloads. The applications themselves are growing in size and complexity. It is

therefore imperative that the server hardware be robust enough to handle the increased

demands of new classes of applications such as streaming audio and digital and highly

complex processes. It is a fact that a robust server configuration that includes new multi-

core and multi-threading technologies, maximum memory, hard drive and the fastest

processors will perform better than old, outmoded and inadequate equipment. The survey

showed for example that the high reliability ratings for IBM and HP were no fluke: the

powerful IBM System p5 and System p6 Power Series servers and the HP 9000 and

Integrity Servers achieved very high reliability – 99.99% and 99.999% uptime – while

carrying workloads that were 30% to 40% greater than comparable x86-based machines.

Experience of the IT managers. Errors by neophyte, inexperienced network

administrators and IT managers who have not been able to get training and re-certified on

the latest technologies is another major factor that contributes to extended downtime and

adversely impacts system reliability.

Patch management. The amount of time spent applying patches is one of the biggest

contributors to system downtime; this is especially true of security patches, as we see in

Exhibit 2 below.



IBM AIX administrators spent the least amount of time – 11 minutes – applying patches. They

were followed closely by the Ubuntu open source distribution, Apple Mac, niche market ―other‖

customized Linux distributions and Novell SUSE; administrators in each of these environments

spent on average from 12 to 15 minutes applying patches in these environments.

This speaks to the underlying stability of these environments as well as the experience of the

administrative staff. Typically, UNIX installations – notably IBM’s AIX, as well as Novell

SUSE Enterprise and Apple Mac, tend to be stable, static environments with experienced, hands

on network administrators who are familiar with the most minute details of the bits and bytes of

their systems. Fast patch management positively impacts reliability.

The feedback from the survey respondents reinforced the importance of being able to receive and

download patches quickly once a bug has been identified. Corporate IT managers noted the

significant strides that had been made by all of the vendors across the board in recent years,

though they still voiced some concerns. Among the anecdotal comments:



• ―IBM has done a wonderful job of keeping our AIX systems up and ready. We rarely if

ever have reliability issues,‖ said an IT manager at a Midwest financial institution.

• ―Patch management automation has significantly reduced both the manpower required to

apply patches and the downtime associated with patch management over the last three

years,‖ noted an IT administrator at a large health care facility in the northeast.

• ―Novell SUSE Linux Enterprise is always very up-to-date on patches; Zenworks is nice

and we never have a problem,‖ said a longtime Novell user at a large healthcare provider

in the Southwest.

• ―The amount of time it takes to identify vulnerability and when the vendors release the

patch, has decreased significantly, but if the bug is a dangerous one, we still worry,‖

according to a chief technology officer at midsized retailer.

• ―Our patches are tested at our corporate headquarters location and then distributed as

needed to the various remote locations, downloaded to a local Microsoft Systems

Management Server (SMS) and automatically downloaded via group policy to each

workstation and server. The process is accelerated and it’s relatively painless for the IT

department,‖ said an administrator at a large West Coast enterprise.

• ―Our patch management dramatically improved with SUSE 10.2 and SUSE 11,‖noted

another veteran Novell administrator. ―We have no problems now to speak of.‖

• ―We currently use Group Policy to download patches on each server, but we manually

apply them. So it takes us about 15 minutes to patch each Windows server. This means

that each server takes less than 15 min to patch. On a whole, other than hardware issues,

we've averaged less than two failures per server, per year on our Windows Server 2003

systems,‖ said an IT manager at a large East coast insurance firm.



Serious Tier 2 + Tier 3 Incidents Decline

The survey results also showed a discernible decline in the number and percentage of the more

serious Tier 2, Tier 3 and combined Tier 2 + Tier 3 incidents, according to Exhibit 3 below.



Once again, IBM AIX on the Power Series System p5 and p6s recorded the smallest percentage

of combined Tier 2 + Tier 3 incidents at 19%. The other UNIX and Linux distributions including

the HP UX 11i v3 on the HP 9000 and HP Integrity, Novell SUSE Linux Enterprise and Sun

Solaris also scored well with the more serious aggregate Tier 2+ Tier 3 outages accounting for

24% to 25% of total outages. And all of the aforementioned distributions managed to lower their

scores from the similar survey in 2008.

Microsoft’s Windows Server 2003 on x86-based servers came in with a very respectable 30% of

reliability outages being in the Tier 2 + Tier 3 categories; this was a reduction of 11% from the

41% reported by respondents to the 2008 ITIC Global Reliability Survey.

One of the most impressive statistics was that IBM AIX Power Series System p5 and System p6

servers notched no severe Tier 3 incidents whatsoever. Again, this achievement is even more

impressive when one considers that these systems typically run higher workloads than their x86-

based counterparts as shown in Exhibit 4.



HP’s UX 11i v.3 Update 4 on the HP 9000 and Integrity servers and Sun Solaris on SPARC

Servers (now owned by Oracle), Novell SUSE, Red Hat Enterprise Linux and Apple Mac OS

10x 5.6 on the G4 Macs also recorded very few Tier 3 outages – less than one each, per server

per annum.

The most common Tier 1 incidents that are usually between 10 and 30 minutes duration, also

showed across the board reductions among all server hardware and server operating system

platforms as we see from Exhibit 5.



In the Tier 1 category, IBM also came out on top with less than one-half of one Tier 1 incident

per AIX Power Series System p5 and System p6 per annum. This equates to about four to seven

minutes downtime per server, per year.

In fact, all of the server hardware and server OS environments each racked up less than one Tier

1 per server, per annum outage.

The results were similarly encouraging for the average number of Tier 2 outages as we see in

Exhibit 6 below.



Conclusions In summary the ITIC 2009 Global Server Hardware and Server OS Reliability Survey findings

indicate that all of the server operating system platforms have achieved a high degree of

reliability. However, the UNIX distributions led by IBM AIX running on the p5 and p6 Power

Servers is the clear winner followed closely by HP, Novell SUSE Enterprise Linux and the

Ubuntu open source distribution.

These results are especially considering in light of the ongoing economic crunch which has

caused companies to cut their budgets and reduce IT staff. As they strive to accomplish more

with fewer resources, IT departments must rely even more heavily on their vendors to deliver

more reliable servers and server operating system software.

To reiterate, time is literally money. Even a few minutes of downtime can cost companies

thousands or millions of dollars and cause business operations to grind to a halt. Downtime can

also impact adversely a company’s relationship with its customers, business suppliers, partners

and internal end users. Reliability or lack thereof can potentially damage a company’s reputation

and result in lost business.

Hence, corporations must have confidence in the reliability and stability of the underlying server

hardware and server OS platforms.

The advances in technology are encouraging. Now companies must tackle other equally

important and challenging issues to ensure the highest level of uptime and reliability. Close

attention must be paid to integration and interoperability, patch management, documentation and

getting the necessary training and certification for the appropriate IT managers. The most

bulletproof hardware and software platforms can be undone by human error. It’s equally

important that companies find the funds to stay as current as possible on their server hardware

and server OS software. Performance will suffer if the server is configuration is old and

inadequate.

Recommendations Server hardware and server operating system reliability has improved vastly since the 1980s,

1990s and even in just the last two to three years. While technical bugs still exist, the number,

frequency and severity have declined significantly.

With few exceptions, common human error poses a bigger threat to server hardware and server

operating system reliability then technical glitches.



Crucial TCO metrics such as reliability, performance, security and management ultimately

depends as much on each firm’s specific implementation, as it does on the properties of the

server and server OS technology itself. There are inherent dependencies between the underlying

capabilities of a particular server operating system and an individual corporation’s ability to

adhere to best deployment practices with respect to training, testing and configuration. The

reliability, security and manageability of even the most hardened server and server operating

system are easily compromised by human error.

A company that does not restrict physical access to the server is asking for trouble. Similarly,

any firm which does not enact and enforce strong usage and security policies, risks

compromising the reliability and integrity of its server hardware and server OS environment. The

reliability of the server environment can also be undone easily or seriously compromised by such

actions as: a bad configuration; the use of incompatible or unapproved memory and logic chips,

hardware, peripherals and software drivers; over clocking machines; failing to apply necessary

patches; failing to upgrade or retrofit inadequate or obsolete servers and operating systems and

taxing server and software resources beyond their capabilities.

Recommendations for Corporate Customers

To optimize uptime and reliability, ITIC advises corporations to:

Regularly analyze and review configurations, usage and performance levels. This

will enable companies to determine whether or not their current server and server OS

environment allows them to achieve optimal reliability.

Adopt formal SLAs. Service level agreements enable organizations to define acceptable

performance metrics. Companies should meet with their vendors and customers on at

least an annual basis to ensure the terms are met.

Define measure and monitor reliability and performance metrics. It is imperative that

companies measure component, system, server hardware, server OS and desktop and

server OS, security, network infrastructure, storage and application performance. Keep a

log of the planned and unplanned downtime in a continuous fashion throughout the

enterprise.

Regularly track server and server OS reliability and downtime. Keep accurate

records of outages and their causes. Segment the outages according to their severity and

length – e.g. Tier 1, Tier 2 and Tier 3. The appropriate IT managers should also keep

detailed logs of remediation efforts in the event of the outage. These logs should include

a full account of remediation activities, specifying how the problem was solved, how

long it took and what staff members participated in the event. It should also list the

monetary costs as well as any material impact on the business, its operations and its end

users. This will prove invaluable resource should the problem recur. It may also make the



difference in containing or curtailing the reliability-related incident, saving precious time

for the IT department, the end users and corporate customers.

Calculate the cost of unplanned downtime. Companies should determine the average

cost of minor Tier 1 outages. They should also keep more detailed cost assessments of the

more serious unplanned Tier 2 and Tier 3 incidents. It’s essential for businesses to know

the monetary amount of each outage – including IT and end user salaries due to

troubleshooting and any lost productivity – as well as the impact on the business. C-level

executives and IT managers should also pay close attention to whether or not the

company’s reputation suffered as a result of a reliability incident; did any litigation

ensue; were customers, business partners and suppliers impacted (and at what cost) and at

least try and gauge whether or not the company lost business or potential business.

Ensure that your organization has robust server hardware that can adequately

handle the OS and application workloads. The server hardware (standalone, blade,

cluster, etc.) and the server operating system are inextricably linked. To achieve optimal

performance from both components, corporations must ensure that the server hardware is

robust enough to carry both the current and anticipated workloads for the lifecycle of

both.

Compile a list of best practices and adhere to them. This is absolutely essential. Chief

technology officers (CTOs), software developers, engineers, network administrators and

managers should have extensive familiarity with the products they currently use and are

considering. Check and adhere to your vendors’ list of approved, compatible hardware,

software and applications. Software developers and network administrators must obey the

rules. That means avoiding such ill-advised and iffy practices like overclocking server

and desktop hardware, allowing unskilled or neophyte administrators to make changes to

the registry. All of these actions can lead to serious reliability problems.

Don’t skimp on training and recertification for IT administrators, software

developers and engineers. In these days of budget cuts, it’s common practice to

eliminate monies that were formerly earmarked for training. ITIC understands that money

is tight. If you can’t afford the time or expense to re-certify your entire IT department,

designate the most experienced or appropriate IT staffer to take the course – even if it’s

only an online course – and allow that person to train additional appropriate managers.

Perform regular asset management testing. Schedule asset management reviews on a

yearly, bi-annual or quarterly basis, as needed. This will assist your company in remaining

current on hardware and software and help you to adhere to the terms and conditions of

licensing contracts. All of these issues influence network reliability. It also allows

organizations to be better equipped to meet their SLA requirements and maintain peak

performance and reliability.

Manual vs. Automated Group Policy Patch Management. IT managers, particularly in

high end UNIX environments and in corporations whose environments feature a high degree

of customization, will continue to perform manual patch management.



Keep your software updated with the latest necessary patches and upgrades. You don’t

have to apply every patch, but it’s wise to keep track of which patches are crucial to the

network’s health. Construct and adhere to a regular schedule to apply patches, preferably on a

monthly basis. This will help the company avoid potentially nasty surprises.

Standardize legacy and future hardware, server OS and application environments

as much as possible. ITIC survey data indicates that standardization—that is, following a

prescribed configuration and version for the company’s hardware, software and network

infrastructure components—can lower TCO costs by 15%. Standardization benefits all

users—including organizations that have custom configurations.

Note that custom software implementations require the highest level of expertise. Any

firm that elects to customize its Linux or open source server operating system distribution

should either employ guru-level administrators or contract with a systems integrator or

outsourcer with the appropriate expertise.

Automated patch management applied via Group Policy vs. manual patching. Companies should also regularly review whether it is feasible for the firm to migrate away

from manual patch management. Collecting this information may seem to be a chore at first,

but it will be an invaluable source of information that can guide the company to lower its

TCO and improve the rate of its ROI.

Recommendations for Vendors

It is a buyer’s market and is likely to remain so for the foreseeable future. Competition among

vendors is intense because businesses have a wide array of server hardware and server operating

system platforms from which to choose. In order to retain the current customer base and attract

new corporate customers, all of the vendors must strive to improve the features, performance,

reliability and security of their respective server hardware and server OS software. Additionally,

ITIC advises vendors to:

Embrace Interoperability and Integration. The survey data indicates that backwards

compatibility and integration with other hardware, server OS, applications and third party

tools and utilities pose significant potential threat to the underlying stability of the network

environment.

Provide Explicit Guidance around Patches and Patch Management. Patches vary

according to the importance, severity of the fix or update and by the number of patches in a

formal release as well. Data ITIC obtained from anecdotal essay comments and first person

customer interviews underscore the need for vendors to issue patches in an efficient,

expeditious manner and to provide full transparency on the nature and severity of all bugs.

Many IT managers expressed frustration and confusion with the patch management process,

which was sometimes cumbersome. IT managers also noted that oftentimes they were unsure

of which patches were crucial versus optional. ITIC advises vendors to deliver specific

recommendations and instructions on the download process, since patch management is a

crucial element of IT management that can positively or negatively impact reliability.



Provide the latest technical documentation. Ready access to clear, concise technical

guidelines and detailed documentation has never been more important. The economic

downturn forced many companies to cut staff. Time and money are scarce or non-existent for

training and re-certification of IT administrators. It is therefore crucial that vendors pick up

the slack and publicize and disseminate technical ―how to‖ guidelines via their respective

Websites, Emails and Webinars.

Vendors should also actively work with third party ISVs to assist in resolving driver

and application compatibility issues. As we noted above, integration and interoperability

issues are a top priority for IT departments who wish to maintain a high level of reliability.

While many of the largest third party ISVs do an exemplary job of ensuring that their

applications and drivers are certified to work with new server hardware and server OS

releases, many smaller and niche ISVs – particularly in specific verticals like finance, legal

and healthcare, in many instances lack the necessary resources and funds to support new

releases. Vendors should poll their customers on which third party applications, drivers and

utilities are crucial and when necessary assist ISVs in providing the necessary compatibility.

Work with partners to provide expanded access to discounted certification and online

training courses. One of the biggest challenges confronting IT departments today is finding

the money and sparing the time to get the appropriate administrators re-trained and certified

on the latest server hardware and server OS software.