© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
INFORMATION TECHNOLOGY
INTELLIGENCE CORP.
ITIC 2009 Global Server Hardware and Server OS Reliability Survey
July 2009
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 2
Executive Summary “Time is money”
For the second year in a row, IBM AIX UNIX running on the Power or ―P‖ series servers scored
the highest reliability ratings among 15 different server operating system platforms – including
Linux, Mac OS X, UNIX and Windows.
Those are the results of the ITIC 2009 Global Server Hardware and Server OS Reliability Survey
which polled C-level executives and IT managers at 400 corporations from 20 countries
worldwide. The results indicate that the IBM AIX operating system running on Big Blue’s
Power servers (System p5s), is the clear winner; it offers rock solid reliability, besting all
competing operating systems, including those running on Intel-based x86 machines. The IBM
servers running AIX consistently score at least 99.99% or just 15 minutes of unplanned per
server, per annum downtime (See Exhibit 1).
Overall, the results showed improvements in reliability, patch management procedures and an
across-the-board reduction in per server, per annum Tier 1, Tier 2 and the most severe Tier 3
outages.
IBM AIX on the Power series System p5 and System p6 servers leads all vendors for
both server hardware and server OS reliability. The IBM UNIX distribution recorded the
fewest number of Tier 1, Tier 2 and Tier 3 unplanned server outages per year. IBM AIX
running on the System p5s and newer p6s had less than one unplanned outage incident
per server in a 12 month period. More impressively, the IBM servers experience no
severe Tier 3 outages.
Hewlett-Packard’s HP UX 11i running on the HP 9000 and Integrity servers also
performed very well though HP servers notch approximately 21 to 25 minutes more
downtime than IBM servers, depending on model and configuration. The HP UX 11i v. 3
Update 4 on the HP 9000s average 36 minutes of per server, per annum downtime; while
the HP UX 11i v.3 Update 4 on HP Integrity servers recorded 39 minutes of per server,
per annum downtime.
Faster Patch Management. IT managers spend approximately 11 minutes to apply
patches to IBM servers running the AIX operating system, which is again, the least
amount of time spent patching any server or operating system. The open source Ubuntu
distribution is a close second with IT managers spending 12 minutes to apply patches,
while IT managers in the Novell SUSE Enterprise, customized Linux distribution and
Apple Mac OS X 10.x. environments each spend a very economical 15 to 19 minutes
applying patches.
Unplanned severe Tier 2 and Tier 3 Outages Decline. IBM also took top honors in
another important category: IBM Power Series System p5 and p6 servers running AIX
experience the lowest amount of the more severe Tier 2 and Tier 3 outages combined of
any server hardware or server operating system. The combined total of Tier 2 and Tier 3
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 3
outages accounted for just 19% of all per server, per annum failures in IBM network
environments. HP UX on the 9000 and Integrity servers, Novell SUSE Linux Enterprise
11 and ―other‖ Linux distributions were close behind with combined Tier 2 + Tier 3
outages accounting for 24% to 25% of unplanned yearly downtime.
Novell SUSE Superiority. Among the Linux and Open Source server operating system
distributions, both Novell SUSE Linux Enterprise 10 and 11 versions consistently
achieved superior reliability ratings. In fact, Novell SUSE in a customized
implementation had the lowest instance -- approximately 16 minutes of per server/server
OS, per annum downtime – of any distribution with the exception of IBM’s AIX on the
Power Series. Many IT managers specifically mentioned and extolled the high level of
integration and interoperability between their Novell SUSE Linux Enterprise and
Microsoft Windows Server 2003 and Windows Server 2008 in heterogeneous networks,
in their anecdotal responses and first person customer interviews.
Most Improved. Microsoft Windows Server 2003 and Windows Server 2008 showed the
biggest improvements of any of the vendors. The Windows Server 2003 and 2008
operating systems running on Intel-based platforms saw a 35% reduction in the amount
of unplanned per server, per annum downtime from 3.77 hours in 2008 to 2.42 hours in
2009. The number of annual Windows Server Tier 3 outages also decreased by 31% year
over year and the time spent applying patches similarly decline by 35% from last year to
32 minutes in 2009.
Apple Mac and OS X 10.x Competitive Enterprise Reliability. This year’s survey for
the first time also incorporated reliability results for the Apple Mac and OS X 10.x OS
platform. Over the past two to three years, the Apple Mac platform has made a comeback
in corporate enterprises. The numbers of Mac G4 servers are modest in comparison to the
more entrenched Windows, Linux and UNIX distributions. Nonetheless, they are making
their presence known. IT managers report the reliability has been generally very good.
The survey respondents indicated that the Apple Mac G4 servers are extremely
competitive in an enterprise setting. IT managers spend approximately 15 minutes per
server to apply patches and an average recorded downtime of about 40 minutes per
server, per annum.. It is important to note that at this point, the workloads of the G4 Macs
are not comparable to those of the high end IBM, HP and Sun (now Oracle) UNIX
systems or the customized Linux and open source distributions.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 4
The intent of this Report is to quantify and qualify the reliability of 15 different server operating
system platforms running on a variety of proprietary UNIX and Intel-based hardware platforms.
This will allow organizations to more easily identify baseline reliability metrics associated with
individual platforms in order to better determine and optimize their total cost of ownership
(TCO), accelerate return on investment (ROI) and more efficiently manage risk.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 5
Table of Contents
Executive Summary ................................................................................................... 2
Introduction.............................................................................................................. 6
Survey Methodology .............................................................................................. 8
Survey Demographics ............................................................................................ 9
Data & Analysis ......................................................................................................... 9
Conclusions ............................................................................................................ 19
Recommendations................................................................................................... 19
Recommendations for Corporate Customers .......................................................... 20
Recommendations for Vendors ............................................................................. 22
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 6
Introduction
Server hardware and server operating system reliability is the foundation and bedrock upon
which crucial applications, storage, security and third party utilities and management, rest. The
stability and health of the entire network infrastructure depend heavily on the server hardware
and the operating systems that run on them. Server hardware and server operating system
reliability are inextricably linked to the corporation’s ability to lower its TCO, accelerate ROI
and reduce the risk factors that negatively impact performance.
Information on specific reliability metrics, allows businesses to calculate the real-time resources
and monies needed to manage and maintain their various server hardware platforms and
operating systems. It also enables them to determine whether or not their mission critical server
hardware and operating system software are assisting or impeding the business from meeting key
service level agreements (SLAs) to their customers, business partners and suppliers as well as
internally to the company’s own end users.
The ITIC self-selecting reliability survey polled IT managers at 400 corporations worldwide on
the annual amount and percent of unplanned per server, per annum downtime experienced
following 15 hardware and server OS environments.
IBM AIX on Power series System p5 and p6 servers
HP UX on the 9000
HP UX on Integrity servers
Sun Solaris UNIX on the SPARC Servers
Apple Mac OS X 10.5, 10.6 on G4 Macs
Novell SUSE Linux Enterprise on Intel x86 servers
Novell SUSE Linux Enterprise on Intel x86 servers
Red Hat Enterprise Linux on Intel x86 servers
Red Hat Enterprise Linux with customization
Windows Server 2003 on Intel x86 servers
Windows Server 2008 on Intel x86 servers
Ubuntu open source
Debian open source
Other Linux distributions (e.g. Mandriva, Turbo Linux)
Other Linux distributions with customization
The survey data gives a detailed comparison breakdown of the percentage of Tier 1, Tier 2 and
highest severity Tier 3 outages.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 7
ITIC’s definition of server outages is as follows:
Tier 1: These are the typically minor common, albeit annoying occurrences. A network
administrator can usually resolve such incidents with less than 30 minutes for dependent
users. Tier 1 incidents can usually be resolved by rebooting the server and rarely involve any
data loss.
Tier 2: These are moderate issues in which the server may be offline from one hour to four
hours or about a half-day. Tier 2 problems may require the intervention of more than one
network administrator to troubleshoot and it frequently affects the corporation’s end users
and possibly business partners, customers and suppliers in the event they are attempting to
access data on an affected corporate extranet.
Tier 3: This is the most severe type of incident. Tier 3 outages are of longer than four hours
duration for network administrators and the company’s associated dependent users. Tier 3
outages almost always require a team of multiple network administrators to resolve. Data loss
or damage to systems and applications may or may not occur. Another real threat associated
with a protracted Tier 3 outage is potential lost business and the potential damage to the
company’s reputation.
.
The length and severity of each of these actions correspond to specific line item capital
expenditure and operational expenditure costs for the business. Reliability, measured by
downtime, can positively or negatively impact TCO and accelerate or delay the time it takes to
realize ROI.
Improvements or declines in reliability also mitigate or increase technical and business risks to
the organization’s end users and external customers. The ability to meet service-level
agreements (SLAs) hinges on server reliability, uptime and manageability. These are key
indicators that enable organizations to determine which server operating system platform or
combination thereof is most suitable.
The survey data detailed the disparity in the number and severity of unplanned server outages
and the amount of time in minutes and hours that businesses experience on the various Linux,
Windows and UNIX platforms.
The survey closely examined both the actual quantitative reliability statistics as well as the
qualitative issues that positively or negatively impacted outage time. The ITIC survey queried
corporate IT managers and C-level executives on myriad reliability-related functions including:
The amount of downtime (minutes/hours experienced per server, per annum
The amount of time spent patching each server
Whether the IT administrators apply updates via an automated group policy procedure or
manually apply the patches to individual servers
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 8
On average, individual corporate Linux, Windows and UNIX servers experience from zero to
approximately two failures per server per year. In a best case scenario, this results in 20 minutes
(IBM AIX running on p5 and p6 Power servers) to 4.3 hours (Debian open source) hours of
annual downtime for each server. Windows Server 2008 servers experienced a total of just under
three unplanned yearly Tier 1, Tier 2 and Tier 3 outages. However, the necessity of having to
take many of the Windows Servers offline to apply monthly patches and then do a system reboot,
resulted in Windows Server 2008 machines being offline for just under two and a half hours each
year. Still, this is a 35% reduction for the 3.77 hours of downtime experienced by Windows
Server 2008 machines in last year’s ITIC reliability survey.
Among the Linux distributions Novell SUSE Enterprise exhibited consistent reliability
reminiscent of the late 1980s and 1990s when Novell NetWare was famous for running several
years – in some cases as long as nine years – without experiencing a failure or the need to reboot.
This can be attributed to the stability of the Novell distribution, the experience of the SUSE
engineers and the length of experience of many IT managers who came from the NetWare
environment. Novell also inked an interoperability and technical service and support agreement
with Microsoft two and a half years ago, which also served to improve reliability.
The open source Ubuntu distribution also scored some impressive reliability gains as it continues
to gain in popularity and deployments.
Overall, these survey responses provide crucial, comparative reliability metrics to enable
customers to make informed choices on which server hardware and server operating system or
combination thereof, best suits their specific business and budgets needs.
Survey Methodology
ITIC conducted the 2009 Global Server Hardware and Server OS Survey, an independent Web-
based survey; that included multiple-choice questions and essay responses from March through
July 2009. ITIC polled C-level executives and IT managers at 400 corporations worldwide.
ITIC analysts supplemented the Web survey by conducting two dozen first-person customer
interviews. ITIC conducted additional interviews with customers in October 2009 and updated
the Report with specific information on server downtime statistics. The anecdotal data obtained
from these interviews validates the survey responses and provides deeper insight into the
challenges confronting businesses in both the immediate and long term.
To deliver the most unbiased, accurate information, ITIC did not accept any vendor
sponsorship money for the online poll or the subsequent first-person interviews conducted in
connection with this project. ITIC employed authentication and tracking mechanisms to prevent
tampering and to prohibit multiple responses by the same parties.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 9
Survey Demographics
Companies of all sizes and all vertical markets were represented in the survey. Respondents
came from companies ranging from small and medium businesses (SMBs) with fewer than 50
workers, to large enterprises with more than 100,000 employees.
Roughly 33% of the survey respondents came from the SMB segment with 1 to 100 employees;
12% of those polled were from midsize companies with 100 to 500 employees; 14% were drawn
from corporations employing 500 to 1,000 employees; and 41% of respondents worked in large
enterprises with 1,000 to more than 100,000 workers.
The survey was truly global. Approximately 85% of respondents came from North America.
The remaining 15% hailed from more than 20 countries including Europe, Asia, Australia, New
Zealand, South America and Africa.
Data & Analysis
Server hardware and server operating system reliability has improved immeasurably in the last
five years.
When ITIC began conducting reliability research and surveys, our original definition of
unplanned downtime was an unexpected external or internal incident that caused the server
hardware and/or the server operating system software to spontaneously fail or freeze, thereby
disrupting network operations and requiring remediation efforts and a reboot. Depending on the
seriousness of the incident, the downtime may also have resulted in lost or damaged data.
However, it quickly became apparent from the anecdotal survey comments and during our first
person customer interviews, that IT managers and network administrators had a broader
definition of what constituted downtime.
As far as IT departments are concerned, anything that causes them to take the server offline,
regardless of the cause, is unplanned downtime. Included in this category are instances of
vendors releasing an unanticipated patch to fix a technical bug or security vulnerability. Such an
occurrence does not qualify as unplanned downtime in the narrowest definition of the term;
network administrators oftentimes do not make that distinction. To them downtime is downtime
because it disrupts their routine and may also impact daily operations because it means the IT
department must devote time to remedial issues that would have been spent performing other IT
chores. And in some network environments like Windows, it’s still necessary to take the servers
down, apply the patch and perform a hard reboot.
Time very literally equates to money. The economic downturn has forced companies to cut staff,
put network and software upgrades on hold, decimated IT departments and has severely reduced
the training and recertification for network administrators.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 10
A recent ITIC survey that polled 250 corporations worldwide in October 2009 found that 47% of
businesses had budget cuts within the past 12 months. That number was even greater for
companies with over 500 end users; 64% of large enterprises experienced budget cuts.
Consequently, 84% of the respondents reported that their IT departments simply pick up the
slack and work longer and harder.
Downtime by the Numbers
In the early days of networks, corporate enterprises considered 99% uptime to be an adequate
reliability standard. Not so in 2009. An ITIC survey of 250 enterprises conducted in October
found that only 14% of survey respondents consider 99% uptime adequate for their most mission
critical, line of business (LOB) applications. Another 14% said that 99.9% or three nines met
their reliability needs. A two-thirds majority – 66% -- of those polled however, said their
network environments require 99.95%; 99.999% or greater reliability for their most mission
critical LOBs.
It’s easy to see why when you correlate the downtime percentages to actual downtime:
99% = average unplanned downtime of one hour and 40 minutes per week
99.9% = average unplanned downtime of 45 minutes per month
99.95% = average unplanned downtime of 22 minutes per month
99.999% = average unplanned downtime of 5 1/2 minutes per year
Taken in this context, it’s easy to understand how the ongoing economic crisis has cast renewed
emphasis on server and server operating system reliability. Businesses of all sizes and across all
vertical markets are extremely risk averse. IT departments grapple daily with the reality of
keeping networks up and running in the face of cost cuts, layoffs and fewer resources. Server
hardware, server operating systems and the a Businesses and their IT departments are under
pressure to maximize server hardware and server operating system uptime in order to realize the
greatest economies of scale and ensure that their server hardware, server operating systems and
the crucial business applications and services that run on them are available to end users,
corporate clients, business partners and suppliers. A server outage of even a few minutes
duration can disrupt network operations and result in lost data, steep monetary losses and
damage a company’s reputation.
Reliability Then and Now
The first generations of server hardware and server operating system software platforms
introduced in the mid-to-late 1980s, were proprietary. Network administrators typically became
experts in a particular vendor’s platform. The 1.0 version of new hardware and software products
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 11
from 10 to 20 years ago were also rife with bugs. It typically took from six months to a year for
the vendors to work the kinks out and achieve an acceptable level of stability and IT managers to
gain sufficient expertise and knowledge resulting in higher levels of uptime. It is also worth
noting that two decades ago, businesses were not as wholly dependent on their networks as they
are today.
In the 1990s, 99% reliability was considered an acceptable industry standard. That is no longer
the case; 99% uptime is the equivalent of over 80 hours of annual per server downtime. ITIC’s
separate 2009 Global Application Availability Survey conducted in April found that eight out
of 10 of the 300 businesses polled said that their major business applications require higher
availability rates than they did two or three years ago. However, nearly three-quarters of
companies – 72% -- are unable to quantify the cost of downtime or the impact that unplanned
reliability outages have on the business. Among the other 2009 Global Application Availability
survey findings:
Nearly two-thirds -- 61% -- of organizations are unsure of how estimate the impact of
downtime on the business or do not even attempt to track the losses associated with
application downtime and reliability
Two out of five firms -- 41% -- said they require conventional 99% to 99.9% application
availability; 29% said they needed 99.95% or 99.99% uptime; while 7% of respondents
indicated they need continuous availability of 99.999% or 99.9999% availability.
Just under half – 49% of companies – lack the budget to purchase additional third party
software or hardware availability technology. This places more of an onus on the underlying
server hardware and server OS to deliver high reliability.
The responses from the ITIC 2009 Global Application underscore the crucial importance of
having highly reliable server hardware and server operating system reliability. If the servers,
server OS and related applications are unavailable for any reason, business and daily operations
grind to a halt – with sometimes catastrophic results.
The demand for server hardware, server OS and application availability has grown, particularly
with the emergence of new technologies like cloud computing and virtualization. Corporations
need to ensure that reliability keeps pace. To quantify the reliability statistics: 99.99% uptime
equates to approximately four hours or 240 minutes of per server, per annum downtime.
Today’s networks demand near perfect reliability; corporations deem any downtime as an
anathema to their business operations. This is particularly true for those companies in vertical
markets such as banking and finance, stock exchanges, insurance, healthcare and legal, whose
businesses are based on intensive data transactions. A server crash of even 15 to 30 minutes
duration can cost a company from tens of thousands or tens of millions in lost business and
remediation efforts. Zero downtime – or as close to it as is humanly and technologically possible,
is the obvious goal and Holy Grail of reliability.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 12
While system flaws will always be present in some fashion, the survey found that at present,
server hardware and server OS reliability was also inextricably linked with several other crucial
factors and components. They are:
Integration and interoperability is crucial. Over 85% of businesses with 300+ end
users have myriad types of server hardware and three different operating systems present
in their environment. Heterogeneity and openness are essential to the reliability of
today’s networks. The 2007 wide ranging, non-exclusive interoperability pact between
Microsoft and Novell was extremely well received and a huge boon for the respective
customer bases of both firms. As part of the deal, Microsoft and Novell team up to
provide joint sales, technical service and support to deliver plug and play interoperability
between the Windows and SUSE Linux Enterprise environments.
Workloads. The applications themselves are growing in size and complexity. It is
therefore imperative that the server hardware be robust enough to handle the increased
demands of new classes of applications such as streaming audio and digital and highly
complex processes. It is a fact that a robust server configuration that includes new multi-
core and multi-threading technologies, maximum memory, hard drive and the fastest
processors will perform better than old, outmoded and inadequate equipment. The survey
showed for example that the high reliability ratings for IBM and HP were no fluke: the
powerful IBM System p5 and System p6 Power Series servers and the HP 9000 and
Integrity Servers achieved very high reliability – 99.99% and 99.999% uptime – while
carrying workloads that were 30% to 40% greater than comparable x86-based machines.
Experience of the IT managers. Errors by neophyte, inexperienced network
administrators and IT managers who have not been able to get training and re-certified on
the latest technologies is another major factor that contributes to extended downtime and
adversely impacts system reliability.
Patch management. The amount of time spent applying patches is one of the biggest
contributors to system downtime; this is especially true of security patches, as we see in
Exhibit 2 below.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 13
IBM AIX administrators spent the least amount of time – 11 minutes – applying patches. They
were followed closely by the Ubuntu open source distribution, Apple Mac, niche market ―other‖
customized Linux distributions and Novell SUSE; administrators in each of these environments
spent on average from 12 to 15 minutes applying patches in these environments.
This speaks to the underlying stability of these environments as well as the experience of the
administrative staff. Typically, UNIX installations – notably IBM’s AIX, as well as Novell
SUSE Enterprise and Apple Mac, tend to be stable, static environments with experienced, hands
on network administrators who are familiar with the most minute details of the bits and bytes of
their systems. Fast patch management positively impacts reliability.
The feedback from the survey respondents reinforced the importance of being able to receive and
download patches quickly once a bug has been identified. Corporate IT managers noted the
significant strides that had been made by all of the vendors across the board in recent years,
though they still voiced some concerns. Among the anecdotal comments:
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 14
• ―IBM has done a wonderful job of keeping our AIX systems up and ready. We rarely if
ever have reliability issues,‖ said an IT manager at a Midwest financial institution.
• ―Patch management automation has significantly reduced both the manpower required to
apply patches and the downtime associated with patch management over the last three
years,‖ noted an IT administrator at a large health care facility in the northeast.
• ―Novell SUSE Linux Enterprise is always very up-to-date on patches; Zenworks is nice
and we never have a problem,‖ said a longtime Novell user at a large healthcare provider
in the Southwest.
• ―The amount of time it takes to identify vulnerability and when the vendors release the
patch, has decreased significantly, but if the bug is a dangerous one, we still worry,‖
according to a chief technology officer at midsized retailer.
• ―Our patches are tested at our corporate headquarters location and then distributed as
needed to the various remote locations, downloaded to a local Microsoft Systems
Management Server (SMS) and automatically downloaded via group policy to each
workstation and server. The process is accelerated and it’s relatively painless for the IT
department,‖ said an administrator at a large West Coast enterprise.
• ―Our patch management dramatically improved with SUSE 10.2 and SUSE 11,‖noted
another veteran Novell administrator. ―We have no problems now to speak of.‖
• ―We currently use Group Policy to download patches on each server, but we manually
apply them. So it takes us about 15 minutes to patch each Windows server. This means
that each server takes less than 15 min to patch. On a whole, other than hardware issues,
we've averaged less than two failures per server, per year on our Windows Server 2003
systems,‖ said an IT manager at a large East coast insurance firm.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 15
Serious Tier 2 + Tier 3 Incidents Decline
The survey results also showed a discernible decline in the number and percentage of the more
serious Tier 2, Tier 3 and combined Tier 2 + Tier 3 incidents, according to Exhibit 3 below.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 16
Once again, IBM AIX on the Power Series System p5 and p6s recorded the smallest percentage
of combined Tier 2 + Tier 3 incidents at 19%. The other UNIX and Linux distributions including
the HP UX 11i v3 on the HP 9000 and HP Integrity, Novell SUSE Linux Enterprise and Sun
Solaris also scored well with the more serious aggregate Tier 2+ Tier 3 outages accounting for
24% to 25% of total outages. And all of the aforementioned distributions managed to lower their
scores from the similar survey in 2008.
Microsoft’s Windows Server 2003 on x86-based servers came in with a very respectable 30% of
reliability outages being in the Tier 2 + Tier 3 categories; this was a reduction of 11% from the
41% reported by respondents to the 2008 ITIC Global Reliability Survey.
One of the most impressive statistics was that IBM AIX Power Series System p5 and System p6
servers notched no severe Tier 3 incidents whatsoever. Again, this achievement is even more
impressive when one considers that these systems typically run higher workloads than their x86-
based counterparts as shown in Exhibit 4.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 17
HP’s UX 11i v.3 Update 4 on the HP 9000 and Integrity servers and Sun Solaris on SPARC
Servers (now owned by Oracle), Novell SUSE, Red Hat Enterprise Linux and Apple Mac OS
10x 5.6 on the G4 Macs also recorded very few Tier 3 outages – less than one each, per server
per annum.
The most common Tier 1 incidents that are usually between 10 and 30 minutes duration, also
showed across the board reductions among all server hardware and server operating system
platforms as we see from Exhibit 5.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 18
In the Tier 1 category, IBM also came out on top with less than one-half of one Tier 1 incident
per AIX Power Series System p5 and System p6 per annum. This equates to about four to seven
minutes downtime per server, per year.
In fact, all of the server hardware and server OS environments each racked up less than one Tier
1 per server, per annum outage.
The results were similarly encouraging for the average number of Tier 2 outages as we see in
Exhibit 6 below.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 19
Conclusions In summary the ITIC 2009 Global Server Hardware and Server OS Reliability Survey findings
indicate that all of the server operating system platforms have achieved a high degree of
reliability. However, the UNIX distributions led by IBM AIX running on the p5 and p6 Power
Servers is the clear winner followed closely by HP, Novell SUSE Enterprise Linux and the
Ubuntu open source distribution.
These results are especially considering in light of the ongoing economic crunch which has
caused companies to cut their budgets and reduce IT staff. As they strive to accomplish more
with fewer resources, IT departments must rely even more heavily on their vendors to deliver
more reliable servers and server operating system software.
To reiterate, time is literally money. Even a few minutes of downtime can cost companies
thousands or millions of dollars and cause business operations to grind to a halt. Downtime can
also impact adversely a company’s relationship with its customers, business suppliers, partners
and internal end users. Reliability or lack thereof can potentially damage a company’s reputation
and result in lost business.
Hence, corporations must have confidence in the reliability and stability of the underlying server
hardware and server OS platforms.
The advances in technology are encouraging. Now companies must tackle other equally
important and challenging issues to ensure the highest level of uptime and reliability. Close
attention must be paid to integration and interoperability, patch management, documentation and
getting the necessary training and certification for the appropriate IT managers. The most
bulletproof hardware and software platforms can be undone by human error. It’s equally
important that companies find the funds to stay as current as possible on their server hardware
and server OS software. Performance will suffer if the server is configuration is old and
inadequate.
Recommendations Server hardware and server operating system reliability has improved vastly since the 1980s,
1990s and even in just the last two to three years. While technical bugs still exist, the number,
frequency and severity have declined significantly.
With few exceptions, common human error poses a bigger threat to server hardware and server
operating system reliability then technical glitches.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 20
Crucial TCO metrics such as reliability, performance, security and management ultimately
depends as much on each firm’s specific implementation, as it does on the properties of the
server and server OS technology itself. There are inherent dependencies between the underlying
capabilities of a particular server operating system and an individual corporation’s ability to
adhere to best deployment practices with respect to training, testing and configuration. The
reliability, security and manageability of even the most hardened server and server operating
system are easily compromised by human error.
A company that does not restrict physical access to the server is asking for trouble. Similarly,
any firm which does not enact and enforce strong usage and security policies, risks
compromising the reliability and integrity of its server hardware and server OS environment. The
reliability of the server environment can also be undone easily or seriously compromised by such
actions as: a bad configuration; the use of incompatible or unapproved memory and logic chips,
hardware, peripherals and software drivers; over clocking machines; failing to apply necessary
patches; failing to upgrade or retrofit inadequate or obsolete servers and operating systems and
taxing server and software resources beyond their capabilities.
Recommendations for Corporate Customers
To optimize uptime and reliability, ITIC advises corporations to:
Regularly analyze and review configurations, usage and performance levels. This
will enable companies to determine whether or not their current server and server OS
environment allows them to achieve optimal reliability.
Adopt formal SLAs. Service level agreements enable organizations to define acceptable
performance metrics. Companies should meet with their vendors and customers on at
least an annual basis to ensure the terms are met.
Define measure and monitor reliability and performance metrics. It is imperative that
companies measure component, system, server hardware, server OS and desktop and
server OS, security, network infrastructure, storage and application performance. Keep a
log of the planned and unplanned downtime in a continuous fashion throughout the
enterprise.
Regularly track server and server OS reliability and downtime. Keep accurate
records of outages and their causes. Segment the outages according to their severity and
length – e.g. Tier 1, Tier 2 and Tier 3. The appropriate IT managers should also keep
detailed logs of remediation efforts in the event of the outage. These logs should include
a full account of remediation activities, specifying how the problem was solved, how
long it took and what staff members participated in the event. It should also list the
monetary costs as well as any material impact on the business, its operations and its end
users. This will prove invaluable resource should the problem recur. It may also make the
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 21
difference in containing or curtailing the reliability-related incident, saving precious time
for the IT department, the end users and corporate customers.
Calculate the cost of unplanned downtime. Companies should determine the average
cost of minor Tier 1 outages. They should also keep more detailed cost assessments of the
more serious unplanned Tier 2 and Tier 3 incidents. It’s essential for businesses to know
the monetary amount of each outage – including IT and end user salaries due to
troubleshooting and any lost productivity – as well as the impact on the business. C-level
executives and IT managers should also pay close attention to whether or not the
company’s reputation suffered as a result of a reliability incident; did any litigation
ensue; were customers, business partners and suppliers impacted (and at what cost) and at
least try and gauge whether or not the company lost business or potential business.
Ensure that your organization has robust server hardware that can adequately
handle the OS and application workloads. The server hardware (standalone, blade,
cluster, etc.) and the server operating system are inextricably linked. To achieve optimal
performance from both components, corporations must ensure that the server hardware is
robust enough to carry both the current and anticipated workloads for the lifecycle of
both.
Compile a list of best practices and adhere to them. This is absolutely essential. Chief
technology officers (CTOs), software developers, engineers, network administrators and
managers should have extensive familiarity with the products they currently use and are
considering. Check and adhere to your vendors’ list of approved, compatible hardware,
software and applications. Software developers and network administrators must obey the
rules. That means avoiding such ill-advised and iffy practices like overclocking server
and desktop hardware, allowing unskilled or neophyte administrators to make changes to
the registry. All of these actions can lead to serious reliability problems.
Don’t skimp on training and recertification for IT administrators, software
developers and engineers. In these days of budget cuts, it’s common practice to
eliminate monies that were formerly earmarked for training. ITIC understands that money
is tight. If you can’t afford the time or expense to re-certify your entire IT department,
designate the most experienced or appropriate IT staffer to take the course – even if it’s
only an online course – and allow that person to train additional appropriate managers.
Perform regular asset management testing. Schedule asset management reviews on a
yearly, bi-annual or quarterly basis, as needed. This will assist your company in remaining
current on hardware and software and help you to adhere to the terms and conditions of
licensing contracts. All of these issues influence network reliability. It also allows
organizations to be better equipped to meet their SLA requirements and maintain peak
performance and reliability.
Manual vs. Automated Group Policy Patch Management. IT managers, particularly in
high end UNIX environments and in corporations whose environments feature a high degree
of customization, will continue to perform manual patch management.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 22
Keep your software updated with the latest necessary patches and upgrades. You don’t
have to apply every patch, but it’s wise to keep track of which patches are crucial to the
network’s health. Construct and adhere to a regular schedule to apply patches, preferably on a
monthly basis. This will help the company avoid potentially nasty surprises.
Standardize legacy and future hardware, server OS and application environments
as much as possible. ITIC survey data indicates that standardization—that is, following a
prescribed configuration and version for the company’s hardware, software and network
infrastructure components—can lower TCO costs by 15%. Standardization benefits all
users—including organizations that have custom configurations.
Note that custom software implementations require the highest level of expertise. Any
firm that elects to customize its Linux or open source server operating system distribution
should either employ guru-level administrators or contract with a systems integrator or
outsourcer with the appropriate expertise.
Automated patch management applied via Group Policy vs. manual patching. Companies should also regularly review whether it is feasible for the firm to migrate away
from manual patch management. Collecting this information may seem to be a chore at first,
but it will be an invaluable source of information that can guide the company to lower its
TCO and improve the rate of its ROI.
Recommendations for Vendors
It is a buyer’s market and is likely to remain so for the foreseeable future. Competition among
vendors is intense because businesses have a wide array of server hardware and server operating
system platforms from which to choose. In order to retain the current customer base and attract
new corporate customers, all of the vendors must strive to improve the features, performance,
reliability and security of their respective server hardware and server OS software. Additionally,
ITIC advises vendors to:
Embrace Interoperability and Integration. The survey data indicates that backwards
compatibility and integration with other hardware, server OS, applications and third party
tools and utilities pose significant potential threat to the underlying stability of the network
environment.
Provide Explicit Guidance around Patches and Patch Management. Patches vary
according to the importance, severity of the fix or update and by the number of patches in a
formal release as well. Data ITIC obtained from anecdotal essay comments and first person
customer interviews underscore the need for vendors to issue patches in an efficient,
expeditious manner and to provide full transparency on the nature and severity of all bugs.
Many IT managers expressed frustration and confusion with the patch management process,
which was sometimes cumbersome. IT managers also noted that oftentimes they were unsure
of which patches were crucial versus optional. ITIC advises vendors to deliver specific
recommendations and instructions on the download process, since patch management is a
crucial element of IT management that can positively or negatively impact reliability.
© Copyright 2009, Information Technology Intelligence Corp. (ITIC) All rights reserved.
Other products and companies referred to herein are trademarks or registered trademarks of their respective companies or mark holders.
Page 23
Provide the latest technical documentation. Ready access to clear, concise technical
guidelines and detailed documentation has never been more important. The economic
downturn forced many companies to cut staff. Time and money are scarce or non-existent for
training and re-certification of IT administrators. It is therefore crucial that vendors pick up
the slack and publicize and disseminate technical ―how to‖ guidelines via their respective
Websites, Emails and Webinars.
Vendors should also actively work with third party ISVs to assist in resolving driver
and application compatibility issues. As we noted above, integration and interoperability
issues are a top priority for IT departments who wish to maintain a high level of reliability.
While many of the largest third party ISVs do an exemplary job of ensuring that their
applications and drivers are certified to work with new server hardware and server OS
releases, many smaller and niche ISVs – particularly in specific verticals like finance, legal
and healthcare, in many instances lack the necessary resources and funds to support new
releases. Vendors should poll their customers on which third party applications, drivers and
utilities are crucial and when necessary assist ISVs in providing the necessary compatibility.
Work with partners to provide expanded access to discounted certification and online
training courses. One of the biggest challenges confronting IT departments today is finding
the money and sparing the time to get the appropriate administrators re-trained and certified
on the latest server hardware and server OS software.
Top Related