osfr

8/10/2019 osfr

1/73

8/10/2019 osfr

2/73

1999-2000 Operating System Function ReviewSS, March 2000

2 Copyright 2000 D.H. Brown Associates, Inc.

support at least 1 TB. Tru64 UNIX 5.0 has strong 64-bit capabilities and goodscalability clustering options, along with a variety of miscellaneous performanceoptimizations that are useful for particular classes of applications. However, untilCompaqs long-awaited future eight-, 16-, and 32-node high-end GS-seriessystems arrive in 1H00, Tru64 UNIXs SMP range significantly trails its

competitors at 14 processors.

RELIABILITY, AVAILABILITY,

SERVICEABILITY (RAS) RESULTS

Tru64 UNIX 5.0 shares the lead with IRIX 6.5 for RAS functions. Tru64 UNIXoffers unmatched storage reliability features and leading HA clustering functions,thanks in part to its clustering file system, which is unique among all studiedproducts. IRIX offers particularly strong resiliency functions, along withcompetitive HA clustering options. Solaris 7 follows, having the strongestresiliency functions due to its unmatched Dynamic Reconfiguration and

Alternate Pathing capabilities. Solaris 7 offers only average HA clusteringcapabilities, however. HP-UX 11.0 and AIX 4.3.3 have roughly equivalent RAScapabilities. HP offers stronger resiliency functions and the best overallserviceability functions, but IBM offers very strong HA clustering functions.

SYSTEM MANAGEMENT RESULTS

Tru64 UNIX 5.0 also ranks first for system management, thanks to very strongoperating-system management functions, which now match the strength of AIX,the long-time leader in this area. Tru64 provides very strong heterogeneousmanagement capabilities, bundling the ability to host Windows NT network-

authentication functions. AIX 4.3.3 follows closely with the strongest hardwaremanagement, supporting plug-and-play configuration of RS/6000 hardware andperipherals and strong remote manageability based on its web-based systemmanager. HP-UX 11.0 achieves average ratings across most functional areas, butstill trails in heterogeneous management and interoperability functions. Solaris 7stands out for its leading resource-management capabilities, but has yet to catchup with the leaders for operating-system management functions. IRIX 6.5 leadsin storage management and has strong resource-management tools, but hasrelatively weak operating-system management and remote-manageabilityfunctions.

INTERNET AND WEB APPLICATION SERVICES RESULTSAIX 4.3.3 retains the lead for Internet and web application services, benefitingfrom the strongest support for TCP/ IP protocols and extensions; the strongestInternet file, mail, and web services; and the richest set of e-commerce options.HP-UX 11.0 follows it also offers a strong TCP/ IP implementation, coupledwith very good e-commerce options and Internet file, mail, and web services.Tru64 UNIX 5.0 has a very strong JVM implementation and unique support forMicrosofts DCOM distributed object protocol, but otherwise has average

8/10/2019 osfr

3/73


Copyright 2000 D.H. Brown Associates, Inc. 3

capabilities. Solaris 7 places fourth overall, a surprising position for a pioneeringInternet company that has contributed so much technology to the industry. In arapidly growing arena where every player wants to be first, Suns choice to bundleits own web server with Solaris 7 rather than iPlanet or Apache; a modest set ofe-commerce offerings; and the lack of many TCP/ IP extensions hamper Suns

functional leadership. IRIX 6.5 includes a good set of bundled Internet file, mail,and web services, but it trails in most other areas.

DIRECTORY AND SECURITY SERVICES RESULTS

Tru64 UNIX 5.0 leads in directory and security services, bundling the strongestset of directory services and sharing the top spot for secure networkingfunctions. AIX 4.3.3 follows closely, sharing the lead for Virtual Private Network(VPN) functions while also providing very competitive directory services. HP-UX shares the lead for secure networking and VPN functions, but provides onlyaverage directory services. Solaris 7 has competitive directory services, but

average capabilities in remaining areas. IRIX 6.5 trails in all areas.

8/10/2019 osfr

4/73



TABLE OF CONTENTS

EXECUTIVE SUMMARY ........................ ........................... ........................... ........................... ....................... 1

OVERALL RESULTS...................................................................................................................................1SCALABILITY RESULTS.............................................................................................................................1RELIABILITY, AVAILABILITY, SERVICEABILITY (RAS) RESULTS........................... ........................... .....2SYSTEM MANAGEMENT RESULTS ......................... ........................... ........................... .......................... .2INTERNET AND WEB APPLICATION SERVICES RESULTS.....................................................................2DIRECTORY AND SECURITY SERVICES RESULTS .......................... ........................... .......................... .3

METHODOLOGY ........................... .......................... ........................... ........................... ........................... .....6

NOTES ON THE 1999-2000 EDITION.........................................................................................................7MICROSOFT WINDOWS NT/2000 ......................... ........................... ........................... .......................... .8SUN SOLARIS 8 .......................... ........................... ........................... ........................... .......................... .8

SCALABILITY........................... ........................... ........................... ........................... .......................... ..........9

SUMMARY..................................................................................................................................................9SCALABILITY CRITERIA ......................... ........................... ........................... .......................... ................... 9

64-BIT SUPPORT........................ ........................... ........................... ........................... ......................... 10SMP/NUMA SCALABILITY........................ ........................... ........................... ........................... ............ 10SMP BENCHMARK EVIDENCE......................... ........................... ........................... ........................... ...11MAXIMUM SMP CONFIGURATION SIZE........................ ........................... .......................... ................. 13SMP LINEARITY.......................... ........................... ........................... ........................... ......................... 13PERFORMANCE CLUSTERING........................ ........................... ........................... ........................... ...13TECHNICAL COMPUTING CLUSTERS...................... ........................... ........................... ..................... 14DATABASE CLUSTERING ....................... ........................... ........................... ........................... ............ 14PACKAGED WEB SERVER FARMS .......................... ........................... ........................... ..................... 15MISCELLANEOUS PERFORMANCE OPTIMIZATIONS ........................ ........................... ..................... 16

AIX 4.3.3 ....................... ........................... ........................... ........................... .......................... ................. 16HP-UX 11.0 ........................ .......................... ........................... ........................... ........................... ............ 18IRIX 6.5 ......................... ........................... ........................... ........................... .......................... ................. 19SOLARIS 7....... ........................... .......................... ........................... ........................... ........................... ... 21TRU64 UNIX 5.0............ ........................... ........................... ........................... .......................... ................. 22

RELIABILITY, AVAILABILITY AND SERVICEABILITY (RAS)............................. ........................... ............ 24

SUMMARY............................. ........................... ........................... ........................... .......................... ........ 24RAS CRITERIA ........................... .......................... ........................... ........................... ........................... ... 24

RESILIENCY FUNCTIONS ....................... ........................... ........................... ........................... ............ 24HIGH-AVAILABILITY CLUSTERING FUNCTIONS........................ ........................... ........................... ...25STORAGE RELIABILITY AND SCALABILITY ........................... ........................... .......................... ........ 27SERVICEABILITY ENHANCEMENTS......................... ........................... ........................... ..................... 27

AIX 4.3.3 ....................... ........................... ........................... ........................... .......................... ................. 28HP-UX 11.0 ........................ .......................... ........................... ........................... ........................... ............ 29

IRIX 6.5 ......................... ........................... ........................... ........................... .......................... ................. 29SOLARIS 7....... ........................... .......................... ........................... ........................... ........................... ... 31TRU64 UNIX 5.0............ ........................... ........................... ........................... .......................... ................. 32

8/10/2019 osfr

5/73



SYSTEM MANAGEMENT ........................... ........................... ........................... .......................... ................. 35

SUMMARY............................. ........................... ........................... ........................... .......................... ........ 35SYSTEM MANAGEMENT CRITERIA......................... ........................... ........................... ......................... 35

OPERATING-SYSTEM MANAGEMENT.......................... ........................... .......................... ................. 36EVENT MANAGEMENT................................. ........................... ........................... .......................... ........ 37

HARDWARE STATE MANAGEMENT.................................. ........................... ........................... ............ 38STORAGE PERIPHERAL MANAGEMENT.......................... ........................... ........................... ............ 38REMOTE MANAGEABILITY .......................... ........................... ........................... .......................... ........ 39RESOURCE MANAGEMENT............................................... ........................... ........................... ............ 40HETEROGENEOUS MANAGEMENT AND INTEROPERABILITY..................................... ..................... 41


INTERNET AND WEB APPLICATION SERVICES..................... ........................... ........................... ............ 55

SUMMARY............................. ........................... ........................... ........................... .......................... ........ 55

INTERNET AND WEB APPLICATION CRITERIA ........................ ........................... .......................... ........ 55TCP/IP FEATURES ......................... ........................... ........................... ........................... ..................... 56WEB APPLICATION SERVICES........................ ........................... ........................... ........................... ...58E-COMMERCE TOOLS ........................ ........................... ........................... .......................... ................. 58BUNDLED FILE, MAIL, AND WEB SERVERS .......................... ........................... .......................... ........ 60


DIRECTORY AND SECURITY SERVICES.................... ........................... ........................... ......................... 68

SUMMARY............................. ........................... ........................... ........................... .......................... ........ 68

DIRECTORY SERVICES CRITERIA .......................... ........................... ........................... ......................... 68SECURITY INFRASTRUCTURE CRITERIA................... ........................... ........................... ..................... 69VIRTUAL PRIVATE NETWORKING (VPN) CRITERIA.............................. ........................... ..................... 70


8/10/2019 osfr

6/73



METHODOLOGYIn this study, D.H. Brown Associates, Inc. (DHBA) evaluates five leading UNIXoperating systems IBM AIX 4.3.3, Hewlett-Packard HP-UX 11.0, SGI IRIX

6.5, Sun Solaris 7, and Compaq Tru64 UNIX 5.0 based on their functionalcapabilities as of January 1, 2000. In this edition of the study, each operatingsystem receives a rating for its support of over 100 functional items across fiveareas:

scalability,

RAS,

system management,

Internet and web application services, and

directory and security services.

This study primarily notes items for their existence or non-existence on a givenplatform, although it judges some according to the quality and breadth of theirimplementation. Vendors receive maximum credit only for functions they bundleand integrate in their operating systems. They take a penalty if the functionrequires a separately priced option and suffer a greater penalty if the function isnot available directly from the operating systems supplier (i.e., if it requiresinvolvement of a third-party supplier). They receive a maximum penalty if afunction is unavailable for the platform or if can be implemented only throughan awkward workaround.

Each individual rating sums to a score for each of the five functional categories,

based on weights indicated at the beginning of each chapter. The overall rankingresults from the average of all category rankings. Each of the major functionalareas gets an equal weight toward the total.

To determine its ratings for the studied functional items, DHBA evaluated eachoperating system and its layered products using a variety of approaches,including:

hands-on evaluation,

examination of system documentation and related publications, and

discussions with marketing and engineering staff from each operating-systemvendor.

DHBA must emphasize that this report represents a technology assessment,which exposes findings that remain distinct from other types of research, such asmarket-share statistics, customer-satisfaction surveys, or laboratory-based stresstesting. One cannot extrapolate the results of this assessment to makeconclusions in other domains. The industry has frequently shown that the besttechnology does not always win in the market place.

8/10/2019 osfr

7/73



To arrive at a complete profile of an operating-system product, users shouldconsider a number of factors in addition to those addressed by this study,including:

Application portfolio: An operating system is only as useful as the applicationsavailable for it. The suitability of an application portfolio for a given user,

though, ultimately depends that users specific requirements. Quality: As with any other complex technical product, an operating system

may ship with a number of defects that are independent of its relativetechnical richness. Formal methods to measure quality vary; two alternativesare stress testing and collecting empirical data based on customer-satisfactionsurveys.

Vendor support: At the high end of software complexity, operating systemsintroduce a notoriously high support burden, especially when deployed onservers. The ability of vendors to meet those support requirements may vary.

Vendor experience: Vendors offering multiple operating systems may have

different levels of experience within their respective product lines, dependingon when they entered the market and with what level of commitment.

Skills availability: This factor applies both to the skills available within a usersorganization and in the market as a whole.

Hardware/ systemcapabilities: Since an operating system will only perform as wellas its underlying hardware, users must remain aware of factors such asprocessor performance and the SMP ranges available on host platforms.

Cost: A complex and contentious area, this factor depends not only on theprices of operating-system software and associated client license fees, but alsoon any necessary add-on packages, the price and price/performance ofunderlying hardware, and a wide variety of hard-to-measure soft costs

related to ongoing management and training.

NOTES ON THE 1999-2000 EDITION

DHBA revised the latest version of its scorecard to reflect new areas oftechnology differentiation among vendors and shifts in enterprise-levelcomputing priorities. While the scalability, RAS, and system-managementcategories remain unchanged, the other top-level categories changed as follows:

The features evaluated in the PC client support category were incorporatedinto the system-management category, as baseline file- and print-sharingcapability for PCs has become commoditized. Relevant differentiation nowrelates to operating systems ability to integrate PC and UNIX managementfunctions in terms of heterogeneous network resources.

The features evaluated in the distributed enterprise services category weresplit, with some functions moving to the Internet and web applicationscategory (formerly Internet/ intranet) and the remainder going into a newcategory that evaluates directory and security services. These changes reflectthe industrys growing orientation around web-based infrastructures fornetwork and application architectures.

8/10/2019 osfr

8/73



MICROSOFT WINDOWS NT/2000

While the previous version of this report rated Microsofts Windows NT 4.0product, this edition does not include Windows NT. The release of Windows2000 which occurred after the research deadline for this report introducessignificant architectural changes, including a major kernel upgrade and a new

approach to network services (the Active Directory Service). These changes willpotentially require that DHBA modify its scorecard line items to take theWindows 2000 development into account before reassessing the entire operating-system area. This work is in progress, and DHBA expects to publish a newreport reflecting the changed landscape later this year.

SUN SOLARIS 8

This edition of the report evaluates Solaris 7, rather than the recently introducedSolaris 8. Sun shipped Solaris 8 after the research deadline, and the company hasstaggered the release of some Solaris 8 functions, so that the entire solution set isnot currently available. DHBA will not formally evaluate Solaris 8 until a numberof critical features ship. In addition, many enhancements Sun touts for Solaris 8were in fact previously shipped for Solaris 7 in the form of patches and add-ons.These enhancements were included in this report.

8/10/2019 osfr

9/73



SCALABILITY

5.00 6.00 7.00 8.00 9.00

Tru64 UNIX 5.0

AIX 4.3.3

HP-UX 11.0

IRIX 6.5

Solaris 7

Fair OK Good Very Good Excellent

SUMMARY

Solaris 7 captures the lead for overall scalability, supporting a very broad SMP

range, offering strong 64-bit capabilities, and holding competitive ratings in otherareas. IRIX 6.5 follows closely; it also offers a very broad SMP range, but lacksperformance evidence for its database clustering options. HP-UX 11.0 has thebest 64-bit capabilities, thanks to the extraordinary memory ranges supported onHPs SCA hardware, and very competitive scalability clustering options. AIX4.3.3 provides leading performance clustering capabilities on IBMs SP hardware,but its SMP range remains average. Further, AIXs maximum file size, 64 GB,falls significantly behind competitors, most of whom support at least 1 TB.Tru64 UNIX 5.0 has strong 64-bit capabilities and good scalability clusteringoptions, along with a variety of miscellaneous performance optimizations that areuseful for particular classes of applications. However, until Compaqs long-awaited future eight-, 16-, and 32-node high-end GS-series systems arrive in1H00, Tru64 UNIXs SMP range significantly trails its competitors at 14processors.

SCALABILITY CRITERIA

Three basic functional areas determine the scalability of a system in an enterpriseenvironment:

64-bit support: the ability to exploit processing, memory, and storage beyondthe 4 GB limitation imposed by 32-bit systems. Several levels of 64-bitcapabilities exist, including 64-bit processor support, large file systems, large

files, large physical memories, and large process address spaces (where largemeans greater than 4 GB).

Shared-memory multiprocessing (SMP) support: the ability to take advantage ofmultiple processors in a server. Criteria include kernel locking granularity,kernel thread mechanisms, and evidence of scalability based on industry-standard benchmarks.

FIGURE 2:

Scalability

Functi onal Ratings

8/10/2019 osfr

10/73



Performance clustering options: the ability to grow system capacity, includingperformance and storage, by lashing together multiple servers using high-speed interconnects. Typically, a systems ability to handle technicalapplications and commercial applications (e.g., database or web) classifies itsperformance clustering capabilities.

64-BIT SUPPORT

64-bit support typically pays off the most for applications that use largedatabases. 64-bit systems can cache complete database indexes (or the databasecontents themselves) in physical memory, offering a roughly 10x improvement inaccess time over disk. Performance improvements in real-world situations withreal workloads prove substantially more modest; TPC-C results for various 64-bitvendors come in closer to a factor of 10%-2x, for example.

In general, operating systems can support 64-bit capabilities at four incremental

levels: 64-bit processor support: can run on 64-bit processors such as Alpha, MIPS, PA-

RISC, PowerPC, and UltraSPARC. All current Intel X86 processors use 32-bit instruction sets, although Pentium Pro and Pentium II Xeon support 36-bit physical memory addressing (i.e., a maximum of 64 GB RAM).

Largestoragesupport: can support file systems and files greater than 4 GB.Large file systems must use large RAID configurations, which may range ashigh as 2-4 TB. Support for large files requires the availability of APIfunctions that allow applications to access 64-bit ranges.

Largephysical memory support: can take advantage of physical memory greaterthan 4 GB. While this capability proves most useful when coupled with 64-bit

virtual memory (see below), applications can exploit the larger memoryconfigurations even in systems that otherwise support only 32 bits. Inparticular, administrators can configure database systems to use the extraphysical memory for caching purposes, boosting performance.

Large virtual memory support: the ability for applications to run in a 64-bitprocess address space. Only operating systems with this capability qualify asfully 64-bit enabled.

SMP/NUMA SCALABILITY

The ability of an operating system to exploit SMP systems continues to represent

a critical differentiator in server environments. Relevant factors include: The degree to which the kernel has been optimized to exploit multiple

processors, which influences the absolute range of processors that it caneffectively support. This ranges from two processors to more than 100 inadvanced NUMA architectures.

The availability of mechanisms to support SMP-optimized applications suchas threads.

8/10/2019 osfr

11/73



The availability of performance evidence based on industry-standardbenchmarks for high-end systems, based on tests such as TPC-C and TPC-D, which stress I/O as well as computation.

SMP BENCHMARK EVIDENCE

When discussing the quality of SMP implementations, quibbling over the detailsof kernel architecture becomes relatively meaningless beyond a certain point.Developers can plan only a finite degree of SMP scalability into the design thefinal analysis hinges on industry-standard benchmark performance.

Traditional uniprocessor performance metrics such as SPECint95 and SPECfp95cannot measure SMP system performance because they submit tasks to thesystem serially rather than concurrently. Instead, SMP performance must beassessed using benchmarks that stress running jobs in parallel. These tests fallinto two basic classes technical and commercial. Technical users can rely onbenchmarks such as the NAS Parallel Series to assess parallelized engineering-

related application performance on SMP systems, particularly those related tocomputational fluid dynamics and finite element analysis. Users can projectperformance from NAS test results to the extent that other technical applicationsrely on similar algorithmic techniques.

Commercial workloads present a greater challenge to SMP implementations,because they tend to exhibit a high degree of communication andsynchronization overhead among processors. Most commercial applications areessentially database applications that stress I/ O, cache management, andcommunication. As the number of processors increases, these operations placeincreasing demands on scarce resources such as memory/ bus bandwidth and

I/O bandwidth, rigorously exercising kernel-locking mechanisms. Commercially-oriented benchmarks thus provide the most credible assessment of the quality ofan SMP implementation.

Many types of benchmarks claim to measure realistic commercial serverperformance. Proprietary benchmarks tend to have a narrow focus, favoringparticular architectures, such as PC fileservers or terminal-centric hosts, orparticular products, such as SAP. To overcome any potential biases in themeasured workloads, vendors tend to rely on a number of industry-standardbenchmarks to demonstrate SMP scalability for commercial applications. Multi-vendor committees define these benchmarks and require that results be

published under strict guidelines, including detailed auditing procedures. Some ofthe most rigorous and widely accepted tests relevant to SMP systems include:

SPECint_rate95: a variation of the SPECint95 test commonly used to measureraw processor performance. The SPEC (Standard Performance EvaluationCorporation) committee manages several SPEC benchmarks. TheSPECint_rate95 benchmark measures the capacity of a computer to executemultiple CPU-intensive processes concurrently. It derives from the same setof applications used in the traditional SPECint95 test, but runs multiplecopies of this application set in parallel. However, because SPECint_rate95

8/10/2019 osfr

12/73



tests involve relatively little I/ O or interprocess communication, it is fairlyforgiving of inefficient SMP kernel locking and thus serves only as a baselinemeasure of commercial SMP capabilities.

SPECWeb96: measures web server performance. The SPECWeb96benchmark is designed to provide comparable measures of how well systems

can handle HTTP GET requests. The SPEC committee based the workloadof this test on analysis of server logs from websites ranging from a smallpersonal server up through some of the Internets most popular sites. Builton the framework of the SPEC SFS/LADDIS benchmark, SPECWeb96 cancoordinate the driving of HTTP protocol requests from single- or multiple-client systems.

TPC-C:measures database transactions completed per minute, expressed intpmC ratings. The Transaction Processing Performance Council, anorganization devoted to benchmarking transaction-processing systems,manages this test. To determine the number of transactions a system canprocess in a given timeframe, TPC benchmarks measure the total

performance of the system, including the computer, operating system,database-management system, and any other related components involved inthe transaction-processing operation.

TPC-D: designed for decision support and tests 17 complex queries. TPC-Dresults are relative numbers based on the size of the database being queriedand yield a single-user Qppd Power metric and a multiple-user QthDThroughput metric. TPC-D is being phased out in favor of TPC-H (below),after vendors discovered that by pre-caching the specific queries made by thebenchmark, scores could leap dramatically.

TPC-H:restores the emphasis on ad hoc (not pre-cached) queries. Like TPC-D, it is designed for decision support. TPC-H results are relative numbersbased on the size of the database being queried and yield a single-user QphHquery-per-hour metric.

While it is tempting to simply compare the absolute performance numbersobtained with a particular operating system to rate SMP capabilities, thisapproach proves invalid. The benchmark result achieved by an SMP serverdepends on a variety of factors, including the processor performance (i.e., IntelX86 compared to RISC), the cache sizes used (which can range from 256 KB to8 MB), the hardware interconnect design and performance, the database or webserver, and the applications used. The operating system itself represents only onecomponent in this equation. However, one can draw relevant conclusions about

the capabilities of an SMP kernel from two key metrics: Maximumconfiguration size: the largest number of processors on which the

operating system was tested with an industry-standard benchmark; and

Linearity: the ratio of performance gained when additional processors areadded to the system.

8/10/2019 osfr

13/73



MAXIMUM SMP CONFIGURATION SIZE

Regardless of the absolute performance value achieved, benchmark resultspublished from high-end configurations help to prove that the operating systemitself can effectively and competitively exploit that number of processors. At acertain point, every SMP system will roll over (become slower as more

processors are added), because synchronization overhead starts to overtakecomputing performance. Since superior SMP designs can push that threshold tolarger numbers of processors, vendors choose to run benchmarks onconfigurations that produce the most impressive results with the fewest numberof processors.

SMP LINEARITY

Linearity on SMP systems is typically expressed as a percentage relative to theideal scalability, in which increasing the number of processors from n-1 to nshould produce n/(n-1) times the performance. An ideal SMP system would trulybe linear, i.e., performance would increase by a factor of one for every singleCPU added (100% linearity). Loathe to be measured by such an unforgivingstandard, vendors hesitate to disclose the SMP benchmark data points necessaryto draw conclusions about linearity. Even when vendors provide multiplemeasurements for the same machine, the tested environments almost always varyby processor clock speed, cache size, database system, database version, oroperating-system version.

In rare instances, vendors have released enough benchmark data on a system toallow a gauge of its linearity. For example, on Bulls Escala servers, which areinternally identical to IBMs current 32-bit RS/6000 servers, earlier versions ofAIX achieved 70% linearity on TPC-C benchmarks when going from four to

eight processors in identical configurations.

PERFORMANCE CLUSTERING

Clusters can sometimes increase a systems capacity, including performance andstorage. To scale performance on a cluster, applications work in concert withclustering software to partition their workloads into subtasks, which theclustering software then distributes across a group of clustered servers. Sinceeven the fastest cluster interconnects usually have lower bandwidth and greaterlatency than the bus in an SMP (in some cases by several orders of magnitude),synchronization among the subtasks becomes a critical bottleneck that systems

must minimize. Identifying opportunities for coarse-grained parallelism proveskey to effective scalability on clusters. A variety of parallel-programming toolsand techniques have emerged to assist in partitioning applications for clusters.Their use requires considerable expertise, however, and some classes ofapplications fundamentally cannot be adapted at all. If sufficiently partitioned,applications can exploit clustered systems containing hundreds or even thousandsof nodes, delivering monumental gains in performance.

8/10/2019 osfr

14/73



TECHNICAL COMPUTING CLUSTERS

Today, parallel computing addresses some of the worlds deepest computationalproblems across a variety of scientific and engineering domains, includingsimulation of natural phenomena, finite element analysis, and mechanical design.From a hardware standpoint, clustered computing has evolved into variants such

as Clusters of Workstations (COWs) and Massively Parallel Processors (MPPs),all of which share the assumption that attached nodes are dedicated exclusively totheir participation in cluster activity. Software designs have converged aroundtwo public-domain parallel processing packages, Message-Passing Interface (MPI)and Parallel Virtual Machine (PVM), which handle dispatch, collection, andmanagement of processing tasks across cluster nodes.

Even as researchers have gained parallel programming experience, they havecontinued looking for more affordable alternatives to expensive supercomputerproducts (many of whose developers have now adopted parallel architecturesthemselves). Recently, the dramatic improvements in price-performance of such

commodity technology as Intel X86 processors and Ethernet LAN adapters hasallowed developers to pursue advanced cluster architectures based entirely onindustry-standard technology, requiring little or no involvement from majorhardware vendors. The development of low-cost operating systems and relatedPVM-type capability on Linux through software known as Beowulf has extendedthe popularity and availability of technical computing clusters.

DATABASE CLUSTERING

Technical applications tend to be analysis-oriented and thus read more data thanthey write. By contrast, most commercial applications involve Online TransactionProcessing (OLTP), which usually requires that database records be updated

frequently. Since any data set changes must be copied to every node in the clusterto maintain consistency, OLTP-oriented tasks tend to scale better on SMPsystems, which suffer a much less severe penalty with regard to inter-processorcommunication.

However, a few commercial applications rely on analysis as well and thus lendthemselves well to cluster deployment. For example, data warehousing involvesscanning large databases for patterns that can be used to help make businessdecisions. (Decision support is typically cited as a key benefit of data-warehousing applications.) Many classes of data warehousing applications canpartition their data sets so as to minimize inter-node synchronization, allowing

them to achieve good scalability on clusters. However, data partitioning anddistribution must be implemented at the core of a database engine to workeffectively, meaning that database systems require modifications to properlysupport clustered operation.

Several commercial database systems including Oracle Parallel Server (OPS),IBM DB2 Universal Database (UDB), and Informix XPS have been extendedto work on clusters of servers connected by high-speed interconnects.

8/10/2019 osfr

15/73



PACKAGED WEB SERVER FARMS

While technical clustering revolves around distribution of compute cycles acrossnodes, and commercial database clustering revolves around distribution of bothdisk I/O and compute cycles across nodes, IP clustering focuses on thedistribution of network requests such as TCP/ IP or web service requests across

nodes. The largest websites on the Internet process millions of hits per day, avolume of traffic that can exceed the capabilities of a single server.

IP clusters allow ISPs or corporate Intranet sites to map all the traffic destinedfor a single node (say, home.netscape.com) to a farm of multiple web serversacross which the Internet traffic is balanced. This mapping can take place eitherin hardware (at a router-like device sitting in front of the web server farm) or insoftware (on a separate server that sits in front of the web server farm). Virtuallyall operating systems support the hardware approach, epitomized by theexpensive but well-known Cisco LocalDirector solution. LocalDirector, likeother hardware products, takes incoming IP sessions and rewrites the IP headers

of a packet stream to redirect them to a particular server using a technique calledNetwork Address Translation defined in RFC 1631. This approach requires nochanges to DNS configurations and minimal configuration of web servers, otherthan to insure that the web servers have mirrored data or are operating off acommon network-based file store. The balancing of connections can occur for abroad range of TCP/ IP services such as email or FTP and not just web services.On the downside, systems require a backup LocalDirector to avoid having asingle point of failure, and throughput can be limited by the rate at whichLocalDirector can rewrite packets. In addition, the hardware approach does notoffer the ability to dynamically balance the connections according to the load oneach server in the web server farm.

The earliest software approaches revolved around a technique called RoundRobin DNS. In this approach, when a DNS server was asked for IP addressesof a site name (e.g., www.dhbrown.com) it would return a numeric IP addressthat alternated among a set of predefined set of IP addresses, each referring to aseparate back-end server (e.g., 127.1.1.1, 127.1.1.2, 127.1.2.5). This approach hadtwo main flaws, however. First, DNS mappings often got cached in intermediaterouters and other DNS servers in such a way that the load was not evenlydistributed. Second, if a back-end server failed, the DNS table had to be modifiedby hand to remove the failed systems IP address. Otherwise, connections wouldcontinue to be routed to the dead system, even when a user pressed the reloadbutton on their browser.

Over the last several years, a variety of other software approaches have sprung upthat attempt to provide the load-balancing via software that intercepts incomingrequests for information and distributes those requests accordingly. As withdatabase clustering solutions, a full evaluation of these products goes beyond thescope of this paper. However, at a minimum, all the studied operating systemscan support the hardware redirection approach to IP clustering, as well as theprimitive round-robin DNS approach.

8/10/2019 osfr

16/73



MISCELLANEOUS PERFORMANCE OPTIMIZATIONS

Several vendors have tried to address scalability in particular situations. Becauseof their limited applicability, DHBA weighted these areas less heavily in itsrankings. Potential areas for optimization include:

Multiple2GB shared-memory segments: Large servers running several copies of 32-bit enterprise applications such as SAP can run into bottlenecks over sharedmemory, if the kernel can only provide 2 GB of shared memory to the wholesystem. A proper kernel design or modification can enable multipleapplications to use their own private 2 GB shared-memory windows withoutexhausting the limited shared-memory space addressable by a 32-bit kernel.This tactical feature should help scalability in certain server consolidationenvironments, where the application vendor has yet to port its application to64 bits.

Dynamicpagesizing: Historically, operating systems used fixed-size I/ O pages.However, some classes of applications may benefit from different page sizes.For example, applications that involve use of many small files may operatemore efficiently with small page sizes, while I/ O-intensive applicationsimplementing large block transfers may run better with large page sizes. Someoperating systems allow administrators to set page size by process.

Kernel thread architecture: All studied environments now support kernel threads,which are required to effectively scale threaded applications on SMP systems.Some environments innovate over traditional one-to-one (1-1) threadmechanisms in which each application thread has one corresponding kernelthread with the addition of MxN thread mechanisms. MxN thread-scheduling multiplexes user threads over a fixed (but configurable) number ofkernel threads. In some application classes, MxN thread-scheduling boostsapplication efficiency, as it avoids calling kernel functions directly, thus

reducing the overhead of saving and restoring the kernel state when makingthose calls. MxN also potentially allows the creation of many more userthreads, because it requires a smaller overhead per thread.

Kernel-based asynchronous I/ O: Asynchronous I/O mechanisms prove useful inprogramming SMP applications by allowing threads to continue processingwhile waiting for time-consuming I/ O operations such as disk reads tocomplete. Some operating systems support asynchronous I/O deep in thekernel, potentially making its use more efficient with heavy-duty programssuch as databases.

AIX 4.3.3AIX offers good scalability overall, with excellent performance clusteringcapabilities and solid 64-bit and SMP capabilities. IBMs latest generation of 64-bit SMP servers supports up to 24 processors, twice what it supported last year.IBM has published respectable TPC-C results demonstrating that AIX caneffectively exploit such high-end configurations.

8/10/2019 osfr

17/73



While linearity remains unknown at the 12-way or 24-way level, AIX has achievedimpressive results on past system generations. For example, as noted earlier,using Bulls Escala servers, which are internally identical to IBMs current 32-bitRS/6000 servers, earlier versions of AIX achieved 70% linearity on TPC-Cbenchmarks when going from four to eight processors in identical

configurations. SPECweb96 shows 69% linearity between similar two-way andfour-way H70s. For technical batch jobs, IBM relies mainly on clustered SPsystems, but the older R50 has shown linearity of 82% on SPECfp_rate_base95and SPECint_rate_base95. This is somewhat low, but still respectable.

IBM introduced its first 64-bit UNIX hardware two years ago and over the lastyear it extended 64-bit benefits to the midrange of its server line with the H70.Also over the last year, AIXs maximum physical memory support has grown to64 GB with AIX 4.3.3 running on the S80. While AIX provides some degree of64-bit capability across all four major criteria large file systems, files, physicalmemory and address space two implementation weaknesses remain:

AIX 4.3s maximum file size, 64 GB, falls significantly behind competitors,most of whom support at least 1 TB.

AIX 4.3s 64-bit addressing scheme rests on a hybrid 32-bit/ 64-bit kerneladdressing mechanism that penalizes 64-bit application performance, amongother tradeoffs (described more fully below).

Some controversy has arisen over AIXs single kernel, since the inner kernel itselfremains 32-bit, meaning that its pointers internally remain 32 bits wide. The 64-bit pointers used by applications are handled internally by the kernel as 64-bitcookies passed among internal routines. Manipulation of these pointers isrestricted to a small set of 64-bitaware kernel routines. A full description and

examination of the implications of this falls beyond the scope of this paper,except for a brief examination of the tradeoffs. With AIX, 64-bit applicationsthat frequently call upon (32-bit) kernel routines may invoke a small performancepenalty for checking, reshaping, and creating internal kernel data structures. IBMpoints out that 32-bit applications running on a 64-bit kernel face at least someoverhead of a similar nature. Given the relatively small number of 64-bitapplications and the great benefits that those applications receive from going to64 bits, AIXs kernel architects feel comfortable with the occasional case where a64-bit application takes a small performance penalty. In return, AIX offers theunique ability to use older device drivers and a single kernel across its 32-bit and64-bit systems. Furthermore, the cache effects of larger 64-bit code and data thatreduce performance may offset much of the potential gain of a true 64-bit kernel.Overall, the issue has an impact on some 64-bit applications performance, butthe impact appears to be minor.

AIX is equipped with very good clustering options. IBMs HACMP clusteringpackage supports industry-leading high-availability (HA) functions, and on IBMsSP systems, AIX has proven its ability to support world-class computationalproblems. In terms of concurrent database support as rated by DHBAs HAresearch, IBM ranks second only to Tru64 UNIX in breadth of capabilities. IBM

8/10/2019 osfr

18/73



has demonstrated the scalability of its systems with strong TPC-D benchmarkson a 48-node SP. IBM also holds second place for clustering performance on theTPC-C benchmark with a five-node cluster of S70 servers. Vendors such asOracle also support their OPS parallel database on AIX. The SP and AIXsupport the broadest range of clustered databases, including IBM DB2 UDB

EEE, Informix XPS, Oracle Parallel Server, Red Brick xPP, and Sybase MPP.Note that there is typically a three-month delay before the most recent version ofAIX is made available on the SP.

IBMs technical clustering stands out for its SP system, which consists of AIXsystems connected by a proprietary, high-performance switch; a version of MPIoptimized for the switch; and sophisticated cluster-management tools.

Like other vendors, IBM depends on third-party reverse proxy software (such asiPlanet Proxy Server) for web-farm clustering. IBM supports 64-bit kernelasynchronous I/O and an MxN thread model, as well as the ability to supportmultiple pools of shared memory on 32-bit systems using techniques analogousto HPs Memory Windows feature. However, unlike all the other productsstudied in this report, AIX does not support dynamic page sizing, in part due tohardware limitations.

HP-UX 11.0

HP-UX offers solid scalability, offering strong SMP support up to 32-waysystems and matching other vendors for 64-bit capabilities. However, HP-UXlacks MxN threads and strong concurrent database capabilities. While HPs V-class servers have supported up to 32 processors since last year, HP has recentlymoved its NUMA technology from Convex into HP-UX, allowing non-uniform

shared-memory servers that began shipping at the end of 1999 to reach 128processors. HPs SMP servers will support up to 32 processors today, withprototype NUMA systems of 128 processors expected to become generallyavailable in early 2000. At least some performance gain from additionalprocessors seems likely at the 32-way level, since HP has published TPC-C, TPC-H, and TPC-D benchmarks on 32-processor systems. Linearity remainssomewhat unclear, but some eight-way N and 32-way V class results suggestreasonable scalability, despite the fact that they are not strictly comparable, dueto different backplanes and other minor factors. For example, two 440 MHzsystems running TPC-H benchmarks with Informix as the database show 63%linearity between eight processors and 32 processors. Similarly, an eight-way 440

MHz system running TPC-C benchmarks with Sybase suggests 47% linearitywhen compared to a 32-way 440 MHz system running Oracle. (Assuming Oracleis better than Sybase, 47% would be an upper bound.) A later 32-way Sybaseresult with a slightly newer operating system and slightly newer Sybase yields 52%linearity.

In terms of SPECweb96 linearity, one- to two-way linearity is 74%, with two- tofour-way being 94% and four- to eight-way being 85% (one- to eight-way thus is78%). Also, a 16-way V-class showed 55% SPECweb96 linearity over a

8/10/2019 osfr

19/73



comparably clocked four-way K-class, albeit with different backplanes. In termsof trivially parallelized benchmarks useful for technical or batch-orientedcomputing, SPECint_rate_base95 linearity from a 16-way to a 32-way 440MHzV-Class is 84%, while a 200 MHz V-Class shows 93% linearity going from a one-way to a 16-way configuration. No results for the HPs 128-way NUMA systems

from are yet available.

HP-UX 11.0 supports 64-bit capabilities in all four areas: files, file systems,physical memory, and process address space. HPs 64-bit servers exceed manyother UNIX competitors in terms of maximum memory capacity, with supportfor 128 GB in the SCA-node V-class servers that started shipping at the end of1999. Previously, HPs 32 GB V-class limit matched that of others. HP-UXsupports large files and file systems up to 1 TB. HP has largely put compatibilityand software-availability transition issues for the 64-bit platform behind it, havingfilled in holes such as its OpenGL implementation.

As measured by DHBAs HA research, HPs support for performance clusteringfunctions is limited by its lack of virtual raw disk access or low-overheadmessaging protocols, no distributed lock manager in the kernel, and no softwareRAID5 support. Still, HP-UX systems support both XPS and OPS. Moreover, interms of benchmark evidence, HP has provided eight-node TPC-C and eight-node TPC-H benchmarks, the former running Oracle and the latter withInformix. Like other vendors, HP depends on third-party reverse proxy software(such as iPlanet Proxy Server) for web-farm clustering.

HPs Memory Windows feature, introduced largely to support large SAPinstallations on 32-bit systems, removes the restriction that all applications on aserver would have to share a single 1.75-2.75 GB pool of shared memory.

Memory Windows allow each application to have its own, semi-private pool of1+ GB. With HP-UX 11.0, HP finally introduced kernel threads, using the 1-1model, but has yet to catch up with competitors offering the more modern MxNthreads for peak scalability. HP does support dynamic page-sizing optimizationsand kernel asynchronous I/O, features that prove useful for accelerating databaseperformance.

IRIX 6.5

IRIX has excellent scalability, supporting more processors and memory in SMPsystems than any other studied UNIX product. Commercial performance clustering

remains less fully addressed, however. IRIX offers particularly strong SMP capabilitiesfor technical requirements. Currently, SGI systems scale as high as 512 processorswith their Origin2000 NUMA hardware, although mainstream commercialbenchmarks do not go nearly as high. SGI has published TPC-C benchmark resultsits for 28-way Origin2000 servers as well as a respectable 32-way TPC-D result. Whilelinearity cannot be determined based on the single TPC-C result, two reasonable datapoints on the TPC-D benchmark indicate 77% linearity on database query power andthroughput between comparable eight-way and 32-way systems running the sameprocessors, operating system, and database.

8/10/2019 osfr

20/73



SPECweb96 linearity is fine; one- to two-way Origin results indicating 52-59%linearity. Two- to four-way linearity appears to be between 59% and 70%,depending on the choice of system pairs, with more results falling at the high endof that range. Four- to eight-way SPECweb linearity appears to be 60%.

SGI has largely abandoned its vision of a Cellular IRIX that would solvescalability problems such as kernel page table bottlenecks by running multipleimages of the operating system. Instead, SGI has turned its focus on tuning fortechnical and batch applications. Linearity for trivially parallelizable technical andbatch applications appears strong, with 99% linearity from one- to four- to eight-to 16- to 32- to 64-way Origin2000 systems all running at 250 MHz. From 64- to128-way, linearity stays at 88% and from 128 to 256-way, linearity improvesslightly to 92%. Overall then, linearity from a one-way to a 256-way systemsappears remarkably strong over a very wide range, with each processorperforming at 88% of maximum performance.

SGI provides 64-bit hardware across its entire product line. The 64-bit version ofIRIX, first released in 1993, is highly mature, meaning compatibility issues arelargely an issue of the past. IRIX stands out for its large real memory support of256 GB in a single Origin 2000 system, four times its nearest competitor, SunsEnterprise 10000 server. IRIX is also notable for its storage scalability. SGI hascustomers who use its XFS file system with over 100 TB, a stronger claim of real-world testing than any of its competitors, most of whom guarantee 1 TB-leveltesting and support.

SGI has sharpened its focus to emphasize technical performance clusteringscalability, rather than commercial database cluster scalability. While Oraclesupports its Parallel Server, and Informix supports XPS on IRIX, SGI has not

yet run any TPC-C or TPC-D benchmarks using either of these systems. SGIsORIGIN ARRAY clustering technology supports eight SMP nodes linked byFibre Channel (FC), aimed primarily at technical tasks. SGI offers a suite ofcapabilities, including NQE and LSF for workload balancing/ distribution; theTotalView cluster debugger for debugging a single MPI application across acluster; ArrayServices, allowing commands to be run across the cluster for clustermanagement; and a cluster accounting package that helps track system usage.SGIs 48 node, 128-way SMP cluster at LANL demonstrates its ability to deliverand extend this technology in the most advanced technical-computingenvironments. Current SGI environments support up to 512-way SMP, andthose systems can be partitioned into smaller clusters as desired.

In terms of miscellaneous optimizations, IRIX implements MxN threadscheduling, per-process dynamic page scheduling, and kernel asynchronous I/O.

8/10/2019 osfr

21/73



SOLARIS 7

Solaris provides excellent SMP scalability, with support for up to 64 processors.TPC-C and TPC-D benchmarks at the 64-way level suggest at least someperformance gains from that number of processors. Linearity remains extremely

difficult to judge, more even than other vendors, because Sun ran all its tests withdifferent databases or hardware configurations. If one drops the requirement forcomparable databases and backplanes, a 64-way Oracle configuration shows 40%linearity per processor over a four-way Sybase configuration, both running 400MHz processors with 4 MB L2 caches. An even grosser comparison can be madefrom that 64-way system to a 24-way system running Sybase on 336 MHzprocessors, indicating 69% linearity, assuming that TPC-C results would scaleperfectly with comparable clock speeds. A strong correlation exists betweenSPECint speeds and clock rates and between SPECint and TPC-C results, butthe accuracy of these linearity results is highly qualified given the number ofchanging variables.

In terms of SPECweb96 linearity, one- to two-way improvements on comparableEnterprise 250s run a surprisingly low 63%, or 72% on comparable Enterprise450s. The Enterprise 450 shows 80% linearity from two to four processors. Interms of technical compute scalability, Suns SPECfp_rate_base95 scales between78 and 87% on various 32- to 64-way configurations.

Solaris 7 added support for large process address spaces to its previous supportfor large files, file systems, and real memory. Suns support for large physicalmemory sizes is strong at 64 GB, matching or exceeding all vendors except SGI.Sun supports files and file systems up to 1 TB. Like HP-UX and IRIX, Solaris 7comes in both 32-bit and 64-bit flavors, chosen at install time, with the same

binary compatibility for applications and the minor inconvenience of checkingthat proper device drivers are available and installed on the appropriate 32-bit or64-bit hardware.

Suns commercial clustering performance has been validated by four-nodebenchmarks with Oracle 8i and four-node TPC-D benchmarks with InformixDynamic Server XP. As one of the first UNIX environments to optimize forkernel threads, Solaris pioneered the MxN thread model and fully supportskernel asynchronous I/O. Sun does not appear to support a shared-memoryfeature similar to HPs Memory Windows for systems running 32-bitapplications on large memory servers.

8/10/2019 osfr

22/73



TRU64 UNIX 5.0

While Tru64 UNIXs 64-bit and clustering technologies remain key areas ofstrength, the operating system has yet to prove its scalability on high-end SMPconfigurations, weakening its scalability range. Until Compaqs future eight-, 16-,

and 32-node high-end GS-series systems arrive in 1H00, Tru64 UNIXs SMPsupport remains limited to up to 14 processors. Linearity of that SMP supportremains unclear at best, questionable at worst. Compaq has published 12-wayTPC-D results and recent eight-way TPC-C and TPC-H results (as well as a now-obsolete 10-way TPC-C result). These results indicate that additional processorsdo improve performance through at least the eight-way range and probably upthrough the 12-way range. While it might appear odd that Compaq has releasednewer benchmarks with fewer processors, this is likely due not to operating-system factors but to the fact that newer Alpha 21264 processor saturates thememory bus quicker than the 21164 for which the systems were originallydesigned.

On the positive side, comparable six-way and eight-way 700 MHz Sybasesystems, with slightly different backplanes, do show 97% TPC-C linearity overthat narrow processor range. However, SPECweb linearity is much more modest:48% improvement going from one- to two-way (DS20), 84% going from two-way DS20 to four-way ES40, and only 11% faster performance going from afour-way system to a 10-way system with faster clock speed, once the MHz gainsare scaled out. SPECfp_rate_base95 linearity is surprisingly poor, with 52%improvement from a four-way to an eight-way. This is perhaps due to inadequatememory bandwidth for the 21264 on the 8400; SPECint_rate_base95 linearity forthat same comparison is 95%.

Tru64 UNIXs lead in 64 bits grows less relevant by the year, as all other UNIXvendors now effectively match its 64-bit addressing capabilities. Compaq boaststhe largest portfolio of 64-bit applications, but the payoff of this achievement aside from large-memory databases remains unproven.

From the very beginning, Compaq (then Digital) designed its operating system tobe a fully 64-bit environment. Not surprisingly, Tru64 UNIX offers the strongest64-bit functionality and the best compatibility story for customers going forward.Compaqs entire line of Alpha hardware has been 64-bit capable for almost fiveyears and the system provides large files, file systems, process address space, andphysical memory of up to 28 GB (limited by hardware). Tru64 UNIXs AdvFS

file system has supported large files since its introduction, although its UFS filesystem has not. Unlike other competitors, all Tru64 UNIX applications areavailable on its 64-bit platform without future migration issues. While Tru64UNIX does not yet conform to the UNIX98 standard, it does support the earlierUNIX95 standard and offers a full set of 64-bit device drivers and applicationswith a single operating-system binary.

Compaqs TruCluster Server software provides effective clustering support.Although its HA clustering functions rank as average, Tru64 has driven more on

8/10/2019 osfr

23/73



performance scalability clustering, ranking first in terms of concurrent databasesupport as rated by DHBAs HA research. Tru64 UNIX systems support bothInformix XPS and Oracle OPS parallel databases. An eight-node Oracle OPS andTru64 UNIX cluster was the first system to break the 100,000 tpmC TPC-Cbarrier. More recently, Compaq released a new TPC-H benchmark running

Informix XPS across eight nodes running Tru64 UNIX.

Like other vendors, Compaq depends on third-party reverse proxy software(such as iPlanet Proxy Server) for web-farm clustering. Tru64 UNIX doesprovide a MxN thread model for SMP applications. Its dynamic page sizing isparticularly flexible, allowing page sizes to vary across processor and on a per-process basis.

8/10/2019 osfr

24/73



RELIABILITY, AVAILABILITY

AND SERVICEABILITY (RAS)

5.00 6.00 7.00 8.00 9.00

AIX 4.3.3

HP-UX 11.0

Solaris 7

Tru64 UNIX 5.0

IRIX 6.5

Fair OK Good Very Good Excellent

SUMMARY

IRIX 6.5 shares the lead with Tru64 UNIX 5.0 for RAS functions. Tru64 UNIXoffers unmatched storage reliability features and leading HA clustering functions,thanks in part to its clustering file system, which is unique among all studiedproducts. IRIX offers particularly strong resiliency functions, along withcompetitive HA clustering options. Solaris 7 follows, having the strongestresiliency functions due to its unmatched Dynamic Reconfiguration andAlternate Pathing capabilities. Solaris 7 offers only average HA clusteringcapabilities, however. HP-UX 11.0 and AIX 4.3.3 have roughly equivalent RAScapabilities. HP offers stronger resiliency functions and the best overallserviceability functions, but IBM offers very strong HA clustering functions.

RAS CRITERIA

Virtually all systems have downtime, but enterprise environments place apremium on functions that help minimize it. Developers have created a numberof software tools and mechanisms to reduce both planned and unplanneddowntime, including:

Resiliency functions: allow an operating system to adapt to outages by certainhardware components in single systems, including I/ O, CPUs, and memory.

HA clusteringfunctions: protect a complex of multiple systems against hardwareand software failures both in the operating system and applications byallowing servers to failover operations to a backup server.

Storagereliability functions: such as journaling file systems maintain the integrityof system and user data in the event of unplanned shutdowns.

RESILIENCY FUNCTIONS

In general, hardware has become more reliable over time. Server designsincreasingly build on highly integrated components, reducing complexity andhence the number of points of failure. Hardware areas particularly vulnerable to

FIGURE 3:

RAS Functi onal Ratings

8/10/2019 osfr

25/73



mechanical failure, such as storage, can be protected through techniques such asRAID. Systems now build in redundancy for components such as fans to furtherimprove reliability. Despite these improvements, critical failures can still occur incomponents such as memory and CPUs. Leading-edge developers haveresponded by introducing features that allow an operating system to adapt to

certain hardware failures, in some cases drawing on techniques that havetraditionally been implemented in mainframe environments. Emerging operating-system technology that enables such self-healing includes:

Dynamicprocessor resilience: can adapt to processor failures by isolating failedCPU components. In the event of a soft error (a non-fatal error that allowsthe system to continue processing), the system should gracefully discontinueuse of the failed unit. If a processor failure results in a system crash, thesystem should reboot automatically after isolating the failed unit.

Dynamicmemory resilience: can dynamically cordon off memory that has sufferedsingle-bit errors so that software no longer risks using potentially unreliableareas. Most systems typically can detect and correct single-bit failures with

error-correcting code (ECC) memory. With dynamic memory resilience,however, the operating system registers repeated single-bit failures insoftware so it can isolate affected areas before fatal double-bit errors occur.

Dynamic reconfiguration: can support online addition and removal of I/Oadapters, CPUs, and memory modules for repairs or upgrades. Dynamicremoval of CPUs and memories requires the operating system to gracefullydry up use of those resources. Dynamic reconfiguration of I/O typicallyrequires support for Alternate Pathing (AP) in the operating system, so anylogical I/ O reference can be switched among different physical I/ O adapters.

Software-awareinternal partitions: can support the division of a large SMP systeminto several smaller SMP systems, each running their own copy of theoperating system for increased reliability.

HIGH-AVAILABILITY CLUSTERING FUNCTIONS

The vast majority of risks to system reliability derive from failures in software,including the operating system, middleware, and applications. Administrators canuse HA clustering techniques to maintain the availability of operating-systemservices and applications by failing over to a backup system in the event ofsystem outage, either planned or unplanned. HA clustering allows one or moreservers to take over for a server that has crashed or stopped processing normallydue to an operating-system or application failure, allowing processing to

continue. By isolating faults on the failed node, the remaining nodes can continuefunctioning, keeping the overall clustered system in operation, albeit at reducedcapacity.

In some cases, clustering can help with some management tasks by absorbingplanned downtime in addition to addressing system failure. For example, a clustercould allow testing of new software or hardware in a working system while stillprotecting the remaining nodes from any resulting failures. Clusters can also beused to respond to failure of hardware components such as disks or adapters.

8/10/2019 osfr

26/73



Note that most clustering solutions only try to insure that service gets restoredwithin a reasonable time limit. They do not necessarily guarantee continuousservice. In fact, at the time of a failure, cluster clients will likely receive errorswhile the cluster completes state transition changes. Unlike Fault-Tolerant (FT)systems, which tend to use specifically-designed and usually costly proprietary

mechanisms to enable truly continuous availability, clusters emphasize the use ofstandard building blocks (i.e., traditional servers used to construct meta-systems with some level of a single-system image). As part of the designtradeoff, a clusters failover process does not necessarily occur immediately ortransparently.

Full-function HA clustering solutions typically include a number of components,including:

Failuredetection and recovery: Clustering software monitors the health of systemsand applications by running agents that continuously probe for certainconditions. Vendors usually provide agents for monitoring hardware, the

operating system, and key applications such as databases and messagingsystems. They typically also provide an API that developers can use toconfigure monitoring of their own applications.

Failover configuration: When agents detect a failure, they can trigger a variety ofactions, depending on the configurability of the clustering package. First, thesystem must decide whether to attempt a local recovery or initiate a failover,in which the workload is moved to a backup server. In failover situations,support for more than two nodes becomes a significant added value, becauseof the ability to perform cascading and multidirectional failover. Cascadingfailover provides higher levels of reliability by allowing the workload tocontinue migrating to yet another backup node if the primary backup node

fails. Multidirectional failover allows a failed nodes workload to be split andfailed over to multiple backup nodes.

Cluster administration: The basic definition of a cluster has long invitedcontentious debate in both marketing and academic circles. The one conceptagreed on by all relates to the fundamental requirement for a single-systemimage the ability to view and operate the cluster as if it were a single virtualserver. From a clients perspective, a cluster implementation should betransparent and require no special modification to client software orhardware. From an operator standpoint, administration should involve asingle point of interaction, and management tools should hide theimplementation details of multiple servers as much as possible.

Disaster recovery: Many clustering packages depend on the ability for systems toshare disks, since backup nodes need to access the same data used by primarynodes. However, most shared-storage configurations constrain the distancebetween nodes to the maximum length of I/ O channels such as SCSI or FC,which at best extend to campus ranges of a thousand yards or so. Disaster-recovery configurations allow nodes to be separated by geographicallysignificant distances, measured in miles or even continents. These greaterdistances protect systems from outages that affect entire sites, such as floodsor terrorist attacks.

8/10/2019 osfr

27/73



Cluster filesystem: The cluster can share a single file system across multiplenodes, both for data and for the operating-system code itself on the root filesystem. This feature dramatically simplifies serviceability and manageability ofHA clusters.

STORAGE RELIABILITY AND SCALABILITY

Since data management represents a central function in most serverenvironments, operating systems must implement specific features to maintainthe integrity of storage. A journaling file system (JFS) provides two particularlyimportant storage reliability benefits by increasing the robustness of the filesystem and reducing the time required to boot a system configured with largeamounts of storage after unplanned shutdowns. Journaling employs transaction-based logging techniques similar to those of database systems. Before updatingany file system control information (i.e., metadata), the operating system entersinformation concerning the update into a disk-based log. Only after the systemhas confirmed that it has written the user data safely to disk does it attempt to

update the actual metadata. If the system loses power or otherwise fails duringthe metadata update, the JFS can reconstruct the all-important metadata frominformation in the log. In this way, file systems always move from one consistentstate to another, never attempting unsafe writes.

SERVICEABILITY ENHANCEMENTS

Enterprise systems administrators require a broad portfolio of tools to help themservice the operating system. They use these tools to harden the system againstfailures (usually by performing postmortems on past failures) and to tune it foroptimal performance. Potential serviceability options include:

Checkpoint/ restart: This capability allows the operating system to takesnapshots of a running application, including memory contents and registervalues. When a server fails, the snapshot can be used after the server comesback up to restore an application to its exact state at the time of failure.

Resource management: While standard UNIX provides disk quotas and per-process resource limitations, some systems provide more advancedmanagement capabilities, such as allocating CPU and memory percentages byuser or user group.

Year 2000 validation: While most Year 2000 risks derive from shortsighteddesigns in application code, users must also test operating-system functionsas a minimum level of protection. A number of vendors certify that their

operating systems work in Year 2000 conditions, in some cases referring toassessments by independent organizations.

Enhanced core dump analysis: This capability provides tools for analyzingapplication failures. When UNIX applications crash, they leave behind a coredump file containing the state of the application at the time of failure.Operating systems can provide enhanced abilities to analyze these files withmore system-specific detail than standard debuggers provide.

8/10/2019 osfr

28/73



Efficient kernel dump: This capability provides tools for analyzing total systemfailures. Extreme software failures can result in operating-system crashes. Aswith application crashes, developers can examine dump files containing asnapshot of the entire system memory at the time of failure. On high-endservers configured with very large amounts of memory, especially 64-bit

systems, such files can grow significantly. Operating systems can makeanalysis of such files more efficient by reducing the amount of data throughcompression or elimination of irrelevant information.

AIX 4.3.3

AIX has processor resilience and partial memory resilience. If AIX discovers asick processor or memory block at boot time, the system turns off the defectivepart and does not use it. This includes a situation when a processor encounterstoo many recoverable errors, although recoverable errors for ECC memory arenot yet trapped in a similar manner. In any case, if the system is halted for a sick

processor or memory block, then the processor and block are turned off and notused when the system reboots. However, AIX does not yet support any dynamicreconfiguration, other than the ability to turn off processors using thecpudisable command.

AIX achieves the highest overall rating for HA cluster features according toDHBAs HA scorecard. While achieving competitive ratings in every area, AIXbreaks out with overwhelming advantages for Disaster Recover/Remote DataReplication. AIX does not yet offer a full root CFS, although it providesnetworked file systems with its SP cluster hardware. In terms of storagereliability, AIX includes a journal file system, albeit one that protects the integrityof the file system as a whole, and not individual files.

AIX matches many of the serviceability and performance improvementsprovided by competitors, including efficient core and kernel dump facilities, andsome checkpoint/ restart capability. AIX does not offer dump analysis tools.AIXs kernel dumps are both selective in the data saved and can be compressedon the fly when created. For example, for a device driver that runs in kernelmode, the driver can explicitly specify what it wants to dump, possibly queryingthe device upon a dump for specific status information. IBM claims that suchdevice driver data typically runs about 0.25 MB per driver. IBM claims its dumpsizes are in general limited to less than 10% of real memory and less than 5% onlarger systems (>4GB RAM), with 64 GB systems producing dumps typically

2GB in size. In terms of checkpoint/ restart capability, IBMs LoadLeveller allowscheckpoint/ restart of a set of processes with parent-child relationships intact andprocess IDs in place. This capability includes any kind of processes run underLoadLeveller, even Perl and shell scripts, without requiring explicit API hooks.However, LoadLeveller is an additional charge, and does not include moresophisticated multiple-process-family checkpoint/ restart or socket checkpoint/restart.

8/10/2019 osfr

29/73



HP-UX 11.0

While HP-UX offers little in the way of dynamic reconfiguration, it is at theforefront for resiliency features. When certain classes of faults occur at run-time,HP-UX will trap and check for processor failure. If a processor has failed, HP-UX will notify an administrator who can take the processor down while thesystem stays up. HP-UX can also check memory at run-time to detect if single-bithard errors or repeating soft errors have occurred. If so, HP deallocates therespective 4K page of memory to prevent a second and fatal bit error (note thatECC memory detects, but does not correct, a second bit error.) HP-UX also logsthese errors for later analysis or reboot so the bad memory is permanently keptoffline throughout succeeding boot cycles. HP-UX 11 does not yet supportdynamic addition and removal of CPU, memory, or I/O devices.

HP-UX offers solid HA cluster features, taking third place according to DHBAsHA scorecard. In particular, HPs clustering options offer leading cluster failoverconfiguration and detection/backup/ recovery functions. In terms of storage

reliability, HP-UX includes a journal file system, albeit one that protects theintegrity of the file system as a whole, not individual files.

Miscellaneous reliability features include the ability to log all console errormessages to a file (syslogd), and core and kernel dump infrastructuresoptimized for efficient saving of relevant information to disk. User core dumpscan be analyzed with WDB, HPs Windowed Debugger, which can attach to arunning/hung process or to attach to and examine application core files. Forkernel core dumps, HP has an internal tool, /usr/contrib/bin/q4typically used only by HP field support or a few specially trained customers. HP-UX lacks a checkpoint/ restart tool, although HP provided one on past SPP-UX

based systems from its Convex division.

IRIX 6.5

While IRIX lags its competitors in traditional failover HA capabilities critical tothe commercial server market, SGIs focus on reliability for technical serversshows. IRIX offers strong processor and memory resiliency, a unique integratedcheckpoint-restart capability, highly flexible dynamic page sizing, and a staticpartitioning capability that is second only to Suns Dynamic Domains.

IRIX offers strong resilience to processor and memory failures. IRIX can catch awide range of faults in user mode. When IRIX encounters an unrecoverable CPUerror, it runs a software routine to gather hardware diagnostics. If the failure didnot cause a crash, an administrator can stop scheduling processes onto the CPUvia thempadmin command. Some processor errors are automatically recoveredfrom without crashing or administrator intervention, including cache errors inthe instruction cache, cache errors on clean data lines, and unrecoverable cacheerrors for user data, in which case the running process is killed. IRIX can also usethose scheduler features to restrict a processor whose cache has too many single-bit (correctable) errors. If failure did cause a crash, the CPU is disabled after

8/10/2019 osfr

30/73



rebooting. The power-on test also detects the failed CPU and automaticallydisables it. If ECC-correctable errors grow beyond a certain threshold, SGI saysIRIX can deallocate those pieces of memory so they are not used by theoperating system for further read/write operations. IRIX also offers alternatepathing of network and disk I/O, providing automatic failover to a second

already-configured Ethernet or SCSI adapter. Despite strong resilience to failure,SGI does not yet provide any support for dynamic reconfiguration of CPUs,memory, or I/O devices.

Finally, IRIX stands out for its hardware processor-partitioning capability, areliability enhancement that allows several operating-system images to run on asingle large server, insuring that in the case of operating-system or hardwarefailure in one partition, other partitions will remain up. While the partitioning isinferior to Suns Dynamic Domains capability which allows for the partitionsizes to be shrunk or expanded at run-time rather than at reboot this capabilityis available on SGIs whole Origin 2000 product line. While the ability to rundifferent versions of the operating system and to tune partitions can essentiallybe accomplished using a cluster, those clusters cannot aggregate their computeresources to form a single-system image upon reboot as the SGI processorpartitioning scheme allows.

SGI has made significant strides forwards in its HA failover offerings, largelycatching up to other major UNIX vendors. In 1999, SGI extended its FailSafeproduct with version 2.0 product from a dual-node-only failover solution toprovide eight-node multidirectional and cascading failover. Failsafe now includesdisaster recovery and remote data replication capabilities over two-kilometerEthernet or FC connections. Management of the cluster can occur within asingle GUI environment called IRIS Console, and preconfigured failover scripts

are available for a wide range of server scenarios, including web, email, NFS, andSamba file serving, and Oracle or Informix database serving. Still, SGI currentlylacks software RAID 5 support or a cluster filesystem to enable single-systemimage clusters. A cluster filesystem is currently under development, however.

IRIX supports efficient kernel dumps, allowing both minimal dumps andcompressed dumps. These dumps can then be analyzed (in compressed form) bya software tool (icrash) and by the hardware diagnostic processor (FRU), whichwill perform an automatic analysis of the kernel dump and will generate a list ofprobable causes, ordered by percentage likelihood that a specific item is causingthe failure. The crash information can also automatically be sent back to SGI

over the network or dial-out modem for further analysis or response. IRIX doesnot support compression for user-level program dumps, since unlike full systemdumps, they generally are not multi-gigabytes in length.

SGI also provides availmon as a standard part of IRIX embedded in the systemboot and shutdown processes. The availmon utility differentiates betweencontrolled shutdowns, system panics, system hangs, power cycles, and powerfailures. Uptime is tracked by a lightweight daemon, and diagnostic information iscollected from icrash, syslog, hinv, versions, and gfxinfo. All availability and

8/10/2019 osfr

31/73



diagnostic data for cooperating systems are maintained in an SGI database (acheck-box is presented upon installation asking if users want to send failureinformation to SGI over the Internet). This database provides SGI with overallreliability data and a specific problem history for individual machines. While itremains unclear whether IRIX can log console error messages to a file, IRIX

systems do provide unattended reboot capability in case of system failure in anattempt to restore services.

SGI has integrated checkpoint-restart into the kernel and it is now bundledbeginning with IRIX 6.5, a further improvement over i

osfr

Documents

Transcript of osfr