An Evaluation of Alternative Designs for a Grid Information Service

9
Cluster Computing 4, 29–37, 2001 2001 Kluwer Academic Publishers. Manufactured in The Netherlands. An Evaluation of Alternative Designs for a Grid Information Service WARREN SMITH * , ABDUL WAHEED ** , DAVID MEYERS *** and JERRY YAN NASA Ames Research Center, Moffett Field, CA 94035-1000, USA Abstract. Computational grids consisting of large and diverse sets of distributed resources have recently been adopted by organizations such as NASA and the NSF. One key component of a computational grid is an information services that provides information about resources, services, and applications to users and their tools. This information is required to use a computational grid and therefore should be available in a timely and reliable manner. In this work, we describe the Globus information service, describe how this service is used, analyze its current performance, and perform trace-driven simulations to evaluate alternative implementations of this grid information service. We find that the majority of the transactions with the information service are changes to the data maintained by the service. We also find that of the three servers we evaluate, one of the commercial products provides the best performance for our workload and that the response time of the information service was not improved during the single experiment we performed with data distributed across two servers. Keywords: computational grid, grid information service, LDAP 1. Introduction Computational grids [3] consisting of large and diverse sets of distributed resources have recently been adopted by or- ganizations such as NASA in their Information Power Grid (IPG) [7,9] and NSF in their Partnership for Advanced Computing Infrastructure (PACI) effort [11,12]. The key middleware supporting computational grids is the Globus toolkit [4]. The Globus toolkit provides services such as security, communication, managing distributed applications, remote data transfer, and information. The increase in the number of resources and users in Globus-based computa- tional grids has highlighted deficiencies in the current im- plementation of some of these services. In particular, the im- plementation of the Globus Grid Information Service (GIS) was insufficient to handle the loads being placed on it un- til the recent addition of a second server. The result of the high load on the GIS was that queries made by users were not being fulfilled in a timely manner and therefore, users could not effectively locate and determine how to access the resources available in the Globus-based computational grids. The goal of this study is to examine the demands made on the Globus GIS and evaluate how well different GIS im- plementations can meet those demands. We begin in sec- tion 2 by describing the current Globus GIS and how it is used by Globus software and users. In section 3, we use * Computer Sciences Corporation. ** MRJ Technology Solutions. *** Directory Research L.L.C. This work was partially funded by grant 08.008.005.002 from the Research Institute for Advanced Computer Science at NASA Ames Research Center. trace data obtained from the Globus GIS to study the load that was placed on the GIS and characterize who is access- ing the GIS for what reasons. We find that the majority of the accesses made to the GIS are for the purposes of modify- ing the information stored in the GIS. This is contrary to the assumption made by most of the commercial software used to implement grid information services. These implemen- tations assume that the vast majority of the operations will be searches. This access pattern is also similar to the access patterns expected for Directory Enabled Networking [8] and our results should therefore apply to that domain. We also find that fairly high demands are placed on the GIS: there are typically 90 connections open to the GIS and 8.8 operations per second occur on average. Section 4 describes how we use trace-driven simulation to evaluate GIS configurations, presents the results of these simulations, and analyzes these results. We find that of the three servers we evaluate, the server provided by Vendor 1 exhibits the best performance on the hardware we used for evaluation. We also find that using indexes improves search performance by over 90% and do not decreases update per- formance, when only a small percentage of the updates are made to entries that are indexed. Finally, we find that dis- tributing the GIS data across two servers decreases the time to perform modifications by 14–25% but increases the av- erage search time by 18–21%. This increase in search time is negated at higher loads because it takes longer to connect and bind to a single server under these conditions. Section 5 describes the changes to Globus in version 1.1.3 that will affect the Globus GIS and discusses the effects we believe they will have on GIS performance and reliability. Section 6 presents our conclusions and future work.

Transcript of An Evaluation of Alternative Designs for a Grid Information Service

Page 1: An Evaluation of Alternative Designs for a Grid Information Service

Cluster Computing 4, 29–37, 2001 2001 Kluwer Academic Publishers. Manufactured in The Netherlands.

An Evaluation of Alternative Designs for a Grid InformationService

WARREN SMITH ∗, ABDUL WAHEED ∗∗, DAVID MEYERS ∗∗∗ and JERRY YANNASA Ames Research Center, Moffett Field, CA 94035-1000, USA

Abstract. Computational grids consisting of large and diverse sets of distributed resources have recently been adopted by organizationssuch as NASA and the NSF. One key component of a computational grid is an information services that provides information aboutresources, services, and applications to users and their tools. This information is required to use a computational grid and therefore shouldbe available in a timely and reliable manner. In this work, we describe the Globus information service, describe how this service is used,analyze its current performance, and perform trace-driven simulations to evaluate alternative implementations of this grid informationservice. We find that the majority of the transactions with the information service are changes to the data maintained by the service. Wealso find that of the three servers we evaluate, one of the commercial products provides the best performance for our workload and thatthe response time of the information service was not improved during the single experiment we performed with data distributed acrosstwo servers.

Keywords: computational grid, grid information service, LDAP

1. Introduction

Computational grids [3] consisting of large and diverse setsof distributed resources have recently been adopted by or-ganizations such as NASA in their Information Power Grid(IPG) [7,9] and NSF in their Partnership for AdvancedComputing Infrastructure (PACI) effort [11,12]. The keymiddleware supporting computational grids is the Globustoolkit [4]. The Globus toolkit provides services such assecurity, communication, managing distributed applications,remote data transfer, and information. The increase in thenumber of resources and users in Globus-based computa-tional grids has highlighted deficiencies in the current im-plementation of some of these services. In particular, the im-plementation of the Globus Grid Information Service (GIS)was insufficient to handle the loads being placed on it un-til the recent addition of a second server. The result of thehigh load on the GIS was that queries made by users werenot being fulfilled in a timely manner and therefore, userscould not effectively locate and determine how to accessthe resources available in the Globus-based computationalgrids.

The goal of this study is to examine the demands madeon the Globus GIS and evaluate how well different GIS im-plementations can meet those demands. We begin in sec-tion 2 by describing the current Globus GIS and how it isused by Globus software and users. In section 3, we use

∗ Computer Sciences Corporation.∗∗MRJ Technology Solutions.∗∗∗ Directory Research L.L.C. This work was partially funded by grant

08.008.005.002 from the Research Institute for Advanced ComputerScience at NASA Ames Research Center.

trace data obtained from the Globus GIS to study the loadthat was placed on the GIS and characterize who is access-ing the GIS for what reasons. We find that the majority ofthe accesses made to the GIS are for the purposes of modify-ing the information stored in the GIS. This is contrary to theassumption made by most of the commercial software usedto implement grid information services. These implemen-tations assume that the vast majority of the operations willbe searches. This access pattern is also similar to the accesspatterns expected for Directory Enabled Networking [8] andour results should therefore apply to that domain. We alsofind that fairly high demands are placed on the GIS: there aretypically 90 connections open to the GIS and 8.8 operationsper second occur on average.

Section 4 describes how we use trace-driven simulationto evaluate GIS configurations, presents the results of thesesimulations, and analyzes these results. We find that of thethree servers we evaluate, the server provided by Vendor 1exhibits the best performance on the hardware we used forevaluation. We also find that using indexes improves searchperformance by over 90% and do not decreases update per-formance, when only a small percentage of the updates aremade to entries that are indexed. Finally, we find that dis-tributing the GIS data across two servers decreases the timeto perform modifications by 14–25% but increases the av-erage search time by 18–21%. This increase in search timeis negated at higher loads because it takes longer to connectand bind to a single server under these conditions. Section 5describes the changes to Globus in version 1.1.3 that willaffect the Globus GIS and discusses the effects we believethey will have on GIS performance and reliability. Section 6presents our conclusions and future work.

Page 2: An Evaluation of Alternative Designs for a Grid Information Service

30 SMITH ET AL.

2. Metacomputing directory service

The Metacomputing Directory Service (MDS) [2] is the gridinformation service of the Globus project. The MDS is arepository of information for using computational grids. Itcontains information about organizations, people, comput-ers, networks, software, applications, and project-specificdata. The MDS is accessed using the Lightweight DirectoryAccess Protocol (LDAP) [5,6] and data in the MDS is or-ganized as entries in a hierarchical tree called the directoryinformation tree. The location of an entry in the directoryinformation tree is based on organizations and other entriesit is associated with. For example, a Portable Batch Sys-tem scheduler interface for a SGI Origin computer systemat NASA Ames would be located in the directory informa-tion tree (moving towards the root of the tree) at: service= jobmanager-pbs, hn= origin.arc.nasa.gov, ou= AmesResearch Center, o= National Aeronautical and Space Ad-ministration, o= Globus, c= US. Each entry in the treeis a set of attributes where an attribute has a name and oneor more values. The names are text strings and the valuescan be of any number of pre-defined types, but are typicallystrings. For example, the entry for the above PBS schedulerinterface might contain the name of the host it is runningon, the port it is listening on, the type of scheduler available,how many nodes are managed and available through the in-terface, properties of the scheduler, and so forth. Details ofthe MDS directory information tree and the types of entriesthat are defined are available in [2].

The LDAP protocol supports addition, deletion, and mod-ification of entries in the directory information tree and al-lows clients to search an LDAP server for entries that satisfyspecified search constraints. The LDAP communications be-tween the client and the server can be unauthenticated, au-thenticated with an identity and password, or authenticatedwith an identity and password over a Secure Sockets Layer(SSL) connection. Currently, the MDS is not using SSL con-nections. LDAP databases and clients are provided by manyvendors. In this work, we are concerned with implemen-tations provided by OpenLDAP and two companies, Ven-dor 1 and Vendor 2, that we cannot identify for licensingreasons. Currently, the Globus software uses the OpenLDAPclient software to access the MDS, which is contained intwo Netscape LDAP servers located at the National Centerfor Supercomputing Applications (NCSA) in Champaign-Urbana, Illinois.

As the number of entities participating in the Globus com-putational grid increased, the response time of a single serverbecame inadequate. Fortunately, LDAP servers support sev-eral techniques to improve response times. First, data canbe distributed across several computer systems. This is ac-complished by placing sub-trees of the directory informationtree on different hosts and accessing these sub-trees throughLDAP referrals. For example, all data from NASA Ames canreside in a database on a LDAP server running on a machineat NASA Ames. This approach may increase the amount ofadds, deletes, and modifies that can occur in a given time in-

terval but may increase the response time of searches if thesearches have to contact multiple servers to get their results.

Second, data can be replicated on multiple hosts. This ap-proach may increase the number of searches that can be ser-viced in a given time by servicing the searches on differenthosts, but it may decrease the number of adds, deletes, andmodifies that can occur in a given time since any changesmade must be propagated to the replications. Third, theLDAP servers can be tuned to improve their response time.This tuning varies from implementation to implementation,but one example is the creation of indexes that allow forfaster searches. As mentioned above, the Globus MDS datahas recently been distributed across two servers at NCSAand this has significantly improved the access times to thedata contained in the MDS.

There are several different Globus software modules thatupdate information in the MDS. First, when a computer sys-tem is initially added to the Globus grid, a setup script pop-ulates the MDS with information about the host, its networkinterfaces, the networks it is attached to, and the Globussoftware running on the host. Second, there are a set ofscripts that are run periodically to update information aboutthe computers, networks, and software available through thegrid. Third, there are MDS updates associated with theGlobus Resource Allocation Manager (GRAM) [1]. TheGRAM is used to start applications on remote computer sys-tems and there are two GRAM components on remote com-puter systems that interest us here. The GRAM job man-ager is a daemon that is started for each application. Thejob manager starts, monitors, and manages an applicationand informs a second software module, the GRAM reporter,of the application state. Periodically, the GRAM reporterwill determine the number of available nodes on the com-puter system it is associated with, determine the status ofany queues associated with the GRAM, determine the usersthat can use the GRAM, gather the state of the applicationssubmitted through the GRAM, and update all of this infor-mation in the MDS. By default, this process is performedevery 30 s and user and job information is not published intothe MDS. Many sites do not publish the user information forsecurity reasons, and do not publish job information due tothe load it places on the MDS.

There are many possible ways that users can use the MDSdata. One common way is that when a user uses theglobus-run program to start an application on a host, the user canspecify the hostname, andglobusrunwill contact the MDSto find the host name, port number, and other informationnecessary for GRAM to start an application on the remotecomputer. Another common use of the MDS is to querythe status of applications that are started on remote systems.Users have not typically employed the MDS in more sophis-ticated ways because the response time of the MDS was notsufficient to support these activities.

A new release of Globus, version 1.1.3, occurred in Juneof 2000 and there are changes that affect the MDS in thisrelease. The most significant change is that the MDS will bea highly distributed information service with an OpenLDAP

Page 3: An Evaluation of Alternative Designs for a Grid Information Service

ALTERNATIVE DESIGNS FOR A GRID INFORMATION SERVICE 31

Figure 1. The number of open connections to the MDS server over time.

Figure 2. Histogram of connection durations.

server running on every host that provides compute cyclesto remote users through Globus. These LDAP servers willmaintain local data and can also be configured to push data toorganizational LDAP servers that maintain data from a groupof hosts that are running the Globus software. A further de-scription of this approach is provided in section 5 along withour thoughts on this approach.

3. Workload characterization

In this section, we analyze 20 h of trace data recorded fromthe Globus MDS. This data consists of all of the accesses tothe LDAP server from when the server restarted on February24, 2000 to when it restarted again on February 25. Duringthis time the MDS was contained in a single server locatedat NCSA. These 20 h of data contain 86,695 connections tothe server and 143,446 adds, deletes, modifies, and searches.If we also consider connects, binds to an identity, responsesto requests, unbinds from an identity, and closes then thereare 633,672 operations in the workload or an average of 8.8operations per second.

Figure 1 graphs the number of open connections at anygiven time. The data shows that there are typically 90 con-nections open at any time with two spikes of over 900 ac-tive connections. We have not been able to determine an

Figure 3. Histogram of operations per connection.

exact cause for these spikes. Note that they appear to beperiodic. We did not find any periodic increases in con-nections in the workload so we assume these spikes are dueto some periodic actions going on inside the LDAP server.Figure 2 presents a histogram of connection duration. Thisdata shows that there are a large number of relatively short-duration connections. In fact, 88% of the connections lastless than 120 s and 97% of the connections last less than240 s. Examining the data closer, we determine that thelong-duration connections are those where a user connectsto the information service and periodically searches for thestate of their application using that connection. Figure 3shows a histogram of the number of adds, deletes, modifies,or searches per connection. The data shows that the vastmajority of the connections have relatively few operations.In fact, 97% of the connections have two or less of theseoperations. Examination of this data shows that the connec-tions with the most number of operations are those where auser is periodically searching for the state of their applica-tion. There are also a fairly high number of operations perconnection when a gram reporter updates information whenpublishing of job information is enabled.

Each connection consists of a connect, a bind to an iden-tity, one or more adds, deletes, modifies, or searches, an un-bind, and a close. Table 1 shows the number of add, delete,

Page 4: An Evaluation of Alternative Designs for a Grid Information Service

32 SMITH ET AL.

modify, and search operations in the trace data. As one cansee, the majority of the operations are modifies. These mod-ifications come from GRAM reporters to update informationsuch as job status, the load on workstations, the nodes avail-able through schedulers, and so forth. Modifications are alsoused to touch objects so that clients will know that Globusdaemons were up in the recent past. There are relativelyfew entries added and deleted because only entries for jobsare added and deleted and very few computer systems arepublishing this information due to the load it places on theMDS. Entries can also be added to the MDS when new orga-nizations start using the MDS, but these events are relativelyrare and did not occur in the trace data analyzed here. Thereare relatively few searches because at the time this data wasrecorded, users were avoiding searches of the MDS becausethese searches were not returning results for long periods oftime. Table 1 also presents the number of errors that occurduring the operations. The data shows that a high percent-

Table 1Occurrences of LDAP operations.

Operation Number Percent of Number of Percent ofof total LDAP errors operations resulting

operations operations in error

Add 1044 0.73 943 90.33Delete 81 0.06 6 7.41Modify 134611 93.84 3807 2.83Search 7710 5.37 5867 76.10Total 143446 100.00 10623 7.41

age of the add and search applications result in errors. Mostof the errors that occur during add operations occur whenGlobus software first tries to modify an entry in the MDS,the modify fails, an add is attempted, and it also fails. Themodify typically fails because the bind to an identity failed.These successions of failures can be avoided by respondingcorrectly to the LDAP error codes that are generated: an addshould only be attempted after a failed modify if the modifyfailed because the entry does not exist. The search opera-tions also result in a high percentage of errors. Almost allof these errors are caused by the searches timing out beforethey complete because the server was too highly loaded.

We also use this trace data to classify the connections andidentify what type of entity initiated the connection and forwhat purpose. These classifications are shown in table 2. Asone can see, we can classify almost 100% of the connectionsand the majority of the connections, 67%, are modificationsof data made by GRAM reporters.

4. Experimental analysis

We use a set of experiments to evaluate the performanceof LDAP server implementations, implementation-specificLDAP configurations, and distribution of an LDAP direc-tory information tree across multiple hosts. Our approach inthis work is to evaluate GIS configurations by starting oneor more LDAP servers that will act as the GIS on one ormore systems, loading these servers with the contents of the

Table 2Classification of MDS connections.

Number of Percent of Descriptionconnections connections

58476 67.45 Modification to job manager, the queues it can submit to, and the jobsit has submitted.

364 0.42 Modification of host information. This includes entries for the host, it’sviews from various networks, the Globus software running on the host.

73 0.08 Modification of network information.

7 0.01 Modification of software information.

71 0.08 Deletion of jobs.

191 0.22 Search for job status.

50 0.06 Search for job managers.

1494 1.72 Search by the MDS monitor.

1234 1.42 Search for all Globus Physical Resource objects. We do not currentlyknow what entity is generating these searches.

22 0.03 Search for all objects. These are most likely users performing tests.

0 0.00 Unclassified adds.

5 0.01 Unclassified deletes. Deletions are so infrequent that we do not classifythem.

4440 5.12 Connections with a bind and unbind but no operations. We do not cur-rently have an explanation for these connections.

20232 23.34 Connections containing a connect, a bind failure, an unbind, and a close.

86659 99.96 Classified connections.

86695 100.00 Total number of connections.

Page 5: An Evaluation of Alternative Designs for a Grid Information Service

ALTERNATIVE DESIGNS FOR A GRID INFORMATION SERVICE 33

Table 3Performance of operations on individual LDAP servers under varying load.

Load LDAP Connect and Unbind Add Delete Modify Search Weightedserver bind (ms) (ms) (ms) (ms) (ms) (ms) average (ms)

<0.5 OpenLDAP 150 1 101 108 317 1657 387

0.5 Vendor 1 67 2 59 61 73 1351 81Vendor 2 120 5 314 756 381 782 215

1.0 Vendor 1 70 2 92 88 103 1443 99Vendor 2 208 3 973 1660 828 945 436

1.5 Vendor 1 2110 3 114 136 134 1556 673Vendor 2 2255 1 6050 6094 3987 4056 2429

2.0 Vendor 1 4083 4 107 116 128 1641 1213Vendor 2 18154 1 27439 28914 22063 18905 14893

Globus MDS as of February 24, 2000, and then replaying20 h of access that were made to the Globus MDS server be-tween February 24, 2000 and February 25, 2000 from oneor more workstations. The clients on the workstation thatexercise the LDAP servers are written in Java and use theJava Naming and Directory Information Interface (JNDI).The trace data used for these simulations is the derived fromthe data that we analyzed in section 3.

The data used in the simulation differs from the recordeddata in that the recorded data does not include the actualmodifications made to entries or the actual contents of theentries that were added to the LDAP servers. We constructthis data off-line using the data in the MDS and our knowl-edge of which attributes Globus modifies. We perform dif-ferent experiments and adjust the load on the LDAP serversby simulating the trace data faster or slower than real time.Our testing environment consists of a Sun UltraSparc 30with one 296 MHz CPU, 512 MB of memory and Ultra-SCSI disk drives running Solaris 2.6 where we run most ofour experiments and a Sun UltraSparc 10 with a 333 MHzCPU, 128 MB of memory and a UltraSCSI disk drive thatwe use for experiments when two servers are used to pro-vide an information service. We use several client systemsin different geographical locations to test OpenLDAP serverversion 1.2.11 and LDAP products from Vendor 1 and Ven-dor 2 running on one or both of these systems. We evaluatea GIS configuration using the response time of the LDAPcommands and if the LDAP servers continue to operate.

There are many possible GIS designs and we only aimto evaluate a few in this work. First, we evaluate the rela-tive performance of the OpenLDAP, Vendor 1, and Vendor 2LDAP servers. Second, we evaluate the performance effectsof using indexes to improve search performance. Third, weevaluate the performance of a GIS that distributes data overtwo LDAP servers using referrals. Fourth, we discuss theadvantages and disadvantages of data replication in our en-vironment.

4.1. Comparison of LDAP servers

To compare the performance of the LDAP servers fromOpenLDAP, Vendor 1, and Vendor 2, we start one of these

servers on our test system, load the LDAP server with theMDS contents from February 24, as described above, andthen use a simulator running on a workstation to exercise theLDAP server under test. The simulator can replay the opera-tions in the workload, replay some fraction of the operations,or replay more than one copy of some of the operations inthe workload. The simulator can also be programmed withconstraints such as the maximum number of open connec-tions at a time and the maximum number of new connectionsthat can be opened a second. Table 3 summarizes the resultsof these experiments. The load column indicates the loadthat was placed on the server under test. For the Vendor 1and Vendor 2 results, the load is the fraction of operations inthe original workload that are replayed during a 20 h exper-iment. For the OpenLDAP server we indicate a load of lessthan 0.5 but this does not directly relate to the fraction ofthe operations in the original trace that were performed. Wefound that the OpenLDAP server failed when we attemptedto perform simulations at a load of 1.0 or 0.5. The serverfailed by not responding to queries in the middle of the sim-ulations. To complete a simulation, we limited the maxi-mum number of open connections to 50 and the maximumnumber of new connections a second to 20, roughly half theaverage of 90 open connections that occur during a real-timesimulation. These are the results reported for OpenLDAPin and leads to our first result: the OpenLDAP server is theonly server of the three we tested to fail under the loads weplaced on it.

The table also shows the performance results for the twocommercial LDAP servers with loads between 0.5 and 2.0.Several observations can be made from this data. First,we examine search performance and find that the Vendor 2server performs better at lower loads while the Vendor 1server performs better at higher loads. The Vendor 2 serverperforms searches 42 and 35% faster at a load of 0.5 and1.0, respectively, but the Vendor 1 server performs searches62 and 91% faster at a load of 1.5 and 2.0, respectively. Wealso observe that the search times for the server from Ven-dor 1 do not dramatically increase as the load on the serveris increased while the search times for the Vendor 2 serverdo dramatically increase as the load increases. The searchtime for Vendor 1 only increases by 21% as the load quadru-

Page 6: An Evaluation of Alternative Designs for a Grid Information Service

34 SMITH ET AL.

ples from 0.5 to 2.0. We believe that the search time forVendor 2 increases so dramatically because this server is at-tempting to optimize search performance by, for example,maintaining indexes for all of the entries in the database (in-dexes and their performance effects are discussed further insection 4.2). Optimizing search performance is an excellentcharacteristic for the typical data in LDAP servers that isnot modified very often, but trading improved search perfor-mance for an increased amount of work to perform for eachmodification is not the best choice in our environment. Webelieve that the large amount of work resulting from main-taining indexes for a large number of changes to the data re-sults in less resources to perform the searches and thereforelarger search times, the opposite of what was intended.

Second, we observe that the time to perform add, delete,and modify searches are always less for the Vendor 1 server.The performance for Vendor 1 is from 81 to 99% better thanVendor 2 with the performance difference increasing as theload increases. We believe the main cause of this increas-ing performance gap is the extra work the Vendor 2 server isperforming when data is changed, as described above. Onemeasure of the amount of work being performed is the CPUload of the system running the LDAP server. For the Ven-dor 2 server, we observed that the CPU load was typicallyover 5 (measured with the Unix uptime command) while theCPU load while the Vendor 1 server was 0.1 at most. Thisalso seems to indicate that the Vendor 2 server is targetedtowards multiprocessor systems.

Third, we observe that for Vendor 1, the performance ofthe add, delete, modify, and search operations only increasea relatively small amount (the largest increase is 21% forsearches) as the load quadruples from 0.5 to 2.0. We alsoobserve that the time to connect and bind to an identity doesincrease significantly as the load increases. This behaviormay occur because the Vendor 1 has a default maximumnumber of threads of 20. This places a limit on the numberof operations (possibly including connects and binds) thatoccur simultaneously and therefore limits how poor the add,delete, modify, and search performance can become.

Fourth, we observe that the Vendor 2 server does not per-form well as the load increases. The Vendor 2 server canalso be configured to set various properties, including thenumber of threads to serve requests. Unfortunately, the de-fault configuration does not perform as well as that of theVendor 1 server. The performance of the connect and bind,add, delete, modify, and search operations increases super-linearly as the load increases resulting in operations thattake between 19 and 29 s at a load of 2.0, compared to be-tween 0.1 and 4 s for the Vendor 1 server.

4.2. Indexing

One technique that is used to improve the performance ofsearches is indexing. An index essentially stores search re-sults for quick lookups when a search occurs. For example,an index can be maintained for an operating system attributeso that a search for all Solaris computer systems is quickly

Table 4Performance of the LDAP server from Vendor 1 with and with-

out an approximate index for the GlobalJobID attribute.

Load Indexing Add Delete Modify Search(ms) (ms) (ms) (ms)

0.5 No 86 93 82 2020Yes 82 69 69 174

1.0 No 105 87 101 2026Yes 100 93 89 143

1.5 No 142 109 137 2201Yes 136 103 133 158

2.0 No 154 137 155 2338Yes 126 137 142 157

responded to by accessing the index. The index would be anequality index on the operating system attribute that wouldcontain a list of entries associated with the value Solaris.These entries would be all of the entries in the directory thathave a value of Solaris for the operating system attribute.

The disadvantage to indexes is that they have to be up-dated whenever an attribute they are indexing is changed.This adds overhead to the add, modify, and delete operationsto maintain any indexes that refer to any of the attributes inthe entries that are changed. We evaluate the performance ofindexes by adding an index to the server from Vendor 1 toimprove the performance of the 63% of the searches that aremade to determine the status of jobs and then performing areal-time simulation. These searches are performed over thewhole directory tree to look for entries with GlobalJobIDsthat contain the name of the system that the job is executingon. To improve the performance of these searches, we addan approximate index on the GlobalJobID attribute.

The results of these experiments are shown in table 4. Thetable shows the response time of only the add, delete, mod-ify, and search operations that access entries that contain theGlobalJobID attribute. First, the data shows that the time toadd, delete, or modify these entries does not change signif-icantly if an index is used or not. This result is unexpectedbecause we assumed that maintaining an index would addsignificant overhead whenever the entries that are indexedare changed. One important factor to note is that the frac-tion of modifications that are to entries with a GlobalJobIDattribute is about 1%. It may be the case that if a higherpercentage of the modifications are made to entries that areindexed, the small amount of extra work for each of thesemodifications could lead to lower overall performance. Sec-ond, the data clearly shows that there is a large decrease insearch times when using an index. The search times whenperforming approximate searches for GlobalJobIDs decreasebetween 91 and 93% when an index is used.

4.3. Data distribution

Distribution of data can be used to support very large data-bases and to improve the performance of databases. Dis-tributing data across multiple servers results in more re-sources being available to handle data modifications and

Page 7: An Evaluation of Alternative Designs for a Grid Information Service

ALTERNATIVE DESIGNS FOR A GRID INFORMATION SERVICE 35

Table 5Performance of the LDAP server from Vendor 1 when data is and is not distributed across

two servers.

Load Distributed Connect and Unbind Add Delete Modify Searchbind (ms) (ms) (ms) (ms) (ms) (ms)

1.0 No 70 2 92 88 103 1443Yes 71 1 44 137 89 1831

1.5 No 2110 3 114 136 134 1556Yes 73 1 47 133 101 1906

2.0 No 4083 4 107 116 128 1641Yes 74 1 54 129 109 1996

hopefully better performance. Data distribution can alsoimprove search performance when the searches access datafrom only a few servers. In this situation, more resources areavailable to satisfy searches. If searches access data frommany servers, many transactions have to occur to obtain thesearch results and this can reduce search performance.

We are interested in distributing data across multipleservers because of the large percentage of operations in ourenvironment that change the data. Globus users observed adramatic performance improvement when the Globus GISmoved from a single server to two servers several monthsago. We wish to perform simulations to characterize the ef-fects of distributing data across more than one server. Wedistribute our data across two servers from Vendor 1 on thetwo Solaris systems previously described in the same waythat the Globus GIS currently distributes it’s data: server 1contains all of the data from NASA and the NSF Alliancesites and server 2 contains all of the other data. The secondserver also contains referrals to the NASA and NSF Alliancedata on the first server. Referrals are a special LDAP entrythat points to where data is actually located. When a clientaccessed one of these special LDAP entries, the client is re-ferred to the LDAP server that actually contains the data forentries rooted at that position in the directory informationtree. These referrals allow searches over all of the data inthe database to begin at the second server and be referred tothe first server to search those entries.

The results of our experiments are shown in table 5. Forthese experiments, we assume that the clients know whichserver contains the data they are interested in. If an add,delete, or modify is made of an entry for NASA or theNSF Alliance sites, the clients will communicate directly toserver 1. If a search is performed starting from within theNASA or NSF Alliance sub-trees, it will communicate di-rectly with server 1. All other accesses will communicatewith server 2. If a search is performed over data in bothservers, the search will start with server 2 and also be re-ferred to server 1 to search there.

We find that distributing data across two servers decreasesthe response time when adding data by 50–59% and de-creases the response time when modifying data by 14–25%.When deleting entries, there is not a clear trend in responsetimes. We believe this is an artifact of the relatively few dele-tions that occur (see table 1) and therefore the relatively fewresponse times we average. The data also shows that the time

to perform a search is 18–21% lower when data is not dis-tributed. The faster searches occur when only one server isused because almost all of the searches search the entire di-rectory tree and must therefore query all of the servers usedto hold the data. A final important fact to note is how thetime to connect and bind to the LDAP servers increases asthe load increases when only one server is used. This is animportant consideration at higher loads. For example, at aload of 2, if a user performs a connect, bind, search, andthen unbinds it takes 5728 ms when using one server butonly 2071 ms when using two servers. The slower connectsand binds when using a single server outweigh the benefitsof faster searches.

4.4. Data replication

Another way to use multiple computer systems to store ourdata is to replicate data on one or more servers. Replica-tion improves search performance by having more resourcesavailable to perform searches and improves reliability byhaving data still be available when a server goes down. Thedisadvantage to replication is that when data is changed,these changes must be propagated to the replicas of the dataand this adds overhead. At this time, we do not evaluatereplication because of the relatively few number of searchesin our workload and the relatively large number of modifica-tions.

5. Globus 1.1.3 Grid information service

The Globus group has made several changes in Globus ver-sion 1.1.3 to attempt to improve the performance of theirinformation service. The major change is that the default in-formation service is highly distributed to lower the numberof changes made to the data on any single server and elimi-nate the bottleneck caused by having many data updates goto only a few servers. By default, each host that supportsapplication execution via Globus has a Grid Resource In-formation Server (GRIS) on it. The GRIS consists of theOpenLDAP front end layered over the GRAM reporter (de-scribed in section 2) that provides information about the hostthe GRIS is running on including the host itself, the softwareon the host, the users who can access to the host, and the ap-plications running.

Page 8: An Evaluation of Alternative Designs for a Grid Information Service

36 SMITH ET AL.

The other new component of the Globus 1.1.3 GIS is or-ganizational LDAP servers. An organizational sever is aLDAP server (Globus will configure an OpenLDAP orga-nizational server if it is asked to) that contains referrals tothe GRIS servers it is associated with. For example, an or-ganizational server would contain an entry for each of theGRIS servers in that organization and each of these entrieswould refer to a GRIS server on a machine in the organiza-tion. This configuration results in a “pull” model for retriev-ing data: when a user performs a search, an organizationalserver queries the GRIS servers that may contain the datathe user is interested in and then passes this data to the user.This is very different from the “push” model used by earlierGlobus releases where the GRAM reporter pushed data toremote LDAP servers.

We have not evaluated these changes to the Globus infor-mation service using experiments but we do have some ini-tial thoughts. First, having a large number of LDAP serverswill mean fewer accesses to each server and therefore fasterresponse times. The difficulty is that the data of interest tousers is now widely distributed. If a user is interested ininformation from a small set of hosts, we do not believeit will add a large overhead to perform a small number ofqueries to different servers to find the information. If a useris interested in information that comes from a large num-ber of hosts, we believe it will take an unacceptably longtime to query all of the hosts that have the information. Thisis where organizational servers that aggregate grid informa-tion can improve search performance. Searches that exam-ine data from a large number of hosts can query a smallernumber of organizational servers to find their results in anacceptable period of time if the organizational servers cachethe data they pull from GRIS servers. The next problem isthat if users perform searches for dynamic information frommany hosts, this data cached in the organizational serverswill not be up to date and must be pulled from the GRISservers. This means that a search for dynamic informationsent to an organizational server can require many pulls ofdata from GRIS servers and the potential benefits of havinga LDAP server on each Globus host have been negated. Ifthis situation occurs in practice, it implies that there is noreason to have an LDAP server on each Globus host.

To summarize our analysis, the effectiveness of thechanges to the Globus information service in version 1.1.3will depend on how users want to access data from this ser-vice. If users do not perform many searches for dynamicdata that are produced by a significant number of hosts, thenthis approach should provide good performance. If usersdo wish to search for dynamic data produced by a signifi-cant number of hosts, the OpenLDAP servers on the Globushosts will not improve performance and a set of organizationservers that maintain up-to-date information should be used.

6. Conclusions

In this paper, we described our investigation of alternativedesigns for a grid information service. We described the

Globus grid information service and how the Globus toolkitand users access this information service. We analyzed tracedata obtained from the Globus information service and foundthat the majority of the operations are modifications of ex-isting data, that the information service has roughly 90 con-nections open at any given time, and the information serviceis performing 8.8 operations per second. We described ourmethodology for experimentally evaluating LDAP server de-signs using trace data and contents obtained from the Globusgrid information service and we evaluated the OpenLDAPserver and two servers from vendors we cannot specify. Wefound that the OpenLDAP server failed when we attemptedto place our recorded load upon it. For searches, we foundthat the Vendor 2 server has better performance at lowerloads and the Vendor 1 server has better performance athigher loads. The Vendor 2 server performs 35–42% bet-ter for lower loads while the Vendor 1 server performs 62–91% better for higher loads. Further, the search performancefor the Vendor 1 server only degraded slowly as we quadru-pled the load while the search performance for the Ven-dor 2 server degraded super-linearly. We hypothesize thatthe Vendor 2 implementation seems to be highly optimizedfor searching and for executing on multiprocessor computersystems. We find that indexing can be used to reduce the re-sponse time of searches by more than 90% without increas-ing update times, if only a small percentage of the updatesare made to indexed entries. Finally, we distributed our di-rectory information tree across two servers on two computersystems and found that distributing data across two serversdecreases the time to perform adds by 50–59% and decreasesthe time to perform modifies by 14–25%. The disadvantageof data distribution is that since almost all of our searcheshad to contact both servers, the time to perform a search in-creased by 18–21%. This disadvantage is negated at higherloads due to much longer times to connect and bind whenthere is only a single server.

In future work, we will continue to evaluate different con-figurations for grid information services and we will inves-tigate other factors that impact the performance of a grid in-formation service such as an increase in the number of usersand the use of secure connections the servers. To assist inthis work, we plan to develop a system of synthetic grid enti-ties to apply loads to proposed grid information service. Thecurrent Globus components use the grid information servicein a relatively predictable way. This makes it relatively easyto develop synthetic components and have these componentsapply loads to the grid information service. Further, as de-scribed in our workload analysis, users also use the MDS inpredictable ways. This allows us to develop synthetic usersand evaluate the performance and fault tolerance of the de-sign of a grid information system if there are hundreds orthousands of users. We expect the number of users of com-putational grids to greatly increase as the middleware growsin stability and more users observe the advantages of usingcomputational grids.

Page 9: An Evaluation of Alternative Designs for a Grid Information Service

ALTERNATIVE DESIGNS FOR A GRID INFORMATION SERVICE 37

References

[1] K. Czajkowski, I. Foster, N. Karonis, C. Kesselman, S. Martin, W.Smith and S. Tuecke,A Resource Management Architecture for Meta-systems, Lecture Notes in Computer Science (1998).

[2] S. Fitzgerald, I. Foster, C. Kesselman, G. von Laszewski, W. Smithand S. Tuecke, A directory service for configuring high-performancedistributed computations, in:Proceedings of the 6th IEEE Symposiumon High-Performance Distributed Computing(1997) pp. 365–375.

[3] I. Foster and C. Kesselman, eds.,The Grid: Blueprint for a New Com-puting Infrastructure(Morgan Kauffman, 1999).

[4] I. Foster and C. Kesselman, Globus: a metacomputing infrastructuretoolkit, International Journal of Supercomputing Applications 11(2)(1997) 115–128.

[5] T. Howes, M. Smith and G. Good,Understanding and DeployingLDAP Directory Services(Macmillan, 1999).

[6] T. Howes and M. Smith,LDAP: Programming Directory-Enabled Ap-plications with Lightweight Directory Access Protocol(Macmillan,1997).

[7] W. Johnston, D. Gannon and B. Nitzberg, Grids as production com-puting environments: The engineering aspects of NASA’s informationpower Grid, in: Proceedings of the Eighth IEEE International Sym-posium on High Performance Distributed Computing(1999).

[8] J. Strassner and F. Baker,Directory Enabled Networks(Macmillan,1999).

[9] The Information Power Grid,http://ipg.arc.nasa.gov.[10] The Globus Project,http://www.globus.org.[11] The National Computational Science Alliance,http://www.

ncsa.uiuc.edu/alliance.[12] The National Partnership for Advanced Computing Infrastructure,

http://www.npaci.edu.

Warren Smith is currently a research scientistworking for Computer Sciences Corporation atNASA Ames Research Center. He received B.S.and M.S. degrees from the Johns Hopkins Univer-sity and M.S. and Ph.D. degrees from NorthwesternUniversity. His main research interest is to providethe services required for efficient and easy use ofcomputational grids. He is a member of the Insti-tute of Electronic and Electrical Engineers and theAssociation for Computing Machinery.E-mail: [email protected].

Abdul Waheed received his M.S. and Ph.D. de-grees from Michigan State University and a B.Sc.degree from the University of Engineering andTechnology in Pakistan. His research interests in-clude parallel and distributed computing. He is amember of the Institute for Electronic and Electri-cal Engineers.E-mail: [email protected]

David Meyers was born in Chicago, Illinois, USA.He received a B.A. degree in computer and infor-mation science from the University of California atSanta Cruz in 1986. His work at the National Cen-ter for Experiments in Television under a Rocke-feller Foundation grant for research into the use oftelevision as a creative medium. For the last sevenyears, he has worked at the National Aeronauticsand Space Administration, Ames Research Centerat Moffett Field, California. Currently, he is work-ing under a NASA grant from the Research Institutefor Advanced Computer Science (RIACS). His re-search interests include high-performance distrib-uted computing, performance and modeling of di-rectory services, X.509 public-key infrastructure(PKI), and policy-based network quality of service.

Dr. Jerry Yan received his Ph.D. and MSEE from Stanford University. Hecurrently works as a Senior Scientist at NASA Ames Research Center. Hepublished over 40 articles in the areas of parallel processing, performanceevaluation, and computer architecture. He is a founding member of the Par-allel Tools Consortium, and a Senior Member of the Institute of Electronicand Electrical Engineers and a Member of the Institute of Electrical Engi-neers (UK).E-mail: [email protected].