Grid Ian Foster

IP A R T

PERSPECTIVES

Why is the Grid important? We asked some distinguished thinkers to provideus with their perspectives on this question, and their thoughtful responses formthe first section of this book.

Larry Smarr, in his chapter “Grids in Context,” analyzes the emergence of theGrid as a new infrastructure. He compares Grids with other major infrastructures,such as the railroad, using historical analogies to provide perspectives on both thecomplexity of such developments and the widespread impact of the associatedchanges. He also considers how individuals and groups working in computationalscience and engineering, experimental and observational science, industrialengineering, and corporate communications can use the Grid today to solveimmediate and urgent problems.

Fran Berman and Tony Hey speak to “The Scientific Imperative” in Chapter 2,reviewing the revolutionary changes in technology and methodology drivingscientific and engineering communities to embrace Grid technologies. Theydiscuss the challenging and interrelated demands of data-, compute-, and instru-mentation-intensive science and also the increased importance of large-scalecollaboration. They also outline the ambitious new government programs beingcreated to support these new approaches, and thereby the scientific breakthroughsof the twenty-first century.

Irving Wladawsky-Berger addresses “The Industrial Imperative” in Chapter 3,explaining by analogy and analysis how advances in computing technology aredriving an unprecedented integration of technologies, applications, and data thatenable global resource sharing beyond what has been possible with the Web alone.He introduces the important concepts of virtualization and on-demand access,and argues that these and other concepts enabled by the Grid are driving infor-mation technology infrastructure into a “post-technology” phase in which it hasbecome so ubiquitous as to be invisible.

CH01-001-012 11/5/03 5:13 PM

CH01-001-012 11/5/03 5:13 PM

1CHAPTER

Grids in Context

Larry Smarr

This second edition of The Grid marks a major turning point in the evolutionof the Grid concept. The notion of linking people, computers, sensors, and datawith networks is decades old, and we have seen increasingly sophisticated Gridapplications in recent years. However, here we have for the first time a coherentdescription of the hardware, software, and applications required to create a func-tioning and persistent Grid.

For the Grid to be successful, we must consider not only technologies but alsothe social aspects of what will induce researchers, educators, businesses, and con-sumers to use the Grid as an everyday part of their work. One approach is to studythe development of previous infrastructures, such as railroads and the electricalpower distribution system, for clues as to how they were jump-started. Anotherapproach is to envision how diverse subgroups of our society would benefit fromthe Grid and how people would change their working environment as a result ofthe new enabling technologies. I briefly pursue both approaches in this chapter.

1.1 LESSONS FROM HISTORY

To understand the future of the Grid, it helps to study the history of otherinfrastructures. The Corporation for National Research Initiatives has specificallyconsidered the development of infrastructures for railroads, telephonesand telegraphs, power and light, and banking (293–296). These are a few examplesof infrastructures that are distributed, have different end-user devices and

CH01-001-012 11/5/03 5:13 PM

intermediate reservoirs, and have created hundreds of billions of dollars in today’sworld market. Without such infrastructures, life as we know it would be impos-sible. As we study these infrastructures, which are all quite different on the surface,we begin to notice that they have in common a number of striking features.

The evolution of each infrastructure was incredibly complicated. There is nosimple story of how each came into being; indeed, there is a good deal of dis-agreement on the historical details. However, one point is clear: these infrastruc-tures did not come into being just because a bunch of researchers got together andthought it would be a cool thing to do. Private market forces and the governmentplayed fundamental roles, as did invention, research, and standardization. Theseinfrastructures had an important local/global ordering to their time developmentthat is not necessarily apparent, but was critical to the dynamics. When these infra-structures started, almost all traffic was local. For instance, there were robust citytelephone systems, but long distance was rare. County roads were prevalent formany decades before the interstate highway system was started in the mid-1950s.

Also critical is a distribution of capacity throughout the infrastructure. Whenwe think of the physical distribution of goods, there are local warehouses, regionalwarehouses, and national warehouses to enable efficient just-in-time resupply.This distributed system can work only because of the regional “caching” or storingof physical goods. The same is true, for example, in the national electrical distri-bution system, which ranges from huge-capacity cables leaving dams and nuclearreactors, to cross-country power lines, to local neighborhood lines, to the 110-Vlines in private homes.

Such a capacity distribution is already becoming apparent in the developingGrid. In the United States, the Extensible Terascale Facility (ETF) is a goodexample, with five supernodes at the National Center for SupercomputingApplications (NCSA), the San Diego Supercomputer Center (SDSC), the PittsburghSupercomputing Center, Argonne National Laboratory, and the CaliforniaInstitute of Technology, with plans to extend it to a broad range of regional cen-ters, hosting medium-sized nodes with capacity and functions integrated acrossthe ETF by a common Grid middleware infrastructure. Networking capacity isalso distributed, or hierarchical—with a 40 gigabit/sec network connecting majorsupercomputer centers, while the average person is using 50 kilobit/sec over adial-up modem—a span of almost one million-fold in bandwidth. Similar struc-tures are found within the UK, EU, Japanese, Chinese, and other national Grids.Such distributed differential capacity is characteristic of many of these otherinfrastructures.

Grids are almost fractal in nature: we need only to look at the billions of appli-ances in homes worldwide and work upward to the nuclear power generators.Grids are not flat.

1 Grids in Context4

CH01-001-012 11/5/03 5:13 PM

1.1.1 Why Is There Chicago?

We tend to think of the world into which we were born as the way it always was, yethistory reveals that this is not the case. Much of the development of societal struc-tures is bound up with the development of infrastructure. A fascinating book on thistopic is Nature’s Metropolis: Chicago and the Great West (1992), written by WilliamCronon, a professor of history at Yale University. Cronon is a quantitative historian—a historian who takes data from shipping records, land sales, and so forth and recon-structs what actually happened. The issue he addresses is “Why is there Chicago?”

Chicago is an artifact of the emergence of infrastructure. Until the early1800s, Chicago was a small field of onions on a very large lake with a lot of mos-quitoes. The development of the modern metropolis of Chicago is a microstudy inthe phenomenal story of how the emergence of railroads, enabling the movementof physical goods around the country, completely changed the United States ina few short decades. Cronon emphasizes that we cannot study the developmentof Chicago in isolation; rather, we must take a national viewpoint to capturethe scale of the phenomenon. Specifically, Chicago can be understood only byconsidering the total transformation of the West.

Just as is happening in Brazil today, the native ecosystems of the center of theUnited States were rapidly annihilated and replaced by artificial ones in the lasthalf of the nineteenth century. Bison were killed and replaced by cattle. Theprairie was destroyed and replaced by monoculture agriculture (wheat or corn).The northern forests were destroyed to provide wood to build homes on theprairie where there were no trees. Cronon points out that none of this could havehappened had it not been for the new railroad transportation infrastructure thatallowed for goods to be shipped from deep inland to markets far away.

The Great Lakes and the rivers were the previous transportation infrastruc-ture. St. Louis was the dominant midwestern city for a long time before Chicagocame into existence. Yet as the railroads emerged and linked with the Great Lakeslines, Chicago grew much more rapidly than St. Louis did. Chicago emergedfirst of all as an intermediate “cache” for agricultural products. Before then, eachfarmer would grow his or her own corn, take bags on a wagon to a levee inSt. Louis, and sell the bags individually. Chicago enabled farmers to pool theircorn into a common elevator.

This fundamental change of pooling and “caching” mixed-origin grain allowedfor new groups of middle people who bought and sold goods that they had notgrown themselves. Financial institutions, such as the Chicago Board of Trade,came into being because a new class of financial instruments and interactions wasrequired. Stockyards arose for the same “caching” purpose, but with livestockinstead of grain. Inventions of “Grid technology” like the refrigerated railroad car,

1.1 Lessons from History5

CH01-001-012 11/5/03 5:13 PM

without which meat would spoil before it got from the Chicago stockyards to theeastern consumers, allowed meat for the first time to be sold far from where thelivestock was raised or butchered.

The new “middleware” infrastructure enabled the creation of new industriesthat were not anticipated and not derivative. For example, Chicago quicklybecame one of the great retailing centers of the world. Today, Chicago hosts thelargest number of corporate headquarters outside New York City. Chicago has theworld’s second busiest airport, which emerged out of the rise of the air transportinfrastructure. If there was ever an example of a city that came into existence asa result of the emergence of infrastructure, it is Chicago.

1.1.2 Why Is There Champaign–Urbana?

Champaign-Urbana, located in east central Illinois, had an analogous history. One-hundred and fifty years after the laying of the railroad tracks through ChampaignCounty, we can find the living reminders of the social impact of the railroad infra-structure. Across Champaign County, a number of small towns are strung alongthe railroad track: Homer, Sydney, Philo, Tolono, Savoy, St. Joseph, Thomasboro,Rantoul, and Mahomet, each located where the grain elevators were built as grain“caches” along the railroad track. Their spacing was set by a one-day’s wagon driveto bring corn to the grain elevator.

And how did Champaign-Urbana come into being? Urbana townspeople andthe Illinois Central could not agree on a price for the railroad station, so therailroad simply built its station a few miles to the west of Urbana and went onsouth. That railroad station grew into Champaign. Today a number of shippingcompanies locate in Champaign–Urbana because it is on the intersection of twomajor east–west and north–south interstate highways. Again and again, when welook through the “lens” of infrastructure development, we understand just howpowerful a social force it is.

1.2 THE GRID ON STEROIDS

We see that infrastructure has serious social consequences. Grids are going tohave a similar revolutionary effect as railroads. However, instead of waiting 30 or40 years to see the changes brought about by this infrastructure, we are seeing thechanges much faster. It is not clear that people are sociologically ready to dealwith this rate of change.

1 Grids in Context6

CH01-001-012 11/5/03 5:13 PM

The world’s computer and communications infrastructure is driven by expo-nentials: it’s on steroids. Every aspect—the microprocessor, bandwidth and fiber,disk storage—is all on exponentials, a situation that has never happened before inthe history of infrastructure. For example, NCSA’s first Cray X-MP, purchased in1986, cost $8 million, had its own electrical line to the Illinois Power substation,had special cooling, and had absolutely no graphics. It was connected to othersupercomputers by the first National Science Foundation (NSF) network back-bone with a capacity of only 56 kilobit/sec. A child’s video game, such as aMicrosoft Xbox, has an MIPS microprocessor with roughly 20 times the computepower of an X-MP processor and has a list price of $200. It uses 250 W, instead of60,000. It has incredibly interactive three-dimensional graphics. Those who have500 kilobit/sec DSL in their homes have more computer power and bandwidththan did all five NSF supercomputer centers put together only 17 years ago.

This situation is not like the railroads. It is not like anything in history. So,while we look back on history to get some guiding principles, we need to be a littlehumble in the face of what we are launching. Grids are going to change the worldso quickly that we are not going to have much of a chance—on a human, political,or social time scale—to react and change our institutions. Therefore, there will bea lot of noise, a lot of angry people, and a lot of controversy.

1.3 FROM NETWORKS TO THE GRID

ARPANET, the precursor of the Internet, which forms backbone for the emerginginformation Grid, started in the early 1970s. ARPANET was an experimental net-work used by a few computer scientists and the Department of Defense (DoD)community. It developed critically important protocols such as TCP/IP and thenotion of packet switching (analogous to the standardization of the electricalpower distribution industry on AC versus DC). Various production and researchnetworks evolved from this beginning. One of these was NSFNET, created in 1986with a 56 kilobit/sec backbone that tied together the five (then new) NSF super-computer centers. The backbone bandwidth of 56 kilobit/sec was upgraded to1.5 megabit/sec, and then to 45 megabit/sec during the late 1980s and early 1990s.In 1995 the NSF transferred NSFNET to the commercial sector, which evolved itinto today’s Internet.

The Internet could never have arisen so quickly had it not been for the fed-eral government funding 100% of the NSFNET backbone, partially funding theregionals, and occasionally funding the academic networking efforts. Within fiveyears, however, the total funding by everybody else involved in the Internet was

1.3 From Networks to the Grid7

CH01-001-012 11/5/03 5:13 PM

probably a hundred times more than the federal government’s initial investment.(How this worked in the case of railroads is still controversial, but the federalfunding for the land ultimately turned out to be a fairly small percentage of theprivate funding (293).) Thus, I believe that the federal government has a criticalrole in bringing today’s Grid into being.

Similarly, high-performance Grids will not spring into being just by waitingfor private industry to do it. In 2002, NSF funded the creation of the TeraGrid,which links five major supercomputing sites with a 40-gigabit/sec dedicated opti-cal network. With its Extensible Terascale Facility program, it is now looking toexpand this new infrastructure to other research institutions. Major projects arealso under way in Europe and Asia-Pacific to establish the necessary network andcomputational infrastructure. The cycle of initial government funding leading toa new set of commercial services has started again.

1.4 WHO WILL USE THE GRID?

We cannot assume that the Grid we are building is just going to work and instantlyfill with users. We cannot make a Field of Dreams assumption: Build it and theywill come. If nobody shows up to use it, the Grid is going to be a failure. Peoplewill use it only because they want to communicate with resources or people thatthey care about and need to do their work. This is an example of the local/globalphenomenon mentioned in regard to previous infrastructures.

We must stimulate the use of the Grid by applications researchers, drawnfrom both computational science and experimental and observational science, aswell as educators and visionary corporations. We must find groups of people whoare globally or nationally dispersed but who have a preexisting need to interactwith each other or remote resources—and are totally committed to doing so overthe Grid. Their use of the Grid from the beginning will greatly shorten the timeit takes for this new infrastructure to develop. If we rely on natural evolution,as witnessed by the other major infrastructures, we will wait decades for newcapabilities to appear.

Computational scientists and engineers need the Grid. Computational scientists andengineers would like to visualize their applications in real time and, for manyapplications, steer or navigate the computation as well. It is still common practicefor computational scientists to transfer the results of complex simulations by ship-ping tapes—and as a result, it can take days from when a simulation starts to whenan error is detected. This is unacceptable. Many other researchers just cannot get

1 Grids in Context8

CH01-001-012 11/5/03 5:13 PM

access to enough computational horsepower. They need the ability to reach outacross the Grid to remote computers.

Experimental scientists need the Grid. It is important to remember that for everytheoretical or computational scientist, there are 10 experimental and observa-tional scientists. If we wish to generally influence science, we need to influenceexperimental and observational science. Experimental scientists want to hook uptheir remote instrumentation to supercomputers or to advanced visualizationdevices, and to use advanced user interfaces, such as Java or voice command forinstrumental functions. Thus, for example, brain surgeons in Chicago could con-trol MRI machines in Urbana and collaborate with colleagues in San Franciscowhile viewing and manipulating the data in real time in three dimensions.

The Berkeley–Illinois–Maryland Radio Telescope Array (BIMA), the fastestmillimeter radio telescope array in the world, is located in the high desert ofnorthern California. It is a beautiful region but quite remote from researchuniversities. BIMA uses networks to send data to NCSA supercomputers, whicheffectively become the computational “lens” of this telescope. Because BIMA is anaperture synthesis telescope, the phase shifts of the radio waves coming in arriveslightly differently at each of the antennas, and a supercomputer is needed toreconstruct what the original object in the sky must have looked like. Thetelescope takes 1000 two-dimensional images of the sky at one time at differentwavelengths, and the computer produces data cubes from them (two dimensionsof space and one dimension of frequency). These visualizations can be used, inturn, to steer the instrument or to create a digital library on the fly, which peoplecan look at remotely through Web browsers. This is a perfect example of the typeof science that we would like to be able to do at much higher bandwidth, withmuch more robust software for a broad range of scientific instruments.

Corporations need the Grid. Most large corporations today are global in extent. Whilethe Web has allowed for corporate intranets to arise, these are still fairly primitivein terms of the types of functionality this book discusses. For example, in the1990s NCSA worked for several years with Caterpillar on the development of vir-tual environments to support virtual prototyping. Linking to computer-aideddesign files that define new heavy earth-moving equipment, Caterpillar hasmoved from standard computer graphics to heads-up displays to single stereowalls to full CAVE virtual worlds. Caterpillar prototyped teleimmersion with mul-tiple sites, including some internationally. They have suppliers, manufacturers,and customers spread out all over the world and could imagine linking thesepeople over the Grid.

1.4 Who Will Use the Grid?9

CH01-001-012 11/5/03 5:13 PM

Allstate Insurance, another NCSA industrial partner, has explored using high-speed networks to link its Menlo Park research lab with the data-mining activitiesgoing on at NCSA, where large-claims datasets reside on high-performance com-puters. Such a link would allow for new types of pattern recognition using a totalcorporate database approach. Given that Allstate has 15,000 claims offices, itimagines using Grid capabilities to unite the company into a single collaborativeteam. Similar stories will arise for almost any large corporation.

The environment needs the Grid. Although society is becoming politically ready todeal with large-scale environmental problems such as ozone depletion, globalwarming, and air and water pollution, researchers are lagging in creating trustedinteractive knowledge bases that integrate all known science about these issues.Given the multidisciplinary nature of such problems, it is clear that to gather allthe experts needed to study the problem, we will need the Grid.

For example, the dominant environmental threat to the Chesapeake Baysystem has moved from industrial pollution to agricultural runoff. Farmers putfertilizer on their fields, which runs off in the spring into the rivers, which feedslarge algae blooms, which later produce massive decaying organic masses nearthe bottom, which ultimately reduce the dissolved oxygen for the shellfish. Thisis a dynamic, nonlinear, highly coupled, multitimescale, and multispacescalemultidisciplinary problem, and the only possible way to deal with it scientificallyis computationally. In fact, researchers should be able to work in a collaborativecomputational framework that links researchers and remote sensors with theGrid, allowing for the integration of much more detailed models of chemical, bio-logical, and physical effects and the testing of the model against field data.

Training and education need the Grid. One of the first applications of Grid tech-nologies will be in remote training and education. Imagine the productivity gainsif we had routine access to virtual lecture rooms! Currently, we often must spendseveral days flying across the country to deliver a lecture using an antiquatedviewgraph projector in a meeting room with only 20 people in the audience. Whatif we were able to walk up to a local “power wall” and give a lecture fully elec-tronically in a virtual environment with interactive Web materials to an audiencegathered from around the country—and then simply walk back to the officeinstead of going back to a hotel or an airplane?

Given the rapid rate at which K–12 schools are joining community collegesand universities online, it will not be long before some of the more advancedschools begin to use these Grid technologies. The question is not how we usethese technologies to redo classic education but what new types of collaborativeeducation will arise using a national-scale Grid? Informal social experiments that

1 Grids in Context10

CH01-001-012 11/5/03 5:13 PM

show how easily children can work together over the Internet are being run todaywith shared virtual world games, such as “Quake.” The immersive game world hasexploded (see Chapter 13). Children may teach us more about collaboration thanwe learn from university research scientists!

Nations and states need the Grid. Many countries are installing dedicated fiber or“lambda” networks internally, as if the rest of the world were not there. Within theUnited States, for example, states are installing “dark” fiber networks withoutregard for the rest of the country. As discussed previously, that is how infrastruc-ture developments start—locally. As these regional “Tera POPs” and national andstate dark fiber networks come online, the next question is “Can we interconnectthem?” In the early days of the railroad, a number of different track gauges wereadopted. When the train tracks extended from their local strongholds, trains couldnot pass from one system to another. People who are part of the statewideUniversity of California system can interact at high speed with one another todayover experimental fiber networks because they are linked institutionally byCENIC. However, what if they need to interact with researchers across the coun-try using experimental networks? This is one of the first capabilities that a conti-nental-scale LambdaGrid will bring to university researchers.

Of course, the Grid will rapidly evolve from research into the private sector.Already, massive networks linking supplier chains are being constructed in indus-tries such as automobile manufacturing, and wireless extensions using RFID tagsare exploding. Electronic commerce is just beginning to take off, and eliminating“speed bumps” from one state to the other will be essential for the free flow ofinformation on which commerce is based.

The world needs the Grid. Naturally, Grid development will not stop at nationalboundaries. Recent proposals have advocated that goods traded over the Internetshould be made tax and tariff free. Such a situation could radically change thepatterns of world trade over the next few decades. Thus, there is great interestinternationally in bringing advanced Grid technologies to all countries. NSF ini-tiatives such as STAR-TAP and StarLight are facilitating the long-term intercon-nection and interoperability of advanced international networking in support ofapplications, performance measuring, and technology evaluations, while bodiessuch as the Global Grid Forum are establishing connections at the levels of proto-cols and software.

Consumers need the Grid. The functions described previously will all graduallymove down the exponential to the consumer. Already we can enter virtual stores,click on merchandise, and find information about various items. Soon we will be

1.4 Who Will Use the Grid?11

CH01-001-012 11/5/03 5:13 PM

able to click on “Sales Help” to have a live streaming video of a store person comeup on our screen, from anywhere in the country. Serious research is under wayon personal avatars that, given personal measurements, can try on clothes (on athree-dimensional computer screen) to help a customer decide whether to buyan item.

1.5 SUMMARY

All the projects described in this chapter are under development and will cometogether as part of the Grid projects described in this book. The Grid, however,will require that we adopt new ways of working. The Grid is about collaboration,about working together. Fortunately, there is a whole new set of information tech-nologies that come from the Web and from companies building tools to enable usto work together in collaborative online spaces. Old technologies such as propri-etary video teleconferencing are giving way to open-system versions linked overthe Internet, surrounded by white boards, interactive software, and recordingcapabilities.

More advanced efforts tailored to the needs of researchers also are under way,such as the Access Grid technology developed at Argonne National Laboratory.This technology enables shared collaborative spaces. A Web site allows all AccessGrid software developers or users to interact with each other via hypernews,application galleries, and shared programs. This new approach to creating a vir-tual community of software developers working on a common software system(in this case, the Access Grid libraries) allows all participants to work togetherregardless of place. Presumably participants in these virtual communities canhave a rate of progress far greater than by working in isolation.

As the new Grid technologies come into widespread use, we must shift oursocial patterns of interaction and our reward structure so that we can realize thepotential gains from the nonlinear advancements that collaboration will create.Collaboration can be an almost magical amplifier for human work. The success ofthe Grid will both enable and depend on this amplification.

1 Grids in Context12

CH01-001-012 11/5/03 5:13 PM

2C H A P T E R

Computational Grids

Ian FosterCarl Kesselman

In this introductory chapter, we lay the groundwork for the rest of the bookby providing a more detailed picture of the expected purpose, shape, andarchitecture of future grid systems. We structure the chapter in terms of sixquestions that we believe are central to this discussion: Why do we needcomputational grids? What types of applications will grids be used for? Whowill use grids? How will grids be used? What is involved in building a grid?And, what problems must be solved to make grids commonplace? We providean overview of each of these issues here, referring to subsequent chapters formore detailed discussion.

2.1 REASONS FOR COMPUTATIONAL GRIDS

Why do we need computational grids? Computational approaches to problemsolving have proven their worth in almost every field of human endeavor.Computers are used for modeling and simulating complex scientific and engi-neering problems, diagnosing medical conditions, controlling industrial equip-ment, forecasting the weather, managing stock portfolios, and many otherpurposes. Yet, although there are certainly challenging problems that exceedour ability to solve them, computers are still used much less extensively thanthey could be. To pick just one example, university researchers make extensiveuse of computers when studying the impact of changes in land use on biodiver-sity, but city planners selecting routes for new roads or planning new zoning

2 Computational Grids16

ordinances do not. Yet it is local decisions such as these that, ultimately, shapeour future.

There are a variety of reasons for this relative lack of use of computationalproblem-solving methods, including lack of appropriate education and tools.But one important factor is that the average computing environment remainsinadequate for such computationally sophisticated purposes. While today’sPC is faster than the Cray supercomputer of 10 years ago, it is still far fromadequate for predicting the outcome of complex actions or selecting fromamong many choices. That, after all, is why supercomputers have continuedto evolve.

2.1.1 Increasing Delivered Computation

We believe that the opportunity exists to provide users—whether city planners,engineers, or scientists—with substantially more computational power: anincrease of three orders of magnitude within five years, and five orders ofmagnitude within a decade. These dramatic increases will be achieved byinnovations in a wide range of areas:

1. Technology improvement: Evolutionary changes in VLSI technology andmicroprocessor architecture can be expected to result in a factor of 10increase in computational capabilities in the next five years, and a factorof 100 increase in the next ten.

2. Increase in demand-driven access to computational power: Many applicationshave only episodic requirements for substantial computational resources.For example, a medical diagnosis system may be run only when a car-diogram is performed, a stockmarket simulation only when a user re-computes retirement benefits, or a seismic simulation only after a majorearthquake. If mechanisms are in place to allow reliable, instantaneous,and transparent access to high-end resources, then from the perspective ofthese applications it is as if those resources are dedicated to them. Giventhe existence of multiteraFLOPS systems, an increase in apparent compu-tational power of three or more orders of magnitude is feasible.

3. Increased utilization of idle capacity: Most low-end computers (PCs andworkstations) are often idle: various studies report utilizations of around30% in academic and commercial environments [407, 164]. Utilization canbe increased by a factor of two, even for parallel programs [31], without im-pinging significantly on productivity. The benefit to individual users can

2.1 Reasons for Computational Grids17

be substantially greater: factors of 100 or 1,000 increase in peak computa-tional capacity have been reported [348, 585].

4. Greater sharing of computational results: The daily weather forecast involvesperhaps 1014 numerical operations. If we assume that the forecast is ofbenefit to 107 people, we have 1021 effective operations—comparable tothe computation performed each day on all the world’s PCs. Few othercomputational results or facilities are shared so effectively today, but theymay be in the future as other scientific communities adopt a “big science”approach to computation. The key to more sharing may be the develop-ment of collaboratories: “. . . center[s] without walls, in which the nation’sresearchers can perform their research without regard to geographicallocation—interacting with colleagues, accessing instrumentation, sharingdata and computational resources, and accessing information in digital li-braries” [410].

5. New problem-solving techniques and tools: A variety of approaches can im-prove the efficiency or ease with which computation is applied to problemsolving. For example, network-enabled solvers [146, 104] allow users toinvoke advanced numerical solution methods without having to install so-phisticated software. Teleimmersion techniques [412] facilitate the sharingof computational results by supporting collaborative steering of simula-tions and exploration of data sets.

Underlying each of these advances is the synergistic use of high-performance networking, computing, and advanced software to provide ac-cess to advanced computational capabilities, regardless of the location of usersand resources.

2.1.2 Definition of Computational Grids

The current status of computation is analogous in some respects to that of elec-tricity around 1910. At that time, electric power generation was possible, andnew devices were being devised that depended on electric power, but the needfor each user to build and operate a new generator hindered use. The truly rev-olutionary development was not, in fact, electricity, but the electric power gridand the associated transmission and distribution technologies. Together, thesedevelopments provided reliable, low-cost access to a standardized service, withthe result that power—which for most of human history has been accessibleonly in crude and not especially portable forms (human effort, horses, waterpower, steam engines, candles)—became universally accessible. By allowing


both individuals and industries to take for granted the availability of cheap, re-liable power, the electric power grid made possible both new devices and thenew industries that manufactured them.

By analogy, we adopt the term computational grid for the infrastructurethat will enable the increases in computation discussed above. A computa-tional grid is a hardware and software infrastructure that provides dependable,consistent, pervasive, and inexpensive access to high-end computational capa-bilities.

We talk about an infrastructure because a computational grid is concerned,above all, with large-scale pooling of resources, whether compute cycles, data,sensors, or people. Such pooling requires significant hardware infrastructureto achieve the necessary interconnections and software infrastructure to mon-itor and control the resulting ensemble. In the rest of this chapter, and through-out the book, we discuss in detail the nature of this infrastructure.

The need for dependable service is fundamental. Users require assurancesthat they will receive predictable, sustained, and often high levels of perfor-mance from the diverse components that constitute the grid; in the absenceof such assurances, applications will not be written or used. The performancecharacteristics that are of interest will vary widely from application to appli-cation, but may include network bandwidth, latency, jitter, computer power,software services, security, and reliability.

The need for consistency of service is a second fundamental concern. Aswith electric power, we need standard services, accessible via standard in-terfaces, and operating within standard parameters. Without such standards,application development and pervasive use are impractical. A significant chal-lenge when developing standards is to encapsulate heterogeneity withoutcompromising high-performance execution.

Pervasive access allows us to count on services always being available,within whatever environment we expect to move. Pervasiveness does not im-ply that resources are everywhere or are universally accessible. We cannotaccess electric power in a new home until wire has been laid and an accountestablished with the local utility; computational grids will have similarly cir-cumscribed availability and controlled access. However, we will be able tocount on universal access within the confines of whatever environment thegrid is designed to support.

Finally, an infrastructure must offer inexpensive (relative to income) ac-cess if it is to be broadly accepted and used. Homeowners and industrialistsboth make use of remote billion-dollar power plants on a daily basis becausethe cost to them is reasonable. A computational grid must achieve similarlyattractive economics.

2.1 Reasons for Computational Grids19

It is the combination of dependability, consistency, and pervasivenessthat will cause computational grids to have a transforming effect on howcomputation is performed and used. By increasing the set of capabilities thatcan be taken for granted to the extent that they are noticed only by theirabsence, grids allow new tools to be developed and widely deployed. Much aspervasive access to bitmapped displays changed our baseline assumptions forthe design of application interfaces, computational grids can fundamentallychange the way we think about computation and resources.

2.1.3 The Impact of Grids

The history of network computing (see Chapters 21 and 22) shows that orders-of-magnitude improvements in underlying technology invariably enable revo-lutionary, often unanticipated, applications of that technology, which in turnmotivate further technological improvements. As a result, our view of networkcomputing has undergone repeated transformations over the past 40 years.

There is considerable evidence that another such revolution is imminent.The capabilities of both computers and networks continue to increase dramat-ically. Ten years of research on metacomputing has created a solid base ofexperience in new applications that couple high-speed networking and com-puting. The time seems ripe for a transition from the heroic days of metacom-puting to more integrated computational grids with dependable and pervasivecomputational capabilities and consistent interfaces. In such grids, today’smetacomputing applications will be routine, and programmers will be ableto explore a new generation of yet more interesting applications that leverageteraFLOP computers and petabyte storage systems interconnected by gigabitnetworks. We present two simple examples to illustrate how grid functionalitymay transform different aspects of our lives.

Today’s home finance software packages leverage the pervasive avail-ability of communication technologies such as modems, Internet serviceproviders, and the Web to integrate up-to-date stock prices obtained fromremote services into local portfolio value calculations. However, the actualcomputations performed on this data are relatively simple. In tomorrow’sgrid environment, we can imagine individuals making stock-purchasing de-cisions on the basis of detailed Monte Carlo analyses of future asset value,performed on remote teraFLOP computers. The instantaneous use of threeorders of magnitude more computing power than today will go unnoticed byprospective retirees, but their lives will be different because of more accurateinformation.


Today, citizen groups evaluating a proposed new urban development muststudy uninspiring blueprints or perspective drawings at city hall. A computa-tional grid will allow them to call on powerful graphics computers and data-bases to transform the architect’s plans into realistic virtual reality depictionsand to explore such design issues as energy consumption, lighting efficiency,or sound quality. Meeting online to walk through and discuss the impact of thenew development on their community, they can arrive at better urban designand hence improved quality of life. Virtual reality-based simulation models ofLos Angeles, produced by William Jepson [345], and the walkthrough modelof Soda Hall at the University of California–Berkeley, constructed by CarloSeguin and his colleagues, are interesting exemplars of this use of comput-ing [88].

2.1.4 Electric Power Grids

We conclude this section by reviewing briefly some salient features of the com-putational grid’s namesake. The electric power grid is remarkable in terms ofits construction and function, which together make it one of the technologi-cal marvels of the 20th century. Within large geographical regions (e.g., NorthAmerica), it forms essentially a single entity that provides power to billionsof devices, in a relatively efficient, low-cost, and reliable fashion. The NorthAmerican grid alone links more than ten thousand generators with billionsof outlets via a complex web of physical connections and trading mecha-nisms [105]. The components from which the grid is constructed are highlyheterogeneous in terms of their physical characteristics and are owned andoperated by different organizations. Consumers differ significantly in terms ofthe amount of power they consume, the service guarantees they require, andthe amount they are prepared to pay.

Analogies are dangerous things, and electricity is certainly very differentfrom computation in many respects. Nevertheless, the following aspects of thepower grid seem particularly relevant to the current discussion.

Importance of Economics

The role and structure of the power grid are driven to a large extent byeconomic factors. Oil- and coal-fired generators have significant economiesof scale. A power company must be able to call upon reserve capacity equalto its largest generator in case that generator fails; interconnections betweenregions allow for sharing of such reserve capacity, as well as enabling tradingof excess power. The impact of economic factors on computational grids is notwell understood [282]. Where and when are there economies of scale to be

2.2 Grid Applications21

obtained in computational capabilities? Might economic factors lead us awayfrom today’s model of a “computer on every desktop”? We note an intriguingdevelopment. Recent advances in power generation technology (e.g., small gasturbines) and the deregulation of the power industry are leading some analyststo look to the Internet for lessons regarding the future evolution of the electricpower grid!

Importance of Politics

The developers of large-scale grids tell us that their success depended onregulatory, political, and institutional developments as much as on technicalinnovation [105]. This lesson should be taken to heart by developers of futurecomputational grids.

Complexity of Control

The principal technical challenges in power grids—once technology issuesrelating to efficient generation and high-voltage transmission had beenovercome—relate to the management of a complex ensemble in whichchanges at a single location can have far-reaching consequences [105]. Hence,we find that the power grid includes a sophisticated infrastructure for moni-toring, management, and control. Again, there appear to be many parallelsbetween this control problem and the problem of providing performanceguarantees in large-scale, dynamic, and heterogeneous computational gridenvironments.

2.2 GRID APPLICATIONS

What types of applications will grids be used for? Building on experiences ingigabit testbeds [355, 476], the I-WAY network [155], and other experimentalsystems (see Chapter 22), we have identified five major application classes forcomputational grids, listed in Table 2.1 and described briefly in this section.More details about applications and their technical requirements are providedin the referenced chapters.

2.2.1 Distributed Supercomputing

Distributed supercomputing applications use grids to aggregate substantialcomputational resources in order to tackle problems that cannot be solvedon a single system. Depending on the grid on which we are working (seeSection 2.3), these aggregated resources might comprise the majority of the


ChapterCategory Examples Characteristics reference

Distributedsupercomputing

DISStellar dynamicsAb initio chemistry

Very large problems needing lots ofCPU, memory, etc.

2

High throughput Chip designParameter studiesCryptographic problems

Harnessing many otherwise idleresources to increase aggregatethroughput

12

On demand Medical instrumentationNetwork-enabled solversCloud detection

Remote resources integrated withlocal computation, often for boundedamount of time

3, 6

Data intensive Sky surveyPhysics dataData assimilation

Synthesis of new information frommany or large data sources

4

Collaborative Collaborative designData explorationEducation

Support communication or col-laborative work between multipleparticipants

5

2.1

TABLE

Five major classes of grid applications.

supercomputers in the country or simply all of the workstations within acompany. Here are some contemporary examples:

� Distributed interactive simulation (DIS) is a technique used for trainingand planning in the military. Realistic scenarios may involve hundredsof thousands of entities, each with potentially complex behavior patterns.Yet even the largest current supercomputers can handle at most 20,000entities. In recent work, researchers at the California Institute of Technol-ogy have shown how multiple supercomputers can be coupled to achieverecord-breaking levels of performance (see Section 3.4).

� The accurate simulation of complex physical processes can require highspatial and temporal resolution in order to resolve fine-scale detail. Cou-pled supercomputers can be used in such situations to overcome reso-lution barriers and hence to obtain qualitatively new scientific results.Although high latencies can pose significant obstacles, coupled supercom-puters have been used successfully in cosmology [423], high-resolutionab initio computational chemistry computations [421], and climate mod-eling [371].

Challenging issues from a grid architecture perspective include the needto coschedule what are often scarce and expensive resources, the scalability of


protocols and algorithms to tens or hundreds of thousands of nodes, latency-tolerant algorithms, and achieving and maintaining high levels of performanceacross heterogeneous systems.

2.2.2 High-Throughput Computing

In high-throughput computing, the grid is used to schedule large numbersof loosely coupled or independent tasks, with the goal of putting unusedprocessor cycles (often from idle workstations) to work. The result may be,as in distributed supercomputing, the focusing of available resources on asingle problem, but the quasi-independent nature of the tasks involved leadsto very different types of problems and problem-solving methods. Here aresome examples:

� Platform Computing Corporation reports that the microprocessor manu-facturer Advanced Micro Devices used high-throughput computing tech-niques to exploit over a thousand computers during the peak designphases of their K6 and K7 microprocessors. These computers are locatedon the desktops of AMD engineers at a number of AMD sites and wereused for design verification only when not in use by engineers.

� The Condor system from the University of Wisconsin is used to managepools of hundreds of workstations at universities and laboratories aroundthe world [348]. These resources have been used for studies as diverseas molecular simulations of liquid crystals, studies of ground-penetratingradar, and the design of diesel engines.

� More loosely organized efforts have harnessed tens of thousands of com-puters distributed worldwide to tackle hard cryptographic problems [338].

2.2.3 On-Demand Computing

On-demand applications use grid capabilities to meet short-term requirementsfor resources that cannot be cost-effectively or conveniently located locally.These resources may be computation, software, data repositories, specializedsensors, and so on. In contrast to distributed supercomputing applications,these applications are often driven by cost-performance concerns rather thanabsolute performance. For example:

� The NEOS [146] and NetSolve [104] network-enhanced numerical solversystems allow users to couple remote software and resources into desktop


applications, dispatching to remote servers calculations that are computa-tionally demanding or that require specialized software.

� A computer-enhanced MRI machine and scanning tunneling microscope(STM) developed at the National Center for Supercomputing Applicationsuse supercomputers to achieve realtime image processing [456, 457]. Theresult is a significant enhancement in the ability to understand what weare seeing and, in the case of the microscope, to steer the instrument. (Seealso Section 4.2.4.)

� A system developed at the Aerospace Corporation for processing of datafrom meteorological satellites uses dynamically acquired supercomputerresources to deliver the results of a cloud detection algorithm to remotemeteorologists in quasi real time [332].

The challenging issues in on-demand applications derive primarily fromthe dynamic nature of resource requirements and the potentially large popula-tions of users and resources. These issues include resource location, schedul-ing, code management, configuration, fault tolerance, security, and paymentmechanisms.

2.2.4 Data-Intensive Computing

In data-intensive applications, the focus is on synthesizing new informationfrom data that is maintained in geographically distributed repositories, digitallibraries, and databases. This synthesis process is often computationally andcommunication intensive as well.

� Future high-energy physics experiments will generate terabytes of dataper day, or around a petabyte per year (see Section 4.2.3). The complexqueries used to detect “interesting” events may need to access large frac-tions of this data [363]. The scientific collaborators who will access thisdata are widely distributed, and hence the data systems in which data isplaced are likely to be distributed as well.

� The Digital Sky Survey (Section 5.1.2) will, ultimately, make many tera-bytes of astronomical photographic data available in numerous network-accessible databases. This facility enables new approaches to astronomicalresearch based on distributed analysis, assuming that appropriate compu-tational grid facilities exist.

� Modern meteorological forecasting systems make extensive use of data as-similation to incorporate remote satellite observations (see Section 5.1.1).


The complete process involves the movement and processing of manygigabytes of data.

Challenging issues in data-intensive applications are the scheduling andconfiguration of complex, high-volume data flows through multiple levels ofhierarchy.

2.2.5 Collaborative Computing

Collaborative applications are concerned primarily with enabling and enhanc-ing human-to-human interactions. Such applications are often structured interms of a virtual shared space. Many collaborative applications are concernedwith enabling the shared use of computational resources such as data archivesand simulations; in this case, they also have characteristics of the other appli-cation classes just described. For example:

� The BoilerMaker system developed at Argonne National Laboratory allowsmultiple users to collaborate on the design of emission control systems inindustrial incinerators [158]. The different users interact with each otherand with a simulation of the incinerator.

� The CAVE5D system supports remote, collaborative exploration of largegeophysical data sets and the models that generate them—for example, acoupled physical/biological model of the Chesapeake Bay [567].

� The NICE system developed at the University of Illinois at Chicago allowschildren to participate in the creation and maintenance of realistic virtualworlds, for entertainment and education [482].

Challenging aspects of collaborative applications from a grid architectureperspective are the realtime requirements imposed by human perceptual ca-pabilities and the rich variety of interactions that can take place.

We conclude this section with three general observations. First, we notethat even in this brief survey we see a tremendous variety of already suc-cessful applications. This rich set has been developed despite the significantdifficulties faced by programmers developing grid applications in the absenceof a mature grid infrastructure. As grids evolve, we expect the range and so-phistication of applications to increase dramatically. Second, we observe thatalmost all of the applications demonstrate a tremendous appetite for compu-tational resources (CPU, memory, disk, etc.) that cannot be met in a timelyfashion by expected growth in single-system performance. This emphasizes


the importance of grid technologies as a means of sharing computation aswell as a data access and communication medium. Third, we see that manyof the applications are interactive, or depend on tight synchronization withcomputational components, and hence depend on the availability of a gridinfrastructure able to provide robust performance guarantees.

2.3 GRID COMMUNITIES

Who will use grids? One approach to understanding computational grids isto consider the communities that they serve. Because grids are above all amechanism for sharing resources, we ask, What groups of people will havesufficient incentive to invest in the infrastructure required to enable sharing,and what resources will these communities want to share?

One perspective on these questions holds that the benefits of sharing willalmost always outweigh the costs and, hence, that we will see grids that linklarge communities with few common interests, within which resource sharingwill extend to individual PCs and workstations. If we compare a computationalgrid to an electric power grid, then in this view, the grid is quasi-universal,and every user has the potential to act as a cogenerator. Skeptics respond thatthe technical and political costs of sharing resources will rarely outweigh thebenefits, especially when coupling must cross institutional boundaries. Hence,they argue that resources will be shared only when there is considerable in-centive to do so: because the resource is expensive, or scarce, or becausesharing enables human interactions that are otherwise difficult to achieve. Inthis view, grids will be specialized, designed to support specific user commu-nities with specific goals.

Rather than take a particular position on how grids will evolve, we proposewhat we see as four plausible scenarios, each serving a different community.Future grids will probably include elements of all four.

2.3.1 Government

The first community that we consider comprises the relatively smallnumber—thousands or perhaps tens of thousands—of officials, planners, andscientists concerned with problems traditionally assigned to national govern-ment, such as disaster response, national defense, and long-term research andplanning. There can be significant advantage to applying the collective powerof the nation’s fastest computers, data archives, and intellect to the solutionof these problems. Hence, we envision a grid that uses the fastest networks

2.3 Grid Communities27

to couple relatively small numbers of high-end resources across the nation—perhaps tens of teraFLOP computers, petabytes of storage, hundreds of sites,thousands of smaller systems—for two principal purposes:

1. To provide a “strategic computing reserve,” allowing substantial computingresources to be applied to large problems in times of crisis, such as toplan responses to a major environmental disaster, earthquake, or terroristattack

2. To act as a “national collaboratory,” supporting collaborative investigationsof complex scientific and engineering problems, such as global change,space station design, and environmental cleanup

An important secondary benefit of this high-end national supercomputinggrid is to support resource trading between the various operators of high-endresources, hence increasing the efficiency with which those resources areused.

This national grid is distinguished by its need to integrate diverse high-end(and hence complex) resources, the strategic importance of its overall mission,and the diversity of competing interests that must be balanced when allocatingresources.

2.3.2 A Health Maintenance Organization

In our second example, the community supported by the grid comprises ad-ministrators and medical personnel located at a small number of hospitalswithin a metropolitan area. The resources to be shared are a small numberof high-end computers, hundreds of workstations, administrative databases,medical image archives, and specialized instruments such as MRI machines,CAT scanners, and cardioangiography devices (see Chapter 4). The couplingof these resources into an integrated grid enables a wide range of new, compu-tationally enhanced applications: desktop tools that use centralized supercom-puter resources to run computer-aided diagnosis procedures on mammogramsor to search centralized medical image archives for similar cases; life-criticalapplications such as telerobotic surgery and remote cardiac monitoring andanalysis; auditing software that uses the many workstations across the hospitalto run fraud detection algorithms on financial records; and research softwarethat uses supercomputers and idle workstations for epidemiological research.Each of these applications exists today in research laboratories, but has rarelybeen deployed in ordinary hospitals because of the high cost of computation.


This private grid is distinguished by its relatively small scale, central man-agement, and common purpose on the one hand, and on the other hand bythe complexity inherent in using common infrastructure for both life-criticalapplications and less reliability-sensitive purposes and by the need to integratelow-cost commodity technologies. We can expect grids with similar character-istics to be useful in many institutions.

2.3.3 A Materials Science Collaboratory

The community in our third example is a group of scientists who operate anduse a variety of instruments, such as electron microscopes, particle acceler-ators, and X-ray sources, for the characterization of materials. This commu-nity is fluid and highly distributed, comprising many hundreds of universityresearchers and students from around the world, in addition to the opera-tors of the various instruments (tens of instruments, at perhaps ten centers).The resources that are being shared include the instruments themselves, dataarchives containing the collective knowledge of this community, sophisticatedanalysis software developed by different groups, and various supercomputersused for analysis. Applications enabled by this grid include remote opera-tion of instruments, collaborative analysis, and supercomputer-based onlineanalysis.

This virtual grid is characterized by a strong unifying focus and relativelynarrow goals on the one hand, and on the other hand by dynamic member-ship, a lack of central control, and a frequent need to coexist with other usesof the same resources. We can imagine similar grids arising to meet the needsof a variety of multi-institutional research groups and multicompany virtualteams created to pursue long- or short-term goals.

2.3.4 Computational Market Economy

The fourth community that we consider comprises the participants in a broad-based market economy for computational services. This is a potentially enor-mous community with no connections beyond the usual market-oriented rela-tionships. We can expect participants to include consumers, with their diverseneeds and interests; providers of specialized services, such as financial mod-eling, graphics rendering, and interactive gaming; providers of compute re-sources; network providers, who contract to provide certain levels of networkservice; and various other entities such as banks and licensing organizations.

This public grid is in some respects the most intriguing of the four sce-narios considered here, but is also the least concrete. One area of uncertainty

2.4 Using Grids29

concerns the extent to which the average consumer will also act as a producerof computational resources. The answer to this question seems to depend ontwo issues. Will applications emerge that can exploit loosely coupled compu-tational resources? And, will owners of resources be motivated to contributeresources? To date, large-scale activity in this area has been limited to fairlyesoteric computations—such as searching for prime numbers, breaking crypto-graphic codes [338], or detecting extraterrestrial communications [527]—withthe benefit to the individuals being the fun of participating and the potentialmomentary fame if their computer solves the problem in question.

We conclude this section by noting that, in our view, each of these scenar-ios seems quite feasible; indeed, substantial prototypes have been created foreach of the grids that we describe. Hence, we expect to see not just one singlecomputational grid, but rather many grids, each serving a different commu-nity with its own requirements and objectives. Just which grids will evolvedepends critically on three issues: the evolving economics of computing andnetworking, and the services that these physical infrastructure elements areused to provide; the institutional, regulatory, and political frameworks withinwhich grids may develop; and, above all, the emergence of applications ableto motivate users to invest in and use grid technologies.

2.4 USING GRIDS

How will grids be used? In metacomputing experiments conducted to date,users have been “heroic” programmers, willing to spend large amounts of timeprogramming complex systems at a low level. The resulting applications haveprovided compelling demonstrations of what might be, but in most cases aretoo expensive, unreliable, insecure, and fragile to be considered suitable forgeneral use.

For grids to become truly useful, we need to take a significant step forwardin grid programming, moving from the equivalent of assembly language tohigh-level languages, from one-off libraries to application toolkits, and fromhand-crafted codes to shrink-wrapped applications. These goals are familiarto us from conventional programming, but in a grid environment we arefaced with the additional difficulties associated with wide area operation—in particular, the need for grid applications to adapt to changes in resourceproperties in order to meet performance requirements. As in conventionalcomputing, an important step toward the realization of these goals is thedevelopment of standards for applications, programming models, tools, and


Class Purpose Makes use of Concerns

End users Solve Applications Transparency,problems performance

Application Develop Programming Ease of use,developers applications models, tools performance

Tool Develop tools, Grid Adaptivity, exposure ofdevelopers programming models services performance, security

Grid Provide basic Local system Local simplicity,developers grid services services connectivity, security

System Manage Management Balancing localadministrators grid resources tools and global concerns

2.2

TABLE

Classes of grid users.

services, so that a division of labor can be achieved between the users anddevelopers of different types of components.

We structure our discussion of grid tools and programming in terms ofthe classification illustrated in Table 2.2. At the lowest level, we have griddevelopers—the designers and implementors of what we might call the “GridProtocol,” by analogy with the Internet Protocol that provides the lowest-levelservices in the Internet—who provide the basic services required to constructa grid. Above this, we have tool developers, who use grid services to constructprogramming models and associated tools, layering higher-level services andabstractions on top of the more fundamental services provided by the gridarchitecture. Application developers, in turn, build on these programming mod-els, tools, and services to construct grid-enabled applications for end users who,ideally, can use these applications without being concerned with the fact thatthey are operating in a grid environment. A fifth class of users, system admin-istrators, is responsible for managing grid components. We now examine thismodel in more detail.

2.4.1 Grid Developers

A very small group of grid developers are responsible for implementing thebasic services referred to above. We discuss the concerns encountered at thislevel in Section 2.5.

2.4 Using Grids31

2.4.2 Tool Developers

Our second group of users are the developers of the tools, compilers, libraries,and so on that implement the programming models and services used by ap-plication developers. Today’s small population of grid tool developers (e.g.,the developers of Condor [348], Nimrod [5], NEOS [146], NetSolve [104], Ho-rus [548], grid-enabled implementations of the Message Passing Interface(MPI) [198], and CAVERN [337]) must build their tools on a very narrow foun-dation, comprising little more than the Internet Protocol. We envision thatfuture grid systems will provide a richer set of basic services, hence makingit possible to build more sophisticated and robust tools. We discuss the na-ture and implementation of those basic services in Section 2.5; briefly, theycomprise versions of those services that have proven effective on today’s endsystems and clusters, such as authentication, process management, data ac-cess, and communication, plus new services that address specific concerns ofthe grid environment, such as resource location, information, fault detection,security, and electronic payment.

Tool developers must use these basic services to provide efficient im-plementations of the programming models that will be used by applicationdevelopers. In constructing these translations, the tool developer must be con-cerned not only with translating the existing model to the grid environment,but also with revealing to the programmer those aspects of the grid environ-ment that impact performance. For example, a grid-enabled MPI [198] canseek to adapt the MPI model for grid execution by incorporating specializedtechniques for point-to-point and collective communication in highly hetero-geneous environments; implementations of collective operations might usemulticast protocols and adapt a combining tree structure in response to chang-ing network loads. It should probably also extend the MPI model to provideprogrammers with access to resource location services, information about gridtopology, and group communication protocols.

2.4.3 Application Developers

Our third class of users comprises those who construct grid-enabled applica-tions and components. Today, these programmers write applications in whatis, in effect, an assembly language: explicit calls to the Internet Protocol’s UserDatagram Protocol (UDP) or Transmission Control Protocol (TCP), explicitor no management of failure, hard-coded configuration decisions for specificcomputing systems, and so on. We are far removed from the portable, efficient,


high-level languages that are used to develop sequential programs, and the ad-vanced services that programmers can rely upon when using these languages,such as dynamic memory management and high-level I/O libraries.

Future grids will need to address the needs of application developers intwo ways. They must provide programming models (supported by languages,libraries, and tools) that are appropriate for grid environments and a rangeof services (for security, fault detection, resource management, data access,communication, etc.) that programmers can call upon when developing appli-cations.

The purpose of both programming models and services is to simplifythinking about and implementing complex algorithmic structures, by provid-ing a set of abstractions that hide details unrelated to the application, whileexposing design decisions that have a significant impact on program per-formance or correctness. In sequential programming, commonly used pro-gramming models provide us with abstractions such as subroutines and scop-ing; in parallel programming, we have threads and condition variables (inshared-memory parallelism), message passing, distributed arrays, and single-assignment variables. Associated services ensure that resources are allocatedto processes in a reasonable fashion, provide convenient abstractions for ter-tiary storage, and so forth.

There is no consensus on what programming model is appropriate for agrid environment, although it seems clear that many models will be used.Table 2.3 summarizes some of the models that have been proposed; newmodels will emerge as our understanding of grid programming evolves. Thesemodels are discussed in more detail in Chapters 8, 9, and 10, while Chapter 15discusses the related question of tools.

As Table 2.3 makes clear, one approach to grid programming is to adaptmodels that have already proved successful in sequential or parallel environ-ments. For example, a grid-enabled distributed shared-memory (DSM) systemwould support a shared-memory programming model in a grid environment,allowing programmers to specify parallelism in terms of threads and shared-memory operations. Similarly, a grid-enabled MPI would extend the popularmessage-passing model [198], and a grid-enabled file system would permitremote files to be accessed via the standard UNIX application programminginterface (API) [545]. These approaches have the advantage of potentially al-lowing existing applications to be reused unchanged, but can introduce sig-nificant performance problems if the models in question do not adapt well tohigh-latency, dynamic, heterogeneous grid environments.

Another approach is to build on technologies that have proven effective indistributed computing, such as Remote Procedure Call (RPC) or related object-

2.4 Using Grids33

Model Examples Pros Cons

Datagram/stream UDP, TCP, Low overhead Low levelcommunication Multicast

Shared memory, POSIX Threads High level Scalabilitymultithreading DSM

Data parallelism HPF, HPC++ Automatic Restrictedparallelization applicability

Message passing MPI, PVM High performance Low level

Object-oriented CORBA, DCOM, Support for PerformanceJava RMI large-system design

Remote procedure DCE, ONC Simplicity Restrictedcall applicability

High throughput Condor, LSF, Ease of use RestrictedNimrod applicability

Group ordered Isis, Totem Robustness Performance,scalability

Agents Aglets, Flexibility Performance,Telescript robustness

2.3

TABLE

Potential grid programming models and their advantages and disadvantages.

based techniques such as the Common Object Request Broker Architecture(CORBA). These technologies have significant software engineering advan-tages, because their encapsulation properties facilitate the modular construc-tion of programs and the reuse of existing components. However, it remains tobe seen whether these models can support performance-focused, complex ap-plications such as teleimmersion or the construction of dynamic computationsthat span hundreds or thousands of processors.

The grid environment can also motivate new programming models andservices. For example, high-throughput computing systems (Chapter 13), asexemplified by Condor [348] and Nimrod [5], support problem-solving meth-ods such as parameter studies in which complex problems are partitioned intomany independent tasks. Group-ordered communication systems representanother model that is important in dynamic, unpredictable grid environments;they provide services for managing groups of processes and for delivering mes-sages reliably to group members. Agent-based programming models representanother approach apparently well suited to grid environments; here, programsare constructed as independent entities that roam the network searching fordata or performing other tasks on behalf of a user.


A wide range of new services can be expected to arise in grid environ-ments to support the development of more complex grid applications. In addi-tion to grid analogs of conventional services such as file systems, we will seenew services for resource discovery, resource brokering, electronic payments,licensing, fault tolerance, specification of use conditions, configuration, adap-tation, and distributed system management, to name just a few.

2.4.4 End Users

Most grid users, like most users of computers or networks today, will not writeprograms. Instead, they will use grid-enabled applications that make use ofgrid resources and services. These applications may be chemistry packages orenvironmental models that use grid resources for computing or data; problem-solving packages that help set up parameter study experiments [5]; mathemat-ical packages augmented with calls to network-enabled solvers [146, 104]; orcollaborative engineering packages that allow geographically separated usersto cooperate on the design of complex systems.

End users typically place stringent requirements on their tools, in termsof reliability, predictability, confidentiality, and usability. The construction ofapplications that can meet these requirements in complex grid environmentsrepresents a major research and engineering challenge.

2.4.5 System Administrators

The final group of users that we consider are the system administrators whomust manage the infrastructure on which computational grids operate. Thistask is complicated by the high degree of sharing that grids are designedto make possible. The user communities and resources associated with aparticular grid will frequently span multiple administrative domains, and newservices will arise—such as accounting and resource brokering—that requiredistributed management. Furthermore, individual resources may participatein several different grids, each with its own particular user community, accesspolicies, and so on. For a grid to be effective, each participating resource mustbe administered so as to strike an appropriate balance between local policyrequirements and the needs of the larger grid community. This problem has asignificant political dimension, but new technical solutions are also required.

The Internet experience suggests that two keys to scalability when ad-ministering large distributed systems are to decentralize administration andto automate trans-site issues. For example, names and routes are administeredlocally, while essential trans-site services such as route discovery and nameresolution are automated. Grids will require a new generation of tools for au-

2.5 Grid Architecture35

tomatically monitoring and managing many tasks that are currently handledmanually.

New administration issues that arise in grids include establishing, moni-toring, and enforcing local policies in situations where the set of users maybe large and dynamic; negotiating policy with other sites and users; account-ing and payment mechanisms; and the establishment and management ofmarkets and other resource-trading mechanisms. There are interesting paral-lels between these problems and management issues that arise in the electricpower and banking industries [114, 218, 216].

2.5 GRID ARCHITECTURE

What is involved in building a grid? To address this question, we adopt asystem architect’s perspective and examine the organization of the softwareinfrastructure required to support the grid users, applications, and servicesdiscussed in the preceding sections.

As noted above, computational grids will be created to serve differentcommunities with widely varying characteristics and requirements. Hence,it seems unlikely that we will see a single grid architecture. However, wedo believe that we can identify basic services that most grids will provide,with different grids adopting different approaches to the realization of theseservices.

One major driver for the techniques used to implement grid services isscale. Computational infrastructure, like other infrastructures, is fractal, orself-similar at different scales. We have networks between countries, organiza-tions, clusters, and computers; between components of a computer; and evenwithin a single component. However, at different scales, we often operate indifferent physical, economic, and political regimes. For example, the accesscontrol solutions used for a laptop computer’s system bus are probably notappropriate for a trans-Pacific cable.

In this section, we adopt scale as the major dimension for comparison. Weconsider four types of systems, of increasing scale and complexity, asking twoquestions for each: What new concerns does this increase in scale introduce?And how do these new concerns influence how we provide basic services?These system types are as follows (see also Table 2.4):

1. The end system provides the best model we have for what it means tocompute, because it is here that most research and development effortshave focused in the past four decades.


Computational ResourceSystem type model I/O model management Security

End system Multithreading,automaticparallelization

Local I/O, diskstriping

Processcreation,OS signaldelivery, OSscheduling

OS kernel,hardware

Cluster (increasedscale, reducedintegration)

Synchronouscommunication,distributed sharedmemory

Parallel I/O (e.g.,MPI-IO), filesystems

Parallelprocesscreation, gangscheduling,OS-level signalpropagation

Sharedsecuritydatabases

Intranet (hetero-geneity, separateadministration,lack of globalknowledge)

Client/server,looselysynchronous:pipelines, couplingmanager/worker

Distributed filesystems (DFS,HPSS), databases

Resourcediscovery,signaldistributionnetworks, highthroughput

Networksecurity(Kerberos)

Internet (lack ofcentralized control,geographicaldistribution,internationalissues)

Collaborativesystems, remotecontrol, data mining

Remote fileaccess, digitallibraries, datawarehouses

Brokers,trading,mobile code,negotiation

Trustdelegation,public key,sandboxes

2.4

TABLE

Computer systems operating at different scales.

2. The cluster introduces new issues of parallelism and distributed manage-ment, albeit of homogeneous systems.

3. The intranet introduces the additional issues of heterogeneity and geo-graphical distribution.

4. The internet introduces issues associated with a lack of centralized control.

An important secondary driver for architectural solutions is the perfor-mance requirements of the grid. Stringent performance requirements amplifythe effect of scale because they make it harder to hide heterogeneity. For ex-ample, if performance is not a big concern, it is straightforward to extend UNIXfile I/O to support access to remote files, perhaps via a HyperText TransportProtocol (HTTP) gateway [545]. However, if performance is critical, remoteaccess may require quite different mechanisms—such as parallel transfers


over a striped network from a remote parallel file system to a local parallelcomputer—that are not easily expressed in terms of UNIX file I/O semantics.Hence, a high-performance wide area grid may need to adopt quite differentsolutions to data access problems. In the following, we assume that we aredealing with high-performance systems; systems with lower performance re-quirements are generally simpler.

2.5.1 Basic Services

We start our discussion of architecture by reviewing the basic services pro-vided on conventional computers. We do so because we believe that, in theabsence of strong evidence to the contrary, services that have been developedand proven effective in several decades of conventional computing will alsobe desirable in computational grids. Grid environments also require additionalservices, but we claim that, to a significant extent, grid development will beconcerned with extending familiar capabilities to the more complex wide areaenvironment.

Our purpose in this subsection is not to provide a detailed exposition ofwell-known ideas but rather to establish a vocabulary for subsequent discus-sion. We assume that we are discussing a generic modern computing system,and hence refrain from prefixing each statement with “in general,” “typically,”and the like. Individual systems will, of course, differ from the generic systemsdescribed here, sometimes in interesting and important ways.

The first step in a computation that involves shared resources is an au-thentication process, designed to establish the identity of the user. A subse-quent authorization process establishes the right of the user to create entitiescalled processes. A process comprises one or more threads of control, createdfor either concurrency or parallelism, and executing within a shared addressspace. A process can also communicate with other processes via a variety of ab-stractions, including shared memory (with semaphores or locks), pipes, andprotocols such as TCP/IP.

A user (or process acting on behalf of a user) can control the activities inanother process—for example, to suspend, resume, or terminate its execution.This control is achieved by means of asynchronously delivered signals.

A process acts on behalf of its creator to acquire resources, by executinginstructions, occupying memory, reading and writing disks, sending and re-ceiving messages, and so on. The ability of a process to acquire resourcesis limited by underlying authorization mechanisms, which implement a sys-tem’s resource allocation policy, taking into account the user’s identity, priorresource consumption, and/or other criteria. Scheduling mechanisms in the


underlying system deal with competing demands for resources and may also(for example, in realtime systems) support user requests for performanceguarantees.

Underlying accounting mechanisms keep track of resource allocations andconsumption, and payment mechanisms may be provided to translate resourceconsumption into some common currency. The underlying system will alsoprovide protection mechanisms to ensure that one user’s computation does notinterfere with another’s.

Other services provide abstractions for secondary storage. Of these, virtualmemory is implicit, extending the shared address space abstraction alreadynoted; file systems and databases are more explicit representations of secondarystorage.

2.5.2 End Systems

Individual end systems—computers, storage systems, sensors, and otherdevices—are characterized by relatively small scale and a high degree of ho-mogeneity and integration. There are typically just a few tens of components(processors, disks, etc.), these components are mostly of the same type, andthe components and the software that controls them have been co-designedto simplify management and use and to maximize performance. (Specializeddevices such as scientific instruments may be more significantly complex,with potentially thousands of internal components, of which hundreds maybe visible externally.)

Such end systems represent the simplest, and most intensively studied,environment in which to provide the services listed above. The principalchallenges facing developers of future systems of this type relate to changingcomputer architectures (in particular, parallel architectures) and the need tointegrate end systems more fully into clusters, intranets, and internets.

State of the Art

The software architectures used in conventional end systems are well known[511]. Basic services are provided by a privileged operating system, whichhas absolute control over the resources of the computer. This operating sys-tem handles authentication and mediates user process requests to acquire re-sources, communicate with other processes, access files, and so on. The inte-grated nature of the hardware and operating system allows high-performanceimplementations of important functions such as virtual memory and I/O.


Programmers develop applications for these end systems by using a va-riety of high-level languages and tools. A high degree of integration betweenprocessor architecture, memory system, and compiler means that high perfor-mance can often be achieved with relatively little programmer effort.

Future Directions

A significant deficiency of most end-system architectures is that they lackfeatures necessary for integration into larger clusters, intranets, and inter-nets. Much current research and development is concerned with evolvingend-system architectures in directions relevant to future computational grids.To list just three: Operating systems are evolving to support operation inclustered environments, in which services are distributed over multiple net-worked computers, rather than replicated on every processor [25, 544]. A sec-ond important trend is toward a greater integration of end systems (computers,disks, etc.) with networks, with the goal of reducing the overheads incurredat network interfaces and hence increasing communication rates [167, 288].Finally, support for mobile code is starting to appear, in the form of au-thentication schemes, secure execution environments for downloaded code(“sandboxes”), and so on [238, 559, 555, 370].

The net effect of these various developments seems likely to be to re-duce the currently sharp boundaries between end system, cluster, and in-tranet/internet, with the result that individual end systems will more fullyembrace remote computation, as producers and/or consumers.

2.5.3 Clusters

The second class of systems that we consider is the cluster, or network ofworkstations: a collection of computers connected by a high-speed local areanetwork and designed to be used as an integrated computing or data process-ing resource (see Chapter 17). A cluster, like an individual end system, isa homogeneous entity—its constituent systems differ primarily in configura-tion, not basic architecture—and is controlled by a single administrative entitywho has complete control over each end system. The two principal complicat-ing factors that the cluster introduces are as follows:

1. Increased physical scale: A cluster may comprise several hundred or thou-sand processors, with the result that alternative algorithms are needed forcertain resource management and control functions.


2. Reduced integration: A desire to construct clusters from commodity partsmeans that clusters are often less integrated than end systems. Oneimplication of this is reduced performance for certain functions (e.g.,communication).

State of the Art

The increased scale and reduced integration of the cluster environment makethe implementation of certain services more difficult and also introduce aneed for new services not required in a single end system. The result tends tobe either significantly reduced performance (and hence range of applications)or software architectures that modify and/or extend end-system operatingsystems in significant ways.

We use the problem of high-performance parallel execution to illustratethe types of issues that can arise when we seek to provide familiar end-system services in a cluster environment. In a single (multiprocessor) endsystem, high-performance parallel execution is typically achieved either byusing specialized communication libraries such as MPI or by creating multiplethreads that communicate by reading and writing a shared address space.

Both message-passing and shared-memory programming models can beimplemented in a cluster. Message passing is straightforward to implement,since the commodity systems from which clusters are constructed typicallysupport at least TCP/IP as a communication protocol. Shared memory re-quires additional effort: in an end system, hardware mechanisms ensure auniform address space for all threads, but in a cluster, we are dealing withmultiple address spaces. One approach to this problem is to implement a logi-cal shared memory by providing software mechanisms for translating betweenlocal and global addresses, ensuring coherency between different versions ofdata, and so forth. A variety of such distributed shared-memory systems exist,varying according to the level at which sharing is permitted [586, 177, 422].

In low-performance environments, the cluster developer’s job is doneat this point; message-passing and DSM systems can be run as user-levelprograms that use conventional communication protocols and mechanisms(e.g., TCP/IP) for interprocessor communication. However, if performanceis important, considerable additional development effort may be required.Conventional network protocols are orders of magnitude slower than intra-end-system communication operations. Low-latency, high-bandwidth inter-end-system communication can require modifications to the protocols usedfor communication, the operating system’s treatment of network interfaces,or even the network interface hardware [553, 431] (see Chapters 17 and 20).


The cluster developer who is concerned with parallel performance mustalso address the problem of coscheduling. There is little point in communicat-ing extremely rapidly to a remote process that must be scheduled before it canrespond. Coscheduling refers to techniques that seek to schedule simultane-ously the processes constituting a computation on different processors [174,520]. In certain highly integrated parallel computers, coscheduling is achievedby using a batch scheduler: processors are space shared, so that only one com-putation uses a processor at a time. Alternatively, the schedulers on the dif-ferent systems can communicate, or the application itself can guide the localscheduling process to increase the likelihood that processes will be cosched-uled [25, 121].

To summarize the points illustrated by this example: in clusters, theimplementation of services taken for granted in end systems can requirenew approaches to the implementation of existing services (e.g., interpro-cess communication) and the development of new services (e.g., DSM andcoscheduling). The complexity of the new approaches and services, as wellas the number of modifications required to the commodity technologies fromwhich clusters are constructed, tends to increase proportionally with perfor-mance requirements.

We can paint a similar picture in other areas, such as process creation,process control, and I/O. Experience shows that familiar services can be ex-tended to the cluster environment without too much difficulty, especially ifperformance is not critical; the more sophisticated cluster systems providetransparent mechanisms for allocating resources, creating processes, control-ling processes, accessing files, and so forth, that work regardless of a program’slocation within the cluster. However, when performance is critical, new im-plementation techniques, low-level services, and high-level interfaces can berequired [544, 180].

Future Directions

Cluster architectures are evolving in response to three pressures:

1. Performance requirements motivate increased integration and hence op-erating system and hardware modifications (for example, to support fastcommunications).

2. Changed operational parameters introduce a need for new operating sys-tem and user-level services, such as coscheduling.


3. Economic pressures encourage a continued focus on commodity technolo-gies, at the expense of decreased integration and hence performance andservices.

It seems likely that, in the medium term, software architectures for clus-ters will converge with those for end systems, as end-system architecturesaddress issues of network operation and scale.

2.5.4 Intranets

The third class of systems that we consider is the intranet, a grid comprisinga potentially large number of resources that nevertheless belong to a singleorganization. Like a cluster, an intranet can assume centralized administrativecontrol and hence a high degree of coordination among resources. The threeprincipal complicating factors that an intranet introduces are as follows:

1. Heterogeneity: The end systems and networks used in an intranet arealmost certainly of different types and capabilities. We cannot assume asingle system image across all end systems.

2. Separate administration: Individual systems will be separately adminis-tered; this feature introduces additional heterogeneity and the need tonegotiate potentially conflicting policies.

3. Lack of global knowledge: A consequence of the first two factors, and theincreased number of end systems, is that it is not possible, in general,for any one person or computation to have accurate global knowledge ofsystem structure or state.

State of the Art

The software technologies employed in intranets focus primarily on the prob-lems of physical and administrative heterogeneity. The result is typicallya simpler, less tightly integrated set of services than in a typical cluster.Commonly, the services that are provided are concerned primarily withthe sharing of data (e.g., distributed file systems, databases, Web services)or with providing access to specialized services, rather than with support-ing the coordinated use of multiple resources. Access to nonlocal resourcesoften requires the use of simple, high-level interfaces designed for “arm’s-length” operation in environments in which every operation may involveauthentication, format conversions, error checking, and accounting. Never-theless, centralized administrative control does mean that a certain degree


of uniformity of mechanism and interface can be achieved; for example, allmachines may be required to run a specific distributed file system or batchscheduler, or may be placed behind a firewall, hence simplifying securitysolutions.

Software architectures commonly used in intranets include the Distri-buted Computing Environment (DCE), DCOM, and CORBA. In these systems,programs typically do not allocate resources and create processes explicitly,but rather connect to established “services” that encapsulate hardware re-sources or provide defined computational services. Interactions occur via re-mote procedure call [352] or remote method invocation [424, 290], modelsdesigned for situations in which the parties involved have little knowledgeof each other. Communications occur via standardized protocols (typicallylayered on TCP/IP) that are designed for portability rather than high perfor-mance. In larger intranets, particularly those used for mission-critical applica-tions, reliable group communication protocols such as those implemented byISIS [62] and Totem [401] (see Chapter 18) can be used to deal with failure byordering the occurrence of events within the system.

The limited centralized control provided by a parent organization canallow the deployment of distributed queuing systems such as Load SharingFacility (LSF), Codine, or Condor, hence providing uniform access to com-pute resources. Such systems provide some support for remote managementof computation, for example, by distributing a limited range of signals to pro-cesses through local servers and a logical signal distribution network. How-ever, issues of security, payment mechanisms, and policy often prevent thesesolutions from scaling to large intranets.

In a similar fashion, uniform access to data resources can be providedby means of wide area file system technology (such as DFS), distributeddatabase technology, or remote database access (such as SQL servers). High-performance, parallel access to data resources can be provided by more spe-cialized systems such as the High Performance Storage System [562]. In thesecases, the interfaces presented to the application would be the same as thoseprovided in the cluster environment.

The greater heterogeneity, scale, and distribution of the intranet environ-ment also introduce the need for services that are not needed in clusters. Forexample, resource discovery mechanisms may be needed to support the dis-covery of the name, location, and other characteristics of resources currentlyavailable on the network. A reduced level of trust and greater exposure toexternal threats may motivate the use of more sophisticated security technolo-gies. Here, we can once again exploit the limited centralized control that a par-ent organization can offer. Solutions such as Kerberos [418] can be mandated


and integrated into the computational model, providing a unified authentica-tion structure throughout the intranet.

Future Directions

Existing intranet technologies do a reasonable job of projecting a subset of fa-miliar programming models and services (procedure calls, file systems, etc.)into an environment of greater complexity and physical scale, but are inad-equate for performance-driven applications. We expect future developmentsto overcome these difficulties by extending lighter-weight interaction modelsoriginally developed within clusters into the more complex intranet environ-ment, and by developing specialized performance-oriented interfaces to vari-ous services. Some relevant issues are discussed in Chapters 17 and 20.

2.5.5 Internets

The final class of systems that we consider is also the most challenging onwhich to perform network computing—internetworked systems that spanmultiple organizations. Like intranets, internets tend to be large and hetero-geneous. The three principal additional complicating factors that an internetintroduces are as follows:

1. Lack of centralized control: There is no central authority to enforce opera-tional policies or to ensure resource quality, and so we see wide variationin both policy and quality.

2. Geographical distribution: Internets typically link resources that are geo-graphically widely distributed. This distribution leads to network perfor-mance characteristics significantly different from those in local area ormetropolitan area networks of clusters and intranets. Not only does la-tency scale linearly with distance, but bisection bandwidth arguments[147, 197] suggest that accessible bandwidth tends to decline linearly withdistance, as a result of increased competition for long-haul links.

3. International issues: If a grid extends across international borders, exportcontrols may constrain the technologies that can be used for security, andso on.

State of the Art

The internet environment’s scale and lack of central control have so far pre-vented the successful widespread deployment of grid services. Approaches


that are effective in intranets often break down because of the increased scaleand lack of centralized management. The set of assumptions that one user orresource can make about another is reduced yet further, a situation that canlead to a need for implementation techniques based on discovery and negoti-ation.

We use two examples to show how the internet environment can requirenew approaches. We first consider security. In an intranet, it can be reasonableto assume that every user has a preestablished trust relationship with everyresource that he wishes to access. In the more open internet environment,this assumption becomes intractable because of the sheer number of potentialprocess-to-resource relationships. This problem is accentuated by the dynamicand transient nature of computation, which makes any explicit representationof these relationships infeasible. Free-flowing interaction between computa-tions and resources requires more dynamic approaches to authentication andaccess control. One potential solution is to introduce the notion of delegationof trust into security relationships; that is, we introduce mechanisms that al-low an organization A to trust a user U because user U is trusted by a secondorganization B, with which A has a formal relationship. However, the develop-ment of such mechanisms remains a research problem (see Chapter 16).

As a second example, we consider the problem of coscheduling. In anintranet, it can be reasonable to assume that all resources run a single sched-uler, whether a commercial system such as LSF or a research system such asCondor. Hence, it may be feasible to provide coscheduling facilities in sup-port of applications that need to run on multiple resources at once. In aninternet, we cannot rely on the existence of a common scheduling infrastruc-ture. In this environment, coscheduling requires that a grid application (orscheduling service acting for an application) obtain knowledge of the schedul-ing policies that apply on different resources and influence the schedule ei-ther directly through an external scheduling API or indirectly via some othermeans [144].

Future Directions

Future development of grid technologies for internet environments will in-volve the development of more sophisticated grid services and the gradualevolution of the services provided at end systems in support of those services.There is little consensus on the shape of the grid architectures that will emergeas a result of this process, but both commercial technologies and researchprojects point to interesting potential directions. Three of these directions—commodity technologies, Legion, and Globus—are explored in detail in later


chapters. We note their key characteristics here but avoid discussion of theirrelative merits. There is as yet too little experience in their use for such dis-cussion to be meaningful.

The commodity approach to grid architecture, as advocated in Chapter 10,adopts as the basis for grid development the vast range of commodity technolo-gies that are emerging at present, driven by the success of the Internet andWeb and by the demands of electronic information delivery and commerce.These technologies are being used to construct three-tier architectures, inwhich middle-tier application servers mediate between sophisticated back-endservices and potentially simple front ends. Grid applications are supported inthis environment by means of specialized high-performance back-end and ap-plication servers.

The Legion approach to grid architecture, described in Chapter 9, seeksto use object-oriented design techniques to simplify the definition, deploy-ment, application, and long-term evolution of grid components. Hence, theLegion architecture defines a complete object model that includes abstractionsof compute resources called host objects, abstractions of storage systems calleddata vault objects, and a variety of other object classes. Users can use inheri-tance and other object-oriented techniques to specialize the behavior of theseobjects to their own particular needs, as well as develop new objects.

The Globus approach to grid architecture, discussed in Chapter 11, is basedon two assumptions:

1. Grid architectures should provide basic services, but not prescribe partic-ular programming models or higher-level architectures.

2. Grid applications require services beyond those provided by today’s com-modity technologies.

Hence, the focus is on defining a “toolkit” of low-level services for security,communication, resource location, resource allocation, process management,and data access. These services are then used to implement higher-level ser-vices, tools, and programming models.

In addition, hybrids of these different architectural approaches are possi-ble and will almost certainly be addressed; for example, a commodity three-tier system might use Globus services for its back end.

A wide range of other projects are exploring technologies of potentialrelevance to computational grids, for example, WebOS [546], Charlotte [47],UFO [13], ATLAS [40], Javelin [122], Popcorn [99], and Globe [549].

2.6 Research Challenges47

2.6 RESEARCH CHALLENGES

What problems must be solved to enable grid development? In precedingsections, we outlined what we expect grids to look like and how we expectthem to be used. In doing so, we tried to be as concrete as possible, with thegoal of providing at least a plausible view of the future. However, there arecertainly many challenges to be overcome before grids can be used as easilyand flexibly as we have described. In this section, we summarize the natureof these challenges, most of which are discussed in much greater detail in thechapters that follow.

2.6.1 The Nature of Applications

Early metacomputing experiments provide useful clues regarding the natureof the applications that will motivate and drive early grid development. How-ever, history also tells us that dramatic changes in capabilities such as thosediscussed here are likely to lead to radically new ways of using computers—ways as yet unimagined. Research is required to explore the bounds of what ispossible, both within those scientific and engineering domains in which meta-computing has traditionally been applied, and in other areas such as business,art, and entertainment. Some of these issues are discussed at greater length inChapters 3 through 6.

2.6.2 Programming Models and Tools

As noted in Section 2.4, grid environments will require a rethinking of existingprogramming models and, most likely, new thinking about novel models moresuitable for the specific characteristics of grid applications and environments.Within individual applications, new techniques are required for expressingadvanced algorithms, for mapping those algorithms onto complex grid archi-tectures, for translating user performance requirements into system resourcerequirements, and for adapting to changes in underlying system structure andstate. Increased application and system complexity increases the importanceof code reuse, and so techniques for the construction and composition of grid-enabled software components will be important. Another significant challengeis to provide tools that allow programmers to understand and explain programbehavior and performance. These issues are discussed in Chapters 7 through10 and 15.


2.6.3 System Architecture

The software systems that support grid applications must satisfy a variety ofpotentially conflicting requirements. A need for broad deployment impliesthat these systems must be simple and place minimal demands on local sites.At the same time, the need to achieve a wide variety of complex, performance-sensitive applications implies that these systems must provide a range of po-tentially sophisticated services. Other complicating factors include the needfor scalability and evolution to future systems and services. It seems likelythat new approaches to software architecture will be needed to meet theserequirements—approaches that do not appear to be satisfied by existing Inter-net, distributed computing, or parallel computing technologies. Architecturalissues are discussed in Chapters 9, 10, 11, and 13.

2.6.4 Algorithms and Problem-Solving Methods

Grid environments differ substantially from conventional uniprocessor andparallel computing systems in their performance, cost, reliability, and se-curity characteristics. These new characteristics will undoubtedly motivatethe development of new classes of problem-solving methods and algorithms.Latency-tolerant and fault-tolerant solution strategies represent one importantarea in which research is required [40, 47, 99]. Highly concurrent and specu-lative execution techniques may be appropriate in environments where manymore resources are available than at present. These issues are touched uponin a number of places, notably Chapters 3 and 7.

2.6.5 Resource Management

A defining feature of computational grids is that they involve sharing of net-works, computers, and other resources. This sharing introduces challengingresource management problems that are beyond the state of the art in a vari-ety of areas. Many of the applications described in later chapters need to meetstringent end-to-end performance requirements across multiple computa-tional resources connected by heterogeneous, shared networks. To meet theserequirements, we must provide improved methods for specifying application-level requirements, for translating these requirements into computationalresources and network-level quality-of-service parameters, and for arbitrat-ing between conflicting demands. These issues are discussed in Chapters 12,13, and 19.

2.6 Research Challenges49

2.6.6 Security

Sharing also introduces challenging security problems. Traditional network se-curity research has focused primarily on two-party client-server interactionswith relatively low performance requirements. Grid applications frequentlyinvolve many more entities, impose stringent performance requirements, andinvolve more complex activities such as collective operations and the down-loading of code. In larger grids, issues that arise in electronic markets becomeimportant. Users may require assurance and licensing mechanisms that canprovide guarantees (backed by financial obligations) that services behave asadvertised [325]. Some of these issues are addressed in Chapter 16 and Sec-tion 4.3.4.

2.6.7 Instrumentation and Performance Analysis

The complexity of grid environments and the performance complexity ofmany grid applications make techniques for collecting, analyzing, and explain-ing performance data of critical importance. Depending on the applicationand computing environment, poor performance as perceived by a user canbe due to any one or a combination of many factors: an inappropriate algo-rithm, poor load balancing, inappropriate choice of communication protocol,contention for resources, or a faulty router. Significant advances in instrumen-tation, measurement, and analysis are required if we are to be able to relatesubtle performance problems in the complex environments of future grids toappropriate application and system characteristics. Chapters 14 and 15 discussthese issues.

2.6.8 End Systems

Grids also have implications for the end systems from which they are con-structed. Today’s end systems are relatively small and are connected to net-works by interfaces and with operating system mechanisms originally devel-oped for reading and writing slow disks. Grids require that this model evolvein two dimensions. First, by increasing demand for high-performance net-working, grid systems will motivate new approaches to operating system andnetwork interface design in which networks are integrated with computersand operating systems at a more fundamental level than is the case today.Second, by developing new applications for networked computers, grids willaccelerate local integration and hence increase the size and complexity of the


end systems from which they are constructed. Significant research is requiredin both areas, as discussed in Chapters 17 and 20.

2.6.9 Network Protocols and Infrastructure

Grid applications can be expected to have significant implications for futurenetwork protocols and hardware technologies. Mainstream developments innetworking, particularly in the Internet community, have focused on best-effort service for large numbers of relatively low-bandwidth flows. Many of thefuture grid applications discussed in this book require both high bandwidthsand stringent performance assurances. Meeting these requirements will re-quire major advances in the technologies used to transport, switch, route, andmanage network flows. These issues are discussed in Chapters 18 and 21. Inaddition, as discussed in Chapter 22, a next generation of testbeds will be re-quired to support the experiments that will advance the state of the art.

2.7 SUMMARY

This chapter has provided a high-level view of the expected purpose, shape,and architecture of future grid systems and, in the process, sketched a roadmap for more detailed technical discussion in subsequent chapters. The dis-cussion was structured in terms of six questions.

Why do we need computational grids? We explained how grids can enhancehuman creativity by, for example, increasing the aggregate and peak com-putational performance available to important applications and allowing thecoupling of geographically separated people and computers to support collab-orative engineering. We also discussed how such applications motivate ourrequirement for a software and hardware infrastructure able to provide de-pendable, consistent, and pervasive access to high-end computational capabil-ities.

What types of applications will grids be used for? We described five classes ofgrid applications: distributed supercomputing, in which many grid resourcesare used to solve very large problems; high throughput, in which grid re-sources are used to solve large numbers of small tasks; on demand, in whichgrids are used to meet peak needs for computational resources; data intensive,in which the focus is on coupling distributed data resources; and collaborative,in which grids are used to connect people.

Who will use grids? We examined the shape and concerns of four grid com-munities, each supporting a different type of grid: a national grid, serving

Further Reading51

a national government; a private grid, serving a health maintenance orga-nization; a virtual grid, serving a scientific collaboratory; and a public grid,supporting a market for computational services.

How will grids be used? We analyzed the requirements of five classes ofusers for grid tools and services, distinguishing between the needs and con-cerns of end users, application developers, tool developers, grid developers,and system managers.

What is involved in building a grid? We discussed potential approaches togrid architecture, distinguishing between the differing concerns that ariseand technologies that have been developed within individual end systems,clusters, intranets, and internets.

What problems must be solved to enable grid development? We provided abrief review of the research challenges that remain to be addressed beforegrids can be constructed and used on a large scale.

FURTHER READING

For more information on the topics covered in this chapter, see www.mkp.com/grids and also the following references:

� A series of books published by the Corporation for National ResearchInitiatives [215, 217, 218, 216] review and draw lessons from other large-scale infrastructures, such as the electric power grid, telecommunicationsnetwork, and banking system.

� Catlett and Smarr’s original paper on metacomputing [109] provides anearly vision of how high-performance distributed computing can changethe way in which scientists and engineers use computing.

� Papers in a 1996 special issue of the International Journal of SupercomputerApplications [155] describe the architecture and selected applications of theI-WAY metacomputing experiment.

� Papers in a 1997 special issue of the Communications of the ACM [515]describe plans for a National Technology Grid.

� Several reports by the National Research Council touch upon issues rele-vant to grids [411, 412, 410].

� Birman and van Renesse [63] discuss the challenges that we face in achiev-ing reliability in grid applications.

4C H A P T E R

Realtime Widely DistributedInstrumentation Systems

William E. Johnston

Useful and robust operation of real-time distributed systems requires manycapabilities, including automated management of data streams and distributedsystem components; semiautonomous operation of remote instrument sys-tems; generalized access control; dynamic scheduling and resource reserva-tion; application designs that can adapt to congestion; and brokering mecha-nisms. These capabilities will be built on supporting architecture, middleware,and low-level services, such as high-speed network-based caches, realtime cat-aloging systems, and agent-based systems that provide, for example, dynamicperformance analysis.

In this chapter we first provide some rationale for remote realtime appli-cations and then characterize the associated problems, discussing the natureof remote operation in terms of an example: a remotely controlled beamlineat the Advanced Light Source. Then, we provide more detailed descriptionsof three additional applications that we have constructed and that illustratethe practical issues that arise in systems of this type: a cardioangiographysystem that depends on realtime data cataloging, a high-energy physics data-processing architecture that supports very high data rates and volume, and aremote-control system for an electron microscope.

Finally, we discuss issues and approaches to providing the infrastructurerequired to support the cited applications. We describe a model architecture,network-based caches, agent-based management and monitoring, and policy-based access control systems.

4 Realtime Widely Distributed Instrumentation Systems76

4.1 DISTRIBUTED REALTIME APPLICATIONS

High-speed data streams result from the operation of many types of online in-struments and imaging systems and are a staple of modern scientific, healthcare, and intelligence environments. The advent of shared, widely availablehigh-speed networks is providing the potential for new approaches to the col-lection, organization, storage, analysis, and distribution of the large data ob-jects that result from such data streams. These new approaches will make boththe data and its analysis much more readily available. To illustrate this emerg-ing paradigm, we examine several examples that come from quite differentapplication domains but that have a number of similar architectural elements.

Health care imaging systems illustrate both high data rates and the needfor realtime cataloging. High-volume health care video and image data usedfor diagnostic purposes (e.g., X-ray CT, MRI, and cardioangiography) are col-lected at centralized facilities and, through widely distributed systems, may bestored, managed, accessed, and referenced at locations other than the pointof collection (e.g., the hospitals of the referring physicians). In health careimaging systems, it is important that the health care professionals at the refer-ring facility (hospitals or clinics frequently remote from the tertiary imagingfacility) have ready access not only to the image analyst’s reports, but alsoto the original image data. Additionally, it is important to provide and man-age distributed access to tertiary storage because laboratory instrumentationenvironments, hospitals, and so on are frequently not the best place to main-tain a large-scale digital storage system. Such systems can have considerableeconomy of scale in operational aspects, and an affordable, easily accessible,high-bandwidth network can provide location independence for such systems.

High-energy physics experiments illustrate both very high data rates andvolumes that have to be processed and archived in real time and must be ac-cessible to large scientific collaborations—typically hundreds of investigatorsat dozens of institutions around the world. High-bandwidth (20–40 MB/s) datahandling for analysis of high-energy and nuclear physics data is increasinglylikely to have a source of data that is remote from the computational and stor-age facilities. The output from particle detectors (the instrument) is subjectedto several stages of data reduction and analysis. After the initial processing,the analysis functions are carried out by dispersed collaborators and facilities.Their analysis is then organized in information systems that may reside on asingle storage system or be distributed among several physical systems.

Remote microscopy control illustrates the problem of the human always be-ing remote from the controlled system or object of interest. Data is typically

4.2 Problem Characterization and Prototypes77

collected as images (in the spatial or Fourier domains) that are then analyzedto provide information both for experiment control and for analysis. Experi-ment and instrument control includes object tracking, both in order to keepthe object visible (e.g., drift and depth-of-focus compensation) and to observechanges in the object. Some of this information may be fed back to the appa-ratus that is acting on the object, as in application of electromagnetic fieldsand thermal gradients. In all of these cases, the precision, repetition, or timescale means that humans cannot directly perform the required tasks effec-tively. The human operators provide the high-level control, such as initiallyidentifying objects of interest, establishing operating set points, and definingprotocols for the in situ experiments. Automated operation of the low-latency,low-level control enables the human functions to be carried out over wide areaas well as local area networks.

4.2 PROBLEM CHARACTERIZATION AND PROTOTYPES

Realtime management of distributed instrumentation systems involves re-mote operation of instrument control functions, distributed data collectionand management, and/or distributed data analysis and cataloging. Each ofthese regimes requires a supporting infrastructure of middleware and of sys-tems and communications services.

The required middleware services include automated cataloging and ter-tiary storage system interfaces (i.e., a digital library system between the instru-ment and the user; see Chapter 5); automated monitoring and managementsystems for all aspects of the distributed components (see Chapters 14 and 15);policy-based access control systems to support scheduling and resource allo-cation (e.g., quality of service, or QoS; see Chapter 19), security, distributedsystem integrity, and (potentially) automated brokering and system construc-tion; and rich media capabilities to support telepresence and collaboration (seeChapter 6).

Supporting systems and communications services include flexible trans-port mechanisms; reliable and unreliable wide area multicast; resource reser-vation and QoS for computing, storage, and communications; and security toprotect the network-level infrastructure (see Chapter 16).

These capabilities are not sufficient, but are a representative collection ofnecessary services for remotely operated, high-performance data systems. Inthe next few sections we will illustrate some of the issues that give rise to theneed for these services.


4.2.1 The Nature of Remote Operation

Distributed instruments can be remote in space, scale, or time. Remote inspace is the typical circumstance for network-distributed scientific collabora-tion, where instruments are located at one facility, users are located at others,and data processing and storage at yet others. Another common circumstanceis that the controlled function is sufficiently remote in scale that direct controlis not possible. Many microscopic experiment environments fall into this cate-gory. The operation of the Mars Pathfinder mission Rover vehicle provides anexample of functional control that is remote in time. (Rover operation was spec-ified a day in advance, and then the actions were uploaded for the followingday’s mission, which was carried out autonomously.) Each of these scenariosprovides circumstances that have to be addressed for remote operation.

When the operator is remote from the instrument, as is the case whenthe instrument is located at a national facility like the Lawrence Berkeley Na-tional Laboratory (LBNL) Advanced Light Source, and the investigators arelocated at universities and laboratories scattered across the country, severalissues arise. Multiple media streams are typically required in order to supporthuman interaction (audio and video conferencing and worksurface sharing)

and to provide a sense of presence (remote environment monitoring) so thatthe general environment, including the equipment area and local personnel,can be observed in order to verify general operational status. The experimentitself (e.g., a sample chamber) must typically be visually monitored as a “san-ity” check to ensure that the data stream is actually the result of the intendedexperiment. Finally, since data is shared in real time among several experi-menters, additional data streams are required for online analysis and control(see Plate 7) [10, 9, 485].

Multiple collaborators, each of whom need to see the instrument outputin real time and potentially control the instrument, require synchronizedand reliable access to the data and control. The shared control panels shownin Plate 7 illustrate such a capability, which is based on reliable multicastprotocols.

When the scale of the operations is very different from human scale, re-mote operation must typically involve some machine intelligence. Automatedoperations analyze the sensor data in real time and adapt the progress of theexperiment depending on the results of analysis. The human function is toset up the experiment, identify the object of interest in the experiment en-vironment, and so on. The actual operation, however, cannot be in humanhands.


Automation can also be critical to the remote operation of experimentswhen operating over a wide area IP network of unpredictable or high latency.Such a network cannot be used to provide fine-grained, realtime control, asrequired, for example, in a closed-loop servo system where the operatingfunctions are at one end of the network and the data analysis that providesthe feedback is at the other end. Incorporating machine intelligence intothe experiment control system and remotely performing monitoring and dataanalysis address this problem.

In the rest of this section, we use three examples to illustrate some ap-proaches to addressing these issues.

4.2.2 Cardioangiography: Realtime Data Cataloging

In many environments the key aspect of realtime data is the immediate andautomated processing necessary to organize and catalog the data and makeit available to remote sites. The online cardioangiography system that wedescribe here is typical of such an environment. Data is generated in largevolumes and with high throughput, and the people generating the data aregeographically separated from the people cataloging or using the data.

There are several important considerations for managing this type ofinstrument-generated data:

1. Automatic generation of at least minimal metadata

2. Automatic cataloging of the data and the metadata as the data is received(or as close to real time as possible)

3. Transparent management of tertiary storage systems where the originaldata is archived

4. Facilitation of cooperative research by providing specified users at localand remote sites immediate as well as long-term access to the data

5. Incorporation of the data into other databases or documents

For the online cardioangiography system (a remote medical imagingsystem), a realtime digital library system collects data from the instrumentand automatically processes, catalogs, and archives each data unit togetherwith the derived data and metadata, with the result being a Web-based objectrepresenting each data set. This automatic system operates 10 hours/day, 5–6days/week, with data rates of about 30 Mb/s during the data collection phase(about 20 minutes per hour); see Figure 4.1.


Kaiser San Francisco HospitalCardiac Catheterization Lab(digital video capture)

LBNL WALDO server andDPSS for data processing,cataloging, and storage

KaiserOaklandHospital(physiciansand databases)

Kaiser Divisionof Research

San Francisco Bay Area

Lawrence BerkeleyNational Laboratory

and Kaiser PermanenteOnline Health Care

Imaging Experiment

NTONnetworktestbed

to theMAGICtestbed

4.1

FIGURE

A distributed health care imaging application.

WALDO (Wide Area Large Data Objects) is a realtime digital library sys-tem that uses federated textual and URL-linked metadata to represent thecharacteristics of large data sets (see Figure 4.2 and [296]). Incoming data isautomatically cataloged by extracting associated metadata and converting itinto text records, by generating auxiliary metadata and derived data, and bycombining this data into Web-based objects that include persistent referencesto the original data components. Tertiary storage management for the originaldata sets is achieved by using the remote program execution capability of Webservers to manage the data on a mass storage system. For subsequent use, thedata components may be staged to a local disk and then returned as usual viathe Web browser or, as is the case for high-performance applications, moved toa high-speed cache for direct access by the specialized applications. The loca-tion of the data components on tertiary storage, information on how to accessthem, and other descriptive material are all part of the object definition. Thecreation of object definitions, the inclusion of “standardized” derived-data ob-jects as part of the metadata, and the use of typed links in the object definitionare intended to provide a general framework for dealing with many differenttypes of data, including abstract instrument data and multicomponent, multi-media programs.


Web browser

Application

Searchengine (6)

Public-keyinfrastructure

and use conditioncertificates

Web server

DPSS (2)Local storage

LDO “object”description

generation (4)

MSSDatasource (1)

DPSS (2)

Processing (3)

Producer(capture, catalog)

Object management(persistence, metadatamanagement, storage

management)

Consumer

Data-userinterface (10)Curatorinterface (9)

Cache-based orWeb-based accessto LDOcomponents

LDO accessmethodsSearch enginemanagementCache/MSSmanagement (8)Some LDO datacomponents

Generate object template, metadata, derived representationsManage initialarchival storage

High-speeddata cachefor incomingdata

CollectionBufferingNetworktransport

Tertiary storagearchiving oflarge datacomponents

WALDO Webserver–basedLDO componentstorage

Cache forhigh-speedapplicationaccess todata

Access control (7)

met

adata (5)

(5)

4.2

FIGURE

Overall architecture and data flow of WALDO.

WALDO uses an object-oriented approach for capture, storage, catalog,retrieval, and management of large data objects (LDO) and their associatedmetadata. The architecture includes a collection of widely distributed servicesto provide flexibility in managing storage resources, reliability and integrity ofaccess, and high-performance access, all in an open environment where theuse conditions for resources and stored information are guaranteed throughthe use of a strong, but decentralized, security architecture.

Elements of the WALDO Model

The WALDO model offers realtime cataloging of extensible, linked, multicom-ponent data objects that can be asynchronously generated by remote, onlinedata sources. Class-based methods are used to manage large data objects.Collections of data objects are handled by flexible curator/collection ownermanagement, including “anytime” management of the collection organization


and object metadata. There is also globally unique and persistent namingof the objects and their various components via URLs and URNs. There isstrong access control at the level of individual object components based onuse condition certificates managed by the data owner. Additionally, there ishigh-performance application access to the data components.

WALDO Software Architecture

Figure 4.2 illustrates the data flow and overall organization of the WALDOarchitecture. The basic elements of the architecture include the following:

1. Data collection systems and the instrument network interfaces

2. High-speed, network-based cache storage for receiving data, for providingintermediate storage for processing, and for high-speed application access

3. Processing mechanisms for various sorts of data analysis and derived datageneration

4. Data management that provides for the automatic cataloging and metadatageneration that produces the large data object definitions

5. Data access interfaces, including application-oriented interfaces

6. Flexible mechanisms for providing various searching strategies

7. Transparent security that provides strong access control for the data com-ponents based on data owner policies

8. Transparent tertiary storage (“mass storage”) management for the datacomponents

9. Curator interfaces for managing both the metadata and the large dataobject collection

10. User access interfaces for all relevant aspects of the data (applications,data, and metadata)

These elements are all provided with flexible, location-independent interfacesso that they can be freely (transparently) moved around the network as re-quired for operational or other logistical convenience.

The model just described has been used in several data-intensive com-puting applications; however, it raises a number of issues. The distributedcache is an important component, but one that requires distributed man-agement and distributed security. The incorporation of a digital-library-likefunction is an important consideration, but such automatic cataloging in the


face of human error in the operation of the instrument (and the resulting er-rors in the metadata and cataloging) requires human curation of the library.Access control is a critical aspect when sensitive or confidential data is in-volved, and the management of the access control must also be distributedto the various principals. Approaches to several of these issues are discussedbelow.

4.2.3 Particle Accelerators: High-Data-Rate Systems

Our next example concerns a detector system at a high-energy physics particleaccelerator. Modern detectors like STAR (Solenoidal Tracker at the RelativisticHeavy Ion Collider at Brookhaven National Laboratory) will generate 20–40MB/s of data that must be processed in two phases: data collection and eventreconstruction (Phase 1) and physics analysis (Phase 2) [247, 295].

In Phase 1, a detector puts out a steady-state high-data-rate stream. Tra-ditionally, the data is archived, and a first level of processing is performedat the experiment site. The resulting second-level data is also archived andthen used for the subsequent physics analysis. The data is thus archivedat the experiment site in “medium-sized” tertiary storage systems. This ap-proach has disadvantages: large mass storage systems are one of the few com-puting technologies that continue to exhibit significant economies of scale,and therefore central sites remain an important architectural componentin high-data-volume systems. However, the potential problems of networkaccess to large-scale storage systems can be overcome with network-basedcaching.

In a grid environment, medium-sized tertiary systems at experimentsites can be replaced by a distributed cache consisting of a high-speed, high-capacity network-based cache and very large tertiary systems at dedicatedstorage sites.

The Distributed Parallel Storage System (DPSS), described below, canserve as the cache for all stages of data manipulation. DPSS provides a scalable,dynamically configurable, high-performance, and highly distributed storagesystem that is usually used as a (relatively long-term) cache of data. It is typi-cally used to collect data from online instruments and then supply that data toanalysis applications or to high-data-rate visualization applications (as in thecase of the MAGIC, the wide area gigabit testbed where DPSS was originallydeveloped; see Chapter 21 [330, 539, 476]). The system is also being used insatellite image-processing systems and for the distributed online, high-data-rate health care imaging systems described above.


Local databuffer

Detector(data source)

ATM WAN

Offlineevent

archive

Local eventcaching on

DPSS

Remoteanalysis

Local eventcaching on

DPSS

Remoteanalysis

Onlineevent data

cache(DPSS)

Reconstructionand

high-performanceanalysis cluster

Detector environment

Reconstructiondata flow

Analysis data flow

ATMLAN

4.3

FIGURE

Distributed physics data handling. Phase 1 processing is shown on the left,phase 2 processing on the right.

The architecture illustrated in Figure 4.3 supports distributed computa-tional systems doing the Phase 1 data processing in real time. Realtime dataprocessing potentially also supports two capabilities. First, it can provide aux-iliary information to assist in the organization of data as it is transferred totertiary storage (the STAR experiment will generate about 1.7 TB/day). Sec-ond, it can provide feedback to the instrument operators about the functioningof the accelerator detector system and the progress of the experiment, so thatchanges and corrections may be made.

In the Phase 2 processing (interactive analysis), the architecture enablesan efficient implementation of the second-level analysis of the data. Thisinvolves using a high-speed cache like DPSS as a large “window” on the tape-based data in the tertiary storage system in order to support the use of bothlocal and remote computational resources (Figure 4.3). Prototype versions ofthis architecture have been successfully tested [295].

The issues raised in this environment include the use of distributedcaches, the organization of the cache, the various interfaces to the cache, themanagement of the movement of data to and from the tertiary storage sys-tems, and the management of the cache components in a wide area network.


4.4

FIGURE

The high-voltage electron microscope (HVEM) at NCEM.

4.2.4 Electron Microscopy: Control-Centered Systems

Our final example concerns the remote control of an electron microscope.An evolutionary step in multimedia systems is for them to provide a com-putational framework for the extraction of information from images and videosequences in real time. This information can then be used to manipulate ex-periments or to perform other operations, based on the information contentof the images. This realtime analysis enables semiautonomous remote controlbased on the image content. One such application of this approach is a systemfor remote operation of in situ microscopy. A testbed for this approach is a 1.5MeV transmission electron microscope, shown in Figure 4.4, that is operatedby the National Center for Electron Microscopy (NCEM).


In situ microscopy refers to a class of scientific experiments in whicha specimen is excited by external stimuli and the response must be eitherobserved or controlled. The stimuli could, for example, be in the form of tem-perature variation or stress in the sample environment. The interaction of theexternal stimuli and specimen can result in sample drift, shape deformation,changes in object localization, changes in focus, or simply anomalous speci-men responses to normal operating conditions. Currently, during the in situexperiments the operator must constantly adjust the instrument to maintaindepth of focus and compensate for various drifts. These activities are laborintensive and error-prone, require a high-bandwidth video link to the oper-ator, and are nearly impossible to do in wide area networks because of limitednetwork bandwidth.

For example, a class of in situ electron microscopy experiments requiresdynamic interaction with the specimen under observation as it is excitedwith external stimuli (e.g., temperature variation, EM field variation). Thedynamic operations include control of the sample’s position and orientationunder the electron beam and the illumination conditions and focus. Remotecontrol via wide area networks like the Internet that do not offer realtimedata and command delivery guarantees are not practical for the finely tunedadjustments that dynamic studies require.

Enabling remote control of dynamic experiments involves such tasks asseparating the basic human interaction of establishing control system param-eters like gross positioning and identifying objects of interest (that do notrequire low-latency interaction) from the control servoing that performs op-erations like autofocus, object detection, and continuous fine positioning tocompensate for thermal drift.

The human interaction operations, together with the supporting humancommunication involving video and audio teleconferencing, can easily be per-formed in a wide area network environment [194, 367]. The dynamic controloperations, on the other hand, must occur in a much more controlled environ-ment where the control operation and the monitored response to the controlor stimuli have to be coupled by low-latency communication that is not pos-sible in wide area networks. Therefore, dynamic remote-control applicationsusually involve automated control operations performed near the instrument,in order to eliminate the wide area network realtime delivery requirement.

This approach requires determining the type of servo loops needed to en-able remote operation and collaboration, and the implementation of a controlarchitecture. The basic aspect of the architecture is a partitioning that sepa-rates the low-frequency servo loop functions that enable direct human interac-tion performed over the wide area network from those functions that require


Remote controlenvironment

UnpredictableWAN

Local environment

Video tonetworkinterface

Videoimaging

Microscope

Gateway

Video streamanalysis

Server foroperating control

equipment

Compressed videofor monitoring

Humaninteraction

Stagedrive

4.5

FIGURE

Remote, semiautonomous, dynamic experiment operation architecture.

low-latency control and are performed locally by using automated techniques(see Figure 4.5). The approach hides the latencies in the wide area networkand permits effective remote operation. The result is telepresence that pro-vides the illusion of close geographical proximity for in situ studies. With thisapproach, the testbed 1.5 MeV transmission electron microscope can now beused online via the global Internet.

In the case of image-based instrumentation, control may be automated byusing computer vision algorithms that permit instrumentation adjustments tobe made automatically in response to information extracted from the videosignal generated by the microscope imaging system. Thus, by relieving theoperator of having to do the dynamic adjustment of the experimental setup,remote collaboration and remote operation of the in situ studies over a widearea network are made possible. The computational vision techniques thatsupport remote in situ microscopy applications include image compression,autofocusing, self-calibration, object detection, tracking by using either high-level or low-level features, and servo loop control mechanisms [441, 443].


4.6

FIGURE

Remote in situ experiment interface for the NCEM HVEM microscope.

The image content analysis that provides the information that is fed backto the control system is automated and performed in the environment localto the instrument. That is, the computers that acquire and analyze the videoimages and then communicate with the control system are all connectedby fast local area networks. The set points that initialize the servo loops—the selection of objects of interest and the parameters of external forcingfunctions, as well as the monitoring of the experiment—may be carried outin a wide area network environment.

The microscope and experiment control interface, a typical image, and theresults of video content analysis for shape and drift velocity are illustrated inFigure 4.6.

The main issues that are raised by this sort of remote operation are theservoing architecture and the algorithms used for information extraction andcontrol [442].

4.2.5 Summary

The four examples presented in this section (a media-rich instrument controlenvironment, a health care imaging system doing autonomous data collection

4.3 Issues, Capabilities, and Future Directions89

and cataloging, a high-data-volume physics experiment environment, and asemiautonomous control system) illustrate several aspects of remote opera-tion and expose some of the capabilities that will be needed to support routineconstruction and use of these types of systems in the future. In summary, theonline angiography system requires automated management of data streams,the use of a network cache, automatic cataloging, and distributed access con-trol. These, in turn, require semiautonomous monitoring and QoS guaranteemechanisms in the network and in the processing and storage systems. TheSTAR detector scenario uses a widely distributed configuration of the net-work cache and distributed management of computational resources and data.The shared interface example of the Advanced Light Source Beamline 7 re-quires reliable multicast in wide area networks and rich-media managementmechanisms. All of the examples require distributed management of systemresources and of distributed access control, both for security and for the “dis-tributed enterprise” management of users and resources. In the next sectionwe examine some approaches to providing these capabilities.

In addition, most of the scenarios would potentially benefit from band-width adaptive interface features; the Beamline 7 and microscopy scenariosare candidates for dynamic system construction with brokered resources tosupport their transient needs for significant computational resources. Thesedesired capabilities are, however, not addressed in our current systems.

4.3 ISSUES, CAPABILITIES, AND FUTURE DIRECTIONS

In this section we describe some of the architectural and middleware ap-proaches that are proving useful, and sometimes critical, in implementinghigh-performance distributed instrumentation and data systems.

4.3.1 A Model Data-Intensive Architecture

In our research, we have demonstrated the utility, in the automated cat-aloging and high-data-rate application domains, of using a high-speed dis-tributed cache as a common element for all of the sources and sinks of datainvolved in high-performance data systems. This cache-based approach pro-vides standard interfaces to a large, application-oriented, distributed, online,transient storage system. Each data source deposits its data in the cache, andeach data consumer takes data from the cache, usually writing the processeddata back to the cache. In almost every case there is also a tertiary storagesystem manager that migrates data to and from the cache at various stages ofprocessing (see Figure 4.7).


Initial dataprocessing

Cache interface

High-speed,distributed random

access cache

Archive datamover

Instrument(e.g., detector)

Cache interface

Object archivingand management

Tertiary storagesystem

(e.g., HPSS)C

ach

e in

terf

ace

MSS interface

Analysisapplications

Cache interface

Objectmanagement

4.7

FIGURE

The data-handling model.

For the various data sources and sinks, the cache, which is itself a complexand widely distributed system, provides a standardized approach for high-data-rate interfaces; an “impedance”-matching function (e.g., between the coarse-grained nature of parallel tape drives in the tertiary storage system and thefine-grained access of hundreds of applications); and flexible management ofonline storage resources to support initial caching of data, processing, andinterfacing to tertiary storage.

Depending on the size of the cache relative to the objects of interest, thetertiary storage system management (object manager + archive data moverof Figure 4.7) may involve only moving partial objects to the cache; that is,the cache is a moving window for the offline object/data set. The applicationinterface to the cache can support a variety of I/O semantics, including UNIXdisk I/O semantics (i.e., upon posting a read, the available data is returned;requests for data in the data set but not yet migrated to cache cause theapplication-level read to block until the data is migrated from tape to cache).

Generally, the cache storage configuration is large compared with theavailable disks of a typical computing environment, and very large comparedwith any single disk (e.g., hundreds of gigabytes).


4.3.2 Network-Based Caches

DPSS serves several roles in high-performance, data-intensive computing en-vironments. This application-oriented cache provides a standard interface forhigh-speed data access and provides the functionality of a single, very large,random-access, block-oriented I/O device (i.e., a “virtual disk”). It provideshigh capacity (we anticipate a terabyte-sized system for physics data) and iso-lates the application from the tertiary storage system. Many large datasetscan be logically present in the cache by virtue of the block index maps beingloaded even if the data is not yet available. In this way processing can beginas soon as the first data has been migrated from tertiary storage.

Generally speaking, DPSS can serve as an application cache for any num-ber of high-speed data sources (instruments, multiple mass storage systems,etc.). The naming issue (e.g., resolving independent name space conflicts) ishandled elsewhere. For example, in the online health care imaging systemmentioned above, the name space issue is addressed by having all of thedata represented by Web-based objects that are managed by WALDO [296].At the minimum, WALDO provides globally unique naming and serves asa mechanism for collecting different sources of information about the data.The Web object system can also provide a uniform user (or application) frontend for managing the data components (e.g., migration to and from differentmass storage systems), and it manages object use conditions (PKI access con-trol [297]).

DPSS provides several important and unique capabilities for the dis-tributed architecture. The system provides application-specific interfaces toan extremely large space of logical blocks (16-byte indices). It may be dy-namically configured by aggregating workstations and disks from all over thenetwork (this is routinely done in the MAGIC testbed [476] and will in the fu-ture be mediated by the agent-based management system). It offers the abilityto build large, high-performance storage systems from inexpensive commoditycomponents. It also offers the ability to increase performance by increasingthe number of parallel DPSS servers. A cache management policy module op-erates on a per-data-set basis to provide block aging and replacement whenthe cache is serving as a front end for tertiary storage.

The high performance of DPSS—about 10 MB/s of data delivered to theuser application per disk server—is obtained through parallel operation ofindependent network-based components. Flexible resource management—dynamically adding and deleting storage elements, partitioning the availablestorage, and so on—is provided by design, as are high availability and stronglybound security contexts. Scalability is provided by many of the same designfeatures that provide the flexible resource management (that provides the


Agent-based management(storage server and networkstate vis-à-vis applications)

Agent-basedmanagement(redundant

masters)

Security context 1(system integrity

and physicalresources)

DPSS master

Resource manager(disk resources)

Request manager(logical-to-physicalname translation,

cache management)

Data set manager(user security,data set accesscontrol, etc.)

Logical blockrequests

Returned data stream(“third-party” transfers

directly from the storageservers to the application)

Security context 2(data use conditions) Application

(client)

Memorybuffer

Datarequests

Application dataaccess methods

(data structure tological block-id

mapping)

DPSS API(client-side library)

Agent-basedmanagement

(data sets,metadata

locations, etc.)

Disk servers(block storageand block-levelaccess control)

Physicalblock

requests

4.8

FIGURE

DPSS architecture.

capability to aggregate dispersed and independently owned storage resourcesinto a single cache).

When datasets are identified by the object manager (e.g., as in Figure 4.7)

and are requested from tertiary storage, the logical-to-physical block maps be-come immediately available. The data mover operates asynchronously, and ifan application “read” requests a block that has not yet been loaded, the appli-cation is notified (e.g., the read operation blocks). At this point the applicationcan wait or request information on available blocks in order to continue pro-cessing.

While the basic interface provides for requesting lists of named logicalblocks, many applications use file I/O semantics, as provided in the DPSSclient-side interface library.

The internal architecture of DPSS is illustrated in Figure 4.8. Typical DPSSimplementations consist of several low-cost workstations, each with severaldisk controllers, and several disks on each controller. A three-server DPSS


can thus provide transparent parallel access to 20–30 disks. The data layouton the disks is completely up to the application, and the usual strategy forsequential reading applications is to write the data “round-robin” (stripe acrossservers). Otherwise, the most common strategy is to determine the physicalblock locations randomly when they are written. Our experience has shownthat, with the high degree of parallelism provided at the block level when aDPSS is configured from, say, 30 disks spread across three servers, randomplacement of blocks provides nearly optimal access time for a wide range ofread patterns.

DPSS provides several features to ensure that distributed caches providesignificant value to the remote operation and computational grid environ-ments. These features include agent-managed dynamic reconfiguration (i.e.,adding and deleting servers and storage resources during operation), agent-managed replication (of data, name translation, and disk servers) for reliabilityand performance, data block request semantics to support application dataprediction, and application access semantics (e.g., a large block index spaceallows encoding of some application information, such as longitude, latitude,and elevation for tiled geographical image data).

4.3.3 Agent-Based Management and Monitoring

The combination of generalized autonomous management of distributed com-ponents and accurate monitoring of all aspects of the environment in whichdata moves has turned out to be a critical aspect of the debugging, evaluation,adaptation, and management of widely distributed high-data-rate applications.

In widely distributed systems, when we observe that something has gonewrong, it is generally too late to react. In fact, we frequently cannot even tellwhat is wrong, because the problem depends on a history of events or becausethe needed information is no longer accessible or because it will take too longto ask and answer all of the required questions.

An agent-based approach for analysis of the operation of distributed appli-cations in high-speed wide area networks can be used to monitor and identifyall of the factors that affect performance and to isolate the problems arisingfrom individual hardware and software components. Agents not only can pro-vide standardized access to comprehensive monitoring, but they can also per-form tasks such as keeping a state history in order to answer the question, Howdid we get here? Active analysis of operational patterns (e.g., pattern analysisof event-based lifeline traces) will lead to adapting behavior/configuration toavoid or correct problems.


Monitoring

One successful monitoring methodology involves recording every event ofpotential significance together with precision timestamps, and then corre-lating events on the basis of the logged information. This allows construct-ing a comprehensive view of the overall operation, under realistic operatingconditions, revealing the behavior of all the elements of the application-to-application communications path in order to determine exactly what is hap-pening within complex distributed systems. This approach has been used inthe DPSS distributed storage system and its client applications. As data re-quests flow through the system, timestamps and log records are generatedat every critical point. Network and operating system monitoring tools areused to log additional events of interest using a common format. This mon-itoring is designed to facilitate performance tuning, distributed applicationperformance research, the characterization of distributed algorithms, and themanagement of functioning systems (by providing the input that allows adap-tation to changes in operating conditions). The approach allows measuringof network performance in a manner that is a much better real-world testthan, for example, ttcp, and allows us to accurately measure the dynamicthroughput and latency characteristics of our distributed application code—top to bottom and end to end [295].

This sort of detailed monitoring is also a practical tool for system-levelproblem analysis, as has been demonstrated in the analysis of a TCP over ATMproblem that was uncovered while developing the monitoring methodologyin the ARPA-funded MAGIC gigabit testbed (a large-scale, high-speed ATMnetwork [540]).

The high-level motivation for this work is twofold. First, when developinghigh-speed, network-based distributed services, we often observe unexpect-edly low network throughput and/or high latency. The reason for the poorperformance is frequently not obvious. The bottlenecks can be (and havebeen) in any of the components: the applications, the operating systems, thedevice drivers, the network adapters on either the sending or receiving host(or both), the network switches and routers, and so on. It is difficult to trackdown a performance problem because of the complex interaction between themany distributed system components and the fact that problems in one placemay be most apparent somewhere else. A precision and comprehensive mon-itoring and event analysis methodology is an invaluable tool for diagnosingsuch problems (see Chapters 14 and 15).

Second, such monitoring is one aspect of an approach to building pre-dictable high-speed components that can be used as building blocks for high-


performance applications, rather than having to tune the applications top tobottom, as is all too common today. Continuous and comprehensive monitor-ing can provide the basis of adapting distributed system behavior to “conges-tion” in processing, storage, and communication elements.

Agent-Based Management of Widely Distributed Systems

If comprehensive monitoring is the key to diagnosis, agent-based managementmay be the key to keeping widely distributed systems running reliably.

In one prototype system [568], “agents” are autonomous adaptable mon-itors, managers, information aggregates, and Knowledge Query and Manipu-lation Language (KQML)-based information filters implemented in Java andconstantly communicating with peers and resources. Initial experimentationwith such agents in DPSS indicates several potential advantages (see Fig-ure 4.9).

The first is structured access to current and historical information regardingthe state of DPSS components.

The second is reliability. Not only does this system keep track of all com-ponents within the system, but it restarts any component that has crashed, in-cluding one of the other agents (addresses fault tolerance). “Associated” agentscommunicate with each other using IP multicast.

A third advantage is automatic reconfiguration. When new components(such as a new disk server) are added, the agents do not have to be recon-figured. Rather, an agent is started on the new host, and it will inform all otheragents about itself and the new server. Brokers and agents may discover inter-esting new agents via a dynamic directory protocol like SDR or by the supportof reliable multicast protocols that provide interagent communication. Follow-ing discovery, the new agent—and the resource that it represents—is added tothe configuration.

Information management is a fourth potential advantage of agent-basedmanagement. Broker agents manage information from a collection of moni-tor agents—usually on behalf of a user client—and provide an integrated viewfor applications. For example, the Java-based graphical status interface illus-trated in Figure 4.10 shows aggregated information from DPSS system-stateand data-set–state monitoring agents—the two elements emphasized in Fig-ure 4.9. Agents manage data-set metadata—dynamic state, alternate locations,tertiary location—at each storage system as well as the state of all networkinterfaces and data paths and the load of each DPSS disk server.

A fifth advantage is user representation. Brokers can perform actions onbehalf of a user. For example, if a data set is not currently loaded onto a DPSS


Data curator 1

Data setagent2

Data curator 2

Data setagent2

Data setbroker2 DPSS

broker1

Clientagent1

Clientapplication

Monitorinterface

WHEREagent1

WHEREagent1

DPSSmaster

Other caches

WHEREagent1

DPSSserver

WHEREagent1

DPSSserver

DPSSserver

4.9

FIGURE

An agent-based monitoring architecture that addresses adaptive operation,reliability/survivability, and dynamically updated metadata for data reposi-tories. 1, distributed system management; 2, data state management.

(which is typically used as a cache), the broker can cause the data set to beloaded from tertiary storage.

The broker/agent architecture also allows for efficient system administra-tion. In particular, rule-based operation of the agent can be used to determinewhat policies are to be enforced while remaining separate from the actualmechanism used to implement these policies.

Finally, agent-based management provides flexible functionality. New agentmethods can be added at any time. For example, the brokers have an algorithmfor determining which DPSS configuration to use based on a set of parametersthat include network bandwidth, latency and disk server load. This algorithmcan be modified on the fly by loading new methods into the agents. Relatedagents are part of the same security context, and new code/methods presentedto the agents are cryptographically signed for origin verification and integrity.


4.10

FIGURE

A prototype display for an agent-based distributed storage monitoring system.

Prototype Implementation

With a prototype of such an agent architecture in the MAGIC testbed, anapplication uses aggregated information from a broker to present an adaptiveand dynamic view of the system: data throughput, server state, and data setmetadata as reported by the agents. Self-configuring user interfaces (e.g., as inFigure 4.10) can be built dynamically, based on a broker agent collecting andorganizing classes of information from the relevant set of monitor agents.

4.3.4 Policy-Based Access Control

Collaborative distributed environments that involve multiuser instruments atnational facilities, widely distributed supercomputers and large-scale storagesystems, data sharing in restricted collaborations, and network-based multime-dia collaboration channels give rise to a range of requirement for distributedaccess control. For example, administration of such resources as network QoS


Stackholders’use conditions

Use conditionsimposed

Users’ attributes matchthe use conditions

MemoExclude “bad”

countries

Passportagency

Goodcountry

LBNLemployeeor guest

X-ray 101

Approvedprotocol

MedicalR&Dgroup

Attribute certifiers

LBNL PersonnelDept.

XYZ StateUniversity

UC HumanUse Committee

ALS MedicalBeamline Group P.I.

MemoInclude all LBNLstaff and guests

MemoMust have X-raysafety training

MemoMust have

approved protocol

MemoMust be group

member

ALSmedical

beam line(hypothetical)

DOE-HQ

LBNL

ALS

UC

Group PI

STOP

Accesscontrol

Access request

Access granted after matchinguse conditions and attributes

4.11

FIGURE

Societal access control model.

will need to be handled by an automated authorization infrastructure so thatmanagement of both resource availability and allocation, as well as subse-quent enforcement of use conditions, can be done automatically and withoutrecourse to a central or single authority.

In all of these scenarios, the resource (data, instrument, computationaland storage capacity, communication channel) has multiple stakeholders(typically the intellectual principals and policy makers), and each stakeholderwill impose use conditions on the resource. All of the use conditions must bemet simultaneously in order to satisfy the requirements for access. This modelis common in society and is illustrated in Figure 4.11.

Further, scientific collaborations often are diffuse, with the principals andstakeholders being geographically distributed and multiorganizational. There-fore, the access control mechanism must accommodate these circumstancesby providing distributed management of policy-based access control for all


resources; authentication, integrity, confidentiality, and so on of resource-related information; and mechanisms supporting the internal integrity of dis-tributed systems.

We also anticipate that the resulting infrastructure will support automatedbrokering and policy-based negotiation for resources.

Goals

The goal for access control in such distributed environments is to reflect, in acomputing and communication-based working environment, the general prin-ciples that have been established in society for policy-based resource accesscontrol.

All responsible entities—principals and stakeholders—should be able tomake their assertions (as they do now by signing, for example, a policystatement) without reference to a mediator, and especially without referenceto a centralized mediator (e.g., a system administrator) who must act on theirbehalf. The mechanism must be dynamic and easily used, while maintain-ing strong assurances. Only in this way will computer-based security systemsachieve the decentralization and utility needed for the scalability to supportlarge distributed environments.

The computer systems-based resource access control mechanisms shouldbe able to collect all of the relevant assertions (stakeholder use conditions andcorresponding attributes) and make an unambiguous access decision withoutrequiring entity-specific or resource-specific local, static configuration infor-mation that must be centrally administered. (This requirement does not im-ply that such specific configuration is precluded, only that it should not berequired.) The mechanism (Figure 4.12) should also be based on, and evolvewith, the emerging, commercially supplied public-key certificate infrastruc-ture components.

Expected Benefits

For security to be successful in distributed environments, providing both pro-tection and policy enforcement, each principal entity should have the sameinvolvement as in the currently established procedure in the absence of com-puter security—no more, no less. That is, those who have the authority to setaccess conditions or use conditions, by, for example, holographically signingstatements in a paper environment, will digitally sign functionally equivalentstatements in a distributed computing-based environment. The use of these


Certification authorities(identity)

Attribute authorities(user characteristics)

Authorization authorities(resource owner-generated

use conditions)

Policy engine(acts on behalf of all stakeholders)

Stakeholderidentities

Useridentity

User clientapplication

Access controlgateway

Resource

Gen

eral

au

thor

itat

ive

info

rmat

ion

Certificate servers

Authorizationrequest

Authorizationrequest Response

Digitally sign

eddocu

men

ts generated by

man

y different prin

cipals

Au

thor

ize

Ope

rate

4.12

FIGURE

An authorization- and attribute-based access control architecture.

credentials should be automatic, and the functions of checking credentials,auditing, and so on should be performed by appropriate entities in either cir-cumstance.

The expected advantages of computer-based systems are in maintainingaccess control policy, but with greatly increased independence from tempo-ral and spatial factors (e.g., time zone differences and geographic separation),together with automation of redundant tasks such as credential checking andauditing. The intended outcome is that the scientific community will moreeasily share expensive resources, unique systems, and sensitive data. A fur-ther expected benefit is that this sort of security infrastructure should providethe basis of automated brokering of resources that precede the constructionof dynamically and just-in-time configured systems to support, for example,scientific experiments with transient computing, communication, or storagerequirements.


Authorization-Based Distributed Security

An approach that addresses the general goals noted above can be based on au-thorization and attribute certificates. These digitally signed documents havethe characteristic that they assert document validity without physical pres-ence of the signer or physical possession of holographically signed documents.The result is that the digitally signed documents that provide the assertions ofthe principals, stakeholders, attribute authorities, and so on may be generated,represented, used, and verified independent of time or location.

Other parts of the approach are implemented through the use of “authori-ties” that provide delegation mechanisms and assured information as digitallysigned documents: identity authorities connect human entities and systems todigital signatures; stakeholder authorities provide use conditions; attribute au-thorities attest to user characteristics. Additional components include reliablemechanisms for generating, distributing, and verifying the digitally signeddocuments; mechanisms that match use conditions and attributes; and re-source access control mechanisms that use the resulting credentials to enforcepolicy for the specific resource. (For a general introduction to public-key in-frastructure, see [196, 493].)

Architecture for Distributed Management of Fine-Grained Access Control

A prototype implementation [297] that is addressing distributed managementof access control to limited, valuable, or large-scale resources, data, and ob-jects (e.g., large scientific instruments, distributed supercomputers, sensitivebut unclassified databases) is providing some experience with decentralizedsecurity environments. The prototype includes fully distributed resource man-agement and access. In our target environment, the resource users, resourceowners, and other stakeholders are remote from the protected resource—thenorm in large-scale scientific instrument environments, among others. In theprototype, all significant resources have multiple stakeholders, all of whomprovide their own use conditions, which are specified in the environment ofthe stakeholder and then provided to the resource access control mechanism.At the heart of the prototype is an attribute-based access policy. Users are per-mitted access to resources based on their attributes that satisfy the stakeholderuse conditions. These attributes are attested to by trusted third parties. Valida-tion of the right of access is typically used to establish the security context foran underlying security system such as SSL (e.g., between Web browser andservers [268]) and GSS (secure messaging between components of distributedsystems [347]).


The prototype provides for objects, data, resource owners, and other stake-holders to be able to remotely exercise control over access to the resource,for legitimate users (those that satisfy the use conditions of the resourcestakeholders) to obtain easy access, and for unqualified or unauthorized usersto be strongly denied access. The architecture is illustrated in Figure 4.12.

In addition to the technology issues of integrity and management of theaccess control system and associated computing platforms, useful securityis as much (or more) a deployment and user ergonomics issue. That is, theproblem is as much trying to find out how to integrate good security into theend user (e.g., scientific) environment so that it will be used, trusted to providethe protection that it claims, easily administered, and genuinely useful in thesense of “providing distributed enterprise capabilities” (that is, providing newfunctionality that supports distributed organizations and operation) as it istrying to address the more traditional security issues.

While the security architecture provides the basic technology, in order toaccomplish a useful service the architecture must be applied in such a waythat the resources are protected as intended by the principals. This involvesunderstanding the information/resource use and structure model and devel-oping a policy model that will support the intended access control. These mustbe supported by a security model that specifies how the elements of the secu-rity architecture and infrastructure will implement the policy model.

A prototype implementation of this architecture [297] provides a policy en-gine that implements both flat and hierarchical multiple-use-condition policymodels, uses X.509 identity certificates and ad hoc attribute and use-conditioncertificates obtained from Web and LDAP servers, and provides a policy evalu-ation service to the Apache Web server and an implementation of SPKM/GSS.

ACKNOWLEDGMENTS

The material presented in this chapter represents the work of numerous peo-ple: in particular, Deborah Agarwal, Bahram Parvin, Mary Thompson, andBrian Tierney, as well as the author and others at Lawrence Berkeley NationalLaboratory. More information may be found at www.itg.lbl.gov.

Physicists Craig Tull and Doug Olson are our collaborators in the STARproject; Joe Terdiman and Bob Lundstrum of Kaiser Permanente and Evert-JanPol of Philips Research are our collaborators for the cardioangiography project;Brian Tonner of the University of Wisconsin–Milwaukee is our collaborator onthe ALS Beamline 7, Spectro-Microscopy Collaboratory; and Ulrich Dahmen isour collaborator at the NCEM.

Further Reading103

We also acknowledge Stewart C. Loken, division director of the Infor-mation and Computing Sciences Division, LBNL, for his long-term supportand contributions to this work specifically, and the idea of collaboratoriesgenerally.

The work described here is supported by the U.S. Department of Energy,Energy Research Division, Mathematical, Information, and ComputationalSciences and ER-LTT offices, under contract DE-AC03-76SF00098 with theUniversity of California, and by DARPA, ISTO.

FURTHER READING


� The National Research Council report [410] first described the concept ofa collaboratory.

� Craig Partridge’s book [440] discusses gigabit networking and includesuseful bibliographical references.

5C H A P T E R

Data-Intensive Computing

Reagan W. MooreChaitanya Baru

Richard MarcianoArcot Rajasekar

Michael Wan

Computational grids provide access to distributed compute resources anddistributed data resources, creating unique opportunities for improved accessto information. When data repositories are accessible from any platform, ap-plications can be developed that support nontraditional uses of computingresources. Environments thus enabled include knowledge networks, in whichresearchers collaborate on common problems by publishing results in digi-tal libraries, and digital government, in which policy decisions are based onknowledge gleaned from teams of experts accessing distributed data repos-itories. In both cases, users access data that has been turned into informa-tion through the addition of metadata that describes its origin and quality.Information-based computing within computational grids will enable collec-tive advances in knowledge [396].

In this view of the applications that will dominate in the future, appli-cation development will be driven by the need to process and analyze infor-mation, rather than the need to simulate a physical process. In addition toaccessing specific data sets, applications will need to use information discov-ery interfaces [138] and dynamically determine which data sets to process. InSection 5.1, we discuss how these applications will evolve, and we illustratetheir new capabilities by presenting projects now under way that use someconcepts implicit within grid environments. Data-intensive applications thatwill require the manipulation of terabytes of data aggregated across hundredsof files range from comparisons of numerical simulation output, to analysesof satellite observation data streams, to searches for homologous structures

5 Data-Intensive Computing106

for use as input conditions in chemical structure computations. Accessing ter-abytes of data will require data transfer rates approaching 10 GB/s, implyingthe ability to manipulate a petabyte of data per day from data repositories dis-tributed across multiple sites.

The creation of computational grids implies not only ubiquitous access tocomputing resources but also uniform access to all data systems. No matterwhere an application is executed, it needs access to input data sets on localor remote disk systems, in distributed data repositories, or in archival storagesystems. For data sets to be remotely accessible, metadata must be providedthat describes the data sets’ location. Providing this information in a metadatacatalog constitutes a form of publication because it makes it possible to findthe data sets through information discovery interfaces.

Grids will require new mechanisms to support publication and peer re-view of data. Data is only as good as the degree of assurance about its qualityand the validity of the categorization of the data by discipline-specific at-tributes. If the data is not of high quality, conclusions drawn about the dataare suspect. If data sets cannot be located because they are described in-correctly, they will never be used. Peer-reviewed publication of data solvesboth problems by providing metadata that can be used to access high-qualitycurated data globally, effectively turning data into information. In grids, pub-lication of peer-reviewed data sets will become as important as publicationof peer-reviewed scientific reports: By using information discovery interfaces,the most recently published data can be used directly in subsequent analyses,providing an information feedback loop that nonlinearly advances scientificdiscoveries.

The implementation of information-based computing will require dra-matic extensions to the data support infrastructure of grid systems. Data setsare valuable when they are organized as information and made accessibleto other researchers. Several scientific disciplines (e.g., high-energy physics,molecular science, astronomy, and computational fluid dynamics) have rec-ognized this fact and are now aggregating domain-specific results into datarepositories. The data is reviewed, annotated, and made accessible through acommon, uniform interface. Researchers are then able to make faster progressby accessing all of the curated data related to their discipline. Such data reposi-tories, however, require a variety of services to make the data useful, includingsupport for data publication, data curation and quality assurance, informationdiscovery, and distributed data analysis. The emerging technology that pro-vides these services is based on digital libraries. In Sections 5.2 and 5.3, wediscuss how digital library technology can be integrated into grids to enableinformation-based computing.

5.1 Evolution of Data-Intensive Applications107

Grid software infrastructure must be based on application requirementsto be useful to the user community. Postulating grid environments that willnot be functional until after the year 2000, though, requires us to postulatesimilarly how applications are most likely to evolve. We base our projectionof future application requirements on current supercomputer systems in thebelief that individual nodes in computational grids will have capabilities sim-ilar to those of current supercomputers. At the same time, the pressure toanalyze information and build discipline-specific data collections will forcethe development of new technologies. We base our projection of such newtechnologies on the services that digital libraries now provide for local datarepositories. We expect that data-intensive applications will compel the coevo-lution of supercomputer, digital library, and grid technologies. This conclusionis based on the fact that teraFLOPS-capable computers of the future will gen-erate petabytes of data that must be managed and turned into information.Similarly, the digital libraries of the future will house petabytes of data thatwill need teraFLOPS-capable computers to support analysis and other services.In Section 5.4, we discuss how grid systems can facilitate the merger of thesetechnologies.

5.1 EVOLUTION OF DATA-INTENSIVE APPLICATIONS

The term data-intensive computing is used to describe applications that areI/O bound. Such applications devote the largest fraction of execution timeto movement of data. They can be identified by evaluating “computationalbandwidth”—the number of bytes of data processed per floating-point oper-ation. On vector supercomputers for applications that sustain high perfor-mance, usually 7 bytes of data are accessed from memory for every floating-point operation [552, 395]. For well-balanced applications, this ratio shouldmatch the memory bandwidth divided by the CPU execution rate. When datatransmission rates between computer memory and local disk are examined,we find that memory acts as a cache that greatly reduces the disk bandwidthrequirements. For vector supercomputers, the computational bandwidth todisk is 1 byte of data accessed per 70 FLOPS, a factor of 490 smaller. For well-balanced applications, this ratio should match the disk bandwidth divided bythe CPU execution rate. We can think of data-intensive applications, then, ascodes that require data access rates to data storage peripherals that are sub-stantial fractions of their memory data access rates.

In computational grids, a data-intensive application may need a high-bandwidth data access rate all the way to a remote data repository. Since


network bandwidth performance tends to be smaller than local disk bandwidthperformance, it appears that data-intensive applications will be difficult to sup-port in grid environments. In practice, when the available bandwidth is lessthan the required bandwidth, either the CPU is held idle until the data sets areavailable, or other jobs are executed while waiting for the data to be cached onthe local system. One challenge is hence to support both distributed process-ing, in which the application is moved to the site where the data resides, anddistributed caching, in which the data is moved to the supercomputer for analy-sis. The former method tends to be preferred when data is processed throughthe CPU once, but the latter if data is read multiple times by the application.The decision between these options depends on determining which one min-imizes the total time needed for solution. This calculation is dependent onthe network protocol overhead, network latency, computational and networkbandwidths, and the total amount of data accessed [150, 394, 397]. This taskshould be supported within computational grids by a scheduler that tracks theavailability of resources and chooses the optimal site to execute each applica-tion (see Chapter 12).

5.1.1 Data Assimilation

A good example of a data-intensive application is the problem of assimilatingremote satellite observations into a comprehensive data set that describes theweather over the entire globe. Satellite observations provide measurements ofsome of the physical parameters needed to describe the weather, but only forthe portion of the earth that is covered by the orbit, and only for the time pe-riod that the satellite is over a given area. What is desired instead is a globallyconsistent data set that describes all of the physical parameters for the entireglobe with snapshots of the data at regular intervals in time. At the NASA DataAssimilation Office, the Goddard Earth Observing System (GEOS) Data Assimi-lation System (DAS) is used to accomplish this task [449]. The analysis requiresrunning a General Circulation Model (GCM) to predict the global weatherpatterns, comparing the satellite observations with the predicted weather, cal-culating the discrepancies between the observed and predicted data, then re-running the model using gridded corrections, or increments, to reproduce theobserved data. The assimilation cycle is repeated every 6 hours.

The discrete observations are interpolated onto a regular time and spacegrid. The gridded data is then used to evaluate global hydrological and energycycles. The winds derived from the atmospheric circulation are used to trans-port trace gases in the troposphere and stratosphere. The end products are


Network connection Bandwidth (Mb/s) Daily transfer (GB/day)

T1 1.4 15

T3 45 486

OC-3 155 1,670

OC-12 622 6,720

OC-48 2,488 26,870

OC-192 9,952 107,480

5.1

TABLE

Upper limits for data transmission.

data sets used by other researchers to support, for example, investigation ofgreenhouse gases, dust circulation from volcanoes, and global heating.

The GEOS DAS is a prototype computational grid application. Approxi-mately 2 GB of data per day are collected at the NASA Goddard Space FlightCenter Distributed Active Archive Center (DAAC) in Maryland. The raw data isprocessed at Goddard to generate the input data sets for DAS. The data is thensent to the NASA Ames computer facility in California for analysis by DAS. Theresults, approximately 6 GB per day, are then sent back to the Goddard DAAC.This process requires moving data on a continual basis across the country be-tween the two sites, turning the raw data into curated information through theuse of the GCM simulation, then moving the data products back to Goddardfor publication. The data is cached at NASA Ames while the data assimila-tion is done. DAS requirements are tractable because the total amount of datamovement per day is small compared with the available network bandwidth.

As shown in Table 5.1, a T3 network connection (45 Mb/s) can transmitmore than 400 GB of data per day. The amount of data transmitted by DAS willgrow as higher-resolution grids are used in the GCM weather simulation or asmore data is collected from satellites. Once the data movement uses an appre-ciable fraction of the available bandwidth, computational grids must managecollective effects arising from contention for resources. Eventually, the dataassimilation will require scheduling of data transmission and of disk cachespace utilization, in addition to scheduling of CPU access. See Chapter 19 forquality-of-service issues related to network use.

While the software systems used to support DAS have been explicitlydeveloped for this purpose, the data-handling steps are quite general:

1. Identify the required raw data sets.

2. Retrieve the data from a data repository.


3. Subdivide the data to generate the required input set.

4. Cache the data at the site where the computation will take place.

5. Analyze the data, and generate new data products.

6. Transmit the results back to the data repository.

7. Publish (register) the new data sets in the repository.

Grid data-handling environments should provide mechanisms to support allthese steps, which would greatly simplify the effort required to develop otherdata assimilation applications.

A second component of the DAS mission is to support reanalysis of thedata. As the physical models for representing weather improve, the GCM willbe modified correspondingly. The new models are expected to provide bettersimulation predictions and improved assimilation of the data. Prior observa-tional data will be reanalyzed to provide higher-quality weather patterns. Thereanalyses will be done using data over 10-year periods, requiring the move-ment of 29 TB of data stored in more than 47,000 files. Such data handling isonerous, unless identification of the data sets and caching of the data can beautomated and managed within the application.

Data handling can be automated if general interfaces can be designedthat support information discovery from running applications. One difficultyoccurs in creating a logical handle for the input file. This handle must begenerated dynamically based on the attributes associated with the data set,such as the type of satellite and the time of the observation. A second difficultyoccurs in determining where the input file is located, since in general it willnot be on local disk. The application must be able to perform caching of thatnamed data set onto the local disk from the remote data repository.

Traditionally, name generation has been automated by embedding thedata set’s attributes within the UNIX pathname under which the data set isstored. An application-specific algorithm is used to concatenate the attributesinto a valid pathname, which is then accessed through a UNIX open statement.This works as long as the application knows the concatenation algorithm.Computational grids should provide a more general support mechanism toidentify data sets by querying an information discovery system for the locationand name of the data set that corresponds to the desired attributes. This willrequire users to learn how to invoke information discovery interfaces andinterpret the results [262].


5.1.2 Distributed Data Analysis

The size of the individual data sets required to support the DAS analysis isrelatively modest. But what if the total amount of data becomes much largerthan the transmission capacity of the underlying grid infrastructure? Thissituation can occur because either the size of the data collection becomes verylarge or the sizes of individual data sets become very large. Then additionaldata-handling systems are needed to support processing of the data withinthe repository. An example is the Digital Sky project, which will integrate skysurveys that contain data taken at different light wavelengths.

Recent advances in astronomy research have made it feasible to create dig-ital images of large areas of the sky by digitizing existing astronomical photo-graphic plates (Digital Palomar Observatory Sky Survey) or directly recordingdigital signals from light detection devices attached to a telescope (Two MicronAll-Sky Survey). The images are analyzed automatically to identify the loca-tion and brightness of each object. The aggregate raw data collections range insize from 3 to 40 TB of pixel data. Since most pixels are black, correspondingto no observable star or galaxy, the size of the data collection can be reducedsignificantly by saving only those pixels where light is detected. This analy-sis turns 40 TB of unprocessed data into approximately 2 billion objects, eachof which has associated metadata to describe its location and brightness. Thesize is still large, on the order of 250 GB for the object metadata, and severalterabytes for all the nonblack pixels. The pixel images of each object must besaved to allow reanalysis of each object to verify whether it is a star or galaxy.

This project is of great interest to astronomers because it will allow statis-tical questions to be answered for objects observed at multiple wavelengths oflight. It will be possible to find and access subsets of the data representing ob-jects of the same type and to analyze the images corresponding to the objectsfor morphology.

Such analyses require the ability to generate database queries based onobject attributes or image specifications. The Digital Sky project will needto coordinate such queries across multiple databases, including small (evenpersonal) data sets residing in various locations. Through the use of advanceddatabase and archival storage technology, the goal is to do the analyses in daysinstead of years.

Unique data-handling requirements for this project arise because of thevery large number of objects that must be individually accessible, since as thesurveys are completed, the aggregate number of objects will grow to billions.The object metadata can be stored in databases, but the object images willbe stored in archival storage systems. A user requesting information about an


individual star will need to format the request as a query to the database, whilestatistical analyses may access a large fraction of the image archive. Thus,accessing the data may require execution of methods to obtain the desiredresult. This implies that metadata to describe objects within the data-handlingenvironment should be augmented with metadata to describe the types ofmethods or processing algorithms applicable to the data, making it practicalto apply algorithms within the data resource and minimizing the amount ofdata that needs to be transmitted over the network.

A second requirement on the metadata comes from the need to integratedata from multiple digital sky survey repositories. Each of the surveys will belocated at the site where the scientific expertise resides for maintaining datafrom that survey. Such ownership is necessary to support data curation andvalidation. An example is the automated classification of stellar objects as starsor galaxies. If a meteor transits the sky during an observation, an automatedsystem might classify its track as a string of galaxies. To ensure against this, thedata sets need to be checked by experts at each repository to guarantee theirvalidity. In the Digital Sky project, it will be necessary to integrate access tocombinations of relational databases, object-oriented databases, and archivalstorage systems, which can be done only if system-level metadata is kept thatdescribes the access protocol, network location, and local data-set identifiersfor each of the data repositories.

A third requirement is the need to support multiple copies of a data set.When joint statistical analyses are done across two or more surveys, copies ofeach survey will need to be colocated to facilitate the comparisons. The data-handling system should be able to record the location of the copy and use thatinformation to optimize future requests. Having multiple distributed copies ofdata sets minimizes network traffic, ensures fault tolerance, improves disasterrecovery, and allows the data to be stored in different formats to minimize pre-sentation time. The data access scheduling system can determine which copyis closest in terms of transmission time and can use that copy to minimizethe time required to support a query. If the copy becomes inaccessible forany reason, the data-handling system can automatically redirect a data accessretrieval request to a backup site.

The need to federate access to multiple data repositories will be quitecommon. Many scientific disciplines are generating multiple data repositories.Each tends to focus on one aspect of the discipline. Consequently, each mayuse a different set of unique attributes to describe its data sets. Since generalquestions may be answered only by accessing all of the repositories within adiscipline, mechanisms are needed to publish the discipline-specific metadataat each site [244].


Multiple data repositories are also being created in neuroscience. Inanother project, detailed images of brains for primates (human, macaquemonkey), rats, and crickets are being collected in repositories at UCSD, UCLA,and Washington University. Images from all three sites can be used to makestatistically significant claims about anatomical properties relating to bothstructure and function. Comparisons between two brain images are madeby transforming the shape of one brain to match the shape of the secondthrough deformation algorithms, requiring access to the transformation algo-rithms used to represent the brain structures as well as access to both datasets.

Data sets will continue to grow in size as it becomes possible to achieveever higher resolutions in observational data and simulation calculations. Itwill become necessary to support data sets that are partitioned across multiplestorage devices, with only the active portion of the data set on the local high-performance disk and the rest stored within a repository such as an archivalstorage system. In the neuroscience community, the size of a current brainimage is on the order of 50 GB. However, rapid advances in technology areexpected to enable the capture of brain images at much higher resolutions,with up to 1 TB of data stored per image. Therefore, a collection of 1,000 brainimages could be as large as 1 PB. This scenario implies that subsets of givenimages will be used, rather than the entire image, creating the need for thecomputational grids’ data-handling environments to support replicates of par-titions of data sets. In this case, the metadata will have to include informationabout how the data set is partitioned across multiple storage resources.

5.1.3 Information Discovery

In addition to supporting scientific analysis of distributed data sets, computa-tional grids will also need to support science-based policy making [237]. Gridusers will include citizens who want access to information and decision mak-ers who need access to scientific knowledge. The needs of these users havebeen discussed in multiple NSF-sponsored workshops including KnowledgeNetworking [575] and Research and Development Opportunities in Federal In-formation Services [496]. Both workshops focused on how to turn data intoinformation and how to use information to support predictive modeling, prob-lem analysis, and decision making.

The Knowledge Networking workshops proposed a generalization of gridapplications. The term knowledge networks was defined to represent the multi-ple sets of discipline expertise, information, and knowledge that can be aggre-gated to analyze a problem of scientific or societal interest, that is, both people


and infrastructure. For scientific problems, the people include the applica-tion scientists to model the underlying physical systems and the computerscientists to organize and manage the information. For societal problems, thepeople include not only application and computer scientists, but also plan-ners to develop policy decisions based on the results of scientific models.The infrastructure includes computational grids, an information-based com-puting environment, models, and applications. Knowledge is presented eitheras predictions of effects (e.g., the impact of human-released aerosols on globalchange) or as interpretations of existing data (e.g., analyses of social sciencesurveys). The knowledge can then be used to change previous planning deci-sions or direct new research activities.

Individual researchers comprise a knowledge network enclave that in-cludes their expertise, data collections, and analysis tools. Researchers formgroups to address larger-scale problems whose scope exceeds the capabilitiesof any individual. Groups, in turn, aggregate themselves into multidisciplinaryconsortia to tackle Grand Challenge problems. At each level of this hierar-chy, information is exchanged within and between enclaves. The enclavesinclude legacy systems and, consequently, are heterogeneous and distributed.The heterogeneity spans all components of the data and information orga-nization hierarchies (including the data sources themselves, the ways thedata is organized, the vocabularies describing the data, and the cultures ofthe groups of experts). One of the fundamental challenges facing grid sys-tems is to support interoperability between legacy systems, emerging tech-nology, and the multiple cultures that might be connected by a knowledgenetwork [149].

Each knowledge network enclave may impose unique requirements onthe data-handling environment of computational grids. Some may keep dataprivate until it can be analyzed for new effects. Some may publish immedi-ately to establish precedence. Some enclaves may organize information foreither scientists’ or the public’s use. These enclaves will create new ideas thatchange the approach to data handling. For instance, an enclave might establisha content standard for all their scientific data objects, implying a world view ofthat domain. Since the purpose of science is to evolve world views, the contentstandard within an enclave can also be expected to evolve over time.

Digital library technology addresses some of these issues. For each disci-pline, an ontology is created that defines how the information is to be struc-tured. Attributes are defined that encapsulate the information as metadata,which are then organized into a schema. Definitions of each attribute are spec-ified by semantics that have a common meaning within the discipline. When


world views evolve, the ontology, schema, metadata, and semantics may allneed to change. This process implies that the structures used to organize in-formation must themselves evolve over time [219].

Computational grid data-handling environments can be simplified greatlyby providing access to persistent objects, with changes to data sets recordedas new versions of the data. Access to prior versions is needed to allow com-parisons between versions to determine the impact of modifications. Thisapproach minimizes caching consistency problems. The metadata for a par-ticular repository must be kept consistent with the stored objects, but newversions of a data set can be disseminated lazily to external metadata caches.For example, applications identify the specific version of each data set that isused. The validity of the analysis is then a function of the publishing date ofeach data set, and the analysis can be rerun if a new version of the data be-comes available. Data sets that are highly referenced by the community willbecome the standards on which analyses are done.

Consequently, grid data-handling environments should provide lineageinformation about each data set to allow reanalysis of the original data. Thisinformation will need to be recorded as part of the system-level metadata,with the lineage preserved through every method invoked on the data. Thepublication of new data sets should include metadata about the source of allinput files and metadata attributes that identify the application or methodused to generate the data products.

The concern about semantics—the vocabulary used to describe themetadata—is that the data sets are created within a culture. Persons withoutthat cultural background are at risk because they do not understand the un-derlying ontology used within the domain or the vocabulary used to conveymeaning. The cultures may be discipline driven or socially driven (e.g., userswith different levels of education). Thus, mechanisms are needed to providehierarchies of information from general to domain-specific to satisfy publicand discipline-oriented scientific access.

Data quality is critical, implying the need for peer review mechanismsfor users of data to provide feedback. Even for high-quality data, errors canbe introduced from cross-discipline data exchanges and unintended uses ofthe data. The underlying organization of the data may be inappropriate andmay result in the data being biased with respect to another discipline’s usagepattern. An example is a data collection that gives the location of all thehardwood forests. If this is used to represent the location of all of the treeswithin an area, the data will be inaccurate because nonhardwood trees are notrepresented.


5.1.4 Application Requirements Summary

Data-intensive computing will require access to data at rates that may exceedthe capabilities of computational grid networks. Hence, data-intensive appli-cations will continue to be run on local resources where data and computeservers can be tightly coupled together. Grid systems will be able to supportdata-intensive applications when it is possible to cache the data at the computeserver.

The more general applications in the future, however, will be as interestedin metadata about the data set as in the data set itself. The metadata constitutesinformation that can be used to determine how the data set should be used.Information-based computing will enable applications to make effective useof computational grids by implementing data access behind information dis-covery interfaces. Information environments within grids will be establishedthrough publication of data sets in data repositories or digital libraries. Themost general applications will be based on knowledge networks that com-bine grids and information-based computing environments with enclaves ofexperts to enable collective analysis of societally important problems.

The application requirements for information-based computing are sum-marized in Table 5.2. The requirements have been organized loosely as a func-tion of the evolving data environments needed by future applications. Theyall assume access is being provided to published data sets. In Section 5.2, weexamine how data support software infrastructure has also been evolving toaddress these requirements.

5.2 SOFTWARE INFRASTRUCTURE EVOLUTION

The evolution of data-handling environments has been driven by the need todevelop storage systems to hold data, information discovery mechanisms tolocate data sets, data-handling mechanisms to retrieve data sets, publicationmechanisms to populate high-quality data repositories, and systems to sup-port data manipulation services. Each of these areas has experienced a steadyincrease in the ability to manage and manipulate data. The evolving capabil-ities are characterized in Table 5.3. Each row illustrates a different capability,which eventually should constitute part of computational grids. Chapters thatprovide more detailed discussions of the capabilities are also listed.

In each area, we examine the available data-handling software infrastruc-ture and identify the research goals that are needed to enable information-based computing within computational grids.

5.2 Software Infrastructure Evolution117

Data environment Requirements

Data-intensive computing Data-caching system

Attribute-based data set identification

Access to heterogeneous legacy systems

Automated data handling

Data-subsetting mechanisms

Information-based computing Data publication mechanisms

Quality assurance mechanisms

Information discovery interfaces

Attribute-based access to data sets

System-level metadata for resources, users, data sets,and methods

Discipline-specific metadata

Replicated data sets for fault tolerance and disasterrecovery

Partitioned data sets

Knowledge networks Extensible semantics and schemas

Publication mechanisms for semantics and schemas

Interoperability mechanisms for semantics andschemas

Lineage metadata and audit mechanisms

5.2

TABLE

Application requirements for data-handling environments.

5.2.1 Data-Naming Systems

Traditionally, applications use input data sets to define the problem of interestand store results in output files written to local disk. The problem of identify-ing or naming the data sets is handled by specifying a unique UNIX pathnameon a given host for each data set. The user maintains a private metadata catalogto equate the pathname with the unique attributes that identify the contentsof the data set. This task may be done manually in a notebook or by encodingthe attributes in the pathname. In either case, the only way to discover thenaming convention is by communicating with the data’s originator.

With the advent of the Web, naming conventions have been developed todescribe data sets based on their network address. A URL specifies both the In-ternet address of a server and the pathname of each object. This extends theUNIX pathname convention to include the address of the site where the data


Capability Growth paths

Data naming UNIX pathname LDAP Database metadata(Chapter 11)

Data storage Local disk files Archival storage(Chapter 17)

Integrateddatabase/archive

Data handling Manual access Integratedarchive/file system

Homogeneousaccess to filesystems, archives,databases

Data services Local applications Distributed objects(Chapter 9)

Knowledgenetworks

Datapublication

Data repositories Digital libraries Federated informa-tion repositories

Datapresentation

Application-specificvisualization

User-managed dataflow systems

Coordinated presen-tation, JavaBeans(Chapter 10)

5.3

TABLE

Evolution of data-handling environments.

object resides. URNs extend the concept of URLs by providing location trans-parency. URNs are unique names across the Web that can map to multipleURLs. A URN service is required to translate the URN into a URL. Users muststill individually learn the URN that corresponds to a given object to build theirown metadata catalog of interesting data object names.

One approach to improve the ability to name data sets is to impose astandard structure on the UNIX pathname. The Lightweight Directory AccessProtocol (LDAP) [582, 281] organizes entries in a hierarchical treelike structurethat reflects political, geographic, and/or organizational boundaries. Typically,entries representing countries appear at the top of the tree, with entries repre-senting states or national organizations hierarchically listed below. A structuremay be defined that represents arbitrary metadata attributes. LDAP is a proto-col for accessing online directory services [281] and is used by nearly all X.500directory clients.

The LDAP directory service model is based on entries, which are collec-tions of attributes with a distinguished name. The distinguished name refersto an entry unambiguously by taking the name of the entry itself and con-catenating the names of its ancestor entries. This process is similar to using a


UNIX pathname to define a data-set name, except in this case the name is de-fined within the context of the attributes associated with the LDAP directorystructure.

Research proposals for the Web and LDAP focus on metadata extensionsto facilitate data naming and information discovery. For the Web, the DublinCore provides mechanisms to associate descriptive fields with every documentat a Web URL [324, 564]. The Warwick Framework [323] provides a containerarchitecture to integrate distinct packages of metadata, including the DublinCore. (See Section 11.4 for additional discussion of LDAP.)

An alternative approach is to use a relational database to store the at-tributes associated with the data set. As with LDAP, a structure must be de-signed to organize the attributes, but in this case the relation between theattributes is specified by the database schema. This allows the design of moregeneral relationships and supports more complex queries. As a result, it be-comes possible to access a data set by specifying a subset of the attributesassociated with the data set instead of the distinguished name. As disciplinesidentify more attributes to characterize their data, the schema used to describethe data sets will also increase in complexity, implying the need for an exten-sible database schema.

5.2.2 Data Storage Systems

At supercomputer centers, archival storage systems are used to maintaincopies of data sets on tape robots that back-end large disk caches. Data writtento the archive migrates from the cache to the robot based on the frequency ofaccess (e.g., data sets that are never accessed reside on tape, minimizing thecost of storage across the data collection). Archives typically store millions offiles and have capacities measured in the hundreds of terabytes. Almost allarchives in use today rely on a user-specified pathname to identify each file.A current research topic is how to integrate object-relational database technol-ogy with archival storage systems to enable attribute-based access to the datasets within the archive. (Information about data storage system capabilities isgiven in Chapter 17.)

Within computational grids, the archival storage system will provide per-sistent storage. But several challenges must be met. The number of data setswill grow into the billions and vastly exceed the name server design specifi-cations for current archives. A second challenge is archive access latencies,which are measured in tens of seconds for data migrated to tape. The retrievalof a large number of small data sets (size less than the tape access latency


times the access bandwidth) will be inconveniently long if the data sets aredistributed across a large number of tapes. Again, database technology is be-ing considered for its ability to aggregate data into containers, which are theentities stored in the archive. This minimizes the number of entities storedand the access latency for multiple data-set requests. This scenario suggeststhat storage of data within archives needs to be controlled by clustering al-gorithms that automatically aggregate jointly accessed data sets in the samedatabase container.

For data-intensive computing on large data sets, the latency of access tothe data peripheral is small compared with the transmission time. If largedata sets are accessed whose size is greater than the local disk cache, thedata must be paged between the archive and the disk. This is feasible if thearchive transmission rate can be increased to a substantial fraction of the localdisk access rate. The standard way to do this is to use third-party transfer, inwhich data is moved from network-attached peripherals to the disk cache orrequesting computer. This is possible by separating the data control and datamovement functions within the archive [284, 130].

Some archival storage systems support movement of a single data setacross parallel I/O channels from tape and disk [562]. This approach allowsaggregation of I/O bandwidth up to the capability of the receiving system.Fully parallel implementations make it possible to increase the I/O accessrate of the data in the archive in proportion to the size of the archive. It thenbecomes feasible to construct archives in which the entire data collectioncan be processed in a single day. The standard figure of merit with currenttechnology is an access rate of 1 GB/s per terabyte of disk cache in the archive.A 10 TB disk cache enables data-intensive problems requiring the movementof a petabyte of data per day. An area of research interest is how to integratethe standard MPI I/O data redistribution primitives [377] on top of third-partytransfer, thus enabling dynamic redistribution of the data onto the appropriatenodes of a parallel computer.

5.2.3 Data-Handling Systems

Data-intensive applications are labor-intensive, requiring manual interventionto identify the data sets and to move the data from a remote repository toa cache on the local disk. In addition, the application must be told the localfilenames before execution. When the number of data files is counted in thehundreds or the sizes of the data files are measured in gigabytes, the time


expended in manual data-handling support can exceed the CPU execution timeby orders of magnitude.

Multiple software infrastructures have been developed to minimize man-ual intervention in accessing files:

� Distributed file systems to provide a global name space. Examples arethe Andrew File System, the Distributed File System (DFS), and remotelymounted versions of the Network File System (NFS). In each case, the usermust know the unique UNIX pathname for each data set. Data repositoriesthat do not provide an NFS or DFS interface must be accessed separately.

� Persistent object computation environments that federate file systems.The Legion environment (see Chapter 9) transparently migrates objectsfor execution among systems [248]. Although the user is required to knowthe unique Legion object identifier to access an object and must maintaina list of objects, the manipulation of objects is automated.

� Database systems that support queries against local data collections. Re-trieving data from a data repository managed by a database typically re-quires the user to generate SQL syntax to identify the data set of interest.Interfaces are now available that support queries across distributed data-bases.

� Data migration systems that tightly couple file systems with tape storage.The Cray Data Migration Facility (DMF) uses hooks within the UNIXfile system to identify data sets that have been migrated to tape andautomatically retrieves the data sets when they are requested by the user.

These solutions are characterized by acting on strictly local resources orby requiring the user to identify the data set based on an arbitrarily chosenpathname. In computational grids, these restrictions need to be alleviated.When a user can access data sets anywhere within the grid, it is not reason-able to expect the user to know the unique UNIX pathname that identifies adata set. What is needed is a metadata catalog that maintains the correlationbetween data set attributes and the data set name.

In addition, the storage systems accessible within computational grids willnot have uniform access protocols. What is needed is a storage resource broker(SRB) that supports protocol conversion between the UNIX-style streaminginterface used by most applications and the protocols used by the storageresources. Figure 5.1 shows an architecture for such a storage resource broker.

In the SRB, storage interface definitions (SIDs) provide alternate interfacesthat the application may choose to access. The SRB then converts the data


UniTreedriver

Informixdriver

HPSSdriver

DB2driver

Filesystem

FileSID

DBlobSID

ObjectSID

SRB APIs

SRB

Catalogservices

Authenticationand

access control

Schedulingbroker

Application

5.1

FIGURE

Storage resource broker for interoperation between data resource systems.

request into the format needed by a particular storage system. The conversionprocess requires access to system-level metadata to identify the characteristicsof the storage resource to decide what type of protocol conversion is required.The SRB provides the homogeneous access to file systems, databases, andarchival storage systems needed by grid applications.

5.2.4 Data Service Systems

Data sets may require preprocessing before they are accessed by an applica-tion. The preprocessing typically can be expressed as well-defined servicesthat support general data-subsetting operations. A data-handling infrastructureis needed to encapsulate the services as methods that can be applied withincomputational grids. Two technologies are emerging to support this capability.One is CORBA, in which data sets are encapsulated within objects that providethe required manipulation [424]. This system works very well as long as therequested service is defined and available. CORBA attempts to provide somesupport for information discovery using its notion of trader services.

Digital libraries provide a more comprehensive and powerful set of toolsto manipulate data sets, by supporting services on top of data repositories. The


Catalogservices Catalog API

Registrationand

publication

DiscoverysupportMDAs

Storageresourcebroker

Authentica-tion andaccesscontrol

Schedulingbroker

Methodexecution

APIAPI API API API API API

Application

LDAPX.500

HPSSDPSFile

GSSSSH

Kerberos

NWSAppLeS

GlobusLegion

system

5.2

FIGURE

Digital library architecture.

services can be invoked against any of the data sets stored in the library andregistered as methods within object-relational databases. The combination ofmetadata-based access to data sets through catalog services, with the ability toregister methods to manipulate the data sets, provides most of the attributesneeded for computational grid data-handling environments. An example of adigital library architecture is shown in Figure 5.2.

Possible services include publication/registration of new data sets, sup-port for information discovery for attribute-based identification of data sets,support for access to heterogeneous data resources through a storage resourcebroker, support for authentication and access control, support for schedulingof compute and I/O resources, and support for distributed execution of theservices that constitute the digital library.

Although the digital library architecture is extensible and capable of scal-ing to wide area grid environments, current implementations tend to beclosely coupled to local resources. The underlying data storage resources areusually the local file system, and the methods are executed locally by the data-base. Current research topics include generalizing digital library technology tofunction in a distributed environment with support for executing the methodson nonlocal resources [42].


5.2.5 Data Publication Systems

Data publication provides the quality assessment needed to turn data into in-formation. This capability is provided by the experts who assemble data repos-itories for a given discipline. The mechanisms used to assess the quality of thedata include statistical analyses of the data to identify outliers (possibly poordata points) and to compute the inherent properties of the data (mean, stan-dard deviation, distribution). For the large data sets accessed by data-intensiveapplications, research topics include how to generate statistical propertieswhen the size of the data set exceeds the storage capacity of the local re-sources. For publishing data across multiple repositories, the coordination ofdata privacy requires cryptographically guaranteed properties for authorshipand modification records (discussed in Chapter 16).

Publication also involves developing peer review mechanisms to validatethe worth of the data set, which is an active area of research within the librarycommunity. The mechanisms will be similar to those employed for reviewof scientific reports. Indeed, in the chemistry community, some scientificjournals require publication of molecular structures in data repositories beforereports that analyze the structures can be published in the journal.

The harder research issue is supporting interoperability among multipledata repositories and digital libraries [137]. To enable ubiquitous access toinformation, mechanisms are needed that support interoperability betweenschemas. This requires continued research on how to specify schemas and in-terpret semantics so that a query can be completed across two heterogeneousdatabases. The schemas and associated semantics must be published for eachrepository. One approach is to generate a set of global semantics that spans alldisciplines. Unfortunately, as noted above, the semantics used within a givendiscipline are expected to evolve. Thus, the semantics associated with dataset attributes must also evolve. Queries for information that require compar-ing historical and current data will require interoperability between schemasbased on different semantics. Current approaches include generalizing thequery to access higher-level attributes that can be defined with the same se-mantics across the heterogeneous schemas.

A unifying approach to enable interoperability is the use of proxies. Prox-ies act as interpreters between a standard protocol and the native protocols ofindividual information sources. The Stanford InfoBus protocol is an implemen-tation of such a standard [479]. It uses distributed CORBA objects to supportinformation discovery across multiple digital libraries. Information access andretrieval are accomplished through a Digital Library Interoperation Protocol(DLIOP).


5.2.6 Data Presentation Systems

Unifying data presentation architectures are needed to enable collaborative ex-amination of results. The data presentation may involve teleinstrumentation(Chapter 4), with data streaming from instruments in real time for displayon collaborators’ workstations, or dynamic steering of applications as theyexecute (Chapter 7). The associated data-handling systems require supportfor asynchronous communication mechanisms, so that processing of the datastream will not impede the executing application or interrupt the instrument-driven data stream. When data sets become very large, the data streams mayhave to be redirected to archives able to accommodate the entire data set.Subsets of the data may then be redirected for display on the researcher’sworkstation.

The realtime constraints associated with collaborative examination of datawill also affect the design of grid data-caching infrastructure. Multiple repre-sentations of the data sets at different resolutions may be needed to maintaininteractive response when the collaborations are distributed across a conti-nent. This, in turn, will affect the type of data-subsetting services that thedigital library should provide. (Realtime applications are discussed in detailin Chapter 4.)

Visualization of data sets will be as important as the ability to locate andretrieve them. For data-intensive applications, the resolution of the data setcan be finer than the resolution of the display device. One approach to thissituation is to zoom into a data set through multiple levels of resolution, witheach level stored as a reduced data set within the data repository. An alter-native approach is to decompose the data set into basis functions that can beused to display the data at different resolutions. Fractal decompositions canminimize the amount of data that must be transmitted as higher-resolutionversions of the data are requested. Both approaches require additional meta-data to describe the alternate representations of the data that are available.

Data coordination systems are also needed to ensure that the same pre-sentation is provided on all windows active on a display system and across themultiple windows within a distributed collaborative environment. Changes inthe visualization control parameters in one window need to be reflected in allother windows. JavaBeans is a platform-neutral technology that can accom-plish this process [528]. This requirement poses a major architectural designchallenge for computational grid data-support environments. Presentation en-vironments will need to be integrated across a combination of computationalgrids, CORBA object services, Java presentation services, and digital librarypublication services. (More information about Java is provided in Chapter 10.)


Local Distributed Computational grids

Data storage system Distributed data handling Data analysis system

Data-intensive comput-ing on data repositories

Information discoveryenvironments

Information-based policymaking

Digital libraries Information-basedcomputing

Knowledge networking

5.4

TABLE

Evolution of data-handling paradigms.

5.3 DATA-HANDLING PARADIGMS

Data-handling environments are evolving from local systems that can interactonly with local data peripherals to distributed systems that integrate access tomultiple heterogeneous data resources. Data access via user-defined data setnames is evolving to information access based on data set attributes. Computeenvironments are evolving from support for local execution of services ormethods to distributed execution of services within computational grids.

This shift from local to global resource access will enable new paradigmsfor data handling. Three such shifts are shown in Table 5.4. They can becharacterized as a shift from local resources, to distributed resources, toan environment that supports ubiquitous access to computing and informa-tion resources. Each paradigm builds upon the capabilities of the prior one.The long-range goal is to develop infrastructure that supports information-based policy decisions by experts organized into knowledge network en-claves.

The basic software infrastructure to build an information-based computingenvironment includes the following:

� Persistent object computation environments for grids that support run-time execution of applications and services

� Information discovery system that supports attribute-based access to data,metadata mining, semantic interoperability, shared ontologies, and im-proved data annotation [41, 137]

� Digital library technology that supports publication, cataloging, and cura-tion of scientific data sets

5.4 Information Revolution127

� Data management system that provides a system-level metadata catalog tosupport interoperation among objects and resources within computationalgrids

� Storage resource broker that provides a uniform access mechanism toheterogeneous data sources

� Database repositories that support domain-specific data collections

� Archival storage systems that provide permanent data repositories

5.4 INFORMATION REVOLUTION

A major user of the information discovery environment will be the internalsystems comprising computational grids themselves. For ubiquitous comput-ing infrastructure to be accessible, system-level metadata is required to iden-tify available resources. For scheduling systems to be capable of operatingwithin grids, access is needed to resource utilization statistics. To support se-curity access control lists and authentication systems, system-level metadatais needed to describe user privileges. For data sets to be accessible in remotearchives, again system-level metadata is needed to determine the protocol thatshould be used to access the heterogeneous data resources. In short, the tech-nologies needed to implement grids will be driven by the needs of its internalsubsystems, with development of support mechanisms for system-level meta-data providing the unifying infrastructure.

One consequence of the implementation of an information-based comput-ing environment on top of grid systems will be a revolution in the ability togenerate information. Data analysis has been a cottage industry in which re-searchers develop unique applications that access local data sets. Information-based computing will turn data analysis into a national infrastructure by mak-ing it possible to access and manipulate any published data set. The synergyobserved within individual disciplines when they integrate access to theirdata will be made possible across multiple fields of study that choose to worktogether. Common metadata models are being developed to enable such inter-change [185, 171].

The emergence of ubiquitous access to data is revolutionizing the conductof science [445]. Researchers are publishing scientific results on the Web andproviding Web-based access mechanisms to query data repositories and applyanalysis algorithms. The opportunity exists to develop scalable informationdiscovery systems that generalize the above capabilities and enable analysisof terabyte-sized data collections. Information-based computing will enable


information access from applications running on supercomputers. This in turnwill enable automated examination of, and access to, all available informationsources for a discipline, including scientific data generated by simulations,observational data, standard reference test case results, published scientificalgorithms for analyzing data, published research literature, data collections,and domain-specific databases.

This infrastructure is expected to enable analyses that could not be con-templated before, resulting in faster progress in scientific research through thenonlinear feedback made possible when new information is used to improvedata analysis. The publication of the results of computations, followed by thedynamic discovery of information when new applications are run, forms afeedback loop that can rapidly accelerate the generation of knowledge.

Rapid progress in building this infrastructure can be made by buildingupon existing technologies. Supercomputer centers are evolving from provid-ing support for predominantly numerically intensive computing to also pro-viding support for data-intensive applications. Systems that can manage themovement of terabytes of data per day are in development. Data-handlingenvironments are also evolving from syntactic-based to semantic-based ac-cess systems. Digital library technology is evolving to include the capabilityto analyze data in associated workspaces through application of published al-gorithms. Finally, user interfaces to these systems are evolving into dynamiccollaboration environments in which researchers simultaneously view and in-teract with data.

The coevolution and integration of computational grids, information-based computing, and digital library technologies promise to create a uniqueinfrastructure that will enable ubiquitous computing and ubiquitous access toinformation. The resulting synergy between the ability to analyze data to cre-ate information and the ability to access information to drive the data analysiswill have a profound effect on the conduct of science.

FURTHER READING


� A white paper prepared for the Workshop on Distributed HeterogeneousKnowledge Networks [575] provides an introduction to the knowledgenetworking concept.

6C H A P T E R

Teleimmersion

Tom DeFantiRick Stevens

Two centuries ago, the poet Byron wrote of his desire “to mingle with theUniverse, and feel/What I can ne’er express, yet cannot all conceal.” Now anemerging approach to computing, communication, and collaboration, calledteleimmersion, promises to enable us to mingle with—indeed, immerse our-selves in—the universe in new ways that Byron could only dream of. The termteleimmersion refers to the use of immersive virtual reality systems over a net-work, where the generators, data sets, and simulations are remote from theuser’s display environment. These systems are often used to support collab-oration, and so we use the terms “teleimmersion” and “collaborative virtualenvironments” interchangeably. In more modern, and more prosaic, language,teleimmersion has been called the “ultimate synthesis of networking and me-dia technologies” [518]. We believe that teleimmersion is likely to emerge asone of the key applications for future computational grids, simultaneouslyenabled by the availability of grid capabilities and driving their continued de-velopment.

In this chapter, we explore both the nature of teleimmersion and its impli-cations for future grid systems. We first explain why grids and teleimmersionare so naturally suited. Then, we review the applications that currently use, orwill use in the future, teleimmersive environments. Finally, we describe thedemands that teleimmersive technology places on the underlying infrastruc-ture, discussing in detail the grid requirements for teleimmersion.

6 Teleimmersion132

6.1 TELEIMMERSION AND THE GRID

We believe that future developments will see increasingly strong links beingestablished between teleimmersion and emergent grids, based on principles ofcoevolution and symbiosis. By providing improved performance, greater relia-bility, and more sophisticated services, grids will enable the widespread accep-tance and use of teleimmersion. Simultaneously, the success of teleimmersionwill motivate ongoing improvements to grid technologies. In this section, weexpand upon these ideas.

One key to this complex symbiotic relationship between application andtechnology is that teleimmersive environments are inherently multimodal.Gaze, gesture, facial expressions, buttons, and speech are typical media in-puts. Rendering, sonification, video, text, and speech are media outputs of thecomputer simulations. Hence, a key requirement of teleimmersion is aggres-sive networking and display technology. Bandwidth, latency, and jitter (thevariation in latency) all are problems that must be addressed if interactiveteleimmersion is to become a reality. Equally critical to the success of teleim-mersion is the integration of computers and databases. The systems accessedby teleimmersion include potentially large and distributed databases and sim-ulations, all under some sort of real-time control.

These requirements will, we believe, be the most aggressive driversfor the computational grid in the next five years. Increased access to high-performance networks and improved access to high-end virtual reality envi-ronments will cause teleimmersion applications to come to the fore. Areas thatare likely to benefit in the near term from teleimmersion include distributeddesign, collaborative scientific visualization, collaborative engineering, andadvanced communications and networking management [235, 28, 157].

Because grid technologies will be driven by teleimmersive applications,they will have to be closely coupled with, and designed a priori for, the typesof interactive applications that people prefer. This requirement distinguishescomputational grids from the existing Internet and means, moreover, that gridservices will need to track the latest developments in human-to-human andhuman-to-machine communications. For example, grids will have to supportnot only large numbers of voice-grade connections (like the phone system),but also multichannel, high-resolution audio (CD quality and better) and spa-tialized audio streams.

Image manipulation and transport will be another challenge driven byteleimmersion applications. Ideally, we would like to support facile manipu-lation, editing, transport, and delivery of very high quality still and motion

6.1 Teleimmersion and the Grid133

images. Grids will need to support the delivery of HDTV streams to tensof thousands of users, concert-quality audio, ultra-high-resolution interactiveteleimmersive environments, and high-resolution (in both space and time)

haptic images. Each of these interaction modalities will require the ability tointerconnect many people and simulations in a collaborative setting. Sourcesand sinks of data will be live generators (user interfaces), displays, programs,databases, and scientific instruments or industrial machinery.

As more sites acquire network-capable virtual reality devices [469], andas collaborative tools are developed and deployed, high-end teleimmersionapplications will be able to support multiple users in a collaborative virtualenvironments mode. Increasingly, researchers will rely on teleimmersion toavoid travel or to augment desktop collaboration technologies. This situationwill, in turn, lead to even more pressure to improve grid capabilities.

We can also expect that new types of requirements will emerge. Examplesmight include higher-order quality-of-service requirements (guarantees on thecorrelation or anticorrelation of jitter across multiple streams, for instance)

or latency equalization (where we request all latencies in a set of multiplestreams to be normalized) or complex modality mixing and adjusting.

In the longer term, we can expect new modalities of use to evolve fromgrid-based applications and services and from the availability of digital proxies(for users and other resources). One such example is telepresence, a specialcase of teleimmersion in which input and output are generated by arraysof remote sensors and robotic actuators. Telepresence technology, in turn,will enable remote control of manufacturing processes and such large-scaletasks as in-orbit construction of space structures and vehicles, seafloor miningand construction, and remote (e.g., polar, desert, or remote planetary) explo-ration and extraction duties. The military, too, may make considerable use oftelepresence. These future applications may have significantly more internalstructure and communication requirements than even today’s most aggressiveteleimmersion applications.

To judge how far we are from pervasive use of teleimmersion, let’s con-sider the current situation with audio and video. There is a worldwide networkoptimized for speech (the phone system) that supports both two-way and mul-tiway interactions. Computers and other equipment that can be bought insuburban shopping malls can completely record, edit, play back, and duplicatespeech. Realtime synthesis of speech is close to being available with gigaFLOP-class machines. Similarly, for video recording, editing, playback, global tele-conferencing, and broadcast, mature and optimized systems exist (at grosslyhigher cost, of course). Video and audio traditionally implemented with analog

6 Teleimmersion134

technology are today handled by modest-bandwidth digital streaming technol-ogy. Teleimmersive applications, on the other hand, require realtime render-ing, sonification, and guaranteed quality of service—capabilities not providedby current streaming data technology.

6.2 APPLICATIONS OF TELEIMMERSION

Widespread awareness and adoption of teleimmersion concepts have occurredonly in the past 10 years. A major event that advanced the use of immer-sive virtual reality (VR) and network-based applications was the I-WAY project,demonstrated at Supercomputing ’95 in San Diego [154, 155] (see also Chap-ter 22). Building on earlier events that had demonstrated the utility and fea-sibility of linking large-scale immersive VR with high-performance comput-ing [447], the I-WAY project was the first attempt to demonstrate a large num-ber of remote immersive VR applications.

Teleimmersion application prototypes developed for the I-WAY and sub-sequently had to contend with a lack of network quality of service, less-than-ideal bandwidth, and only limited understanding of the modality and network-ing requirements for supporting optimum human interactions. Nevertheless,exciting results have been achieved, and a community of users and technol-ogy developers has formed that continues to devise new applications and newclasses of display and software environments [337, 161, 202]. Over 50 groupsare developing teleimmersive applications, and many more are acquiring thehardware and networking technology needed to do so. We expect that duringthe next five years teleimmersion applications will emerge as a major user ofhigh-end networking.

In the rest of this section, we first review the applications that have beendeveloped to date and then present a “case study” that summarizes our visionof where teleimmersion may be 10 years from now.

6.2.1 Teleimmersion Application Classes

The following incomplete list indicates some of the major areas in whichsignificant successes have already been achieved.

Interactive Scientific Visualization

Many applications involve extensions and extrapolations of desktop systemsused for visualization of scientific data, whether obtained from a storage sys-

6.2 Applications of Teleimmersion135

tem or generated in real time by a scientific simulation. Immersive environ-ments are used to increase the sense of “being there with the data” and tosupport collaborative exploration of large remote data sets [567]. (For example,see Plates 8 and 9, which show snapshots from teleimmersive, collaborativeexplorations of an astrophysical simulation and a molecular dynamics sim-ulation, respectively.) Realtime scientific visualization is often also coupledwith the ability to directly manipulate objects in the virtual world [142, 583],hence providing a mechanism for the “steering” of simulations. These tech-niques can be expected to become increasingly important in the future, aspetaFLOPS computers and petabyte storage systems increase the amount ofdata that must be processed by a scientist.

Education, Training, and Scenario Simulations

There is considerable interest in the creation of artificial or real-world simu-lacra to facilitate training in situations that are expensive, difficult, or danger-ous to recreate in the real world. These applications have traditionally beenextremely important to the realtime graphics market, and they form a largeclass of existing teleimmersion test cases [482, 82, 484].

Art and Entertainment

Teleimmersion applications whose purpose is to entertain include both multi-party VR games and purely aesthetic creations. These applications often testthe limits of human perception and interfaces [300, 163].

Industrial Design, Architectural Review, and Evaluation

In “virtual prototyping,” teleimmersion is used to try out ideas quickly bybuilding prototypes that exist only as computer models. These prototypes canbe tested through simulation, evaluated for human factors or manufacturabil-ity, and reviewed by groups of nondevelopers, all without building expensivephysical prototypes. Several companies have already begun to explore dis-tributed virtual design teams that couple geographically distributed design andengineering groups [157, 336]. In principle, dozens of sites may be involved indesign evaluation [517]. Notice that virtual prototyping requires that the gridprovide simultaneous support for both collaborative CAD and distributed mod-eling and simulation (see Plate 10, which shows a prototype immersive CADtool).

6 Teleimmersion136

Information Visualization and Data Mining

These applications go beyond conventional displays to fully exploit the ca-pability of humans to perceive patterns in large-scale data. Teleimmersionapplications offer the benefits of immersion plus the added attraction of in-terfaces to remote data and remote users [472] (see also Chapter 15).

Telecollaboration Environments and Human Factors

In telecollaboration applications, teleimmersion systems are used largely tosupport human-to-human interactions. These systems are pushing the fron-tiers of collaborative work environments and will provide critical data on thenetworking requirements of future teleimmersion applications [561, 257]. Anintriguing area for future research is the use of teleimmersion to enable manypeople to cooperate on tasks that traditionally have been done by only a few.An example is the development of a software system. With teleimmersion,a large team of programmers—a team that is larger than possible withoutteleimmersion—might be able to cooperate on the development of a system.The goal would be to have many people virtually superimposed, yet still ableto cooperate to accelerate the completion of complex software developmenttasks. Possible targets might include the development of an operating systemfor a petaFLOPS computer [524] or a compiler and programs for a quantumcomputer.

Telepresence for Exploration, Construction, and Recreation

Telepresence already has been used successfully for planetary exploration.Widespread availability of teleimmersion environments should enable theadoption of telepresence as a standard way of solving complex manual andintellectual tasks at a distance. Deep-sea, arctic, and underground explora-tion and mining are good candidates for telepresence. So, too, is the use oftelepresence to project expertise to remote areas (in the same spirit as tele-surgery): one can have telemaintenance of aircraft, boats, submarines, andland vehicles. It is also not hard to imagine use of this technology for educationor for recreation—to explore remote or dangerous areas of the earth [271].

6.2.2 Teleimmersion Case Study

We firmly expect that new applications will be developed in each of the areasdiscussed above. Some future applications will combine multiple existing ap-plication areas into new types of problem-solving environments. Others will

6.2 Applications of Teleimmersion137

focus on migrating concepts from the desktop into the teleimmersion frame-work. We use an imaginary example to illustrate the impact that we believeteleimmersion will have on the future.

It is the year 2009. The U.S. Department of Energy and Environment’sOffice of Nanotechnology has recently completed the installation of theworld’s third petaFLOPS-capable supercomputer. This latest supercomput-ing system has been connected via the computational grid to thousands ofdistributed supercomputers, data repositories, and teleimmersive environ-ments around the world.

Six laboratories on three continents have been working around theclock to test a new immersive collaborative molecular CAD tool for design-ing self-replicating synthetic nanostructures. This new tool, when coupledwith the petaFLOPS computer, will enable teams of scientists to coopera-tively investigate innovative strategies for molecular assembly in a virtualmolecular laboratory. The grid-connected supercomputers are needed toprovide the physical simulation of the billion-atom structure as the usersmodify, interact with, and manipulate component molecules.

Each of the cooperating laboratories is using advanced teleimmersivedesign environments first prototyped in Chicago back in 1999. The im-mersive molecular CAD system, unlike desktop systems currently in use,enables the distributed collection of nanodesigners to directly interactwith and communicate naturally while building and sharing a large-scale3D dynamic prototype of the complex molecular system they are study-ing. The teleimmersive environment surrounds the user with a fully in-tegrated audio, visual, and haptic world, effectively teleporting the entireteam to the universe of the nanoscale.

Each user of the teleimmersion environment is able to fully participatein the design and analysis of these new models regardless of distance.When the nanodesigners are in the teleimmersive environment, they cantalk naturally with others in the same virtual space. They can also directvoice commands to the computer to navigate, access data sets, or manip-ulate the molecular model. When particularly complex topics are to bediscussed, the designers have the option of live video images to replaceor augment the avatar renderings. At any time, control of the distributedset of simulations can be rapidly passed from one site to another as userstake turns directly manipulating the molecular model. Some complex op-erations may require multiple users to manipulate different parts of amolecular structure in concert. The teleimmersion environment enablesmultiple users to easily and directly interact with the virtual objects in the

6 Teleimmersion138

shared space, by implicitly generating the needed transactions betweendistributed world databases. When one designer adds to or manipulatesthe shared model, all sites are updated. Advanced haptic interfaces pro-vide force feedback and tactile imaging, enabling the users to touch andfeel the model as they cooperatively explore and test complex series ofdelicate assembly tasks.

In addition to testing the new petaFLOPS system, these laboratoriesare also testing a new spandexlike “tracking” suit that fits their bodiesclosely (similar to the suits worn by 20th-century Olympic athletes forskiing and speed skating). Each tracking suit contains hundreds of positionsensors and electromagnetic emitters that automatically couple with theactive space tracking system to accurately track the location, velocity,and orientation of each part of the user’s body in the environment. Thisinformation is transmitted to remote sites, where it is used to generate arealistic rendering of each participant in the shared virtual space. Remotenanodesigners can be rendered in a form chosen by either the local orthe remote user. Nanodesigners may choose normal human forms (bestwhen working together at the same scale), or when they are engagingin multiscale work (where some users are operating at a different spatialor temporal scale from that of others), they can choose from a variety oficonic and geometric forms that adjust appropriately.

Although this example is imaginary, the technologies discussed are real.In the next section we discuss the core technologies needed to support teleim-mersion.

6.3 TELEIMMERSION TECHNOLOGIES

We review briefly the display technologies that underlie teleimmersive sys-tems. High-end teleimmersive collaborative virtual environments representthe most technologically advanced human-computer interfaces under devel-opment today. They combine state-of-the-art visual display environments,full-motion body and limb tracking, spatialized audio output and active voice/audio input, and haptic feedback mechanisms (Figure 6.1). Requirements forsuch environments include the delivery of many channels of realtime stream-ing audio and video into the visual/audio display environment, scalable in-terconnection of many users and worlds, close coupling of the virtual worldsto distributed networks of large-scale simulations and databases, and realtimeinteraction [532].

6.3 Teleimmersion Technologies139

Audiolocalizer

Audiosynthesizer

Speechrecognizer

Displayelectronics

Head/eye/handtracking

electronics

Haptic/tactilekinesthetic

system

Virtualenvironment

generator

Userenvironment

Head-mounteddisplay

or

Projectiondisplaysystem

commands

graphics

eye position

head position

hand position

6.1

FIGURE

Collaborative virtual reality system architecture, showing types of input andoutput devices.

Two main types of VR devices are currently available: projection-basedsystems, in which large stereo images are presented to a tracked user (orusers), and monitor-based or head-mounted systems, which present tiny imagesmuch closer to the eyes and physically coupled to the head. Examples of theformer are the CAVE (Cave Automatic Virtual Environment), a room-sizedvirtual environment made up of three or more screens; the ImmersaDesk andResponsive Work Bench, table-sized systems; and the PowerWall, a full-wallmultiscreen system. Examples of the latter include head-mounted displays(HMDs) and Binocular Omni-Orientation Monitors (BOOMs). Doing justiceto the full range of possible VR devices and network options is beyond thescope of this chapter. For detailed information, see the references at the endof this chapter. Our description here focuses on projection-based virtual reality

6 Teleimmersion140

Personalinformation

infrastructure

Desktopcollaborativeenvironment

ImmersaDesk CAVE PowerWall

Individuals Workgroups Collaborations Institutions

6.2

FIGURE

A continuum of collaborative virtual environment display technologies exists,with more high-end devices typically supporting more users and higher fi-delity, but at greater cost.

systems because these are what we are most familiar with and because thesesystems represent most of the VR devices on the nascent grid [515]. However,we see these different devices providing a continuum of useful services, asillustrated in Figure 6.2.

6.3.1 CAVE

The CAVE [143] is a multiperson, room-sized, high-resolution, 3D video andaudio environment. Graphics are rear-projected in stereo onto three walls andthe floor and are viewed with stereo glasses. As a viewer wearing a locationsensor moves within its display boundaries, the correct perspective and stereoprojections of the environment are updated, and the image moves with andsurrounds the viewer.

The CAVE environment provides users with numerous capabilities:

� A user is presented with dynamically moving stereo full-color imagesat multithousand-pixel horizontal resolution on the walls and floor. Theimages seem to float in space and allow the viewer to walk around them.

� The primary user’s position is tracked so that the correct perspectiveview is generated in real time. Head rotation is used to subtly adjust theperspective, not swing the entire world as in head-mounted displays.

� The primary user can navigate with a variety of intuitive navigationaldevices currently under test and construction. At interesting points, theviewer can freeze the viewpoint and automatically cause the computer-

6.3 Teleimmersion Technologies141

graphics generation time to fill in the image (called “successive refine-ment”). The viewers may rotate their head to take in the entire refinedscene and still achieve a good approximation of correct stereo withoutrequiring that the image be recomputed.

� All users wearing LCD shutter glasses can see full 3D stereo projected intothe room. In 3D movies or workstation stereo, objects must be kept nearthe center of the screen or behind the screen because the edges of thedisplay cause the illusion to be destroyed for objects between the user andthe screen (this is called “edge violation”). The CAVE effectively has noedges because of its wrap-around screens.

� Since all users can still see their hands, body, and feet, they do not needtraining to stay upright in the virtual space. Disorientation common inhead-mounted displays is not an issue with the CAVE unless specificallyinduced. The CAVE allows groups of people to be led by a scientist ordemonstrator to interesting places, a preservation of the teacher-studentrelationship not typically practical with head-mounted displays.

6.3.2 Smaller VR Devices

The CAVE is a representative of the high-end of virtual environment systemsthat make up the end points of a teleimmersion system. While the CAVE isideal for many applications, its large size and high cost often make it imprac-tical for small groups or areas with limited space. To address some of theseissues, smaller devices have been developed [145].

The ImmersaDesk is a drafting-table format version of the CAVE. The Im-mersaDesk, when folded up, fits through a standard institutional door frameand deploys into a 6 ft × 8 ft footprint. It requires a single graphics en-gine of the Onyx or Octane class (to which many researchers have access),one projector, and no architectural modifications to the working space. TheImmersaDesk is 100% software compatible with the CAVE library and sup-ports interfaces to software packages such as Sense8’s World toolkit, SGI’s Per-former/Inventor, AVS, and IBM Data Explorer. A version of the ImmersaDeskwith an integral airline-safe shipping case has been developed and dubbed the“IDesk2.”

The PowerWall achieves very high display resolution through parallelism,building a single image from an array of display panels projected from therear onto a single screen. High-speed playback of previously rendered imagesis made possible by attaching extremely fast disk subsystems, accessed in par-allel, to a rack Onyx. The PowerWall is often used for interactive playback

6 Teleimmersion142

Control

Text

Audio

Video

Tracking

Database and event transactions

Simulation

Haptics

Remote rendering

6.3

FIGURE

Types of flow in collaborative virtual reality.

of large movie sequences of prerendered, high-resolution images from super-computer simulations and has software for the panning of a single, extremelylarge image. For either application, a very high capacity and high-speed disksubsystem is required, with associated tape archiving capabilities.

6.4 TELEIMMERSION PERFORMANCE REQUIREMENTS

We emphasized earlier the multimodal nature of the data flows associated withteleimmersion applications and the fact that user interactions can lead to strin-gent performance requirements. In this section, we provide a more detailedanalysis of the nature of these flows and the performance characteristics re-quired for these flows in realistic teleimmersive applications.

6.4.1 Teleimmersion Data Flows

As a first step toward understanding teleimmersion network requirements, weanalyze the communication structures commonly encountered in teleimmer-sion systems. As illustrated in Figure 6.3, we distinguish nine different classesof flow.

6.4 Teleimmersion Performance Requirements143

Control

Control information consists of two types of data: that which is used to managethe teleimmersion session and that which governs synchronization of eventsin the shared spaces. Examples of the former are communications to authenti-cate users or processes, to start up or launch processes, to tune or control thedisplay or tracking systems, and to communicate out-of-band (with respectto the user) control or metadata between the world servers and VR systems.Examples of the latter include events generated during object manipulation,entry and exit of spaces, object intersection, and communications to supportarbitration.

Text

Text data currently is used to provide simple communications channels (e.g.,via text-based collaborative environments) for communicating within collab-orative session and for nonaudio nonvisual interactions among users. Text isalso used to provide command and control to the UNIX processes driving theenvironments. In the future, text may be used in part to integrate handhelddevices into the teleimmersion system and to support messaging.

Audio

Audio has three primary uses in current teleimmersion systems: to provideambient auditory cues, data sonification, and sound effects linked to actionsor events in the environment; to provide communications between users, asin a teleconferencing system; and to act as a command and control interfaceto the system with voice recognition and audio feedback. A typical applicationmay use multiple audio streams.

Video

Video plays an important role in teleimmersion systems. Like audio, videois used to provide teleconferencing interfaces in the virtual world (with theimportant difference that the video image can be texture-mapped on virtuallyany object or surface), but streaming video or perhaps video-based imagingtechniques will extend the use of still and full-motion high-resolution videoas an important media type. From inside the virtual environment syntheticvideo can also be generated (directly generating digital video of the simulatedworld) [153] and can be either streamed to remote sites or captured for archivalpurposes. Coupling multiple streams of live and synthetic video is importantfor teaching and tutorial purposes.

6 Teleimmersion144

Tracking

Location and orientation sensors are used in the VR environment to cap-ture the position and orientation of the user. Typically the data is streamedto the computer responsible for computing the perspective of the scene. Ina teleimmersion application this tracking data will need to be shared amongmultiple sites. Early VR systems required a minimum of two sensors per user,one mounted on the head and one on a hand or glove. Future systems mayhave many more sensors, allowing more natural avatars and the graphicalrepresentation of more complex postures and body motion.

Database

At the heart of a teleimmersion application is a database representing the vir-tual “world.” This database contains graphical models of virtual scenes, objects,and data. Each participant in a teleimmersion session must have access to thedatabase and, depending on the application, may need to be able to rapidlyupdate and modify it. Because the database is used to provide the models thatare rendered, it must be maintained in a coherent state across multiple sites.Databases might be as simple as shared VRML files or as complex as a multi-terabyte scientific data set.

Simulation

We often want to allow objects in our “world” to have dynamic behavior, forexample, so that they can respond to user actions. This behavior is sometimesprovided by running additional processes on the same computer that is ren-dering the image to provide dynamics. However, this strategy often does notscale; therefore, the simulation frequently is run on a separate, more power-ful dedicated supercomputer system. User input is captured and transmittedto the simulation via the network, and the simulation generates an update tothe database, which is then propagated to each user site for local rendering.Typically, the data transferred to the simulation is considerably smaller thanthe data returned by the simulation. For example, if the user is conducting aninteractive molecular docking experiment, only tracking data needs to be sentto the molecular model indicating the location of the user’s hand. In response,the simulation will return updated coordinates of hundreds or thousands ofatoms.


Haptics

Haptics refers to the class of user interfaces that provide force and touch feed-back to the user [89]. Haptics may include a variety of sensors and actuatorsthat are “attached” to the hands, arms, and legs of the user and connected toone or more computers. Applications can generate haptic “images” that allowthe user to feel what is in the visual environment. For example, a user mightbe able to feel the magnetic field around a star simulation or the texture of anatomic-scale surface being imaged by a scanning microscope.

Rendering

Rendering involves the transformation of geometric information—typicallyrepresented by computer graphics models—into images for display. Most cur-rent VR environments render graphics locally (i.e., the computer is directly at-tached to the frame buffer or display system) without producing intermediatedigital representations of the images that are transmitted from the renderingsystem to the display system via a general network. In the future, however,we believe that it will become increasingly common for teleimmersion worldsand scenes to be built up from a set of partial images, some of which are ren-dered remotely and transmitted to each site in real time. While some methodsincorporate digital video (see above), others use nonstreaming, compositingnetwork virtual frame buffers.

6.4.2 VR Lag and Communications Performance

Before proceeding to a detailed discussion of performance issues, we will makesome general comments about the issue of user-perceived lag in a virtualenvironment. Lag is the term used to describe the delay between action in thereal world, as captured by tracking or haptics, and the perceived response ofthe system to that action (e.g., in a collaborative design application, the videoupdate lag from a change in head position or the delay from a button click to avisual or audio response). Lag is a key issue for usability of the teleimmersionsystem, and reducing lag is a major technical challenge for the VR researchcommunity.

VR system lag is the result of the perceived delays in the following pro-cesses of the VR system: rendering, display, tracking, simulation, commu-nications, and synchronization [531]. Our primary concern here is with thecommunications contribution to teleimmersion lag.

6 Teleimmersion146

Multiple sources of latency exist in the communications system. In ourmodels of communications latency we include transmission latency, the timeit takes to send a zero-length packet from one node to another; bandwidthor transfer latency, the time it takes to move data because of the size of thetransfer; switching or routing latency, the sum of the delays that arise becausethe network is not composed of just point-to-point links; contention latency, thedelay caused by competition for limited resources in the network (bandwidth,queues, etc.); and protocol latency, the delay caused by the segmentation andreassembly operations to build data packets and the header processing forprotocol stacks.

In local (nonnetworked) VR systems, the contribution of LAN latencyto the overall lag of the system is usually negligible. In WAN environments(typical for future teleimmersion systems), the virtual world may be com-posed of many interacting network-based objects. In these multisite worlds,communications delays can become a critical contributor to end user lag. Forexample, in a computational steering application, in response to a user event,many sites may need to exchange megabytes of data before the next scene canbe updated.

Research has shown [350, 385, 570] that users are highly sensitive to lagonce it exceeds a certain threshold (see also Chapter 19). For example, mostusers have difficulty manipulating objects in virtual reality once lag exceeds300 ms. For augmented reality (where the virtual display is coupled with thereal world), this limit is approximately 30 ms. Since other (nonnetworked)

components of the VR system often together exceed 200–300 ms, there is littleroom for wide area communications delay in the lag budget.

Asynchronous teleimmersion models, in which local models of remoteusers, objects, and behaviors are asynchronously updated, may improve thissituation. However, absolute limits of transmission latency because of time-of-light round-trips may ultimately limit the geographical extent of tightlycoupled teleimmersion environments.

6.4.3 Performance Requirements

We now analyze each of the nine flow types with respect to seven metrics:maximum acceptable latency, bandwidth requirements, need for reliability,need for multicast, level of security, use of streaming communication, andvariability in quality-of-service requirements. We emphasize that the widevariation in teleimmersion applications means that the results of this analysisare necessarily approximate and may not apply in all circumstances.

Bandwidth and latency requirements are derived under the assumptionthat the display system is capable of a rate of 60 field updates per second per


Dynamic

Flow type Latency Bandwidth Reliable Multicast Security Stream QoS

Control < 30 ms 64 Kb/s Yes No High No Low

Text < 100 ms 64 Kb/s Yes No Medium No Low

Audio < 30 ms N × 128 Kb/s No Yes Medium Yes Medium

Video < 100 ms N × 5 Mb/s No Yes Low Yes Medium

Tracking < 10 ms N × 128 Kb/s No Yes Low Yes Medium

Database < 100 ms > 1 GB/s Yes Maybe Medium Maybe High

Simulation < 30 ms > 1 GB/s Mixed Maybe Medium Maybe High

Haptics < 10 ms > 1 Mb/s Mixed Maybe High Yes High

Rendering < 30 ms > 1GB/s No Maybe Low Maybe Medium

6.1

TABLE

Requirements of teleimmersive application.

eye and that it is driven with 30 frames per second of updated content. Thisrate translates into a frame update interval of ≈30 ms (see Table 6.1).

Latency

One way to think about latency is to look at bounds: the low-water mark wouldbe speed-of-light travel times for the site geometry; the high-water mark wouldbe the acceptable delays for each of the data stream types as they contributeto VR lag and usability. Of special note here is the latency (and bandwidth)

cost of multicast or broadcast mechanisms such as reflectors, MBone, ATMcell replication, software-based multicast, and unicast proxies. We emphasizethat it is still largely an open research issue to define exactly the nature of thelatency requirements for supporting effective human-human interactions viateleimmersion. Much work has focused on the teleoperator (restricted versionof telepresence) or pure audio and video lag problem (telephony applications),but comparatively little work has been done to quantify the impact on effectivecooperative tasking via teleimmersion.

Low latency is one of the most important grid requirements to supportteleimmersion applications. Control streams need to respond quickly to theuser and should enable the user to have effective control of the environment,particularly the display system. We believe the latency for control messagesshould equal the frame rate or ≈30 ms. Text communication is rarely time crit-ical but needs to provide a sense of continuity to the user. Hence, ≈100 ms fortext messages is probably adequate. The human auditory system is extremelysensitive to timing variations in speech and music. For tight lip-synced videowe believe that audio latency should be ≈30 ms. Video streams that are used

6 Teleimmersion148

for teleconferencing ideally should be lip-synced, although studies [565] haveshown that audio quality is more important than video quality in preservingthe effectiveness of the overall teleconferencing experience. For this reasonwe estimate ≈100 ms latency for video is probably adequate; however, whena media stream has both audio and video, audio should have higher priority.

Tracking data is generally collected at a rate of 10–500 Hz. It is used notonly to compute current location and orientation of the user’s head and hands,but also to compute accelerations and to extrapolate positions. Therefore, oneneeds to have both the most recent value of the sensors and (ideally) a seriesof recent values from which to compute derivatives and extrapolate futurevalues. Large-scale motion of the head and body have relatively low naturalfrequencies (2–10 Hz); however, hand and eye motion rates can exceed ≈30–100 Hz. We estimate the tracking latency requirement at ≈10 ms.

The size of database updates is dependent on the application; to avoidhaving the system pause during transactions, however, database transactionlatency should be ≈100 ms, independent of the size of the update. This latencyhas considerable implications for peak bandwidth. Simulation updates (i.e., thelatency from the supercomputer to the display environment) should ideallymatch the frame rate, ≈30–100 ms. If significant processing is required forthe simulation update, then latency for the communication of the simulationresults may need to be even less.

The latency requirements for effective haptic interfaces that have been re-ported in the literature range from 1 ms to 20 ms [89]. We believe that ≈10 msis a reasonable value for general teleimmersion systems; more specialized sys-tems that could support telesurgery may have more aggressive requirements.

Finally, distributed rendering rates will need to match the display updaterate and ideally maintain single-frame synchronization rates, which imply alatency of ≈30 ms.

Bandwidth Requirements

Bandwidth requirements for teleimmersion can be summarized in the follow-ing equation:

Btotalf =c,e,a,v,t,d,s,h,r=

∑Nf · Bf ,

where Nf is the number of flows of type f , Bf is the average bandwidth associ-ated with a flow of type f , and the flow types correspond to control, text, audio,video, tracking, database, simulation, haptics, and rendering, respectively.


A more bandwidth-complete model would require additional terms thatwould relate the bandwidth requirements to the size and nature of distributedworld models, nature of the simulation data streams, consistency messagesand deadlines for exchanges, tracking data, replication and multicast, controlmessages, degrees of adjustability, guard bandwidth, and bandwidth neededfor latency reduction.

Control and text streams rarely will exceed 64 Kb/s data rates, since theyare typically generated from keyboard input or GUIs. Bandwidth needed foraudio depends on the quality of the audio and the sampling rates, as wellas on the number of audio streams the application needs. Current encodingmethods can transmit a CD-quality (44 kHz sampling rate) stream in about128 Kb/s. Teleimmersion applications may, however, have several to dozensof such streams active at any given time; hence, we characterize the audiorequirement as N × 128 Kb/s, where N is the number of streams. Fully spatial-ized audio may require higher bandwidth, but that is also balanced by the factthat most voice-grade audio encoding requires less than 64 Kb/s. Video is per-haps the next most difficult data type to characterize. Depending on the qualityof video encoding and the resolution required, bandwidth requirements canvary from 128 Kb/s to 5 Mb/s. Like audio, the number of video streams canvary, and so we characterize the video requirement as N × 5 Mb/s.

Tracking data is typically generated by one or more sources in the vir-tual environment; the data rates depend on the degrees of freedom of thetracker and the sampling rate. We characterize typical tracking requirementsas N × 128 Kb/s, where N is, as before, the number of streams; this would sup-port approximately 8 tracker sources at 100 Hz. Database updates also have theproperty that they are highly application dependent and have bandwidth re-quirements driven by the need to update in real time potentially large worldmodels. For example, to update a 40 MB world model would require a peakbandwidth of gigabytes per second.

The bandwidth required for interfaces to simulations also is highly variable,ranging from a few tens of kilobytes per second to gigabytes per second,depending on the application type, simulation update rate, and frame rate.Very high peak numbers can be required if updates are to be completed withinone frame interval (i.e., ≈30 ms).

Haptics require that two types of data be sent: a continuous stream of datathat corresponds to the instantaneous forces that need to be applied at eachhaptic interface point, and periodic updates to the haptic textures (that wouldbe cached locally) that are “imaged” at each interface point. We estimate thatcurrent haptic interfaces can be effective with 1 Mb/s.

6 Teleimmersion150

Remote rendering anticipates future VR environments in which a combina-tion of local and remote rendering is used to generate the images of the world.Since this requirement is highly dependent on the application and display en-vironment, we simply acknowledge that it is not an unreasonable assumptionthat rates in excess of gigabytes per second would be extremely useful.

Reliability

Not all the data exchanged in a teleimmersion session needs to be sent reliably.Control and text data typically do. On the other hand, tracking data has a limiteduseful lifetime. Essentially, we are generally interested only in the latest value,not the entire time series or history of the stream. Thus, if a tracking packetis lost, rather than requesting a retransmission, we would simply use the nextpacket in the stream and interpolate between them. Depending on the latencybudget in the application, similar techniques can apply to audio and videostreams and to rendering data. Some encoding and interleaving of the datacan be done at the source to improve the quality of the receiver interpolation.Other types of data exchanged in a session must be sent reliably to ensureconsistency of the world database or to accurately reflect an update from asimulation or user interface event. In these exchanges performance is stillcritical, but the exchange must occur exactly once to maintain consistencyor to have a valid transaction. On the other hand, if simulation data is simplyupdating (“tracking”) the position of a simulated object, reliability may be lessimportant. Haptics can share properties with control or tracking, dependingon the application.

Multicast

In an N -way teleimmersion session, much of the data from one user needsto be available to many other users in essentially the same form. When thisis the case, it is reasonable to use multicast transmission rather than repeatedunicast transmissions to each receiver. Examples include audio, video, tracking,and database updates and certain kinds of user or simulation events. Both reli-able multicast and unreliable multicast are required, particularly if unreliablemulticast has lower latency. Early teleimmersion applications have made littleuse of multicast largely because of two issues: the difficulty in gaining accessto high-performance multicast-enabled networks, and the lack of middlewaretools providing the teleimmersion programmer with simple access to multi-cast. However, multicast support is definitely required if teleimmersion appli-cations are to scale beyond a few tens of users per session.


Security

Teleimmersive applications require numerous levels of security. For example,authentication is needed at both the session and resource level to protectagainst unauthorized users. Individual streams in a teleimmersion applicationmay need to be further protected by encryption for privacy reasons or for usersafety reasons (like control and haptic streams). In addition, computationalsteering data analysis applications may have content that is sensitive and thatneeds to be protected. A major requirement is that the user be able to controlthe level of security versus performance impact on the application.

Streaming

Of the nine classes of flows in a teleimmersion application, at least four aretypically thought of as streams: video, audio, tracking, and haptics. In addi-tion, depending on the application, remote databases, simulation, and ren-dering could have streaming modes. What differentiates streams from othertypes of network flows is the assumption of continuity and relatively constantbandwidth demands per stream. When teleimmersion applications are imple-mented on a networking environment that can make separate provisions forstreaming data and burst data, it makes sense to use network channels thatsupport constant bit rate traffic (e.g., ATM AAL 1-4). Moreover, when charac-terizing the networking demands of a teleimmersion application, we shouldthink in terms of steady-state demands and peak demands, which can varyover an order of magnitude during a session.

Quality of Service

The large peak transfer rates noted above are driven by the fact that relativelysimple actions in the virtual world by the user can cause a considerabledemand for synchronization or consistency updates at each participating site.One example is rapidly loading new geometry files or world models when theuser edits or navigates through the world. Realtime rendering requirementsmay imply the need to distribute updates within one frame update interval(1/30–1/60 s) to avoid jerkiness or pauses in the graphics or inconsistenciesin the shared world. While intelligent and speculative prefetching can oftenreduce the need for peak bandwidth, the ultimate limit is the nature andcomplexity of the world model and the restrictions (if any) placed on the user.

For QoS environments supporting two classes of service—high priority(C1) and normal priority (C2)—a possible assignment of flow types to classtypes is as follows:

6 Teleimmersion152

C1 = NaBa + NrBr + NtBt + NnBn

C2 = NcBc + NeBe + NrBr + NsBs + NdBd

The assignment of database flows (NdBd) to the Class 2 (C2) service might needto be reconsidered depending on the latency required for event arbitration.

For a QoS environment supporting three classes of service—a constant bitrate service (C1), normal traffic (C2), and lower-priority (C3) traffic—a possibleassignment of teleimmersion flows is as follows:

C1 = NaBa + NtBt + NhBh

C2 = NvBv + NrBr + NsBs + NdBd

C3 = NcBc + NeBe

If latency is the primary criterion, we might make the following alterna-tive assignment in a three-class system:

C1 = NtBt + NhBh

C2 = NaBa + NsBs + NrBr + NcBc

C3 = NeBe + NdBd + NvBv

6.4.4 Discussion

One message to take away from this analysis is that teleimmersion requiresmuch more from the grid than just high raw bandwidth and low latency. Highraw bandwidth is certainly required to support many simultaneous mediastreams and to enable sharing of large-world models and database updates inreal time. However, jitter (variation in latency) is equally important, as maybe higher-level QoS metrics, as discussed below.

Some of the needs expressed in this section may be unachievable in widearea environments. This situation suggests that certain teleimmersion applica-tions may not be feasible on very large geographic scales. More interesting, itsuggests that non-brute-force solutions, such as predictive simulation, shouldbe investigated in future research. Techniques can be developed to accommo-date lost packets or larger-than-ideal latency in teleimmersion applications,provided the rates of packet loss are bounded and relatively low and that thelatency is relatively constant. For example, local models can be used to predicttracking trajectories and to interpolate through periods of lost data for voiceor world model updates. All these are research areas where breakthroughsmay significantly reduce the raw performance requirements of future appli-cations [36].

We emphasize that the material presented here is preliminary. Little workhas been done on measuring the effectiveness of VR systems, and even less

6.5 Higher-Level Service Requirements153

work has been done to evaluate the effectiveness of teleimmersive collabora-tion environments on enabling groups of people to work together on complextasks [264, 560]. Additional research is needed to further understand the typesof tasks that teleimmersion may effectively support and to identify the fea-tures and performance needed to best support different classes of cooperativetasks. For example, educational uses of teleimmersion may impose relativelyfew performance or quality requirements on the system, but cooperative de-sign and an experimental assembly activities may impose strict network per-formance and human factors issues.

6.5 HIGHER-LEVEL SERVICE REQUIREMENTS

We can also identify a number of higher-level grid services required for teleim-mersion.

6.5.1 QoS Mechanisms

Quality-of-service requirements can be discussed broadly in three areas:(1) the ability to assign minimum service guarantees and relative prioritiesto each of many streams, (2) the ability to specify notification and compen-sation actions if the QoS dynamics of the network change over time, and(3) the ability to predict the reliability of service guarantees or service es-timates. A teleimmersion application may be able to make intelligent useof information regarding QoS variations provided by the network or middle-ware software layer, provided that appropriate dynamic control interfaces areavailable. For example, an application might compensate for a reduction inavailable bandwidth by reducing the fidelity of a simulation or the resolu-tion of a video stream. Complex relative priorities for stream queues may beused to balance some peak bandwidth requirements. However, significant ex-periments will be required to understand the user impact of various priorityschemes (see Chapter 19 for more details on QoS).

6.5.2 Grid Programming Environment

Teleimmersion systems will make extensive use of distributed computing pro-gramming environments. Tools are needed that enable the specification andmonitoring of complex QoS schemes (see above) and for monitoring the ac-tual utilization of grid resources (networking, computing, data, etc.). Findingor discovering resources (e.g., teleimmersion servers, computational servers,

6 Teleimmersion154

databases, and instruments) is a key requirement. Teleimmersion applicationswill make heavy demands on grid resource locators and directories. The usermust also be able to easily find and attach to information resources (like to-day’s Web) and to interactive resources (compute servers, databases, collab-orative virtual environment servers, and remotely controllable instruments)and combine them into ad hoc problem-solving systems. Distributed resourcemanagement and scheduling tools are required that allow teleimmersion sys-tems to specify resources in an abstract way, by generic type or class, ratherthan by absolute name as is done today. Since teleimmersion users are likelyto be assembling complex collections of grid resources in an ad hoc mannerfor exploration and analysis, it may be desirable to be able to refer to pre-viously used collections of grid resources within a persistent gridwide namespace.

6.5.3 Advanced Protocols

As was the case with desktop collaboration tools [367], broad deployment ofvirtual environments will increase the size of the user community and hencerequire that network-based systems scale to increasingly large numbers ofusers. In desktop collaboration tools, scalability was achieved via innovationsat the protocol level—specifically, the low-level multicast features commonlyavailable in modern IP routers [183]. Teleimmersion applications will requirebroad deployment of high-performance multicast implementations as well asother protocol innovations, such as various forms of reliable multicast (seeChapter 18).

6.5.4 Performance Monitoring and Measurement

It is not easy in the current Internet to determine the cause of poor end-to-endnetwork performance. A related difficulty is that we are rarely able to predicteven short-term network performance for a particular distributed applicationconfiguration. Future grids must incorporate infrastructure for collecting net-work performance data as it relates to a specific end user application. Per-formance data must be captured for an entire teleimmersion application, notjust one stream or connection. This multistream view is important becausedynamic QoS mechanisms may be changing the allocations of network re-sources among a set of streams with complex interdependent priorities (seeChapter 14).

Further Reading155

6.5.5 Scheduling

Resource scheduling is critical to teleimmersion applications. The schedul-ing of resources—whether cycles on a large computer server or networkingbandwidth—must occur in real time and be sensitive to the needs of the en-tire application. Users may wish to employ a combination of scheduling tech-niques: resource reservation for heavily used and unique resources and adhoc interactive scheduling of generic resources. Hence, the grid schedulingenvironment must support both techniques (see Chapters 12 and 19).

6.6 CONCLUSIONS

We have presented an overview of a new class of applications that combinevirtual reality with many of the features of applications described in the pre-ceding three chapters: remote simulation, realtime interaction and steering,and large-scale databases. Teleimmersion applications require not only high-bandwidth/low-latency networks, but also comprehensive programming andperformance measurement tools and a variety of robust “gridware” services.We believe that teleimmersion applications will drive grid requirements inthe next five years. The applications can be used as ideal testbeds for thedevelopment and evaluation of grid infrastructure services and programmingenvironments. For these reasons, teleimmersion applications are likely to beflagship applications for future computational grids.

FURTHER READING


� Kalawsky’s reference text [301] on virtual environments is comprehensive.

� Two National Research Council reports discuss open research topics invirtual environments and user interfaces [173, 412].

� Burdea provides a comprehensive survey of haptics [89], and Minsky awonderful haptics bibliography [387].

� Matlin and Foley survey human sensory and perception systems [365],from which many of the requirements of teleimmersive and virtual en-vironments derive.

9C H A P T E R

Object-Based Approaches

Dennis GannonAndrew Grimshaw

The design of the current generation of desktop software technology dif-fers from that of past generations in a fundamental way. The new paradigmstates that applications should be built by composing off-the-shelf compo-nents, much as hardware designers build systems from integrated circuits,and that, furthermore, these components may be distributed across a widearea network of compute and data servers. Components are defined by theirpublic interfaces, which specify the function as well as the protocols that theymay use to communicate with other components. In this model, an applica-tion program becomes a dynamic network of communicating objects. Thisbasic distributed-object design philosophy is having a profound impact onall aspects of information-processing technology. We are already seeing thesoftware industry move away from handcrafted, standalone applications andtoward investment in software components. A technology war over the designof component composition architecture is being fought within the industry.

High-performance computing cannot remain immune to this paradigmshift. As the Internet continues to scale in both size and bandwidth, it is notunrealistic to imagine applications that incorporate 10,000 active componentsdistributed over 10,000 compute hosts. Furthermore, pressure from the desk-top software industry will eventually lead to the integration of applicationsthat currently run only on supercomputer systems into distributed problem-solving environments that use object technology. Computational grids thatcouple massively parallel processor (MPP) servers, advanced networked in-struments, database servers, and gigabit networks will require a robust andscalable object model that supports high-performance application design.

9 Object-Based Approaches206

This chapter is divided into two parts. In the first half (Sections 9.1 through9.3), we explore the concepts underlying current distributed-object and com-ponent system architectures. We describe the basic features of the designsthat, for now, constitute the standards of the desktop software industry. Asgood as they are, however, the designs fall short of what is needed for a high-performance national grid object system. By looking at three of the applica-tions described in other chapters of this book, we can extract the requirementsthat software component middleware must meet in order to build these typesof applications.

The second half of this chapter (Sections 9.4 through 9.6) focuses on thelarge-scale architecture of a complete grid software architecture, based onobject-oriented design concepts and using the Legion system as an example.

9.1 BASIC CONCEPTS

Before discussing any applications, we should define some of the terms used inthe chapter to classify computation and communication types. The key ideasin object-oriented software design, and in this discussion, are as follows:

� Data and the functions that operate on the data should be bound togetherinto objects. These objects are instances of an abstract data type calleda class. The data associated with an object are called data members, orattributes, and the functions that are associated with a class of objects arecalled member functions.

� Interfaces describe a set of functions that can be used to interact with afamily of objects. Those classes of objects that respond to a particularinterface are said to implement that interface. A class may implement morethan one interface.

� A new object class may be built from an existing class by adding new dataattributes or member functions. Instances of the new class each containan instance of the original (parent) class and thus can implement thesame interfaces as the parent. This process of extending one class to buildanother is called inheritance: the extended class is said to inherit from theoriginal class. The extended class can override its parent class’s definitionof a member function and specialize or modify the parent’s behavior (i.e.,it responds to the same functions, but not in the same way).

� A new interface definition can also be created by simply adding new func-tions, thereby extending the definition of one or more other interfaces.

9.1 Basic Concepts207

These object-oriented software design principles are only the first step inbuilding the next generation of grid applications. As the desktop software in-dustry has learned, it is also necessary to understand the process wherebyan object class instance becomes a component in a distributed system. Com-ponent architecture is used to describe the framework for designing and usingcomponents, but it usually has two parts: components and containers. The ar-chitecture also defines a set of rules that prescribe the required features allcomponents must support in order to be integrated into functioning appli-cations. A container is an application that handles the integration of a set ofcomponents, or their proxies. We will discuss containers in more detail later.

Components often (but not necessarily) share the following characteris-tics: they are objects, they have persistent state, they have visual interfaces,they can be manipulated with graphical representations of component con-tainer toolkits, and they can communicate with other components—eitherlocally or remotely—by one or more mechanisms (events, method invoca-tions, procedure calls, message passing, etc.). The way in which a componentpresents a visual interface (if it has one), responds to events, and communi-cates with other components is defined by the component architecture. Thethree most important commercial component architectures are Microsoft Ac-tiveX, OMG’s CORBA/OpenDoc, and JavaBeans Java Studio. But, since ourinterest here is grid systems, not graphical user interfaces, we will focus onaspects of component systems that describe the composition and communica-tion behavior of most component architectures.

There are two common models of component integration:

� Client-server communication: In this model a client is an application thatacts as a container of components or their proxies. The application makesrequests of objects by invoking the public member functions defined bythe component objects’ interfaces. The individual components are servers,which respond to the client as illustrated in Figure 9.1. The control flow isbased on a function call from and return to the client. Microsoft ActiveXfollows this model [117, 427]. CORBA was also designed with this modelin mind, but as a distributed-object system it is flexible enough to supportother models [424, 428, 427].

� Software ICs: An electronic IC is a component that has input buffers andoutput ports. A design engineer can connect any output port of the rightsignal type to an input port of another IC. Software IC systems have thesame nature. A software module has input ports and output ports, and agraphical container or script-based composition tool can be used to createobject instances and to define the connections between the components.


Component container

Proxies

CORBA object

ActiveX

Java RMI

HPC++

IIOP

DCOM

Nexus

9.1

FIGURE

Client-server component models consist of a client container application,which holds object components that often act as proxies for remote objects.Such an architecture may support multiple protocols between the proxies andthe remote components.

The input port’s type is an interface describing the message that the portcan receive. These ports can be connected to other ports, whose interfacedescriptions describe what type of messages are sent. As with electronicICs, an output port’s messages can be multicast to matching input portson several other components, as shown in Figure 9.2. The control flow ofmessages is based on macro-data-flow techniques.

In addition to this data stream style of communication between objectports, there are two other standard forms of component system communica-tion.

� Control signals: Every component implements a standard control messageinterface, which is used by the component control container frameworkto query other components about their properties and state.

� Events and exceptions: Events are messages generated by a component andbroadcast to any other components that are “listening” for events of thattype. Most user input, as well as other GUI management tasks, is handledby events.

Sun’s JavaBean and Java Studio systems follow the software IC componentmodel closely. Other commercial systems based on this architecture includeAVS and its descendent, NAG Explorer, which are used to build visualizationtools from components. Unfortunately, Explorer has a limited and inflexible

9.1 Basic Concepts209

9.2

FIGURE

A software IC component architecture breaks the client-server hierarchy. Eachcomponent has three standard modes of communication: data streams thatconnect component ports (solid lines), control messages from the componentcontainer (dashed lines), and events (star bursts), which are broadcast to all“listening” objects.

type system that restricts its extensibility to larger distributed applications.The CORBA-based OpenDoc system uses a similar object model.

The final piece of a component system architecture that distinguishesit from other types of software infrastructure is the concept of a componentcontainer framework. The container, an application that runs on the user’sworkstation, is used to select components, connect them, and respond to eventmessages. The container uses the control interface of each component to learnits properties and to initialize it.

Microsoft’s Internet Explorer is an example of a component container forActiveX. Java Studio provides a graphical user interface for composing andconnecting components that is similar to the layout system used by Explorerand other component breadboards. We will return to more of the technical


requirements for high-performance components and container frameworkslater in this chapter.

9.2 THREE APPLICATION SCENARIOS

Having defined the basic object-oriented software concepts, we can now lookat three application case studies. Each case illustrates a different set of designrequirements, but they all share certain features.

9.2.1 Example: Distributed Algorithm Design

An important class of grid-based programming environments is based onproblem-solving environments (PSEs), software frameworks for integratingalgorithmic modules into distributed scientific computations. As discussed inChapter 6, these systems make it possible for a user to exploit the resources ofthe grid without having to deal with the complexities of low-level communica-tion and resource management. These systems differ from generic componentarchitectures by providing a high-level framework that allows users to ap-proach a problem in terms of the application area semantics for which thePSE was designed.

SCIRun (see Chapter 7) is an excellent example of a PSE system architec-ture for scientific problem solving. Based on the NAG Explorer–style compo-nent composition model, SCIRun currently supports a small but powerful setof data types for communication, including mesh, fields, surfaces, and matrixtypes. However, the type system cannot in principle be extended to arbitrarytypes.

Two other distributed scientific PSEs include NetSolve (see Chapter 7)

and WebFlow (Chapter 10). NetSolve is not based on a object componentarchitecture, but uses a combination of a client-server model together with anovel approach to agent-based design. WebFlow is based on component designbut is built on and derives its power and versatility from commodity Webtechnologies. Another example is the Linear System Analyzer (LSA), designedby Bramley et al. [222] (illustrated in Figure 9.3).

LSA was built to simplify the process of solving large sparse systems of lin-ear equations. While many may consider the task of solving matrix equationsto be a “solved problem,” nothing could be further from the truth. This jobremains one of the most difficult problems in most large-scale scientific simu-lations. The difficulty arises because no single method works for all problemsand little theory exists to guide the user in selecting the correct method for a

9.2 Three Application Scenarios211

9.3

FIGURE

Building distributed algorithms by composing components using LSA.

given problem. Furthermore, the most successful methods involve a combina-tion of matrix preconditioning and iterative solvers or careful reordering andscaling and a direct solver. Bramley observed that the problem can be brokendown into the following steps:

1. Read the matrix (or extract it from another part of a larger problem).

2. Analyze the matrix for obvious properties that help guide the solutionprocess. For example, is it symmetric, banded, or strongly diagonally dom-inant?

3. Apply a reordering or scaling transformation, such as Markovitz pivotingor blocking.


4. Select and apply a preconditioner, such as MG, ILU, MILU, RILU, ILUT, orSSOR.

5. Select a solver from the many that are available, such as Direct, AMG,BiCG, CGS, Bi-CGstabilized, GMRES, GCR, or OrthoMin.

6. Extract a solution.

In Figure 9.3 the reordered system is sent to a sparse direct solver (SuperLU)

and an iterative library (SPLIB). LSA provides a library of components thatimplement these steps in the solution process. By connecting a matrix analysiscomponent to a preconditioner that is connected to an iterative solver and asolution extractor, the user can build a custom solver for the problem at hand.The components can be linked to form a single library, which can then beadded to a larger application. Suppose, however, that the best solver is locatedon a specific remote parallel machine or that the problem is so large that it canbe solved only on a remote machine with a large memory. Since LSA allowscomponents to be placed on remote machines by assigning a host IP addressto that component, the underlying component container architecture workswith the grid scheduler to make sure that the component is initialized andrunning at that location.

Requirements Imposed on the Component System

Note that LSA and other distributed algorithm systems impose special re-quirements that are not part of the conventional desktop software componentmodel. Many of these will be common to all the examples in this book.

First, network quality of service and performance characteristics may de-teriorate when a problem is distributed. Large-scale problems have large-scalebandwidth demands, but moving a large sparse matrix over a network linkshould not take longer than the combined execution time of the sending andreceiving components. Otherwise, it may not make sense to distribute thecomputation. (There are important exceptions to this rule. For example, ifthe host system has special capabilities or the component objects are pro-prietary, it may be necessary to distribute the computation in spite of re-duced performance.) None of the commercial architectures (ActiveX, CORBA,JavaBeans) has a standard model for associating performance characteristicsor requirements with the communication infrastructure. Other network per-formance factors affect performance time and service as well, and they shouldbe considered when examining component models. (Quality of service is amajor concern; it is described in greater detail in Chapter 19.)


Second, scheduling the execution of a large distributed computation canbe extraordinarily complex. For an application such as LSA, some compo-nents can execute interactively, while other components must wait in batchqueues. Consequently, the synchronization between components must beflexible enough to allow the network of components to work asynchronouslywith long latencies.

Third, it is important to have a scripting interface to supplement thegraphical composition model, since a network of components may be executedseveral times with different inputs and different parameter configurations. Ascripting language such as Python or Perl will allow iterative control of theexecution as well as composition of large graphs of components.

Finally, mixed language components are essential for linking scientific ap-plications, such as linear algebra solvers, with Java-based graphical interfacesand component architectures. In LSA, approximately 40% of the system is For-tran plus MPI, 30% is Java, and 30% is HPC++ [221], which encapsulates theparallel Fortran and communicates with the Java front end.

9.2.2 Example: Teleimmersive Collaborative Design

Consider the following application of the teleimmersion environment de-scribed in Chapter 6. A car company uses a collaborative design system toreduce costs and time in its new product design process. For each new car,there is a master design database at the main factory, and each subcontrac-tor maintains a separate design database with details about the componentsthat they supply. Some of the information in the subcontractors’ databasesis proprietary and does not appear in the master design database, but it ispossible for the master design database to extract any required performanceinformation from a subcontractor by means of simple RPC transactions. Theseperformance responses can be used to create a simulation of the car on a re-mote supercomputer at the company headquarters. The simulation resultscan be transmitted to teleimmersion systems at the main facility and at thesubcontractors’ facilities over a high-bandwidth network. The teleimmersionenvironment displays the car responding, in a virtual environment, to theuser’s control.

Suppose, though, that the designers want to see how an altered engine de-sign affects the car’s handling (Figure 9.4). The main designers would ask theengine builder to update the main database with the new engine model. A vir-tual mountain road scenario could be loaded and used to create a simulation.The designers could then interactively experiment with the new handlingcharacteristics of the simulated vehicle.


CAVE 1 (main) CAVE 2 (subcontractor)

Media streams

event messages

Scene databaseScene database

Design database Subcontractordesign database

Finiteelement

simulation

database requests

reply messages

9.4

FIGURE

Two CAVE environments connected to a distributed design database and aremote simulation facility.

This system has several components. The teleimmersion environment canbe viewed as one large component, but it probably consists of several smallercomponents, such as the following:

� The data input stream consists of updates sent to the associated visualdatabase and then rendered by the display object.

� As with any graphical user interface system, the user interface control com-ponents, which include pointers, head trackers, and other haptic devices,detect user events and then output the information as a stream to othercomponents that are associated with the application.

� The application components receive information from the control devicecomponents. With this information the application components can queryand update the visual database.

The design database is a description of the car and is used in the simulationand the manufacturing process. This database can also be viewed as a largecomponent or as a collection of smaller ones. Output from this object includes


the polygon model used in the rendering component and the finite-elementmodel used by the simulation.

The required inputs for the simulation object are the finite-element modelsof the car and the road, as well as a sequence of control inputs that “drive” thecar during the simulation.


The application’s most important aspect is its dependence upon managedbandwidth, which makes realtime performance possible. The need to managemany different types of data streams between components makes this depen-dence more complex. Currently, realtime CORBA implementations are beinginvestigated in the research community [488], but the standard implemen-tations of CORBA, DCOM (ActiveX), and Java communication mechanismswould be insufficient for the teleimmersion application. The object archi-tecture must provide a mechanism that allows performance constraints andquality-of-service mechanisms to be associated with the logical data paths be-tween component ports.

In addition, the application needs support for multicast communicationin the object model. While it is likely that we will see extensions of JavaRMI (Remote Method Invocation) to multicast, it is not part of the CORBAor ActiveX model. (More details about multicast communication and CORBAare found in Chapter 18.)

9.2.3 Example: The Digital Sky Project

The Digital Sky project (detailed in Chapter 5) illustrates a different set ofdistributed-object needs. Multiple databases contain billions of metadata ob-jects, each of which describes a visible object such as a star or galaxy. Eachmetadata object is linked to a digital image in the archival system. The collec-tion of metadata objects is organized as a relational database. In addition to areference to the appropriate archived image object, each metadata object con-tains basic reference information and a list of data extraction methods that canbe applied to the image object.

Suppose, then, that a scientist at some remote location decides to searchfor all galaxies that exhibit a particular set of properties. Some of these prop-erties may relate to information stored in the metadata, but some may requirean analysis of stored images. The request is formulated as a database queryand sent to the database, where a set of objects that satisfy conditions asso-ciated with the metadata can be extracted. Then, for each of these objects, a


1

2

3

1. Apply initial relational select operation. Send message to archiveto complete request.

2. Apply data parallel image analysis operation.

3. Send selected references and image objects to a second survey formore analysis.

Image archiveObject RDB

9.5

FIGURE

The Digital Sky project couples a distributed collection of database and archivecomponents. Heavy lines between components indicate communicationchannels that require high bandwidth.

request to apply the remaining tests to the images is sent to the data archive.This is a data-parallel operation and results in a set of references that satisfyall conditions to the subset of galaxies. The result of this query might then beused as part of a second query submitted to another remote repository. Thismay involve the transmission of a large stream of data from the first repositoryhost to the second.

As shown in Figure 9.5, the components of the solution process are therelational databases and the repositories. To set up a complex analysis of thedata, two or more components may need to be connected by a high-bandwidthlink. The information that is communicated between components consists ofimage objects, object references, metadata information, and relational data-base queries.


While many of the problems associated with this application can be foundin the preceding two examples, there are also some unique features in thisparticular problem.

9.3 Grid Component Frameworks217

The first is the extensive use of database technology. While there is aJava database interface standard, it may not scale to the problems describedhere. In particular, the interaction between the object-relational database andimage archive requires the implementation of the data-parallel remote methodinvocation described above.

The second feature is related to the communications that must occur be-tween components with parallel implementations. Existing commercial tech-nologies require that a single logical channel be implemented as a singlenetwork stream connection. However, if both components have parallel im-plementations, it may be possible to implement the communication as a setof parallel communication streams. Pardis, a parallel implementation and ex-tension of CORBA [307], is an example of a system that supports this feature.Pardis demonstrates that it is possible to significantly improve the utilization ofnetwork bandwidth by arranging parallel streams that can implement remotemethod calls.

We will return to this example again at the end of this chapter.

9.3 GRID COMPONENT FRAMEWORKS

To build a component-based, high-performance application, certain additionalproblems must be considered. First, objects in the framework need to knowabout each other in order to be able to transmit the data and member functionmessages as indicated. For example, a CAD database may be located in onecity and a flow simulation may be running on a parallel processing systemin another city, while the coupled visualization system may be an immersiveenvironment such as a CAVE at yet another location. To complicate mattersfurther, some objects, such as the design database, may be persistent, whileother objects, such as the grid generation filter, may exist only for the durationof the computation.

One solution to this configuration problem would be a visual programmingsystem that allows the user to draw the application component graph. NAGExplorer uses this technique, as does the LSA example described above andJava Studio. Unfortunately, NAG’s type system is not very rich, and it isunclear whether graphical composition tools will scale to networks of morethan a few dozen objects. We would also like to be able to describe networksthat are dynamic and can incorporate new component resources as they arediscovered. Systems such as Explorer and the current version of SCIRun usea fixed-type system, but most distributed-object systems allow arbitrary user-defined types to be transmitted over the channels between components.


9.3.1 Serialization Problem

A system must also know how to transmit application-specific objects overthe network. This is called the serialization problem, and its solution requires aprotocol for packing and unpacking data structure components so that theymay be reliably transmitted between different computer architectures in aheterogeneous environment. The traditional solution is to use an InterfaceDefinition Language (IDL) to describe the types of the objects being transmit-ted. IDL is a simple C++-like language for describing structures and interfacesand was first used in the DCE infrastructure [352]. The DCE IDL was adoptedand extended for use in Microsoft DCOM and CORBA. The CORBA extensionis the most complete, and it is used as the foundation of the specification of theentire CORBA system. Java RMI, on the other hand, is a strictly Java-to-Javacommunication model, so that Java serves as its own IDL. However, there isnow a Java-to-HPC++ link that uses a combination of IDL and Java RMI, andJavaSoft has agreed to reimplement RMI so that it runs over the CORBA com-munication protocol known as IIOP. (The use of CORBA, Java, and DCOMin a commodity-based grid architecture is described in much greater detail inChapter 10. Chapter 18 provides a good introduction to IIOP as well as otherrelevant protocols.)

9.3.2 Performance Issues

A persistent problem with existing commercial technologies is the poor per-formance of serialization and communication. Java RMI provides the mostsophisticated serialization model, but its performance is several orders of mag-nitude below the requirements of the grid applications described above.

The Agile Objects project, at the University of Illinois, is exploring tech-niques for high-performance implementation of component object standardinterfaces and protocols that focus on lowering the cost of crossing com-ponent boundaries (lower invocation overhead) and reducing the latencyof an RPC (lower invocation latency). In particular, these efforts are focus-ing on Java RMI and DCOM invocation mechanisms and are building ontechnologies from the Illinois Concert run time, which executes RPC andmessage calls within a cluster of workstations in 10–20 µs. In addition, theGigabit CORBA project [236, 490, 489] and the Indiana JAVA RMI-Nexusprojects [80] are addressing the same problem in the case of heterogeneousenvironments.


9.3.3 Additional Issues

Most good object-oriented systems also include mechanisms for the followingadditional problems.

Naming

Any application-level programming framework must provide a mechanismthat uniquely identifies the objects being integrated into a distributed com-putation. Naming mechanisms must be incorporated into the design of thesystem at a fundamental level.

Persistence and Storage Management

An object may need to be “frozen” so that its state is preserved on some storagedevice and “thawed” when it is needed. A system with this ability is saidto support persistence, which is closely related to serialization as describedabove.

Object Sharing

If each object instance belonged to only one application, this would be prob-lem enough, but when objects are used in multiple applications concurrently—a design database may be used simultaneously by several applications, forexample—additional problems arise. To overcome these problems, the pro-grammer can associate a session identifier with each circuit of objects, so thatwhen an object receives a message, there is an accompanying session iden-tifier. The identifier identifies which objects need to receive any outgoingmessages associated with that transaction. An important related concept isthat of collaboration: distributed applications, such as the teleimmersive de-sign example, are often based on the ability of multiple users to share viewsand access to an object. The Infospheres system [114] is an excellent exam-ple of a component architecture that treats collaboration as a central designobjective.

Process and Thread Management

Most instances of distributed objects are encapsulated within their own pro-cesses, but some situations require that multiple objects belong to the sameprocess or that an object respond concurrently to different requests for thesame method invocation. This capability requires that the object system be


integrated with a thread system. An important associated concern is that thethread model used to implement the communication and events for the com-ponent must be consistent with a thread model that might be used in thecomputation kernel. For example, an application that uses Fortran OpenMPmay generate threads with one runtime system, but the component architec-ture may use another. These thread systems often have difficulties existing inthe same process.

Object Distribution and Object Migration

An object implementation may itself be distributed. This is especially relevantto parallel programming, but it can occur in other situations, as when onepart of a particular interface needs to be implemented on one system andanother part on another system. An object may also need to migrate fromone host to another, as when a host’s compute resources are too limited andan object could be more efficiently implemented on a second, more powerfulhost.

Network Adaptability

As the examples described above illustrate, it is essential that the object mid-dleware layer be able to adapt to dynamic network loads and fluctuating alter-native pathway availability.

Dynamic Invocation

As described thus far, the interfaces to distributed objects must be knownat compilation time, as must the IDL description used to generate the prox-ies/stubs and interface skeleton for the remote objects. However, a componentsystem may be required to provide the mechanism for an application to dis-cover these interfaces to an object at run time. This will allow the applicationto take advantage of special properties of the component without having torecompile the application.

Reflection

Both object migration and network adaptability are examples of object behav-ior depending upon an object’s implementation and its runtime system. Theability to obtain information about an object, such as its class or the inter-faces it implements at run time, is called reflection. It also refers to an object’scapability to infer properties about its implementation and the state of the


environment in which it is executing. Reflection can be used to implementdynamic invocation, for example. While reflection can be implemented inany system, Java is the only conventional language that supports reflectiondirectly.

A closely related concept is that of a metaobject [311], which can be thoughtof as a runtime object that is bound to each application-level object. In somesystems, metaobjects are used to implement method invocations, so that thechoice of network protocol for executing a particular method invocation canbe controlled by the associated metaobject. This strategy allows the objectmaking the method call to be written without concern for how the call isimplemented, since that is the metaobject’s job. It also allows greater varietyin implementing some of the features listed in this section. For example, analternative to object migration is to endow a system with a pseudomigrationcapability, which works as follows. The metaobject associated with an objectcaches each of that object’s requests for a member function call. If the meta-object can detect that the current compute host is too busy, it can create aninstance of the controlled object on another host and forward the call to thenew instance.

Event Logging

Debugging distributed systems is difficult and requires a mechanism thatcan log the timing and explanation of events associated with a given set ofdistributed interactions.

Fault Tolerance

An exception-handling mechanism is the first step toward building reliablesystems, but it falls far short of providing a mechanism that reliably toleratesfailure. The system must be able to restart applications automatically and toroll back transactions to a previous known state.

Authentication and Security

Authentication allows us to identify which applications and users are allowedto access system components. Security ensures that interactions can be ac-complished safely for the data as well as the implementations. It is an issuethat goes far beyond the domain of the object system, but the object systemmust provide a way to allow the user access to available authentication andsecurity tools.


Beyond Client-Server

For high-performance computation, future distributed-object systems mustsupport a greater variety of models than simple client-server schemes. Asillustrated in Section 9.2, there are paradigms that include peer-to-peer ob-ject networks, and we can imagine future massive networks of componentsand software agents that work without centralized control and dynamicallyrespond to changing loads and requirements.

Support for Parallelism

Beyond multithreaded applications are those involving the concurrent activityof many components. An object system must allow both asynchronous andsynchronous method calls, as well as multicast communication and collectivesynchronization, both of which are essential for supporting parallel operationon large numbers of concurrently executing objects.

9.4 THE LEGION GRID ARCHITECTURE

In the preceding sections we have outlined many of the technical problemsthat are associated with extending contemporary component and distributed-object technology to support grid applications. In the remainder of this chap-ter, we address the problem of delivering this type of programming infrastruc-ture to the application builder. More specifically, the component architecturethat the programmer uses is a high-level programming model, which mustprovide easy-to-use abstraction for complex grid services.

The task of implementing application-level programming abstractions interms of basic grid functionality is a major challenge. There are three waysto address this problem. One approach, explored in Chapter 10, is to extendexisting commodity technology. A second approach is to layer an application-level component architecture on top of a grid architecture such as the Globustoolkit, described in Chapter 11. The current versions of HPC++ and CC++,for example, use Nexus, the Globus communication layer, to support object-oriented RMI, and Java RMI has been ported to run over Nexus [80]. Aneffort to extend this object layer to a Globus-compatible component modelis under way. The third approach is to provide a single, coherent virtualmachine that addresses key grid issues such as scalability, programming ease,fault tolerance, security, and site autonomy completely within a reflective,object-based metasystem. The University of Virginia’s Legion system is thebest example of this type of grid architecture.

9.4 The Legion Grid Architecture223

Legion is designed to support millions of hosts and trillions of objects ex-isting in a loose confederation and tied together with high-speed links. Theuser can sit at a terminal and manipulate objects on several processors, buthas the illusion of working on a single powerful computer. The objects theuser manipulates can represent data resources, such as digital libraries andvideo streams; applications, such as teleconferencing and physical simula-tions; and physical devices, such as cameras, telescopes, and linear accelera-tors. Naturally, the objects being manipulated may be shared with other users.It is Legion’s responsibility to support the abstractions presented to the user;to transparently schedule application components on processors; to managedata migration, caching, transfer, and coercion; to detect and manage faults;and to ensure that the user’s data and physical resources are adequately pro-tected.

9.4.1 Legion Design Objectives

The Legion design is based on 10 central objectives as follows.

1. Site autonomy: Legion will not be a monolithic system. It will be composedof resources owned and controlled by an array of organizations. Theseorganizations, quite properly, will insist on having control over their ownresources—for example, specifying how much of a resource can be used,who can use it, and when it can be used.

2. Extensible core: It is not possible to know or predict many current andfuture needs of all users. Legion’s mechanism and policy must be realizedvia extensible and replaceable components that permit Legion to evolveover time and allow users to construct their own mechanisms and policiesto meet their specific needs.

3. Scalable architecture: Because Legion will consist of millions of hosts, itmust have a scalable architecture rather than centralized structure. Thismeans that the system must be totally distributed.

4. Easy-to-use, seamless computational environment: Legion must mask thecomplexity of the hardware environment and of the communication andsynchronization involved in parallel processing. Machine boundaries, forexample, should be invisible to users, and compilers acting in concert withruntime facilities must manage the environment as much as possible.

5. High performance via parallelism: Legion must support easy-to-use parallelprocessing with large degrees of parallelism. This requirement includestask and data parallelism and their arbitrary combinations.


6. Single persistent name space: One of the most significant obstacles to widearea parallel processing is the lack of a single name space for file anddata access. The existing multitude of disjoint name spaces makes writingapplications that span sites extremely difficult.

7. Security for users and resource owners: Because Legion does not replaceexisting operating systems, we cannot significantly strengthen existingoperating system protection and security mechanisms. In order to ensurethat existing mechanisms are not weakened by Legion, it must providemechanisms that allow users to manage their own security needs. Legionshould not define the user’s security policy or require a “trusted” Legion.

8. Management and exploitation of resource heterogeneity: Legion must supportinteroperability between heterogeneous hardware and software compo-nents, as well as take advantage of the fact that some architectures arebetter than others at executing particular applications (e.g., vectorizablecodes).

9. Multiple language support and interoperability: Legion applications will bewritten in a variety of languages. It must be possible to integrate heteroge-neous source-language application components in much the same mannerthat heterogeneous architectures are integrated. Interoperability requiresthat Legion support legacy codes.

10. Fault tolerance: In a system as large as Legion, it is certain that at any giveninstant several hosts, communication links, and disks will fail. Dealingwith these failures and with the resulting dynamic reconfiguration is anecessity for both Legion and its applications.

In addition, the Legion design is shaped by the following three constraints.First, Legion cannot replace host operating systems. Organizations will notpermit their machines to be used if their operating systems must be replaced.Operating system replacement would require them to rewrite many of theirapplications, retrain many of their users, and possibly make their machinesincompatible with other systems in their organization.

Second, Legion cannot legislate changes to the interconnection network,but must assume that the network resources and the protocols in use areoutside any one group’s control and should be accepted as an ungovernableelement in large-scale parallel processing.

And, finally, Legion cannot insist that it be run as “root” (or the equiva-lent). Indeed, quite the contrary: most Legion users will want it to run withthe fewest possible privileges in order to protect themselves.


9.4.2 Legion System Architecture

Legion is a reflective object-based system that endows classes and metaclasses(classes whose instances are themselves classes) with system-level responsi-bility. Legion users will require a wide range of services on various levels,including security, performance, and functionality. No single policy or set ofpolicies will satisfy every user; hence, whenever possible, users must be ableto decide which trade-offs are necessary and desirable. Several characteristicsof Legion’s architecture reflect and support this philosophy.

Everything Is an Object

The Legion system consists of a variety of hardware and software resources,each of which is represented by a Legion object (defined as an active processthat responds to member function invocations from other objects in thesystem). Legion describes the message format and high-level protocol for ob-ject interaction, but not the programming language or the communicationsprotocol.

Classes Manage Their Instances

Every Legion object is defined and managed by its class object, which is itselfan active Legion object. Class objects are given system-level responsibility:classes create new instances, schedule them for execution, activate and de-activate them, and provide information about their current location to clientobjects that wish to communicate with them. In this sense, classes act as man-agers and make policy, as well as define instances. Classes whose instancesare themselves classes are called metaclasses.

Users Can Provide Their Own Classes

Legion allows users to define and build their own class objects, which permitsprogrammers to determine and even change the system-level mechanismsthat support their objects. Legion 1.0 (and future Legion systems) contains de-fault implementations of several useful types of classes and metaclasses. Usersare not forced to use these implementations, however, particularly if the im-plementations do not meet the users’ performance, security, or functionalityrequirements.


Core Objects Implement Common Services

Legion defines the interface and basic functionality of a set of core objecttypes, which support basic system services such as naming and binding, andobject creation, activation, deactivation, and deletion. Core Legion objectsprovide the mechanisms that classes use to implement policies appropriatefor their instances. Examples of core objects include hosts, vaults, contexts,binding agents, and implementations.

9.4.3 The Legion Object Model

Legion objects are independent and logically address-space-disjoint active ob-jects that communicate with one another via nonblocking method calls, whichmay be accepted in any order by the called object. Each method has a signa-ture that describes the parameters and return value, if any, of the method. Thecomplete set of method signatures for an object fully describes its interface(which is determined by its class). Legion class interfaces can be described inan IDL, several of which will be supported by Legion.

Naming System

Legion implements a three-level naming system. At the highest level, usersrefer to objects using human-readable strings, called context names. Contextobjects map context names to LOIDs (Legion object identifiers), which arelocation-independent identifiers. Each identifier includes an RSA public key.Since LOIDs are location independent, they are insufficient for communica-tion by themselves. A LOID is therefore mapped to an LOA (Legion objectaddress) for communication. An LOA is a physical address (or set of addresses,in the case of a replicated object) that contains sufficient information to allowother objects to find and communicate with the object, for example, an (IPaddress, port number) pair.

Object States

A Legion object can be in one of two different states, active or inert. As de-signed, Legion will contain too many objects for all to be represented simulta-neously as active processes and therefore requires a strategy for maintainingand managing representations of these objects in their inert state in persistentstorage. An inert object is represented by an object-persistent representation(OPR), which is a set of associated bytes residing in stable storage somewherein the Legion system. The OPR contains information about an object’s statethat enables the object to move to an active state. An active object runs as a


process that is ready to accept member function invocations; an active object’sstate is typically maintained in the address space of the process, although thisis not strictly necessary.

Core Objects

Several core object types implement the basic system-level mechanisms re-quired by all Legion objects. Like classes and metaclasses, core objects are re-placeable system components; users (and in some cases resource controllers)can select or implement appropriate core objects.

Binding agents are Legion objects that map LOIDs to LOAs. A (LOID, LOA)

pair is called a binding. Binding agents can cache bindings and organize them-selves in hierarchies and software combining trees, in order to implement thebinding mechanism in a scalable and efficient manner.

Context objects map context names to LOIDs, allowing users to name ob-jects with arbitrary high-level string names, and enabling multiple disjointname spaces to exist within Legion. All objects have a current context anda root context, which define parts of the name space in which context namesare evaluated.

Host objects represent processors in Legion. One or more host objects runon each computing resource that is included in Legion. Host objects create andmanage processes for active Legion objects. Classes invoke member functionson host objects in order to activate instances on the computing resources thatthe hosts represent. Representing computing resources with Legion objectsabstracts the heterogeneity that results from different operating systems hav-ing different mechanisms for creating processes. Further, it provides resourceowners with the ability to manage and control their resources as they see fit.

Just as the host object represents computing resources and maintainsactive Legion objects, the vault object represents persistent storage, but onlyfor the purpose of maintaining the state, in OPRs, of the inert Legion objectssupported by the vault.

Implementation objects allow Legion objects from other Legion systems torun as processes in the system. An implementation object typically containsmachine code that is executed when a request to create or activate an objectis made. More specifically, an implementation object is generally maintainedas an executable file that a host object can execute when it receives a requestto activate or create an object. An implementation object (or the name of animplementation object) is transferred from a class object to a host object toenable the host to create processes with the appropriate characteristics.

Legion specifies functionality and interfaces, not implementations. Legion1.0 provides useful default implementations of class objects and of all the core


system objects, but users are never required to use the defaults. In particular,users can select (or build their own) class objects, which are empowered bythe object model to select or implement system-level services. This feature ofthe system enables object services (e.g., creation, scheduling, security) to bemade appropriate for the object types on which they operate, and eliminatesLegion’s dependence on a single implementation for its success.

9.5 A CLOSER LOOK AT LEGION

Space limitations do not permit a detailed discussion of how Legion realizesits objectives. Thus, rather than attempt to compress a large and complexsystem into a few pages, we will briefly expand on three aspects of Legionthat are of interest to the high-performance computing community: security,high performance, and scheduling and resource management.

9.5.1 Security

Legion offers the opportunity of bringing the power and resources of millionsof interlinked computers to the desktop computer. While that possibility ishighly attractive, users will adopt Legion only if they feel confident that it willnot compromise the privacy and integrity of their resources. Without security,Legion systems can offer some limited uses. But if the full Legion vision of aworldwide metacomputer is to become a reality, reliable and flexible securityis essential.

Security Problems

Security has been a fundamental part of the Legion design from the beginning.Early work identified two main problems: users must be able to install Legionon their sites without significant risk, and they must be able to protect andcontrol their Legion resources as they see fit.

The solution to the first problem is reflected in the broad design goalsfor Legion. Specifically, Legion does not require any special privileges fromthe host systems that run it. Administrators have the option of taking a veryconservative approach while installing the system. Furthermore, Legion isdefined as an architecture, not an implementation, allowing individual sitesto reimplement functionality as necessary to reflect their particular securityconstraints.

9.5 A Closer Look at Legion229

The second problem, protecting Legion resources, requires multiple so-lutions. In an environment where users may range from students to banksto defense laboratories, it is impossible for Legion to dictate a single secu-rity policy that can hope to satisfy everyone. Therefore, Legion uses a flexibleframework that adapts to many different needs. Individual users can choosehow much they are willing to pay in time and convenience for the level ofsecurity they want. They can also customize their Legion system’s securitypolicies to match their organization’s existing policies.

Placing policy in the hands of users is much more than just an attrac-tive design feature. A decentralized system does not use security architecturesbased on control and mediation by “the system.” Nor is there a single ownerwho sets and enforces global policies. In such an environment, users must ulti-mately take responsibility for security policies. Legion is designed to facilitatethat goal.

The Security Model

The basic unit in Legion is the object, and the Legion security model istherefore oriented toward protecting both objects and object communication.Objects are accessed and manipulated via method calls; an object’s rights arecentered in its capabilities to make those calls. A file object may supportmethods for read, write, seek, and so forth, so that the read right for a fileobject might permit read and seek, but not write. The user determines thesecurity policy for an object by defining the object’s rights and the methodcalls they allow. Once this step is done, Legion provides the basic mechanismfor enforcing that policy.

Every object in Legion supports a special member function called “MayI”(objects with no security have a NULL MayI). MayI is Legion’s traffic cop:All method calls to an object must first pass through MayI before the targetmember function is invoked. If the caller has the appropriate rights for thetarget method, MayI allows that method invocation to proceed.

To make rights available to a potential caller, the owner of an objectgives the caller a certificate listing the rights granted. This certificate cannotbe forged. When the caller invokes a method on the object, it presents theappropriate certificate to MayI, which then checks the scope and authenticityof the certificate. Alternatively, the owner of an object can permanently assigna set of rights to a particular caller or group. In that case, MayI’s responsibilityis to confirm the caller’s identity and membership in one of the allowed groupsand then to compare the rights authorized with the rights required for themethod call.


Besides regulating user access control, Legion also protects underlyingcommunications between objects. Every Legion object has a public-key pair;the public key is part of the object’s name (its LOID). Objects can use the pub-lic key of a target object to encrypt their communications to it. Likewise, anobject’s private key can be used to sign messages, thereby providing authenti-cation and nonrepudiation. This integration of public keys and object nameseliminates the need for a certification authority. If an intruder tries to tam-per with the public key of a known object, the intruder will create a new andunknown name.

The combined components of the security model encourage the creationof a large-scale Legion system with multiple overlapping trust domains. Eachdomain can be separately defined and controlled by the users that it affects.When difficult problems arise, such as merging two trust domains, Legionprovides a common and flexible context in which they can be resolved.

9.5.2 High Performance

Legion achieves high-performance computing in two ways: by selecting pro-cessing resources based on load and job affinity, and by parallel processing.

Even single-task jobs can have better performance when presented witha range of possible execution sites. The user can choose the host with thelowest load or the greatest power. Power, in this context, might be definedby performance on the SPEC benchmarks adjusted for load, or by using theapplication itself as a benchmark. Similarly, different components of a coarse-grained meta-application may be scheduled on different hosts (based on thecomponent’s affinity to that type of host), leading to a phenomenon known assuperconcurrency [214]. In either scenario, Legion’s flexible resource manage-ment scheme lets user-level scheduling agents choose the right resource.

Alternatively, Legion can be used for traditional parallel processing, aswhen executing a single application across geographically separate hosts, orsupporting meta-applications (e.g., scheduling the components of a singlemeta-application on the nodes of an MPP). Legion supports a distributed-memory parallel computing model in four ways: supporting parallel libraries,supporting parallel languages, wrapping parallel components, and exportingthe runtime library interface.

Supporting Parallel Libraries

The vast majority of parallel applications written today use MPI [250] orPVM [227]. Legion supports both MPI’s and PVM’s libraries via emulation


libraries, which use the underlying Legion runtime library. Existing appli-cations need only to be recompiled and relinked in order to run on Legion.

Supporting Parallel Languages

Legion supports MPL (Mentat Programming Language, described in [569]), BFS(Basic Fortran Support), and Java. MPL is a parallel C++ language in whichthe user specifies those classes that are computationally complex enough towarrant parallel execution. Class instances are then used like C++ class in-stances: the compiler and runtime system take over, construct parallel com-putation graphs of the program, and then execute the methods in parallel ondifferent processors. Legion is written in MPL. BFS is a set of pseudocommentsfor Fortran and a preprocessor that gives the Fortran programmer access toLegion objects. It also allows parallel execution via remote asynchronous pro-cedure calls, as well as the construction of program graphs. The Java interfaceallows Java programs to access Legion objects and to execute member func-tions asynchronously.

Wrapping Parallel Components

Object wrapping is a time-honored tradition in the object-oriented world, butLegion extends the notion of encapsulating existing legacy codes into objectsone step further by encapsulating a parallel component into an object. Toother Legion objects, the encapsulated object appears sequential, but executesfaster. Thus, one could encapsulate a PVM, HPF, or shared-memory threadedapplication in a Legion object.

Exporting the Runtime Library Interface

The Legion team cannot provide the full range of languages and tools thatusers need. The designers of Legion, rather than developing everything at theUniversity of Virginia, intended the system to be an open community artifactto which other languages and tools are ported. To support third-party softwaredevelopment, the complete runtime library interface is available and may bedirectly manipulated by user libraries. The Legion library is completely recon-figurable: It supports basic communication, encryption/decryption, authenti-cation, exception detection and propagation, and other features. One featureof particular interest is program graph support.

Program graphs (Figure 9.6) represent functions and are first class and re-cursive. Graph nodes are member function invocations on Legion objects orsubgraphs. Arcs model data dependencies. Graphs are constructed by starting


20 B 100 MB

(needs C90, 100 CPU minutes)

9.6

FIGURE

A Legion program graph: functions of three arguments and two outputs.

with an empty graph and adding nodes and arcs. Graphs may be combined,resulting in a form of function composition. Finally, graphs may be annotatedwith arbitrary information, such as resource requirements and architectureaffinities. The annotations may be used by schedulers, fault tolerance proto-cols, and other user-defined services.

9.5.3 Scheduling and Resource Management

The Legion scheduling philosophy is one of reservation through a negotiationprocess between resource providers and resource consumers. Autonomy isconsidered to be the single most crucial aspect of this process, for two reasons.

First, site autonomy is crucial in attracting resource providers. In particu-lar, participating sites must be assured that their local policies will be respectedby the system at large. Therefore, final authority over the use of a resource isplaced with the resource itself.

Second, user autonomy is crucial to achieving maximum performance. Asingle scheduling policy will not be the best answer for all problems and pro-grams: Users should be able to choose between scheduling policies, selectingthe one that best fits the problem at hand or, if necessary, providing their


Scheduler

Resourcedatabase

Scheduleimplementor

Resources

2

3,5

4,6

1

9.7

FIGURE

The Legion scheduling model.

own schedulers. A special, and vitally important, example of user-providedschedulers is that of application-level scheduling. This allows users to pro-vide per-application schedulers that are specially tailored to match the needsof the application. Application-level schedulers will be commonplace in high-performance computing domains.

Legion currently provides two types of resources: computational resources(hosts) and storage resources (vaults). Network resources will be incorporatedin the future. As seen in Figure 9.7, the Legion scheduling module consists ofthree major components: a resource state information database, a module thatcomputes request mapping to resources (hosts and vaults), and an activationagent responsible for implementing the computed schedule. These items arecalled the Collection, Scheduler, and Enactor, respectively.

The Collection interacts with resource objects to collect information de-scribing the system’s state (Figure 9.7, step 1). The Scheduler queries theCollection to determine a set of available resources that match the Scheduler’srequirements (step 2). After computing a schedule, or set of desired sched-ules, the Scheduler passes a list of schedules to the Enactor for implementation(step 3). The Enactor then makes reservations with the individual resources(step 4) and reports the results to the Scheduler (step 5). Upon approval by theScheduler, the Enactor places objects on the hosts and monitors their status(step 6).

If the user does not wish to select or provide an external scheduler, theLegion system (via the class mechanism) provides default scheduling behav-ior that supplies general-purpose support. Through the use of class defaults,sample schedulers, and application-level schedulers, the user can balancethe effort put into scheduling against the resulting application performancegain.


9.6 APPLICATION SCENARIOS AND LEGION

To conclude this overview of Legion, let us revisit the application scenariosdescribed in the first half of this chapter. A Legion implementation of theteleimmersion collaboration design application would closely follow the gen-eral object-oriented design presented earlier. The display object, the visualdatabase object, the simulation components, the design database, and the userinterface control objects would all be Legion objects that would communicatevia method invocations.

The Digital Sky project is a more interesting example of a system that canexploit Legion. In a Legion implementation, the application “object databases”that contain the observations would be Legion objects—perhaps instances ofan observation_db class. An observation_db object would have an interfacetailored to the application. Thus, rather than generic (and therefore hard tooptimize for the application) functions such as read() and write() or select-from-where(), an observation_db object would have functions such as get_sky_volume() or get_object(). In fact, the interface is completely arbitrary,allowing the designer the choice of query and update interfaces.

Access to observation_db instances would be location transparent becauseof the nature of Legion LOIDs. They could also be online versus archivetransparent. That is, the Legion vault storing the persistent state of the objectscould mask whether the state is on disk or stored on an archival medium suchas tape.

The implementation of observation_db could then be optimized for thetype of data stored and the most common data requests. For example, high per-formance for large sparse databases can be realized by using a PLOP file [502]or a quad-tree [487], as has been done for radio astronomy data [303, 302].Data could also be previously fetched and cached based on access predic-tions provided by the user via special member functions, rather than usinga demand-driven strategy and naive assumptions about temporal and spatiallocality, as is typically the case.

The internal implementation of observation_db could be internally par-allel as well. The data could be horizontally partitioned across multiple sub-objects, each of which resided on a separate device. Queries against theobservation_db could then be executed in parallel, with multiple devices ac-tive at the same time, resulting in greater bandwidth.

Legion further supports the Digital Sky requirements by providing thefollowing:

Further Reading235

� A flexible access control policy that can easily be tailored to meet a varietyof needs

� Support for flow-oriented processing via MPL, BFS, and the underlyinggraph support mechanism

� Support for user-level scheduling that allows either the computation to bemoved to the data or the data to be moved to the computation

� The ability to dynamically insert object-monitoring code for performancedebugging

� The ability to encapsulate legacy databases in Legion objects by wrappingthem in an object and restricting the object’s placement to the particularhost or hosts where the legacy system can run

Finally, using Legion’s unique metaclass system, one could create a meta-class for the observation_db class that supported object replicas. Replicascould be generated and placed as needed at multiple sites for both faster accessand increased availability in the event of equipment failure. Replicas couldalso be transparently generated on demand close to a data consumer, acting asintelligent prefetching and caching agents.

ACKNOWLEDGMENTS

Fritz Knabe, Steve Chapin, and Mike Lewis assisted in preparing the Legionmaterial. Portions of the Legion material have appeared elsewhere, specifi-cally the 10 design objectives and the three constraints, which have appearedin many Legion papers. The Legion work has been supported by the follow-ing grants and contracts: DARPA (Navy) contract no. N66001-96-C-8527, DOEgrant DE-FD02-96ER25290, DOE contract Sandia LD-9391, DOE D459000-16-3C, DARPA (GA) SC H607305A, and Northrop-Grumman.

FURTHER READING



� Books by Orfali, Harkey, and Edwards [427] and Chappell [117] providegood introductions to distributed objects.

� Lockhart’s book [352] provides information on DCE, which plays a critical,historical role in the evolution of this technology.

� Schmidt’s papers discuss high-performance CORBA and the ACE AdaptiveCommunication Environment [490, 488].

� Chandy’s work in the Infospheres project [114] also provides many uniqueand inventive approaches to the problems described here.

10C H A P T E R

High-Performance CommodityComputing

Geoffrey C. FoxWojtek Furmanski

In this chapter, we consider the role of commodity off-the-shelf softwaretechnologies and components in the construction of computational grids. Wetake the position that computational grids can and should build on emerg-ing commodity network computing technologies, such as the CORBA, COM,JavaBeans, and less sophisticated Web and networked approaches. These tech-nologies are being used to construct three-tier architectures, in which middle-tier application servers mediate between sophisticated back-end services andpotentially simple front ends. The decomposition of application functionalityinto separate presentation, application, and back-end service tiers results in adistributed computing architecture that can be extended transparently to in-corporate grid-enabled second- and/or third-tier services. Consequently, thethree-tier architecture being deployed for commodity network applications iswell suited to serve as an architecture for computational grids, combining highperformance with the rich functionality of commodity systems. The distinctinterface, server, and specialized service implementation layers enable tech-nological advances to be incorporated in an incremental fashion.

This commodity approach to grid architecture should be contrasted withmore specialized grid architectures such as Globus (Chapter 11) and Legion(Chapter 9). Clearly, these latter technologies can be used to implement lower-tier services. This would seem to be particularly consistent with the Globusservice-oriented approach to grid computing.

The rest of this chapter proceeds as follows. We first define what wemean by “commodity technologies” and explain the different ways that theycan be used in high-performance computing. Then, we discuss an emerging

10 High-Performance Commodity Computing238

distributed commodity computing and information system in terms of aconventional three-tier commercial computing model. We describe how thismodel can be used as a CORBA facility, and we give various examples of howcommodity technologies can be used effectively for computational grids. Fi-nally, we discuss how commodity technologies such as Java can be used tobuild parallel programming environments that combine high functionality andhigh performance.

10.1 COMMODITY TECHNOLOGIES

The past few years have seen an unprecedented level of innovation andprogress in commodity technologies. Three areas have been critical in thisdevelopment: the Web, distributed objects, and databases. Each area has de-veloped impressive and rapidly improving software artifacts. Examples atthe lower level include HTML, HTTP, MIME, IIOP, CGI (Common Gate-way Interface), Java, JavaScript, JavaBeans, CORBA, COM, ActiveX, VRML(Virtual Reality Markup Language), ORBs (object request brokers), and dy-namic Java servers and clients. Examples at the higher level include col-laboration, security, commerce, and multimedia technologies. Perhaps moreimportant than these raw technologies is a set of open interfaces that enablelarge components to be quickly integrated into new applications.

Computational grid environments can be constructed that incorporatethese commodity capabilities in such a way as to achieve both high perfor-mance and high functionality. One approach to this goal would be to use just afew of the emerging commodity technologies as point solutions. For example:

� VRML or Java3D could be used for scientific visualization.

� Web (including Java applets) front ends could provide convenient cus-tomizable interoperable user interfaces to high-performance facilities[208].

� The public-key security and digital signature infrastructure being devel-oped for electronic commerce could enable more powerful approaches tosecure high-performance systems.

� Java could become a common scientific programming language.

� The universal adoption of Java DataBase Connectivity (JDBC) and thegrowing convenience of Web-linked databases could result in the growingimportance of systems that link large-scale commercial databases withhigh-performance computing resources.

10.2 The Three-Tier Architecture239

� The emerging “Object Web” (linking the Web, distributed objects, anddatabases) could encourage a growing use of modern object technology.

� Emerging collaboration and other distributed information systems couldencourage new distributed work paradigms in place of traditional ap-proaches to collaboration [53].

Our focus, however, is not on such point solutions but on exploiting theoverall architecture of commodity systems for high-performance parallel ordistributed computing. You might immediately raise the objection that overthe past 30 years, many other major broad-based hardware and software de-velopments have occurred—such as IBM business systems, UNIX, Macintoshand PC desktops, and video games—without any profound impact on high-performance computing software. However, the emerging distributed com-modity computing and information system (DcciS) is different: it gives usa worldwide, enterprisewide distributing computing environment. Previoussoftware revolutions could help individual components of a high-performancecomputing system, but DcciS can, in principle, be the backbone of a completehigh-performance computing software system—whether it be for some globaldistributed application, an enterprise cluster, or a tightly coupled large-scaleparallel computer.

To achieve this goal, we must add high performance to the emergingDcciS environment. This task may be extremely difficult, but, by using DcciSas a basis, we inherit a multibillion-dollar investment and what is, in manyrespects, the most powerful productive software environment ever built.

10.2 THE THREE-TIER ARCHITECTURE

Within commodity network computing, the three-tier architecture has becomepervasive. As shown in Figure 10.1, the top level of this model provides acustomizable client tier, consisting of components such as graphical user in-terfaces, application programs, and collaboration tools. The middle tier, oftenreferred to as an application server, consists of high-level agents that can pro-vide application functionality as well as a range of high-level services such asload balancing, integration of legacy systems, translation services, metering,and monitoring. An important aspect of the middle tier is that it both definesinterfaces and provides a control function, coordinating requests across oneor more lower-tier servers. The bottom tier provides back-end services, suchas traditional relational and object databases. A set of standard interfaces al-lows a rich set of custom applications to be built with appropriate client and


Web server

Applicationserver

TP server

Middle tier

Objectstore

Database

Resourcemanagement

Perl CGIJava serveletsActiveX

DCOMRMIIIOP

HTTP

10.1

FIGURE

Industry three-tier view of enterprise computing.

middleware software. As indicated in Figure 10.1, these layers can use Webtechnology such as Java and JavaBeans, distributed objects with CORBA, andstandard interfaces such as JDBC.

10.2.1 Implementing Three-Tier Architectures

To date, two basic technologies have been used to construct commodity three-tier networked computing systems: distributed-object systems and distributed-service systems. Distributed object-based systems are built on object-orientedtechnologies such as CORBA, COM, and JavaBeans (when combined withRMI). These technologies are discussed in depth in Chapter 9. Object-orientedsystems are ideal for defining middle-tier services. They provide well-definedinterfaces and a clean encapsulation of back-end services. Through the useof mechanisms such as inheritance, these systems can be easily extended,allowing specialized services to be created.

In spite of the advantages of the distributed-object approach, many net-work services today are provided through a distributed-service architecture.The most notable example is the Web, in which the middle-tier service is pro-vided by a distributed collection of HTTP servers. Linkage to back-end services(databases, simulations, and other custom services) is provided via programscalled CGI scripts.

The use of Web technology in networked databases provides a good ex-ample of how the distributed-services architecture is used. Originally, remoteaccess to these databases was provided via a two-tier client-server architec-


ture. In these architectures, sophisticated clients would submit SQL queries toremote databases using proprietary network access protocols to connect theclient to the server. The three-tier version of this system might use Web-basedforms implemented on a standard thin client (i.e., a Web browser) with middle-tier application functionality implemented via CGI scripts in the HTTP server.These scripts then access back-end databases by using vendor-specific meth-ods. This scenario becomes even more attractive with the introduction of Javaand JDBC. Using this interface, the middle-tier HTTP service can communi-cate transparently to a wide range of vendor databases.

Currently, a mixture of distributed-service and distributed-object architec-tures is deployed, using CORBA, COM, JavaBeans, HTTP servers and CGIscripts, Java servers, databases with specialized network protocols, and otherservices. These all coexist in a heterogeneous environment with commonthemes but disparate implementations. We believe that in the near future,there will be a significant convergence of network computing approaches thatcombines both the distributed-server and distributed-object approaches [134,426]. Indeed, we already see a blurring of the distinction between Web anddistributed-object servers, with Java playing a central role in this process.

On the Web side, we are seeing a trend toward the use of extensible Java-based Web servers. Rather than implementing middle-tier services using CGIscripts, written in a variety of languages, these servers can be customized onthe fly through the use of Java “servlets.” Alternatively, CORBA ORBs alreadyexist whose functionality can be implemented by using Java.

We also believe that these advances will lead to an integrated architecturein which Web-based services (browsers, Java, JavaBeans) and protocols (HTTP,RMI) are used to construct the top layer, distributed-object services and proto-cols (such as CORBA and IIOP) are used to interface to the lower tier, and themiddle tier consists of a distributed set of extensible servers that can processboth Web-based and distributed-object protocols. The exact technologies usedare not critical; however, for the sake of discussion, we will consider a mid-dle tier based on an integrated Java and CORBA server. We believe that theresulting “Object Web” three-tier networked computing architecture will haveprofound importance and will be the most appropriate way to implement arange of computational grid environments.

10.2.2 Exploiting the Three-Tier Structure

We believe that the evolving service/object three-tier commodity architec-ture can and should form the basis for high-performance computational grids.


These grids can incorporate (essentially) all of the services of the three-tier ar-chitecture outlined above, using its protocols and standards wherever possible,but using specialized techniques to achieve the grid goal of dependable per-formance. This goal might be achieved by simply porting commodity servicesto high-performance computing systems. Alternatively, we could continue touse the commodity architecture on current platforms while enhancing specificservices to ensure higher performance and to incorporate new capabilitiessuch as high-end visualization (e.g., immersive visualization systems, Chap-ter 6) or massively parallel end systems (Chapter 17). The advantage of thisapproach is that it facilitates tracking the rapid evolution of commodity sys-tems, avoiding the need for continued upkeep with each new upgrade of thecommodity service. This results in a high-performance commodity comput-ing environment that offers the evolving functionality of commodity systemswithout requiring significant reengineering as advances in hardware and soft-ware lead to new and better commodity products.

The preceding discussion indicates that the essential research challengefor high-performance commodity computing is to enhance the performanceof selected components within a commodity framework in such a way thatthe performance improvement is preserved through the evolution of the ba-sic commodity technologies. We believe that the key to achieving this goal isto exploit the three-tier structure by keeping high-performance computing en-hancements in the third layer—which is, inevitability, the home of specializedservices. This strategy isolates high-performance computing issues from thecontrol or interface issues in the middle layer.

Let us briefly consider how this strategy works. Figure 10.2 shows a hybridthree-tier architecture in which the middle tier is implemented as a distributednetwork of servers. In general, these servers can be CORBA, COM, or Java-based Object Web servers: any server capable of understanding one of the basicprotocols is possible. The middle layer not only includes networked serverswith many different capabilities but can also contain multiple instantiationsof the same server to increase performance. The use of high-functionalitybut modest-performance communication protocols and interfaces at the mid-dle layer limits the performance levels that can be reached. Nevertheless,this first step gives a modest-performance, scalable parallel (implemented, ifnecessary, in terms of multiple servers) system that includes all commod-ity services (such as databases, object services, transaction processing, andcollaboratories).

The next step is applied only to those services whose lack of performanceconstitutes a bottleneck. In this case, an existing back-end (third-layer) imple-


WW PD D DC DC DC PC O T N W

Middle tier

Third tier

First tier

D DatabaseDC Distributed computing

componentN Sequential networked

computer server

O Object serverPC Parallel computerPD Parallel databaseT Collaboratory serverW Web server

10.2

FIGURE

Today’s heterogeneous interoperating hybrid server architecture. High-performance commodity computing involves adding high performance in thethird tier.

mentation of a commodity service is replaced by its natural high-performanceversion. For example, sequential databases are replaced by parallel data-base machines, and sequential or socket-based messaging distributed sim-ulations are replaced by message-passing implementations on low-latency,high-bandwidth dedicated parallel machines (specialized architectures or clus-ters of workstations).

Note that with the right high-performance software and network connec-tivity, clusters of workstations (Chapter 17) could be used at the third layer.Alternatively, collections of middle-tier services could be run on a single par-allel computer. These various possibilities underscore the fact that the rela-tively clean architecture of Figure 10.3 can become confused. In particular,the physical realization may not reflect the logical architecture shown in Fig-ure 10.2.


ORB

HTTP ORB

ORB ORB ORB ORB ORB

NOWsMPP Oracle Illustra mSQL

AppletORB

CORBA bus

Object Web server

10.3

FIGURE

Integration of object technologies (CORBA) and the Web.

10.3 A HIGH-PERFORMANCE FACILITY FOR CORBA

As discussed above, we envision the middle tier of the network computingarchitecture providing the interface to high-performance computing capabili-ties. In the CORBA-based strawman that we have presented, this means thathigh-performance computing components must be integrated into the CORBAarchitecture.

CORBA is defined in terms of a set of facilities, where each facility de-fines an established, standardized high-level service. Facilities are split intothose that are required by most applications (called horizontal facilities) andthose that are defined to promote interoperability within specific applicationdomains (called vertical facilities). As CORBA evolves, it is expected that somevertical facilities will migrate to become horizontal facilities.

We believe that high-performance computing can be integrated into theCORBA model by creating a new facility that defines how CORBA objectsinteract with one another in a high-performance environment. CORBA cur-rently supports only relatively simple computing models, including the em-barrassingly parallel activities of transaction processing or data flow. High-

10.4 High-Performance Communication245

performance commodity computing therefore would fill a gap by providingCORBA’s high-performance computing facility.

This new facility allows us to define a commercialization strategy forhigh-performance computing technologies. Specifically, academia and indus-try could experiment with high-performance commodity computing as a gen-eral framework for providing high-performance CORBA services. Then, one ormore industry-led groups could propose high-performance commodity com-puting specifications, following a process similar to the MPI or HPF forumactivities. Such specifications could include another CORBA facility—one thatprovided user interfaces to (scientific) computers. This facility could compriseinterfaces necessary for performance tools and resource managers, file sys-tems, compilation, debugging, and visualization.

Although we focus here on the use of CORBA, analogies exist in theJava and COM object models. In particular, in Section 10.6.2, we discuss howwrappers might be used to provide a Java framework for high-performancecomputing.

10.4 HIGH-PERFORMANCE COMMUNICATION

Communication performance is a critical aspect of the performance of manyhigh-performance systems. Indeed, a distinguishing feature between dis-tributed and high-performance parallel computation is the bandwidth andlatency of communication. In this section, we present an example of howhigh-performance communication mechanisms can be integrated into a com-modity three-tier computing system.

The example we consider is a multidisciplinary simulation. As discussedin Section 9.2.2, multidisciplinary applications typically involve the linkageof two or more modules—say, computational fluid dynamics and structuresapplications—into a single simulation. We can assume that simulation compo-nents are individually parallel.

As an initial approach, we could view the linkage between componentssequentially, with the middle tier coordinating the movement of data fromone simulation component to the other at every step in the simulation. Ifhigher performance is required, we may need to link the components directly,using a high-performance communication mechanism such as MPI [250]. Theconnections between modules can be set up by a middle-tier service (such asWebFlow or JavaBeans); then, two third-tier modules can communicate to oneanother without intervention of the middle-tier service. Control flow returnsto the middle tier when the simulation is complete.


A third possibility is to keep the control function within the middle tier,setting up high-performance connections between the two modules and ini-tiating data transfer at every step in the simulation. Unlike the first scenariowe considered, the actual transfer of data between modules would take placeusing the high-performance interface, not through the middle tier. This ap-proach preserves the advantages of the other strategies using the commodityprotocols and services of the three-tier architecture for all user-visible controlfunctions while exploiting the performance of high-performance software onlywhere necessary.

A key element of this example is the structure of the middle-tier service.One approach would be to use JavaBeans (see Chapter 9) as the vehicle forintegrating the individual simulation components. In the JavaBeans model,there is a separation of control (handshake) and implementation. This sepa-ration makes it possible to create JavaBeans “listener objects” that reside inthe middle tier and act as a bridge between a source and sink of data. Thelistener object can decide whether high performance is necessary or possi-ble and invoke the specialized high-performance layer. As discussed above,this approach can be used to advantage in runtime compilation and resourcemanagement, with execution schedules and control logic in the middle tierand high-performance communication libraries implementing the determineddata movement. This approach can also be used to provide parallel I/O andhigh-performance CORBA.

10.5 HIGH-PERFORMANCE COMMODITY SERVICES

A key feature of high-performance commodity computing is its support fordatabases, Web servers, and object brokers (see Section 10.1). In this section,we use the additional example of collaboration services to illustrate the powerof the commodity computing approach.

Traditionally, a collaborative system is one in which specific capabilitiesare integrated across two or more clients. Examples of such systems includewhiteboards, visualization, and shared control. With the introduction of theflexible Java-based three-tier architecture, support for collaboration can alsobe integrated into the computing model by providing collaboration services inthe middle tier.

Building grid applications on the three-tier architecture provides a well-defined separation between high-performance computing (bottom tier) andcollaboration (top and middle tier). Consequently, we can reuse collaboration

10.6 Commodity Parallel Computing247

systems built for the general Web market to address areas that require peopleto be integrated with the computational infrastructure, such as computationalsteering and collaborative design.

This configuration enables the best commodity technology (e.g., frombusiness or distance education) to be integrated into the high-performancecomputing environment. Currently, commodity collaboration systems arebuilt on top of the Web and are not yet defined from a general CORBA point ofview. Nevertheless, facilities such as WorkFlow are being developed, and weassume that collaboration will emerge as a CORBA capability to manage thesharing and replication of objects.

CORBA is a server-server model in which clients are viewed as servers(i.e., run ORBs) by outside systems. This makes the object-sharing view ofcollaboration natural, whether an application runs on the “client” (e.g., ashared Microsoft Word document) or the back-end tier (e.g., a shared parallelcomputer simulation).

Two systems, TANGO [53] and WebFlow [60], can be used to illustrate thedifferences between collaborative and multidisciplinary computation. Bothsystems use Java servers for their middle tier. TANGO provides collabora-tion services: client-side applications are replicated by using an event distri-bution model. To put a new application into TANGO, you must be able todefine both its absolute state and changes therein. By using Java object se-rialization or similar mechanisms, this state is maintained identically in thelinked applications. On the other hand, WebFlow integrates program modulesby using a data flow paradigm. With this system, the module developer de-fines data input and output interfaces and builds methods to handle data I/O.Typically, there is no need to replicate the state of a module in a WebFlowapplication.

10.6 COMMODITY PARALLEL COMPUTING

Most of the discussion in this chapter has focused on the use of commod-ity technologies for computational grids, a field sometimes termed high-performance distributed computing. We believe, however, that commodity tech-nologies can also be used to build parallel computing environments thatcombine high functionality and high performance. In this section, we firstcompare alternative views of high-performance distributed parallel comput-ers. Then, we discuss Java as a scientific and engineering programminglanguage.


Node 1 Node 2 Node 3

Node 4 Node 5 Node 6

MPI MPIHostParallel

computerskeleton

IIOP

10.4

FIGURE

A parallel computer viewed as a single CORBA object in a classic host-nodecomputing model. Logically, the host is in the middle tier and the nodes in thelower tier. The physical architecture could differ from the logical architecture.

10.6.1 High-Performance Commodity Communication

Consider two views of a parallel computer. In both, various nodes and the hostare depicted as separate entities. These represent logically distinct functions,but the physical implementation need not reflect the distinct services. In par-ticular, two or more capabilities can be implemented on the same sequentialor shared-memory multiprocessor system.

Figure 10.4 presents a simple multitier view with commodity protocols(HTTP, RMI, COM, or the IIOP pictured) used to access the parallel com-puter as a single entity. This entity (object) delivers high performance byrunning classic high-performance computing technologies (such as HPF [269],PVM [227], or the pictured MPI [250]) in the third tier. This approach has beensuccessfully implemented by many groups to provide parallel computing sys-tems with important commodity services based on Java and JavaScript clientinterfaces. Nevertheless, the approach addresses the parallel computer onlyas a single object and is, in effect, the “host-node” model of parallel program-ming [160]; the distributed computing support of commodity technologies forparallel programming is not exploited.

Figure 10.5 depicts the parallel computer as a distributed system with afast network and integrated architecture. Each node of the parallel computerruns a CORBA ORB (or, perhaps more precisely, a stripped-down ORBlet), Webserver, or equivalent commodity server. In this model, commodity protocolscan operate both internally and externally to the parallel machine. The resultis a powerful environment where we can uniformly address the full range ofcommodity and high-performance services. Other tools can now be applied toparallel as well as distributed computing.


Node 2

Node 5Node 6

Node 1

Node 4

Node 3Host

IIOPon parallel computer network

(could be implemented with MPI)

IIOPglobalcloud

10.5

FIGURE

Each node of a parallel computer instantiated as a CORBA object. The “host”is logically a separate CORBA object but could be instantiated on the samecomputer as one or more of the nodes. Via a protocol bridge, we could ad-dress objects using CORBA, with local parallel computing nodes invoking MPIand remote accesses using CORBA where its functionality (access to manyservices) is valuable.

Obviously, we should be concerned that the flexibility of this secondparallel computer is accompanied by a reduction in communication perfor-mance. Indeed, most commodity messaging protocols (e.g., RMI, IIOP, andHTTP) have unacceptable performance for most parallel computing applica-tions. However, good performance can be obtained by using a suitable bindingof MPI or other high-speed communication library to the commodity proto-cols.

In Figure 10.6, we illustrate such an approach to high performance, whichuses a separation between messaging interface and implementation. Thebridge shown in this figure allows a given invocation syntax to support sev-eral messaging services with different performance-functionality trade-offs.


Internet(HTTP)cloud

DC1

DC3

DC2

W3 W4

W1 W2

Clientapplication

Protocolperformance

optimizer

IIOP cloud

Skeletons

Stubs

PC ND

Parallel computer

Database Networked computer server

CORBAservice

HTTPservice

MPI

CORBAHTTP

Web servers

Web servers

CORBA

MPIservice

Distributed computingcomponents

10.6

FIGURE

A message optimization bridge allows MPI (or equivalently Globus or PVM)

and commodity technologies to coexist with a seamless user interface.

In principle, each service can be accessed by any applicable protocol. Forinstance, a Web server or database can be accessed by HTTP or CORBA; a net-work server or distributed computing resource can support HTTP, CORBA, orMPI.

Note that MPI and CORBA can be linked in one of two ways: (1) the MPIfunction call can call a CORBA stub, or (2) a CORBA invocation can be trappedand replaced by an optimized MPI implementation. Current investigationsof a Java MPI linkage have raised questions about extending MPI to handlemore general object data types. For instance, the MPI communicator fieldcould be extended to indicate a preferred protocol implementation, as is donein Nexus [199]. Other research issues focus on efficient object serializationneeded for a high-performance implementation of the concept in Figure 10.6.


10.6.2 Java and High-Performance Computing

We have thus far discussed many critical uses of Java in both client interfacesand middle-tier servers to high-performance systems. Here, we focus on thedirect use of Java as a scientific and engineering programming language [209],taking the role currently played by Fortran 77, Fortran 90, and C++. (In ourthree-tier architecture, this is the use of Java in lower-tier engineering andscience applications or in a CORBA vertical facility designed to support high-performance computing.)

User Base

One of Java’s important advantages over other languages is that it will belearned and used by a broad group of users. Java is already being adopted inmany entry-level college programming courses and will surely be attractive forteaching in middle or high schools. We believe that entering college students,fresh from their Java classes, will reject Fortran as quite primitive in contrast.C++, as a more complicated systems-building language, may well be a naturalprogression; but although it is quite heavily used, C++ has limitations asa language for simulation. In particular, it is hard for C++ to achieve goodperformance even on sequential code. We expect that Java will not have theseproblems.

Performance

Performance is arguably a critical issue for Java. However, there seems lit-tle reason why native Java compilers, as opposed to current portable JavaVMinterpreters or just-in-time (JIT) compilers, cannot obtain performance com-parable with that of C or Fortran compilers. One difficulty in compiling Javais its rich exception framework, which could restrict compiler optimizations:users would need to avoid complex exception handlers in performance-criticalportions of a code. Another important issue with Java is the lack of any oper-ator overloading, which could allow efficient elegant handling of Fortran con-structs like COMPLEX. Much debate centers on Java’s rule that code not onlymust run everywhere but must give the same value on all machines. This ruleinhibits optimization on machines such as the Intel Pentium that include mul-tiple add instructions with intermediate results stored to higher precision thanfinal values of individual floating-point operations.

An important feature of Java is the lack of pointers. Their absence allowsmuch more optimization for both sequential and parallel codes. Optimisti-cally, we can say that Java shares the object-oriented features of C++ and


the performance features of Fortran. An interesting area is the expected per-formance of Java interpreters (using JIT techniques) and compilers on Javabytecodes (virtual machine). Currently, a PC just-in-time compiler shows afactor of 3–10 lower performance than C-compiled code, and this can be ex-pected to decrease to a factor of 2. Hence, with some restrictions on program-ming style, we expect Java language or VM compilers to be competitive withthe best Fortran and C compilers. We also expect a set of high-performance“native-class” libraries to be produced, which can be downloaded and accessedby applets to improve performance in the usual areas where scientific librariesare built.

Parallelism

To discuss parallel Java, we consider four forms of parallelism seen in appli-cations:

1. Data parallelism: By data parallelism, we mean large-scale parallelism re-sulting from parallel updates of grid points, particles, and other basiccomponents in scientific computations (see Chapter 8). Such parallelismis supported in Fortran by either high-level data-parallel HPF or, at a lowerlevel, Fortran plus message passing. Java has no built-in parallelism of thistype, but the lack of pointers means that natural parallelism is less likely tobe obscured. There seems no reason why Java cannot be extended to high-level data-parallel form (HPJava) in a similar way to Fortran (HPF) or C++(HPC++). Such an extension can be done by using threads on shared-memory machines; on distributed-memory machines, message passingmay be used.

2. Modest-grain functional parallelism: Functional parallelism refers to thetype of parallelism obtained when unique application functions can be ex-ecuted concurrently. For example, Web browsers frequently use functionalparallelism by overlapping computation with I/O operations. Support forfunctional parallelism is built into the Java language with threads, but hasto be added explicitly with libraries for Fortran and C++.

3. Object parallelism: Object parallelism is quite natural for C++ or Java. Javacan use the applet mechanism to represent objects portably.

4. Metaproblem parallelism: Metaproblem parallelism occurs in applicationsthat are made up of several different subproblems, which themselves maybe sequential or parallel.


Scripted usercommands

Dynamic libraryand

compiled scriptinvocation

Proxy libraryin chosenscriptinglanguage

Instrumentedcompiled code

with breakpoints

Adddynamically

IIOP/HTTP

IIOP/HTTP

Source codeClient Server

Displayapplet

10.7

FIGURE

An architecture for an interpreted Java front end communicating with amiddle-tier server controlling dynamically an HPCC back end.

Interpreted Environments

Java and Web technology suggest new programming environments that in-tegrate compiled and interpreted or scripting languages. In Figure 10.7, weshow a system that uses an interpreted Web client interacting dynamicallywith compiled code through a typical middle-tier server. This system usesan HPF back end, but the architecture is independent of the back-end lan-guage. The Java or JavaScript front end holds proxy objects produced by anHPF front end operating on the back-end code. These proxy objects can bemanipulated with interpreted Java or JavaScript commands to request ad-ditional processing, visualization, and other interactive computational steer-ing and analysis. We note that for compiled (parallel) Java, the use of ob-jects (as opposed to simple types in the language) probably has unaccept-able overhead. However, such objects are appropriate for interpreted frontends, where object references are translated into efficient compiled code.


We believe such hybrid architectures are attractive and warrant further re-search.

Evaluation

In summary, we see that Java has no obvious major disadvantages and someclear advantages compared with C++ and especially Fortran as a basic lan-guage for large-scale simulation and modeling. Obviously, we cannot andshould not port all our codes to Java. Putting Java (or, more generally, CORBA)

wrappers around existing code does, however, seem a good way of preserv-ing old codes. Java wrappers can both document their capability (throughthe CORBA trader and JavaBean information services) and allow definitionof methods that enable such codes to be naturally incorporated into largersystems. In this way a Java framework for high-performance commodity com-puting can be used in general computing solutions. As compilers get better,we expect users will find it more and more attractive to use Java for new ap-plications. Thus, we can expect to see a growing adoption by computationalscientists of commodity technology in all aspects of their work.

10.7 RELATED WORK

The Nile project [363] is developing a CORBA-based distributed-computingsolution for the CLEO high-energy physics experiment using a self-managing,fault-tolerant, heterogeneous system of hundreds of commodity workstations,with access to a distributed database in excess of about 100 TB. These resourcesare spread across the United States and Canada at 24 collaborating institutions.

TAO is a high-performance ORB being developed by Douglas Schmidtof Washington University. Schmidt conducts research on high-performanceimplementations of CORBA [489], geared toward realtime image processingand telemedicine applications on workstation clusters over ATM. TAO, whichis based on an optimized version of a public-domain IIOP implementationfrom SunSoft, outperforms commercial ORBs by a factor of two to three.

The OASIS (Open Architecture Scientific Information System) [376] envi-ronment, being developed by Richard Muntz of UCLA for scientific data analy-sis, allows the storage, retrieval, analysis, and interpretation of selected datasets from a large collection of scientific information scattered across heteroge-neous computational environments of earth science projects such as EOSDIS.Muntz is exploring the use of CORBA for building large-scale object-based data-

Further Reading255

mining systems. Several groups are also exploring specialized facilities forCORBA-based distributed computing. Examples include the Workflow Manage-ment Coalition and the Distributed Simulation Architecture.

10.8 SUMMARY

We have described the three-tier architecture employed in commodity com-puting and also reviewed a number of the commodity technologies that areused in its implementation. The resulting separation of concerns among in-terface, control, and implementation may well make the integration of high-performance capabilities quite natural. We have also sketched a path by withthis integration may be achieved. Although significant challenges must beovercome before commodity technologies can guarantee the performance re-quired for computational grids, there is much to be gained from structuringapproaches to grid architectures in terms of this framework.

FURTHER READING


� The books on CORBA and Java by Mowbray and Ruh [402] and Orfali andHarkey [426] are excellent.

� Our article [210] discusses supercomputing on the Web.

� A book by Sessions [505] discusses COM and DCOM.

12C H A P T E R

High-Performance Schedulers

Francine Berman

Computational grids will provide a platform for a new generation of ap-plications. Grid applications will include “portable” applications that can beexecuted at a number of computation and communication sites, resource-intensive applications that must aggregate distributed resources (memory,data, computation) to produce results for the problem sizes of interest, andcoupled applications that combine computers, immersive and visualizationenvironments, and/or remote instruments. Grid applications will include se-quential, parallel, and distributed programs. All of these applications willexecute simultaneously and will share resources. Most important, each appli-cation will seek to leverage the performance potential of the grid to optimizeits own execution.

From the application’s perspective, performance is the point. But how canapplications leverage the performance potential of the grid? Experience withtwo decades of parallel and distributed applications indicates that schedulingis fundamental to performance. Schedulers employ predictive models to eval-uate the performance of the application on the underlying system, and usethis information to determine an assignment of tasks, communication, anddata to resources, with the goal of leveraging the performance potential of thetarget platform.

In grid environments, applications share resources—computation re-sources, communication resources, instruments, data—and both applicationsand system components must be scheduled to achieve performance. How-ever, each scheduling mechanism may have a different performance goal. Jobschedulers (high-throughput schedulers) will promote the performance of the

12 High-Performance Schedulers280

system (as measured by aggregate job performance) by optimizing throughput(measured by the number of jobs executed by the system); resource schedulerswill coordinate multiple requests for access to a given resource by optimizingfairness criteria (to ensure that all requests are satisfied) or resource utiliza-tion (to measure the amount of the resource used). Both job schedulers andresource schedulers will promote the performance of the system over the per-formance of individual applications. These goals may conflict with the goalsof application schedulers (high-performance schedulers), which promote theperformance of individual applications by optimizing performance measuressuch as minimal execution time, resolution, speedup, or other application-centric cost measures. Since their notion of performance differs, grid pro-grammers cannot rely on resource schedulers or other system components topromote application performance goals.

In a computational grid setting, high-performance schedulers become acritical part of the programming environment. However, high-performancescheduling on a grid is particularly challenging: Both the software and hard-ware resources of the underlying system may exhibit heterogeneous perfor-mance characteristics; resources may be shared by other users; and networks,computers, and data may exist in distinct administrative domains. Moreover,centralization is typically not feasible in a grid environment, since no one sys-tem may be in control of all of the resources. In this chapter, we focus onthe problem of developing high-performance application schedulers for grids:schedulers that focus on the problem of achieving performance for a sin-gle application on distributed heterogeneous resources. The related problemsof achieving system performance through high-throughput scheduling andthrough resource scheduling are considered in Chapters 13 and 19, respec-tively.

12.1 SCHEDULING GRID APPLICATIONS

It should be clear from the preceding discussion that “performance” meansdifferent things in different contexts. Webster’s dictionary defines performanceas “the manner in which a mechanism performs,” with perform defined as “tosuccessfully complete a process.” From this we can infer that achieving perfor-mance requires both a model of behavior and some way to determine whichbehaviors are successful. In a technical context, we can say that performanceis achieved by optimizing a cost model that provides a means of comparingand ranking alternatives within that model. With regard to high-performance(application) scheduling, the cost model assigns a value (cost) to the execu-

12.1 Scheduling Grid Applications281

tion resulting from a particular schedule. Executions can then be evaluated bycomparing them with respect to some quantifiable measure of their execution(e.g. execution time, speedup, resolution). We call this quantifiable measurethe performance measure.

12.1.1 The Scheduling Problem

Grid applications consist of one or more tasks that may communicate andcooperate to form a single application. Scheduling grid applications involves anumber of activities. A high-performance scheduler may do the following:

1. Select a set of resources on which to schedule the task(s) of the applica-tion.

2. Assign application task(s) to compute resources.

3. Distribute data or colocate data and computation.

4. Order tasks on compute resources.

5. Order communication between tasks.

In the literature, item 1 is often termed resource location, resource selection,or resource discovery. Resource selection refers to the process of selecting can-didate resources from a pool; resource discovery and resource location refer tothe determination of which resources are available to the application. Item 2may be called mapping, partitioning, or placement. For task-parallel programs,computation or data may reside in distinct locations, and the scheduler mustdetermine which needs to be moved (item 3). For data-parallel programs, allcomputation resources execute the same program, and the complexity of thescheduling process lies in the determination of a performance-efficient dis-tribution or decomposition of data (item 3). For data-parallel programs, loadbalancing—the assignment of equivalent amounts of work to processors thatwill execute concurrently—is often the scheduling policy of choice for thehigh-performance scheduler.

Note that items 1 through 3 (generally termed mapping) focus on theallocation of computation and data “in space”; items 4 and 5 (generally termedscheduling) deal with the allocation of computation and communication “overtime.” For many authors, scheduling is also used to describe activities 1 through5, as we use it here.

A scheduling model consists of a scheduling policy—a set of rules for pro-ducing schedules; a program model, which abstracts the set of programs to be


scheduled; and a performance model, which abstracts the behavior of the pro-gram on the underlying system for the purpose of evaluating the performancepotential of candidate schedules. In addition, the scheduling model utilizesa performance measure, which describes the performance activity to be opti-mized by the performance model.

High-performance schedulers are software systems that use scheduling mod-els to predict performance, determine application schedules based on thesemodels, and take action to implement the resulting schedule. Given appro-priate input, the high-performance scheduler determines an application sched-ule—an assignment of tasks, data, and communication to resources, orderedin time—based on the rules of the scheduling policy, and evaluated as “per-formance efficient” under the criteria established by the performance model.The goal of the high-performance scheduler is to optimize the performanceexperienced by the application on computational grids.

Note that while distributed parallel applications are among the most chal-lenging to schedule on grids, “portable” single-site applications must also bescheduled. Even if an application cannot profit from distribution, the sched-uler may have to locate a computational site and/or colocate data in a way thatpromotes application performance. For parallel grid applications, the high-performance scheduler will need to determine whether performance is op-timized by assigning all tasks to a single site or by distributing the applicationto multiple sites.

One approach to developing high-performance schedulers initially thoughtfruitful was to modify successful strategies from massively parallel processor(MPP) schedulers for grid environments. This seemed reasonable because ap-plications in both MPP and grid environments require careful coordinationof processing, communication, and data to achieve performance. Moreover,in the MPP environment, strategies such as gang scheduling [187] providea method by which both application and system behavior can be optimized(under the assumption that by achieving good throughput, utilization, and/orfairness for uniform resources will promote good application performance onaverage).

However, MPP scheduling models generally produce poor grid schedulesin practice. To determine why, it is useful to look carefully at the assumptionsthat underlie the model used for MPP scheduling:

� The MPP scheduler is in control of all resources.

� All resources lie within a single administrative domain.

� The resource pool is invariant.

12.1 Scheduling Grid Applications283

� The impact caused by contention from other applications in the systemon application execution performance is minimal.

� All computation resources and all communication resources exhibit simi-lar performance characteristics.

None of these assumptions hold in typical grid environments. The gridhigh-performance scheduler is rarely in control of all resources, which maylie in a number of administrative domains. The resource pool will vary overtime as new resources are added, old resources are retired, and other resourcesbecome available or unavailable. Other users will share the system and maydramatically impact the performance of system resources. Finally, resourcesare of different types and may exhibit highly nonuniform performance charac-teristics. Even uniform resources may exhibit nonuniform performance char-acteristics because of variations in load resulting from other users sharing thesystem.

Because the fundamental model of MPP scheduling makes incorrect as-sumptions about grid environments, the optimal schedules as determined bythis model typically do not perform well in grid environments in practice. Con-sequently, a new scheduling model (and new scheduling techniques) must bedeveloped for the grid. Such a model must reflect the complex and dynamicinteractions between applications and the underlying system.

12.1.2 Lessons Learned from Application Scheduling

Before we turn to the challenge of developing an adequate high-performancescheduling model for computational grids, it is useful to review the expe-riences of programmers scheduling applications on parallel and distributedplatforms. It is clear from the accumulated experience of both MPP and gridprogrammers, users, and application developers that the choice of a schedul-ing model can make a dramatic difference in the performance achieved by theapplication (e.g., [508, 58, 371]). Let’s review some of the lessons learned fromapplication scheduling in MPP and grid environments.

Efficient application performance and efficient system performance are not nec-essarily the same. In both MPP and grid environments, achieving system per-formance, resource performance, and application performance may presentconflicting goals. In particular, it is unrealistic to expect the job scheduler or re-source scheduler to optimize application performance. In grid environments,specific application schedulers must be developed in order for the applicationto leverage the system’s performance potential.


It may not be possible to obtain optimal performance for multiple applica-tions simultaneously. In the MPP environment, if N processors are availableand applications A and B both require N − 1 processors in single-user modeto achieve minimal execution time for their given problem sizes, then bothcannot be executed concurrently with the best performance. In grid environ-ments, networked resources may be shared, and A and B may both be able toobtain the same resources concurrently. However, each application may slowdown or degrade the performance of the other [191], diminishing the resultingperformance of both applications.

Load balancing may not provide the optimal application scheduling policy.In grid environments, the performance deliverable by a given resource willvary over time, depending on the fraction of the resource allocated to otherprograms that share the system. Assigning equivalent amounts of work to aset of processors whose load will vary may result in a performance degrada-tion occurring when lightly loaded processors wait for more heavily loadedprocessors to finish. Moreover, communication is also “work” on computa-tional grids, so the impact of distributing data over shared networks may incuradditional performance penalties because of variation in network load andtraffic.

The application and system environment must be modeled in some detailin order to determine a performance-efficient schedule. All scheduling is basedimplicitly or explicitly on a predictive performance model. The accuracyand quality of predicted behavior as determined by this model are funda-mental to the effectiveness of the scheduler. Experience shows that simpleperformance models permit analysis but often yield poor schedules in prac-tice. Grid performance models must be sufficiently complex to representthe phenomena that impact performance for real programs at the prob-lem sizes of interest, but tractable enough to permit analysis and verifi-cation.

Fundamentally, MPP scheduling policies and performance models areinadequate for computational grids because “good” MPP schedules do notcorrelate with the “good” schedules for grid programs observed in practice.The challenge is to develop grid scheduling policies and performance modelsso that the good schedules as determined by a grid scheduling model willcorrelate with good application schedules as observed in practice. Moreover,this should be true with respect to the domain of programs that are actuallylikely to be executed.

In the next section, we discuss the problem of developing adequate per-formance models and scheduling policies for computational grids.

12.2 Developing a Grid Scheduling Model285

12.2 DEVELOPING A GRID SCHEDULING MODEL

As we indicated in the preceding section, the effectiveness of high-performance schedulers is based on the development of adequate schedul-ing models for computational grids. Why is application performance on gridsso difficult to model? Much of the difficulty can be derived from the impactsof hardware and software heterogeneity and from variations in deliverableresource performance because of contention for shared resources. To predictperformance in this dynamic distributed environment, models must repre-sent grid characteristics that impact application performance. In particular,the challenge is to develop a grid scheduling model that can do the following:

� Produce performance predictions that are timeframe-specific: Since the deliv-erable performance of system resources and application resource require-ments vary over time, predictions of execution performance must alsovary over time.

� Utilize dynamic information to represent variations in performance: Sincecomputational grids are dynamic, application performance may vary dra-matically over time and per resource. Performance models can reflectevolving system state by utilizing dynamic parameters. In addition, suchattributes as the range or accuracy of dynamic values can provide impor-tant metainformation that can be used to develop grid-aware schedules.

� Adapt to a wide spectrum of potential computational environments: Applica-tions may have a choice of potential platforms for execution. Performanceprediction models must be able to target distinct execution environmentsand adapt to the deliverable performance of the resources within thoseenvironments. While dynamic information helps models perceive perfor-mance variations, adaptation provides a way for models to respond to theirimpact. One technique that fosters adaptation is to develop models inwhich parameters can change, or alternative models can be substituted,based on dynamic characteristics of the application and the target execu-tion platform.

Many approaches to the development of application scheduling modelsare documented in the literature. Early recognition of the multiparameterednature of performance and program models can be seen in the work on opti-mal selection theory [214]. In addition, a number of sophisticated schedulingpolicies have been devised to address the grid scheduling problem (e.g., [510,107, 71, 331, 503, 373, 346, 233, 261, 184, 315, 509]).


One promising approach to developing grid models is to compose modelsfrom constituent components that reflect application performance activities.This approach is being taken by a number of researchers (e.g., [494, 532, 35,580, 197]). To illustrate, let us consider a simple model that predicts execu-tion time for a grid application that executes task A to completion and passesall data to task B, which then executes to completion. (Some grid applica-tions that compute and then visualize the resulting data at a visualizationor immersive site have this form.) A performance model for this applica-tion is

ExecTime(t1) = CompA(t1) + Comm(t2) + CompB(t3)

where the CompA(t1), Comm(t2), and CompB(t3) components provide predic-tions of their performance activities when initiated at times t1, t2, and t3,respectively, and are composed (by summing) to form a time-dependent pre-diction of the execution time performance (ExecTime(t1)). Note that each of theconstituent models (CompA(t1), CompB(t3), and Comm(t2)) may themselvesbe decomposed into other constituent component models and/or parametersthat reflect performance activities.

For this application, as for many grid applications, the complexity ofthe modeling process is not in the overall structure of the application, butin the parameterization of its components—in other words, “the devil is inthe details.” In particular, the way in which parameters are used to derivecomponent model predictions critically impacts how well the model reflectsexpected application performance. In the following, we briefly describe howcompositional models manifest the desired characteristics of grid schedulingmodels described in the preceding subsection.

12.2.1 Timeframe-Specific Predictions

In grid environments, the execution performance for the application willvary. This is captured by the parameterization of ExecTime(t1), CompA(t1),CompB(t3), and Comm(t2) by time parameters in the model. Each time pa-rameter is the time for which we would like a prediction of application per-formance, with t1 being the time the application execution will be initiated.CompA(t1), CompB(t3), and Comm(t2) are also time dependent in another way:they are calculated by using dynamic parameters, as described below.

12.2 Developing a Grid Scheduling Model287

12.2.2 Dynamic Information

In a production environment, computation time may depend upon CPUload(s), and communication performance may depend upon available net-work bandwidth. Such parameters may vary over time because of contentionfrom other users. Predictions of these values at schedule time may be reflectedby dynamic parameters to the CompA(t1), CompB(t3), and Comm(t2) compo-nents in the performance model.

For example, assume that task A iteratively computes a particular opera-tion. A performance model for CompA(t1) on machine M might be

CompA(t1) =(Niters)(

Operpt )

CPUavail(t1)

where Niters is the number of iterations, Oper/pt is the operation per pointwhen M is unloaded, and CPUavail(t1) is the predicted percentage of CPUavailable for M at time t1. The use of the dynamic CPUavail(t1) parameterprovides a time-dependent prediction for CompA(t1), which can be combinedwith other models to form a time-dependent prediction for ExecTime(t1).

12.2.3 Adaptation

The performance model must target the execution platform that will be usedby the application. A common grid scheduling policy is to compare predictionsof application performance on candidate sets of resources to determine thebest schedule and execution platform for the application. Under this policy,the performance model must be able to adapt to distinct execution environ-ments and produce accurate (and comparable) predictions of behavior on eachof them.

In our simple example, task A must complete before it communicates withtask B. If overlapped communication and computation were possible, the ap-plication would have to be modeled to reflect the more complex interplay ofcommunication and computation. For example, with overlapped communica-tion and computation, it may be more appropriate to replace + by max or apipeline operator to reflect the way in which computation for task A, compu-tation for task B, and communication between the processors on which theyreside are coordinated. In this way, the performance model can be adaptedto reflect the performance characteristics of the application with respect to aparticular execution environment. Such an approach is taken in [532], which


describes a compositional model for predicting lag time in interactive virtualreality simulations.

Adaptation can also be used to weight the relative impact of the perfor-mance activities represented in a performance model. For example, if ourexample application is compute intensive, it may be quite important to de-rive an accurate model (exhibiting a small error between modeled and actualperformance) for CompA(t1), CompB(t3), or both, and less important to de-rive an accurate model for Comm(t2). Our primary goal is for ExecTime(t1)to be able to predict application execution time within acceptable accuracy,and it may be possible to combine the accuracies of each of the predictions ofCompA(t1), CompB(t3), and Comm(t2) to deliver a performance prediction ofExecTime(t1) with the desired accuracy [495].

Note that the accuracy, lifetime, and other characteristics of performanceparameters and predictions constitute metainformation (attributes that de-scribe the determination or content of information), which provides a qual-itative measure of the performance information being used in (producedby) the model. Such performance metainformation can be used to derivesophisticated high-performance scheduling strategies that combine the ac-curacies of the performance models and their parameters with the perfor-mance penalties of deriving poor predictions. This approach can be used toaddress both the accuracy and the robustness of derived application sched-ules.

Compositional scheduling models provide a mechanism for representingboth the high-level structural character of grid applications and the criticaldetails that describe the dynamic interaction of the application with the gridenvironment. As such, they constitute a promising approach to developinghigh-performance scheduling models.

Whatever approach is used to develop grid scheduling models, such mod-els will have a significant impact on the development of grid software ar-chitecture. In particular, scheduling models employed by high-performanceschedulers will need to both provide predictive information for a particulartimeframe and utilize dynamic performance information extracted from thegrid software infrastructure. These activities will need to be performed in aflexible, extensible, and computationally efficient manner, as well as in andfor a timeframe suited to the application.

In the next section, we focus on the most visible current efforts in devel-oping high-performance schedulers for grid environments. The spectrum ofscheduling models developed for these efforts represents the state of the artin modeling approaches for high-performance grid schedulers.

12.3 Current Efforts289

12.3 CURRENT EFFORTS

There are a number of exciting initial efforts at developing schedulers for gridsystems. In this section, we focus on a representative group of these pioneer-ing efforts to illustrate the state of the art in high-performance schedulers. (Formore details on each project, see the references provided.)

We do not provide a valuation ranking of these projects. With high-performance schedulers, as with many software projects, it is often difficultto make head-to-head comparisons between distinct efforts. Schedulers aredeveloped for a particular system environment, language representation, andprogram domain, and many research efforts are incomparable. Even when dis-tinct high-performance schedulers target the same program domains and gridsystems, it may be difficult to devise experiments that compare them fairlyin a production setting. Fair comparisons can be made, however, in an exper-imental testbed environment. In grid testbeds, conditions can be replicated,comparisons can be made, and different approaches may be tested, resultingin the development of mature and more usable software. (Current efforts todevelop grid testbeds are described in Chapter 22.)

Table 12.1 summarizes major current efforts in developing high-performance grid schedulers. The way in which the scheduling model is de-veloped for each project in Table 12.1 illustrates the spectrum of differentdecisions that can be made, and experience with applications will give somemeasure of the effectiveness of these decisions. In the next subsections, wediscuss these efforts from the perspective of the constituent components oftheir scheduling models.

12.3.1 Program Model

Program models for current high-performance schedulers generally representthe program by a (possibly weighted) data-flow-style program graph or by aset of program characteristics (which may or may not include a structural taskdependency graph).

Data-flow-style program graphs are a common representation for gridprograms. Dome [29] and SPP(X) [35] provide a language abstraction for theprogram, which is compiled into a low-level program dependency graph rep-resentation. MARS [225] assumes that programs are phased (represented bya sequence of program stages or phases) and builds a program dependency

Project Program model Performance model Scheduling policy Remarks

AppLeS [57] Communicating tasks Application performance modelparameterized by dynamicresource performance capacities

Best of candidate schedules basedon user’s performance criteria

Network Weather Service [574]used to forecast resource loadand availability

MARS [225] Phased message-passingprograms

Dependency graph built fromprogram and used to determinetask migration

Determines candidate schedulethat minimizes execution time

Program history informationused to improve successiveexecutions

Prophet [566] Mentat SPMD and parallelpipeline programs

Execution time = sum ofcommunication and computationparameterized by static anddynamic information

Determines schedule with theminimal predicted executiontime

Focuses primarily on workstationclusters

VDCE [541] Programs composed of tasksfrom mathematical tasklibraries

Task dependency graph weightedby dedicated task benchmarksand dynamic load information

List scheduling used to matchresources with application tasks

Communication weighted as 0 incurrent version, parameterizedby experiments, analytical modelin next version

SEA [514] Dataflow-style programdependence graph

Expert system that evaluates“ready” tasks in program graph

“Ready” tasks enabled in programgraph are next to be scheduled

Resource selection performed bya Mapping Expert Advisor (MEA)

I-SOFT [200] Applications that couplesupercomputers, remoteinstruments, immersiveenvironments, data systems

Developed by users, staticcapacity information used forscheduling some applications

Centralized scheduler maintainsuser queues and static capacities;applications scheduled as “firstcome, first served”

Users select own resources,scheduling approach used forI-WAY at SC ’95

IOS [87] Realtime, iterative auto-matic target recognitionapplications

Applications represented as adependency graph of subtasks,each of which can be assignedone of several possible algorithms

Offline genetic algorithmmappings indexed by dynamicparameters used to determinemapping for current iteration

Approach uses dynamicparameters to index offlinemappings

SPP(X) [35] Base serial language X andstructured coordinationlanguage

Compositional performancemodel based on skeletonsassociated with programstructure

Determination of performancemodel for candidate scheduleswith minimal execution time

Skeleton performance modelscan be derived automaticallyfrom program structure

Dome [29] SPMD C++ PVM programs Program rebalanced based onpast performance, after somenumber of Dome operations

Globally controlled or locallycontrolled load balancing

Initial benchmark data based onshort initial phase with uniformdata distribution

12.1

TABLE

Representative high-performance grid scheduler projects.


graph as part of the scheduling process. SEA [514] and VDCE [541] repre-sent the program as dependency graphs of coarse-grained tasks. In the caseof VDCE, each of the tasks are from a mathematical task library.

In other efforts, the program is represented by its characteristics. Ap-pLeS [57] and I-SOFT [200] take this approach, representing programs interms of their resource requirements. AppLeS and I-SOFT focus on coarse-grained grid applications. IOS [87], on the other hand, represents realtime,fine-grained, iterative, automatic target recognition applications. Each taskin IOS is associated with a set of possible image-processing algorithms. Thisapproach combines both the program graph and resource requirements ap-proaches to program modeling.

12.3.2 Performance Model

A performance model provides an abstraction of the behavior of an applicationon an underlying system and yields predictions of application performancewithin a given timeframe. Current high-performance schedulers employ awide spectrum of approaches to performance modeling; however, parameter-ization of these models by both static and dynamic information is common.Approaches differ in terms of who supplies the performance model (the sys-tem, the programmer, some combination), its form, and its parameterization.

At one end of the spectrum are scheduler-derived performance models.SPP(X) [35] derives “skeleton” performance models from programs developedusing a structured coordination language. Similarly, MARS [86] uses the pro-gram dependency graph built during an iterative execution process and param-eterized by dynamic information as its performance model for the next itera-tion. Dome [29] uses the last program iteration as a benchmark for its SPMDprograms and as a predictor of future performance. IOS [87] associates a setof algorithms with each fine-grained task in the program graph and evaluatesprestored offline mappings for this graph indexed by dynamic information.VDCE uses a representation of the program graph as a performance model,in which the scheduler evaluates candidate schedules based on predicted taskexecution times. All of these approaches require little intervention from theuser.

At the other end of the spectrum are user-derived performance mod-els. AppLeS [57] assumes that the performance model will be provided bythe user. Current AppLeS applications rely on structural performance mod-els [494], which compose performance activities (parameterized by static anddynamic information) into a prediction of application performance. TheI-SOFT scheduler [200] assumed that both the performance model and the


resulting schedule were determined by the programmer. Information aboutsystem characteristics was available, but usage of this information was left upto the programmer.

Some approaches combine both programmer-provided and scheduler-provided performance components. Prophet [566] provides a more genericperformance model (ExecutionTime = Computation + Communication) for itsSPMD programs, parameterized by benchmark, static, and dynamic programcapacity information. SEA [514] uses its data-flow-style program graph as in-put to an expert system that evaluates which tasks are currently “ready.” Theseapproaches require both programmer and scheduler information.

12.3.3 Scheduling Policy

The goal of a high-performance scheduler is to determine a schedule that op-timizes the application’s performance goal. This performance goal may varyfrom application to application, although a common goal is to minimize ex-ecution time. The current efforts in developing high-performance schedulersutilize a number of scheduling policies to accomplish this goal.

In many current efforts, the scheduling policy is to choose the “best”(according to a performance model and the performance criteria) from amongthe candidate resource choices. Some schedulers perform resource selectionas a preliminary step to filter the candidate resource sets to a manageablenumber, and some schedulers do not.

AppLeS [57] performs resource selection as an initial step, and its defaultscheduling policy chooses the best schedule among the resulting candidates(scheduled by a “Planner” subsystem) based on the user’s performance cri-teria (which may not be minimal execution time). Other scheduling policiesmay be provided by the user. SPP(X), Prophet [566], and MARS [86] use similarapproaches, although they do not provide as much latitude for user-providedscheduling policies or performance criteria: the performance goal for all ap-plications is minimal execution time.

VDCE [541] uses a list scheduling algorithm to match resources with appli-cation tasks. The performance criterion is minimal execution time. Dome [29]focuses on load balancing as a scheduling policy for its SPMD PVM programswith the performance goal of minimal execution time. The load-balancing pol-icy used by Dome can be globally controlled or locally controlled and, afterexecuting a short initial benchmark, uses dynamic capacity information to


rebalance at Dome-specified or programmer-specified intervals. The schedul-ing policy used by SEA is embodied in its expert system data flow approach:the application is scheduled by enabling tasks as they become “ready” in theprogram graph.

I-WAY was a successful “proof of concept” experiment in grid computing atSupercomputing ’95. The I-SOFT scheduler [200] was centralized and operatedon a “first come, first served” policy. Information about static capacities anduser queues was used by many users to develop schedules for their ownapplications.

Finally, IOS [87] uses a novel approach for scheduling fine-grained auto-matic target recognition (ATR) applications. Offline genetic algorithm map-pings are developed for different configurations of program parameters priorto execution. Dynamic information is then used to select a performance-efficient mapping at run time. ATR applications may be rescheduled at theend of every iteration if a new schedule is predicted to perform better thanthe existing schedule.

12.3.4 Related Work

Application scheduling is performed not only by high-performance sched-ulers. There are also a number of problem-solving environments, programdevelopment tools, and network-based libraries that act as “applications” anduse high-performance scheduling techniques to achieve high performance.Application-centric resource management systems such as Autopilot [475] (seeChapter 15) seek to control resource allocation based on application perfor-mance and represent application-aware systems. Autopilot controls resourceallocation based on application-driven events. A fuzzy logic model is usedto determine allocation decisions based on the “quality” of monitored perfor-mance data.

Ninf [366], NetSolve [104], Nile [363], and NEOS [146] represent problem-solving environments that can benefit from high-performance scheduling toachieve performance. For example, Ninf incorporates “metaserver” agents togather network information to optimize client-server-based applications. Thesystem schedules accesses from a client application to remote libraries, withthe performance goal of optimizing application performance. More informa-tion about Ninf, NetSolve, NEOS, and other related work can be found inChapter 7.


12.4 CASE STUDY: THE AppLeS PROJECT

To demonstrate the operation of high-performance schedulers in more detail,we present an overview of the AppLeS high-performance scheduler. AppLeS(Application-Level Scheduler) [57] is a high-performance scheduler targetedto multiuser distributed heterogeneous environments. Each grid applicationis scheduled by its own AppLeS, which determines and actuates a schedulecustomized for the individual application and the target computational grid atexecution time.

AppLeS is based on the application-level scheduling paradigm, in whicheverything in the system is evaluated in terms of its impact on the application.As a consequence of this approach, resources in the system are evaluated interms of predicted capacities at execution time, as well as their potential forsatisfying application resource requirements.

The target platform for AppLeS applications is intended to be a distributedwide area and/or local area network that connects computational resources,data resources, visualization resources, and “smart instruments.” No resourcemanagement system is assumed; however, AppLeS applications are currentlybeing targeted to the Globus [202] and Legion [248] software systems and theirtestbeds (see Chapters 11 and 9, respectively) and the PACI metasystems.AppLeS operates at the user level and does not assume any special permis-sions. Neither computation nor communication resources are assumed to behomogeneous, nor are they assumed to be under uniform administrative con-trol. All resources are represented by their deliverable performance as mea-sured by such characteristics as predicted capacity, load, availability, memory,and quality of performance information.

The application program model assumes an application comprising com-municating tasks. No specific programming paradigm or language representa-tion is assumed. AppLeS agents utilize user preferences (log-in information,libraries, the user’s performance measure) and application-specific and dy-namic information to determine a performance-efficient custom schedule.The user provides an application-specific performance prediction model thatreflects the components of the application, their composition, and their im-pact on application performance. The user may also provide informationabout the resource requirements of the application. In the default schedulingpolicy, the AppLeS agent selects candidate resource configurations, deter-mines an efficient schedule for each configuration, selects the “best” of theschedules according to the user’s performance measure, and actuates thatschedule on the underlying resource management system. Users may also

12.4 Case Study: The AppLeS Project295

provide their own scheduling policy. The goal of the AppLeS agent is toachieve performance for its application in the dynamic grid environment. TheAppLeS approach is to model the user’s scheduling process. However, sinceAppLeS schedulers run at machine speeds, they can consider much morewidely varied information, thereby enhancing the capability of the user.

A facility called the Network Weather Service [574] is used to providedynamic forecasts of resource load and availability to each AppLeS. DynamicNetwork Weather Service information is used to parameterize performancemodels and to predict the state of grid resources at the time the applicationwill be scheduled.

AppLeS and the Network Weather Service demonstrate that dynamic in-formation, prediction, and performance forecasting can be used effectively toachieve good schedules. Figure 12.1 shows an experiment involving a demon-stration AppLeS developed for a distributed Jacobi application. The Jacobicode used was a regular two-dimensional grid application that iteratively per-formed a computation at each grid point after receiving updates from its north-ern, western, southern, and eastern neighbors on the grid. The application wasdecomposed into strips and time-balanced to achieve its performance goal ofminimal execution time. (In time balancing, all processors are assigned somepossibly nonuniform amount of work with the goal that they will all finish atroughly the same time.) Performance at schedule time was modeled by usinga compositional performance model parameterized by dynamic parametersrepresenting CPU load and available bandwidth. The computation and com-munication component models were chosen to reflect resource performancein the distributed environment. More information about the Jacobi2D AppLeScan be found in [58].

The experiments represented by Figure 12.1 compared three partition-ings:

1. A blocked HPF-style partitioning, which decomposed the N × N Jacobigrid into equal-sized blocks

2. A nonuniform strip partitioning, which used resource capacity measure-ments taken at compile time to assign work to processors

3. A runtime AppLeS strip partitioning, which used resource capacity mea-surements predicted by the Network Weather Service at run time

The experiments were run in a local area production environment con-sisting of distributed heterogeneous workstations in the Parallel ComputationLaboratory at the University of California at San Diego (UCSD) and at the


7

6

5

4

3

2

1

0

1,000 1,200 1,400 1,600 1,800 2,000

HPF uniform/blockedNonuniform stripAppLeS

Problem size

Exe

cuti

on t

ime

(sec

onds

)

12.1

FIGURE

Execution times for blocked, nonuniform strip, and AppLeS partitionings fora distributed 2D Jacobi application. All resources were used for each parti-tioning.

San Diego Supercomputer Center (SDSC). Both computation and communi-cation resources were shared with other users. In the experiment shown inFigure 12.1, all processors were used by all partitionings. The figure demon-strates that dynamic information and prediction can be used to provide aperformance advantage for the application in a production environment.

Figure 12.2 illustrates the effectiveness of adaptive scheduling. In thisset of experiments, the gateway between the Parallel Computation Labora-tory at UCSD and the SDSC was rebooted, and the two sites were connectedby a slower network link for a short period of time. Experiments conductedduring this timeframe (for problem sizes around n = 1,800) show that the Ap-pLeS agent (which used resource selection in these experiments) assessed thepoor performance of the gateway and chose an alternative resource configura-

12.5 Scheduling Digital Sky Survey Analysis297

0

2

4

6

8

10

12

14

16

0 500 1,000 1,500 2,000 2,500

Uniform blockedMeasured AppLeS with selectionPredicted AppLeS with selection

Problem size

Exe

cuti

on t

ime

(sec

onds

)

12.2

FIGURE

Execution times for blocked and AppLeS partitionings when a gateway re-booted. Estimated times using the Jacobi performance model and measuredtimes for the Jacobi AppLeS are also compared.

tion, maintaining the performance trajectory of the application. The uniformblocked partitioning was unable to adapt to dynamic conditions and conse-quently exhibited poor performance for the application. Such results are rep-resentative of experiments with Jacobi2D and other prototype AppLeS applica-tions currently being developed, and demonstrate the performance potentialof an adaptive scheduling approach.

12.5 SCHEDULING DIGITAL SKY SURVEY ANALYSIS

The preceding sections have focused on the development of high-performanceschedulers for computational grids. In this section, we discuss how such a


scheduler might achieve performance for a specific grid application. In partic-ular, we focus on the Digital Sky Survey Analysis (DSSA) application describedin Chapter 5 to illustrate the scheduling issues involved.

The analysis of Digital Sky Survey data requires a distributed infrastruc-ture for accessing and handling a large amount of distributed data. The “ap-plication” itself is the analysis of this data to answer statistical questions aboutobserved objects within the universe. The application must discover where therelevant data resides and perhaps migrate data to a remote computational siteto perform the analysis.

DSSA is representative of an important class of scalable distributed data-base problems (see, e.g., [363] and Chapter 4) that require resources to bescheduled carefully in order to achieve high performance. In developing ahigh-performance scheduler for DSSA, the following issues must be addressed.

Efficient strategies must be developed for discovering the sites where relevantdata resides and for generating the data sets. The high-performance schedulermust determine how the relevant data is and should be decomposed betweenthe Palomar, Sloan, and 2-Mass databases, and which data sets will be requiredfor the analysis. To generate the data set needed to perform a statistical analy-sis, data may need to be sorted into a temporary file on local disk, with theentire data set accessed by the statistical analysis program once the data set iscomplete.

Resource selection and scheduling strategies must be developed. For DSSA,the set of potential data servers currently is small (Palomar, Sloan, 2-Mass,and archives that replicate the information); however, the set of potentialcompute servers may be large. If the required analysis is small enough, itcould be performed directly at the data server. Alternatively, if a large fractionof the database must be manipulated, analysis could be moved to anotherlocation that delivers greater execution capacity (e.g., if the data is alreadycached).

The high-performance scheduler must determine whether data sets willbe moved to site(s) where the statistical analysis will be performed or whetherthe statistical analysis will be performed at the data server(s). For DSSA, thescheduler must determine a candidate resource set and schedule that can beaccomplished in the minimum execution time.

Both resource selection and scheduling decisions will be based on a DSSAperformance model. This model must reflect the cost of data decompositionas well as the costs of migrating data and/or computation, and it may buildon database mechanisms for estimating the cost of execution, augmented bymodels of statistical analysis operation. Communication and computation ca-pacities in the model could be assessed from dynamic information and shouldbe predicted for the timeframe in which the application will be executed.

12.5 Scheduling Digital Sky Survey Analysis299

A user interface must be developed for the application. The user interfacewould provide an accessible representation for the potentially large client baseof the application. In addition, the user interface should be structured so that itcould be extended to include additional potential compute resources and dataservers. If the interface is Web based, the time it takes to transmit the requestfrom the client site over the Internet or other networks to potential computeand data sites must be included in the costs as evaluated by the performancemodel.

We now discuss some more concrete issues relating to the development ofa DSSA AppLeS. Recall that each AppLeS conjoins with its application to forma new application that can develop and actuate a time-dependent, adaptivegrid schedule. Consequently, the DSSA application would be modified some-what to allow the DSSA AppLeS to make scheduling decisions for it. The DSSAAppLeS would use as input application-specific information (log-in informa-tion, a characterization of application resource requirements, user prefer-ences, etc.) and an adaptable DSSA performance model to characterize thebehavior of the application on potential resources. Dynamic information re-quired by the performance model and the AppLeS would be obtained via theNetwork Weather Service.

A DSSA AppLeS would schedule the application using the followingstrategy:

1. Select resources by ordering them based on both their deliverable per-formance and their usage by the application. This information could bedetermined by developing a representative computation (which wouldlikely include communication, since bandwidth is important to DSSA) thatcould serve as an application-specific benchmark. The benchmark couldbe parameterized by dynamic Network Weather Service load informationand used to evaluate and prioritize the speed and capacity of potentialresources.

2. For each candidate set of resources, plan a schedule based on a compo-sitional performance model, parameterized by dynamic resource infor-mation. Metainformation (e.g., lifetime or accuracy) that quantifies thequality of performance information may also be used.

3. From among the candidates, select the schedule that is predicted toachieve minimal execution time.

4. Actuate the selected schedule by interacting with the underlying resourcemanagement system to initialize and monitor execution of the application.

Although much of the information that would be required by the DSSAAppLeS would be dynamic or application specific, experience shows that it


could be obtained efficiently during scheduling or offline [58]. Since the DSSAapplication will be used repeatedly by researchers, the time spent buildingan AppLeS scheduler for it by application developers would be amortized byimproved application execution performance for users. Moreover, the DSSAAppLeS would provide a mechanism for achieving high performance as theapplication scales to a wider set of database servers and potential computa-tional sites.

12.6 TRENDS AND CHALLENGES IN HIGH-PERFORMANCESCHEDULING

High-performance scheduling for grid environments is evolving into an area ofactive research and development. The projects described previously provide asampling of the state of the art. Considering these projects as an aggregate,we can identify a number of trends as well as a set of issues that providecontinuing challenges in the development of high-performance schedulers.

12.6.1 Current Trends

We discuss here several trends in high-performance scheduling, as exemplifiedin various projects and scheduling policies.

Using Dynamic Information

Many high-performance schedulers utilize dynamic information to createadaptive schedules that better reflect the dynamics of changing applicationresource requirements and dynamically varying system state. This character-istic can be seen, for example, in the MARS project [86] with respect to theretention of performance information for the purpose of rescheduling, in theAppLeS project [57] in terms of the use of dynamic system information viathe Network Weather Service, and in the load-balancing approach used in theDome system [29].

Using Metainformation

A number of scheduling projects use not only information from varioussources but also metainformation in the form of a quantitative measure ofthe quality of information given. For example, both the lifetime and accuracy

12.6 Trends and Challenges in High-Performance Scheduling301

of a performance prediction provide important additional metainformation ofpotential value to high-performance schedulers. Autopilot [475] incorporatesmetainformation in the fuzzy logic values used for determining resource allo-cation decisions. AppLeS [57] uses quality-of-information (QoIn) measures toevaluate the quality of predictions derived by structural performance models.Nimrod-G (see Chapter 11) is being extended to allow users to specify metain-formation in the form of cost and completion time constraints. Such uses ofmetainformation parallel some of the important ideas on metadata emergingfrom the data analysis and data-mining communities (see Chapter 5).

Using Realistic Programs

Much of the early scheduling literature involved the development of schedul-ing policies that optimized the execution of parallel programs represented byrandom program graphs. However, in practice, parallel programs have nonran-dom communication patterns and program graphs with identifiable structure.There is a trend in the current literature to illustrate the efficiency of sched-ulers on programs more representative of the high-performance codes thatwould actually be scheduled. Although it is often infeasible to show that theschedule derived from a given high-performance scheduler is actually “opti-mal,” schedulers are more frequently shown to be efficient on benchmarksrepresentative of real parallel codes in production environments, engenderingconfidence that they are likely to develop an efficient schedule for the user’scode.

Restricting the Program Domain

One way of deriving information about application behavior and performanceis to restrict the applications to be scheduled to those that lie within a well-defined domain. Several high-performance scheduling efforts target a partic-ular class of programs. IOS [87] targets iterative automatic target recognitionprograms; Prophet [566] targets SPMD and parallel pipelined Mentat appli-cations; and the PHASE [226] system performs resource selection to supportthe efficient execution of pharmaceutical applications on computational grids.Restricting the program domain enables a scheduler to better predict appli-cation behavior and to use specialized scheduling policies. In this way, goodperformance can be achieved for a restricted class of programs, in contrast toless efficient performance that may be achieved by using a more broad-basedscheduling policy.


Deriving Scheduling Information from Languages

Adaptive schedulers depend heavily on the availability of sufficient infor-mation about the resource requirements of the application. Although pro-grammers often know much of this information, the development of anadequate interface for providing application-specific information is a diffi-cult problem. However, some researchers are obtaining useful information forscheduling from programming languages and abstractions. Efforts to developlanguages that incorporate information about task decomposition, data decom-position and location, resource requirements, and so on assist in automatingthe scheduling process. Projects such as SPP(X) [35] and Dome [29] provide apromising approach to obtaining grid-relevant performance information fromhigh-level language abstractions.

12.6.2 Challenges

While current efforts to develop high-performance schedulers promise im-proved application performance, there are still a number of challenges to bemet.

Portability versus Performance

The development of portable programs often focuses on architecture inde-pendence, and the development of performance-efficient programs oftenfocuses on leveraging architectural features. A continuing challenge for par-allel and distributed application developers is to develop code that is bothportable and performance efficient. In grid environments, portability maysometimes promote performance by allowing an application a choice of plat-forms on which to execute. The challenge for the high-performance scheduleris to use good scheduling to minimize the performance impact of architectureindependence, and to leverage the availability of multiple resources and thedynamicism of grid environments to achieve application performance.

Grid-Aware Programming

Currently, considerable effort must be spent to modify programs in order to ex-perience the benefits of scheduling. An important challenge for programmersis to design applications that can work with high-performance schedulers toleverage the performance potential of computational grids. Such grid-awareprograms should be able to adapt to dynamic system state; assess the perfor-

12.6 Trends and Challenges in High-Performance Scheduling303

mance impact of, and possibly negotiate for, resources; and be able to selectamong and leverage multiple potential platforms.

Scalability

Although application schedules may target a manageable number of resources,the resources themselves may be selected from an ever-widening resource do-main. For this reason, it is important that the high-performance scheduler usea scalable approach for resource selection. Generally, the strategy to deal withlarge resource sets involves clustering the resources based on some metric(similarity in system characteristics, similarity based on application resourcerequirements, etc.). However, the determination of how resources should beclustered, when they should be repartitioned, and how to deal with the clus-ters are open questions and require further research.

Efficiency

The development of high-performance schedulers that not only promote per-formance for their applications, but do so in a computationally efficient man-ner, presents an important challenge to developers and researchers. Resourceselection, performance modeling, and schedule generation can all incur sub-stantive overhead depending on the techniques used. Useful schedulers can-not take more time to schedule than it would take the application to executewith any choice of schedule. In addition, there may be trade-offs between thecomplexity and accuracy of performance models and the intrusiveness andprecision of dynamic monitors. The scheduler must maximize the predictedperformance of the ultimate schedule in an efficient manner. Developingperformance-efficient schedules with low overhead presents a challenge forthe developers of high-performance grid schedulers.

Repeatability

One of the most critical problems in developing parallel applications on anyplatform is the ability to repeat the program and obtain the same results. Re-peatability is a key component for scheduler development, since both sched-uled programs and the scheduler itself must be tested in a variety of devel-opment and production environments before they can be assumed to runcorrectly and produce useful results. To achieve repeatability, the grid envi-ronment must be able to provide trace information about the performance ofsingle-user and shared resources and must be able to impose consistent or-derings and constraints on multiple executions of the same application. Many


ingenious approaches have been developed to attack the repeatability problemfor parallel programs targeted to MPPs. However, most strategies assume thatresources are uniform, enjoy the same performance characteristics, and canbe loosely synchronized with respect to one another. Such approaches maynot be applicable to grids, where heterogeneous performance characteristics,the impact of other users on shared resources, and the asynchronous natureof the system make it extraordinarily difficult to repeat behavior and diagnoseproblems.

Multischeduling

High-performance schedulers will not operate in a vacuum; they will co-exist with a number of scheduling mechanisms, including local resourceschedulers, high-throughput schedulers, and other high-performance sched-ulers. Coordinating multiple schedulers (or multischeduling, also known asco-allocation in [144]) is difficult. For example, just as applications must bescheduled with respect to a particular timeframe, resources must also be al-located with respect to a particular timeframe. The application scheduler andthe resource scheduler must cooperate so that the application scheduler cantake advantage of the resource at the time the resource scheduler can provideit.

Developing an integrated scheduling subsystem in which each scheduleris able to promote the performance of the programs or resources in its domainprovides the ultimate scheduling challenge. This challenge must be addressedin order to achieve performance for all schedulers.

In addition to the problem of multischeduling different types of sched-ulers, multiple independent high-performance schedulers will also need tobe coordinated. To satisfy their individual performance goals, multiple high-performance schedulers must select resources and implement schedules fortheir applications. Thrashing can occur if all schedulers select a particularresource, sense poor performance, and then all select the same alternativeresource. This can cause instability in the system and can result in bothpoor application performance and poor resource performance. Multischedul-ing strategies must be devised that enable each scheduler to promote its per-formance goals, without affecting the stability of the system.

12.7 SYSTEM SUPPORT FOR SCHEDULERS

The goal of the high-performance scheduler is the same as that of the user:to leverage the best performance possible from the system for the applica-

12.7 System Support for Schedulers305

tion at execution time. This task can be made considerably easier and moreefficient if the underlying system provides an infrastructure that supportshigh-performance scheduling. In the following, we outline requirements foran infrastructure to support high-performance schedulers on computationalgrids. If such an infrastructure were available, greater integration between ap-plication scheduling and other grid activities would be possible.

Resource Reservation/QoS Guarantees

Resource reservation and QoS guarantees can allow resources to be dedicatedto an application for some timeframe, increasing the predictability of thesystem. This predictability can allow high-performance schedulers to derivemore performance-efficient application schedules.

When multiple applications must share the same resources, resourcereservation and QoS guarantees also provide a unifying notion of “goods andservices,” which can be used to drive a computational grid “economy.” In par-ticular, resource reservations and QoS guarantees provide a way for competinghigh-performance schedulers to quantify the impact of other applications onthe system and to negotiate a performance-efficient application platform.(Information on QoS and resource reservation can be found in Chapter 19.)

12.7.1 Dynamic Monitoring Mechanisms

Information about dynamic system state and application resource usage canbe used effectively by high-performance and other schedulers to deriveperformance-efficient schedules. Autopilot [475] (Chapter 15) provides an ex-ample of a mechanism that gathers resource information (for the purpose ofmanaging resource allocation) based on application-driven events; the Net-work Weather Service [574] provides an example of dynamic resource infor-mation that is gathered at regular intervals.

It is often useful for dynamic resource information to be persistent. Timeseries analyses and predictive models utilize such information to promotegood schedules and require the retention of dynamic information. Mecha-nisms that retain dynamic information must be extensible so that new cate-gories of information relevant to an application’s resource usage can be stored,and flexible so that information can be gathered and accessed in a variety ofways. The Metacomputing Directory System [192] (Chapter 11) provides anexample of a database facility that retains dynamic resource information suchas the number of nodes currently available and the status of the resource thatcan be used by schedulers.


Although implementations may vary, the information retained in theresource information database must be accessible and useful to the high-performance scheduler. In particular, from the high-performance scheduler’sperspective, the database must provide information useful to applications exe-cuting on distinct resources simultaneously, provide both static and dynamicinformation of interest to the application, provide metainformation that indi-cates the quality of the resource information (accuracy, lifetime, etc.), be ableto be queried by several applications simultaneously, and be accessible in realtime.

12.7.2 High-Level Language Support

High-level language support for scheduling can assist the programmer in theprocess of developing grid-aware applications and provide important informa-tion for high-performance schedulers. The Legion system [248] (Chapter 9)

provides an example of such an approach. High-level language primitives(data streams, object method invocations, etc.) provide support for buildinghigh-level semantic objects. These objects can be incorporated in various lan-guage paradigms and used for scheduling. High-level language support en-sures uniform semantics across computational grids. Such support providesan important complement to the necessary low-level services that must alsobe provided. Since low-level services may change over time, high-level lan-guage support plays an important role in defining a consistent set of stableabstractions on which programming models can be built.

12.7.3 Integration with Other Software Tools

To develop grid applications, the programmer will use many tools and facil-ities, including compilers, problem-solving environments, and libraries. Co-ordination of the high-performance scheduler with these tools and facilitiescan improve the performance of both. For example, the compiler can pro-vide a considerable amount of useful information about program structureand resource requirements to the high-performance scheduler. Conversely,scheduling directives within the application can provide useful information tothe compiler. Coordinating the activities of compiling and scheduling wouldenhance them both. (Compilation for grids is addressed in Chapter 8.)

Adaptive scheduling can also enhance the performance of problem-solvingenvironments. Tools such as SCIRun [438] and NetSolve [104] (see Chapter7) can themselves be scheduled to achieve better performance. The inte-

12.8 Conclusion307

gration of problem-solving environments and high-performance schedulerswould provide an execution and development environment in which the qual-ity of application results as well as application execution performance couldbe addressed.

Finally, programmers rely on support for both performance monitoringand evaluation to improve the performance of their applications. Tools suchas Pablo [466] and Paradyn [380] can be used to develop performance-efficientprograms. High-performance schedulers often require similar sorts of dy-namic performance information to make scheduling decisions. Coordinationbetween performance monitoring, evaluation, and scheduling activities wouldallow such facilities to leverage each other’s information more efficiently andto utilize adaptive techniques for improving application performance. (Moreinformation on performance tools can be found in Chapters 14 and 15.)

12.7.4 Assistance for Multischeduling

High-performance schedulers, resource schedulers, and job schedulers willshare the same resources and, in many cases, use the same grid infrastructureto manage communication, store or access information, reserve resources, andso on. One approach to providing a uniform interface for multiple schedulersis GRAM [144] (Section 10.2), which provides services to support resourcediscovery, resource inquiry, MDS access, and other activities useful for gridscheduling.

When multiple schedulers work together, consistency of information andmetainformation across the grid becomes especially important. To obtain con-sistent information, grids will need to support wide-scale data synchroniza-tion. Information interfaces will need to have the flexibility to provide consis-tent system-centric and application-centric information and metainformationin the timeframe appropriate to the requesting component.

12.8 CONCLUSION

Scheduling holds the key to performance in grid environments; however,high-performance application scheduling on computational grids representsa “brave new world” in which much progress needs to be made. Part of thedifficulty is that distributed applications, resource management systems, andgrid testbeds are all being developed concurrently, comprising the same sortof “shifting sands” that have made the development of software for parallel


environments so difficult. Part of the problem lies in the inherent difficulty ofthe scheduling problem, whose optimal solution is considered infeasible evenin the simplest environments.

In spite of the difficulty of the grid scheduling problem, promising effortsare being made. The immense appeal and potential of coordinating networksof resources to attack our most difficult problems has created enormous ex-citement and interest in computational grids from both the scientific andnonscientific communities. The development of infrastructure and grid-awareapplications is progressing at a rapid pace. Currently, the most advanced gridapplications are targeted to specific resources known to the developer. Overthe next decade, the evolution of infrastructure for the grid and the develop-ment of sophisticated high-performance scheduling policies will enable usersto target their applications to the grid itself, rather than to specific resources.

In this brave new world, high-performance schedulers and other systemcomponents will have an expanded and even more important role in theachievement of application performance. In particular, the development ofhigh-performance schedulers is a critical activity in the establishment of gridenvironments, serving as a fundamental building block for an infrastructurein which applications must leverage the deliverable performance of diverse,distributed, and shared resources in order to achieve their performance po-tential.

ACKNOWLEDGMENTS

Special thanks to my colleague Rich Wolski for many substantive discussionson this material and comments on the text. I am also grateful to the editorsand to Walfredo Cirne, John Darlington, Salim Hariri, Dan Marinescu, ReaganMoore, Dan Reed, Alexander Reinefeld, Randy Ribler, Jennifer Schopf, H. J.Siegel, Valerie Taylor, Jon Weissman, and the members of the UCSD AppLeSgroup for useful comments on previous drafts. Finally, I am grateful to NSF,DARPA, and the DoD Modernization Program for support during the develop-ment of this chapter.

FURTHER READING


Further Reading309

� Two recent papers discuss AppLeS [57, 438].

� Papers by Gehring and Reinfeld [58] and Weissman and Zhao [566] discussMARS and Prophet, respectively.

� Casavant and Kuhl [107] provide a taxonomy of scheduling for general-purpose distributed computing systems.

13C H A P T E R

High-Throughput ResourceManagement

Miron LivnyRajesh Raman

Historically, users of computing facilities have been concerned primarilywith the response time of applications, while system administrators havebeen concerned with throughput. As a paradigm, users judged the power of asystem by the time taken to perform a fixed amount of work. Given this fixedamount of computing to perform, the question most users asked was, Howlong will I have to wait to get the results of this computation? Administrators,who were charged with the responsibility of managing scarce and expensivecomputing resources, were judged by the utilization and throughput of thefacility. Although the average response time and the throughput of a facilityare related, they represent two very different perspectives on the performanceof a computing environment. In recent years, however, we have experienceda change in these traditional viewpoints.

The dramatic decrease in the cost-performance ratio of computing re-sources has effectively substituted “response time” for “utilization” as a pri-mary concern of administrators. At the same time, a growing community ofusers are now concerned about the throughput of their applications. As morescientists, engineers, and decision makers use computers to generate behav-ioral data on complex phenomena, it is not uncommon to find users who askthe question, How much behavioral data can I generate by the time my re-port is due? This question represents a paradigm shift. In contrast to otherusers, these users measure the power of the system by the amount of workperformed by the system in a fixed amount of time. For these throughput-oriented users, the fixed time period is measured in the relatively coarse unitsof days or weeks, and the amount of work is seemingly unbounded: you can

13 High-Throughput Resource Management312

never have too much data when studying a biological system, testing the de-sign of a new hardware component, or evaluating the risk of an investment.

The computing needs of these throughput-oriented users are satisfiedby high-throughput computing (HTC) environments that can generate largeamounts of behavioral data. These users are less concerned with the instanta-neous performance of the environment (typically measured in FLOPS) thanthe amount of computing they can harness over a month or a year. They mea-sure the performance of a computing environment in units of scenarios perday, wind patterns per week, instructions sets per month, or crystal configura-tions per year. Given this unbounded need for computing resources, the HTCuser community are closely watching activity in the computational grids areaand eagerly awaiting the moment they can tap into the vast computationalpower of these grids.

In this chapter we present important lessons learned, promising direc-tions, and future challenges in the design and implementation of scalable androbust HTC environments. We present these issues as a result of our decade-long interaction with groups of HTC users that include scientists and engineerswho employ diverse computation techniques from a wide range of disciplines.These users have been using HTC resources to study a wide spectrum ofphenomena, including diesel engines, neural networks, high-energy physicsevents, computer hardware and software, the structure of crystals, and opti-mization techniques. Most of them have been customers of the Condor [348,181] environment that we have developed. Although this chapter is based onour experience with Condor, our objective is by no means to present Condoror to evaluate its capabilities. Since a Condor pool can be viewed as a privatecomputational grid of desktop workstations that are managed for HTC use, wehope that builders of computational grids who would like to provide HTC ser-vices (e.g., high-throughput distributed supercomputing, Chapter 3) will findour experience and frameworks useful.

We believe that the experience from these interactions in terms of lessonslearned and promising future directions are applicable to all types of grid,regardless of whether they are private, virtual, organizational, or public. Fur-thermore, we expect that the size, scope, heterogeneity, and dynamics ofgrids will only strengthen the validity of these conclusions. The confidencein these beliefs stems from working with a wide range of customers with real-life computing needs, from maintaining and supporting Condor for more thana decade, and from managing a large HTC production environment at the Uni-versity of Wisconsin–Madison. (We currently manage a Condor Flock [181] atthe University of Wisconsin that consists of more than 500 desktop UNIX work-stations and serves users throughout the campus.)

13.1 Characteristics of HTC Environments313

The most important lesson our HTC experience has taught us is that inorder to deliver and sustain high throughput over long time intervals, a com-puting environment must build its resource management services on an in-tegrated collection of robust, scalable, and portable mechanisms. Robustnessminimizes downtime; scalability and portability increase the size of the re-source pool the environment can draw upon to serve its customers. As will beargued in the next section, a typical environment is physically distributed, itsresource pool is heterogeneous and is owned by several entities, the availabil-ity of resources can change at any time, and new types of resources are contin-uously added to the pool as older technology is removed. Fragile mechanismsthat depend on the unique characteristics of specific computing platformsare likely to have a negative rather than a positive impact on the long-termthroughput of the environment.

Four groups of users are served by the mechanisms provided by an HTCenvironment: resource owners, customers, system administrators, and appli-cation writers. The needs and expectations of each of these groups and therole they play in the success of an HTC environment will be discussed in Sec-tion 13.2. In the same way that an electric power grid is not just a collectionof generators, lines, outlets, and trading policies, but a community that con-sists of power providers, customers, shareholders, and maintenance crews, anHTC computational grid is a community with its own culture and a unique setof rules. In Section 13.3 we present and discuss a promising suite of match-making mechanisms that can bring providers and consumers of computationalservices together, thus integrating the HTC community.

In the most general case, either party, the provider or the customer, canhave the right to break an allocation at any time. A mechanism capable ofpreserving any partially completed work is thus needed. In Section 13.4 wediscuss a user-level checkpointing mechanism by which a snapshot of anexecuting program can be stored away. The snapshot can be later used torestart the program from that state. A brief overview of commercial and publicdomain batch systems is made in Section 13.5.

13.1 CHARACTERISTICS OF HTC ENVIRONMENTS

Given the seemingly infinite appetite of its customers for computing power,an HTC environment is continuously on the lookout for additional resources.HTC environments have the mentality of scavengers. The services of a pro-vider of computing power are always accepted regardless of the resource’scharacteristics, degree of availability, or duration of service. Hence, the pools


of resources HTC environments draw upon to serve their customers are large,dynamic, and heterogeneous collections of hardware, middleware, and soft-ware.

As a result of the recent decrease in the cost-performance ratio of com-modity hardware and the proliferation of software vendors, resources thatmeet the needs of HTC applications are plentiful and have several differentcharacteristics, configurations, and flavors. A large majority of these resourcesreside today on desktops, owned by interactive users, and are frequently up-graded, replaced, or relocated.

The change in the cost-performance ratio of hardware not only improvedthe power of our desktop machines but also rendered the concept of multiusertimesharing obsolete. In the early days of computing the idea of allocatinga computer to a single person was not sensible, but it has become commonpractice in recent years. In most organizations, each machine is usually allo-cated to one individual in the organization to support his or her daily duties. Asmall fraction of these machines are grouped into small farms and allocated togroups who are considered by management to be heavy users of computing.We believe that the trend to distribute the ownership of resources within orga-nizations will continue, giving full control over powerful computing resourcesto individuals and small groups. As a result of this trend, while the absolutecomputing power of organizations has improved dramatically, only a smallfraction of this computing power is accessible to HTC users because of theever-increasing fragmentation of computing resources. In order for an HTCenvironment to productively scavenge these distributively owned resources,the boundaries marked by owners around their computing resources must becrossed.

However, crossing ownership boundaries for HTC requires that the rightsand needs of individual resource owners be honored. Resource owners aregenerally unwilling to donate their machines for HTC use at the cost of de-graded performance or availability. The restrictions placed by owners on re-source usage for HTC can be complex and dynamic, involving parameters suchas recent “idleness” of the resource and characteristics of the customer. Theserestrictions constrain when and which customers can be allocated to the re-source.

The constraints attached by owners to their resources prevent the HTC en-vironment from planning future allocations. All the resource manager knowsis the current state of the resources. It therefore has to treat them as sourcesof computing power that should be exploited opportunistically. Available re-sources can be reclaimed at any time, and resources occupied by their ownerscan become available without any advance notice. The resource pool is also

13.1 Characteristics of HTC Environments315

continuously evolving as the mean time between hardware and software up-grades of desktop machines is steadily decreasing. Owners are likely to replacetheir hardware as faster CPUs and larger memories become affordable, andthey will install new versions of operating systems, or switch to a new systemaltogether, soon after announcement.

In addition to ownership boundaries, HTC environments must cross ad-ministrative domains as well. The main obstacle to interdomain execution isaccess to the environment from which the application was submitted, suchas input and output files. The HTC computing environment has to providemeans by which an application executing in a foreign domain can access itsinput and output files that are stored at its home domain. The ability to crossadministrative domains not only contributes to the processing capabilities ofthe environment, but also broadens the “customer circle” of the environment.Hence, it is easy to connect the computer of a potential customer to the en-vironment. In a way, the HTC environment appears to the user as a hugeincrease in the processing power of the personal computer, since almost ev-erything looks the same except for the throughput. As in the case of an electricgrid where you do not know who generated the power that cooks your meal,a user of an HTC environment does not know who executed the program thattransformed the parameters stored in the input file to the time series that has“miraculously” appeared in the output file.

The applications that perform these transformations usually follow themaster-worker computing paradigm, where a list of tasks is executed by agroup of workers under the supervision of a master. The realization of themaster and the workers and their interaction may take different forms. Theworkers may be independent jobs submitted by a shell script that acts asthe master and may collect their outputs and summarize them, or they canbe implemented as a collection of PVM processes that receive their workorders in the form of messages from a master process that expects to re-ceive the results back in messages [459]. Regardless of the granularity ofthe work units and the extent to which the master regulates the workers,the overall picture is the same: a heap of work is handed over to a mas-ter who is responsible for distributing its elements among a group of work-ers.

Since workers in one tier can act as masters to workers in a lower tier,hierarchies of master workers can be easily formed. These hierarchies mayspan more than one HTC environment. For example, a group of researchersfrom the University of Amsterdam has been running its HTC applicationin six Condor pools located in three different countries and spanning twocontinents. Over the past three years they have used more than 150 CPU-years


to search for global potential energy minima of an N -particle system consistingof Lennard-Jones particles on a spherical surface [554].

At any given time, hundreds of workers of the above application couldhave been found scattered over the different pools. Some of these workersconsumed more than 100 days of CPU over a lifetime of 4–5 months. In manycases these workers were left unattended as members of the group were awayfrom their desks, at meetings or on vacation. The group expected the HTC en-vironment not to lose any of these workers before or during their executionphase. Any such loss may have had a significant impact on their throughput.Like most other HTC users we have worked with, they counted on the robust-ness of the mechanisms used by the environment to successfully take theirworkers from submission to completion, and were much less concerned aboutthe efficiency of the mechanisms or the policies that control them.

Although losing workers prematurely is clearly what HTC customersworry about most, they obviously have some basic expectations regardingwasted resources or fairness in resource allocation. Given the rapid changesin hardware and operating systems, the biggest potential source of throughputinefficiency is exclusion of resources because of the inability of the mecha-nisms of an HTC environment to operate on new computers. Nothing is morefrustrating for an HTC customer than new resources that are likely to be thebiggest and fastest being excluded from the HTC environment because of port-ing difficulties.

Simplicity clearly holds the key to the robustness and portability of HTCmechanisms. As will be discussed in the next section, these mechanismsserve not only the customers, but also owners of resources, administratorsof the environment, and programmers who write HTC applications. Sincecomputational grids are likely to be large, physically distributed, distributivelyowned, dynamic, and evolving just like HTC environments, we believe thatthe same principles hold for the mechanisms that will support these grids.While users with tight time constraints or demanding quality-of-service needswill expect computational grids to employ sophisticated resource allocationpolicies, most of them, and the entire community of HTC users, will expectrobust services that run anywhere.

13.2 RESOURCE MANAGEMENT LAYERS

The cornerstone of an HTC environment is the resource management sys-tem (RMS) that manages its pool of distributively owned resources. The RMSprovides resource management (RM) services to its user community, which

13.2 Resource Management Layers317

consists of four groups of people: owners, system administrators, applicationwriters, and customers. The ordering in this list is significant because we be-lieve that owners are the most important group of people in a distributivelyowned environment. Without resources that have been “donated” by ownersfor HTC, the RMS ceases to exist. The distinguishing aspect of distributed own-ership is that this donation is not unconditional: the RMS must ensure thatowners have unhindered access to their resources, that there is no perceiveddegradation in the availability or performance of the resource during personaluse, and that the resource access policy specified by the owner is honored.Next, system administrators must feel confident that the RMS is robust andcan run continuously without frequent intervention. If the RMS fails to win thetrust of system administrators, its installation and use at a site are not possible.Even in the presence of a robust and reliable system, inflexible and obscureAPIs to the services provided by the RMS nullify the generality and powerof the RMS, since the available features cannot be effectively harnessed forproductive computation. Thus, it is important to address the requirements ofapplication writers during the design and implementation of RMSs. Customersare in many ways the easiest group to please because they are the ultimatebeneficiaries of the RMS. However, if the system is not flexible enough to adaptto the requirements of the customers, it will fail to effectively address theirconcerns and fall into disuse.

Clearly, the requirements of an effective RMS are quite demanding, andbuilding a successful RMS is a complex task. We emphasize here that in addi-tion to the performance-related requirements of the user community, securityconcerns must be addressed by the HTC environment. In a computationalgrid, the RMS will rely on the services of other grid components, such as net-work services (Chapter 19) and security, accounting, and assurance services(Chapter 16) to satisfy all the needs of its users. The success of an RMS canbe assessed only when it runs continuously and reliably in “production mode,”with owners and customers who are satisfied by the delivered quality of ser-vice and reliability, and with system administrators and application writerswho can rely on the robustness and flexibility of the system. These require-ments suggest that the system must be built by using a layered approach, withclose interaction, monitoring, and control of resources at the bottom level,and with abstractions and interfaces for application developers and customersat the topmost level.

An important point to note is that each layer is defined by its responsibilityand the protocols with which it interacts with other layers. Actual implemen-tations of components in layers may vary greatly across the RMS. Thus, forexample, it is possible to have different implementations and paradigms of


Global resource management(intercustomer RM)

Applicationtasks

RM library

Local RM

Resource

InterrequestRM

IntertaskRM

Accesscontrol

runtime RMrequirements


resourcerequests

resourceoffers

runtime RMservices

runtime RMservices

matches

matches

local systemservices

Requestqueue

Applicationtasks

RM library

Local RM

Resource

InterrequestRM

IntertaskRM

Accesscontrol



resourcerequests

resourceoffers

runtime RMservices

runtime RMservices

matches

matches

local systemservices

Requestqueue

. . .

. . .

. . .

. . .

. . .

Applicationlayer

ApplicationRM layer

Customerlayer

Systemlayer

Ownerlayer

Local RMlayer

clai

min

greso

urc

e ac

cess

13.1

FIGURE

Layers of an RMS.

access control at the owner level for different resources, as long as each im-plementation is compliant with the behavioral specification of owner layercomponents. The same argument extends to all layers of the RMS. Theselayers therefore define the architecture of the system, the granularity of in-teroperability, and domains of fault containment.

The principal layers of an RMS are illustrated in Figure 13.1 and enumer-ated below:

1. Local RM layer: The first layer of the RMS is not really part of the systembut rather is logically part of the resource. It is a software layer (e.g., op-

13.2 Resource Management Layers319

erating system, batch system, or even another computational grid) thatprovides basic RM services for processes executing in the domain of thatresource. Since we are principally interested in the higher-order problemof managing distributed resources rather than local RM, we do not discussthis layer further. Nevertheless, it is a fundamental and important compo-nent of a robust HTC because unreliable hardware and local services canseriously affect the sustained operation of an HTC system.

2. Owner layer: The owner layer of the RMS represents the interests of theresource owner. A fundamental purpose of the owner layer is to pro-vide access control mechanisms to the resource, which interacts withand enforces the owner’s policy. These policies are beyond those other-wise imposed by the RMS itself, and imply the necessity of constraintsthat determine when and to whom the resource may be allocated forHTC. Within the constraints of the owner’s policy, the owner layer alsoinforms the system layer of the characteristics and availability of the re-source.

3. System layer: The system layer may be thought of as the global resourceallocation layer. Its principal function is matchmaking, that is, matchingresource offers and requests so that the constraints of both are satisfied.This matchmaking occurs in the context of high-level policies that imple-ment intercustomer scheduling policy. Although this policy is not directlyrelevant to the architecture of the RMS, the policy may dictate when andwith whom matchmaking may take place. For example, these policiesmay enforce fair-share [305], stable marriage [252] or economic-based [321]matching policies.

4. Customer layer: The customer layer is the layer that represents the cus-tomer’s interests in the RMS. This layer provides the abstraction of a“user” as a queue of resource requests. The primary goals of this layerare to maintain this queue in a persistent and fault-tolerant manner andto interact with the system layer by injecting resource requests for match-making, claiming matched resources for the requests, and handing theseresources off to the application RM layer. The injection of requests takesplace in the context of an interrequest resource management policy,which may dictate, for example, which requests have priority over oth-ers or which requests are dependent on others so that certain requestsare satisfied before others. Another important function of this layer is


to provide an interface to the HTC environment for human users andapplications.

5. Application RM layer: Once a resource has been claimed by the customerlayer, it is passed on to the application RM layer that implements per-application RM services. The application RM is responsible for communi-cating with the resource’s access control module to establish the runtimeenvironment for the application. It also provides runtime services forquerying, utilizing, and requesting more resources. This functionality isextremely useful because it provides a framework for the developmentof applications that are adaptive and can grow and exploit resources asand when these resources become available. This functionality is affordedby close interaction of the application RM layer with both the applica-tion itself and the customer layer through well-defined interfaces. Newrequests for resources made by applications appear in the queue of thecustomer layer, which then negotiates for a resource in the usual way. Anadditional responsibility of application RM is the implementation of inter-task resource management that determines which resource will be usedin fulfilling which task’s request.

6. Application layer: The application layer represents instances (or tasks)of the customer’s application. These tasks accomplish pieces of the endresult of the customer’s computation by utilizing resources handed tothe application’s resource manager. Runtime RM services required by theapplication are forwarded to the application RM layer, which may servicethese requests directly, or indirectly by acting as an intermediary to thecustomer layer.

The effectiveness of an HTC environment depends on how well the fourdifferent groups that constitute its user community are interwoven. Whatbrings them together are the services provided by the six layers of the RMS.All the layers have to operate in harmony in order to establish an atmosphereof collaboration among the members of the community. Such an atmosphere,which cannot be built without goodwill and mutual respect, holds the keyto the success of an HTC environment. As in any community, matchmak-ing plays a pivotal role in the nature of the relationships developed betweenowners, system administrators, application writers, and customers in an HTCenvironment. We therefore start our discussion of HTC resource managementmechanisms with a presentation of a matchmaking suite.

13.3 Matchmaking and Claiming321

13.3 MATCHMAKING AND CLAIMING

In this section, we present the basic requirements of a robust and effectivematchmaking suite (or framework). A primary concern of distributed resourcemanagement in large computational grids is the scalability, flexibility, androbustness of the matchmaking mechanism. We introduce the concept of clas-sified advertisements (classads), an approach for representing and matchingrequests and offers for services in a distributed matchmaking framework.

The paradigm of entities advertising their attributes and requirements toa matchmaker is a promising one. This scheme has several advantages.

First, the details of formulating and managing requirements and con-straints are the responsibility of the advertisers themselves. This facilitates anend-to-end approach for resource management. The specific entities matchedthemselves control the claiming, usage, and control of resources and serviceswithout subsequent intervention of the matchmaker, whose responsibilityends after identifying the match. This approach enhances the generality andscalability of the system.

Second, the paradigm does not imply an architecture for the matchmaker.The implementation of the abstract matchmaking service can be parallelizedand distributed for better reliability, availability, and performance.

Third, the paradigm is extremely flexible because it is not tied down toany specific type of resource. Indeed, the matchmaker may be used for moreabstract services than finding resources, such as finding other matchmakers.

Thus, classads are more of a negotiation-based approach to resource man-agement, where the advertising entities (and not the matchmaker) assume fullresponsibility for advertising, claiming, and managing resources and services.We discuss this mechanism in further detail below.

13.3.1 Advertising Offers and Requests

The fundamental problem of an HTC system is resource management. Assuch, its purpose is to bring resources and customers together to enable pro-ductive computation. To perform this matchmaking, the system must firsthave a method of representing resources and customers. The flexibility andexpressiveness of the representation are extremely important because theydirectly affect the functionality of the resource management system. For ex-ample, in a general system of resources, not all resources of interest are com-pute nodes. Other possibilities include software licenses and network links.


A representation that assumes that every resource is a single compute nodewould be unable to effectively represent other entities such as storage media,network links, and multiprocessor parallel machines. Thus, a representationmust avoid any assumptions about the nature and characteristics of resources.This requirement would allow an HTC environment to represent several het-erogeneous resources. Such flexibility, along with the implied dynamic imple-mentation, would facilitate inclusion and exclusion of resources from the poolat run time.

The distributed ownership of resources implies that owners of resourcesmust be able to restrict the usage of their resources. The mechanism thatimplements these access policies must be flexible enough to account for boththe technical concerns and sociological idiosyncrasies of owners’ policies.For example, the owner of a workstation may demand an access policy thatstates that the resource is available for HTC only if the keyboard has beenidle for over 15 minutes and the background load average is less than 0.3;customer requests made by the owner have higher priority than those made bymembers of the owner’s research group, which in turn have a higher prioritythan other requests. Finally, no requests made by members of a competingresearch group are to be serviced, and customers requiring less than 100 MBof virtual memory are preferred. These restrictions may be thought of as theconditions under which the owner grants the resource to a resource request,or the requirements of the resource offer, which must be honored by thematchmaker.

Similar restrictions may be placed by requests on offers. Malleable parallelapplications and large jobs with task dependencies can significantly affectthe requirements of applications during a run. These applications place bothqualitative and quantitative constraints on required resources as the task set inquestion grows and shrinks with time. For example, a customer to the systemmay state a requirement of at least five machines, and at most fifty. Of these,four machines with over 64 MB of memory are required, and the rest shouldhave at least 32 MB of memory and a MIPS rating of over 80. Furthermore,since the application was compiled for a particular architecture, it requirescompute nodes that are of the same architecture.

Thus, with respect to matchmaking, a symmetry exists in the structure ofoffers and requests: both need to express their attributes and requirements.This allows us to formulate the basic unit of the matchmaking mechanismas the encapsulation of the attributes and requirements of an entity thatrequires matchmaking services. The similarity between this mechanism andthe classified advertisements found in newspapers prompted us to define thisunit of encapsulation as a classified advertisement, or classad.


Before discussing classads in further detail, we note that despite the sim-ilarity of our formulation to newspaper classified ads, important differencesexist:

� Requests versus offers: A key feature of newspaper classified ads is that it istrivial to determine whether an ad represents a request or an offer, eitherpurely by the contents or by the context of the ad. Although this differ-entiation is often useful when implementing nontrivial matching policies(e.g., fair matching), our mechanism does not require it. In matching twoclassads, identifying and distinguishing the entity offering the service arenot generally necessary because, as we shall shortly see, the matchingprocess is operational and does not intrinsically depend on any impliedsemantic content of the ads. However, in the interest of clarity of discus-sion, we continue to make this distinction.

� Advertisers versus matchmakers: Unlike newspaper classified ads, ourframework clearly differentiates between advertisers and matchmakers.This allows one to designate specialized matchmakers that match offersand requests using criteria such as priority, fairness, and preferences.Optionally, these designated matchmakers may be treated as trusted au-thorities that can grant capabilities or tickets to the matched entities.

13.3.2 Matchmaking Requirements

The assumed model of the HTC environment is that it is an open environment—services and customers of different types (including completely new ones)can be added in or removed at run time. There is no inherent necessity for acentral authority that determines which entities may advertise and what theyshould advertise for. The following constraints are immediately imposed:

� Portability: Since the specific entities involved in matchmaking cannot beassumed to be of a fixed type or architecture, the mechanism must beportable and architecture independent.

� Self-describing: The advertised services in the environment may vary fromcompute nodes to software licenses to storage space to network band-width. Each resource requires a different description, but the matchmakermust be able to function correctly in a manner that is independent of thespecific descriptions.


� Well-defined and robust semantics: Because of the inherent uncertainties inlarge heterogeneous open environments, the mechanism must have well-defined and robust semantics to handle situations such as when character-istics required by an entity are not correctly represented in the candidatematch ad, or if such information is completely absent.

� Decoupled protocols: For maximum robustness and scalability, the mech-anism must carefully distinguish and decouple the protocols for adver-tising, matchmaking, and claiming. Decoupling the advertising andmatchmaking protocols allows the matchmaker the freedom of matchingasynchronously with respect to advertising clients. Decoupling the claim-ing and matchmaking protocols relaxes the required degree of informationconsistency in disseminated advertisements.

The classad mechanism has been designed to address all of the above issues.

13.3.3 The Classad Mechanism

A classad-based matchmaking framework consists of five logically indepen-dent components: (1) the evaluation mechanisms, (2) the claiming protocol,(3) the advertising protocol, (4) the matchmaking protocol, and (5) the match-making algorithm. Of these components, the evaluation mechanisms are ab-solute; that is, they are standard and remain fixed across all matchmakingframeworks. In contrast, a given matchmaker defines its own matchmakingalgorithm, advertising protocol, and matchmaking protocol, and advertisingentities employ a claiming protocol to connect with each other. Thus, the def-initions of these components are relative to a given framework.

The matchmaker of a framework defines the following:

� Its advertising protocol, which describes both the expected contents of adsand the means by which it obtains these ads

� Its matchmaking protocol, through which it communicates the outcome ofthe matchmaking process to the entities involved

� The matchmaking algorithm, which semantically relates the contents ofclassads to the matchmaking process

The claiming protocol is executed by the entities matched by the matchmakerto connect to each other and perform productive computation. The protocolmay also involve a verification phase by the entities involved when the matchis validated with respect to their current state, which may have changed sincethe advertisement from which the match was made.


The distinction between absolute and relative components is noteworthybecause a classad is defined as an attribute list that has been constructed inconformance to a given matchmaker’s advertising protocol. Thus, an attributelist may be a classad with respect to matchmaker A, but just an arbitraryattribute list to another matchmaker B that defines its relative componentsdifferently. Although this distinction is useful for the purposes of design anddiscussion, we note the following:

� The evaluation mechanisms, which define the semantics of expressionevaluation, do not depend on this distinction. The repercussions of non-conformance to an advertising protocol are defined by the matchmaker’smatchmaking protocol and are completely independent of the absolutecomponents, which have well-defined behavior regardless. Specifically,the correctness and performance of a matchmaker are not compromisedby an entity that advertises “nonconforming classads” (which are con-strued to be arbitrary attribute lists).

� In the interest of simplicity, the different aspects of matchmaking areexplained with respect to a single matchmaker. Thus, in this discussion,“classad” and “attribute list” are used interchangeably.

Evaluation Mechanisms

An entity that requires matchmaking services expresses its characteristicsand requirements as a set of attributes called an attribute list. Each attributeis a binding of an identifier with an expression. The expressions are struc-turally similar to arithmetic expression constructs in common programminglanguages and are composed of constants, attribute references (which canrefer to attributes in candidate match ads), calls to primitive inbuilt func-tions, and other subexpressions combined with operators and parentheses.Figure 13.2 illustrates two example attribute lists.

The key feature of the attribute list that makes it an attractive mecha-nism for open environments is the semantics of expression evaluation, whichare defined so that the uncertainties of an open environment can be han-dled in a graceful manner. Specifically, the evaluation of an expression iswell defined even if required expressions are not available or do not yieldvalues of expected types. In these cases, the evaluation results in the dis-tinguished UNDEFINED and ERROR values, respectively, which can be explicitlytested for.

The evaluation of expressions is usually (but not required to be) car-ried out by the matchmaker when testing two classads for mutual constraintsatisfaction. This test is usually performed by evaluating expressions from


Example 1

Type ⇒ “Machine”

OpSys ⇒ “OSF/1”

Arch ⇒ “Alpha”

Memory ⇒ 32

Disk ⇒ 782

KbdIdle ⇒ 17

LoadAvg ⇒ 0.1

ReplyTo ⇒ “<chestnut.cs.wisc.edu:5964>”

Requirement ⇒ (self.LoadAvg < 0.3) && (self.KbdIdle > 15) &&

(other.Owner != “foo”)

Example 2

Owner ⇒ “bar”

Group ⇒ “condor team”

Executable ⇒ “a.out”

ImageSize ⇒ 10

State ⇒ “Idle”

RemoteCPU ⇒ 0

ReplyTo ⇒ “<perdita.cs.wisc.edu:3748>”

Requirement ⇒ (Type == “Machine”) && (OpSys == “OSF/1”) &&

(Arch == “Alpha”)

13.2

FIGURE

Examples of classads. The expressions in these classads may be arbitrarilycomplex. Attribute references of the form self.X and other.X force lookup ofattribute X in the same ad and the candidate match ad, respectively. If thereferences do not have these prefixes, natural default lookup rules are used.

well-known attributes (specified by the advertisement protocol) from the twoads and ensuring that they evaluate to TRUE. The matchmaker evaluates theseexpressions from the classad in an “environment” that contains the two class-ads being tested. The two classads involved in the match are expected toconform to the advertising protocol, and the number, contents, and scopenames of the other classads (which contain default attributes and other match-related information) are fixed by the matchmaking algorithm. This entire setof classads serves as an environment from which attributes can be looked up.


Expressions in classads can refer to any attribute in the environment, includ-ing attributes from other classads. This lookup in other ads may be performedexplicitly by prefixing attribute references by scope resolution prefixes such as“self,” “other,” and “env” in the example, which explicitly name classads fromwhere the attributes will be looked up. The semantics of attribute referenceswithout scope resolution is also defined. (See [462], which details the structureand evaluation semantics of classad expressions.)

13.3.4 Matchmaking

The model of matchmaking in the classad framework is that entities thatrequire matchmaking services post classads to a matchmaker, which matchesthe ads and notifies the advertisers concerned in the event of a successfulmatch. The framework may contain several matchmakers, each of whichmay be distinguished by one or more features, such as the domain in whichit matches (e.g., automobile, furniture), the matching algorithm used, thesemantics of a match, and its communication protocol.

Notably absent in the responsibilities of a matchmaker is any notion ofallocation. There are several reasons for this intentional omission.

First, in highly dynamic environments, the status of advertising entitiesmay have changed since their last advertisement. The matchmaker may there-fore make some invalid matches with regard to the current state of the ad-vertising entities. Entities that receive notification of a match must activatea claiming protocol that both validates the match with regard to their currentstate and establishes a relationship between the matched entities if the matchis deemed valid. This protocol, which involves only the two matched entitiesand not the matchmaker, is required irrespective of whether we consider thematch as an “allocation.” In such dynamic environments a match is a substan-tially weaker operation than an allocation. This operation may be strength-ened by having the matchmaker generate a capability as a part of the match.In this case, the match may be considered as “permission” rather than a “hint,”but the operation still remains considerably weaker than allocation.

Second, the details of allocation can greatly vary depending on the type ofthe entity being allocated. These details are best left out of the matchmaker,which may be involved in matching (a possibly unknown) number of types ofheterogeneous entities.

Finally, allocation is by nature an asymmetric operation where one entityis allocated to another. Matchmaking is more symmetric, thus allowing more


general interactions. Any desired asymmetry can be introduced by the claim-ing protocol in the context of the matched entities and should not be imposedby the matchmaker itself.

13.3.5 Claiming

Claiming is the process by which the two parties agree to use the servicesof each other: the provider that serves the request, and the consumer thatrequests that it be served. Claiming has two important roles to play in thematching of offers and requests.

First, since the matchmaker does not constrain or verify the contents ofadvertisements, it is possible for an entity to incorrectly represent itself orits characteristics and origin to obtain a match. Verifying the correctness ofadvertisements is an issue that requires further investigation. A promisingapproach is that of “licensing and assurance,” which is discussed in Chapter 16.Regardless, the claiming protocol is an extremely useful interaction in thisregard because it can be designed to include a challenge-response protocol formutual authentication.

Second, in addition to authentication, both entities use the claiming proto-col to verify that their respective constraints are indeed satisfied with respectto their current states. Thus, the claiming protocol forms the first phase of im-plementing the constraints imposed by entities involved in the match. If theseconstraints are not satisfied, the match is rejected, and the entities restart theadvertise-match-claim cycle.

Example

We now furnish an example that illustrates matchmaking and claiming in asimple classad-based framework. The example is necessarily informal aboutthe specification of the relative components.

� Advertising protocol: Every classad sent to the matchmaker must include anattribute named Requirement that represents the constraints of the adver-tiser. The classad must also contain an attribute named ReplyTo that is acommunication end point at which it can be contacted. (Communicationprotocols regarding sending the classad to the matchmaker are omitted.)

� Matchmaking algorithm: Two classads A and B are said to match if A’sRequirement evaluates to TRUE and B’s Requirement evaluates to TRUE inthe environment constructed by the matchmaker.


� Matchmaking protocol: In the event of a match, the matchmaker will con-tact the two matched entities at their ReplyTo addresses and will pass theReplyTo attribute of the other entity involved in the match. Additionally,exactly one of the two entities is passed a tag, which denotes it to be theactive entity. The other entity is said to be the passive entity.

� Claiming protocol: In the event of a match, the active entity contacts thepassive entity (i.e., its match) at the ReplyTo address sent by the match-maker. The passive entity makes sure the network connection comes fromthe ReplyTo address specified by the matchmaker. The connection is thenmade, and the matched entities jointly perform their computation.

In the context of the above relative components, we can see that the examplesof attribute lists in Figure 13.2 are classads in this matchmaking frameworkand may be potentially matched by the matchmaker.

It is important to note that the matchmaking algorithm does not containany references to specific resources or services, or what it takes to match them.All this information is contained in the classads themselves, which are createdby the entities requiring matchmaking services. The algorithm also makes nodistinction between the offer and the request for the service. If any of theads sent to the matchmaker do not have a Requirement expression, the matchwould fail because the evaluation of the Requirement attribute would result inUNDEFINED and not TRUE.

In the most general and flexible case, either party—the provider or therequester—has the right to break an allocation at any time. The provider mayhave to give the resource back to the owner or may have an offer from a moreimportant or profitable customer, while the consumer may have obtainedaccess to a cheaper or more powerful resource. In many of these cases, itwould be unfortunate if the work accomplished so far were lost. It is thereforein the interest of both sides to have access to a checkpointing mechanism thatcan save the current state of the computation so that another provider canresume execution at a later stage. While the traditional view of checkpointingis that it is a means to improve the reliability of a computing environment,for an HTC environment it is a basic resource management tool with a meaninterusage time that is much smaller than the mean interfailure time of thehardware or the software. In the next section we discuss the different aspectsof the checkpointing problem and provide an overview of a checkpointingmechanism we developed for UNIX systems.


13.4 CHECKPOINTING

A checkpoint of an executing program is a snapshot of its state, which canbe used to restart the program from that state at a later time. Computing sys-tems have traditionally employed checkpointing to provide reliability: whena compute node fails, the program running on that node can be restartedfrom its most recent checkpoint, either on that same node once it is re-stored or potentially on another available node. Checkpointing also enablespreemptive-resume scheduling. All parties involved in an allocation can breakthe allocation at any time, without losing the work already accomplished, bysimply checkpointing the application. Thus, a long-running application canmake progress even when allocations last for relatively short periods of time.Because of the opportunistic nature of resources in a distributively owned en-vironment, any attempt to deliver HTC has to rely on a checkpointing mech-anism [460].

Checkpointing services provide an interface both to the application andthe surrounding environment. At the least, an application should be able to re-quest that its state be checkpointed at any time during its run and be able to re-quest that no checkpoints be performed during specified critical sections. ThePOSIX P1003.10 draft standard on checkpoint and restart additionally providesan interface for an application to specify pre- and postcheckpoint processing.Researchers have also developed user-directed checkpointing services [451],which rely on user hints about memory usage to significantly increase the per-formance of checkpointing. In addition to the application interface, a check-pointing service provides an external interface to schedulers and users to trig-ger an application checkpoint because of external events (preemption, systemshutdown, etc.). Often this is done by sending a signal to the application (inthe case of user-level checkpointing) or by making a system call (in the caseof kernel-level checkpointing).

Checkpointing can be an expensive and time-consuming operation, sincethe (potentially large) checkpoint must be written (possibly over the network)

to disk. Checkpoints of parallel applications can be particularly huge, sincethe state of a parallel program includes the state of the interconnection net-work in addition to the state of each process. Also, performing a checkpointrequires that the process’s address space be read, which can involve swap-ping virtual memory pages in from disk. In an opportunistic environment, itis imperative that a preempted process vacate the machine quickly; hence,if the scheduler cannot write a checkpoint quickly, the work accomplishedsince the last checkpoint will be lost at preemption time. In a system wherecheckpoints are written periodically and very fast preemption is required, pre-

13.4 Checkpointing331

emption without checkpointing may be desirable. User-directed checkpointingis one method for writing potentially fast checkpoints. Another method is todeploy specialized checkpoint file storage servers throughout the grid and di-rect checkpoints to the nearest or least-loaded server at preemption time [460].A third method is to migrate the process immediately to another machine bywriting the checkpoint to a network stream and reading it directly off the net-work on the new machine. In this method, the checkpoint does not need tobe written to disk. This requires, of course, that a new machine be available atcheckpoint time.

The decision of where to send a checkpoint can have a significant impacton performance and reliability and can impact other scheduling decisions. Mi-gration requires that a new compute node be allocated for the task at the timeof preemption. Disk space must be available for checkpoints not being usedfor immediate migration. Network bandwidth will affect the speed with whicha checkpoint can be written. A checkpointing mechanism should, therefore,provide an interface to allow a scheduler to direct checkpoints to the appro-priate network endpoint or disk.

Since most workstation operating systems do not provide kernel-levelcheckpointing services, an HTC environment frequently must rely on user-level checkpointing. In our experience developing and maintaining a user-level checkpointing library [349], we have found portability to be a significantchallenge. For example, after porting our library to a new version of a popularUNIX operating system, we had reports from a user that his simulation wasexiting prematurely after restarting from checkpoint. After much investiga-tion, we discovered that we needed to reset a flag to tell the operating systemto save floating-point registers on context switches. Small differences like thisbetween operating systems and operating system versions add up, making themaintenance of a portable, robust user-level checkpointing mechanism a sig-nificant challenge to the HTC environment developer. Silicon Graphics hasincluded kernel-level checkpointing services in a recent version of the Irix op-erating system. We hope this is the start of a trend among operating systemvendors.

Process checkpointing is implemented in our user-level checkpoint libraryas a signal handler. When a process linked with this library receives a check-point signal, the provided signal handler writes the state of the process out toa file or a network socket. To determine where to write the checkpoint, thesignal handler either uses a file location provided on the command line orsends a message to a controlling process asking for a file location or a networkaddress to connect to. The checkpoint includes the contents of the process’sstack and data segments (see Figure 13.3), all shared library code and data


Name Name

StartADDR

StartADDR

Size Size

PROT PROT

No.of

seg-ments

Size 1bytes

Size 2bytes

. . .. . .

CKPTHDR

Segment headers Checkpoint data

13.3

FIGURE

Structure of a checkpoint file.

mapped into the process’s address space, all CPU state including register val-ues, the state of all open files, and any signal handlers and pending signals(see Figure 13.3). On restart, the process reads the checkpoint from the fileor network socket, restoring the stack, shared library and data segments, filestate, signal handlers, and pending signals. Again, the location from which toread the checkpoint is determined by either a command line option or the re-sponse to a query of a controlling process. The checkpoint signal handler thenrestores the CPU state and returns to the user code, which continues fromwhere it left off when the checkpoint signal arrived. A program can requestthat a checkpoint be performed by sending itself the checkpoint signal andcan disable checkpointing in critical sections by blocking the checkpointingsignal.

Other details in implementing a practical checkpointing mechanism, suchas handling dynamic libraries and checkpointing state that is not accessibledirectly from user-level (e.g., the open file table), have also been overcomewith indirect solutions [349].

Since our checkpointing support is implemented in a static library, ap-plications that use it must be linked with this library. This requirement, un-fortunately, means that applications for which source or object files are notavailable cannot make use of our checkpointing support. Some UNIX variantsinclude a method for injecting a dynamic library into an executable at startuptime. This method could potentially be used to provide a shared library im-plementation of checkpointing support that could be injected into unmodifiedprograms at startup time.

Checkpointing processes that use network communication require thatthe state of the network be checkpointed and restored. Our checkpointinglibrary has been enhanced to support applications that use PVM or MPI. To

13.5 Batch Systems333

checkpoint the state of the network, this library synchronizes communicat-ing processes by flushing all communication channels prior to checkpoint.At restart time, the library restores the communication channels. Programsthat communicate with processes that cannot be checkpointed also pose aninteresting problem. Programs that communicate with X servers or licensemanagers fall into this category. We have developed a solution that places aswitchboard process between the two end points. Instead of connecting di-rectly, these processes connect through the switchboard. When the programis checkpointed, it notifies the switchboard and closes its connection. Theswitchboard, however, keeps the connection to the other end point open andbuffers any communication from this end point until the checkpointed pro-gram is restarted and the connection is restored. Protocol-specific knowledgeis required in the switchboard if the noncheckpointable end point expectsprompt replies.

13.5 BATCH SYSTEMS

Since the days of the first mainframes, batch systems have played a crucial rolein providing computing resources to HTC applications. Equipped with queuingmechanisms, scheduling policies, priority schemes, and resource classifica-tions, these systems are used to run batch jobs on dedicated resources. In re-cent years the mechanisms employed by batch systems have been extended todeal with large multiprocessor computers and clusters of workstations. Theirpolicies have also been adapted to meet the needs of workloads that consist ofa mix of sequential and parallel applications.

The resources controlled by a batch system are typically owned by one or-ganization and located within a single administrative domain. System adminis-trators have full control over all resources and are in charge of the schedulingpolicies. Jobs are placed in queues classified according to their resource re-quirements and the customer who submitted them. Each queue is assignedcomputing resources to process the class of jobs it serves. Designed and builtto operate as a production tool, batch systems are known for their robustnessand reliability. These qualities will be extremely valuable assets to any com-putational grid that provides HTC services and wants to exploit the resourcesmanaged by such a system.

It is beyond the scope of this chapter to provide an in-depth discussion andevaluation of the currently available commercial and public-domain batch sys-tems. Baker et al. provide an excellent review [38]. The results of a detailed


and systematic evaluation of six job management systems (JMSs) was re-cently published in the latest NASA Job Management System (Batch/QueuingSoftware) evaluation report [298]. Three of the systems evaluated (CODINE[230], DQS [529] and LSF [585]) emphasize heterogeneous environments,whereas the other three systems (LoadLeveler [283], NQE [139], and PBS [265])focus their efforts mainly on supercomputers.

A recent do-it-yourself trend has been adopted by administrators of largeproduction systems who design and implement their own batch schedulers(e.g., the Maui Scheduler from MHPCC and EASY from Argonne National Lab-oratory [344]) and make them available to the community. These schedulersreflect the unique needs and resource allocation philosophy of their imple-mentors and utilize the APIs of an underlying batch system that providesthe scheduler with queuing and process management mechanisms. The Nim-rod system [5] is an example of another recent trend to build tailored batchschedulers designed to support customers who are engaged in large multijobcomputing efforts in a specific domain.

13.6 CHALLENGES

There are several challenges on the way to large-scale HTC computationalgrids. Indeed, every chapter in this book identifies and examines technologiesthat must be revisited or created to achieve this goal. In the interest of brevity,we identify issues that immediately challenge the very large scale deploymentof the technology of high-throughput computing.

1. Understanding the sociology of a very large and diverse community of providersand consumers of computing resources: Unlike electrical grids, in an HTCcomputational grid, every consumer can also be a provider and everyprovider can also be a consumer of services. In electrical grids, a smallnumber of providers serves a much larger community of consumers. If ev-ery consumer in an electrical grid had his or her own generator and someconsumers were always looking for more power, electrical grids wouldlook and behave much more like HTC environments. The National Tech-nology Grid [525] of the NCSA alliance will provide us with a laboratory tostudy such a community.

2. Semantic-free matchmaking services: In the interest of flexibility and ex-pressiveness, semantic-free matchmaking services for providers and cus-tomers of complex services with constraints must be developed. Theseconstraints define who they are willing to serve or by whom they are

13.6 Challenges335

willing to be served, respectively. These mechanisms must be not only ex-pressive but also efficient, as matchmakers will have to check extremelylarge numbers of candidate matches in grids of even moderate size.

3. Tools to develop and maintain robust and portable resource management mech-anisms for large, dynamic, and distributed environments: Current approachesto developing such complex resource management frameworks usually in-volve a new implementation of large fractions of the framework for eachinstance of a marginally different RMS. An established framework withtools and APIs would allow the construction of interoperable components.This would greatly enhance both the functionality and development timeof complex RMSs. Many of these concerns are addressed in Chapter 11 insome detail.

4. Universally available checkpoint services for sequential and parallel applica-tions: This goal is perhaps one of the most difficult to achieve for purelypractical considerations. Differences in vendor implementations of oper-ating systems, varied architectures, and inadequate user-level support forcheckpointing makes providing ubiquitous checkpointing services chal-lenging. However, as described in Section 13.4, recent activity in the fieldmakes this goal more tenable. The availability of a ubiquitous checkpoint-ing mechanism would greatly increase the percentage of available cyclesproductively harnessed by applications.

5. Understanding the economics (relationship between supply and demand) of anHTC computational grid: Although basic priority schemes for guaranteeingfairness in the allocation of resources are well understood, mechanismsand policies for equitable dispersion of services across large computationalgrids are not. A major aspect in the development of such policies involvesunderstanding supply and demand for grid services. Clearly, this topicwarrants further investigation.

6. Data staging—moving data to and from computation sites: When the problemof providing vast amounts of CPU power to applications becomes betterunderstood, the next hurdle will be that of providing sufficiently sophisti-cated RM services to individual throughput-oriented applications. A majoraspect of this problem is that of integrated staging, transport, and storagemedia management mechanisms and policies for high throughput.


13.7 SUMMARY

The need for high computing power over sustained intervals has increased inthe scientific and engineering community. In contrast to other users who areconcerned with response time and use interactive computing services, theseusers are primarily concerned with the throughput of their applications overrelatively long time periods.

In this chapter we argue that grid resources can be productively scav-enged to service these applications. By doing so, both the HTC community andthe high-performance computing community will benefit as HTC applicationswill migrate to opportunistically managed commodity resources, freeing thehigh-end resources for high-performance computing applications. In the gen-eral case, grid resources are distributively owned, thereby presenting severaltechnical difficulties if these resources are to be scavenged for productive com-putation. The most important aspect of using distributively owned resourcesis that the RMS must honor the policies of resource owners at all times. Otherrequirements for satisfying system administrators, application authors, andcustomers must also be addressed.

The varied requirements of such an RMS requires careful decompositionof the system into manageable modules whose responsibilities and interac-tions are well defined. To this end we present a flexible, scalable, and robustsix-layered architecture for distributed resource management systems, andwe discuss the specific responsibilities and interactions of each layer in thesystem.

A primary interaction in the system is between that of customers andresources who are brought together by a matchmaking service. The flexibil-ity and robustness of the matchmaking service are extremely important be-cause they directly affect the usability and quality of service provided by theHTC system. We present the classad matchmaking framework as a promisingmatchmaking mechanism for grid environments.

For maximum flexibility, the matchmaking service must have the abilityto preempt and rematch resources that were matched previously. To guaranteethat applications make progress in the face of such dynamic policies, it is im-portant to have an application checkpointing mechanism. Such a mechanismis also important when an owner’s access control policy revokes the resourcefor personal use or other more preferred requests.

We claim that these mechanisms, although originally developed in thecontext of a cluster of workstations, are also applicable to larger grids. Inaddition to the required flexibility of services in these grids, a very important

Further Reading337

concern is that the system be robust enough to run in “production mode”continuously even in the face of component failures. A layered architecturewith dynamic matchmaking frameworks can be used to address both concernsand, with the help of other grod services, provide reliable and sophisticatedhigh-throughput computing resource management services.

ACKNOWLEDGMENTS

We thank the editors for their invaluable comments and suggestions, whichgreatly improved this chapter. We also thank Jim Basney for his contributionto the Condor checkpointing section. In addition, we offer many thanks toMike Litzkow for his work on the early implementations of Condor, and ToddTannenbaum, Derek Wright, and Adiel Yoaz for their patient and skillful workon the Condor team.

FURTHER READING

For more information on the topics covered in this chapter, see www.mkp.com/grids.

16C H A P T E R

Security, Accounting, andAssurance

Clifford Neuman

The creation and deployment of computational grids will have a profoundimpact on the security of distributed systems. In traditional systems, the fo-cus of security mechanisms has been to protect the system from its users and,in turn, to protect data maintained by the system on behalf of one user fromcompromise by another. While such protection remains important for grid ap-plications, grids introduce the extra requirements of protecting applicationsand user data from the systems on which parts of a computation will execute.Further, because running code may originate from many points on a network,there is greater potential for running malicious code, requiring stronger meth-ods to verify the origin and authenticity of the code and means to confine itsexecution. Because grid resources are managed by many organizations, oftenwith different security requirements and possibly conflicting security policies,managing security for such a system is difficult.

This chapter discusses grid security requirements and provides an over-view of some of the technologies that are available or under development toaddress these requirements. Issues related to the scalability of the computersecurity infrastructure are discussed. The chapter concludes with a summaryof some of the issues that will be faced as we try to integrate these technologiesinto a grid software infrastructure.

16.1 SECURITY REQUIREMENTS

At the highest level, the security requirements of any system involve pre-venting the unauthorized disclosure or modification of data and ensuring the

16 Security, Accounting, and Assurance396

continued operation of the system. Systems differ in the policy that deter-mines when disclosure or modification is authorized, and they also differ inthe kinds of attack to which they are subjected. Because grids typically spanmultiple organizations, and even different countries with different laws, se-curity requirements may vary from one part of such systems to another. Formost systems, however, including grids, security requirements encompass au-thentication and authorization.

16.1.1 Authentication

Authentication is the process of verifying the identity of a participant to an op-eration or request. A principal is an entity whose identity is verified throughauthentication and on whose authority the operation is performed or autho-rized. The principal may be the user logged into a remote system and onwhose behalf the application client is running, it may be a local user loggedinto a server, or it may be the server itself.

In traditional systems, the requirement for authentication is focused onthe client, since the goal of security in such systems is to protect the system(servers) from the users. In grid systems, mutual authentication of the serveris just as important, to ensure that resources and data provided by the serverare not really provided by an attacker. User and server authentication providesassurance that the principal, or more precisely a process possessing someobject or secret held or known by the principal, is an active participant ina protocol exchange at the time authentication is performed.

Data origin authentication provides assurance that a particular message,data item, or executable object originated with a particular principal, andmakes it possible to determine the origin of an incoming program. This infor-mation can be used to determine whether a program was modified or was sentby an attacker to compromise the resources to which the program has access.By itself, however, data origin authentication does not ensure that the data wasrecently sent by the principal, only that it was generated by the principal atsome point in the past.

In some cases, an application or process may assume the identity of a dif-ferent principal for the purpose of performing particular operations. Authorityto act as this other principal is granted through a process called delegation ofidentity.

16.1.2 Authorization

Authentication is useful primarily to enable authorization. Authorization is theprocess through which it is determined whether a particular operation is al-

16.1 Security Requirements397

lowed. In traditional systems, authorization is usually based on the authen-ticated identity of the requester and on information local to the server. Thislocal information identifies the individuals authorized to perform an opera-tion, and it often takes the form of an access control list associated with a file,directory, or service.

Authorization mechanisms are required within grid systems to determinewhether access to a resource is allowed. Such resource access may involveaccessing a file in a data repository, reserving network bandwidth by using asystem like RSVP, or running a task on a particular processing node.

In some cases, the ability to run a task on a processing node may bebased not just on the identity of the user asking to run the task, but on theidentity of the task or application to be run. When the code for an applicationis stored locally on the execution node, the identity of the application maybe determined from the name of the application; but if the user providesthe code to be run, the application itself must be authenticated by usingdata origin authentication—usually by verifying a digitally signed checksum ofthe executable. To identify the particular programs, access control lists mightcontain the names or checksums of authorized programs, together with thenames of the principals authorized to invoke the program.

Many applications can benefit from an authorization mechanism that sup-ports delegation of authority. Delegation of authority is a means by which auser or process authorized to perform an operation can grant that authorityto perform the operation to another process. This is a more restricted formof delegation than delegation of identity (discussed earlier). Delegation of au-thority is important for tasks that will run remotely on the grid but that mustmake calls themselves to read or write data stored across the network. For ex-ample, in implementing distributed authorization [415], a resource managermight allocate a node to a job and might delegate to the job’s initiator [417] theauthority to use that node.

16.1.3 Assurance

While authorization mechanisms allow the provider of a service to decidewhether to perform an operation on behalf of the requester of the service,assurance mechanisms [325] allow the requester of a service to decide whethera candidate system or service provider meets the requester’s requirementsfor security, trustworthiness, reliability, or other characteristics. Real-worldexamples of assurance include hotel ratings by the American AutomobileAssociation and endorsement by the Better Business Bureau.


Assurance is a form of authorization used for validating the authority ofthe service provider. When applied to computer systems, this authorization ofthe system for use in particular applications is sometimes called accreditation.

In a computational grid, assurance credentials may be checked when se-lecting nodes for a computation, to ensure that they meet the performance,reliability, and security requirements of the application and that the comput-ing service is run by an organization that is trusted to handle the data usedby the task that will run on the selected nodes. When applied to programs, aresource manager might verify assurance credentials attached to a programbefore it is run. This form of assurance is analogous to the Underwriters Lab-oratory seal found on electric appliances in the United States.

16.1.4 Accounting

Accounting provides the means to track, limit, or charge for the consump-tion of resources in a system. It is critical for providing a fair allocation ofthe available resources to users that need them. Accounting will be criticalfor deployment of grid applications, supporting payment or barter for the useof computing resources and providing incentives to the owners of comput-ing resources to make idle capacity available to others. Additionally, whenthe aggregate computing requirements of grid applications exceed availableresources, the accounting system will provide a tool that is useful in decidingwhich processes to run.

Any grid accounting mechanism must be distributed so that quotas canbe applied to any node in the grid, making the allocation of resources moreflexible than it would be if quotas were maintained separately on each node.Further, because computational grids will cross organizational boundaries, theaccounting servers should be distributed and scalable across administrativedomains. This feature will allow an organization to administer quotas for itsusers, independently from the quotas granted and maintained by other orga-nizations. To prevent the compartmentalization of the computing resources inthese domains, a settlement and clearing process should be provided betweenaccounting servers in different domains.

16.1.5 Audit

An audit function records operations that have been performed by a system,associating each action with the principal on whose behalf the operation wasperformed. An audit is useful for figuring out what went wrong if somethingbreaks or for tracking breaches of security in the system. An intrusion detection

16.1 Security Requirements399

system will look at events generated by the audit in order to find patternsof operations that fit the profile of a system intrusion or that do not fit theprofiles of legitimate users. If such patterns are detected, an alert is generated.In traditional systems, the audit function is local to each server. To detectnetwork attacks, the audit function should be distributed or audit recordstransmitted to a central location for each organization or administrative unit,where a higher-level view of the system can be constructed. In the ideal case,summary information (and in certain cases details) might be shared acrossadministrative boundaries.

Because code can be loaded onto the system nodes from many sources,and because grid computations can take place across multiple nodes, an au-dit mechanism for a grid must itself be distributed. Consider the case of adenial-of-service attack on the grid. To be effective, the attack would have tobe mounted across many nodes. A distributed audit function could aid in iden-tifying such an attack.

16.1.6 Integrity and Confidentiality

At the highest level, the impact of the other security services discussed inthis section is to support confidentiality and integrity of data as it is storedor processed on computer systems. Full security requires that such data beprotected during transmission on the network. In a grid system, the correctfunctioning of the applications that run remotely depends on the integrity ofthe communications when programs and data are sent between nodes. If thedata consumed or produced by remote tasks is sensitive, confidentiality of thecommunication will also be necessary.

16.1.7 Optional Security Services

Not all of the security requirements in this section are necessary for all gridapplications or for all environments. In fact, meeting some of the require-ments will have a negative impact on the performance, implementation, andadministrative costs of a system. When security needs are balanced with othersystem requirements, certain security services may be omitted. For example,when applications operate on nonsensitive data, confidentiality of the com-munications between tasks can be omitted, often with significant performanceimprovement.

Similarly, some security services might not be needed in particular com-puting environments. For example, if all the tasks of an application will runwithin a single multiprocessor system, and if the design of the system does


not provide a method for user-level processes to monitor or modify bus traffic,then confidentiality and integrity of communications may be assumed, andthere is no need for other techniques to provide these services.

Assuming that a system is not to be left completely unprotected and thatit will try to protect users from interference by other users, then for the ser-vices discussed in this section, three features should be considered mandatory:authentication, authorization, and integrity. With the exception of passwords,confidentiality in such a system would be optional, with the application andorganization’s policies determining need. Assurance mechanisms are neces-sary only in environments where the service providers (the nodes availablefor a computation) are under the control of different administrative entitiesand where not all the nodes are trusted to meet the user’s requirements. Ac-counting is optional: it is useful primarily when there are charges for resourceconsumption or when limited resources require allocation of quotas. Whetheraudit is required is a matter of policy, since it is not needed for the properfunctioning of a system; its benefits are felt after a security breach.

Grid systems should provide a modular framework within which differ-ent security mechanisms can be integrated so that knowledge of the needs ofa particular application, user, or computing environment can be used to dy-namically select from among the implemented choices and so that the quality(or level) of protection can be negotiated as part of the process of resourceallocation.

16.2 TECHNOLOGIES

Many technologies have been developed to provide the security services de-scribed in the preceding section. Here, we examine these technologies, start-ing with those that serve as a base for other technologies and progressing tothose that provide security directly. We focus on technologies designed to pro-vide security services in an environment where an attacker may have accessto the underlying communications medium.

16.2.1 Cryptography

Cryptography is the most basic technology for distributed system security. Al-most all effective techniques that provide security for open networks requirecryptography for their effective operation. This requirement arises becauseanyone connected to part of an open network may observe, insert, and insome cases remove messages on that part of the network at will.

16.2 Technologies401

Cryptography comprises two transformations on messages. Encryption isa transformation that scrambles data in a way that varies based on a secretparameter called an encryption key, so that the data cannot be interpreted(or unscrambled) without the corresponding decryption key. Decryption isthe inverse transformation using the corresponding decryption key, restoringthe data to its original form. The scrambled data is called ciphertext, and theoriginal or subsequently unscrambled data is called plaintext.

Many algorithms can be used for the encryption and decryption transfor-mations. It is generally accepted that the security of a cryptosystem is condi-tioned on the secrecy of the keys from an attacker, but the attacker usuallycan be assumed to know the algorithm that uses the keys as parameters. In aneffective cryptosystem, the parameters needed for encryption and decryptioncannot be determined simply by observing the messages that are encrypted.The parameters are provided to the legitimate parties that need to use themthrough a process called key distribution.

Two Classes of Cryptosystems

The algorithms used for encryption and decryption are classified as symmet-ric or asymmetric according to the relationship between the parameters usedfor encryption and decryption. In a symmetric (or conventional) cryptosys-tem like the data encryption standard (des) [32], triple-des [375], idea [326],blowfish [492], RC4, and RC5, data is encrypted by using a parameter as anencryption key so that it can be decrypted only by a similar transformationusing the same parameter as the decryption key; users then need to remem-ber only the single key used to communicate with the other party. Therefore,when using symmetric cryptosystems, both the party encrypting the data andthe party decrypting the data must share the same encryption key. In otherwords, a user needs a different key for every other user or service providerwith which it exchanges information or messages, and each service providermust maintain a key for every potential user. This limitation is usually miti-gated by using a mutually trusted intermediary to generate a new encryptionkey that is distributed to both parties.

In an asymmetric cryptographic algorithm such as RSA [477] or Digital Sig-nature Algorithm (DSA) [409], encryption and decryption are performed byusing a pair of keys as parameters such that knowledge of one key does notprovide knowledge of the other key in the pair [159]. One key is called thepublic key and is published and available to anyone, including a potential at-tacker. The other key, called the private key, is kept private and is known byonly one of the parties to the exchange. The principal advantage of asymmetric


cryptography is that secrecy is not needed for the public key, making dissem-ination of the public key easier. Additionally, only a single key pair needs tobe generated for each user or service provider, as compared with conventionalcryptography, which requires a key for every possible pair of senders and re-ceivers.

The principal disadvantage of asymmetric cryptography is its perfor-mance. Existing asymmetric cryptosystems are significantly slower than theirsymmetric counterparts. With common key sizes of 512 to 1,024 bits, encryp-tion of a single block with a private key using the RSA algorithm [477] can takeon the order of a tenth of a second or longer to complete on computers incommon use today. The public key operation using the DSA [409] has similarperformance.

Because of the performance issues, asymmetric cryptography is rarelyused in isolation. Instead, it is almost always used to encrypt a symmetricencryption key and checksum, which are in turn used to protect the actualdata. While using asymmetric cryptography in this manner reduces the costof encrypting large messages and documents, at least one encryption opera-tion using an asymmetric algorithm is required for each signed document andfor the exchange of a symmetric key between any new pair of users. For ap-plications that process frequent operations originating from different clients,performance requirements preclude the use of asymmetric encryption. Asym-metric encryption is, however, well suited for store-and-forward applicationssuch as electronic mail and information dissemination applications, wheredocuments can be signed before they are stored (e.g., many Web documents).

When using asymmetric cryptography to exchange symmetric encryptionkeys or to sign checksums, each party must know the other party’s public keya priori or must rely on a trusted third party to certify the other user’s publickey. Without a trusted third party, an attacker can replace the public key ofa participant with a different key for which the corresponding private key isknown by the attacker. This substitution would allow the attacker to decryptmessages encrypted by using the fictitious key and to generate messagessigned by the key. The requirement for a trusted third party is similar tothe requirement for such an intermediary to distribute keys for a purelysymmetric cryptosystem. The use of such third parties for the exchange andcertification of encryption keys is closely tied to authentication.

Confidentiality and Integrity

The most direct application of cryptography to security is to protect the con-fidentiality of data accessed though computer networks. When the sender


and receiver share an encryption key known only to them, data can be en-crypted before transmission and decrypted after transmission, protecting thedata from disclosure to eavesdroppers. Encryption is the only suitable meansto provide confidentiality and integrity of data as it is transmitted across anopen computer network such as that which connects the nodes of any large-scale, administratively decentralized computational grid.

Besides protecting the confidentiality of data, encryption also protects dataintegrity. Because knowledge of the encryption key is required to produceciphertext that will yield a predictable value when decrypted, modification ofthe data by someone who doesn’t know the key can be detected by attaching achecksum to the data before encryption and requiring that the receiver verifythe checksum after decryption. Alternatively, a message digest function canbe calculated over data sent unencrypted, but the digest itself is encryptedand attached to the message to provide a digital signature. A message digestfunction is a one-way function used as a checksum: given a message and adigest, it is computationally infeasible to find a different message that sharesthe same digest.

16.2.2 Authentication

Several methods can be used to verify the identity of the party with whicha user communicates, including assertion, passwords, and encryption-basedauthentication protocols. Assertion-based authentication is suitable only inan environment where processors and their associated system software aretrusted to properly identify local users to other processors and where messagesare protected from modification by adversaries. Passwords are suitable only insituations where messages cannot be read by untrusted processes while tran-siting the network; where communication can be intercepted, the passwordscan be intercepted and used later by an adversary to impersonate the originaluser. Unfortunately, these assumptions for protection of communications donot generally hold for distributed systems.

In less controlled environments, authentication protocols can be usedto prove knowledge of a password, without actually sending the passwordacross the network. This process is accomplished by using the password asan encryption key; because knowledge of the encryption key is requiredto produce ciphertext that will yield a predictable value when decrypted,knowledge of the encryption key can be demonstrated by encrypting a known,but nonrepeating value, and sending the encrypted value to the party verifyingthe authentication. As was the case with encryption for confidentiality and


integrity, unless each party knows the other party’s key a priori, a trustedthird party is required to certify or distribute the keys.

The Kerberos authentication service, described later, is an example of asystem that provides authentication of clients and servers across a computernetwork. Certificate-based mechanisms like Secure Sockets Layer (SSL) arealso capable of providing authentication services to higher-level applications.

16.2.3 Certification

Encryption provides the base technology for confidentiality and integrity ofdata communications, and authentication methods for distributed systems al-low the user to prove possession of an encryption key known only by the user,but it is the certification mechanism that provides the binding between a par-ticular encryption key and the authenticated identity. A certification authority(CA) is the third party that certifies this binding, issuing a certificate signed bythe CA that attests to the validity of the binding.

A certificate is a data object that specifies a distinguished name of a principaland, for certificates based on public-key cryptography, the public key that wasissued to or selected by the principal and that is the inverse of the private keyknown only by the principal. The certificate may contain additional attributesof the principal; depending on the kind of certificate, these might include au-thorizations, group memberships, email addresses, or alternate names. Onceconstructed, the certificate data is signed by the CA, ensuring its authenticity.X.509 [110] is the most widely used certificate format; X.509 certificates areused by most Web browsers, commercial secure email products, and public-key-based electronic payment systems.

To validate the binding of the key in the certificate to a distinguishedname and to other attributes, the verifier must validate the CA’s signature.This validation requires knowledge of the CA’s public key. The key may beknown a priori by the verifier, or it may be obtained from the CA’s certificate,which was itself issued by a higher-level CA. Thus, certification is usuallyhierarchical, with CAs authorized to issue certifications only for distinguishednames delegated by the higher-level CA.

Clients are configured with certain well-known public keys that are inmany cases obtained when software is installed. These keys are used to val-idate the certificates of lower-level CAs, whose keys are then used to validatethe certificates for end users and other CAs. Because certificates are used formany purposes, and because of the lack of a single universally trusted certi-fication authority, the certification hierarchies used in practice have multiple


roots, and applications and servers are configured with the public keys of thoseroot nodes whose certifications are trusted.

16.2.4 Distributed Authorization and Assurance

Authorization and assurance mechanisms provide information (besides thename of a principal) that will be used to determine whether an operation isallowed. For authorization, it is the ability of the requester to obtain accessthat is checked; for assurance, it is the ability of the provider to provide a ser-vice that is determined. To provide authorization and assurance informationin a distributed system, the authenticity of the authorization and assuranceinformation must be verified.

Authenticity of authorization and assurance information is protected byembedding the attributes in certificates, which are then signed by a third partythat is trusted to provide such information. These certificates are sometimescalled privilege attribute certificates [182] (for authorization) or assurance cre-dentials [325]. A restricted proxy [415] is a form of authorization certificatethat grants authority to perform an operation on behalf of the grantor, but isrestricted for access to particular objects and only when the specified restric-tions are satisfied. The use condition model [297] (see Section 4.3.4) providessimilar functionality, with conditions and privilege attributes embedded incertificates that are then matched to verify authority to perform a particularoperation.

In an alternative approach to distributed authorization, authorization in-formation is provided through a separate connection between an authorizationserver and the party providing the service for which authorization must bechecked. The party checking authorization then asks the authorization serverwhether a named principal is authorized. The authorization server either re-sponds yes or no to the specific question or returns an access control list thatis then checked locally. In either case, the integrity of the connection betweenthe authorization server and the party checking the authorization informationmust be assured by using methods described earlier.

16.2.5 Accounting

Distributed accounting mechanisms can be layered on top of integrity, au-thentication, and authorization services. Accounting serves two functions ina distributed system: billing and authorization. Both functions require that theintegrity of accounting data be protected. Similarly, the authentication of users


is necessary to make sure that the correct user is charged at the time a resourceis consumed. Finally, integration with authorization bounds consumption towithin specified quotas and specifies which accounts a user may charge.

A distributed accounting system may be implemented as a database dis-tributed across designated servers that maintain account balances for the usersof a system. The entries in this database will indicate how much of a resourcewas used, how much of the resource a user is still allowed to use, or both. Thelimits and accumulated use of each resource in a distributed accounting sys-tem cover the use of the resource across multiple machines, not just a singlemachine. Hence, remote update and synchronization must be provided so thatusers cannot exceed resource limits in aggregate by spreading their use acrossmultiple machines.

16.2.6 Intrusion Detection and Auditing

Intrusion detection and auditing systems require the ability to record a log ofthe events that occur in a system for later or concurrent analysis to detectattacks while they are in progress or to reconstruct events following an at-tack. The transmission and storage of data for this log are vulnerable to attacksranging from modification, insertion, and deletion of data to denial-of-serviceattacks designed to prevent receipt of audit records by the systems recordingor analyzing the data. The confidentiality of audit data must also be protectedto ensure the privacy of the user and to prevent disclosure of critical informa-tion such as passwords (which are sometimes recorded in audit data when auser accidentally types a password in a username field during login).

As with the other security services discussed so far, the storage and trans-mission function requires cryptographic protection for the confidentiality andintegrity of the audit data. Intrusion detection and auditing also depend onauthentication, since audit records usually associate individuals with the ac-tions they have taken, and the identity of the user must be determined if theseentries are to be trusted.

Once audit records are securely transmitted and stored, intrusion detec-tion programs analyze the data. The techniques used by these programs area topic of current research, and space does not allow discussion in this chap-ter. The presence of grid applications on a network, however, will affect theseprograms because the execution characteristics of the communication, migra-tion, and invocation of these applications may be similar to the characteristicsof certain kinds of network attacks.

16.3 Current Practice407

16.3 CURRENT PRACTICE

The computer security technologies most widely used today include file andemail encryption technologies such as Pretty Good Privacy (PGP), transportlayer security technologies such as SSL, authentication technologies such asKerberos and alternatives using public-key certificates, assurance technolo-gies such as Authenticode, confinement technologies to limit the actions ofuntrusted applications, and network encryption technologies including IPSec,which are sometimes used to implement virtual private networks. This sectiondescribes existing implementations of the security services mentioned earlierin this chapter. The section should be considered illustrative of some of themechanisms that are available, but not necessarily the only way to providesecurity.

16.3.1 File Encryption, Email, and Public-Key Authentication

Many programs that are in widespread use ensure the integrity, authentica-tion, and confidentiality of email and data files. PGP is one of the most popularprograms providing these protections, but similar support is built into variouscommercial email and messaging application suites. Although not specificallytargeted at computational grids, the same techniques and message formats areusable to protect programs and data for grid applications.

These programs work by computing a message digest function over amessage, encrypting the message by using a conventional cryptosystem, thenencrypting the message digest and the message key (from the conventionalcryptosystem) by using public-key cryptography. Confidentiality is ensuredby encrypting the message using the recipient’s public key, and integrity (i.e.,a digital signature) is provided by encrypting the digest using the private keyof the message originator.

Certification of public keys in most of these systems uses a commontechnique: a certificate containing each user’s public key is signed by anotherentity that acts as a certifier. The systems differ in the policies regardingwho certifies user certificates. In PGP, any user can certify any other user’scertificate, and the verifier decides which certifiers to accept. This acceptancepolicy can range from very secure, such as accepting only the verifier’s owncertifications, to weaker policies including accepting certifications by personalfriends or those held in high regard in the Internet community.


Most other implementations depend on established certification authori-ties run by trusted organizations (e.g., the user’s own employer) whose author-ity is itself certified by higher-level authorities, including organizations suchas Verisign. In these systems, the “root CAs” are the organizations who certifyother CAs, and applications must be configured to know the public keys ofthese root CAs. This latter policy is a specialization of the first policy, wherethe choice of CAs is more rigidly specified, and it provides a means to enforcean organization’s policies and limit the errors that might occur from miscon-figuration or misplaced trust by users.

Although certification policies differ from system to system, most of thesystems in widespread use today have adopted a common certificate formatspecified by the X.509 standard. Other certificate formats have been proposedand are under discussion.

16.3.2 Secure Sockets Layer and Transaction-Level Security

Embedded in practically every Web browser, SSL is probably the most widelydeployed technology for confidentiality on the Internet today.

As shown in Figure 16.1, when a Web browser supporting SSL communi-cates with an SSL-enabled Web server, the server sends a public-key certificateto the client, which verifies the signature on the certificate by decrypting withthe public key of the CA. This key was obtained in advance during installationand subsequent configuration of the browser. Verification of this certificateand then checking the host name embedded within the certificate ensures theclient that it is talking with the intended server (the one from the URL thatit is following), and it provides the server’s public key. The server’s publickey is then used to encrypt a session key from a conventional cryptosystem.The encrypted session key is sent to the server, which decrypts it, and sub-sequent communication between the client and the server (for the durationof the session) is confidentiality and integrity protected by using this sessionkey.

As it is typically used, only the server has a certificate, and only theserver is authenticated. Authentication of the user is supported through astandard password-based mechanism; but because the password passes overa confidentiality-protected connection, it is not vulnerable to eavesdropping.

In cases where the client has a certified public key, the SSL protocol sup-ports cryptographic authentication of the client. In this scenario the client’sresponse to the server includes the client’s certificate, plus additional data thatis sent to the server encrypted by using the private key that corresponds to thepublic key from the client’s certificate.


Client Server

5+. Application communciation protectedby key derived from premaster secret

3. Optional client certificateand {Premaster secret}Kpublic_key

1. HELLO

2. Server certificate

4. Server verify (proves server identity)

16.1

FIGURE

An SSL-enabled Web server.

Although used primarily to protect communication on the Web, SSL maybe used to protect communication by other applications, including communi-cation between tasks in a grid system such as Globus [202]. When consideringthe use of SSL for an application, performance of the connection establishmentphase may be an issue because the public-key operations can become a bottle-neck (although once an SSL connection is established, the performance of theconventional cryptosystems used is less of a factor). To improve performance,an SSL server can be allowed to cache the conventional keys used for subse-quent connections between the same client and server. When performanceis an issue, the patterns of communication (how frequently new connectionsare established) must be considered to determine the performance effect ofconnection establishment.

16.3.3 Kerberos

Kerberos is an authentication and key distribution protocol that uses conven-tional cryptography, providing significantly better performance than authen-tication mechanisms that rely on public-key cryptography. Kerberos is wellsuited for applications and services requiring frequent authentication, andits central administration makes it well suited for integration with intrusiondetection and authorization systems. The primary disadvantage of Kerberosover systems using public-key cryptography is the requirement for a trusted“online” (connected to the Internet) certification authority called the key dis-tribution center (KDC) and the need to go back to the KDC for each pair ofcommunicating entities. This is less of an issue than it might seem, however,


C

AS

V

1

2

3

4

1. as_req: , time ,

2. as_rep: { , , time , , ...} {T }

3. ap_req: { , , , ...}

4. ap_rep: { } (optional)

= , , time

c v n

K v n K , K

ts ck K

ts K

T K c

, exp

exp

subsession

exp

c,v c c,v

c,v

c,v c,v

v

{T }K , Kc,v c,v v

16.2

FIGURE

The Kerberos authentication protocol.

since an online intermediary is necessary even in public-key systems to sup-port fast revocation of credentials.

The Kerberos protocol [418] is based in part on the symmetric versionof the Needham and Schroeder authentication protocol [413], with changesto reduce the number of messages needed for basic authentication and theaddition of a facility for subsequent authentication without reentry of theuser’s password. The Kerberos protocol is shown in Figure 16.2.

When a client (C) wishes to communicate with a service provider (theverifier, V), it contacts the Kerberos authentication server (AS), sending itsown name, the name of the server to be contacted, and additional information(1). The Kerberos server randomly generates a session key (Kc,v) and returnsit to the client encrypted in the key derived from the user’s password (Kc) andregistered in advance with the Kerberos server (2). The encrypted session keyis returned together with a ticket (Tc,v) that contains the name of the clientand the session key, all encrypted in the service provider’s key (Kv).

The session key and ticket received from the Kerberos server are validuntil timeexp and are cached by the client, reducing the number of requests tothe Kerberos server. Additionally, the user’s secret key is needed only wheninitially logging in. Instead of using the user’s secret key, subsequent requestsduring the same log-in session use the session key returned by the Kerberosserver in response to an initial request.


To prove its identity to a service provider, the client forwards the tickettogether with a timestamp encrypted in the session key from the ticket (3).The service provider decrypts the ticket and uses the session key containedtherein to decrypt the timestamp. If the timestamp is recent, the server knowsthat the message was recently generated by someone who knew the sessionkey. Since the session key was issued only to the user named in the ticket, theclient is authenticated. If the client requires authentication from the server,the server extracts the timestamp, reencrypts it using the session key, andreturns it to the client (4).

We emphasize that users and service providers need to register encryp-tion keys in advance only with the Kerberos server itself, and not with eachparty with which they will eventually communicate. Authentication by usingKerberos can work across administrative domains; when the client and ap-plication server are registered with different Kerberos servers, interrealm au-thentication supports access to services in other realms. The Kerberos server,as a trusted intermediary, generates a session key when needed, distributesit to the client, and places it in the ticket where it can be subsequently recov-ered by the service provider. This session key can then be used directly by theclient and the service provider for encrypted communication as described inthe preceding paragraphs.

Although the Kerberos authentication protocol is based on conventionalcryptography, recent extensions have provided for integration with public-keysystems. In particular, the PKINIT extensions to the Kerberos protocol providefor the use of public-key cryptography and the use of existing certificates forinitial authentication to the KDC. Subsequent authentication to applicationservices uses the traditional Kerberos protocol and conventional cryptography.This hybrid approach allows for the use of certifications from public-key CAsand common administration, but the performance penalty for using public-key cryptography is felt only when the user first logs into the system (when itis less likely to be noticed).

16.3.4 Assurance

While assurance technologies are not widely deployed on the Internet today,several organizations offer images that may be placed on Web sites to indi-cate adherence to acceptable standards of practice. The limitation of these ap-proaches is that their presence and authenticity are not validated and enforcedby the application. Several frameworks for assurance have been proposed, in-cluding a mechanism for issuing assurance credentials [325] and the Platformfor Internet Content Selection (PICS) [473].


Microsoft’s Authenticode and the Betsi system [483] provide for assuranceof the validity and authenticity of executable content that may be downloaded(or uploaded). Authenticode has seen deployment within Microsoft productssuch as Internet Explorer. In general, though, acceptance of the techniqueshas been limited.

In the absence of absolute trust in the originator of executable code, sys-tems may run untrusted code in an interpreter that will limit the functionsthat may be called by the code, and several systems are available to confinethe execution of untrusted applications in this manner [238].

While this approach provides some protection against malicious code, thedesire to give legitimate code enough power to perform the intended functionoften provides malicious code with functionality that was not intended. Manyof the interpreters available today have bugs that allow malicious code to domore than is intended, and it isn’t clear how effectively the needs of legitimatefunctions can be balanced with the desire to contain malicious code.

16.3.5 Authorization

Distributed authorization mechanisms are included as part of systems likethe Open Software Foundation’s Distributed Computing Environment (OSF-DCE), and local solutions are provided on computing platforms like UNIX andWindows-NT. Only recently have we seen the definition of comprehensiveframeworks for authorization, and to date only components of these frame-works have been implemented.

In general, distributed authorization services provide for the distributedmaintenance of authorization information, such as group membership andaccess control lists, separate from the services that use them. Informationabout group membership, or authority to perform a particular operation, istransmitted to an end service provider through restricted authentication cre-dentials [415, 182] or through the addition of special authorization attributes topublic-key certificates.

Upon receiving such a certificate, a service provider verifies the signa-ture of the issuer of the certificate, or the authenticity of the authenticationcredentials, and checks to make sure the rights conveyed allow the operationrequested by the user. Implementation of the signed authorization certificatesdepends on the integrity and authentication services described earlier.

Performance will be an important consideration when selecting autho-rization mechanisms for computational grids. We expect that authorizationdecisions will be made multiple times during the life of a task, but authentica-tion might be necessary only during initiation of the task. Thus, we anticipate


a large number of operations to validate a task’s authority to perform a particu-lar operation. When using public-key cryptography as a basis for authorization,the use of the slower operation (for RSA it is signing; for DSA it is verification)

should be minimized. Finally, if delegation is to be based on the issuance of anew certificate with more restricted attributes and a new encryption key, thecost of generating the encryption key may be prohibitive for frequent use.

16.3.6 IPSec, IPv6, and Virtual Private Networks

Many of the attacks on the security of distributed systems rely on the abilityof an attacker to monitor and modify packets on the network. The IPSec suiteof protocols developed by the Internet Engineering Task Force (IETF) and thesecurity services that are present in IP version 6 provide for confidentialityand integrity protection of data at the network layer when sent between endsystems.

When communication is first established between a pair of Internet hosts,a key distribution function is initiated to exchange a conventional encryptionkey. That key is used to provide confidentiality and integrity of the pack-ets subsequently exchanged between the two systems. The key distributionfunction may be based on public-key cryptography, it may be based on otherkey distribution mechanisms like Kerberos, or it may use keys that were dis-tributed in advance between the communicating systems. In contrast to theother examples of authentication and key distribution, these keys are asso-ciated with the communicating hosts rather than with applications or endusers.

IPSec, IPv6, and proprietary technologies available from some vendors al-low the creation of virtual private networks (VPNs), networks implemented byusing the shared physical infrastructure of the Internet but with communica-tion permitted only between participating nodes in the private network andwhere communication is protected from disclosure to and modification bynodes that are not participants.

These systems provide some improvement in security for distributed ap-plications and will often be the appropriate technologies to use when it isimpractical to integrate security at the application layer (which might be diffi-cult without the source code for the distributed application). However, becausethese systems operate at the network layer, they cannot provide for authenti-cation of the end user, and they do not have knowledge of the application-levelobjects that are to be protected. Hence, they have limited ability to supportsecurity policies that distinguish users and application objects.


16.3.7 Firewalls

Firewalls provide a barrier at the boundary to an organization’s networkthrough which only specifically authorized communication may proceed. Ingeneral, firewalls fill an important need in an organization’s security policybecause if they have been properly configured and if all paths into the net-work are protected by a firewall, then they prevent many kinds of attack onhosts within the organization’s network. Firewalls are less useful as a means toprotect grid applications because the communication patterns for legitimateapplications running on a computational grid will, by their very nature, re-quire communication through the firewall, making it difficult for the firewallto distinguish legitimate communication from security violations.

By integrating IPSec and VPN technologies at network boundaries, fire-walls can play a role in constructing a computational grid across a set ofcooperating organizations. In such a system, communication on the inter-nal networks of the cooperating organizations could remain unprotected. Afirewall at the boundary between each unprotected network and the rest ofthe Internet would encrypt messages leaving the local network and decryptmessages entering the local network. Communication between nodes in thisprivate grid shared by the cooperating organizations would then be protectedwhen sent over the Internet, but would remain in the clear for communica-tion within the local network, hence removing the need for each internal hostto maintain its own set of security parameters.

16.3.8 Integration with Communication Layers

For the services described so far to have an effect on the security of a com-putational grid, the protocols already developed and those under develop-ment must be integrated with the communications and resource managementmechanisms used by the grid. In general, integration is one of the most dif-ficult aspects of deploying security services today. Security services can beintegrated with protocols at several layers.

Efforts are under way in the IETF to add security services at the IPlayer [33]. With these extensions, computer systems will be able to authen-ticate to one another, and communication between the systems can be en-crypted. Integrating security services at this layer does not provide authenti-cation of the individual users of the system to the remote service providersand thus does not, by itself, meet the requirements for authentication (insupport of access control) by many applications. It does, however, improvethe confidentiality and integrity of communications by applications running


on those systems, including applications that have not been modified to useapplication-level security services.

Integration of security services can also occur at the application layer, andchanges at the application layer are necessary for services where the opera-tions allowed depend on the identity of the user. Integrating security at thislayer can be cumbersome, requiring changes to the application protocol foreach application. The Common Authentication Technology Working Group ofthe IETF has developed the Generic Security Services Application Program-ming Interface (GSS-API) [347] to facilitate the integration of security servicesat the application layer. When using the GSS-API, applications make calls toauthentication, confidentiality, and integrity services in a manner that is in-dependent of the underlying security services.

Integration of security services is easier for applications that run on top ofRPC and similar transport mechanisms. When running on top of such trans-port protocols, user authentication, confidentiality, and integrity can be pro-vided at the transport layer. Though the application must still be modified toask the right questions and to use the answers as a basis for authorization,such changes to the application are less intrusive than changes to the applica-tion protocol itself. Security services have been integrated at the RPC layer forthe Open Software Foundation’s DCE RPC [425], and Sun’s ONC RPC [289].

The transport layer is likely to be the correct place to integrate securityservices for a computational grid. Security services can be integrated with thecommunications layer used for communication between cooperating tasks,providing the appropriate level of communications security (confidentialityand integrity protection) for the application’s needs. The level of protectionprovided at this layer may be adjusted as appropriate also to take into ac-count knowledge about the lower-level communications medium. For exam-ple, when two tasks are communicating across a bus on a tightly coupledmultiprocessor, where it is known that no untrusted jobs have access to thebus, encryption might be bypassed, improving the performance of the com-munications primitives.

Modular Integration

Because of the differing requirements for security, and because differing phys-ical network topologies can allow significant improvement in application per-formance by leaving out security modules when the network topology guaran-tees the level of security that is required, the integration of security servicesinto the computational grid must be modular. Decisions will be made based ontopology and other factors to select the security modules to be used. However,


it must be understood that for two processes to communicate directly, theyshould share a common security mechanism, and this will dictate a relativelysmall set of required mechanisms.

16.4 FUTURE DIRECTIONS

This chapter presented an overview of the security requirements for grid ap-plications, and it discussed some of the technologies that are available todayto meet those requirements. The requirements that have not yet been suffi-ciently met by existing technologies fall into three categories: group commu-nication, accounting, and policy management and administration.

16.4.1 Group Communication

One characteristic of grid applications that affects communications security isthe use of group communication. As will be discussed in Chapter 18, manygrid applications use algorithms that require that data generated by a task dur-ing one phase of a computation be sent to multiple neighbor tasks, or even toall other tasks, for use during the next phase of the computation. When a com-putation is performed on a grid that spans multiple administrative domainsor when it crosses unsecure networks, mechanisms are needed to ensure theconfidentiality and integrity of the data to be distributed. We already discussedhow encryption can be used to provide those assurances, but encryption canbe an expensive operation and it could adversely affect the performance of theapplication. Thus, it is beneficial if the data can be encrypted and integrityprotected once, for receipt and verification by all intended recipients.

If all tasks taking part in the computation are considered trusted, and thenodes on which they run are considered trusted, a single symmetric encryp-tion key can be distributed that will be held by all tasks and used to protectcommunication. If tasks do not all trust one another, it becomes necessary touse a separate key for each pair of communicating entities, requiring multipleencryption of the same message as it is sent to different recipients, precludingthe use of multicast communications for distribution of the message (unless allversions are sent to all recipients). Group communication can also be based onpublic-key cryptography using the private key of the sender for generation andthe public key for verification. This approach is simpler than generating mul-tiple messages, and it reduces the necessary network bandwidth to transmitthe message (one copy multicast instead of many copies point to point), but it

16.4 Future Directions417

imposes a significant performance penalty because the public- and private-keyoperations must now be performed on each message.

Several research efforts have attempted to address aspects of this problem,including the ISIS [62], Horus [548], and the Cliques systems [523].

16.4.2 Distributed Accounting

Few of today’s grid systems provide a means to limit the aggregate consump-tion of resources across all nodes in the system. Instead, a user must be autho-rized to run applications on each node. This form of authorization is somewhatanalogous to requiring that the user have an account on each node in the sys-tem where his or her application will run.

As discussed earlier, accounting provides for the management of resourcequotas across all the nodes in a grid and will thus eliminate the need forthe user to have an account on each system that is to be used. Instead, theuser’s job can run under a temporary ID set aside for use by foreign users,and the resources consumed can be charged against the original user’s quotathrough the distributed accounting system. Eliminating the need to maintainuser accounts across all computer nodes eases the administration of the gridas a whole and allows the grid to scale beyond systems managed by a singleauthority.

The NetCheque system [416] implements a distributed accounting systemthat is suitable for grid applications, but it is not yet widely deployed. Withthe NetCheque system, a user wishing to launch a task on a node in thegrid generates a signed payment instrument that allows the resource managerto charge the user’s account. When a resource is consumed, the managerforwards the instrument to an accounting server, which credits an accountspecified by the resource manager and debits the user’s account. Accounts canbe maintained in currencies such as disk blocks, CPU cycles, or printer pages,or in monetary form as dollars or yen, with the price charged for a resourcenegotiated between the user and the resource manager.

The NetCheque payment instrument is represented as an authorizationcredential (the restricted proxy described earlier in this chapter) that delegatesauthority to transfer a specified value from the grantor’s NetCheque account.Charges can clear across multiple accounting servers, improving the scalabil-ity of the system and allowing its use across organizations. When the user andservice provider are registered with different accounting servers, the serviceprovider deposits the payment instrument with its local accounting serverwith an endorsement authorizing that accounting server to clear the checkon behalf of the service provider.


16.4.3 Policy Management and Administration

Today’s tools for managing and administering the security of distributed sys-tems are inadequate for managing large systems, and significant work isneeded in this area. Currently, security policy is often specified—if it isspecified at all—by configuring access control lists associated with individ-ual resources. Flexible methods for distributed authorization can go a longway toward easing the administration problem by allowing users and objectsto be grouped so that when a particular policy is changed in one place (suchas when membership in a group changes), the change affects all resourcesto which the policy applies (for example, objects to which the group hasaccess).

16.5 SUMMARY

Security is critical for the widespread deployment of a computational grid, andthere are many security technologies that can be directly applied to meet theserequirements. To be effective, a policy must be developed regarding accessto grid resources, including a policy regarding the executable code that mayrun, who may originate a computation, which systems are trusted to providecomputational resources, and how data is to be protected as it is sent acrossthe network.

Because grids may span many organizations, multiple security policieswill apply. Organizations providing computing resources will specify policiesto protect their nodes, and in turn to protect other computations that runusing common resources. The user will have policies regarding protection ofthe results of the computation and access to the data that is manipulated bythe application. This policy may limit the selection of nodes on which thecomputation will execute.

Enforcement of these policies will affect the choice of security mecha-nisms. When selecting mechanisms, we must consider the performance andscalability characteristics of each mechanism. Because different grid compo-nents will have different characteristics and different parties will have dif-ferent policies, the integration of security technologies should be modular,allowing any pair of nodes to choose the security mechanisms that are bestsuited to their particular requirements, as long as the choice is consistent withthe security policies of both.

16.5 Summary419

16.5.1 The Grid Security Environment

Computational grids will operate in a heterogeneous security environment.Some nodes will be on internal networks protected from the Internet by fire-walls, with communication between these internal networks encrypted. Othernodes may be out in the open and require network, transport, or application-level encryption to protect the confidentiality and integrity of communica-tions. Where confidentiality and integrity of data are of particular concern,encryption of data will be provided even behind firewalls because this data isotherwise vulnerable to attack from any system on the internal network andbecause these internal systems may be compromised through holes left in thefirewall or through alternative communication paths into the network.

The security policies applied within computational grids must specify pro-tection of individual objects and resources, and the access to be granted toindividuals. Such policies require integration of security services with appli-cations, and firewalls and network layer encryption will not be sufficient ontheir own to support such policies. We will also see policies based on barter, orability to pay for resources, requiring integration with distributed accountingmechanisms.

Grid computations will require authority to access resources available tothe originator of the application. Such access will be granted through delegatedauthority, and the credentials granting such authority will require protectionwhen sent to the remote node, and from other applications sharing the node.

16.5.2 Implications for Grid Applications

Technologies exist today to protect confidentiality and integrity; to provideauthentication, authorization, and accounting; and to ensure the integrity ofservice providers and executable content. Systems and infrastructure support-ing confidentiality, integrity, and authentication are already deployed.

The biggest problem still to be faced is the deployment of the remainingtechnologies, and the integration of each of the technologies with applications.Grid applications and the tools for managing resources and applications mustprovide an interface through which authorization information may be speci-fied, and such information must be carried or referenced by protected objects.This information encodes the security policies specified by the user, the user’sorganization, and the organization owning grid resources. This informationwill be used to select and implement the appropriate security mechanisms.


ACKNOWLEDGMENTS

The writing of this chapter was supported in part by the Defense Advanced Re-search Projects Agency under the Scalable Computing Infrastructure (SCOPE)Project, TNT contract no. DABT63-95-C-0095, and the Security Infrastructurefor Large Distributed Systems (SILDS) Project, contract no. DABT63-94-C-0034.The views and conclusions contained in this chapter are those of the authorand should not be interpreted as representing the official policies, either ex-pressed or implied, of the U.S. Army Intelligence Center and Fort HuachucaDirectorate of Contracting, the Defense Advanced Research Projects Agency,or the U.S. government.

FURTHER READING


� Schneier’s book [493] provides comprehensive coverage of many of theavailable cryptographic algorithms and computer security protocols.

� The proceedings of the ACM Conferences on Computer and Communica-tions Security, the ISOC Symposia on Network and Distributed SystemsSecurity, and the IEEE Symposia on Security and Privacy describe recentdevelopments in security for distributed systems.

18C H A P T E R

Network Protocols

P. M. Melliar-SmithLouise E. Moser

Traditional network protocols supported the movement of bits from sourceto destination. With the inadequate communication media of the past, thatwas achievement enough. But communication media are improving quiterapidly, though perhaps not rapidly enough, and are becoming cheaper, faster,more flexible, more dependable, and ubiquitous. The network protocols of thefuture will be designed specifically to support applications, and applicationrequirements will determine the communication services that are provided.Ingenious protocol mechanisms will no longer be required to overcome theinadequacies of the network infrastructure. Communication services are whatis important to the users; protocol mechanisms should remain invisible tothem.

Many network protocols are built hierarchically as a stack of protocols, asshown in Figure 18.1. The implementation of each protocol in the stack ex-ploits the services provided by the protocols below it. The service provided byeach protocol is freestanding and does not depend on whether the implemen-tation of that protocol exploits other protocols. One approach to assembling aprotocol stack is to allow the user to construct the stack by choosing protocolsfrom a toolkit of microprotocols [491, 548].

In this chapter, we address the topic of network protocols—their charac-teristics and the services they provide. First, we consider the grid applicationsdiscussed in Chapters 3 through 6 and the generic types of communicationservices they need (Section 18.1). In Section 18.2, we identify four differ-ent classes of network protocols that provide these types of services. In Sec-tion 18.3, we discuss different types of message delivery services needed by

18 Network Protocols454

logical protocol

physical connection

Application Application

Protocol service

18.1

FIGURE

Protocol stack.

the applications. In Section 18.4, we address the issue of maintaining consis-tency of replicated information, which leads into a discussion in Section 18.5of group communication and group membership protocols. In Section 18.6, weconsider the issues of resource management, including flow control, conges-tion control, latency, and jitter. With these services and issues in mind, wethen present in Section 18.7 an overview of several of the more interestingrecently developed protocols. Finally, we discuss in Section 18.8 the future ofnetwork protocols.

18.1 APPLICATION REQUIREMENTS

The classical network protocol provides a point-to-point, sender-initiated datatransfer service, delivering messages in the order in which they were sent,with good reliability and with throughput as the measure of performance.However, each application has its own reasons for communicating and its owntypes of information to be communicated, which may impose very differentrequirements on the network protocols.

Teleimmersion and collaborative virtual environments (see Chapter 6) aresome of the most challenging applications for network protocols because oftheir varied, demanding requirements. These applications require the follow-ing protocols:

� Data streaming protocols for audio and video, often multicast at high rates

18.1 Application Requirements455

� Reliable data transport protocols for collaboration, again often multicast

� Coordinated control protocols and membership protocols for collaboration

� Protocols for the integration of complex, independently developed mod-ules

Realtime instrumentation applications (see Chapter 4) also have demand-ing requirements and are currently limited by a reluctance to commit suchcritical applications to the unreliability of the Internet. These applications re-quire the following protocols:

� Data transport protocols that provide reliable delivery of messages andalso unreliable timestamped delivery of messages containing the mostrecent available data

� Coordinated control protocols with fault tolerance


Data-intensive applications (see Chapter 5) are still in their infancy. Theseapplications will undoubtedly require the following protocols:

� Data transport protocols for the efficient, rapid, and reliable transport ofhuge datasets


� Protocols for the encapsulation and integration of existing modules andmovement of program code to remote sites where the data are located

Distributed supercomputing applications (see Chapter 3), both on inter-connected supercomputers and on clusters of small computers, are perhapsthe best developed of the four types of applications. These applications requirethe following protocols:

� Low-latency, reliable data transport protocols, sometimes multicast

� Low-latency control and synchronization protocols that are scalable toquite large numbers of nodes

� Protocols for the encapsulation and integration of existing modules andmovement of program code to remote sites at which the computation is tobe performed


18.2 CLASSES OF NETWORK PROTOCOLS

Despite the multitude of network protocols that have been proposed, a fewmajor classes of protocols are starting to emerge. We consider four of thesebelow. There is a common tendency to describe the mechanisms of the proto-cols, even though it is the protocol services that are important to the users. Weendeavor to emphasize the services.

18.2.1 Data Transport Protocols

Data transport protocols are the workhorses of the Internet and are exempli-fied by the venerable but highly effective Transmission Control Protocol (TCP).They typically provide reliable, source-ordered delivery of messages with ex-cellent flow control and buffer management. Most data transport protocols arepoint to point, but multicast protocols are becoming more common.

The Xpress Transfer Protocol (XTP) [34, 526] is an example of a data trans-port protocol that has been designed specifically to achieve high performance,particularly for parallel computations. Quite different is the Scalable ReliableMulticast (SRM) protocol [194], a reliable data transfer protocol intended formulticasting to large groups over the Internet, as required, for example, incollaborative virtual environment applications. XTP and SRM are described inmore detail in Sections 18.7.1 and 18.7.2, respectively.

18.2.2 Streaming Protocols

Streaming protocols are typically used for audio, video, and multimedia andalso for certain instrumentation applications. These applications do not re-quire reliable delivery and can accept message loss, particularly if selective-loss algorithms are used. Multicasting is common in these applications, butunreliable source-ordered delivery suffices. Even when the traffic is bursty,flow control is ineffective, and bandwidth reservation is preferable. Low jitteris important.

An example of a streaming protocol is the Real-Time Transport Protocol(RTP) [498], described in Section 18.7.4, which might be used in conjunc-tion with the Resource reSerVation Protocol (RSVP) [76, 584], described inSection 18.7.5. Unfortunately, the datagram orientation of the Internet doesnot yet support streaming protocols well; the connection-oriented, negotiatedquality-of-service approach of ATM generally provides a better foundation forstreaming protocols.

18.2 Classes of Network Protocols457

18.2.3 Group Communication Protocols

What distinguishes group communication protocols from the data transportand streaming protocols described above is that group communication proto-cols are concerned with more than just movement of data. In a distributed sys-tem, to permit effective cooperation and to provide fault tolerance, a group ofseveral processes must all keep copies of the same data, and the copies of thedata must be kept consistent as the application executes. Maintaining consis-tency is difficult for the application programmer, particularly in the presenceof faults. Group communication protocols assist application programmers inmaintaining the consistency of replicated data by maintaining the member-ships of process groups and by multicasting messages to those process groups.

The InterGroup Protocols (IGPs) [56], described in Section 18.7.3, aregroup communication protocols being developed for distributed collaborativevirtual environments and realtime instrumentation applications, to allow sci-entists and engineers to collaborate and conduct experiments over the Inter-net. The IGP protocols might also be suitable for other applications. They payspecial attention to scaling to large sizes and to minimizing the effect on la-tency of long propagation times across a large network.

18.2.4 Distributed Object Protocols

A new area of protocol development is distributed object protocols for hetero-geneous distributed systems, exemplified by the Internet Inter-Orb Protocol(IIOP) [388] developed by the OMG for CORBA (see also Chapter 9). CORBAsupports the use of existing legacy code, provides interoperability across di-verse platforms, and hides the distributed nature of the computation and thelocation of objects from the application. However, the overheads of CORBA’sremote method invocation are currently still high. IIOP, described in Sec-tion 18.7.6, allows ORBs from different vendors to work together. CORBA andIIOP are well suited for distributed collaborative virtual environments andrealtime instrumentation applications, which involve the integration of com-plex distributed systems from existing software.

Also important is Java’s Remote Method Invocation (RMI) [290], describedin Section 18.7.7, which supports the movement of code to remote machines(see also Chapter 9). While similar in some respects to CORBA, Java RMIprovides less support for the integration of complex systems and no supportfor existing code. Currently, Java is less efficient than CORBA, but improvedperformance should come soon. Java and RMI are particularly well suitedfor data-intensive and distributed supercomputing applications, as these can


require movement of programs. The integration of Java with CORBA achievesthe best of both technologies.

18.3 TYPES OF MESSAGE DELIVERY SERVICES

The most basic form of message delivery service is unreliable message deliv-ery, which provides only a best-effort service with no guarantees of delivery.Unreliable message delivery is used for audio and video streaming protocols,such as are needed in teleimmersion. Most applications require a more strin-gent service, reliable message delivery, which ensures that each message sentis eventually received, possibly after multiple retransmissions, despite loss ofmessages by the communication medium.

Many recent network protocols extend the point-to-point service of the tra-ditional network protocol to a multicast service [151]. Multicasting allows effi-cient delivery of the same information to several, possibly many,destinations—a service needed particularly by collaborative virtual environ-ments but also by other applications. Multicasting is conceptually simplebut involves many complications in the protocol mechanisms to achieve ef-ficiency. Typical multicast protocols provide only one-to-many multicasting,where a single sender sends to multiple destinations, but collaborative virtualenvironments require many-to-many multicasting.

Another important criterion for a message delivery service is the orderin which messages are delivered, including causally ordered delivery, source-ordered delivery, group-ordered delivery, and totally ordered delivery, as de-scribed below.

Causally ordered delivery [328] requires that (1) if process P sends messageM before it sends message M′ and process Q delivers both messages, then Qdelivers M before it delivers M′, and (2) if process P receives message M beforeit sends message M′ and process Q delivers both messages, then Q delivers Mbefore it delivers M′. Delivery of messages in causal order prevents anomaliesin the processing of data contained in the messages, but it does not maintainthe consistency of replicated data.

Source-ordered delivery, or FIFO delivery, requires that messages from aparticular source are delivered in the order in which they were sent. For mul-timedia streaming protocols, such as are required for teleimmersion, source-ordered delivery suffices. Data distribution, such as in data-intensive applica-tions, also requires only source-ordered delivery.

Group-ordered delivery requires that, if processes P and Q are membersof a process group G, then P and Q deliver the messages originated by theprocesses in G in the same total order. A reliable group-ordered message

18.4 Maintaining Consistency459

C3C2C1 A2B1A1 B3B2C1 C3A3C2

B3B2B1 A2B1A1 B3B2C1 C3A3C2

A3A2A1 A2B1A1 B3B2C1 C3A3C2

Reliable group-ordered delivery

C3C2C1 C2B1C1 C3B2A1 B3A3A2

B3B2B1 B2B1A1 B3C1A2 C3C2A3

A3A2A1 A2B1A1 B3B2C1 C3A3C2

Reliable source-ordered delivery

C3C2C1 C2B1C1 B2 B3A2

B3B2B1 B2A1 B3C1A2 C3C2

A3A2A1 B1A1 B2C1 C3A3

Unreliable source-ordered delivery

Messagesmulticast

Messagesdelivered

18.2

FIGURE

Types of message delivery services, where A, B, and C are processors and A1,A2, and A3 are messages sent by processor A, and so forth.

delivery service helps to maintain the consistency of replicated data whenprocesses that constitute a group have copies of the same data and mustupdate that data.

Totally ordered delivery requires that, if process P delivers message M be-fore it delivers message M′ and process Q delivers both messages, then Qdelivers M before it delivers M′. Totally ordered delivery is important wheresystemwide consistency across many groups is required, as in realtime instru-mentation applications.

Various types of message delivery services are illustrated in Figure 18.2.

18.4 MAINTAINING CONSISTENCY

In a distributed system, processing and data may be replicated for increasedreliability or availability, or faster access. In particular, several processes mayperform different tasks of the application and may each have a copy of the


C3C2C1

B3B2B1

A3A2A1

A2B1A1 B3B2C1 C3A3C2

A2B1A1 B3B2C1 C3A3C2

Client

Client

Client

Replicateddata at server

Replicateddata at server

18.3

FIGURE

Total ordering on messages helps to maintain the consistency of replicateddata.

same data so that they can cooperate. Alternatively, several processes may bereplicated for fault tolerance, with each replica performing the same task ofthe application and holding a copy of the same data. In both cases, the copiesof the replicated data must be kept consistent as the application executes.

Group communication protocols [56, 62, 162, 372, 401, 548] help to main-tain the consistency of replicated data. Such protocols distribute update mes-sages to all processes holding copies of the replicated data using a multi-cast service. They provide a reliable, totally ordered message delivery ser-vice, which ensures that each process performs the same sequence of up-dates in the same order, as shown in Figure 18.3. This simplifies the appli-cation programming required to maintain the consistency of the replicateddata.

Because they are designed to support fault tolerance in addition to co-ordination of distributed computation, these protocols are very robust andhandle a wide range of faults, including processor faults and network partition-ing. They include excellent fault detection, group membership, and recoveryprotocols that provide a consistent view of faults and membership changessystemwide [62, 400].

Early group communication protocols were inefficient, but performancehas improved. Current group communication protocols operate very effi-

18.5 Group Membership461

ciently in the benign private world of a LAN, as efficiently as other protocolsdespite the more elaborate services they provide. Effective strategies are, how-ever, still being developed for the hostile world of the Internet [56], whereresources must be shared with unknown and uncooperative users.

Applications for these protocols can be purely computational (e.g., dis-tributed supercomputing applications), where tasks might be allocated to pro-cessors by a distributed scheduler, which is coordinated by a group communi-cation protocol. Alternatively, these protocols can be used to mediate humaninteraction, as in a collaborative virtual environment, where a group commu-nication protocol can ensure that all users see the same updates to a sharedworkspace. The protocols can also be used in the control or monitoring of aphysical device, as in a realtime instrumentation application, where differ-ent processes control different parts of the instrumentation, coordinated by agroup communication protocol.

18.5 GROUP MEMBERSHIP

A process group [62] is a set of processes that collaborate on an applicationtask. Messages related to the application task are multicast to the membersof the process group. A distributed application may consist of hundreds ofprocess groups, and a process group may contain hundreds of processes. Theprocesses may join and leave the groups voluntarily, as in a collaborativevirtual environment, or because a process or processor has failed, as in arealtime instrumentation application.

Naming mechanisms that work well for simple applications on small,stable networks are inappropriate for large complex dynamic applications. Forexample, if a processor fails, its processes may be replaced by other processeson different processors. Thus, messages are not addressed to a specific processon a specific processor. Rather, each message is addressed to the processgroup and is delivered to all of the processes that have joined that group.This strategy must be supported by mechanisms that allocate unique namesto process groups and by a name server.

Some applications and many protocol mechanisms need to know the cur-rent group membership for each process group so that, for example, a dis-tributed scheduling algorithm can assign tasks on a distributed supercomputer.It is also important to establish a precise relationship between message order-ing and membership changes. For example, if state is being transferred to anew process on another processor, it is essential to establish whether the mes-sages were processed before the membership change and thus are reflected


Q R S

Membershipis Q, R, and S

Q R

Membershipis Q and R

P Q R

Membershipis P, Q, and R

S fails

Continuedoperationbefore faultis detected

Messages processedbefore state transfer

Messages processedafter state transfer

P joinsgroup

P

State transfer

18.4

FIGURE

When state is transferred from one process to another, it is important thatboth processes agree on which messages should be processed before the statetransfer and which should be processed after.

in the state being transferred, or whether the messages were processed afterthe membership change and thus should be processed by the new process, asshown in Figure 18.4. The concept of ordering membership changes relativeto data messages is called virtual synchrony [62].

The maintenance of group membership is quite difficult—indeed theoreti-cally impossible without the use of a fault detector—but several effective prac-tical membership protocols have been developed. As shown in Figure 18.5, thetypical approach involves the following phase:

� A discovery phase of broadcast solicitations and responses that establisheswhich processes are alive and can communicate

18.5 Group Membership463

Q R S

Membershipis Q, R, and S

Membershipis P, Q, and R

S fails

Normal operation

Fault detected

Discovery phase

Commitment phase

Recovery phase

Membershipalgorithm done

P joins group

P

Last few messagesof old membershipare delivered

Normal operationresumes withnew messages

Membershipalgorithmoperating

18.5

FIGURE

The phases of a membership protocol.

� A commitment phase using a form of two-phase commit that determinesthe membership

� A recovery phase in which messages from the prior membership are col-lected and delivered

Liveness is ensured by repeating these steps as often as is required to com-plete the algorithm while at each step excluding at least one additional processthought to be faulty in the previous membership attempt. This guarantees ter-mination in finite time but at the expense of possibly allowing, in unfavorablecircumstances, each nonfaulty process to form a membership containing onlyitself. In practice, with appropriate values of the time-outs, the approach isrobust and effective.

The problem with such a membership protocol, and many other proto-cols that maintain network topology, connectivity, or distance information, isthat the cost is proportional to N2, where N is the number of processes in thegroup. Furthermore, the interval between membership changes is inverselyproportional to N . As N becomes large, the system may spend too much of


its time in the membership protocol. Worse still, to ensure virtual synchrony,many group communication systems suspend message delivery during mem-bership changes. As N becomes large, they may deliver no messages at all.Research is under way into weaker forms of membership that scale betterwhile remaining useful to the applications, such as the collaborative virtualenvironments and distributed supercomputing applications, that may involvemany processes.

18.6 RESOURCE MANAGEMENT

The great advantage that the Internet has demonstrated over traditionaltelecommunication networks is increased sharing of network resources, whichleads to reduced costs. Many network connections have intermittent andbursty traffic. Where a network resource can be shared between enoughconnections, typically more than 20, quite high utilization of the resourcecan be obtained without serious degradation of the service provided to eachconnection. Multiple bursty traffic sources will, however, always be able tooverwhelm the network; thus, flow control and congestion control will remainimportant.

18.6.1 Flow Control and Congestion Control

With modern communication equipment, packet or cell loss is almost alwayscaused by congestion, either in the switches and routers of the network orin the input buffers at the destinations, rather than by corruption of packetsor cells by the medium. Consequently, the TCP protocol uses packet loss asan indication that it should reduce its transmission rate. The TCP backoffalgorithm, which employs rapid rate reduction on packet loss and slow rateincrease in the absence of packet loss, is very effective for data transmission,as for example in data-intensive and distributed supercomputing applications.It allows many TCP connections to share the network fairly and efficientlywithout explicit coordination.

The strategy suffers, however, from several problems. First, it is not appro-priate for data streaming, such as the multimedia connections of a teleimmer-sion application, where the transmission rate is relatively high but predictableand constant. Such connections are best served by an explicit reservation strat-egy, such as ATM provides and RSVP has proposed for the Internet.

A second problem is that connections that observe the TCP backoff strat-egy can be squeezed out by rogue connections that do not reduce their trans-mission rate in the presence of congestion and packet loss. This problem is

18.6 Resource Management465

Messages originated

Messages delivered

Latency

18.6

FIGURE

Latency is the time from message origination at the source to message deliv-ery at the destination(s). Jitter is the variance in the latency.

addressed by ATM’s traffic-policing mechanisms and, for the Internet, by ran-dom early drop (RED) algorithms in Internet switches [193]. RED algorithmsexplicitly discriminate, by dropping packets, against connections that do notreduce their transmission rate in the presence of congestion.

TCP’s backoff strategy also becomes less effective for connections thatcommunicate at high rates over long distances. The time to detect packet losson a 1 Gb/s transcontinental link corresponds to about 5 MB of transmitteddata, given the 40 ms round-trip time. At, say, 40 Gb/s, this becomes over 200MB, and switch buffering requirements become unreasonable.

18.6.2 Throughput, Latency, and Jitter

Critical performance metrics for network protocols and for the applicationsare throughput, latency, and jitter. Throughput is the number of user data bitscommunicated per second. Latency is the time from message origination atthe source to message delivery at the destination(s), and jitter is the variancein the latency, as shown in Figure 18.6.

Because of the inadequate performance of the network infrastructure,protocols have focused on maximizing the throughput of the network, whichhas been the bottleneck. As network throughput has improved, attention hasshifted to latency and jitter. Protocol mechanisms that increase the throughputof the network may also increase the latency.

Long latencies are undesirable for some applications, such as distributedsupercomputing and realtime instrumentation and control. Protocols havebeen developed, particularly for distributed supercomputing, with mecha-nisms to reduce the latency caused by the protocol stack and the operating


C3C2C1

B3B2B1

A3A2A1

A2B1A1 A3B2C1

B3

C3

C2

Three sources A, B, and C multicasting

Input buffer of a receiver

Buffer overflows;messages are lost

18.7

FIGURE

Multicast protocols incur a high risk of packet loss at the input buffers of thedestinations.

system (see Chapter 20). Such mechanisms are effective, however, only forlightly loaded systems in which queuing delays are not significant. For heav-ily loaded complex systems, high throughput reduces queuing delays and isthe most effective approach for reducing overall system latency.

Jitter is highly undesirable for audio, video, and multimedia applications,such as teleimmersion. Special protocols that timestamp every packet, suchas RTP [498], allow accurate measurement of latency, which is effective athelping to reduce or mask jitter. Also important for multimedia applications,particularly those that use MPEG encoding of streams, are protocols andswitches that select intelligently the packets to be dropped when congestionoccurs [193].

18.6.3 Multicast/Acknowledgment Mechanisms

Good performance of group communication protocols, and even the feasi-bility of Internet video broadcasting, requires that multicasting be able toreach many destinations with a single transmission, sharing the bandwidthand avoiding wasteful multiple transmissions of the same information. Witha modern high-performance network and multiple sources multicasting si-multaneously, as in a collaborative virtual environment, it is easy to transmitinformation faster than the receivers can process it, resulting in a bottleneck atthe input buffers of the receivers, as shown in Figure 18.7. When using high-performance LANs, rather than the slower Internet, the best group commu-nication protocols focus their flow control mechanisms on preventing inputbuffer overflow [401].

Many recent protocols exploit negative acknowledgments rather than pos-itive acknowledgments to achieve higher performance and higher reliabil-ity [194]. Positive acknowledgment protocols, also called sender-based protocols,

18.6 Resource Management467

Acks

Multicast messageNack

18.8

FIGURE

Positive and negative acknowledgments. With many receivers in a multicastgroup, there is an implosion of acknowledgments at the sender.

require the receiver to send an acknowledgment to the sender if it has re-ceived the message. If the sender does not receive an acknowledgment withina certain period of time, it retransmits the message. In contrast, negative ac-knowledgment protocols, or receiver-based protocols, require the receiver tosend a negative acknowledgment to the sender if it has not received the mes-sage within a certain period of time. With a large number of receivers in amulticast group, positive acknowledgment protocols can result in the senderbeing overwhelmed by an implosion of positive acknowledgments, as shownin Figure 18.8, making a negative acknowledgment protocol more appropriate.

To achieve reliable delivery, positive acknowledgments are typically com-bined with negative acknowledgments. This combination allows messages tobe removed from the sender’s buffers when those messages are no longerneeded for possible future retransmission. As larger amounts of data are trans-mitted over longer distances at higher speeds, the buffers must become larger,and buffer management becomes more important.

18.6.4 Translucency

The limited performance of current communication networks, coupled withrapidly growing and changing demands on network protocols, has encouraged


translucency, or access, by the application programmer to the mechanisms ofthe network protocols. Examples are the following:

� Network-aware applications adapt their own behavior and demands on thenetwork to reflect the current network load, latency, or rate of messageloss. This strategy may allow applications to operate more effectivelyunder adverse network conditions. The disadvantage is that the strategyincreases application programming costs. Building a better network maybe cheaper.

� Application-level framing allows the application to be aware of, and evenmanipulate directly, the actual packets or frames transmitted over thecommunication medium. The advantage can be lower overheads for ap-plications such as teleimmersion; the disadvantage is that the applicationprogram loses portability and adaptability to future networks.

� Microprotocol toolkits allow the user to assemble the protocol stack bychoosing the protocols from a toolkit of microprotocols [491, 548]. The ad-vantage is that a skilled user can construct a protocol suited to the needsof the application. In general, however, network protocols constructed bythe user from a microprotocol toolkit are less efficient than more highlyintegrated protocol implementations because the interfaces between themicroprotocols incur overhead and inhibit code optimization. Additionalinefficiency is caused by microprotocols that must be designed to be free-standing so that they can be used with or without other microprotocols.

As high-performance networks become readily available, as the network-ing environment stabilizes, and as the needs of applications become betterunderstood, the advantages of translucency will diminish, and the costs of cus-tom network programming in the application will be harder to justify, even forperformance-sensitive applications such as teleimmersion and distributed su-percomputing. The predominant consideration will be the convenience andefficiency of a standard, fully integrated protocol providing a simple, well-defined, and stable service to the application.

18.6.5 Achieving Adequate Network Performance

The primary requirement of adequate network performance is adequate band-width, which is currently a problem but will become less of a problem inthe future. Efficient and cost-effective use of the network will, however, de-pend on sharing the cost of the bandwidth with many users, and on a pricingmechanism that restrains the growth in demand and also funds the increase

18.7 Example Network Protocols469

in available bandwidth. Other resources, such as buffer space, must also beshared. Users who require exclusive reservation of resources, such as ATMAAL1 users, are likely to pay much higher costs than users who agree to shareresources. The rapid growth and effectiveness of the Internet depend in parton the low costs made possible by packet-switching protocols that share theavailable bandwidth among multiple packets simultaneously. The congestionof the Internet results from the lack of an effective pricing mechanism.

As networks continue to develop, the greatest contributions of networkprotocols will result from improved services rather than improved mecha-nisms. Substantial performance improvements result from transmitting infor-mation once only instead of many times in a collaborative virtual environ-ment, or of transmitting the right information in a data-intensive application,or of not needing to transmit the information at all in a distributed supercom-puting application. Caching, now used quite extensively to speed access topopular sites on the Web, is an example of a mechanism that improves networkperformance by reducing the need to transmit information. As yet, other net-work protocols have made little use of caching beyond buffering messages forpossible retransmission after message loss. Improved protocols and improvednetwork performance, resulting from transmitting the right information at theright time, will result from understanding better the services that the applica-tions need.

18.7 EXAMPLE NETWORK PROTOCOLS

We describe here a few of the more interesting, recently developed networkprotocols. The different characteristics and mechanisms of these protocolsmake them appropriate for different aspects of grid applications such as thoseconsidered in this book.

18.7.1 Xpress Transfer Protocol

The Xpress Transfer Protocol (XTP) is a transport-level protocol designed fordistributed parallel applications operating over clusters of computers. For flex-ibility and efficiency, it supports several different communication paradigms,as discussed below.

XTP supports both unicasting from a single sender to a single receiver andone-to-many multicasting from a single sender to a group of receivers. Dataflows in one direction from the sender to the receiver(s), with control traffic


flowing in the opposite direction. Multiple instances, or contexts, of XTP canbe active at a node, providing partial support for many-to-many multicasting.

Normally, the receivers do not inform the sender of missing packets untilthe sender asks for such information; however, an option is available thatallows the receivers to notify the sender immediately after detecting a packetloss. A control packet indicating lost packets contains the highest consecutivesequence number of any of the packets received by a receiver and the rangeof sequence numbers of missing packets. Retransmissions are multicast tothe group and are either selective or go-back-N . The sender may disableerror control but may still ask the receivers for flow control and rate controlinformation.

XTP supports both flow control and rate control algorithms, which providethe requested quality of service. Flow control, in which further data is sentwhen acknowledgments are received for prior transmissions, is appropriatefor both data and control. Rate control, in which data is transmitted at aconstant rate, is appropriate for data streaming. Flow control considers end-to-end buffer space; rate control considers processor speed and congestion.

Multicast groups are managed by the user, based on information that thesender maintains about the receivers. The user specifies how the initial groupof receivers is formed, and the criterion for admission and removal once themulticast group is established. The sender learns which receivers are activeby periodically soliciting control packets.

The scalability of XTP is limited by its method of error recovery. Likeother sender-initiated error recovery approaches, the sender may be unable tomaintain state for large receiver groups, and the throughput of the sender maybe slowed by control packet implosion and processing and by retransmissions.Multicasting retransmissions to the entire multicast group uses unnecessarybandwidth on links to receivers that have already received the packet.

18.7.2 Scalable Reliable Multicast Protocol

The Scalable Reliable Multicast (SRM) protocol is a lightweight protocol thatsupports the notion of application-level framing and is intended for applica-tions such as collaborative virtual environments. SRM aims to provide reliabledelivery of packets over the Internet, in that all packets should eventuallybe delivered to all members of the multicast group; no delivery order, eithersource ordered or group ordered, is enforced [194]. To receive packets sent tothe group, a receiver sends a join message on the local subnet announcingthat it is interested in receiving the packets. Each receiver voluntary joins and


leaves the group without affecting the transmission of packets to other groupmembers.

Each group member is individually responsible for its own reception ofpackets by detecting packet loss and requesting retransmission. Packet lossis detected by finding a gap in the sequence numbers of the packets from aparticular source. To ensure that the last packet in a session is received, eachmember periodically multicasts a session message that reports the highestsequence number of packets that it has received from the current senders.The session messages are also used to determine the current participants inthe session and to estimate the distance between nodes.

When a node detects a missing packet, it schedules a repair request (re-transmission request) for a random time in the future. This random delay isdetermined by the distance between the node and the sender of the packet.When its repair request timer for the missing packet goes off, the node multi-casts a repair request for the packet, doubles its request timer delay, and waitsfor the repair. If the node receives a request for the packet from another nodebefore its request timer goes off, it does an exponential backoff and resets itsrequest timer.

When a node receives a repair request and it has the requested packet, thenode schedules a repair for a random time in the future. This random delayis determined by the distance between the node and the node requesting therepair. When its repair timer for the packet goes off, the node multicasts therepair. If the node receives a repair for the packet from another node beforeits repair timer goes off, it cancels its repair timer.

The request/repair timer algorithm introduces some additional delay be-fore retransmission to reduce implosion and to reduce the number of dupli-cates. It combines the systematic setting of the timer as a function of distance(between the given node and the source of the lost packet or of the repairrequest) and randomization (for the nodes that are at an equal distance fromthe source of the missing packet or repair request). Because the repair timerdelay depends on the distance to the node requesting the repair, a nearby nodeis likely to time out first and retransmit the packet.

18.7.3 The InterGroup Protocols

The InterGroup Protocols (IGPs) are intended for large-scale networks, like theInternet, with highly variable delays in packet reception and with relativelyhigh packet loss rates, and for large-scale applications with relatively fewsenders and many receivers [56]. The IGPs are particularly appropriate forcollaborative virtual environments and realtime instrumentation applications.


Each Data packet has in its header a unique source identifier, a sequencenumber, and a timestamp obtained from the sender’s local clock. The se-quence numbers are used to detect missing packets from the sender and toprovide reliable source-ordered delivery of packets. The timestamps providereliable group-ordered delivery of packets, while maintaining the causal orderof messages sent within the group.

Each multicast group consists of two subgroups: the sender group and thereceiver group. Superimposed on the underlying multicast backbone (MBoneof the Internet) is a forest of trees that includes all of the receivers in themulticast group. This forest of trees is used to collect acknowledgments, neg-ative acknowledgments, and fault detection packets to reduce congestion andto achieve scalability. Data packets and retransmissions are multicast by usingthe MBone.

Only the senders multicast Data packets, but the senders are also receiversin that they receive Data packets. Both senders and receivers may send re-transmissions of Data packets and are required to send Alive packets. An Alivepacket serves as a heartbeat and also as a positive acknowledgment, since itcontains the highest timestamp up to which the node multicasting the Alivepacket has received all packets. Each node sends an Alive packet on an infre-quent periodic basis, but only to its parent in the tree, and the information inthe packet is then propagated up the tree. Upon receiving the information, aroot node (i.e., sender) multicasts its Alive packet to the group. If a node de-tects that it has missed a packet, it sends a Nack (negative acknowledgment)packet. Under high packet loss rates, Nack packets are sent more frequentlythan Alive packets, but again a node sends a Nack packet only to its parent inthe tree.

The InterGroup Protocols use approximately synchronized clocks at thesenders and receivers, by running a clock synchronization algorithm occa-sionally or using a Global Positioning System. At minimum, these clocks areLamport clocks that respect causality [328]. If every member of the sendergroup is sending packets, the maximum delivery time of a packet is D + τ ,where D is the diameter of the group and τ is the interpacket generation timeof the slowest sender. The maximum delivery time is kept low by using Alivepackets from the senders.

To maintain consistency of packet delivery, both senders and receiversmust know the membership of the sender group. For scalability, only thesenders are responsible for detecting faults of the senders in the sender groupand for repairs of the sender group. Before a node can deliver a packet froma particular sender, it must have received a packet, from each sender in thesender group, with a timestamp at least as great as that of the packet to be


delivered. Otherwise, the node may deliver a packet from one sender with ahigher timestamp before it receives a packet from another sender with a lowertimestamp.

Both senders and receivers must know the membership of the receivergroup, either implicitly or explicitly. This information is used for garbagecollection so that a node can remove a Data packet from its buffers when itknows that all nonfaulty members of the receiver group have received thepacket and thus that it will never need to retransmit the packet subsequently.Again, for scalability, to maintain the membership of the receiver group, eachnode is responsible only for fault detection of its children in the tree.

18.7.4 Real-Time Transport Protocol

The Real-Time Transport Protocol (RTP) [498] provides end-to-end deliveryservices for data with realtime characteristics, such as interactive audio andvideo (see also Chapter 19). RTP is intended primarily for multimedia con-ferencing with multiple participants, as in teleimmersion applications, but itis also applicable to interactive distributed simulation, storage of continuousdata, and so on. RTP typically runs on top of the User Datagram Protocol (UDP)

to utilize its multiplexing and checksum services, and supports data transfer tomultiple destinations using multicasting provided by the underlying network.

RTP works in conjunction with the Real-Time Transport Control Protocol(RTCP). The Real-Time Transport Protocol provides realtime transmission ofdata packets, while the Real-Time Transport Control Protocol monitors qualityof service and conveys session control information to the participants in asession.

Among the services provided by RTP are payload type identification,marker identification, sequence numbering, timestamping, and delivery mon-itoring. The payload type identifies the format of RTP payload (e.g., H.261 forvideo), and the marker identifies significant events for the payload (e.g., be-ginning of a talk spurt). The sequence numbers allow the receivers to detectpacket loss and to restore the sender’s packet sequence, but may also be usedto determine the position of a packet in a sequence of packets as, for example,in video transmission. The timestamp, obtained from the sender’s local clockat the instant at which a data packet is generated, is used for synchronizationand jitter calculations.

RTP itself does not provide any mechanisms to ensure timely delivery orother quality-of-service guarantees. It does not guarantee delivery or preventout-of-order delivery, nor does it assume that the underlying network providessuch services. For jitter-sensitive applications, a receiver can buffer packets


and then exploit the sequence numbers and timestamps to deliver a delayedbut almost jitter-free message sequence.

RTCP is based on periodic transmission of control packets to all of theparticipants in a session. The traffic is monitored and statistics are gathered onthe number of packets lost, highest sequence number received, jitter, and soforth. These statistics, transmitted in control packets, are used as feedback fordiagnosing problems in the network, controlling congestion, handling packeterrors, and improving timely delivery.

In addition, RTCP conveys minimal session control information, withoutmembership control or parameter negotiation, as the participants in a sessionenter or leave the session. Although RTCP collects and distributes control andquality-of-service information to the participants in the session, enforcementis left to the application.

18.7.5 Resource reSerVation Protocol

The Resource reSerVation Protocol (RSVP) [76, 584] is designed for an inte-grated services Internet and is suitable for applications such as teleimmersion.More details on RSVP are provided in Chapter 19. RSVP makes resource reser-vations for both unicast and multicast applications, adapting dynamically tochanging group memberships as well as to changing routes. RSVP operates ontop of the Internet Protocol IPv4 or IPv6 and occupies the place of a transportprotocol in the protocol stack. However, it does not transport application datanor does it route application data; rather, it is a control protocol.

RSVP is based on the concept of a session, which is composed of at leastone data stream or flow and is defined in relation to a destination. A flow isany subset of the packets in a session that are sent by a particular sourceto a particular destination or group of destinations. RSVP is used to requestresource reservations in each node along the path of a flow and to requestspecific qualities of service from the network for that flow.

RSVP is receiver oriented in that the receiver of a flow initiates and main-tains the resource reservation used for that flow. Thus, it aims to accommodatelarge groups, dynamic group memberships, and heterogeneous receiver re-quirements. It carries a resource reservation request to all of the switches,routers, and hosts along the reverse data path to the source. Since the member-ship of a large multicast group and the topology of the corresponding multicasttree are likely to change with time, RSVP sends periodic refresh messages tomaintain the state in the routers and hosts along the reserved paths. This isreferred to as soft state because it is built and destroyed incrementally.


Quality of service is implemented for a particular flow by a mechanismcalled traffic control that includes a packet classifier and a packet scheduler.The packet classifier determines the QoS for each packet, while the packetscheduler achieves the promised QoS for each outgoing interface. During reser-vation setup, a QoS request is passed to two local decision modules, admissioncontrol and policy control. Admission control determines whether the nodehas sufficient available resources to supply the requested QoS; policy controldetermines whether the user had administrative permission to make the reser-vation.

18.7.6 CORBA GIOP/IIOP

CORBA is based on the client-server model and supports interoperability at theuser level (language transparency) via the OMG’s Interface Definition Lan-guage (IDL) and at the communication level via inter-ORB protocols [388].CORBA is also discussed in Chapters 9 and 10.

When a client object invokes a method of a server object, the call is boundto a static stub generated from the IDL specification of the server object, or theoperation is invoked dynamically. The stub passes the call and its parametersto the Object Request Broker (ORB), which determines the location of theserver object, marshals (encodes and packages) the call and its parametersinto a message, and sends the message across the network to the processorhosting the server object. The ORB at the server unmarshals the message andpasses the call and its parameters to a skeleton, which invokes the operationand returns the results to the client object, again via the ORB.

The CORBA 2.0 specification provides the technology that enables ORBsfrom different vendors to communicate with each other, known as the ORBInteroperability Architecture or, more specifically, the General InterORB Pro-tocol (GIOP). GIOP is designed to map ORB requests and responses to anyconnection-oriented medium. The protocol consists of the Common Data Rep-resentation, the GIOP message formats, and the GIOP message transport as-sumptions.

The Common Data Representation (CDR) is a transfer syntax that mapsevery IDL-defined data type into a low-level representation that allows ORBson diverse platforms to communicate with each other. CDR handles variationsin byte ordering (little-endian vs. big-endian) across the different machineshosting the ORBs. It also handles memory alignments to natural boundarieswithin GIOP messages.


The GIOP message formats include Request, Reply, LocateRequest,Locate-Reply, CancelRequest, CloseConnection, Message Error, and Frag-ment. Each of these message formats is specified in IDL and consists of astandard GIOP header, followed by a message format-specific header and,when necessary, a body of the message. The Request header identifies thetarget object of the invocation and the operation to be performed on that ob-ject, and the message body contains parameters of the operation. The Replyheader indicates a successful (or unsuccessful) operation, and the messagebody contains the results (or the exception).

The GIOP message transport assumptions mandate that the underlyingtransport is a connection-oriented, reliable byte stream. Moreover, the trans-port must provide notification of connection loss. Clearly, TCP/IP is a candi-date for such an underlying transport, and the mapping of GIOP to TCP/IP isstandardized by the OMG as the Internet Inter-ORB Protocol (IIOP).

All CORBA 2.0–compliant ORBs must be able to communicate over IIOP,which effectively allows the ORBs to use the Internet as a communication bus.In using IIOP, objects publish their object references, or names, as Interopera-ble Object References (IORs) and register them with CORBA’s Naming Service.These object references allow object addressing and discovery even in a net-work of heterogeneous ORBs.

18.7.7 Java RMI

While CORBA provides interoperability between different languages, Java Re-mote Method Invocation (RMI) [290] assumes the homogeneous environmentof the Java Virtual Machine (JVM) (see Chapter 9). Like CORBA, RMI is basedon the client-server model. For RMI, marshaling and unmarshaling of objectsare handled by the Java Object Serialization system, which converts the datainto a sequence of bytes that are sent over the network as a flat stream.

The RMI model consists of a suite of classes to support distributed objectcomputing by allowing a client object to invoke methods defined in a serverobject running on a remote machine. Each remote object has (1) a remoteinterface that declares the methods that can be invoked remotely on theremote object and extends the class Remote and (2) a remote object (server)that implements the interface and extends the class UnicastRemoteObject.Remote Method Invocation is the action of invoking a method defined in aremote interface on a remote object that implements this interface.

When a remote object is created, it is registered in a registry on thesame machine. The registry stores the remote object’s name along with areference to a stub for the remote object. In order for the registry lookup to be

18.8 The Future of Network Protocols477

effective, the remote object must have been bound previously to the registry.An application using RMI first makes contact with a remote object by findingthe name of the object in the registry. The client then downloads the stub,which acts as a proxy for the remote object. The client can then invoke amethod on the remote object exactly as it would invoke a method locally.

The arguments passed to a remote method can include both remote andnonremote objects. Remote objects are passed by reference. The remote ref-erence passed is a reference to the stub for the remote object. Objects that arenot remote are passed by value by copying over the network, and thereforemust implement the serializable interface.

Thus, Java RMI allows distributed applications written in Java to pass data,references to remote objects, and complete objects (including the code associ-ated with those objects) from one machine to another. The ability to load codeacross the network distinguishes the Java RMI system from other distributedcomputing frameworks, such as CORBA. This is made possible because bothclient and server objects are implemented within the Java Virtual Machineand because Java bytecodes form a portable binary format.

18.8 THE FUTURE OF NETWORK PROTOCOLS

In the past, the primary limitations on computer applications were processingspeed and memory size. Modern processors have removed these limitationsfor most applications. Currently, communication bandwidth is a major limita-tion for distributed applications, but the end of this limitation is in sight. Theprimary limitation on our ability to build distributed application systems in thefuture will be the complexity of the application programs. The network pro-tocols of the future must, therefore, reflect this new primary limitation. Theymust be designed to simplify application programming, even at the expenseof less efficient processing and use of communication bandwidth.

The application programmer of the future will focus on the application,rather than on communication. The invocation of a remote operation acrossthe network will be no different from a local invocation. If an application mustbe fault tolerant, then replication of data will be automatic and invisible, aswill be recovery from a fault. If an application involves multimedia, then thestreams will be transmitted and delivered in a synchronized and jitter-freemanner without involvement of the application programmer.

Network protocols are critical to the future of computing. Almost all com-puting in the future will be distributed computing; network protocols providethe glue that holds such computing together. Much of the computing of the


future will involve the integration of components developed independentlywith little coordination, and the use of those components in ways that couldnot have been foreseen by their developers. Such integration is possible onlywith simple, well-defined, stable interfaces; such characteristics have beenachieved by many network protocols but by little else in computing. If wetreasure that simplicity, precise definition, and stability, our current and fu-ture network protocols will provide the interfaces from which the distributedapplication systems of the future will be built.

FURTHER READING


� Gouda’s text [241] on network protocol design is very readable.

� Halsall’s text [259] is a comprehensive reference book.

� Holzmann’s book [278] discusses how to design protocols, by an expert.

� Stalling’s IEEE tutorial [522] is a bit dated but still excellent.

19C H A P T E R

Network Quality of Service

Roch GuerinHenning Schulzrinne

This chapter is devoted to the issue of quality-of-service support at the net-work level, as well as the use of such capabilities by distributed applications.In particular, this chapter reviews existing network mechanisms available tosupport QoS and identifies what they imply in terms of an application’s per-formance and behavior. It also points to differences in cost between networkservices, as applications should also consider this aspect when selecting a ser-vice.

Without attempting a rigorous or comprehensive definition, we first tryto clarify what is meant by “network QoS.” The network is responsible for thedelivery of data between entities involved in a distributed application, and thisdelivery has several dimensions that reflect the operational requirements ofthe application. Examples of those dimensions include the amount of data thatneeds to be delivered (rate guarantees), the timeliness of their delivery (delayand jitter guarantees), and the quality of their delivery (loss guarantees).Network QoS essentially implies some form of commitment along one or moreof these dimensions. This differs from the current best-effort Internet model,which does not differentiate along such dimensions and where the networkmakes no commitment regarding the delivery of data.

In this chapter we focus on network services supported in IP networks.Although a number of networking technologies offer QoS services, our choiceof IP is primarily motivated by the fact that it is likely to be the technology ofchoice for applications to interface to. In other words, we believe that the APIused to request QoS services from the network will, in most instances, be IPbased.

19 Network Quality of Service480

QOS-awareapplication

RSVPdaemon

Routing

Resourcemanager

Admissioncontrol

Policyserver

Packetclassifier

Linkscheduler

RAPI ioctl

Packet classifier API

Socket APIRTP

Policyprotocol

Host or router

19.1

FIGURE

Host and router architecture for providing QoS.

To better understand the capabilities and limitations of various networkQoS services, we first identify the basic building blocks that networks use inoffering and supporting those services. The two major ones are as follows:

1. Control path: the mechanisms that let an application describe the kindof service it wants to request, and allow it to propagate that informationthrough the network. The control path is of particular importance to ap-plications because it determines the semantics of the interface that thenetwork makes available to applications, in order for them to request ser-vices. One example of such an interface to be used with the RSVP protocolcan be found in [74].

2. Data path: the specific guarantees on how the transfer of packets throughthe network is to be effected and the mechanisms used to enforce thoseguarantees. These mechanisms are not specific to a given network tech-nology and typically need not be externalized to applications requestingservice from the network.

Figure 19.1 indicates the components of a router or host providing quality-of-service guarantees. Throughout this chapter, we will be explaining the var-ious components.

Another important aspect that must be understood is cost. There is acost associated with the provision of QoS guarantees in the network becausethe network has to allocate (dedicate) some amount of resources in order to

19.1 Selecting Network Services481

ensure it can satisfy the applications’ requirements. The amount of resourcesneeded will vary as a function of how stringent the requested QoS guaranteesare and how efficient the network is at utilizing its resources. However, weneed to be aware that certain guarantees may be expensive to provide. Forexample, as discussed in Section 19.3.2, services that allow applications torequest hard delay guarantees, while feasible even over IP networks, arelikely to be among the more expensive. As a result, the user may want toevaluate an application’s requirements for hard delay bounds against this cost.As discussed in Section 19.5, this is especially true for applications that canadapt to a certain range of fluctuations in network performance.

In the rest of this chapter, we expand on the various QoS services thatnetworks offer. Section 19.1 provides a general perspective and classificationof the criteria that the user may want to use in order to assess the suitabil-ity of different network services. Sections 19.2 and 19.3 focus on the twomain components that influence the kind of network services available fromthe network; the signaling and service definition models. In Section 19.2, wedescribe the characteristics and behavior of the RSVP protocol that let appli-cations request QoS guarantees from IP networks (see also Chapter 18). InSection 19.3, we review the current service models that have been defined forIP networks—guaranteed service and controlled load. Section 19.4 highlightscriteria of significance to applications when selecting a specific service, andSection 19.5 describes alternative techniques that applications can use to adapttheir requirements to the available network resources. Finally, Section 19.6identifies a number of extensions and ongoing activities.

19.1 SELECTING NETWORK SERVICES

In this section, we outline a possible road map for how an application maychoose between different network services. A user contemplating the selec-tion of a particular network service should evaluate it along four main dimen-sions:

1. What kind of performance guarantees is it able to provide (e.g., through-put, delay, loss probability, delay variations), and how do those guaranteesrank against the application’s requirements?

2. What does it require in order to provide such guarantees? That is, whatconstraints does it impose on user behavior (traffic descriptors, confor-mance rules, etc.)?


3. What kind of flexibility does it offer to deal with variations in user require-ments? That is, how does it handle excess traffic?

4. What does it cost?

The user should, therefore, attempt to articulate the requirements of theapplication in a manner that defines as precisely as possible the expectationsalong the above dimensions.

What kind of service guarantees does the network provide? Typically, networkservice guarantees specify the reliability of data transfer across the network,the kind of transit delay that the data will experience, and (possibly) the vari-ations in this transit delay. Loss guarantees can range from the promise not tolose any data (except because of transmission errors), to specific loss probabili-ties, to only the avoidance of excessive losses through “adequate” provisioning.Similarly, delay guarantees can cover a wide range, from hard (deterministic)bounds on the maximum delay, to loose guarantees that “typical” delays won’texceed some generic value. Guarantees on delay variations also follow a simi-lar pattern. For each of the above guarantees, the application needs to assesstheir significance for the quality of its operation, since the more stringent itsrequirements, the higher the associated network cost.

Another important aspect that must be considered is the relation betweenthe data units of relevance to the application (e.g., a complete video frame) andthe network data units to which the service guarantee will apply. The two areoften quite different. For example, the loss of a single network data unit mayrender unusable a whole application data unit consisting of multiple networkdata units. It is therefore important to factor in those potential differenceswhen requesting a certain level of service from the network (see [124] fordiscussions on these issues).

What does the network require in order to provide a given service guarantee?Typically, networks provide service guarantees only for a specific amount oftraffic, which the application must specify when contracting the service. Inother words, a service contract is typically associated with a traffic contract.Networks support a range of traffic contracts, but the challenge is to selecta contract that accurately captures an application’s traffic characteristics. Al-ternatively, the application may choose to renegotiate its service and trafficcontracts as its requirements change. For example, an application could askfor a higher level of service (e.g., lower delay) or request permission to injectmore traffic in the network to accommodate an increase in activity.

How does the network handle excess traffic? And what does it cost? Althougha flexible traffic contract or support for service renegotiation can help provide


a service that matches an application requirements, there are still instanceswhen an application will exceed its contract. For example, some applicationsoften specify minimal traffic and service contracts on the basis of cost. Thesecontracts are meant only to ensure the application’s ability to operate evenin the presence of heavy network congestion and are not really representa-tive of their typical requirements. As a result, such applications will usuallygenerate a volume of traffic that is much higher than the value specified intheir traffic contract. Network services that allow transmission of this excesstraffic—although at a lower priority and to the extent that network resourcesare available—are clearly preferred for such applications.

Many of the above aspects are tightly coupled, and in general it is possi-ble for an application to come up with multiple answers, each representinga different trade-off. In order to better understand the process that an appli-cation may wish to follow, it is useful to give an example as a guiding threadwhen reviewing the many different aspects that an application must evaluatein selecting a network service.

19.1.1 A Road Map for Service Selection

In the rest of this section, we use the teleimmersive collaborative designsystem application (of Chapter 6) to illustrate some of the possible choicesthat an application may face. The teleimmersion application actually consistsof multiple types of flows, each with different requirements and, as a result,possible trade-offs in terms of network QoS. Its flows and the requirementsthat are of greatest significance to network QoS are summarized in Table 19.1(a simplified version of Table 6.1).

The parameters shown in Table 19.1 are as follows (see Chapter 6 for ex-plicit details). The latency requirement specifies the end-to-end delay, includ-ing propagation and network queuing delays, that each component flow cantolerate. The bandwidth requirement expresses the amount of data each flow isexpected to generate. Flows that are identified as multicast expect the networkto replicate and deliver their data to multiple (N) entities. The stream charac-teristic identifies flows with explicit synchronization constraints and that are,therefore, also sensitive to delay variations (i.e., jitter). Finally, the dynamicQoS column specifies the expected range of fluctuations of each flow’s trafficcharacteristics and service requirements.

In the rest of this section, we review possible network service choices forthese different flows.


Flow type Latency Bandwidth Multicast Stream Dynamic QoS

Control Medium 64 Kb/s No No Low

Text High 64 Kb/s No No Low

Audio Medium N × 128 Kb/s Yes Yes Medium

Video Medium N × 5 Mb/s Yes Yes Medium

Tracking Low N × 128 Kb/s Yes Yes Medium

Database High > 1 GB/s Maybe Maybe High

Simulation Medium > 1 GB/s Maybe Maybe High

Haptics Low > 1 Mb/s Maybe Yes High

Rendering Medium > 1 GB/s Maybe Maybe Medium

19.1

TABLE

Requirements of the teleimmersive application.

A Conservative Service Selection

At one extreme, the application may insist on having the network deliver all ofits data as it is being generated, with minimal disruption and with the smallestpossible latency. This is essentially the service that would be provided by afixed-rate circuit. Such a service is also available from packet (IP) networksin the form of the guaranteed service (see Section 19.3.2) and may be theright choice for flows such as audio, video, tracking, and haptics, which haverelatively stringent delay and/or synchronization requirements.

Requesting this service will result in the allocation of a guaranteed band-width pipe between the sender and all receivers. The amount of bandwidth ofthe pipe is set to guarantee that the flow’s delay requirements are met. Theabove service selection could also be suitable for the simulation and renderingflows, which have similar stringent delay requirements. However, their muchhigher peak data rates (1 GB/s) makes such a choice all but impractical, as itwould be prohibitively expensive.

The guaranteed bandwidth service represents one extreme in the spec-trum of solutions, where the worst-case resources are allocated. There are,however, other choices, whose desirability depends both on the efficiency ofthe network resource allocation mechanisms and on the traffic characteristicsof the application.

Relaxing Service Guarantees for a Lower Cost

There are many approaches an application can follow to lower the cost of itsnetwork service. One is to basically tell the network that it won’t need all thebandwidth all the time. The issue there is how to express this in a quantitative


manner. One approach is to predict the expected range of traffic fluctuation.Another approach is to ask for bandwidth only when it is needed. This requirescontinuous “renegotiation” between the network and the application as therate of a flow changes. It avoids the difficulty of predicting traffic fluctuations,but creates the risk of the bandwidth not being available when requested. Yetanother approach is to ask only for a minimal service guarantee (or none) andadapt to the availability of network resources. We briefly review the pros andcons of each approach.

The specification of an envelope will secure the availability of the neces-sary network resources for the traffic contained within that envelope. How-ever, identifying the “right” envelope can be a difficult task for an application;for example, it may be feasible for the audio flow but much harder for the sim-ulation flow whose behavior can widely vary. The impact of this uncertaintycan be mitigated by the fact that there is not a unique correct answer, and arange of traffic envelopes is likely to be adequate. For example, the flow’s traf-fic can be reshaped, by buffering it and delaying its transmission, to make itfit different traffic profiles. However, this approach assumes that the flow cantolerate the additional (reshaping) delay.

Renegotiating with the network entails overhead and latency in receivingthe desired service, and even possible failure to do so if network resources areunavailable. The impact of such failures can be mitigated by requesting a floorservice guarantee. In general, the use of renegotiation, with or without floorguarantees, is mainly suitable to adaptive applications, that is, applicationsthat support different levels of operation and can adjust to accommodatechanges in the availability of network resources (see Section 19.5 for additionaldiscussions on the issue of adaptation). For example, the database flow couldrequest a floor bandwidth of 10 Mb/s (substantially lower than its peak rate of1 GB/s), but attempt higher-speed transmissions whenever possible. Similarly,if the video flow supports multilevel encoding, it could select a floor bandwidthmatching the transmission requirements of the coarsest coding level and sendadditional coding levels on the basis of available network bandwidth.

In general, and even irrespective of which level of service an applicationrequests from the network, often it will at some point be inadequate. Inother words, the application will be generating traffic in excess of its contractwith the network. How the network reacts to such situations is important toapplications.

Insufficient Service Selection

In most cases, networks allow applications to transmit traffic in excess ofwhat they have requested. However, the network will usually carry this excess


traffic at a lower loss priority; that is, excess traffic will be dropped first in caseof congestion. The main difference is in how the network identifies excesstraffic: either implicitly or explicitly.

Implicit identification of excess traffic is typically used in networks thatare able to guarantee resources to individual applications. This means thatnetwork elements monitor the resource usage of each individual application.If a network element is uncongested and has idle resources, the applicationwill be able to access them, and its excess traffic will get through. However,when congestion is present, the network element will limit the amount ofresources that the application can use to what it is entitled to from its servicecontract, and its excess traffic will be dropped.

Controlling the resource usage of individual flows can be costly. As a re-sult, networks often rely on another mechanism based on explicit marking(tagging) of excess traffic. Packets that exceed the service contract requestedby the application are identified and tagged as they enter the network, anddropped first in case of congestion. Because tagging is a global indicator, iteliminates the need to control individual flows. The penalty is a coarser con-trol, so that applications can experience substantial differences in the amountof excess traffic they can successfully transmit. However, applications can de-cide to premark their packets and exercise some control over which packetsshould be preferentially dropped.

Finally, we need also to be aware that in some instances the network willsend excess traffic on a lower-priority path that is distinct from the one usedfor regular traffic. Hence, packets can be delivered out of order. For streamingflows such as the audio, video, and tracking flows of the teleimmersion appli-cation, the resulting packet misordering can have a substantial impact on thedelay and delay variations they experience.

In the next two sections, we review existing network services and identifythe guarantees and options they offer.

19.2 THE RSVP SIGNALING PROTOCOL

The RSVP protocol is a resource reservation setup protocol designed for anintegrated services Internet. The protocol is documented in several IETF doc-uments (primarily [76, 75, 577]) that describe various aspects related to itsoperation and use. The protocol is intended to be used by both hosts (endsystems) and routers (network elements); that is, the same RSVP messagesare used by hosts to communicate with routers and by routers to communi-

19.2 The RSVP Signaling Protocol487

cate with each other. From the viewpoint of an application, the RSVP protocolhas a number of characteristics that are of significance.

Requests for resources are initiated by the receiver based on informationit receives from the sender. The sender communicates with the receiver(s)through PATH messages. The information carried in PATH messages includescharacteristics of the sender’s traffic and of the path that the flow will takethrough the network. The main potential advantage of knowing the charac-teristics of the path is that this can assist the application in reserving the“right” amount of resources. For example, knowing the maximum amount ofbandwidth available on the path can avoid requesting more bandwidth thanis feasible. Additionally, knowledge of the path characteristics is key to sup-porting hard end-to-end delay guarantees (see Section 19.3.2). Reservationsare communicated from the receiver(s) using RESV messages, which propagatehop by hop back toward the sender.

RSVP requests resources for simplex flows; that is, it requests resources inonly one direction, from sender to receiver. Hence, if an application requiresduplex communications, separate requests need to be sent by the applicationat each end. For example, in the case of the teleimmersion application ofChapter 6, each site involved in a teleimmersive session needs to individuallyreserve resources to receive data from all other sites.

The protocol allows renegotiation of QoS parameters at any time. Thisis achieved simply by sending a new reservation message with updated QoSparameters. The network will attempt to satisfy the new request, but if it fails,it will leave the old one in place. This support for dynamic QoS negotiationscan be a powerful tool for applications, especially those that may not be ableto accurately predict their traffic patterns or exhibit a wide range of variations.

The RSVP protocol allows sharing of reservations across flows. This meansthat a receiver can specify that a given reservation request can be sharedacross multiple senders. In addition, RSVP allows two styles of reservations;implicit and explicit (see [76] for details).

The availability of shared reservations can benefit applications involvingmultiple parties, but some care must be exercised because of the lack of spec-ification on how sharing of the reservation is to be carried out. Currently,no mechanism is available that will let an application specify to the networkhow it wants resources to be shared. As a result, resource sharing can be con-trolled only at the application level. For example, in the context of an audioconference with 10 participants, bandwidth could be reserved assuming thattwo speakers at most will be simultaneously active. The implied coordinationmechanism is that speakers will be backing off as soon as they realize thatseveral of them are talking simultaneously. For other applications, special


Rx1R12

R7

R2

R1Tx

Rx2R11

R6

Rx3R10

R5

R3 Rx4R9

R4

Rx5R8

PATH message

RESV message

Blockade state

Reservation failure

6

5

5

55

7

4

6

55

6

4

7

6

5

6

19.2

FIGURE

Sample RSVP signaling flows.

coordination mechanisms may be needed to take advantage of shared reser-vations.

The RSVP protocol supports heterogeneous reservations in the case ofmulticast connections. Hence receivers listening to the same sender can re-quest different levels of service. As reservation requests from the receiverstravel back to the sender, they are merged so that only the “larger” one is prop-agated upstream. This merging is defined in RSVP using the least upper bound(LUB) operation [76, Section 2.2]. Heterogeneous reservations can be useful tosupport different end-to-end delay requirements of geographically distributedreceivers.

The operation of the RSVP protocol is shown in Figure 19.2 for the case of amulticast flow. The figure illustrates the receiver-oriented reservation of RSVPand highlights some of its key features. In particular, it shows how reservationsare merged (only the reserved token rate is shown in the figure), as well aswhat happens in the case of reservation failures. Specifically, the link betweenR2 and R6 is unable to satisfy the request for 4 units of reservation, so that the

19.3 Specifications for QoS Guarantees489

flow has no reservation on this link. However, note that reservations are inplace on links between R6 and R10 as well as between R10 and Rx3. The caseof the link between R2 and R7 is more involved as it shows an instance of ablockade state, where a reservation for 7 units initially failed and is “blockaded”at R7 to ensure that the reservation for 5 units can proceed (see [76] for details).

The RSVP protocol is important to applications because it is the mecha-nism through which they will talk to the network to request specific serviceguarantees. This determines not only the semantics of the network servicesthat will be supported, but also the API that applications will need to talk to inorder to request those services (see [74] for an example). In particular, applica-tion developers need to be aware that applications will have to be modified tointeract and exchange parameters with the RSVP signaling daemon. In addi-tion, in the case of an operating system with support for realtime schedulingand/or prioritization (as discussed in Chapter 20), an interface is also neededbetween the RSVP daemon and the OS QoS manager to ensure the allocation ofappropriate operating system resources. In that context, RSVP is then also theentity that applications will use to request QoS guarantees from the operatingsystem as well as the network (see [50] for an example of such a system).

In the next section, we describe the two services currently available toobtain QoS guarantees in IP networks.

19.3 SPECIFICATIONS FOR QoS GUARANTEES

A number of proposals have defined different types of service guarantees to beprovided in IP networks, but currently only two are being standardized. Thesetwo services, controlled load service [576] and guaranteed service [506], canbe considered to be at opposite ends of the spectrum of QoS guarantees. Thecontrolled load service provides only loose delay and throughput guarantees;the guaranteed service ensures lossless operation with hard delay bounds.Despite those differences, both services share a number of common elements,such as the formats and parameter set used to characterize flows and servicecapabilities in the networks.

A first important set of such parameters is the specification used to charac-terize the traffic from a sender on behalf of which a reservation is being made(by a receiver). For applications, this is a key set of parameters because it de-termines which of the application packets are eligible to receive the requestedQoS guarantees. This traffic specification, or TSpec, consists of a token bucket,a peak rate (p), a minimum policed unit (m), and a maximum datagram size(M). The token bucket has a bucket depth (b) and a bucket rate (r), where


rates are in units of bytes per second, and packet sizes and bucket depth arein bytes.

The token bucket, the peak rate, and the maximum datagram size togetherdefine the conformance test used by the network to identify the packet towhich the reservation applies. This conformance test states that the reserva-tion applies to all packets of the flow as long as the amount of traffic A(t) itgenerates in any time interval of duration t verifies

A(t) ≤ min(M + pt, b + rt)

This equation bounds the amount and speed at which the application caninject data in the network. In addition, the minimum policed unit m is used torequire that any packet of size less than m be counted as being of size m. Thisis to account for possible per-packet processing or transmission overhead, forexample, the number of cycles required to schedule a packet transmission.

As was mentioned in Section 19.1.1, the challenge for an application is todetermine which values to pick to characterize its traffic to the network. Thepeak rate p is often set to the raw speed of the network interface of the endsystem where the application resides. However, this can be quite expensive,for example, in the case of SONET OC-12 or Gigabit Ethernet interfaces; thenetwork usually charges a premium for high peak transmission rates becausethe network needs to provide buffering to absorb very high speed bursts.Hence, an application may want to specify a lower peak rate and control thetransmission of its packets to ensure that it complies with this lower rate. Todo so, however, requires support for such pacing in the network interface (seeChapter 20 for additional discussions on network interface characteristics).

The selection of the token rate r and token bucket depth b is a morecomplex task. A large token bucket depth gives the application the ability toburst (at its peak rate) data into the network without any delay. This is keyto minimizing latency. However, as was mentioned earlier, large bursts aredifficult for the network to handle and require additional resources (buffers),which therefore increase the cost of the service. A possible alternative is totrade a large token bucket depth for a smaller one, but increase the value ofthe token bucket rate r. Transmissions can then proceed at a reasonably highrate even when the token bucket is empty. The higher the token rate r, thelower the additional transmission latency when the token bucket is empty;but, on the other hand, the higher the token rate r, the higher the servicecost, because the value of r directly corresponds to the minimum amount ofbandwidth the network needs to allocate to the flow.


The right choice for b and r depends both on the service pricing modelused by the network and on the application characteristics. For instance, thehaptic flow of the teleimmersion application has stringent delay requirements,so it might want to minimize the likelihood of running out of tokens. Thiswould lead to the choice of a token bucket depth b sufficient to accommodatethe maximum transmission burst the application can generate. Similarly, itstoken rate r would be chosen high enough to ensure that the token bucket isalways replenished between consecutive transmissions (e.g., r ≥ 1 Mb/s). Onthe other hand, less dynamic flows such as the control and text flows shouldbe able to select a small value for b (e.g., one or two packets) and a tokenbucket rate r approximately equal to their long-term bandwidth requirementsof 64 Kb/s.

Independent of how the TSpec was selected, packets whose transmissionviolates the conformance test equation given above are deemed nonconformantwith the TSpec and are not eligible for the service guarantee implied by thereservation. Those packets will then be treated as best-effort by the network.

A number of other characterization parameters are shared by the con-trolled load service and the guaranteed service. Some of the more interestingfor applications are the minimum path latency, the available path bandwidth,and the path MTU (see [507] for details).

19.3.1 Controlled Load Service

The service definition of the controlled load service is qualitative. Its statedaim is to approximate the kind of service that the application would experiencefrom an unloaded network. The main aspect of the controlled load service isthat it assumes the use of call admission to ensure that flows with reservationssee this level of performance, irrespective of the actual volume of traffic in thenetwork.

When the RSVP protocol is used, this request for reservation is communi-cated back from the sender by using a flowspec, which essentially specifies aTSpec of the same form as the one used by the sender. The TSpec specified bythe receiver need not be the same as the one used by the sender. This abilitycan be used by applications in a number of ways. For example, a receiver withlimited ability to buffer bursts may specify a smaller value of the token bucketdepth b than the one used by the sender. This will signal to the sender that itshould reshape its traffic accordingly. For example, the recipient of a databasesynchronization in the teleimmersion application may not want to be dumped


with gigabytes of data and may specify a TSpec that will prevent this situationfrom happening.

As mentioned earlier, the service guarantees apply only to conformantpackets, that is, packets that pass the conformance test equation. The treat-ment of packets in excess of the TSpec is left unspecified in the controlledload specifications. The specifications [576] state the following:

The controlled-load service does not define the QoS behavior delivered toflows with non-conformant arriving traffic. Specifically, it is permissibleeither to degrade the service delivered to all of the flow’s packets equally,or to sort the flow’s packets into a conformant set and a non-conformantset and deliver different levels of service to the two sets.

For an application, the two behaviors have a very different impact.The first approach corresponds to an implementation where the network

guarantees each controlled load flow a transmission rate of at least its tokenrate r, for example, by using mechanisms such as weighted fair queuing(WFQ) [156, 437]. In this case, if the flow sends at a rate greater than r for anextended period of time during which the network is congested, packets fromthe flow will start accumulating in the network buffers, since they are arrivingfaster than they can be transmitted. As a result of these larger queues in thenetwork, the end-to-end delay seen by all packets will increase. This may bean adequate behavior in the case of an adaptive application that will detectthe increase in delay and use it to lower its rate, for example, all the way downto conform to its original TSpec. However, this may not be a desired behaviorfor an application that is sensitive to increases in delay and can tolerate somelosses, for example, a telephony application, so that in case of congestion itwould prefer to see nonconformant packets dropped rather than having allpackets experience additional delay.

The second network behavior where conformant and nonconformantpackets are treated differently may then be more appropriate for delay-sensitive applications. However, the identification of nonconformant con-trolled load packets inside the network is a difficult, if not impossible, task,unless some form of “marking” as proposed in [125] is used. The result is thateven when the network implementation of controlled load handles noncon-formant packets by downgrading them to a lower-priority service, this willhappen only after the application experiences some initial increase in end-to-end delay.

Currently, no mechanism is available to let an application signal to thenetwork which of the two above behaviors it would like to see to handle its


nonconformant traffic. Dealing with this uncertainty may require additionalfunctionality in some applications. For example, a delay-sensitive applicationmay want to preemptively drop packets to limit increases in end-to-end delaywhen it detects congestion. Alternatively, an application can also take advan-tage of the renegotiation ability of the RSVP protocol and attempt to increaseits reservation in case its traffic warrants and the network are becoming con-gested.

Flows from the teleimmersion application that could use the controlledload service are essentially those with relatively loose delay and synchroniza-tion requirements. The controlled load service should be suitable for the textand database flows, and even possibly the audio and video flows. The ability ofthe latter two to use controlled load will depend on how tolerant they are todelay variations.

19.3.2 Guaranteed Service

The guaranteed service aims at providing hard (deterministic) service guar-antees to applications. Those hard guarantees apply again only to conformantpackets. For conformant packets, the network commits to an upper bound onthe end-to-end delay they will experience and ensures that no packet will belost except for transmission errors. The goal of the service is to emulate, overa packet-switched network, the behavior of a dedicated rate circuit.

An application requesting the guaranteed service needs to specify thecharacteristics (TSpec) of the traffic to which it wants the guarantees to apply,as well as the value of the maximum end-to-end delay it wants. Rather thangetting into details on how such guarantees are to be supported inside the net-work (see [506]), instead we will concentrate on the features of the guaranteedservice of relevance to applications.

First, applications should be aware that the guaranteed service is likelyto be an “expensive” service, mainly because of the deterministic nature ofthe guarantees it provides. We illustrate this cost next for three applicationswith different traffic characteristics and end-to-end delay requirements. Thethree applications are 64 Kb/s packetized voice, packet video conference, andplayback of stored video. The traffic parameters of these three applicationsare given in Table 19.2, together with the associated end-to-end delays andcorresponding rate. The service rate was computed by assuming a five-hoppath, where the end-to-end propagation delay was taken to be 20 ms. As canbe seen from the table, the service rate R can be substantially higher than thetoken rate r. This is particularly significant for the voice application, where


End-to-endTraffic type M (KB) b (KB) r (Mb/s) p (Mb/s) delay (ms) R (Mb/s)

64 Kb/s voice 0.1 0.1 0.064 0.064 50 0.162

Video conference 1.5 10 0.5 10 75 2.32

Stored video 1.5 100 3 10 100 6.23

19.2

TABLE

Sample applications and service requirements.

the token rate r equals the peak rate p, but the required service rate R is aboutthree times as much.

Another sample point illustrating the high cost of the guaranteed service isfound in [231], which gives some achievable link loads. For example, on an OC-3 link (≈ 150 Mb/s), a typical mix of flow that saturates the link (i.e., addingone more flow would result in the violation of end-to-end delay guarantees)achieves a typical utilization of about 40%. The remaining bandwidth will mostcertainly not be left unused (i.e., it will be assigned to lower-priority traffic),but the network is likely to charge a premium for guaranteed service. The hardguarantees it provides may be worthwhile for certain applications, but the costmay not be justified for all.

In the context of the teleimmersion application, the two flows for whichthe guaranteed service may be the right service are the tracking and hapticflows because of their stringent delay requirements. The control flow maybe another candidate because of its combination of reliability and relativelylow delay requirements. The bounds on delay and the absence of packetloss offered by the guaranteed service may then be well matched to thoserequirements.

Another aspect that applications need to be aware of, because it can sig-nificantly impact the cost of guaranteed service, is packet size. In general, thelarger the (maximum) packet size used by the application, the higher the ser-vice rate that will have to be reserved in order to guarantee a given end-to-enddelay bound. This impact is primarily due to the store-and-forward nature ofpacket networks, which results in packet transmission times being paid at eachhop. As a result, it is strongly advisable for applications wishing to use theguaranteed service to specify the smallest possible packet size that is compat-ible with other system requirements. The impact of packet sizes is shown inTable 19.3 for the three applications described earlier.

A last aspect of guaranteed service that applications need to be aware of isthat its deterministic guarantees do not lend themselves well to dealing withshared reservations and handling of nonconformant traffic.


Reservation rate R (Mb/s)Maximum packet 64 Kb/s Video Storedsize L (KB) voice conference video

0.1 0.16 1.40 5.91

0.5 0.81 1.66 6.00

1.0 1.62 1.99 6.11

19.3

TABLE

Impact of packet size on reservation rate.

19.3.3 Summary of IP QoS Service Offerings

In this section, we have described the two services that have been definedto offer QoS guarantees over IP networks. When used with the RSVP protocol,those services will let applications interact with the network to request and ne-gotiate a variety of service guarantees. Here, we summarize the main featuresand constraints associated with those services, and we also highlight a numberof service guarantees that the network is today not capable of providing.

The network can provide rate, delay, and loss guarantees to both unicastand multicast flows. In particular, the guaranteed service can provide harddelay bounds and lossless transmission, which should be capable of satisfyingthe needs of the most stringent realtime applications, although at a potentiallyhigh cost for the service. The controlled load service is a suitable alternativefor applications that do not have stringent delay constraints and require onlythat the network guarantee them a certain transfer rate.

For both services, the guarantees apply only to packets that fall within aspecific traffic contract. The specification of this traffic contract is probablythe biggest challenge that applications face in requesting QoS guarantees: itrequires that they characterize their expected traffic patterns, which may bea difficult task for many applications. The main impact of the traffic contractis in terms of what happens to data that falls outside. Not only is this datanot covered by the service guarantees, but no mechanism is available to theapplication to specify to the network how it wants it handled—for example,which packets to drop or delay in case of congestion and whether excesspackets can be sent on a separate path. Applications need to be aware of thispossible range of behavior, and the selection of a traffic contract may need tobe adjusted accordingly.

The network will support dynamic renegotiation of service guarantees, sothat applications can adjust both their traffic contract and QoS requirements.


This can help offset some of the complexity in selecting an appropriate trafficcontract. The network also supports sharing of reservations across traffic frommultiple senders. However, this sharing is “blind” in that the network will notdistinguish between the individual flows sharing the reservation. Hence, theburden of controlling this sharing lies with the application.

In general, service guarantees provided by the network do not extendacross flows. In other words, the network does not allow the specification ofservice guarantees that would allow an application to request, for example, abound on the maximum delay difference between packets from two differentflows. This feature could be useful to, say, synchronize an audio and a videostream, but can be supported only by specifying separate delay bounds foreach flow and having the application perform the necessary synchronizationin the end system.

Another important limitation of the current network service models is thatadvanced reservations are not supported. The network simply grants or deniesa service request based on the availability of resources at the time the requestis made. This feature can be of significance to applications such as teleimmer-sion, which require scheduling the availability of many different resources(e.g., supercomputers, workstations, CAVEs). Currently, no mechanisms areavailable to ensure that network resources will also be available at the sametime. A solution to that problem should become available with the policy con-trol mechanisms mentioned in Section 19.6, but their ubiquitous deploymentis still some time in the future.

Finally, we emphasize that network QoS guarantees are only one compo-nent in providing applications with the service guarantees they require. Manyother factors will affect end-to-end performance, but the higher-layer (e.g.,transport) protocol used by the application and the resource management ca-pabilities of the operating system are two that can have a major impact (seeChapter 20).

For example, the behavior of a transport protocol such as TCP in the pres-ence of losses can substantially affect the actual useful throughput that anapplication can achieve. Conversely, the use of a realtime transport protocolsuch as RTP [498] can help an application recover from delay variations expe-rienced when crossing the network. In general, higher-layer protocols play animportant role in providing applications with the desired service guarantees,(see Chapter 18 for additional discussions on this issue).

Similarly, ensuring that adequate resources are allocated to the applica-tion in the operating system is key to delivering end-to-end service guarantees.For example, it may be of little use for an application to select a service such asguaranteed service if similar guarantees cannot be provided in the operating

19.4 Examples of Service Selection Criteria497

system (see Chapter 20 for additional information on performance issues inthe operating system).

19.4 EXAMPLES OF SERVICE SELECTION CRITERIA

While bulk data transfers such as teleimmersion database updates typically aretreated as having no specific QOS requirements, certain applications such asremote backup or downloading media content for off-line playback may wellrequire a minimum guaranteed throughput.

Transaction-oriented applications such as RPC or remote log-in require aresponse time commensurate with human patience, usually on the order ofa second or less. Unless RPC can be pipelined and parallelized, propagationdelay makes it difficult to use in a wide area network, so that resource controlas described in this chapter tends to be of little help.

Multimedia applications have widely varying delay and throughput re-quirements, even if the content is similar. We can distinguish four types of con-tinuous media applications: stored, noninteractive; stored, with trick modessuch as fast forward; interactive (conferencing), without echo; and interac-tive, with echo. Teleimmersion’s audio, haptic, and video streams fall into theinteractive, without echo, category.

Stored, noninteractive multimedia services are limited in delay only bythe ability of the receiver to buffer content. If the viewer is to have the abilityto fast-forward or skip through the presentation, the round-trip delay can beno more than about half a second to ensure that control action and visibleresult can be correlated by the viewer. Surprisingly, the acceptable one-waydelay for interactive multimedia such as video conferencing is of about thesame magnitude as that for stored video, 200 to 300 ms [77, 286]. The one-waydelay tolerance decreases to 45 ms if there is an acoustical or electrical echofor audio. For haptic feedback, delay constraints may be much tighter, but itmay be possible to limit the need for feedback to the local environment ratherthan propagating it across the network.

We note that the variable network delay that can be addressed by the re-source control mechanisms described in this chapter may be only a smallfraction of the total delay budget: Transcontinental propagation delays addabout 25 ms (5 µs/km). Audio codecs have to wait for a whole block of au-dio, typically about 20 to 40 ms, and may have an algorithmic lookahead ofaround 5 to 10 ms. Video codecs often have coding delays of several framesat 30 ms each. The operating system, unless specifically designed for low la-tency, may add substantial buffering and DMA delays, with a certain popular


operating system adding up to 1 second of delay (see again Chapter 20 for fur-ther discussions on operating system issues). Given the delays inherent in apacket network, packet audio is feasible in the wide area only if echoes aresuppressed.

Since the queuing delay for weighted fair queuing is proportional to theallocated rate, a flow can decrease its delay by merging into a larger flow, atthe cost of losing protection against other members of the aggregate flow.

Although attracting a fair amount of attention, delay jitter is only a sec-ondary quality-of-service parameter. A receiver can always convert delay jitterinto additional fixed delay by using a playout delay buffer, possibly adaptive,as discussed in Section 19.5.2.

Some applications require that several streams be synchronized in time,for example, for maintaining lip-sync for video conferences and video delivery.Since individual streams may have very different QoS requirements, it maybe more efficient to create separate flows for, say, an audio and video streamrather than multiplexing them into a single packet stream. However, thisstrategy forces the application to compensate for delay jitter not just withina flow but also between flows. A simple mechanism [316] adjusts the playoutdelay plus any decoding delay to the maximum of all flows to be synchronized.

The tolerance for packet loss varies widely. Regardless of the level, it is al-ways necessary to make codecs aware of the packetization, so that each packetcan be decoded independently [273, 52]. For many continuous-media appli-cations, not only the loss fraction, but also the burstiness of losses matters.For example, if losses are bursty, video frames consisting of several pack-ets may suffer less degradation in quality compared with random losses. Formost codecs, bursty losses are more noticeable than random losses, since theydisrupt the prediction built into codecs and thus lead to artifacts that are no-ticeable long after the packet loss burst has subsided.

For applications that are loss sensitive, such as remote procedure callsor delivery of stored video, packet loss can also be translated into additionaldelay, simply by using an Automatic Repeat Request (ARQ) mechanism.

19.5 APPLICATION CONTROL AND ADAPTATION

In this section we describe different techniques for adapting applications tonetwork bandwidth, delay, and loss.

19.5 Application Control and Adaptation499

19.5.1 Bandwidth Adaptation

So far, we have assumed that the network will ensure the desired quality ofservice by reserving appropriate resources to each flow. However, this ap-proach imposes significant costs. First, the network needs to maintain statefor each flow; an OC-48 link can easily carry 150,000 audio rate flows, requir-ing approximately 75 MB of high-speed memory for storing flowspecs. (Evenhigh-end routers generally can currently only manage about 3,000 flows.) Also,link utilization for the guaranteed service class may be low. In addition, somenetworks, notably shared-media networks such as nonswitched Ethernets andwireless LANs, may not support resource reservation. Finally, because a re-source reservation can block a large fraction of the network resources, possiblybeyond the capacity of the user’s access link, policy control, security, andcharging mechanisms need to be deployed. Thus, given the practical difficul-ties, it appears unlikely that resource reservation will be widely deployed forcommodity applications in the next few years.

Rather than having applications explicitly reserve resources (and possiblybe denied network access), another approach is to have them adapt theirrequirements to the resources available. This is also guided by the notion thatthe utility function for most applications is convex, with a large initial jump atthe minimum usable rate.

The range and adjustment speed of adaptivity vary widely; many dataapplications can tolerate zero throughput for a few seconds, with through-put varying at round-trip delay intervals, while changing the audio encodings,quantization factors, or frame rate across the whole spectrum at that rate islikely to be rather annoying to the recipient. Thus, the adaptation mechanismsfound, for example, in TCP are not applicable to continuous-media applica-tions.

Also, adaptation is difficult in a heterogeneous multicast environment,as this tends to lead to a “race to the bottom,” with the poorest-connectedreceiver determining the quality of service for all. Instead, it may be necessaryto send base and enhancement layers of the content to different multicastgroups [368].

Adapting to the current network conditions requires that the sender havean accurate picture of the quality-of-service conditions among the receiverpopulation that it serves. Receivers also can benefit from obtaining QoS mea-surements for their fellow receivers, as it allows the application to determinewhether a QoS problem is likely a local one or widespread. RTP [498] is com-monly used to deliver continuous media in the Internet. It encompasses a


control protocol (RTCP), in which each receiver periodically reports on thepacket loss, delay, and delay jitter for each sender (see also Chapter 18).Senders report their nominal rate. The reporting interval scales with thenumber of receivers, so that the traffic due to QoS reports remains below asettable fraction of (typically) 5%. The sender can then use any adaptationalgorithm to adjust its sending rate to the current available bandwidth [68,93], typically with a variation on an additive-increase, multiplicative-decreasealgorithm triggered by moving outside given loss thresholds. If the continuous-media application shares bandwidth with an elastic application like TCP,issues of fairness need to addressed. Since the TCP and continuous-mediaadaptation use very different algorithms, the likelihood exists that TCP willyield all bandwidth to the continuous-media streams or vice versa. For uni-cast streams, TCP-like congestion control has been suggested: emulatingslow start and congestion avoidance, but without actually retransmittingpackets.

If a soft-state reservation mechanism such as RSVP or YESSIR [433] isused, the PATH or forward reservation messages themselves could be used toindicate the “fair share” of available bandwidth. However, this indication of theminimum available resources still has to be conveyed back to the sender, bypossibly using a mechanism similar to the RTCP feedback algorithm describedabove.

19.5.2 Delay Adaptation

Continuous-media applications are defined by their need to reconstruct thetiming of the source at the receiver. With the help of a playout buffer thatis a queue for arriving packets emptied at the playback rate, they converta variable-delay packet network into a fixed delay. To prevent starvation,noticeable as gaps in playback, the playout buffer has to be large enough tocompensate for the delay jitter [393, 463]. For best-effort and controlled loadservice, the delay variation is not bounded, so that the application has toestimate the necessary depth of the buffer, trading off loss due to late arrivalagainst a lower playback delay. With guaranteed service, the delay jitter isbounded, so that a fixed-length playout buffer is sufficient. However, since fewpackets experience the worst-case delay, the application is better off adaptingto the network delays, as long as an occasional packet loss is acceptable. Theinfluence of different scheduling disciplines on playout delay adaptation is notwell understood.

19.6 Extensions and Open Issues501

19.5.3 Loss Adaptation and Concealment

Packets sent at risk may be lost. For continuous media, several techniqueshave been explored to reduce and conceal packet losses: retransmission, re-dundant transmission, interleaving, and forward error correction. Retransmis-sion is limited by the delay tolerance of the application, but may be usefulin regional networks or for delivery of stored media. With redundant trans-mission, a lower-bit-rate version of the same stream is sent at a time offset. Ifa packet from the primary encoding is missing, a packet from the lower-rateencoding may be used to fill in the gap [69]. Interleaving distributes a sin-gle media block across several packets, turning packet loss into sample loss,presumably easier to conceal by the codec itself. Finally, forward error cor-rection [478, 480] adds redundant parity or Reed-Solomon packets to a groupof audio or video packets. It has the advantage of offering perfect reconstruc-tion (since long packet losses are isolated) and of scaling nicely to multicast,but may add substantial overhead. With forward error correction or redundanttransmission, decoding delays can be made adaptive, since the receiver onlyhas to wait for the end of the block if a packet was lost. Retransmission andforward error correction may be combined, for example, by retransmitting aparity packet if any packet within a block was lost.

19.5.4 Other Service Models

A number of other service models are possible, beyond those offering guar-anteed delay bounds or service equivalent to an unloaded network. Recently,there has been interest in services that do not reserve resources, but provideservices appropriate, for example, to realtime applications. As discussed in thecontext of delay adaptation, late packets are lost to the application. Thus, itmay be desirable to accept excess traffic, but drop packets when they haveexceeded their delay bound [499]. As an additional mechanism, scheduling pri-ority could be given to packets that had the bad fortune of experiencing longdelays in earlier nodes or still have many nodes to travel [500]. However, thesescheduling mechanisms require additional delay information in each packet.

19.6 EXTENSIONS AND OPEN ISSUES

In this section, we briefly review a number of additional topics of relevance tonetwork service guarantees and provide pointers to additional resources.


The IP networking environment assumed in this chapter will continueto be heterogeneous. As a result, the available end-to-end service guaranteeswill typically consist of only the intersection of the guarantees offered by thenetwork technologies on the path between end systems. The IETF IntegratedServices over Specific Link Layers (ISSLL) working group is addressing themapping of the integrated services onto link layers such as ATM, LANs, andlow-speed links.

The IP integrated services model presented here assumes that support forservice guarantees was flow or application specific. However, it has been sug-gested [125, 188] to guarantee performance at the coarser level of a serviceclass encompassing many flows. This approach improves scalability, since thenetwork does not have to maintain state for individual flows. As discussed inSection 19.5, such a model of looser service guarantees may be adequate foradaptive applications (see [188] for its use with TCP). However, it is unlikelyto be sufficient for applications that require strong service guarantees, for ex-ample, the control, tracking, haptic, and rendering flows of the teleimmersionapplication of Chapter 6.

An important component in delivering quality of service to users is to con-trol who is entitled to a better service. Addition of such control to the RSVPprotocol is being addressed in the IETF Internet RSVP Admission Policy work-ing group. This policy control mechanism can also be used for other purposes,such as allowing a high-priority application to cancel or reduce existing low-priority reservations. This mechanism may be useful in emergency conditionsor to support advanced reservations.

Finally, because QoS guarantees do come at a price, networks will have toprice their services, at least in part, based on the amount of resources beingallocated. However, a number of other factors are being explored to use pricingas a mechanism to better control networks. Relevant information can be foundat the home pages of the CA$hMAN project [106] and of the Journal of ElectronicPublishing, which focuses on Internet-specific activities.

19.7 SUMMARY

We have described issues faced by applications when requesting service guar-antees from IP networks, and we have reviewed the basic components thatwill be involved in providing those services.

We point out that the current set of QoS capabilities being deployed rep-resents the first generation of such offerings. As such, they may not be ableto satisfy the requirements of all applications. Furthermore, their introduction

Further Reading503

has been driven primarily by a number of recent technology and algorithmicdevelopments, rather than an attempt to satisfy explicit requirements put for-ward by applications.

For example, as mentioned in Section 19.3.3, current services do not sup-port QoS specifications across multiple streams, except through the notion ofshared reservations. One possible area to extend this capability, which mightbe useful, for example, in teleimmersion, would be to introduce intelligentdiscard capabilities for shared reservations.

FURTHER READING


� Guerin and Peris [251] provide an overview of QoS in packet networks.They describe basic scheduling and buffer management mechanisms,and review the properties of both controlled load and guaranteed service.Their paper also includes a section on more recent service proposals thatsacrifice precision in the service guarantees for simpler implementationsand easier deployment.

� Thomas [538] provides byte-level details on RSVP and RTP.

� The role of RTP for Internet telephony is discussed in [501].

� Ferguson and Huston [189] give a timely, if somewhat partial, discussionof QoS support in the Internet and examine the various methods used todeliver QoS over different types of networks.

� Peterson and Davie [446] introduce basic networking concepts and provideuseful material on interactions between operating systems and networks.They also touch upon aspects of QoS support and provide implementationexamples.

20C H A P T E R

Operating Systems andNetwork Interfaces

Peter DruschelLarry L. Peterson

This chapter covers the operating systems and network interfaces in thecomputing nodes that form a computational grid system. The node operatingsystem provides the interface between grid applications and middleware ser-vices (discussed in Chapters 3 through 6 and 11) on the one hand, and theunderlying hardware platform on the other. The hardware platform consistsof the node computer and the high-speed network that connects the nodes(Chapters 17 and 21).

Particular emphasis is placed on the network subsystem, as it is responsiblefor meeting the novel, stringent communication requirements of grid applica-tions. The network subsystem comprises the network interface and the partof the operating system responsible for communication. Three main tasks areperformed by the network subsystem:

1. It multiplexes the network among the multiple (possibly untrusted) appli-cations running on the node.

2. It implements the network protocols (discussed in Chapter 18), whichtransform the raw packet-delivery service of the underlying network intothe reliable, semantically rich communication service expected by theapplication.

3. It provides a standardized, abstract communication interface that ensuresthe portability and interoperability of applications and middleware ser-vices.

20 Operating Systems and Network Interfaces506

Grid applications have demanding communication requirements, includ-ing high-bandwidth, realtime, and multiple concurrent data streams withwidely differing quality of service. Moreover, the grid is based on a sharedinfrastructure, implying that grid applications have to coexist and share re-sources with unrelated applications. The task of the general-purpose, vendor-supplied node operating system is to provide the fundamental services re-quired to support grid application. The grid middleware then builds on theseservices to provide high-level services specific to the support of grid applica-tions.

The fundamental services needed to support grid applications are high-bandwidth, low-delay communication and predictable communication per-formance and resource allocation. Data-intensive grid applications must beable to utilize the bandwidth of the underlying high-speed (possibly gigabit)network. Low communication delay is needed to allow effective communica-tion with local network-attached sensors or actuator devices, network-attachedstorage devices, and other nodes in a local cluster. To satisfy the realtime re-quirements of grid applications, the node’s resources must be scheduled insuch a way that QoS guarantees provided by the network (see Chapter 19) aremaintained in the end host.

The remainder of this chapter is organized as follows. Sections 20.1 and20.2 discuss challenges and principles, respectively, in designing a networksubsystem suitable for grid systems. Network interface design is briefly cov-ered in Section 20.3. Issues in achieving high bandwidth, low delay, andpredictable performance are discussed in Sections 20.4, 20.5, and 20.6, respec-tively. Finally, Section 20.7 looks at future challenges and research in networksubsystem design for grids.

20.1 CHALLENGES

The advent of high-speed networks and the simultaneous development ofdemanding distributed applications have made the network subsystem aperformance-critical component of the computer system. In designing a high-performance network subsystem, the key technical challenge is to achieveapplication-to-application communication performance close to the capabili-ties of the physical network, while maintaining standardized APIs to ensurecode portability. More precisely, in delivering communication service to ap-plications, the network subsystem must be able to preserve the performancecharacteristics of the physical network (bandwidth, latency, QoS), while mak-ing effective use of host resources.

20.2 Principles507

Network subsystems in current general-purpose commercial operatingsystems fail to achieve this goal. In these systems, high per-byte process-ing overhead reduces effective application throughput and/or causes highCPU load; high per-message processing overhead substantially increases ap-plication communication latency on local area networks; and inappropriateaccounting and scheduling of OS resources causes high variance in the appli-cation’s communication throughput and latency. The result is loss of perfor-mance and quality of service provided by the network because of inappropri-ate processing in the end system.

The root cause of this “OS bottleneck” is that rapid changes in hardwaretechnology have made obsolete the architecture of traditional network subsys-tems. First, the memory hierarchies found in modern computer systems de-pend on locality of reference in memory for performance [266]. Unfortunately,communication processing in current network subsystem implementationstends to have poor memory access locality, resulting in high overhead and/orpoor performance [167]. Second, current network subsystem implementationsfail to isolate and optimize for the common case [399]. As a result, critical exe-cution paths are burdened with code for special-case handling, thus increasingcommunications latency. Third, general-purpose network subsystems do notproperly account for and schedule resources consumed in communicationprocessing [168]. The result is scheduling anomalies and loss of predictablecommunication performance.

This chapter gives an overview of work that has been done in recentyears to address these problems. In subsequent sections, we discuss work toachieve high bandwidth, low per-byte overhead, low communication latency,and predictable communication performance, without sacrificing modularity,portability, and interoperability of operating system and applications.

20.2 PRINCIPLES

We briefly discuss some fundamental principles for the design of high-performance network subsystems. These principles permeate a large bodyof work on the subject and have been identified as critical to achieving high-performance communication in the end system:

� Coordinated design: Achieving high performance requires a coordinated de-sign of the entire network subsystem, including network adaptor, networkprotocol implementation, communication abstractions, and applicationprogramming interface.


� Early demultiplexing: To appropriately schedule resources in the end sys-tem, it is necessary to have a single point of demultiplexing for incomingdata packets, as close to the point of network attachment as possible.

� Path-oriented structure: High performance requires a system structure thatuses the data path as the focal point for resource management and opti-mization. Paths represent the flow of network data through the compo-nents of the network subsystem and applications.

� Integrated layer processing (ILP): The poor memory access locality of data-touching operations results in high per-byte overhead. Integrated layerprocessing is a technique that attempts to combine all computations andtransformations on the network payload data in a single traversal of thedata.

� User application-level framing: Achieving high performance requires appli-cations to make effective use of the services provided by the physicalnetwork. Application-level framing is an approach to the design of com-munication abstractions and protocols that allows applications to controland identify the individual data units (packets) transmitted over the phys-ical network.

Traditional network subsystems have strictly layered designs. That is, net-work adaptor, protocol layer implementations, communication abstractions,and interfaces are designed in isolation and with minimal regard for end-to-end performance. This typically results in redundant data copying and in-appropriate scheduling of resources. High-performance network subsystemdesign requires coordination and integration of key functions among the lay-ers, without sacrificing modularity.

In current network subsystems, incoming packets are demultiplexed in-crementally as they travel upward through the protocol layers (e.g., IP, TCP).Because of this layered demultiplexing [533], lower layers cannot associate anincoming packet with its transport level connection or its recipient applica-tion. This leads to inappropriate placement and subsequent copying of dataand to suboptimal scheduling of resources, including priority inversion. Theseproblems can be avoided through the use of a single point of demultiplexing,performed as “early” (i.e., close to the point of network attachment) as possi-ble.

Layering is the principal structuring tool for traditional network subsys-tems. While network data traverses the network in a vertical cut through thelayers, all resource management and optimization focus on individual layers.As a result, many opportunities for optimized resource management and pro-

20.3 Network Interface Design509

cessing along the data path cannot be realized. High-performance networksubsystem design requires data-path-centric resource management and opti-mization.

In a layered implementation, data-touching operations prescribed by dif-ferent protocol layers are performed sequentially, as a packet traverses theprotocol stack. This process leads to poor memory access locality, particularlyif layer processing is temporally separated by CPU scheduling points. With ILP,all data-touching operations are performed in a single traversal of the packet’sdata, thus eliminating memory traffic and improving locality.

With traditional protocol architectures and communication abstractions,an application has no control over, or knowledge of, the data packets thattraverse the physical network. Fragmentation and reassembly, buffering, or-dering, and error control are all handled transparently by the protocol stack.However, many applications can achieve better performance on a high-speednetwork if they exercise these functions in a customized manner. Application-level framing (ALF) enables such optimizations by offering sophisticated ap-plications knowledge and control of physical network packets. Notice thatunlike the other principles, which affect only the implementation of the net-work subsystem, ALF also affects the design of communication abstractionand protocol architecture.

In the subsequent sections, we will see how these principles are realizedin numerous approaches to high-performance subsystem design.

20.3 NETWORK INTERFACE DESIGN

Before discussing the issues involved in building an efficient network subsys-tem, it is important to explain the key parameters in the design of networkadaptors. These include how data is transferred between the adaptor and hostmemory, what bus the adaptor is attached to, how the adaptor and host sig-nal each other, and where packets are buffered while they await processing.(See also Chapter 17 for additional information on network interfaces and theemerging VIA standard.)

20.3.1 I/O versus Memory Bus

An essential first question does not concern the adaptor, per se, but ratherwhat bus is used to connect the network adaptor to the host. There are twooptions: the I/O bus, which has traditionally been used to connect externaldevices to the host (e.g., the PCI or S-bus), or the internal memory (system)


bus, which is used to connect the processor to main memory, and sometimesto video memory.

Almost all of today’s computers connect the network adaptor to the hostby using the I/O bus. This strategy is adopted for good reason: I/O buses areboth open and fairly stable, making it possible for third-party vendors to buildand sell network adaptors at low cost. In contrast, memory buses are generallyproprietary and change at the vendor’s whim, making third-party adaptorsimpractical.

If adaptors for system buses could be made practical, then several advan-tages could be achieved [405]. First, the data path between the adaptor andhost memory would be dramatically shortened—from tens or hundreds of mi-croseconds to under one microsecond—thereby improving latency. Second,adaptors would be able to exploit the higher bandwidth of memory buses—onthe order of tens of gigabits per second rather than approximately 1 Gb/s. Per-haps most important, however, the adaptor could actively participate in thesystem’s cache coherency protocol, thereby making it possible to implementsupport for distributed shared memory (DSM) on the adaptor.

20.3.2 DMA versus PIO

The next question is how data is transferred between the host and the adaptor.One option is to use direct memory access (DMA), in which case the adaptordirectly reads and writes host memory without involving the processor. Theother option is programmed I/O (PIO), which requires that the processor exe-cute a loop to copy data to/from the adaptor one word at a time.

The trade-off between using DMA and PIO to transfer data between hostmemory and the adaptor is as follows. On the one hand, DMA supports largetransfers, on the order of an entire network packet. This means that the cost(latency) of acquiring the bus can be amortized over the transfer of hundredsor thousands of bytes. In contrast, PIO implies that only a small amount ofdata—on the order of a single cache line—is transferred across the bus for eachload or store executed by the host processor. PIO also has the disadvantageof requiring that the CPU be directly involved in the transfer, meaning thatit cannot be performing other useful work, such as executing applicationinstructions.

On the other hand, because the CPU brings the data into its registerswhen using PIO, the data is then available to compute on when receiving.Moreover, incoming data is also loaded into the cache. (A similar situationexists for outgoing data.) This is often not the case with DMA, in whichcase the processor must still load the data into registers and the cache if andwhen data accesses by the CPU occur as part of the application processing.

20.3 Network Interface Design511

Furthermore, initiating a DMA transfer requires certain actions to ensurevirtual memory and cache consistency; the cost of these operations may offsetthe performance advantage of DMA over PIO, particularly for short transfers.

Which approach is best is one of the most lively debates in network adap-tor design. Both the literature on the subject (e.g., [461, 45, 148]) and ourown experience have led us to the conclusion that the preferable techniqueis highly machine dependent. The only way to fairly compare DMA perfor-mance versus PIO is to determine how fast an application program can accessthe data in each case.

In the PIO case, with carefully designed software, data can be read fromthe adaptor and written directly to the application’s buffer in main memory,leaving the data in the cache [288, 148]. If the application reads the datasoon after the PIO transfer, the data may still be in the cache. Accordingto one study, the PIO transfer from adaptor to application buffer must bedelayed until the application is scheduled for execution, in order to ensuresufficient proximity of data accesses for the data to remain cached underrealistic system load conditions [429]. Loading data into the cache too early notonly is ineffective, but can actually decrease overall system performance byevicting live data from the cache. Unfortunately, delaying the transfer of datafrom adaptor to main memory until the receiving application is scheduled forexecution requires a substantial amount of buffer space in the adaptor. WithDMA, instead of using dedicated memory resources on the adaptor, incomingdata can be buffered in main memory. Using main memory to buffer networkdata has the advantage that a single pool of memory resources is dynamicallyshared among applications, operating system, and network subsystem.

20.3.3 Interrupts versus Polling

Handling an interrupt asserted by the network adaptor is a time-consumingtask and can negatively impact latency. For example, an interrupt takes ap-proximately 75 µs in the Mach operating system on a DECStation 5000/200.For comparison, the service time for a received UDP/IP packet is 200 µs; thisnumber includes protocol processing and driver overhead, but not interrupthandling. The delay on the physical network can be as low as a few microsec-onds. Given this relatively high cost of interrupt processing, minimizing thenumber of host by interrupts during network communication is important tooverall system performance.

The alternative is for the processor to periodically poll the adaptor to seewhether it has received a new packet or is finished transmitting a previouspacket. This has the obvious downside of stealing processor cycles from the


application, even in the likely case that there are no outstanding packets. Al-though in some situations polling is appropriate—there is sufficient bufferingon the adaptor and the frequency of communication is fairly predictable—most systems employ interrupts. Fortunately, it is often possible to avoidmany interrupts in an interrupt-based system.

In the outgoing direction, the completion of a packet transmission, whichis traditionally signaled to the host by using an interrupt, can instead beindicated by the advance of the transmit queue’s tail pointer. The devicedriver can then check for this condition as part of other driver activity—forexample, while queuing another outgoing packet—and take the appropriateaction. Interrupts are used only in the relatively infrequent event of a fulltransmit queue. In this case, the host suspends its transmit activity, and thetransmit processor asserts an interrupt as soon as the queue reaches the half-empty state.

In the receiving direction, an interrupt need be asserted only once for aburst of incoming packets. More specifically, whenever a packet is enqueuedbefore the host has dequeued the previous packet, no interrupt is asserted.This approach achieves both low packet delivery latency for individually ar-riving packets and high throughput for incoming packet trains. Note that insituations where high throughput is required (i.e., when packets arrive closelyspaced), the number of interrupts is much lower than the traditional one perpacket.

20.4 HIGH-BANDWIDTH NETWORK I/O

In this section, we discuss principles and techniques to achieve low data-dependent (i.e., per-byte) overhead in the network subsystem. Achieving lowper-byte overhead is critical to the network subsystem’s performance. It en-ables the system to deliver the bandwidth of a high-speed network to applica-tions while incurring reasonably low overhead on the host’s resources (CPU,memory, interconnect).

The primary source of per-byte processing cost is data touching opera-tions such as copying, checksumming, and presentation conversions. Tradi-tional network subsystems tend to incur large overheads due to repeated datacopying. As a result, numerous techniques have been developed to reduce orminimize data copying in the network subsystem. Unlike copying, checksum-ming and presentation conversions cannot usually be avoided, since they areprescribed by network protocols. However, the performance impact of theseoperations can be reduced by using the technique of ILP.

20.4 High-Bandwidth Network I/O513

20.4.1 Integrating Data-Touching Operations

We will first focus on techniques to integrate data-touching operations in thenetwork subsystem. The following section will cover techniques to avoid datacopying.

In traditional network subsystem implementations, a data packet is pro-cessed in a single traversal of the protocol stack. In each layer, all of theoperations prescribed for the packet by the corresponding protocol are per-formed before passing the packet to the next layer. If multiple data-touchingoperations, such as checksum calculations or presentation conversion, are pre-scribed in different layers, each of the operations is performed in sequence.The resultant multiple data accesses to the packet have poor temporal locality,particularly if the operations are separated by scheduling points [429].

ILP has been proposed as a general technique to reduce the overheadof multiple data-touching operations [127]. The key idea is to merge all thedata-touching operations performed on a packet and to perform the resultingcompound operation in a single sequential traversal of the data. ILP is similarto the loop fusion optimization performed by certain compilers. In practice,implementing ILP is complicated by the need to merge operations from in-dependent protocols and the fact that the order of certain protocol operationsmay change [1].

20.4.2 Avoiding Data Copying

In this section, we focus on techniques to avoid data copying in the networksubsystem. Data copying occurs most frequently at the boundaries betweenlayers of the network subsystem. Assuming a conventional operating sys-tem structure, most network protocols are implemented inside the OS kernel.Copying among these protocol layers can be avoided relatively easily by usingan appropriate buffering abstraction, such as the mbuf system in BSD UNIX[335]. In current systems, this leaves the data copying that occurs at the pro-tection boundary between OS and applications, and the data transfer betweenmain memory and network adaptor.

The data transfer between network adaptor and main memory cannotusually be avoided, except in special circumstances where the data can remainin outboard buffers located in the network adaptor. Data transfer betweennetwork adaptor and main memory can be implemented by using DMA orPIO, as discussed above. In some cases, additional data copying occurs inthe device driver, to move data between kernel buffers and a staging bufferarea associated with the network adaptor. This copying is usually the result


Copy Copy Outboard Early VMMethod semantics Safety freedom buffering demux operations

Copy Yes Yes No Not Not Never

needed neededWITLESS Yes Yes On simple Needed Not Never

data paths needed

Genie Yes Yes On simple Not Needed Yesdata paths needed

Remap No Yes Always Not Not Alwaysneeded needed

Shared No No Always Not Not Not inmemory needed needed common case

fbufs No Yes Always Not Needed Not inneeded common case

20.1

TABLE

Copy avoiding techniques.

of an inadequate design of either the network adaptor or the device driver.Avoiding this copying poses no fundamental problems. Avoiding data copyingbetween OS and applications requires the solution of a complex set of issuesand affects the design of the network adaptor, the demultiplexing strategy, thedata buffering system in the OS, and the network API. Numerous approachesfor efficient data transfer have been proposed in the literature. Six methodscovered in this section are summarized in Table 20.1.

Copy Method

The “Copy” method refers to the conventional approach of copying data be-tween operating system and application. It incurs one data copy operationevery time network data crosses the user-kernel boundary. The other meth-ods for network data buffering and cross-domain data transfer are discussedbelow.

WITLESS

The WITLESS approach was proposed by Van Jacobsen [288]. It refers to acoordinated design of network adaptor and OS buffering system to eliminatedata copying at the user-kernel boundary. The key idea is to provide outboardbuffering in the network adaptor and to transfer data directly between out-board buffers and application buffers, thereby avoiding repeated data copying

20.4 High-Bandwidth Network I/O515

while maintaining the copy semantics of the traditional UNIX network API.Checksum calculations can be integrated with the copy of data between out-board buffers and application, following the principle of ILP. However, to avoidproblems with delayed checksum calculations in TCP, WITLESS network adap-tors often provide checksum calculation in hardware [313].

The main advantage of WITLESS is that it can avoid copying on sim-ple data paths without requiring a change to the API. Disadvantages are theneed for outboard buffering (a substantial amount of dedicated memory isneeded for this purpose, particularly on networks with high bandwidth-delayproducts) and the fact that copying is still needed for data paths that intersectmultiple applications/servers.

Genie

The Genie I/O system incorporates a data transfer method called emulatedcopy [84]. It avoids most copying while maintaining the semantics of the tra-ditional UNIX API, using a set of techniques called input alignment, reversecopy-out, and page swapping. The basic idea is to remap VM pages in order totransport data between kernel buffers and application buffer without copying.On the receive side, incoming data is placed such that it matches the align-ment of the application buffer, and any overlapping data at the boundaries ofthe application buffer is copied. On the output side, either the application isblocked or the pages containing the application buffer are marked read-onlyuntil the network subsystem releases the buffer. Emulated copy is effective forsynchronous receive operations (i.e., when the receive operation is posted bythe application before the corresponding data arrives from the network) andwhen the amount of overlapping data at the edges of the application buffer issmall compared to the buffer size.

Emulated copy requires early demultiplexing to determine the propermemory alignment for incoming packets. Its advantage is that it can supportthe traditional UNIX API without requiring outboard buffering. Disadvantagesof emulated copy are its potential to block senders, its dependence on thelength and alignment of the application buffer for effectiveness, and its re-liance on VM remap operations in the critical data path.

Page Remapping

Page remapping is the data transfer method used in DASH [22], V [120], andthe container shipping system [24]. The idea is to use the VM system to remapphysical pages containing network data between protection domains, thus


avoiding data copying. Since buffers need to be page aligned and page remap-ping has move semantics, a modified network API is required. Advantages ofpage remapping are its ability to avoid data copying even on complex datapaths and the fact that it does not require support from the network adaptor.Disadvantages are the need for a modified API with move semantics and thereliance on VM operations in the critical data path.

Shared Memory

In the shared-memory approach, a region of virtual memory is staticallyshared among OS kernel and applications. This region is used to buffer net-work data. Shared memory can avoid data copying, provided that (1) the regionis shared by all protection domains that need access to the data, and (2) theregion is large enough to hold all network data until it is consumed. The use ofshared memory requires the use of an alternative API. All forms of read/writeshared memory may compromise protection between the sharing domains.

Condition 1 can be satisfied trivially by sharing the network buffers amongall protection domains, but this approach sacrifices data privacy [497]. Tomaintain privacy, buffers must be shared only among protection domains thathave legitimate access rights to the buffer’s contents. This approach requiresearly demultiplexing (to place an incoming data packet into an appropriatelyshared buffer) and a priori knowledge of the set of protection domains that willeventually need access to a network data packet (or else data copying may berequired after all).

Condition 2 requires that there be no arbitrary limitations on the size ofthe shared region and that the region be pageable. Consider an applicationthat reads a large dataset from the network and then performs a lengthycomputation on the dataset. If the size of the shared region cannot extend tothe size of the machine’s virtual memory, then copying is required to avoidunnecessary limitations.

fbufs

The fbuf system combines the advantages of page remapping and sharedmemory, while avoiding many of the shortcomings of either method. Twokey ideas underlie the design of fbufs. First, fbufs are immutable; hence,they avoid the protection problems associated with shared memory. Second,with fbufs, shared buffer pools are established lazily for each data path usingVM operations. Once established, shared buffers are reused on the same datapath, thereby avoiding repeated VM operations in the critical path. Early

20.5 Low-Latency Network Access517

demultiplexing is used to determine the data path of an incoming packet priorto its placement in main memory.

It is equally correct to view fbufs as using shared memory (where sharingis read-only and page remapping is used to dynamically change the set ofpages shared among a set of domains) or using page remapping (where pagesthat have been mapped into a set of domains are retained for use by futuretransfers). A complete description of the design and implementation of fbufs,along with a detailed performance study, can be found in [170]. A numberof research projects have recently adopted variations of the fbufs mechanism[299, 535]. Also, work is under way at Rice University to build an integrated,copy-free I/O and file-caching system based on fbufs, called IO-Lite [430], for acommercial UNIX system. Work is also under way at the University of Arizonato build a new OS, called Scout, that makes the data path upon which fbufsdepend a first-class object [398].

20.5 LOW-LATENCY NETWORK ACCESS

In this section, we cover principles and techniques to achieve low per-messageprocessing overhead and latency in the network subsystem. Ideally, the net-work subsystem should not reduce the maximal rate of small message trans-missions supported by the network, and it should not significantly add to theend-to-end latency of a local-area network.

Per-message processing overhead refers to the number of CPU cycles spentper sent or received application message. High overhead may limit the rateat which messages can be sent or received. The network subsystem latencyrefers to (1) the elapsed time between the execution of a send operation byan application and the posting of the transmission command to the networkadaptor for the corresponding packet or (2) the elapsed time between thesignal from the network adaptor of a packet reception and the successfulcompletion of the corresponding receive operation by the application. Thislatency contributes to the end-to-end communications latency, which is acritical factor in the performance of distributed applications.

Observe that the relationship between network subsystem overhead andlatency is not direct. First, some overhead cycles do not contribute to latencybecause they are executed off the critical path. For instance, cleanup workthat occurs after a packet has been transmitted on the network falls in thiscategory. In fact, one way to improve latency without reducing overhead is tomove overhead cycles out of the critical path. Second, latency is affected both


by overhead cycles in the critical path and by scheduling delays. Schedulingdelays occur when a task involved in the processing of a message has towait for the completion of an unrelated task before it can acquire the CPU.Achieving low overhead requires reducing the amount of work (i.e., CPUcycles) needed to process a message. Achieving low latency requires reducingthe number of overhead cycles in the critical path and minimizing schedulingdelays.

Scheduling delays are subject to the operating system’s scheduling policy.Simply speaking, to minimize latency, each task involved in the processing ofa message must have sufficiently high priority to avoid scheduling delays. Wewill return to the issue of scheduling in Section 20.6. In the remainder of thissection, we focus on techniques to reduce overhead cycles, particularly thosethat occur in the critical path.

20.5.1 Sources of Overhead Cycles

Two factors contribute to overhead cycles. A fraction of overhead cycles is dueto the inherent complexity of implementing the communication abstraction,executing the network protocols, and controlling the network adaptor. The de-sign of communication abstractions and network protocols for low complexityis covered in Chapter 18.

The remaining overhead cycles are caused by a suboptimal implementa-tion. As in all software systems, implementation quality is constrained by thequality of the tools used and the human effort spent on optimizing the sys-tem. General performance-oriented software design techniques and the use ofappropriate tools such as optimizing compilers will reduce overhead cycles.We focus here on techniques and tools that are specific to optimizing networksubsystem performance.

Two key ideas underlie many of the techniques for optimizing messageprocessing: (1) optimize message processing for the common case and (2)

eliminate the operating system kernel from the critical processing path ofmessages.

20.5.2 Optimizing for Common Case Processing

Executing the instructions that implement a protocol stack is the most ob-vious source of latency in protocol processing and has been widely studiedin the literature [306, 126, 537, 399]. What is less obvious is that this over-head is a function not only of the number of instructions executed on behalf


of each protocol, but also of the number of cycles it takes to execute eachinstruction. In other words, on today’s RISC architectures, optimizing proto-col latency involves not only reducing the number of instructions, but alsominimizing the number of cycles per instruction (CPI). The key ingredient inthe CPI of a particular protocol stack running on a particular machine is howoften the processor stalls waiting for memory. Memory stalls are becoming in-creasingly important as the gap between processor speed and memory speedwidens [458].

Little can be said in the way of general principles when it comes to trim-ming the number of instructions required to process a network packet. Asmentioned above, one technique is to do as much work as possible after thepacket has been transmitted, that is, move instructions off the critical path.These instructions can then be executed in parallel with the transmission ofthe packet.

Another general technique is to use conditional and careful inlining. Con-ditional inlining is a technique that allows inlining a function provided thata subset of the function’s actual arguments is constant. This allows generat-ing inline code for the simple cases without forcing inlining for the complexcases where the resulting code inflation would be unacceptably large. Care-ful inlining limits the use of inlining to the cases that will result in improvedperformance even for latency-sensitive code. This is in contrast to the blindinlining that is often used when optimizing execution in tight loops.

Research into techniques for improving the CPI of networking code is stillin its infancy; it is a much better understood problem when dealing withapplication codes such as the SPEC benchmark. The following summarizesthree techniques that have been used successfully in optimizing TCP/IP andRPC protocol stacks; a more detailed discussion can be found in [399].

Outlining

Outlining, as the name suggests, is the opposite of inlining. It exploits the factthat not all basic blocks in a function are executed with equal frequency. Forexample, error handling in the form of a kernel panic is clearly expected tobe a low-frequency event. Unfortunately, it is rarely possible for a compiler todetect such cases based only on compile-time information. In general, basicblocks are generated simply in the order of the corresponding source codelines. For example, in the following code, the sample C source code shown onthe left is often translated to machine code of the form shown on the right:


......

load r0,(bad_case)if (bad_case) { jump_if_0 r0,good_daypanic("bad day"); load_addr a0,"bad day"

} call panicprintf("good day"); good_day:

load_addr a0,"good day"call printf

......

The above machine code is suboptimal for two reasons: (1) it requires ajump to skip the error-handling code, and (2) it introduces a gap in the i-cacheif the i-cache block size is larger than one instruction. A taken jump oftenresults in pipeline stalls, and i-cache gaps waste memory bandwidth becauseuseless instructions are loaded into the cache. This can be avoided by movingerror-handling code out of the main line of execution, that is, by outliningerror-handling code. For example, error-handling code could be moved to theend of the function or to the end of the program.

We modified the GNU C compiler such that if statements can be anno-tated with a static prediction as to whether the if conditional will mostlyevaluate to TRUE or FALSE. Annotated if statements will have the machinecode for the unlikely branch generated at the end of the function. Unanno-tated if statements are translated as usual. With this compiler extension, thecode on the left is translated into the machine code on the right:

......

load r0,(bad_case)if (bad_case @ 0) { jump_if_not_0 r0,bad_day

panic("bad day"); load_addr a0,"good day"} call printfprintf("good day"); continue:

......

return_from_function

bad_day:load_addr a0,"bad day"call panicjump continue


The above machine code avoids the taken jump and the i-cache gap atthe cost of an additional jump in the infrequent case. Corresponding code willbe generated for if statements with an else branch. In that case, the staticnumber of jumps remains the same, however. It is also possible to use ifstatement annotations to direct the compiler’s optimizer. For example, it wouldbe reasonable to give outlined code low priority during register allocation.

Outlining alone does not make a huge difference in end-to-end latency.However, the code density improvements that it achieves are essential to theeffectiveness of the next technique.

Cloning

Cloning involves making a copy of a function. The cloned copy can be relo-cated to a more appropriate address and/or optimized for a particular use ofthat function. For example, if the TCP/IP path is executed frequently, it maybe desirable to pack the involved functions as tightly as possible. It is usuallynot necessary to clone outlined code. The resulting increase in code densitycan improve i-cache, TLB, and paging behavior. The longer cloning is delayed,the more information is available to specialize the cloned functions. For ex-ample, if cloning is delayed until a TCP/IP connection is established, mostconnection states will remain constant and can be used to partially evaluatethe cloned function. This strategy achieves benefits similar to those from codesynthesis [364].

We experimented extensively with different layout strategies for clonedcode and found that a simple layout strategy consistently outperformed morecomplex layouts. The simple strategy uses a bipartite layout. Cloned functionsare divided into two classes: path functions that are executed once and libraryfunctions that are executed multiple times per path invocation. There is verylittle benefit in keeping path functions in the cache after they execute, sincethere is no temporal locality unless the entire path fits into the i-cache. Incontrast, library functions should be kept cached starting with the first andending with the last invocation. Based on these considerations, it makes senseto partition the i-cache into a path partition and a library partition. Withina partition, functions are placed in the order in which they are called. Sucha sequential layout maximizes the effectiveness of prefetching hardware thatmay be present. This layout strategy is so simple that it can be computed easilyat run time: the only dynamic information required is the order in whichthe functions are invoked. In essence, computing a bipartite layout consistsof applying the well-known “closest is best” strategy to the library and pathpartition individually [448].


Path Inlining

Path inlining is the third latency-reducing technique. This is an aggressiveform of inlining where the entire latency-sensitive path of execution is inlinedto form a single function. Since the resulting code is specific for a singlepath, this is warranted only if the path is executed frequently. However, itis important to limit inlining to path functions. Library functions are usedmultiple times during a single path execution, so it is better to preserve thelocality of reference that they afford. Also, inlining library functions wouldlikely lead to an excessive growth in code size.

The advantage of path inlining is that it removes almost all call overheadsand greatly increases the amount of context available to the compiler foroptimization. For example, one protocol’s output processing often consists oflittle more than a call to the next lower layer’s output function. With pathinlining, the compiler can trivially detect and eliminate such useless calloverheads.

Path inlining is relatively easy as long as no indirect function calls are in-volved. This is usually the case for the outbound side of network processing.On the inbound side, traditional networking code discovers the path of execu-tion incrementally and as part of other protocol processing: a protocol’s headercontains the identifier of the higher-level protocol. This higher-level protocolidentifier is then mapped into the address of the function that implements theappropriate protocol processing. In short, inbound processing is full of indirectfunction calls. To make path inlining work for this important case, it is neces-sary to assume that a packet will follow a given path, generate path-inlinedcode for that assumed path, and then at run time establish that an incomingpacket really will follow the assumed path. The Scout OS, which supports anexplicit path abstraction, supports this optimization [398].

We found that these three techniques commonly resulted in 20–40% re-ductions in processing latency when applied to common protocol stacks suchas TCP/IP and RPC and, in some cases, improved processing latency by asmuch as 186%. Moreover, the techniques had the potential to reduce theamount of time the processor stalled waiting for memory by a factor of roughlyfour to six times.

20.5.3 Bypassing the Kernel: Application Device Channels

Protection boundaries add latency to I/O operations because of the necessaryargument validation, protection domain switch, and resulting drop in memory


access locality. In this section, we describe an approach that gives applicationsdirect access to a network device for common I/O operations, thus bypassingthe OS kernel and removing protection boundaries from the critical messageprocessing path.

The design of application device channels (ADCs) recognizes communi-cation as a fine-grained, performance-critical operation and allows applica-tions to bypass the operating system kernel during network send and receiveoperations. The OS is normally involved only in the establishment and ter-mination of network connections. Protection, safety, and fairness are main-tained because the network adaptor validates send and receive requests fromapplication programs based on information provided by the OS during con-nection establishment. Unlike other systems that support user-level networkaccess using special-purpose dedicated network interfaces [65, 234, 94, 362],ADCs can be used with many commercial general-purpose network adaptors.The U-Net project [51] has recently developed a mechanism very similar toADCs.

The basic approach taken in designing ADCs is depicted in Figure 20.1.First, instead of centralizing the network communication software inside theoperating system, a copy of this software is placed in each user domain aspart of the standard library that is linked with application programs. Thisuser-level network software supports the standard application programminginterface. Thus, the use of ADCs is transparent to application programs, exceptfor performance.

Second, the user-level network software is granted direct access to a re-stricted set of functions provided by the network adaptor. This set of functionsis sufficient to support common network send and receive operations withoutinvolving the OS kernel. As a result, the OS kernel is removed from the crit-ical network send/receive path. An application process communicates withthe network adaptor through an application device channel, which consists ofa set of data structures that is shared between the network adaptor and theuser-level network software. These data structures include queues of bufferdescriptors for the transmission and reception of network messages.

When an application opens a network connection, the operating systeminforms the network adaptor about the mapping between the network con-nection and ADC, creates the associated shared data structures, and grantsthe application process access to these data structures. The network adaptorpasses subsequently arriving network messages to the appropriate ADC andtransmits outgoing messages queued on an ADC by an application using theappropriate network connection. An application cannot gain access to networkmessages destined for another application, nor can it transmit messages otherthan through network connections opened on its behalf by the OS.


Application

Protocollibrary

ADC

ADC Network interface

Connectionmanagement

OS

Network protocols

Send Receive

20.1

FIGURE

Application device channel.

The use of application device channels has a number of advantages. First,network send and receive operations bypass the OS kernel. This strategy elim-inates protection domain boundary crossings, which would otherwise entaildata transfer operations, domain switching, and the associated drop in mem-ory access locality. Second, since application device channels give applicationdomains low-level network access, it is possible to use customized networkprotocols and software. This flexibility can lead to further performance im-provements, since it allows the use of application-specific knowledge to op-timize communications performance. Finally, with application device chan-nels, all processing and resources necessary for network communication areassociated with an application process. This strategy eliminates kernel re-source constraints and scheduling anomalies that plague traditional networkimplementations.

The implementation of ADCs comprises three components:

1. A user-level implementation of the networking software, including devicedriver, network protocols, and communications API

2. The actual ADC mechanism, which provides a shared-memory communi-cation channel between application process and network adaptor

3. Network adaptor support for ADC-based networking.

20.6 Predictable Communication Performance525

A detailed description of application device channels can be found in [166,169].

A prototype implementation of ADCs was realized in a Mach 3.0/x-kernelenvironment, using the Osiris ATM network adaptor [169]. Performance re-sults were obtained on DEC 3000/600 (175 MHz Alpha) workstations con-nected by a pair of Osiris boards, linked back to back. The latency figuresinclude interrupt latency; that is, the receiver does not poll the network de-vice. A short message (1 B) round-trip latency of 154 µs was measured betweentest programs configured directly on top of the user-level Osiris device driver.This number is significant because it is a lower bound for the latency an appli-cation can achieve using customized network protocols on top of ADCs. Morecomprehensive performance results are presented in [169].

20.6 PREDICTABLE COMMUNICATION PERFORMANCE

In this section, we discuss principles and techniques that enable the net-work subsystem to deliver predictable communication service to applications.Specifically, the goal is to enable the network subsystem to maintain the qual-ity of service provided by the network, to maintain fairness in providing com-munication service to multiple applications, and to maintain stability whenthe volume of incoming network traffic exceeds the end host’s capacity.

The key to predictable communication performance is appropriate sched-uling of resources in the end host. For instance, maintaining constant band-width requires that application and network subsystem obtain a sufficientshare of CPU and memory resources per unit time. Achieving low jitter re-quires fixed scheduling delays for application and network subsystem in re-sponse to a message transmission/reception. Four requirements underlie theappropriate scheduling of resources in the end system for predictable commu-nication performance:

1. All communication-related processing must be scheduled according to apolicy that is able to maintain the network’s QoS guarantees, ensure fair-ness among competing applications, and maintain stability under over-load. For instance, a realtime CPU scheduling policy is generally requiredto maintain realtime QoS guarantees provided by the network.

2. Communication events must be associated with the responsible resourceprincipal prior to the processing of the event. Examples of communication


events are the transmission or reception of a network packet and the han-dling of a time-out. The resource principal is the entity on whose behalfthe communication is performed (normally an application process).

3. Communication processing must be scheduled according to the contractthat the responsible principal has established with the OS. For instance,in a priority-driven system, all communication processing performed onbehalf of an application must be performed at the priority of that applica-tion.

4. Resources consumed during the processing of communication eventsmust be charged against the resource allocation of the associated resourceprincipal. For instance, in a system with a fair-share scheduler, CPU timespent in processing network packets must be charged to the applicationprocess on whose behalf the communication is performed, thus reducingthat process’s future priority.

Requirement 1 stipulates that the end system’s scheduler must be ableto provide guarantees sufficient for the preservation of the network’s QoS.Requirement 2 implies that early demultiplexing is critical to achieving pre-dictable communication performance [533]. Without early demultiplexing, itis impossible to determine the responsible resource principal of an incomingnetwork packet. Thus, it is not possible to schedule that packet’s processingappropriately. Requirements 3 and 4 stipulate that all communication process-ing be scheduled and accounted for under a policy that satisfies the contractbetween the the OS and the responsible application.

20.6.1 Network Subsystem Structure

The first step is to structure the network subsystem in such a way that allresources spent in communication processing are properly scheduled and ac-counted for. In particular, communication processing must be scheduled ac-cording to the contract between the OS and the application that is responsiblefor the communication.

In conventional network subsystems, the processing of incoming networkpackets is interrupt driven; that is, it is scheduled at a priority higher thanthat of any application process. This leads to priority inversion when thearrival of a packet for a low-prioritized application preempts an applicationwith higher priority. Moreover, a stream of incoming network packets cancause the system to enter a state know as receiver livelock, where all of the


system’s resources are spent on processing incoming packets and the systemis unable to make progress [391].

In general, the interrupt-driven mode of processing incoming packets cancause the scheduler to violate its contract with the application. To solve thisproblem, all communication processing steps, including the processing ofincoming packets, must be scheduled under the scheduler’s control and inaccordance with the contract between the OS and the responsible application.To achieve this, early demultiplexing must be used to associate an incomingpacket with its responsible application before processing is scheduled. Then,processing of the packet is carried out by a schedulable entity (e.g., a thread)

that acts on behalf of the responsible application.Two approaches have been used to achieve appropriate communication

processing. In the user-level networking approach, all communication pro-cessing is performed at the user level, in the context of the application processresponsible for the communication. In the lazy receiver processing approach,the communication processing remains in the OS kernel but is performed bya schedulable entity associated with the responsible application process.

User-Level Network Subsystems

In a user-level network subsystem, early demultiplexing is performed by thenetwork interface or the OS kernel. All other communication processing isperformed by schedulable entities that are part of the application process.Assuming that the demultiplexing overhead imposed on the host CPU is neg-ligible, all communication processing performed on behalf of an applicationis performed within the context of that application’s process. Therefore, pre-dictable communication performance can be achieved simply through the useof an appropriate scheduling policy and appropriate resource contracts for thevarious schedulable entities involved in communication processing.

Several user-level network subsystems with predictable performance havebeen built, for instance in realtime Mach [333] and Nemesis [340]. In realtimeMach, early demultiplexing is performed by the OS kernel via a packet filter.Code implementing the network subsystem is linked as a library with eachapplication. A dedicated application thread executes this code and performscommunication processing. The realtime Mach scheduler provides resourcecontracts called processor capacity reserves [374]. The dedicated communica-tion thread is provided with a capacity reserve adequate for the application’scommunication performance requirements.

In the Nemesis systems, all I/O is performed by libraries linked with appli-cation programs. Device drivers perform early demultiplexing. A single global


address space memory model with protection domains reduces context switchtimes and allows efficient data transfer. A split-level CPU scheduling approachis used where the kernel multiplexes the CPU among processes by using avariant of the “earliest deadline first” (EDF) discipline, and processes furthermultiplex the CPU among their threads by using an application-specific sched-uler.

As we have seen, user-level network subsystems can be naturally ex-tended to provide predictable service through the use of an appropriatescheduling policy and appropriate resource contracts. However, predictablecommunication performance can also be achieved in a traditional systemstructure where the network subsystem is centralized within the OS kernel.

Lazy Receiver Processing

Lazy receiver processing (LRP) [168] is a centralized kernel-level networksubsystem that allows proper accounting and scheduling of communicationprocessing. As such, it provides improved performance, fairness, and stableoverload behavior. Like user-level network subsystems, LRP can readily beextended to provide predictable communication performance when combinedwith an appropriate CPU scheduler.

The implementation of LRP in a BSD UNIX–like system [335] can besummarized as follows. First, protocol processing is scheduled at the priorityof the responsible application. Protocol processing for an incoming packetin many cases does not occur until the application requests the packet ina receive system call. Packet processing no longer interrupts the currentlyexecuting process at the time of the packet’s arrival, unless the receiver hashigher scheduling priority than the currently executing process. This avoidsinappropriate context switches and can increase performance.

Second, the network interface separates (demultiplexes) incoming traf-fic by destination socket and places packets directly into per-socket receivequeues. (A socket is a communication end point in BSD UNIX.) Combined withthe receiver protocol processing at application priority, this provides feedbackto the network interface about application processes’ ability to keep up withthe traffic arriving at a socket. The feedback is used as follows: Once a socket’sreceive queue fills, the network interface discards further packets destinedfor the socket until applications have consumed some of the queued packets.Thus, the network interface can effectively shed load without consuming sig-nificant host resources. As a result, the system has stable overload behaviorand increased throughput under high load.


Third, the network interface’s separation of received traffic, combinedwith the receiver processing at application priority, eliminates interferenceamong packets destined for separate sockets. Moreover, the delivery latencyof a packet cannot be influenced by a subsequently arriving packet of equal orlower priority. And, the elimination of the shared IP queue found in traditionalTCP/IP network subsystems greatly reduces the likelihood that a packet is de-layed or dropped because traffic destined for a different socket has exhaustedshared resources.

Finally, CPU time spent in receiver protocol processing is charged to theapplication process that receives the traffic. This feature is important becausein UNIX the recent CPU usage of a process influences the priority that thescheduler assigns a process. In particular, it ensures fairness in the case whereapplication processes receive high volumes of network traffic.

A prototype implementation of LRP exists for the SunOS and FreeBSDoperating systems. Experiments show that a prototype system based on LRPmaintains its throughput and remains responsive even when faced with ex-cessive network traffic on a 155 Mb/s ATM network. In comparison, a con-ventional UNIX system collapses under network traffic conditions that caneasily arise on a 10 Mb/s Ethernet. Further results show increased fairness inresource allocation, traffic separation, and increased throughput under highload. A more detailed description of LRP, along with results of an experimentalevaluation, can be found in [168].

20.6.2 Scheduling

This section briefly discusses the scheduling of CPU resources suitable forrealtime communication. Related material on job scheduling and networkresource scheduling can be found in Chapters 12 and 19, respectively. A CPUscheduler consists of an application programming interface, which specifiesa contract between scheduler and application for each of the application’sschedulable entities (e.g., threads), and a scheduling algorithm that can beshown to multiplex the CPU in such a way that the existing resource contractsfor each schedulable entity are satisfied.

The two classic realtime scheduling algorithms are rate monotonic andearliest deadline first [351]. With rate monotonic scheduling, a periodic real-time task is assigned a priority level that is proportional to the rate at which itexecutes. For example, the frame rate at which a video is displayed would bedirectly proportional to the priority of the task responsible for decoding anddisplaying that video. Tasks are then scheduled according to priority. With


EDF scheduling, each task is assigned a deadline by which the next unit ofrealtime work must be done, and the task assigned the earliest deadline is se-lected next for execution. In effect, a task’s priority is given by its deadline.For example, if a video task has already produced three video frames that areready to be displayed, and if the video is playing at a rate of 30 frames persecond, then the deadline for this task producing another frame is 90 ms inthe future.

EDF is more attractive than rate monitonic scheduling for several reasons.First, deadlines are a natural way to express the realtime needs of an applica-tion. Second, EDF is provably optimal as long as the system is not overloaded.Unfortunately, EDF behaves poorly under overload because it tends to sched-ule the tasks that are most likely to miss their deadlines.

An alternative is to allocate each task some share of the CPU’s capac-ity. Such algorithms are commonly called proportional share algorithms, andthey attempt to provide each task with its fair share of CPU time either prob-abilistically [557] or deterministically [558]. However, pure proportional shareschedulers do not support task execution deadline, and for this reason they arenot suitable for environments that require supporting (soft) realtime guaran-tees.

The problem with all these schedulers is that they fix a single schedul-ing algorithm for all the tasks in the system. It is not uncommon, however,for the user to want to run a mix of realtime and nonrealtime applications.To support such generality, hierarchical schedulers are often proposed [195,242]. With hierarchical scheduling, some fraction of the CPU is managed byone scheduler (e.g., EDF) and some other fraction of the CPU is controlled byanother scheduler (e.g., standard priority scheduler). The fundamental prob-lem with hierarchical scheduling, however, is how to allocate a share of theprocessor to each scheduler and, as a consequence, to each class of tasks. Thestate of the art is to make this partitioning decision statically; for example,realtime tasks get 75% of the CPU and nonrealtime tasks get the other 25%. Inthe long run, what is needed is a single unified scheduler that accommodatesboth realtime and nonrealtime tasks. Recent research is making progress inthis direction [420].

20.7 FUTURE DIRECTIONS

In this section, we identify future challenges and research problems in oper-ating systems for grid environments. These environments will likely require

Further Reading531

additional OS support in the areas of security, resource accounting, perfor-mance monitoring, and end-to-end resource scheduling.

Grids based on shared infrastructure depend critically on strong security,accounting, and assurance, which in turn require OS support for the securityarchitectures and protocols discussed in Chapter 16. A strong security archi-tecture must rely on a small trusted computing base to minimize the risk ofsoftware bugs that might compromise the system’s integrity. Unfortunately,reducing the size of the trusted computing base in today’s general-purpose op-erating systems requires significant structural changes.

Since grid applications execute on shared resources, proper accountingand control of resource consumption are important. These are likely to requiremechanisms and APIs not found in present-day operating systems. Since avail-ability of resources is generally unpredictable in a shared infrastructure, con-tinuous performance monitoring and automatic adaptation/reconfigurationmay be required to achieve acceptable application performance. These arelikely, in turn, to require additional instrumentation mechanisms and APIsnot available in current operating systems.

To support applications with run times that exceed the expected availabil-ity of any given computing resource, a checkpointing facility is required thatallows the capture, migration, and restart of an application’s runtime state. Ingeneral, OS support is required to support checkpointing of distributed appli-cations. As discussed in Chapter 13, such support is not currently available ingeneral-purpose operating systems.

Finally, executing a complex grid application requires resource schedulingat several layers. Job schedulers manage the aggregate resources (computers,disks, devices) needed to execute a complex application (see Chapter 12).CPU schedulers allocate node CPU cycles to application processes (see Sec-tion 20.6.2). Network QoS algorithms allocate network resources to networkconnections (see Chapter 19). Each of these subsystems acts independentlyand uses its own API, yet meeting an application’s performance target gener-ally requires the orchestration of all resources involved in the computation.Achieving global end-to-end scheduling of resources requires an integration ofresource scheduling and new APIs that allow applications to specify high-levelperformance requirements.

FURTHER READING



� Bas et al. [51] and Druschel and Banga [168] provide descriptions of net-work subsystem architectures.

� Brustoloni and Steenkiste [84] discuss the effects of buffering semanticson I/O performance.

� Druschel et al. [169] discuss experiences with a high-speed network adap-tor.

� Mosberger et al. analyze techniques for improving protocol latency [399].

� Ramakrishnan’s paper addresses performance considerations in designingnetwork interfaces [461].

22C H A P T E R

Testbeds: Bridges fromResearch to Infrastructure

Charlie CatlettJohn Toole

Developing, testing, and refining a technology (or set of technologies) arethe functions of a testbed. In some cases, the desired capabilities are simplyexpansions—bigger, faster, easier—of currently available systems. In otherinstances, what is envisioned goes against established thinking or cannot beaccomplished by mere extensions of current technology. In either case, forbuilding computational grids, testbeds are critical in at least three ways:

� Scale of integration: Diverse technologies must be integrated, deployed, andtested in radically new ways. The technologies described in the precedingchapters are maturing at different rates, demanding a conscious evolution-ary testbed approach to scale both the magnitude and the breadth of theexperiments.

� Building communities: Distributed computing enables new communities ofusers and developers to form as computational resources are linked withdata and people. Testbeds provide a way to accelerate the formation ofmutually agreeable, but strategically chosen communities. The evolutionof these distributed testbeds over time will allow rapidly prototyping offuture visions of the grid.

� Mitigating risk: In addition to building new communities and users on arapidly changing technology base, we also face the challenge of quantify-ing and qualifying the evolutionary results in ways that help new usersunderstand the wealth of new opportunities and the corresponding risks.Consequently, testbeds must support measurements and carefully chosen

22 Testbeds: Bridges from Research to Infrastructure568

goals, while also providing incentives for new opportunities that may bediscovered.

This chapter examines the role, application, and development of testbeds,looking to both past and present systems to learn how testbeds can provide in-sight and capability as national and international computational grids evolve.Some of the testbeds that we consider have already led to significant infra-structure, while others were crucial steps along the way. Still others are partof today’s landscape, whose effects are not yet known. In the process, con-tributions are noted from the standpoint of both technical achievement and“best practices.” In the conclusion we discuss salient characteristics of “good”testbeds and the common challenges successful testbeds overcome. In allcases, the testbeds employed a variety of the technologies discussed in pre-ceding chapters, focused on developing new communities that advanced thefield, and resulted in solid, measurable progress that has proven useful in thenext generation.

22.1 INTRODUCTION: DECIBITS TO KILOBITS

In 1843, the U.S. Congress approved funding for what today might be calledthe “Decibit testbed” to examine the merits of a new technology: the telegraph.The following year, Congress was treated to a demonstration of this technologyin the form of an electronic message sent over the testbed from Baltimore toWashington, D.C. [217]. For several decades prior to this, inventors in Europeand in the United States had been working on the technology (Samuel Morseapplied for a patent in 1837). By 1843 it appeared that the technology wouldin fact work, but it was not clear how the technology would scale in distanceor complexity, and it was even less clear what would be the application of thistechnology or the utility of that application.

About 125 years later, U.S. federal funding was approved for another exper-iment in long-distance communications: the ARPANET. In 1972 Washington,D.C., was the venue for the ARPANET’s first major demonstration as well.The network was extended into the conference hotel of the International Con-ference on Computers and Communications (ICCC) to show how ARPANETcould support remote computer access, file transfer, and remote control ofperipheral devices (from printers to robots).

In both of these “testbed” examples, the technology being examined ransomewhat crosswise to current practice or state of the art. The telegraph cameat a time when long-distance communication was done by moving paper,

22.1 Introduction: Decibits to Kilobits569

with delays ranging from days to months. It was unclear what benefits wouldemerge if communication over distances took place in minutes or hours.Further, the existing communications industry, such as it was (essentially theU.S. postal service and private pony express enterprises such as Wells Fargo),would be threatened by this new technology.

Similarly, ARPANET and the research behind it advocated two ideas thatdid not fit with established practice or current thinking: packet-switched net-works and the use of computers as communication devices to augment humaninteraction. The notion of a packet-switched network was deemed to be infea-sible by the existing communications industry, whose infrastructure modelwas based on circuit switching. The use of computers as communications de-vices to assist human collaboration [343] did not mesh with the view of thecomputer industry, which saw the computer as an arithmetic calculation de-vice.

Both the telegraph and the ARPANET eventually led to global infrastruc-ture. The telegraph and its follow-on, the telephone, resulted in the telecom-munications infrastructure we use in everyday life today. The ARPANET ledto today’s Internet as an infrastructure that is rapidly approaching the samescale of ubiquity.

These two “testbeds”—the early telegraph trial and the ARPANET—alsoprovide several lessons regarding the transition of research into viable infra-structure. Both testbeds proposed models that were not necessarily consistentwith current practice and were generally considered impractical or outlandish.Both put in place, at great expense, facilities whose application (much lessbenefit) was as yet unknown and certainly unproven. Both involved somecombination of stable infrastructure beneath experimental devices and algo-rithms. In the case of the telegraph, the device was somewhat experimental:while the stringing of iron wires was a common practice (though generally thewires were used for fencing, not telegraphy), the telegraph’s encoding system(Morse code) was essentially a new protocol. In the case of the ARPANET, theleased telephone circuits and Honeywell minicomputers (used as IMPs) werecurrent infrastructure, while the software, interface devices, applications, andprotocols were new and unproven. In fact, at the start of the project only a pro-posed design for the IMP systems existed, and there were no proposals, muchless designs, in place regarding protocols, interfaces, or applications.

22.1.1 Testbeds

What is a testbed? In one sense, any project experimenting with new capabil-ities is a testbed. Generally speaking, the collection of users trying out a new


software application program is a testbed aimed at determining the utility ofsuch a program. Here we will use the term to describe a broader effort in whichcooperating project teams are attempting to provide a particular communityof users with new capabilities, by both developing and combining multipleunderlying components of technology.

Some testbeds are aimed at specific communities of users; others strivefor more ambitious scale. The most successful testbeds have tended to consistof the right balance of scale, component technologies, and coupling betweenthe envisioned capabilities and the needs of the target communities. Testbedsare complex combinations of technology and people. Thus, it is important toalso point out organizational contributions for testbeds.

Perhaps the best way to illustrate the concept of testbed is to begin with alook at a well-known example: the ARPANET.

22.1.2 ARPANET

During the early 1960s, researchers in the United States and Great Britain be-gan to develop the concept of a communications network that would sendinformation in discrete packages, or packets. Such a network could, in the-ory, provide redundant paths from one point to another in order to routeinformation around infrastructure failures. After nearly a decade of incubationof these ideas, the U.S. Department of Defense Advanced Research ProjectsAgency (DARPA) launched ARPANET—a project to determine whether packet-switched networks might be useful. At the time, DARPA was funding expen-sive time-shared computers at several computer science laboratories, and acomputer network interconnecting them might prove useful in facilitating re-source sharing among projects [256]. Below we discuss the broad issues ofARPANET’s contributions as a testbed combining people, new ideas, and tech-nology (see Chapter 21 for a review of the evolution of ARPANET in terms ofnetworking technology).

J. C. R. Licklider, who directed DARPA’s Information Processing Tech-niques Office (IPTO) during the 1960s, set the stage for the ARPANET with arather radical vision that saw computers not as merely arithmetic engines butas tools that might one day augment human intelligence and interaction. Lick-lider’s vision proposed a new model where humans interact with computersin a symbiotic relationship and where humans would interact with one an-other through networks of computers [343]. Throughout most of the 1960s, thestate of the art was batch processing, followed by timeshared use toward theend of the decade. However, computers did not interact with one another, andusers needed a separate, custom terminal for each mainframe used. Dial-up


or dedicated phone lines connected these terminals to remote computers. Tocompute on two systems at two labs, a user needed two separate terminals andphone lines. Even exchanging data manually was impractical because of dif-ferences in physical media formats and information representation (differentcharacter set encoding, etc.).

The ARPANET project began in 1968 under the direction of Larry Roberts,who had succeeded Licklider as director of the IPTO office. A contract wasawarded to Bolt, Beranek & Neuman (BBN) to build an Interface Message Proc-essor (IMP) that would be the building block to a packet-switched network.IMPs at multiple sites would be interconnected with leased telephone cir-cuits and would serve as packet switching, or routing, nodes (see Figure 22.1).Each site with an ARPA-funded computer was required to build a custom in-terface between their computer and the IMP. Part of the challenge, however,was that no standards or example systems existed: the entire system archi-tecture was essentially wide open. How would information on one computerbe transmitted to a distant computer? What would be the division of laborbetween software, hardware, hosts, and IMPs? What applications would runon one computer or the other to take advantage of such a network? Thegoal of sharing resources between locations led to an initial application thatwould allow a teletype terminal at one location to act as a user interface toa host at a distant location—that is, an application that allowed the packet-switched network function in the place of a dial-up connection to a distantcomputer.

Layered Protocols

In 1968, a group of graduate students at the participating universities beganto meet and discuss the potential architecture for this network. The first taskof this group (which became known as the network working group) was todevelop the interface technology that would allow a host computer to connectto an IMP. The group’s development of mechanisms for getting data from onehost to another through the network essentially represents the first notion ofnetwork protocols.

The first end-to-end protocol, called “host-to-host,” was devised by the net-work working group along with a program to interface with the host operatingsystem, called the Network Control Protocol (NCP, also generally used to re-fer to the host-to-host protocol; see Chapter 21). Like testbeds today, one ofthe obstacles to be overcome in the context of heterogeneity was integratingseparate “worlds.” Steve Crocker describes this best in his introduction to RFC1000 [454]: “Systems of that era tended to view themselves as the center of


940

#2SRC

#4Utah

PDP 10360

#3UCSB

#1UCLA Sigma 7

22.1

FIGURE

The initial ARPANET connecting four sites. Redrawn from Bob Kahn and VintCerf, “ARPANET Maps 1969–1990,” Computer Communications Review, October1990. (Original sketch by J. Postel.)

the universe; symmetric cooperation did not fit into the concepts currentlyavailable within these operating systems.”

In 1973, TCP was proposed by Vint Cerf, then on the faculty at StanfordUniversity (formerly a UCLA graduate student in the network working group),and Bob Kahn, who had designed the host-to-IMP interface specification whileworking on the BBN IMP team prior to moving to DARPA. While not theofficially used protocol of the ARPANET until the January 1, 1983, “flag day”transition, TCP evolved steadily over its first decade. Its evolution representsone of the most significant concepts to arise from the ARPANET testbed.

As the community began to look at interconnecting multiple networks,the distribution of work between hosts and IMPs and between applicationsand protocols began to move toward one of functional layers. Under BobKahn’s leadership in the 1970s and early 1980s, IPTO initiated several packet-switching testbeds in addition to the ARPANET. These included SATNET, asatellite-based packet-switched network between the United Kingdom and theUnited States, and the packet-radio testbed, which used terrestrial radio trans-mission to allow mobile devices to be interconnected with packet-switched


networks. During a discussion about interconnecting these networks withvastly different properties, Cerf, Kahn, and Jon Postel came up with the ideathat TCP ought to be split into two separate pieces. The pieces that dealt withaddressing and forwarding messages through the network became IP, whilethe functions dealing with guaranteed delivery (sequence numbers, retrans-mission, multiplexing separate streams) went to TCP. And thus, layered proto-cols and TCP/IP were born.

Steve Crocker captures these early days in his introduction to RFC 1000:

We envisioned the possibility of application specific protocols, with codedownloaded to user sites, and we took a crack at designing a language tosupport this. . . . With the pressure to get something working and the gen-eral confusion as to how to achieve the high generality we all aspired to,we punted and defined the first set of protocols to include only Telnet andFTP functions. In particular, only asymmetric, user-server relationshipswere supported. In December 1969, we met with Larry Roberts in Utah,and suffered our first direct experience with “redirection.” Larry made itabundantly clear that our first step was not big enough, and we went backto the drawing board. Over the next few months we designed a symmet-ric host-host protocol, and we defined an abstract implementation of theprotocol known as the Network Control Program. (“NCP” later came to beused as the name for the protocol, but it originally meant the programwithin the operating system that managed connections. The protocol it-self was known blandly only as the host-host protocol.) Along with thebasic host-host protocol, we also envisioned a hierarchy of protocols, withTelnet, FTP and some splinter protocols as the first examples. If we hadonly consulted the ancient mystics, we would have seen immediately thatseven layers were required.

Applications

While most of the effort during the initial period of the ARPANET project wenttoward creating the technology for the network, the driving force remained:to use interconnected computer systems to support human collaboration. Ini-tially, applications were developed to transfer data from one computer to an-other (file transfer protocol) and for remote log-in (telnet). Already there werevisions of applications such as sending programs to be executed on remotecomputers or accessing large databases from across the country, but these fu-ture applications had to begin with simpler capabilities.


One of the first applications to move the network testbed toward usingcomputers to enable human collaboration was electronic mail. Initially therewere multiple mail systems, each with its own user interface and message for-mat. Eventually, the internetworking community’s culture of idea refinementthrough open discussions resulted in standard message formats to make it eas-ier for people to exchange mail between various email user interfaces (i.e.,client programs on the different hosts running different operating systems).The “finger” program followed this—a simple query program that allowed auser on one host to find out whether a colleague on another host (perhapsacross the country) was logged in.

22.1.3 Organizational Testbed Issues in the ARPANET

The initial ARPANET community, starting with the graduate students workingon host interfaces, began to record its progress and discuss new ideas in theform of “Request for Comments” (RFCs). The RFCs set the process in placefor wide discussion of proposed protocols, standards, and other ideas. Equallyimportant, the fact that RFCs could be submitted by anyone set the stage forthe open and cooperative environment that shaped and crafted the Internetover the next two decades.

Another important concept of a testbed is the use of time-forcing func-tions to drive research and development toward working systems. Just asrelease deadlines and end-of-quarter profit statements drive product devel-opment, large-scale demonstrations can be useful in driving research anddevelopment toward producing working prototypes. At least two events dur-ing the ARPANET project can be considered such drivers. The first was a“bake-off” held in October 1971, when all of the ARPANET site participantsgathered at MIT to try to log into one another’s sites over the network (all butone worked). The second, more public, demonstration was held in conjunc-tion with the ICCC in Washington, D.C., the following October. An IMP wasinstalled in the conference hall and connected to the ARPANET, with eachparticipating site bringing terminal equipment and peripherals to be hookedto the IMP.

With live demonstrations of remote log-in, email, remote printing, evenremote control of robots, the ARPANET community was able to show represen-tatives from funding agencies, the computer and communications industries,and other researchers that packet-switched networks were indeed viable. Atthis point it can be argued that the ARPANET reached a major milestone inthe transition from a packet-switched testbed to infrastructure.

22.2 Post-ARPANET Network Testbeds575

22.2 POST-ARPANET NETWORK TESTBEDS

During the late 1970s and early 1980s, ARPANET testbed results made theirway to infrastructure through technology transfer to the Department of De-fense as well as to industry. MILNET was created to interconnect defensesites using ARPANET technology. Schlumberger, a multinational oil indus-try services company, used ARPANET technology for its worldwide corporatenetwork during the late 1970s and early 1980s. In addition, new companiesformed to commercialize the technology: GTE Telenet (not to be confused withthe telnet protocol) and Tymnet (which created packet-switched networks toprovide network services for corporations and universities) are two examples.Finally, the ARPANET testbed concepts and technologies led to other gov-ernment and academic network projects such as CSNET, BITNET, MFENET,ESnet, NSI, and NSFNET.

Packet-switched networks, having been proven viable by ARPANET (seeChapter 21), also influenced the computer industry to create products thatcould be interconnected by local area (and later wide area) networks. DigitalEquipment Corporation and IBM created their own sets of layered protocolsfor interconnecting their products. AT&T’s UNIX operating system was fur-ther developed with DARPA funding at the University of California–Berkeley,and DARPA funding to Stanford resulted in startups such as Sun Microsys-tems, which delivered a minicomputer running the UNIX operating system.DARPA’s funding of the Berkeley UNIX work also came with the encourage-ment to include the TCP/IP protocol stack in the operating system. Thus, withthe explosion of desktop UNIX workstations in the early 1980s came a widedeployment of the TCP/IP protocols.

The U.S. university computer science community, many of whom werenot included in the ARPANET, created CSNET as well as USENET (based onAT&T’s UUCP data transfer program that was also included with UNIX) toexchange electronic mail and documents. U.S. federal agencies began creat-ing their own packet-switched networks as well. The Department of Energyinitially created MFENET in the mid-1970s using its own protocols and laterevolved to the multiprotocol (including IP) ESnet in the late 1980s to networkits research laboratories; NASA used Digital Equipment’s DECNET protocolslargely because of the popularity of the DEC VAX computers among its re-searchers, and the National Science Foundation and created NSFNET to in-terconnect its supercomputer centers. Here we will discuss NSFNET in moredetail as illustrative of these second-generation networks. Each contributedsignificantly as intermediate steps between the ARPANET testbed and today’sglobal Internet infrastructure.


Following the discussion of NSFNET, we will examine some of the high-performance network testbeds of the late 1980s and early 1990s to look athow these helped to scale the ARPANET technology to the point that it couldsupport the global Internet infrastructure we see today.

22.2.1 NSFNET

Between 1985 and 1986 the NSF created five national supercomputer centersto provide access to advanced computational capabilities for researchers inU.S. universities. Initially there were several thousand users, growing to overten thousand in the first five years of the program. During the first two yearsof the program, most users accessed the centers via dial-up lines, GTE TelenetX.25 service, or (for file transfer) BITNET. None of these networks, however,were sufficient in capacity or functionality to allow some of the advanced ca-pabilities of supercomputer users such as remote visualization. The NSFNETprogram began with a 56 Kb/s backbone network between the five supercom-puter centers and the NSF-funded National Center for Atmospheric Research(NCAR) coupled with a funding program to assist universities in forming col-laborative “regional” networks (see Figure 22.2).

A three-layer model was developed consisting of a backbone network(between the six supercomputer centers), mid-layer networks (the regionals),and campus networks. To receive NSF funding support for connecting a cam-pus to a mid-level network, a campus was required to show commitment toinstalling a campus network that would extend the NSFNET to individual re-searchers’ desktops.

Although the original goals of the NSFNET program were to use ARPANETtechnology to provide infrastructure for supercomputer users, a significantamount of additional research was nonetheless necessary for its success asinfrastructure. The decision was made early on that the network would useInternet protocols (IP, TCP) even though several successful networks werealready in place demonstrating that other protocol choices were valid. NASA’snetwork relied on the DECNET protocols, and the Department of Energy’snetwork used protocols developed by their laboratories as well as DECNETand later IP.

While the original ARPANET architecture was a subnetwork of IMPs eachhaving one or more directly attached hosts, it had evolved by the early 1980sto interconnect local area networks as well, requiring IMP-like devices (latercalled routers) to interconnect LANs.

The selection of IP and the goal of interconnecting local area networksresulted in a need for routers in the NSFNET backbone project as well, al-


22.2

FIGURE

NSFNET backbone and mid-level networks circa 1990. Backbone nodes areshown as circles, with small boxes attached to backbone nodes showing entrypoints for mid-level networks and supercomputer centers.

though no commercial routers were available at the time. Borrowing againfrom the ARPANET community, the NSFNET backbone used minicomputer-based routers called “fuzzballs” [79].

Each fuzzball consisted of an Ethernet LAN interface and multiple serialinterfaces that were used to interconnect the fuzzballs over leased 56 Kb/sphone circuits. A community of technical staff began to develop between thefuzzball sites, as well as the University of Michigan and the University ofDelaware, where NSF was funding research and development of the fuzzballsystems and network routing protocols.

In the course of providing the NSF supercomputer center users with a net-work infrastructure, many important research and development issues wereidentified and resolved. As noted earlier, the NSFNET’s three-layer networkarchitecture (backbone/mid-level/campus) resulted in a wide deployment ofIP routers. As the network grew to include complex mid-level and campusrouting, protocols had to take into account issues of scale that had not been en-countered in previous networks. For example, early routing protocols assumeda maximum network diameter (number of hops from source to destination) of


less than 15; thus, any packet with a hop count of 15 or more was consid-ered to be lost in a routing loop and discarded. As mid-level and campus net-works grew, there were instances where hop counts from one end of NSFNETto the other exceeded the routing protocol’s limits, forcing changes in theseprotocols. Commercial IP routers deployed in mid-level networks also useddifferent routing protocols from those used in the core backbone (for vari-ous reasons including scale). The interaction between these different routingprotocols resulted in both improved routing protocols and new strategies andsoftware such as GATED [186], which is used to translate between multiplerouting protocols.

By 1987 the 56 Kb/s NSFNET backbone network had grown more andmore congested, but no routers were available that could handle T1 (1.5 Mb/s)circuits, the next logical upgrade in bandwidth. At the same time, ARPANETwas being decommissioned, and all of its traffic was beginning to flow overthe NSFNET backbone. In a controversial move, researchers working on thefuzzball software adjusted the algorithms used to handle congestion so thatinteractive performance improved at the expense of file transfer through-put. During congestion, buffers in routers fill and packets are discarded. Thefuzzballs would normally discard packets based on a first-come, first-served ba-sis, but this was changed so that packets associated with interactive sessions(e.g., telnet) were favored over packets associated with file transfer sessions(e.g., ftp). This meant that interactive users saw an improvement in service,while file transfers took a bit longer.

Network management prior to the late 1980s was done by using tech-niques that had developed during the ARPANET era, whereby a network oper-ations center could monitor network devices (e.g., IMPs, hosts) and keep trackof errors or outages, in many cases even using trend data to predict problems.During the NSFNET project and as the number of routers in the networks in-creased to hundreds (including backbone network, mid-level networks, andcampus networks), Dave Mills wrote a simple query mechanism that wouldallow for certain console commands to be executed over the network withoutlogging into the fuzzballs. This allowed for simple network monitoring soft-ware such as the “ping monitor” from MIT (keeping track of host reachabilityby periodically pinging a list of hosts) to be augmented to keep track of inter-faces within routers. Members of the community of mid-level and backbonenode network managers and researchers (specifically Martin Schoffstall fromNYSERnet and Jeffrey Case from the University of Tennessee) expanded thisnotion with a specification for a Simple Gateway Monitoring Protocol (SGMP),which grew into the now common device management protocol Simple Net-work Monitoring Protocol (SNMP).


By 1994 it was clear that Internet technology had created a viable commer-cial marketplace, widely deployed in the form of private corporate IP networksand supporting a large number of equipment suppliers. Many of the mid-level networks, originally funded by NSF and/or consortium member dues,had already been commercialized. The NSFNET backbone was turned off inmid-1995 in favor of a new architecture proposed by NSF. Specifically, net-work access points (NAPs) would be used to interconnect mid-level networksrather than a national backbone, and inter-NAP traffic would be carried bycommercial internet service providers (many of whom had begun as NSFNETmid-level networks). The NAP concept was crucial to development of the In-ternet, not only because it provided peering points for network providers, butalso because it provided the capability for further evolution of the infrastruc-ture by allowing piecemeal technology upgrades.

22.2.2 Gigabit Testbeds Program

As noted in the NSFNET discussion earlier, the evolution of the Internetthrough the late 1980s revealed that commercial Internet technology wasnot quite keeping up with the demand for higher-capacity networks. Further,because of the technical limitations as well as cost factors related to higher-capacity networks, application development was somewhat stifled in terms ofdistributed systems.

Shortly after leaving DARPA, Bob Kahn had formed the Corporation forNational Research Initiatives (CNRI) to attempt to coordinate large-scale in-frastructure projects with cooperation between the public and private sectors.Watching the growth of the Internet in terms of participation and capacity de-mand, coupled with rapid increases in computer processor speeds and mem-ory size, Kahn proposed a major program in high-speed networks and appli-cations that was eventually funded jointly by DARPA and NSF. The initiativewould attempt to answer two questions: How would a gigabit-per-second net-work be architected? And what would its utility be to end users?

After a national solicitation yielded nearly 100 proposals, CNRI, DARPA,and NSF began to organize testbeds based on a number of factors, includ-ing synergy between research projects and whether the research required ahigh-performance testbed. With substantial cost sharing and cooperation froma number of regional and long-distance telecommunications carriers (MCI,AT&T, BellSouth, USWest, PacBell, Bell Atlantic), five testbeds were formedin 1990 (see Figure 22.3). Roughly a year later the first testbed, a metropoli-tan area ATM testbed in North Carolina, was operational. The remaining fourtestbeds became operational over the next 18 months.


CASA

MAGIC

BLANCA

VISTANET

AURORA

NECTAR

22.3

FIGURE

The five gigabit testbeds coordinated by CNRI with funding and support fromDARPA, NSF, and industry are shown with MAGIC, a gigabit testbed fundedseparately by DARPA.

It is instructive to examine the state of technology and the prevailingquestions that existed at the outset of any testbed initiative. In the late1980s, “broadband” was 1.5 Mb/s (T1), and high-performance networks run-ning at 45 Mb/s were just beyond reach. ATM was debated widely, somesaying it was ideal for integrating video/data/voice, while others sayingthat its 53-byte cells would produce too much overhead (in headers and insegmentation/reassembly) to support high speed. Protocol processing was alsoan issue, with claims that TCP overhead was too high to support gigabit-per-second rates and thus lightweight protocols were needed, or perhaps protocolprocessing in hardware.

The five testbeds supported in the CNRI gigabit-per-second testbed initia-tive each had a unique blend of research in applications and in networking andcomputer science research. Below are brief overviews of these five testbedsalong with the MAGIC testbed, separately funded by DARPA after the first fivetestbed projects had been launched:

� CASA (Caltech, SDSC, LANL, JPL, MCI, USWest, Pacific Bell) focused pri-marily on distributed supercomputing applications, attempting to achieve“superlinear speedup” by strategically mapping application components


to the supercomputers best suited to the computation. The CASA networkwas constructed by using HIPPI switches interconnected by HIPPI-over-SONET at OC-12 (622 Mb/s).

� BLANCA (NCSA, University of Illinois, UC-Berkeley, AT&T, Universityof Wisconsin) applications included virtual environments, remote visu-alization and steering of computation, and multimedia digital libraries.BLANCA network research included distributed virtual memory, realtimeprotocols, congestion control, and signaling protocols, using experimentalATM switches from AT&T Bell Laboratories running over 622 Mb/s and45 Mb/s circuits provided by AT&T Bell Laboratories Experimental Uni-versity Network (XUNET) project.

� The VISTANET testbed (MCNC, UNC, BellSouth) supported the develop-ment of a radiation treatment planning application that allowed medi-cal personnel to plan radiation beam orientation using a supercomputerand visualization application that extended the planning process from twobeams in two dimensions to multiple beams and three dimensions. Usingan ATM network at OC-12 (622 Mb/s) interconnecting HIPPI local areanetworks, the VISTANET application involved a graphics workstation atthe UNC Medical Center with special-purpose graphics hardware at UNC’sComputer Science Department across campus and a supercomputer sev-eral miles away at MCNC.

� NECTAR (CMU, Bell Atlantic, Bellcore, PSC) was a metropolitan areatestbed with OC-48 (2.4 Gb/s) links between the PSC supercomputer facil-ity just outside of Pittsburgh at Westinghouse and the downtown campusof Carnegie Mellon University. The primary application work involvedcoupling supercomputers running chemical reaction dynamics, and com-puter science research included both distributed software environmentsand development of HIPPI-ATM-SONET conversion devices.

� AURORA (MIT, IBM, Bellcore, Penn, MCI) research focused primarily onnetwork and computer science issues. An OC-12 (622 Mb/s) network in-terconnected the four research sites and supported the development ofATM host interfaces, ATM switches, and network protocols. AURORA re-search included telerobotics, distributed virtual memory, and operatingsystem issues such as reducing the overhead in network protocol imple-mentation.

� The MAGIC [540, 476] testbed (U.S. Army Battle Laboratory, Sprint, Uni-versity of Kansas, Army High Performance Computing Center, Universityof Minnesota, Lawrence Berkeley Laboratory) was funded separately by


DARPA after the CNRI initiative had already begun. MAGIC used an OC-12 (622 Mb/s) network to interconnect ATM-attached hosts, developingremote vehicle control applications as well as high-speed access to data-bases for terrain visualization and battle simulation.

The gigabit testbeds initiative set out to achieve nearly a 700-fold increasein bandwidth relative to the 1.5 Mb/s circuits typically in use on the Internetin 1989. While local area HIPPI networks were supporting peak data trans-fer between supercomputers at between 100 and 400 Mb/s, typical high-endworkstations were capable of under 25 Mb/s. Even so, applications them-selves rarely saw more than 25% of these peak numbers. Applications re-searchers in the testbeds were for the most part hoping to achieve 300–400Mb/s actual throughput, or a 200-fold increase in performance. The designpoint in 1990, then, for applications was to target capabilities that might besupported by the 1992–93 timeframe. In the end, transmission rates of 622Mb/s were supported, and memory-to-memory throughput between comput-ers was demonstrated at 300–400 Mb/s. Thus, technically there were demon-strations showing success in reaching the original goals.

At the same time, the fact that the testbeds combined research at multiplelayers—from hardware to network protocols to middleware to applications—brought significant challenges. Research at any particular layer generally re-quires the layers below to be predictable, if not stable. For some of the testbeds,by the time a facility was presented to an application developer (much lessan end user), the constraints in terms of availability and stability were actu-ally quite restrictive. This situation, unfortunately, reduced the number of endusers who participated in the testbed (as opposed to software developers).

The selection of end points for the testbeds was partially constrained bythe fact that some locations simply did not have the fiber infrastructure inplace to support deployment of a gigabit testbed. This meant that at sometestbed sites there existed a critical mass of computer science researchersand/or application developers, but there were not high-performance comput-ing resources with which to construct high-end applications. The applicationsseen on the testbeds reflected not only the strengths and interests of the par-ticipating sites but also the constraints of the resources available. For example,the BLANCA testbed applications focus was remote visualization and controlof supercomputer applications, but work in distributed applications betweensupercomputers was limited by the fact that there were supercomputers atonly one site on the testbed.

The testbed initiative did, in fact, provide a wealth of answers to the origi-nal two questions asked: What alternatives are there for architecting a gigabit


network, and what utility would it provide to end users? They also deliveredwhat they proposed. The fact that they remained small “islands” of infrastruc-ture prevented a large number of users from joining in. Part of the lesson hererelates to the interdependency between end user applications and technologydevelopment. Technology development is largely justified by the hope for im-proved and/or new applications capabilities for end users. However, in orderfor the end users to fully benefit from technology development testbeds, theymust result in infrastructure upon which those applications can support thework of the end users. In the case of the gigabit testbeds, the finite lifetime ofthe testbeds without any follow-on infrastructure prevented scientific use ofthe facilities and capabilities that were developed. It was not until two yearslater, when the vBNS was deployed (discussed below), that the testbed appli-cations work was resumed in wide area networks.

A number of important technology developments came from the testbedsinitiative. Several testbeds (CASA, BLANCA) demonstrated wide area hetero-geneous supercomputer applications, achieving between 300 and 600 Mb/s.The participation of multiple carriers in providing testbed switching and trans-mission facilities resulted in the first multivendor SONET interoperation, andAT&T’s deployment of optical amplifiers in BLANCA was the first deploymenton a service basis. The AURORA testbed produced the first demonstration ofstriping of data over multiple OC-3 channels and the first ATM host interfacesfor workstations operating above OC-3 speeds.

Despite heated debates early on in the project about the feasibility of usingATM in a high-performance network, several of the participating telecommu-nications carriers had begun to deploy ATM in their commercial networks bythe end of the project. While many of the telecommunications carriers ini-tially asked why 45 Mb/s was insufficient for the foreseeable future, severalhad directly participated in successful demonstration of OC-3 (155 Mb/s), OC-12 (622 Mb/s), and OC-48 (2.4 Gb/s) by the conclusion of the project. Thus,technology deployment within the telecommunications industry was notablyaccelerated by the gigabit testbeds initiative as well.

22.2.3 Other Testbeds

The number of important networking testbeds would easily fill an entire chap-ter in and of themselves. For example, the multiagency ATDnet (AdvancedTechnology Development Network, also occasionally known as the Washing-ton Area Bitway or WABitway), a 2.4 Gb/s SONET/ATM testbed in the Wash-ington, D.C., area has directly influenced the network architecture of vastsectors of the military and of the government in general. Today ATDnet is


part of a larger effort that includes satellite technology from NASA’s AdvancedCommunications Technology Satellite (ACTS) network, commercial ATM tech-nology from Sprint’s Interim Defense and Engineering Network (I-DREN), andadvanced network research on the CAIRN (Collaborative Advanced InternetResearch Network) network (see below). Efforts among universities and high-tech companies in the San Francisco area such as BAGNET (Bay Area GigabitNetwork) and NTON (National Transparent Optical Network) have had similareffects in industry and academia.

In Canada, CANet was one of the first continental-scale ATM networksrunning at 45 Mb/s to 155 Mb/s, demonstrating interoperability of multiplecommercial ATM switches interconnecting dozens of research and commer-cial laboratories as well as advanced regional networks such as WURCNET inAlberta. Its follow-on network, CANet*2, expands the number of connectedsites and increases transmission speeds in some cases to 622 Mb/s. A similartestbed in the United Kingdom, SuperJANET, interconnects universities at 140Mb/s.

Finally, network testbeds have been formed in industry as well. TheARIES network, aimed at using ATM technology to address the needs of thepetroleum industry, combined technologies ranging from T1 satellite connec-tions to 155 Mb/s terrestrial networks. ARIES involved participants from thepetroleum industry (Geko Prakla, AMOCO, and others), telecommunications(Sprint), government (NASA ACTS testbed), and academia (Minnesota Su-percomputer Center). Its goal was to demonstrate the use of ATM networktechnology in a heterogeneous (satellite, terrestrial, various bandwidths) net-work environment to support applications such as processing of seismic datacollected on ships and transmitted directly to supercomputers.

22.3 SYSTEM TESTBEDS

Testbeds have been used in the development of hardware, software, and sys-tems beyond the networking examples previously given. We note that theevolution of computing has a rich history in what many could call testbeds,although there is not sufficient space to report.

Two testbed efforts during the past several years (now concluded) illus-trate the advantages of using existing network technology to support inno-vative applications. These are significantly different from network testbeds,whose primary focus is on improving the underlying networking and soft-ware technology. The first effort, the I-WAY project, attempted to exploit themultiple ATM testbeds in place in the United States and Canada in 1995 to

22.3 System Testbeds585

support high-performance applications in science and engineering [155]. Thesecond testbed, ARIES, was aimed at exploiting existing ATM services fromtelecommunications carriers and the NASA ACTS satellite to support applica-tions important to the oil industry.

Here we examine the I-WAY because of its multinational scope involvingdozens of organizations.

I-WAY: The Importance of Middleware

Seeking to exploit the soon-to-be-deployed 155 Mb/s NSF vBNS testbed as wellas DOE’s and NASA’s OC-3 networking infrastructure, organizers for IEEE Su-percomputing ’95 released a call for proposals in the fall of 1994. The goal ofthis solicitation was to find teams of developers and researchers who woulddemonstrate innovative scientific applications, given the availability of com-puting resources at dozens of laboratories, interconnected with broadbandnational networks, and accessible from high-end graphics workstations andvirtual reality environments to be made available at Supercomputing ’95. A se-lect jury of leaders from universities, corporations, and government reviewedthe more than 60 proposals received, selecting roughly 40 for support.

Featured Applications

I-WAY applications were classified into five general categories: distributed su-percomputing, remote visualization and virtual environments, collaborativeenvironments (particularly those using virtual reality technology and tech-niques; see Chapter 6), distributed supercomputing coupled with collaborativeenvironments, and video [155]. The applications teams represented over 50research institutions, laboratories, companies, and federal agencies. Severalexample applications will illustrate the various application types.

An NSF-funded Grand Challenge team working on cosmology coupledmultiple supercomputers to compute an n-body galaxy simulation, displayingthe results in the CAVE at Supercomputing ’95. The code was a message-passing code, coupling supercomputers from Cray, SGI, Thinking Machines,and IBM [423].

A team from Argonne National Laboratory, working with a commercialfirm of Nalco/Fueltech, demonstrated a teleimmersive collaborative envi-ronment for the design of emission control systems for industrial incinera-tors [158]. This application coupled a supercomputer in Chicago with CAVEsin San Diego and in Washington, D.C.


The University of Wisconsin Space Science and Engineering Center’sVis5D software was adapted to the CAVE to support a simulation of the Chesa-peake Bay ecosystem [567], allowing researchers at Supercomputing ’95 toexplore the virtual Chesapeake Bay while interacting with a running sim-ulation on a Thinking Machines CM-5 at NCSA in Illinois. Another groupused a network-enabled version of Vis5D to explore remote climate modelingdatasets [267].

MCI and equipment supplier Netstar experimented with video and qualityof service using the vBNS. The experiment demonstrated the use of priorityqueuing in routers to improve quality of video streams in the presence ofcongestion in the network.

Another key application area covered by I-WAY was remote control andvisualization of experiments using network-attached instruments (see Chap-ter 4), including the use of immersive virtual environments and voice control.Instrument output was sent to supercomputers for near-realtime conversionto three-dimensional imagery displayed in the CAVE. This particular set of ap-plications emphasized both high bandwidth and bounded delay, the latter dueto human factors in virtual environments [456]. A group from the AerospaceCorporation demonstrated a system that acquired networked computing re-sources to process data downloaded from a meteorological satellite, and thenmade the enhanced data available in real time to meteorologists at the confer-ence [332].

I-WAY Human and Technology Infrastructure

Staff from Argonne National Laboratory, the University of Illinois–ChicagoElectronic Visualization Laboratory, and the National Center for Supercomput-ing Applications (NCSA) formed a leadership team to coordinate I-WAY. Thiscoordination entailed working out details regarding the network connections,soliciting laboratories and computing centers to volunteer their resources tothe effort, and developing and deploying the software infrastructure necessaryto support the application teams.

The network infrastructure for I-WAY required working with multipleagencies and telecommunications carriers to connect multiple networks(including vBNS, AAI, ESnet, ATDnet, CalREN, NREN, MREN, MAGIC, andCASA) and to install DS3 and OC-3 connections into the Supercomputing ’95show floor network.

Multiple equipment vendors volunteered equipment to be used for demon-strations, ranging from high-end graphics workstations to a fully immersivevirtual environment CAVE. Each participating computing center worked with

22.4 The Landscape in 1998587

the I-WAY team to provide resource allocations for application teams and todeploy a standard workstation system at their sites, called an I-WAY Point ofPresence (IPOP) [200]. The IPOP was used for authentication of distributedapplications, for distribution of associated libraries and other software, and formonitoring the connectivity of the I-WAY virtual network.

A scheduling system was developed and deployed on the IPOP systems,and the scheduler software was ported to each type of computing resource bystaff at the participating center. Applications could use the IPOP-based soft-ware infrastructure, which provided single authentication and job submissionacross multiple sites, or they could work directly with the end resources.

For most of 1995, development teams worked on the I-WAY software, thenetwork deployment plans, and the logistics of making resources available atseveral dozen computing centers. During the course of the year, many centerswithdrew because of an inability to dedicate staff to port software or integratetheir systems to the I-WAY software “cloud.” During the four months prior tothe Supercomputing ’95 demonstrations, teams of staff at the participatingsites worked with applications teams to prepare their applications for demon-stration over the I-WAY. These teams debugged the applications, tuned themto take into account longer delays in the wide area networks, and in manyinstances ported the user interfaces into the CAVE environment using the as-sociated libraries.

The I-WAY project also involved software from gigabit testbeds as well asthe expertise of a number of researchers who had participated in them. Soft-ware developed on the BLANCA gigabit testbed by the University of Wiscon-sin, Vis5D, was adapted to several environmental applications, and BLANCA’sData Transfer Mechanism (DTM) communications library provided a commu-nications API for the CAVE. Many of the principles learned during CASA’sexperiments in coupling multiple supercomputers were also employed to hidelatency for distributed supercomputing applications in I-WAY.

22.4 THE LANDSCAPE IN 1998

Network testbeds have become components of broader “infrastructure” test-beds that attempt to deliver total solutions. Within the United States, newefforts such as the NSF Partnerships in Advanced Computational Infrastruc-ture (PACI), the DOE Advanced Strategic Computing Initiative (ASCI) andDOE2000 programs, the NASA Information Power Grid (IPG) initiative, andthe Globus project are aimed at computational science and engineering whileriding on top of networks and network testbeds such as vBNS, ESnet, NREN,


and AAInet. Even network testbeds such as CAIRN rely partly on infrastruc-ture from other ATM/SONET testbeds. European ACTS projects use networkssuch as SuperJANET and other European testbeds, as well as complex con-catenation of networks in Europe, across the Atlantic, and into the vBNS viaCANARIE and the STAR-TAP. Several important network testbeds are worth ex-amining today as well, including the vBNS, AAInet, CAIRN, SuperJANET, andCANet*2. While there is not sufficient space to cover all of the major efforts,we examine a number of representative efforts.

22.4.1 ACTS ATM Internet (AAInet)

One of the most aggressive ATM testbeds under way in the United States isARPA’s ACTS ATM Internetwork (AAInet), a testbed interconnecting NASA’sACTS satellite network; the Washington, D.C., area ATDnet; MAGIC; and theDefense Research and Engineering Network (DREN). AAInet is addressingnetwork signaling (e.g., virtual circuit setup between switches), congestioncontrol, multicast, and interoperability with non-ATM networks. Constructedof several separate and autonomous networks, AAInet is an ideal testbed forthese technical issues but also for issues of scale and interoperability amongmultiple equipment vendors and multiple separately managed networks.

22.4.2 DARTnet/CAIRN

The CNRI-coordinated gigabit testbeds were not the only network testbed ac-tivities taking place during the late 1980s and early 1990s. At lower speed (T1),DARPA was also funding DARTnet, whose research contributed to many of to-day’s capabilities in multimedia and multicast protocols. CAIRN (CollaborativeAdvanced Internet Research Network) is the present-day DARPA-funded net-work testbed follow-on to DARTnet. Today, the DARTnet-II network intercon-nects 18 sites at T1 and is a subset of the CAIRN infrastructure that adds 45Mb/s and 155 Mb/s ATM links to several of the sites. Unlike many networktestbeds that limit the scope of network research in favor of stability for appli-cations projects, the sole purpose of the DARTnet/CAIRN infrastructure is toprovide a “breakable” testbed for network research.

DARTnet/CAIRN’s comprehensive network research agenda has resultedin major contributions in terms of Internet capabilities, including integratedservice models, multicast routing protocols, QoS schemes (e.g., RSVP), net-work time protocols (including the use of Global Positioning System clocks forunidirectional delay measurement), IP security, and practical experience withIPv6.

22.4 The Landscape in 1998589

22.4.3 NSF PACI and vBNS

As a follow-on to the supercomputer centers program that began in 1985and initiated the NSFNET backbone project, NSF created the PACI programto fund several large-scale infrastructure development efforts to “prototypethe 21st century computational environment.” Two consortia were fundedin 1997: the National Computational Science Alliance (NCSA Alliance) cen-tered at the National Center for Supercomputing Applications (NCSA) atthe University of Illinois at Urbana-Champaign and the National Partner-ship for Computational Infrastructure (NPACI) centered at the University ofCalifornia–San Diego and the San Diego Supercomputer Center (SDSC). In ad-dition to over 120 principal investigators at roughly 80 universities in 27 states,the NCSA and NPACI PACI consortia involve partnerships with other agencylaboratories. Argonne National Laboratory, for example, is a partner in theNCSA effort, while Lawrence Berkeley Laboratory is involved in the NPACIteam.

The PACI program, just under way as of late 1997, relies heavily on the useof testbeds at multiple levels. Systems or software from enabling technologyteams will be stress-tested by application technology teams in order to deter-mine its usefulness as well as its suitability for general infrastructure. Manyof these testbeds will take place on the vBNS network, a cooperative programbetween NSF and MCI. As of late 1997 the vBNS backbone was running atOC-12 (622 Mb/s) with full OC-12 connectivity between SDSC and NCSA andinterconnecting several dozen locations at speeds ranging from 45 Mb/s to 155Mb/s. By the end of 1998 the NSF expects over 100 locations to be connectedto the vBNS, including most of the PACI consortium members and all of themajor resource centers in the consortia.

The participants in the PACI consortia will be participating on a complex,multilayer set of testbeds (e.g., see [469, 515, 525]). Application technologyteams cover more than a dozen fields, from astronomy to biology to nano-technology, and involve teams of 6 to 12 principal investigators at as manyinstitutions. These applications teams provide driving applications to multipleteams of computer scientists and engineers (the enabling technology teams)in order to influence the development of underlying infrastructure capabil-ities. The two consortia, NPACI and NCSA, are complementary in terms ofapplication areas as well as computer science and engineering. For example,while the NPACI-led consortium has a strong concentration of work aimed atdata-intensive computing, the NCSA-led effort emphasizes visual supercom-puting and teleimmersion. Both centers provide a complementary suite ofhigh-performance computing platforms, providing the community of more


than 6,000 users across the country with the ability to select the environmentthat is ideal for their work.

22.4.4 Globus

In 1998 there are many other testbeds that range in scale from half a dozeninstitutions to multitestbed initiatives. The Globus project (see Chapter 11), aDOE- and DARPA-funded effort based at Argonne National Laboratory and theInformation Sciences Institute (ISI) at the University of Southern California,involves many of the same institutions participating in the NSF PACI program.Globus is in many ways an outgrowth of work done in scheduling, security,and other distributed systems areas during the I-WAY project in 1995 [155,200]. Globus components are rapidly becoming some of the first pieces ofinfrastructure within the wide area prototype activities of the PACI program(see Plate 17), partly because of the Globus emphasis on interoperability of avariety of underlying component systems.

22.4.5 NLANR

The National Laboratory for Applied Network Research (NLANR), a distributedlaboratory with staff at several NSF-sponsored supercomputer centers, in-volves a variety of testbed support efforts as well as research efforts. Forexample, NLANR uses the vBNS as a national “backplane” to interconnect anexperimental Web caching system aimed at improving the performance ofthe Web while reducing Internet load. NLANR comprises three complemen-tary functions: a distributed applications support center (at NCSA), a networkmeasurement research and tools development effort (at UCSD), and a networkengineering resource center (at CMU). (More information on NLANR can befound at www.nlanr.net.)

22.4.6 ACTS

The Advanced Communications Technologies and Services (ACTS) programin Europe is one of the world’s most comprehensive and broad “testbed oftestbeds,” with universities as well as government laboratories and private cor-porations, and various funding arrangements with the European Union. Sev-eral dozen consortia, each involving as many as 20 participating organizations,have been formed since 1995 to address a variety of communications, media,and computational developments. For example, the Distributed Virtual Proto-type project involves both government laboratories and universities in mul-

22.5 Testbeds for the Future: Challenges and Opportunities591

tiple European Community nations. Private corporations such as Caterpillarand its subsidiary operation in Belgium are developing techniques for collab-orative engineering over transatlantic ATM networks interconnecting virtualenvironments facilities in the United States (NCSA) and Germany (GMD).

22.5 TESTBEDS FOR THE FUTURE: CHALLENGES ANDOPPORTUNITIES

The preceding sections show clearly that testbeds can be a mechanism fortechnology development, technology transfer, and community developmentthat later developed into mainstream information technology. However, theoutcomes and transfer paths of future generations were not always as clear ashistory can write. In the dynamic world of information technology, the accel-erating pace of industry and users opens the potential for a new generation oftestbeds that will be critical to building the computational grids of the future.

22.5.1 Evolution and Revolution

Ironically, the testbeds cited have proven that evolution is one of the most im-portant characteristics of producing revolutionary results in computing and in-formation technology. Successfully moving from research to usable prototypeto a viable industrially supported base is often stimulated by carefully choos-ing testbeds. The families of computing, networking, and information tech-nology that have produced such revolutionary results on our lives, economies,and jobs evolved over long periods of time [411]. In addition, testbeds focusedon revolutionary technologies of the day (such as all optical networking) even-tually evolved into supporting broad-based classes of applications. Thus, thetechnology push and applications pull often uniquely meet at the testbed.

Computational grids will require a unique sensitivity to this experience.On the one hand, there is revolutionary software and middleware needed tomake it all happen. But, on the other hand, there are also very evolutionarypolicies and procedures that will affect the nature and growth of grid testbeds.Since these will initially be perceived as competing directly with the produc-tion facilities of many resources, careful attention to scale and risk will beneeded to ensure successful evolution.

One challenge facing the creation and support of testbeds will be toachieve concurrent support for both production network capabilities for ap-plications as well as network research and bleeding-edge applications on as


much of the same infrastructure as possible. If achieved, such coexistencewill contain costs, maximize knowledgeable personnel, avoid unnecessaryduplicative infrastructure, and provide the means for applications to easilymigrate between production phase to testbed mode with minor effort.

22.5.2 Getting Real Users Involved

A major advantage of testbeds is to get real users involved in applying ad-vanced technologies to their problems early in the development cycle. Inaddition, this involvement provides a real focus for the technologists in theapplication of their technology. This marriage of user and technology is oftenthe key to success and can become a test of whether there is potential valuein new technology for even a wider class of users. The vision that was shownin the I-WAY experiments has become the precursor of what a computationalgrid could become.

As the national computational grid is built, testbeds of real users emergingfrom the application teams are expected to form successively linked networksof communities tied together with computational resources. Success or re-finements in any given testbed will lead to the definition of the next waveof testbeds, which can bring new users, technologies, and communities to-gether in productive ways. An important challenge for these computationalgrid testbeds will be to productively construct new multidisciplinary user com-munities, which in turn may routinely have the ability to discover new appli-cations (see Chapters 3 through 6).

22.5.3 Funding and Organization

Sustained funding for a series of testbeds is essential to realize their poten-tial and, even more important, to attract the high-quality users, experiments,and researchers needed to drive the testbeds to meet their aggressive goals.Fortunately, government agencies have initiated several programs (NSF PACI,DOE2000, NASA IPG, etc.), but to fully realize the potential of computationalgrids, existing and new programs will have to become grid enabled and partic-ipate in the grid’s evolution.

This situation could pose political, organizational, and technical fragmen-tation unless approached from a system perspective. Testbeds developed couldbe applied across various agencies, bringing new communities into place anddelivering both capabilities and additional resources to the grid.

Long-term funding ensures that applications researchers and developerswill have something to use at the end of technology development stages.

Further Reading593

Similarly, technologists will have the opportunity to see how the systemsbehave under persistent use over time. Persistence is key to be able to providethe technology to the applications that helped develop it.

22.6 CONCLUSIONS

The understanding, organization, and experience developed in both largeand small testbeds have been key contributors to today’s core networking,software, and computing infrastructure. The successful construction of thenational-scale grid environments envisioned in this book will require care-ful choices as we select the technologies and communities that will formthe testbeds for tomorrow. In making these choices, we must seek to bal-ance the potentially conflicting requirements of technologists and users, whilemitgating risk and providing opportunities for the exploitation of new oppor-tunities. Significant commitments of time, money, and talent will be requiredto complement the ongoing evolution of technology and applications. How-ever, history and our view of the future suggest that these investments will bevery worthwhile.

FURTHER READING


� Books by Hafner and Lyon [256] and Lynch and Rose [354] discuss thehistory of the Internet and Internet technologies, respectively.

� A paper by Catlett [108] contains a comprehensive discussion of applica-tions investigated and developed on the gigabit testbeds.

� A special issue of IEEE Annals of the History of Computing on “Time-Sharing and Interactive Computing at MIT” [334] describes the develop-ment of interactive timeshared computing during the 1960s.

� A book by Kaufmann and Smarr [304] provides a brief history of supercom-puting as well as a rich discussion of supercomputing applications.

Grid Ian Foster

Documents

Transcript of Grid Ian Foster