Opus: an Overlay Peer Utility Service - Duke Universityvahdat/ps/openarch02.pdf · data centers)...

Opus: an Overlay Peer Utility Service

Rebecca Braynard, Dejan Kostic, Adolfo Rodriguez, Jeff Chase and Amin Vahdat�

Department of Computer ScienceDuke University

frebecca,dkostic,razor,chase,[email protected]

Abstract

Today, an increasing number of important network services,such as content distribution, replicated services, and storage sys-tems, are deploying overlays across multiple Internet sites to de-liver better performance, reliability and adaptability. Currentlyhowever, such network services must individually reimplementsubstantially similar functionality. For example, applicationsmust configure the overlay to meet their specific demands forscale, service quality and reliability. Further, they must dynami-cally map data and functions onto network resources—includingservers, storage, and network paths—to adapt to changes in loador network conditions.

In this paper, we present Opus, a large-scale overlay util-ity service that provides a common platform and the necessaryabstractions for simultaneously hosting multiple distributed ap-plications. In our utility model, wide-area resource mappingis guided by an application’s specification of performance andavailability targets. Opus then allocates available nodes to meetthe requirements of competing applications based on dynami-cally changing system characteristics. Specifically, we describeissues and initial results associated with: i) developing a generalarchitecture that enables a broad range of applications to pushtheir functionality across the network, ii) constructing overlaysthat match both the performance and reliability characteristicsof individual applications and scale to thousands of participat-ing nodes, iii) using Service Level Agreements to dynamicallyallocate utility resources among competing applications, and iv)developing decentralized techniques for tracking global systemcharacteristics through the use of hierarchy, aggregation, and ap-proximation.

1 Introduction

A key insight from the Active Networking approach is thatdistributed services can benefit tremendously from pushingtheir functionality to intermediate points in the network.

�This research is supported in part by the National Science Founda-tion (EIA-9972879, ITR-0082912), Hewlett-Packard, IBM, Intel, andMicrosoft. Braynard is supported by an NSF graduate fellowship andVahdat is also supported by an NSF CAREER award (CCR-9984328).

The extreme stance of associating code with every packetthat travels across the network has not seen wide deploy-ment. However, instances of Active Networking aboundin the current Internet architecture in the form of L4/L7switches, overlay networks, transparent proxy caches, fire-walls, and network address translation. Most of these net-work services provide point solutions to point problems.Overlay networks [3, 18, 20], however, are emerging asa fundamental technique for enhancing wide-area servicescalability, performance, and availability. Consider the fol-lowing applications:

� Replicated Services: To deliver target levels of per-formance and availability, application developers areincreasingly replicating their service at multiple wide-area sites. Of course, replication is not a panacea. Im-portant questions here include where to place replicas,how many replicas are required, how to route clientrequests to the proper replica, and how to propagateupdates to maintain consistency across replicas.

� Application Layer Multicast (ALM): Despite over adecade of research, native IP multicast has not en-joyed wide-spread deployment. However, multicast-style applications, such as stock quotes, event noti-fication, audio, and video, are continuing to grow inpopularity. Thus, many industrial and research effortsare investigating and deploying ALM, where nodesacross the Internet act as intermediate routers to ef-ficiently distribute data along a pre-defined mesh ortree. Initial evaluations [18, 20] indicate that ALMcan perform within a small factor of the cost and la-tency of native IP multicast by routing through strate-gic intermediate points in the network and by buildingan overlay that matches the characteristics of the un-derlying network.

� QoS Guarantees: Finally, more traditional unicastapplications are also using distributed network re-sources to achieve better performance and reliabilitythan would be delivered by the underlying IP network.One initial study [31] used network traces to deter-mine that, in many cases, default IP routing resultsin paths with inferior reliability and performance rel-ative to indirect routing through one of a set of in-termediate nodes. Recent work on Resilient OverlayNetworks [3] quantifies such improvements in a 13-node testbed running across the Internet. Another re-cent study [32] advocates using multiple intermediatepoints in an overlay to redundantly transmit the samedata from source to destination, reducing end-to-endloss rates and performance variability.

Today, all of these services must redundantly acquireand administer nodes across the Internet to provide the req-uisite functionality. This approach forces services to reim-plement substantially similar functionality, such as track-ing changing network characteristics, building appropri-ate topologies, failure detection, or IP topology matching.Further, given the highly bursty nature of Internet trafficand service access patterns, individual services must over-provision for peak levels of demand that are often a factorof 3-10 higher than the average case.

We believe that a common system infrastructure to sup-port the requirements of a broad range of applications willimprove the performance of existing applications by ex-porting common best practices and will also effect a qual-itative shift in the ease with which novel distributed appli-cations can be deployed. Thus, we propose Opus, an Over-lay Peer Utility Service, to automatically configure servernetwork overlays with the goal of dynamically meetingthe performance and availability requirements of a broadrange of competing applications. By observing changingnetwork conditions and application access patterns, Opuswill: i) allocate a portion of global resources to each appli-cation, ii) place these replicas at appropriate points in thenetwork, and iii) create overlays that satisfy the require-ments of individual distributed applications. Key to ourapproach is building scalable structures to track changingsystem characteristics and developing a common abstrac-tion for prioritizing among competing applications.

The rest of this paper is organized as follows. Section 2describes the Opus system architecture. Next, Section 3presents individual challenges we are addressing to realizethis model and some initial results. Section 4 compares ourwork to related efforts and Section 5 presents our conclu-sions.

2 Architecture

The Opus service allocates overlays of server and net-work resources from a shared pool, as needed to meetservice quality goals efficiently in the face of dynami-cally changing global characteristics. Service workloadsare streams of requests originating from clients throughoutthe network, and requiring varying amounts of computa-tion, shared data access, and data transfer to and from eachclient. We assume that service applications show stableaverage per-request behavior; loads are defined by offeredrequest rates that may vary continuously through time.

Figure 1 depicts the high-level Opus system model. Weenvision a collection of server sites (e.g., small clusters ordata centers) colocated with switching centers at the inte-rior of the Internet “cloud.” Opus manages these Points-of-Presence (PoPs) in a coordinated fashion as a shared phys-ical infrastructure for distributed Internet applications. Ap-plications consist of components running on selected Opusnodes. Opus configures these nodes to run application soft-ware and organizes them as an application-specific overlaynetwork.

Opus resource allocators cooperate to assign resourcesto each overlay application. These resources include“slices” of the server and network capacity on some subsetof the Opus PoPs—the pie charts in Figure 1 represent per-region application demand levels that ideally correspond toresource allocation levels in nearby Opus PoPs—togetherwith an overlay topology configured for that application.Our approach is to describe applications abstractly in termsof their service quality goals, then generate candidate al-lotments and overlay topologies that balance service qual-ity with network performance and cost. A request routinginfrastructure directs external traffic (e.g., client requests)destined for each application service to selected nodes as-signed to that application. Request routing may leverageDNS redirection [10], anycast [5, 21, 38], or an Opus nam-ing interface.

The service overlay. Each Opus PoP runs an instanceof the Opus site manager, which coordinates resource us-age at that site and exchanges status summaries with otherOpus sites. Opus uses its own overlay services inter-nally to disseminate status information and related meta-data through a service overlay that interconnects all activenodes. The service overlay forms the “backbone” for co-ordinated, decentralized resource allocation and resourcecontrol, as described below. Thus the service overlay mustbe dynamic and self-healing: if a network path is lost or de-graded, then the service overlay must reconfigure to reroutetraffic through a different path.

Application demand(per network region)

Overlay node

Application Overlays

Figure 1: Opus system model.

Scalability is a key concern in the design of the serviceoverlay, as we expect Opus to scale to 10,000 or morenodes across the wide area. We take a decentralized ap-proach in which local site managers “think globally but actlocally,” making local resource allocation choices to con-verge on desirable global outcomes based on informationdisseminated through the service overlay. This status in-formation includes summaries of resource availability, net-work conditions, load, and delivered performance at eachsite. The schemes for configuring overlays also require es-timates of link capacities, delays, and failure probabilitiesfor the physical network interconnecting the Opus sites.A key premise of the Opus architecture is that effectiveresource allocation and control requires only approximateinformation about global conditions. Section 3.3 presentssome results from our initial approach to scalable dissem-ination of system metadata through the service overlay,called dicast.

Adaptive per-application overlays. A primary taskof the service overlay is to assist in the construction andmaintenance of application overlays. Individual applica-tions use these overlays to route internal application traf-fic, disseminate content, and/or synchronize their state in-formation efficiently. For example, a video delivery sys-tem would use its overlay to disseminate its content toparticipating sites, which in turn transmit the data to endclients. A replicated database system would use its over-lay to maintain replica consistency by propagating updatesamong active replica sites.

At the core of the Opus architecture are algorithms toselect the number and placement of site locations for eachapplication, allocate global resource shares, and configureoverlays to link the selected sites. These inter-related as-pects of the overlay construction problem interact in a com-plex way to determine end-to-end application behavior. Asan example, consider an Internet service using dynamic

replication for scalability and availability. For a replicatedservice, the need to propagate updates across replicas im-poses new network load and may compromise availabil-ity. Previous work in the TACT project has shown thatthe availability of a replica configuration depends on theapplication’s consistency demands, as well as the numberand placement of replicas and the reliability of their inter-connections [39]. While adding server sites can improveperformance and availability, more is not necessarily bet-ter: we find that in some cases additional sites can actu-ally reduce overall performance and availability dependingon application consistency requirements. In fact, a smallset of well-placed and well-provisioned replica sites gen-erally outperforms a larger set of poorly-placed replicas.The techniques developed for TACT allow the Opus allo-cators to predict performance and quantify availability asa function of the candidate overlay characteristics and theapplication’s consistency targets.

After instantiating an overlay for an application, theOpus resource allocators dynamically adapt the overlaytopology and site allotments to respond to observed loadand network conditions. For example, if many accesses areobserved for an application in a given network region, thesystem may reallocate additional resources at a PoP closeto that location, possibly adding a new site presence to theapplication overlay. The system continuously monitors lo-cal and global conditions through the service overlay, anduses feedback control as a basis for incremental, adaptiveresource provisioning. In addition to enabling dynamicadaptation, the feedback loop enables the system to con-tinuously refine resource allotments so that the quality ofan initial solution is less critical.

Resource allocation and service quality. Opus strivesfor resource allotments that are both effective and efficient.An effective allotment meets service quality goals; an ef-ficient allotment conserves resources. One approach is to

strive for least-cost allotments that satisfy fixed applicationservice quality bounds under existing traffic conditions.We take a more general approach to enable the system toprioritize applications under resource constraint. Althoughwe expect that the utility is adequately provisioned andemploys admission control to avoid overcommitting its re-sources, the Internet environment is adversarial, and large-scale services should design in “fallback” positions for ex-treme scenarios involving site or link failures, flash crowds,or attacks on the system or its physical infrastructure. Forthis reason, our approach emphasizes dynamic tradeoffs ofservice quality and cost. This can enable the system tomatch resource demand with dynamically varying levels ofresource supply, in order to maximize the global good un-der the full range of conditions and constraints that it mightencounter in operation. Indeed, a key benefit of the util-ity approach is that it can reallocate shared infrastructureto respond to adverse conditions. Such reallocation maytake place based on economic considerations (e.g., who iswilling to pay the most?) or based on relative applicationpriority (e.g., which services must absolutely stay up andrunning during a denial of service attack?).

A major technical challenge for flexible resource allo-cation in an overlay utility service is to generate candidateoverlay configurations with varying tradeoffs of cost andservice quality (benefit). Section 3.1 presents two over-lay structures we are investigating to support this objec-tive. Dynamic cost/benefit tradeoffs also depend on mod-els to predict and quantify the benefit of each candidateconfiguration along multiple dimensions of service qual-ity. The models must consider non-traditional quality mea-sures such as availability or consistency, as well as moretraditional performance measures such as response time,fairness, or throughput. The units to quantify different di-mensions of service quality and cost are arbitrary: the sys-tem may scale these measures arbitrarily before comparingor combining them to balance competing objectives.

Given measures for service quality and cost for candi-date overlay configurations, the system needs flexible cri-teria to establish customer priority. Our initial approach isto present service quality goals to the resource allocators asvalue-based Service Level Agreements (SLAs) represent-ing a continuum of service quality tradeoffs. The utilityfunctions may represent arbitrary criteria for establishingcustomer priority. Opus SLAs specify service quality goalsas continuous utility functions specifying values associatedwith various levels of service volume and quality for eachcustomer. In our previous work on server provisioning inindividual data centers, we found that generalized utilityfunctions are a flexible means to guide dynamic tradeoffs

of service quality and cost [8]. Modest constraints on theform of the utility functions enable the resource allocatorsto identify utility-maximizing allocations efficiently, andrefine them incrementally in a feedback loop. Section 3.2outlines our approach in more detail.

Security and isolation. Security is an important con-sideration for any general-purpose utility. Opus allocatesresources to applications at the granularity of individualnodes, eliminating a subset of the security and isolationissues associated with simultaneously hosting multiple ap-plications. In the future, we plan to investigate the use ofvirtual machine technology to isolate services running onthe same physical host [37]. On the network side, we muststill isolate traffic on the wire from different applicationsrunning at the same Opus site. VLANs, supported in mostmodern switches, support such functionality and it shouldbe straightforward to automate the requisite isolation in re-sponse to dynamic reallocation of local site resources. Fi-nally, policy-based sharing of physical resources dependson accurate measures of application resource demand. Insome cases it may be useful for the customer itself to pro-vide the load and QoS measures. If so, Opus relies on sim-ple economics to encourage customers to deploy efficientsoftware and accurately represent their resource needs fora given demand level: customers conserve resources whenthey are asked to pay for their usage.

3 System Components

This section describes four of the principal challengesthat we must address to deploy a general-purpose andlarge-scale service utility: constructing overlays, allocat-ing resources, propagating status, and improving reliabil-ity through multi-path routing. A common theme runningacross all system components is that local decisions mustbe made to approximate the global good based on partialand uncertain information.

3.1 Overlay Topology Construction

As discussed above, Opus must build and maintain twoseparate types of overlays. The service overlay maintainsand distributes overall service metadata among participat-ing sites. The service overlay also facilitates the construc-tion of smaller-scale application overlays designed to meetthe performance and reliability requirements of a broadrange of network services. A number of efforts have inves-tigated techniques for building proper overlay topologiesto match a particular application’s requirements [3, 18, 20].Further, the scalability of current techniques require global

knowledge and do not scale beyond a few tens of nodes..We must devise solutions that scale to thousands of nodesfor application overlays and tens of thousands of nodes forthe service overlay.

Our initial work focuses on developing a general over-lay topology that enables dynamic tradeoffs between net-work performance/reliability and cost. Note that a cost ofan overlay link can be assigned arbitrarily, but is likely todepend upon the cost of the individual physical links thatcompose an overlay link. This cost may reflect current con-gestion levels on a link, the price paid to an ISP to use alink, etc. The actual assignment of cost to individual linksis beyond the scope of this paper, though we do assumethat no individual Opus nodes are aware of this global costmetric and that the metric changes dynamically over time.

One of the key initial goals in our work is to build ap-plication overlays to enable flexible and dynamic tradeoffsbetween overlay cost—logically a measure of total net-work resources consumed in transmitting data across theoverlay—and the associated performance and reliabilitycharacteristics of the overlay. To quantify the benefits ofcompeting structures, we need a set of metrics to comparethe quality of candidate overlay topologies. Initially, wefocus on network cost and relative delay penalty (RDP) tocharacterize overlay topologies. RDP measures the relativeincrease of delay incurred from using a particular overlayrelative to direct transmission in the underlying IP network.Network cost is the sum of all the link weights associatedwith a given overlay topology.

We have identified two candidate overlay topologies thatenable such flexible tradeoffs [23]. A k-spanner [7] en-sures that all paths in an overlay have an RDP no worsethan k. Lower values of k result in higher cost for buildingthe overlay. Because k-spanners attempt to guarantee low-latency paths between all pairs of hosts, it is more appro-priate for multi-sender applications. A second structure,LAST [22] (lightweight approximate shortest-path tree),enables similar tradeoffs for single-sender applications.With a LAST, a configuration parameter, �, bounds theRDP of all paths from a designated source to all destina-tions have an RDP no worse than �. For instance, a LASTwith � = 1:5 ensures that all destinations receive data withdelay at most 50% higher than transmission through IP.

We now present the results of some initial experiments toquantify the benefits of k-spanners and LASTs. The prin-cipal goal here is to enable Opus to use overlay-specifictuning parameters to match application requirements. Forexample Opus can adapt to changing conditions by turn-ing a knob (such as the k or the � value) to reallocate re-sources to adjust the balance of cost and performance. To

quantify the benefits of dynamically trading network costfor performance in overlays, we ran some simulations ofboth k-spanners and LASTs. For our experiments, we con-structed a 200-node overlay randomly distributed among a600-node GT-ITM generated topology [6]. Edge delay wasassigned based on default GT-ITM parameters. For theseexperiments, we equate edge cost with delay though weare currently investigating techniques to allow simultane-ous, bi-criteria network optimization [25]. In the case of aLAST, Figure 2(a) shows how the � parameter affects thecost of the resulting overlay, relative to both a shortest pathtree (RDP=1.0) and a minimum cost spanning tree (with anunbounded RDP). At � = 1, the overlay cost is high, com-parable to a shortest path tree. However, as demonstratedin Figure 2(b), this same point corresponds to the best per-formance (comparable to shortest path routing packets inthe underlying network). As � increases, the network costof the LAST overlay decreases, eventually matching thecost of a minimum cost spanning tree at � = 3. Of course,Figure 2(b) also shows that such a low-cost overlay alsoresults in relatively poor performance. One nice quality ofthe tradeoffs expressed above is that it is possible to builddistribution structures that balance cost and RDP. For ex-ample, with � = 1:5, we are able to obtain a cost within15% of an MST and an RDP within 15% of an SPT forour target topology and edge weights. This result showspromise for our ability to build overlays that match appli-cation requirements with relatively low cost overhead (forall but the most demanding applications).

A key next challenge is to develop scalable distributedalgorithms for building and maintaining k-spanners orLASTs. To support our goal of scalability, we mustavoid the necessity of global knowledge, excessive net-work probing, and distributed locking to build and main-tain such topologies. Our approach is to use probabilis-tic techniques and hierarchy to selectively probe the char-acteristics of various network regions. Key to our ap-proach is having each node gradually migrate to its (ap-proximately) “proper” location in the overlay. This is rel-atively straightforward assuming the presence of globalgroup membership and pairwise probing. However, thisrequires unscalable O(n2) memory and network overheadrespectively). Recent proposals in peer-to-peer networkingaddress scalability concerns by building randomized over-lays [28, 30, 33] requiring only O(lgn) per-node state. Incontrast, our goal is to investigate the practicality of con-structing overlays with specific performance characteris-tics using partial, approximate and probabilistic knowledgeof network information.

8000

9000

10000

11000

12000

13000

14000

15000

16000

17000

1 1.5 2 2.5 3LAST Alpha (tunable property)

Net

wor

k co

st

SPT

LAST

MST

0.8

1

1.2

1.4

1.6

1.8

2

1 1.5 2 2.5 3

LAST Alpha (tunable property)

Roo

t RD

P

MST

LAST

SPT

(a) (b)

Figure 2: Dynamically trading network cost for relative delay product using a lightweight approximate shortest path tree(LAST).

3.2 Resource Allocation

One of the key components of Opus is resource allocationamong competing applications. This principally requiresdetermining the relative priority for competing applicationsand the proportion of global resources that should be al-located based on current system conditions. We will useSLAs as the basis for economic prioritization, building onour initial success with using an economic model for pri-oritization and resource allocation in a cluster setting [8].

The basic resource mapping challenge is to establisha matrix of allotments from j system resources across i

customers (applications). The system resources includeservers in the Opus PoPs and network links interconnect-ing them. The system strives to balance the service qualityof the selected allotments with their costs. In Opus, ourchallenge is to allocate these resources in a decentralizedmanner, based on partial information about resource sup-ply and demand collected through the service overlay.

Opus uses a generalized measure of benefit or utility asa basis for flexible SLAs representing dynamic tradeoffsbetween service quality and value. Customers are associ-ated with utility functions specifying the value of any givenlevel of service volume and service quality predicted to re-sult from a candidate allotment. Opus makes resource al-location decisions by comparing the expected utility of aset of candidate configurations, with the goal of maximiz-ing global utility. The system uses models to predict theeffects of candidate resource allotments on service quality,then evaluates the SLA functions to determine the expected

value of the predicted behavior. Informally, the domainsof these composite functions are continuous measures ofthe cost of resources assigned, e.g., the aggregate amountof server resources assigned to the application at the OpusPoPs, or the network cost of a LAST tree with a given �

parameter. The units of value are arbitrary, as long as thesystem can combine values assigned to multiple measuresof service quality, and compare the total values of candi-date configurations to determine which of the alternativesis preferable.

The resulting optimization problems fall into a classi-cal economic framework for resource allocation. Com-puting optimal resource allocations from sets of utilityfunctions and service quality estimates is a linearly con-strained non-linear optimization problem. To make theproblem tractable, we constrain the composite utility func-tions to be concave. This means that the marginal benefitof assigning additional resources, e.g., servers or networklinks, to a configuration declines steadily and approacheszero: adding resources beyond some point does not re-sult in meaningful improvement of service quality. Moreformally, the utility gradient (the derivative if the func-tion is differentiable over a real-valued domain) is non-negative and monotonically non-increasing. This reducesthe optimization problem to a simple convex program-ming problem with an efficient solution based on gradi-ent climbing[19]. If there are sufficient resources to avoidstarving any customer, then there exists a unique optimalsolution with the property that the marginal utility of anadditional resource unit is in equilibrium across all cus-

Allocated Resources

Thr

ough

put

App 1

App 2Gradient 2

Gradien

t 1

Figure 3: Example of gradient climbing to determine re-source allocation.

tomers. This equilibrium marginal utility is equivalent tothe equilibrium price that matches resource demand withavailable supply in an economic market for allocating re-sources.

A simple example helps to illustrate this point. Sup-pose that an Opus system hosts two application serviceswith a constant level of offered load. Figure 3 shows con-cave curves that qualitatively represent the throughput ofthe two hypothetical applications as a function of the re-sources allocated to each. If the SLA functions for thesecustomers define value as linear with delivered through-put, and they have equal priority (their utility functionshave the same slope), then Opus will seek a resource al-lotment that maximizes global throughput. Note that whilewe use throughput in this simple example, the y-axis couldjust as easily represent availability, reliability, latency, orsome other service quality metric, e.g., a composite metricrepresenting expected customer revenue.

The curves show that adding resources significantly im-proves throughput when allotments are low and the cus-tomers are starved. However, as more resources are added,the marginal gain in throughput declines and approacheszero (trivially, for an offered load of 100 small file re-quests per second, changing allocation from 10 to 11 ma-chines is not likely to measurably improve throughput).The marginal benefit of an additional resource unit can bemeasured by the first derivative or gradient of each appli-cation’s “utility curve.” In this example, the gradients atparticular points on the x-axis represent the current allo-cation of resources to each application and the expectedbenefit of allocating an additional unit of resource to ap-plication 1 versus application 2. Here, application 1 wouldenjoy a greater estimated boost in throughput from an addi-tional unit of resource because it has a larger gradient thanapplication 2. Opus gives preference to application 1 until

its marginal gain equalizes.

Thus, the Opus resource allocators strive to maximizeglobal value across all applications. In the general case,the SLA functions may specify utility as a combination ofservice quality metrics in a “common currency” of value.The utility functions may also incorporate priority by valu-ing service quality for some applications higher than oth-ers. For example, the value metrics would prioritize, say,dissemination of tactical information over distribution oftraining videos, enabling the system to provision resourcesrationally if faced with an unexpected crisis and resourceshortage. The value of the allotments changes dynami-cally with changing conditions and offered load. Our chal-lenge then is to estimate the changing shape and gradientfor these curves to respond to dynamic changes, based onpartial knowledge propagated through the service overlay.

Overall, the concavity constraint allows the system toadjust equilibrium allotments incrementally to adapt tochanging conditions. The system continuously monitorsload and resource status, and propagates status informa-tion through the service overlay. This status informationconstitutes a feedback signal to trigger adaptive resourcereallocation. Rather than computing a new allocation fromscratch, the system responds to changes by incrementallyadapting an existing configuration to restore equilibrium.This can be done using an efficient greedy algorithm whosecost scales with the magnitude of shifts in load or resourceavailability from one interval to the next [8].

Economic resource allocation scales naturally using adecentralized federation of autonomous local “markets”exchanging information to converge toward a global equi-librium. Our initial design centers around a hierarchicalstructure to aggregate related resources into cells capableof planning their internal allocations locally. A cell mightbe an entire Opus PoP or a portion of a large PoP, e.g., anarray of generic servers sharing a redirecting switch nodein close proximity. Cells cooperate to trade load or re-sources in order to balance resource usage across the sys-tems. To derive the magnitude of resource shifts, cells ex-change information about the supply and demand for re-sources in each cell. This can be captured compactly as themarginal utility gained by adding resources to that cell orshifting load away from that cell to free up resources forsome other use; this marginal utility is equivalent to the“price” for resources in that cell. We believe that this cel-lular structure is the key to scalable resource provisioningin large data centers and networks of server sites.

A key tenet of this work is that service quality mustbe measured in an application-specific manner. Thus, oneimportant question involves incorporating multiple dimen-

D E

B CA

F GD

A

Cluster Agent

Cluster

Overlay Network

Figure 4: Hierarchical data dissemination in dicast.

sions of service quality, including reliability, performance,and data consistency, into a single utility function. Oneoption is to define a unified performability measure incor-porating all aspects of service quality, with a single utilityfunction for each customer. An alternative is to define eachdimension of service quality as a separate utility function,and represent tradeoffs optimizing the sum of the individ-ual value measures.

3.3 Scalable Tracking of System Characteristics

As discussed above, a primary challenge to building andmaintaining large scale utilities involves maintaining dis-tributed state about global system characteristics. Con-sider the requirements of the following Opus tasks. First,for request routing, clients spread across the network mustchoose the replica most likely to deliver the best perfor-mance, reliability, security, etc. To achieve such func-tionality, the request routing infrastructure must track dy-namically changing replica characteristics, for instance,available bandwidth and load information. Second, build-ing and maintaining overlays requires probing the networkcharacteristics among all participating replicas. In gen-eral, nodes “near” one another in the underlying topology(i.e., displaying strong pair-wise performance and reliabil-ity characteristics) should peer together in the overlay. Fi-nally, the system must track dynamic group membershipinformation to retire nodes that fail or fall behind long-lasting network partitions. Thus, an Opus node requiresan abstraction to communicate its local state and local ob-servations (e.g., network probes) to other system nodes.Similarly, Opus nodes must receive updates about globalsystem characteristics from remote sites. In a large-scaleutility, it is impractical to maintain accurate global systemcharacteristics. Our challenge then is to balance communi-cation costs with data accuracy as a function of system sizeand global characteristics.

To develop a communication abstraction able to scale tolarge numbers of nodes, we draw inspiration from Internetrouting protocols [16, 26, 29], perhaps the best exampleof distributed protocols that scale to global proportions.The fundamental lesson we draw is that aggregation, hi-erarchy and approximation are fundamental to wide-areascalability. We apply these design ideas to a generic com-munication library within Opus, called dicast, designed todistribute approximate data in a scalable fashion. Thus,not all updates originating at a given node will be (or evenneed be) delivered to all participants. Further, individualupdates may be aggregated together to increasing degreesas data moves through the network.

The use of aggregation in dicast naturally leads to theconstruction of a tree-based structure, as depicted in Fig-ure 4. Nodes are partitioned into clusters of size d, whered determines the height of the tree (for n nodes, a clus-ter size of d implies a tree height of approximately lgd n).Each cluster elects an agent, a speaker responsible for dis-seminating local cluster information to the rest of the di-cast tree. Agents from d adjacent clusters form second-level clusters. This process is repeated until an h-th levelcluster is formed, where h is the height of tree. Note thatall physical nodes in the dicast tree are at the leaves (first-level clusters) and intermediate nodes in the tree are electedmembers from the leaf set who serve multiple responsibil-ities. Deriving good performance from such an approachrequires assigning nodes to clusters with other topologi-cally “nearby” nodes. We plan to leverage existing workon clustering [24] to aid in this process where possible.

In dicast, data travels up the tree, potentially being ag-gregated with data from other dicast nodes. At each levelof the tree, an overlay (as discussed in Section 3.1 above)propagates the data among all participating cluster mem-bers. Associated with each level of the tree is a target(application-specific) level of accuracy for either aggre-gated or individual node information. Once a particularupdate reaches a level of the tree where aggregate accuracyrequirements are not violated, it will be buffered awaitingthe arrival of further updates that will eventually force thepropagation of an aggregate update to nodes higher up thetree. As data spreads to higher-level clusters, it is in turntransmitted back toward the leaves because each agent is amember of at least two adjacent levels in the tree.

One example use of dicast is to propagate per-regionresource consumption information to influence local re-source allocation decisions. Thus, a local node may haveexact information about per-application resource consump-tion for “nearby” nodes (in the same cluster). However, itmay only have aggregate (and somewhat inaccurate) in-

formation about resource consumption in remote clusters.However, such approximate and aggregate data is likely tobe sufficient to set local allocation levels to meet global al-location targets. Similarly, information on per-cluster loadimbalances may be used to make a decision to reallocate agiven replica from one application to another to better meettarget SLAs or to maximize global system throughput.

3.4 Reliability QoS Guarantees

For many emerging Internet services, reliability and avail-ability are more important metrics than raw service per-formance. There are a number of potential definitions forservice availability; we define availability to be the per-centage of requests that can be satisfied within individualclient performance requirements. Many existing metricsfor availability consider a service “available” if it is cur-rently satisfying client requests with availability reducingto a simple measure of uptime, or the amount of time with-out hardware/software failures. For our approach, avail-ability is measured by integrating across all client requests,with those requests that return too slowly (e.g., based on anexpected distribution or even on per-client performance ex-pectations) marked as “unavailable.”

In the context of a replicated utility, an individual hostedservice may be considered unavailable for a number of rea-sons, including failures in the request routing infrastruc-ture, in network links, or, in our more general model, be-cause insufficient resources were allocated to meet targetperformance characteristics. One approach we are pur-suing for addressing failures at the network level, calledrestricted flooding, is to build overlay topologies that re-dundantly transmit the same data over multiple logicalpaths [32]. However, we use a variant of anti-entropy [34]to minimize the overhead associated with redundant trans-mission for certain application classes. Here particularoverlay nodes may choose to forward an application-layerframe redundantly along multiple paths to a single destina-tion, especially if any given path does not meet aggregatereliability requirements. As the data travels toward its des-tination, certain downstream nodes may receive multiplecopies of the same frame (as identified by a unique iden-tifier). In this case, the downstream node will re-evaluatethe estimated reliability of the remainder of the path andsuppress duplicate frames if reliability targets are likely tobe achieved with the propagation of a single frame. Thismanner of restricted flooding provides two principal ad-vantages. First, restricted flooding means that the overlaydoes not have to necessarily prevent “loops,” simplifyingoverlay construction. Next, multiple independent paths to

S DJ

.96 .98

.97 .97

.99

Target Reliability = .98

A

B

Figure 5: Using Restricted Flooding to control cost versusreliability tradeoffs.

the destinations mean that individual delays, failures, orpacket drops will not necessarily prevent the timely deliv-ery of data given available redundancy in the distributiongraph.

A primary challenge to developing such an approach isensuring that the overlay topology matches the failure char-acteristics of the underlying network. For instance, if sepa-rate logical links in an overlay correspond to a commonfailure-prone link in the underlying physical network, afailed physical link can result in failures in multiple logicaloverlay links. Thus, it is important to construct overlayswith disjoint paths, where the failure correlation amonglogical overlay links is low. We determine the loss correla-tion among multiple potential links by collecting statisticalinformation about loss correlations and by using networktopology information where available. Our use of multipleredundant paths enables immediate failover rather than re-lying on the underlying network to converge to new routesin the face of failure. Further, we hope that the combina-tion of restricted flooding and careful construction of over-lay topologies will result in only minimal traffic overheadrelative to single-path routing.

Consider the simple example depicted in Figure 5 wherea source S wishes to transmit data to a destination, D, withan end-to-end reliability of 98% and where all links aredisjoint. Omitting the details of the simple calculations,transmitting the data through either A or B toward D re-sults in reliability of 93.1%. However, by transmitting datathrough both A and B means that at least one copy of thedata arrives at the join point, J , 99.6% of the time, withtwo copies of the data arriving with an 88.5% probability.When node J forwards one copy of the data (suppressingthe second should it arrive later), resulting end-to-end reli-ability is 98.7%, which meets the target yield. Forwardingboth copies results in 99.8% reliability.

The goal of our work in restricted flooding is to pro-vide each node with enough information to determine how

many simultaneous routes to maintain for a given commu-nication stream to achieve a given level of reliability. In-termediate nodes must then determine if it is feasible tosuppress subsequent transmission of the same data and stillmaintain target reliability. In this example, restricted flood-ing must determine if an approximately 88% increase in theutilization of the overlay edge JD for this particular com-munication stream is worth the potential 1.1% improve-ment in end-to-end reliability. Of course, this evaluationmust be made in response to changing network conditionsand application demands.

Finally, in Section 3.1, we discussed techniques for al-lowing application developers to dynamically trade “cost”for performance. Our approach to providing high reliabil-ity through redundant transmission and disjoint paths addsanother dimension to this tradeoff: it allows applications tospecify both performance and reliability targets. Opus thenstrives to build the lowest cost (or lowest overhead) overlayto meet the specified goals.

4 Related Work

Our work on Opus is inspired by related efforts in anumber of different fields. Research into Active Net-works [1, 17, 27, 36] proposes moving computation intothe network on a per-packet level. We view our utilitymodel as a logical culmination of the Active Network phi-losophy. That is, overlays push application-level function-ality to specific intermediate nodes in the network. How-ever, the granularity of computation in overlays is coarsergrained than in Active Networks, operating on application-layer frames [9] rather than individual network packets. Indesigning the abstractions for our utility environment, wewill build on the work already performed in the context ofActive Networks.

Work into Active Services [2] investigates a similar in-termediate point of pushing application functionality intothe network. Relative to this effort, we focus on the wide-area issues associated with simultaneously deploying andallocating resources among competing applications in ascalable utility. Where possible, we intend to leverage theset of abstractions developed for active services runningwithin a cluster environment (analogous to our individualOpus sites).

A number of efforts are investigating a utility model forwide-area computing. Akamai [10] hosts a large numberof servers across the Internet. Globus [13] and Legion [14]investigate resource allocation in the context of a wide-area computational grid. WebOS [35] investigates systemsupport for wide-area services. Within a single machine

room, Cluster Reserves enforces a global allocation of re-sources among multiple “resource principals” [4]. Relativeto these efforts, our goal is to simultaneously investigateissues of resource allocation, replica placement, and over-lay construction based on an economic model to determineper-application priority levels. We believe that our work inOpus will be complementary to these existing efforts.

Our utility model investigates techniques for allocatingnetwork resources to competing applications. We lever-age overlay networks both to track the characteristics of theutility as a whole, as well as to propagate updates amongindividual application nodes. The idea of an overlay net-work is not new, having been leveraged to ease the de-ployment of both multicast in the Mbone [12] and IPv6 inthe 6bone [15]. Until recently, overlays were viewed as atransition technology. However, recent academic and com-mercial efforts are advocating the use of overlays as a fun-damental approach for both deploying new network func-tionality (e.g., multicast [18, 20]) and for improving theperformance and availability of existing applications (e.g.,improved application-layer routing [3, 31, 32]). Relativeto existing approaches, our work is a general utility infras-tructure to allocate nodes among competing applications.Further, we investigate fundamental techniques for scalingoverlay networks to thousands of nodes and for design-ing, implementing, and evaluating distributed algorithmsfor building and maintaining overlays capable of matchingapplication performance and availability requirements.

Recently, there has been tremendous interest in scalablepeer-to-peer lookup services [11, 28, 30]. At a high level,these systems hash an object name to a key within someaddress space and randomly assign cooperating peers to beresponsible for some region of this address space. An endclient wishing to lookup a particular object performs thehash and uses the lookup infrastructure to route its requestto the appropriate peer in O(lg n) application-level hops.The system is scalable in that peers maintain no more thanO(lgn) state in facilitating this lookup. These elegant de-signs provide significant scalability benefits at the cost ofloss of control over exactly how nodes are interconnected,the cost of resulting overlays, etc. Our work on resource al-location and managing inexact information across the widearea is orthogonal to these efforts. However, one explicitgoal of this work is to determine the relative performancebenefits and computational/communication of explicit ver-sus implicit overlay construction and maintenance in largescale distributed systems.

5 Conclusions

This paper presents a novel model for wide-area comput-ing where a collection of server sites distributed acrossthe Internet simultaneously support the requirements of abroad range of decentralized Internet applications. Ratherthan forcing individual applications to reimplement sig-nificant functionality and to redundantly administer dis-tributed service resources, an overlay peer utility service,Opus, dynamically allocates resources among competingapplications. This paper describes our approach to re-alizing this vision and some of the specific research is-sues we are addressing. In particular, we present: i) thesystem architecture and abstractions necessary for diverseapplications to push functionality to intermediate nodes,ii) models for resource allocation and replica placementfor competing applications based on dynamically chang-ing system characteristics, iii) constructing dynamic per-application scalable overlays that both match applicationperformance/availability requirements and that make effi-cient use of underlying network resources, and iv) decen-tralized and scalable techniques for tracking global systemcharacteristics through aggressive use of hierarchy, aggre-gation, and approximation.

References

[1] D. Scott Alexander, William A. Arbaugh, Michael W.Hicks, Pankaj Kakkar, Angelos D. Keromytis, Jonathan T.Moore, Carl A. Gunter, Scott M. Nettles, and Jonathan M.Smith. The SwitchWare Active Network Architecture.IEEE Network, 12(3):29–36, May/June 1998.

[2] Elan Amir, Steven McCanne, and Randy Katz. An Ac-tive Service Framework and its Application to Real-TimeMultimedia Transcoding. In Proceedings of SIGCOMM,September 1998.

[3] David G. Andersen, Hari Balakrishnan, M. FransKaashoek, and Robert Morris. Resilient Overlay Net-works. In Proceedings of SOSP 2001, October 2001.

[4] Mohit Aron, Peter Druschel, and Willy Zwaenepoel. Clus-ter Reserves: A Mechanism for Resource Managementin Cluster-based Network Servers. In Proceedings of theACM Sigmetrics 2000 International Conference on Mea-surement and Modeling of Computer Systems, June 2000.

[5] S. Bhattarcharjee, M. Ammar, E. Zegura, V. Sha, andZ. Fei. Application-Layer Anycasting. In Proceedings ofIEEE Infocom, April 1997.

[6] Ken Calvert, Matt Doar, and Ellen W. Zegura. ModelingInternet Topology. IEEE Communications Magazine, June1997.

[7] Barun Chandra, Gautam Das, Giri Narasimhan, and JoseSoares. New Sparseness Results on Graph Spanners. InSymposium on Computational Geometry, pages 192–201,1992.

[8] Jeffrey S. Chase, Darrell C. Anderson, Prachi N. Thakar,Amin M. Vahdat, and Ronald P. Doyle. Managing energyand server resources in hosting centers. In Proceedings ofthe 18th ACM Symposium on Operating System Principles(SOSP), October 2001.

[9] David D. Clark and David L. Tennenhouse. ArchitecturalConsiderations for a New Generation Protocols. In Pro-ceedings of SIGCOMM, September 1990.

[10] Akamai Corporation, 1999. www.akamai.com.

[11] Frank Dabek, M. Frans Kaashoek, David Karger, RobertMorris, and Ion Stoica. Wide-area Cooperative Storagewith CFS. In Proceedings of the 18th ACM Symposium onOperating Systems Principles (SOSP’01), October 2001.

[12] H. Eriksson. Mbone: The Multicast Backbone. Communi-cations of the ACM, 37(8):54–60, 1994.

[13] Ian Foster and Carl Kesselman. Globus: A Metacomput-ing Infrastructure Toolkit. In International Journal of Su-percomputer Applications, volume 11(2), pages 115–128,1997.

[14] Andrew S. Grimshaw, William A. Wulf, and the Le-gion team. The Legion Vision of a Worldwide VirtualComputer. Communications of the ACM, 40(1), January1997.

[15] I. Guardini, P. Fasano, and G. Girardi. IPv6 OperationalExperience within the 6bone. In Proceedings of the Inter-net Society Conference, July 2000.

[16] Roch Guerin and Ariel Orda. QoS-based Routing in Net-works with Inaccurate Information. In Proceedings ofIEEE INFOCOM, 1997.

[17] Michael Hicks, Pankaj Kakkar, Jonathan T. Moore, Carl A.Gunter, and Scott Nettles. PLAN: A Packet Language forActive Networks. In Proceedings of the Third ACM SIG-PLAN International Conference on Functional Program-ming Languages, pages 86–93, 1998.

[18] Yang hua Chu, Sanjay Rao, and Hui Zhang. A Case ForEnd System Multicast. In Proceedings of ACM Sigmetrics,June 2000.

[19] Toshihide Ibaraki and Naoki Katoh, editors. Resource Al-location Problems: Algorithmic Approaches. MIT Press,Cambridge, MA, 1988.

[20] John Jannotti, David K. Gifford, Kirk L. Johnson, M. FransKaashoek, and Jr. James W. O’Toole. Overcast: ReliableMulticasting with an Overlay Network. In Proceedingsof Operating Systems Design and Implementation (OSDI),October 2000.

[21] Dina Katabi and John Wroclawski. A Framework for Scal-able Global IP-Anycast. In Proceedings of Sigcomm, Au-gust 2000.

[22] S. Khuller, B. Raghavachari, and N. Young. BalancingMinimum Spanning and Shortest Path Trees. In Proc.ACM/SIAM Symp. on Discrete Algorithms, January 1993.

[23] Dejan Kostic and Amin Vahdat. Latency versus Cost Op-timizations in Hierarchical Overlay Networks. TechnicalReport CS-2001-04, Duke University, January 2002.

[24] Balachander Krishnamurthy and Jia Wang. On Network-Aware Clustering of Web Clients. In Proceedings of ACMSIGCOMM 2000, August 2000.

[25] Adam Meyerson, Kamesh Munagala, and Serge Plotkin.Cost-Distance: Two Metric Network Design. In Proceed-ings of the Symposium on the Foundations of ComputerScience (FOCS), November 2000.

[26] J. Moy. OSPF Version 2. Technical Report RFC 2178,Internet Engineering Task Force, Network Working Group,July 1997.

[27] Erik L. Nygren, Stephen Garland, and M. Frans Kaashoek.PAN: A High-Performance Active Network Node Support-ing Multiple Mobile Code Systems. In Proceedings IEEEOpenArch 1999, March 1999.

[28] Sylvia Ratnasamy, Paul Francis Mark Handley, RichardKarp, and Scott Shenker. A Content Addressable Network.In Proceedings of SIGCOMM 2001, August 2001.

[29] Y. Rekhter and T. Li. A Border Gateway Protocol 4 (BGP-4). Technical Report RFC 1771, Internet Engineering TaskForce, Network Working Group, March 1995.

[30] Antony Rowstron and Peter Druschel. Storage Manage-ment and Caching in PAST, a Large-Scale, Persistent Peer-to-Peer Storage Utility. In Proceedings of the 18th ACMSymposium on Operating Systems Principles (SOSP’01),October 2001.

[31] Stefan Savage, Thomas Anderson, Amit Aggarwal, DavidBecker, Neal Cardwell, Andy Collins, Eric Hoffman, JohnSnell, Amin Vahdat, Geoff Voelker, and John Zahorjan.Detour: A Case for Informed Internet Routing and Trans-port. IEEE Micro, 19(1), January 1999.

[32] Alex C. Snoeren, Kenneth Conley, and David K. Gifford.Mesh-Based Content Routing Using XML. In Proceedingsof the 18th ACM Symposium on Operating Systems Princi-ples (SOSP ’01), October 2001.

[33] Ion Stoica, Robert Morris, David Karger, Frans Kaashoek,and Hari Balakrishnan. Chord: A Scalable Peer to PeerLookup Service for Internet Applications. In Proceedingsof the 2001 SIGCOMM, August 2001.

[34] Douglas B. Terry, Marvin M. Theimer, Karin Petersen,Alan J. Demers, Mike J. Spreitzer, and Carl H. Hauser.Managing Update Conflicts in Bayou, a Weakly ConnectedReplicated Storage System. In Proceedings of the Fifteenth

ACM Symposium on Operating Systems Principles, De-cember 1995.

[35] Amin Vahdat, Thomas Anderson, Michael Dahlin, EshwarBelani, David Culler, Paul Eastham, and Chad Yoshikawa.WebOS: Operating System Services for Wide-Area Appli-cations. In Proceedings of the Seventh IEEE Symposium onHigh Performance Distributed Systems, Chicago, Illinois,July 1998.

[36] David Wetherall. Active Network Vision and Reality:Lessons From a Capsule-based System. In Proceedingsof the 17th Symposium on Operating Systems Principles(SOSP), December 1999.

[37] Andrew Whitaker, Marianne Shaw, and Steven D. Gribble.Denali: Lightweight Virtual Machines for Distributed andNetworked Applications. Technical Report 02-02-01, Uni-versity of Washington, 2002.

[38] Chad Yoshikawa, Brent Chun, Paul Eastham, Amin Vah-dat, Thomas Anderson, and David Culler. Using SmartClients to Build Scalable Services. In Proceedings of theUSENIX Technical Conference, January 1997.

[39] Haifeng Yu and Amin Vahdat. Design and Evaluation of aContinuous Consistency Model for Replicated Services. InProceedings of Operating Systems Design and Implemen-tation (OSDI), October 2000.

Opus: an Overlay Peer Utility Service - Duke Universityvahdat/ps/openarch02.pdf · data centers)...

Documents

Transcript of Opus: an Overlay Peer Utility Service - Duke Universityvahdat/ps/openarch02.pdf · data centers)...