STAR: Self-Tuning Aggregation for Scalable Monitoring · SDIMS builds on two ongoing research...

STAR: Self-Tuning Aggregation for Scalable Monitoring

Navendu Jain, Dmitry Kit, Prince Mahajan, Praveen Yalagandula†, Mike Dahlin, and Yin ZhangDepartment of Computer Sciences †Hewlett-Packard Labs

University of Texas at Austin Palo Alto, CA

ABSTRACTWe present STAR, a self-tuning protocol that adaptivelysets numeric precision constraints to accurately and effi-ciently answer continuous aggregate queries over distributeddata streams. Adaptivity and approximation are essentialfor both robustness to varying workload characteristics andfor scalability to large systems. In contrast to previous stud-ies, we treat the problem as an optimization problem whosegoal is to minimize the total communication load for a multi-level aggregation tree under a fixed error budget. Our hi-erarchical self-tuning algorithm, STAR, computes optimalerror distribution and performs cost-benefit throttling to di-rect error slack to where it benefits the most. A key novelfeature of STAR is that it takes into account the updaterate and variance in the input data distribution in a princi-pled manner. Our prototype implementation of STAR in alarge-scale monitoring system provides (1) a new distribution

mechanism that enables self-tuning error distribution and(2) an optimization to reduce communication overhead in apractical setting by carefully distributing the initial, defaulterror budgets. Through extensive experiments performedon synthetic data simulations and a real network monitor-ing implementation, we show that STAR achieves significantperformance benefits compared to existing approaches whilestill providing high accuracy and low resource utilization fordynamic workloads.

1. INTRODUCTIONThis paper describes STAR, a self-tuning algorithm for

adaptive setting of numeric precision constraints to processcontinuous aggregate queries in a large-scale monitoring sys-tem.

Scalable system monitoring is a fundamental abstractionfor large-scale networked systems. It serves as a basic build-ing block for applications such as network monitoring andmanagement [8, 18, 41], financial applications [3], resourcelocation [20, 40], efficient multicast [37], sensor networks [20,40], resource management [40], and bandwidth provision-ing [13]. To provide a real-time view of global system statefor these monitoring and control applications, the centralchallenge for a monitoring system is scalability to processqueries over multiple, continuous, rapid, time-varying datastreams that generate updates for thousands or millions ofdynamic attributes (e.g., per-flow or per-object state) span-ning tens of thousands of nodes.

Recent studies [26, 28, 34, 37, 42] suggest that real-worldapplications often can tolerate some inaccuracy as long asthe maximum error is bounded and that small amounts ofapproximation error can provide substantial bandwidth re-ductions. However, setting of a static error budget a pri-ori is difficult when workloads are not known in advanceand inefficient when workload characteristics change unpre-dictably over time. Therefore, for many real-world systemsthat require processing of long-running or continuous queriesover data streams, it is important to consider adaptive ap-

proaches for query processing [2, 5, 7, 28].A fundamental observation behind this work is that an

adaptive algorithm for setting error budgets should be basedon three key design principles for large-scale query process-ing :• Principled Approach: First, to provide a general and

flexible framework for adaptive error distribution for dif-ferent workloads, we require a solution that does notdepend on any a priori knowledge about the input datadistribution. Rather, a self-tuning algorithm should bebased on first principles and use the workload itself toguide the process of dynamically adjusting the error bud-gets.

• Cost-Benefit Throttling: Second, rather than contin-uously redistributing error budgets in pursuit of a perfectdistribution, our algorithm explicitly determines when thecurrent distribution is close enough to optimal that send-ing messages to redistribute allocations will likely costmore than it will ultimately save given the measured vari-ability of the workload. For example, in our network mon-itoring service for detecting elephant flows [13], we trackbandwidth for tens of thousands of flows, but the vastmajority of these flows are mice that produce so few up-dates that further redistribution of the error budgets isnot useful. Avoiding fruitless optimization in such casessignificantly improves scalability in systems with tens ofthousands of attributes.

• Aggregation Hierarchy: Third, to provide a globalview of the system requires processing and aggregatingdata from distributed data streams in real-time. In suchan environment, a hierarchical decentralized query pro-cessing infrastructure provides an attractive solution tominimize communication/computation costs for both scal-ability and load balancing. Further, in a hierarchical ag-gregation tree, the internal nodes can not only split theerror budget among their children but may also retainsome local error budget to prevent updates received fromchildren from being propagated further up the tree e.g.,merging of updates that cancel out each other.Unfortunately, existing approaches do not satisfy these

requirements. On one hand, protocols such as adaptivefilters [4, 28] and adaptive thresholded counts [24] can ef-fectively reduce communication overhead given a fixed er-ror budget for flat (2-tier) topologies, but they offer nei-ther scalability to a large number of nodes and attributesnor functionality of in-network aggregation. On the otherhand, existing hierarchical protocols either assign a staticerror budget [27] that cannot adapt to changing workloads,or periodically shrink error thresholds [11] at each node tocreate redistribution error budget that incurs high load formonitoring a large number of attributes. Finally, althoughthe previous solutions have an intuitive appeal, they lack arigorous problem formulation leading to solutions that aredifficult to compare against the global optimization problemof minimizing the total communication overhead.

To address these challenges, STAR design focuses on pro-

1

viding three key properties for adaptively setting error thresh-olds in hierarchical aggregation to provide high performance,scalability, and adaptivity.• High Performance: To computes optimal error assign-

ments, STAR uses an informed mathematical model toformulate an optimization problem whose goal is to min-imize the global communication load in an aggregationtree given a fixed total budget. This model provides aclosed-form, optimal solution for adaptive setting of er-ror budgets using only local and aggregated informationat each node in the tree. Given the optimal error bud-gets, STAR performs cost-benefit throttling to balance thetradeoff between the cost for redistributing error budgetsand the expected benefits. Our experimental results showthat self-tuning distribution of error budgets can reducemonitoring costs by upto a factor of five over previousapproaches.

• Scalability: STAR builds on recent work that uses dis-tributed hash tables (DHTs) to construct scalable, load-balanced forests of self-organizing aggregation trees [40,14, 6, 30]. Scalability to tens of thousands of nodes andmillions of attributes is achieved by mapping different at-tributes to different trees. In this framework of forest ofaggregation trees, STAR’s self-tuning algorithm directserror slack to where it is most needed.

• Convergence: STAR computes optimal error budgetsand performs cost-benefit analysis to continually adjustto dynamic workloads. But since STAR’s self-tuning al-gorithm adapts its solution as input workload changes, itis difficult to qualify its convergence properties. Nonethe-less, under stable input data distributions, we show thatSTAR converges to a solution with desirable properties.Further, STAR balances the speed of adaptivity and ro-bustness to workload fluctuations.We study the performance of our algorithm through both

synthetic data simulations and a prototype implementationon our SDIMS aggregation system [40] based on FreePas-try [15]. Experience with a Distributed Heavy Hitter de-tection (DHH) application built on STAR illustrate howexplicitly managing imprecision can qualitatively enhance amonitoring service. Our experimental results show the im-proved performance and scalability benefits: for the DHHapplication, small amounts of imprecision drastically reducemonitoring load. For example, a 10% error budget allowsus to reduce network load by an order of magnitude com-pared to the uniform allocation policy. Further, for 90:10skewness in attributes (i.e., 10% heavy hitters) in terms ofinput load, self-tuning gives more than an order of magni-tude better performance than both uniform error allocationand approaches that do not perform cost-benefit throttling.

This paper makes three key contributions. First, we presentSTAR, the first self-tuning algorithm for scalable aggre-gation that computes optimal error distribution and per-forms cost-benefit throttling for large-scale system monitor-ing. Second, we provide a scalable implementation of STARin our SDIMS monitoring system. Our implementation pro-vides a new distribution abstraction, a dual mechanism tothe traditional bottom-up tree based aggregation, that en-ables self-tuning error distribution top-down along the ag-gregation tree to reduce communication load. Further, itprovides an important optimization that reduces communi-cation overhead in a practical setting by carefully distribut-ing the initial, default error budgets. Third, our evalua-

000 111010 101Physical Nodes (Leaf Sensors)

Virtual Nodes (Internal Aggregation Points)

L0

L1

L2

L3

3 4 2 9 6 1 9 3

7 11 7 12

18 19

37

100 110 001 011

Figure 1: The aggregation tree for key 000 in aneight node system. Also shown are the aggregatevalues for a simple SUM() aggregation function.

tion demonstrates that adaptive error distribution is vitalfor enabling scalable aggregation: a system that performsself-tuning of error budgets can significantly reduce commu-nication overheads.

The rest of this paper is organized as follows. Section 2provides background description of SDIMS [40], a scalableDHT-based aggregation system that underlies STAR. Sec-tion 3 describes the mechanism and the STAR self-tuningalgorithm for adaptively setting arithmetic imprecision forreducing monitoring overhead. Section 4 presents the im-plementation of STAR in our SDIMS aggregation system,policies for maintaining precision of query results under fail-ures, and policies for initializing the error budgets. Section 5presents the experimental evaluation of our system. Finally,Section 6 discusses related work, and Section 7 provides con-clusions.

2. BACKGROUNDSDIMS builds on two ongoing research efforts for scalable

monitoring: aggregation and DHT-based aggregation.

2.1 Aggregation.Aggregation is a fundamental abstraction for scalable mon-

itoring [6, 14, 20, 30, 37, 40] because it allows applicationsto access summary views of global information and detailedviews of rare events and nearby information.

SDIMS’s aggregation abstraction defines a tree spanningall nodes in the system. As Figure 1 illustrates, each phys-ical node in the system is a leaf and each subtree repre-sents a logical group of nodes. Note that logical groups cancorrespond to administrative domains (e.g., department oruniversity) or groups of nodes within a domain (e.g., a /28subnet with 14 hosts on a LAN in the CS department) [17,40]. An internal non-leaf node, which we call a virtual node,is simulated by one or more physical nodes at the leaves ofthe subtree rooted at the virtual node.

The tree-based aggregation in the SDIMS framework isdefined in terms of an aggregation function installed at allthe nodes in the tree. Each leaf node (physical sensor) in-serts or modifies its local value for an attribute defined as an{attribute type, attribute name} pair which is recursivelyaggregated up the tree. For each level-i subtree Ti in anaggregation tree, SDIMS defines an aggregate value Vi,attr

for each attribute: for a (physical) leaf node T0 at level 0,V0,attr is the locally stored value for the attribute or NULLif no matching tuple exists. The aggregate value for a level-isubtree Ti is the result returned by the aggregation function

2

computed across the aggregate values of Ti’s children. Fig-ure 1, for example, illustrates the computation of a simpleSUM aggregate.

2.2 DHT-based aggregation.SDIMS leverages DHTs [30, 31, 32, 36, 44] to construct

a forest of aggregation trees and maps different attributesto different trees [6, 14, 30, 33, 40] for scalability and loadbalancing. DHT systems assign a long (e.g., 160 bits), ran-dom ID to each node and define a routing algorithm to senda request for key k to a node rootk such that the unionof paths from all nodes forms a tree DHTtreek rooted atthe node rootk. By aggregating an attribute with key k =hash(attribute) along the aggregation tree corresponding toDHTtreek , different attributes are load balanced across dif-ferent trees. Studies suggest that this approach can provideaggregation that scales to large numbers of nodes and at-tributes [40, 33, 30, 14, 6].

2.3 Example ApplicationAggregation is a building block for many distributed ap-

plications such as network management [41], service place-ment [16], sensor monitoring and control [26], multicast treeconstruction [37], and naming and request routing [9]. Inthis paper, we focus on a case-study example: a distributedheavy hitter detection service.

2.3.1 Heavy Hitter detectionOur case-study application is identifying heavy hitters 1

in a distributed system—for example, the top 10 IPs thataccount for a significant fraction of total incoming trafficin the last 10 minutes [13]. The key challenge for this dis-tributed query is scalability for aggregating per-flow statis-tics for tens of thousands to millions of concurrent flows inreal-time. For example, a subset of the Abilene [1] tracesused in our experiments include up to 80 thousand flowsthat send about 25 million updates per hour.

To scalably compute the global heavy hitters list, we chaintwo aggregations where the results from the first feed intothe second. First, SDIMS calculates the total incoming traf-fic for each destination from all nodes in the system usingSUM as the aggregation function and hash(HH-Step1, des-tIP) as the key. For example, tuple (H = hash(HH-Step1,128.82.121.7), 700 KB) at the root of the aggregation treeTH indicates that a total of 700 KB of data was received for128.82.121.7 across all vantage points during the last timewindow. In the second step, we feed these aggregated totalbandwidths for each destination IP into a SELECT-TOP-10aggregation with key hash(HH-Step2, TOP-10) to identifythe TOP-10 heavy hitters among all flows.

Although there exist other centralized monitoring services,in Section 5 we show that using our STAR self-tuning algo-rithm in the SDIMS aggregation system, we can monitor alarge number of attributes at much finer time scales whileincurring significantly lower network costs.

1Note that the standard definition of a heavy hitter is anentity that accounts for at least a specified proportion ofthe total activity measured in terms of number of packets,bytes, connections etc [13]. We use a slightly different defi-nition of “heavy hitters” to denote flows whose bandwidth isgreater than a specified fraction threshold of the maximumflow value.

3. STAR DESIGNArithmetic imprecision (AI) bounds the numeric differ-

ence between a reported value of an attribute and its truevalue [29, 43]. For example, a 10% AI bound ensures thatthe reported value either underestimates or overestimatesthe true value by at most 10%.

When applications do not need exact answers and datavalues do not fluctuate wildly, arithmetic imprecision cangreatly reduce the monitoring load by allowing caching tofilter small changes in aggregated values. Furthermore, forapplications like distributed heavy hitter monitoring, arith-metic imprecision can completely filter out updates for most“mice” flows.

We first describe the basic mechanism for enforcing AI foreach aggregation subtree in the system. Then we describehow our system uses a self-tuning algorithm to address thepolicy question of distributing an AI budget across subtreesto minimize system load.

3.1 MechanismTo enforce AI, each aggregation subtree T for an attribute

has an error budget δT which defines the maximum inac-curacy of any result the subtree will report to its parentfor that attribute. The root of each subtree divides thiserror budget among itself δself and its children δc (withδT ≥ δself +

P

c∈childrenδc), and the children recursively do

the same. Here we present the AI mechanism for the SUMaggregate since they are likely to be more common in prac-tice in network monitoring and financial applications; otherstandard aggregation functions (e.g., MAX, MIN, AVG) aresimilar [23].

This arrangement reduces system load by filtering smallupdates that fall within the range of values cached by a sub-tree’s parent. In particular, after a node A with error budgetδT reports a range [Vmin, Vmax] for an attribute value to itsparent (where Vmax ≤ Vmin + δT ), if the node A receivesan update from a child, the node A can skip updating itsparent as long as it can ensure that the true value of theattribute for the subtree lies between Vmin and Vmax, i.e., if

Vmin ≤ P

c∈childrenV c

min

Vmax ≥ P

c∈childrenV c

max(1)

where V cmin and V c

max denote the most recent update re-ceived from child c.

Note the trade-off in splitting δT between δself and δc.Large δc allows children to filter updates before they reacha node. Conversely, by setting δself > 0, a node can setVmin <

P

V cmin, set Vmax >

P

V cmax, or both to avoid fur-

ther propagating some updates it receives from its children.SDIMS maintains per-attribute δ values so that differ-

ent attributes with different error requirements and differ-ent update patterns can use different δ budgets in differentsubtrees. SDIMS implements this mechanism by defining adistribution function; just as an attribute type’s aggregationfunction specifies how aggregate values are aggregated fromchildren, an attribute type’s distribution value specifies howδ budgets are distributed among children and δself .

3.2 Policy DecisionsGiven these mechanisms, there is a plenty of flexibility to

(i) set δroot to an appropriate value for each attribute, and(ii) compute Vmin and Vmax when updating a parent.

3

Setting δroot: Note that the aggregation queries can setthe root error budget δroot to any non-negative value. Forsome applications, an absolute constant value may be knowna priori (e.g., count the number of connections per second±10 at port 1433.) For other applications, it may be ap-propriate to set the tolerance based on measured behaviorof the aggregate in question (e.g., set δroot for an attributeto be at most 10% of the maximum value observed) or themeasurements of a set of aggregates (e.g., in our heavy hit-ter application, set δroot for each flow to be at most 1% ofthe bandwidth of the largest flow measured in the system).Our algorithm supports all of these approaches by allowingnew absolute δroot values to be introduced at any time andthen distributed down the tree via a distribution function.We have prototyped systems that use each of these threepolicies.

Computing [Vmin, Vmax]: When eitherP

cV c

min orP

cV c

max

goes outside of the last [Vmin, Vmax] that was reported tothe parent, a node needs to report a new range. Givena δself budget at an internal node, we have some flexi-bility on how to center the [Vmin, Vmax] range. Our ap-proach is to adopt a per-aggregation-function range pol-icy that reports Vmin = (

P

cV c

min) − bias ∗ δself andVmax = (

P

c V cmax) + (1 − bias) ∗ δself to the parent. The

bias parameter can be set as follows:• bias ≈ 0.5 if inputs expected to be roughly stationary• bias ≈ 0 if inputs expected to be generally increasing• bias ≈ 1 if inputs expected to be generally decreasingFor example, suppose a node with total δT of 10 and δself

of 3 has two children reporting ([V cmin, V c

max]) of [1, 2] and[2, 8], respectively, and reports [0, 10] to its parent. Then,the first child reports a new range [10, 11], so the nodemust report to its parent a range that includes [12, 19]. Ifbias = 0.5, then report to parent [10.5, 20.5] to filter outsmall deviation around the current position. Conversely, ifbias = 0, report [12, 22] to filter out the maximal numberof updates of increasing values.

3.3 Self-tuning Error BudgetsThe key AI policy question is how to divide a given error

budget δroot across the nodes in an aggregation tree.A simple approach is a static policy that divides the er-

ror budget uniformly among all the children. E.g., a nodewith budget δT could set δself = 0.1δT and then divide theremaining 0.9δT evenly among its children. Although thisapproach is simple, it is likely to be inefficient because dif-ferent aggregation subtrees may experience different loads.

To make cost/accuracy tradeoffs self-tuning, we providean adaptive algorithm. The high-level idea is simple: in-crease δ for nodes with high load and standard deviationbut low δ (relative to other nodes); decrease δ for nodes withlow load and low standard deviation but high δ. Next, weaddress the problem of optimal distribution of error budgetsfor a 2-tier (one-level) tree and later extend it as a generalapproach for a hierarchical aggregation tree.

3.3.1 One-level treeQuantify AI Filtering Gain: In order to derive the opti-mal distribution of error budgets among different nodes, weneed a simple way of quantifying the amount of load reduc-tion that can be achieved when a given error budget is usedfor AI filtering.

Intuitively, the AI filtering gain depends on the size of

the error budget relative to the inherent variability in theunderlying data distribution. Specifically, as illustrated inFigure 2, if the allocated error budget δi at node i is muchsmaller than the standard deviation σi of the underlyingdata distribution, δi is unlikely to filter many data updates.Meanwhile, if δi is above σi, we would expect the load todecrease quickly as δi increases.

In order to quantify the tradeoff between load and er-ror budget, one possibility is to compute the entire tradeoffcurve as shown in Figure 2. However, doing so imposes sev-eral difficulties. First, it is in general difficult to computethe tradeoff curve without a priori knowledge about the un-derlying data distribution. Second, maintaining the entiretradeoff curve becomes expensive when there are a largenumber of attributes and nodes. Finally, it is not easy tooptimize the distribution of error budgets among differentnodes based on the tradeoff curves.

To overcome these difficulties, we develop a simple met-ric in STAR to capture the tradeoff between load and errorbudget. Our metric is motivated by the Chebyshev’s in-equality in probability theory, which gives a bound on theprobability of deviation of a given random variable from itsmathematical expectation in terms of its variance. Let Xbe a random variable with finite mathematical expectationµ and variance σ2. The Chebyshev’s inequality states thatfor any k ≥ 0, the probability of the event

Pr(|X − µ| ≥ kσ) ≤ 1

k2(2)

For AI filtering, the term kσ represents the error budgetδi for node i. Substituting for k in Equation 2 gives:

Pr(|X − µ| ≥ kσi) ≤1

k2=

σ2i

δ2i

(3)

Intuitively, this equation implies that if δi ≤ σi i.e., the errorbudget is smaller than the standard deviation (implying k ≤1), then δi is unlikely to filter any data updates (Figure 2).In this case, Equation 3 provides only a weak bound onmessage cost: the probability that each incoming updatewill trigger an outgoing message is upper bounded by 1.However, if δi ≥ kσi for any k ≥ 1, the fraction of unfiltered

updates is bounded byσ2

i

δ2i

.

In general, given the input update rate ui for child i witherror budget δi, the expected cost for node i per unit timeis:

MIN

„

1,σ2

i

δ2i

«

∗ ui (4)

Compute Optimal Error Budget: To derive the op-timal error distribution at each node, we can formulate anoptimization problem of minimizing the total incoming net-work load at root R under a fixed total AI budget δT i.e.,

MINP

i∈child(R)

σ2i ∗ui

(δopti

)2

s.t.P

i∈child(R)

δopti = δT

(5)

Using Lagrange multipliers yields a closed-form and compu-tationally inexpensive optimal solution [23]:

δopti = δT ∗

3p

σ2i ∗ ui

P

i∈child(R)

3p

σ2i ∗ ui

(6)

4

iu

σi Error Budget

Mes

sage

Loa

d

Figure 2: Expected load vs errorbudget of a node.

}Curve basedon workload characteristics

Redistribution LoadTotal Load

Frequency of Error Budget Distribution

Mes

sage

Loa

d

Monitoring Load

Figure 3: Cost-benefit Analysis.

Max

imum

Nod

e St

ress

Number of nodes

Logarithmic Increase

Linear Increase

Figure 4: Maximum Node stress: 1-level vs hierarchical tree

The above optimal error assignment assumes that for all

nodes, the expected cost per update is equal toσ2

i

δ2i

based on

Equation 3 i.e., σi ≤ δi. However, for nodes with high σi

relative to error budget δi, it is highly likely that an updatewill be sent to the root for each incoming message. We termthese nodes as volatile nodes [11] since they may not yield asignificant benefit in spite of being allocated a large fractionof the error budget (Equation 6).

To account for volatile nodes, we apply an iterative al-gorithm that determines the largest volatile node j at eachstep, recomputes Equation 6 for all remaining children as-suming j is absent. A node j is labeled volatile if

σj

δoptj

≥ 1

i.e., the standard deviation is larger than the optimal errorbudget (under fixed total budget) corresponding to Equa-tion 3 and the ratio

σj

δoptj

is maximal among all remaining

children. If no such j exists, the procedure terminates giv-ing the optimal AI budgets for each node; all volatile nodesget zero budget since any non-zero budget will not effec-tively filter their updates. Note that for our DHT-basedaggregation trees, the fan-in for a node is typically 16 (i.e.,4-bit correction per hop) so the iterative algorithm runs inconstant time (at most 16 times).

Relaxation: A self-tuning algorithm that adapts too rapidlymay react inappropriately to transient situations. There-fore, we next apply exponential smoothing to compute thenew error budget δnew

i for each node as the weighted averageof new error budget (δopt

i ) and previous budget (δi).

δnewi = αδ

opti + (1 − α)δi (7)

where α = 0.05.

Cost-Benefit Throttling: Finally, root R needs to sendmessages to its children to rebalance the error budget. There-fore, there is a tradeoff between the error budget redistribu-tion overhead and the AI filtering gain (as illustrated inFigure 3). A naive rebalancing algorithm that ignores suchtradeoff could easily spend more network messages redis-tributing δs than it saves by filtering updates. Limiting re-distribution overhead is a particular concern for applicationslike distributed heavy hitter that monitor a large number ofattributes, only a few of which are active enough to be worthoptimizing.

To address this challenge, after computing the new errorbudgets, the root node computes a charge metric for eachchild subtree c, which estimates the number of extra mes-sages sent by c due to sub-optimal δ:

Chargec = (Tcurr − Tadjust) ∗ (Mc − Mnewc )

where Mc =σ2

i ∗ui

δ2i

, Mnewc =

σ2i ∗ui

(δnewi

)2, and Tadjust is the last

time δ was adjusted at R for child c. Notice that a subtree’s

charge will be large if (a) there is a large load imbalance(e.g., Mc − Mnew

c is large) or (b) there is a stable, long-lasting imbalance (e.g., Tcurr − Tadjust is large.)

We only send messages to redistribute deltas when doingso is likely to save at least k messages (i.e., if chargec > k).To ensure the invariant that δT ≥ δself +

P

c δc, we makethis adjustment in two steps. First, we replenish δself fromthe child whose δc is the farthest above δnew

c by orderingc to reduce δc by Min(0.1 δc, δc - δnew

c ). Second, we loansome of the δself budget to the node c that has accumulatedthe largest charge by incrementing c’s budget by Min(0.1δc,δnew

c − δc, max(0.1δself , δself - δnewself )).

A node responds to a request from its parent to updateδT using a similar approach.

Note that in a practical setting, a parent node should firstreclaim error budget from children and then give budget toother children nodes as guided by the self-tuning algorithm.

3.3.2 Multi-level treesFor large-scale multi-level trees, we extend our basic al-

gorithm for a one-level tree to a distributed algorithm for amulti-level aggregation hierarchy to reduce maximum nodestress (Figure 4) and reduce communication load due to in-ternal nodes that not only split δc among their children cbut may also retain δself to help prevent updates receivedfrom children from being propagated further up the tree.Second, in order to scale to the tens of thousands of at-tributes required by per-flow network monitoring services,our algorithm performs cost/benefit throttling. In particu-lar, we use a competitive analysis to ensure that rebalancingδ accounts for only a fraction of the system’s total messageload by only redistributing δ to correct large imbalances orto correct stable, long-duration (though perhaps small) im-balances.

At any internal node in the aggregation tree, the self-tuning algorithm works as follows:1. Estimate optimal distribution of δT across δself and δc.

Each node p tracks its incoming update rate i.e., the aggre-gate number of messages sent by all its children per timeunit (up) and the standard deviation (σp) of updates re-ceived from its children. Note that uc, σc reports are ac-cumulated by child c until they can be piggy-backed on anupdate message to its parent.

Given this information, each parent node n computes theoptimal values δopt

v for each child v’s underlying subtree thatminimizes the total system load in the entire subtree rootedat n. We apply Equation 6 by viewing each child v as repre-senting two individual nodes: (1) v itself (as a data source)with update rate uv and standard deviation σv and (2) anode representing the subtree rooted at v. Figure 5 illus-trates this local self-tuning view: when any internal node

5

Subtree

Q

P R

MQR

MPQ

MPQ MQR

R

Q

PVirtual Child

Figure 5: Local self-tuning view considers both incom-ing and outgoing bandwidth.

R

R

Q

T1 T2

Q T1 T2

MT1Q

MQR

MT2QMT1Q

MT2Q

MQR

Figure 6: Global self-tuning view collapses a multi-level hierarchy to a one-level tree.

computes optimal budgets, it aims to minimize both incom-ing messages received as well as outgoing messages to parent(by showing parent-link as a virtual child) i.e., minimizingglobal communication load from a local perspective.

Given this view model, we define loadFactor for a node vas 3

√σ2

v ∗ uv . Recursively, we can define AccLoadFactor fora subtree rooted at node v as:

AccLoadFactorv =

8

>

<

>

:

LoadFactorv (v is a leaf node)LoadFactorv +

P

j∈child(v)

AccLoadFactorj

(otherwise)

Next, we estimate the optimal error budget for v’s subtree(v ∈ child(n)) as follows:

δoptv = δT ∗ AccLoadFactorv

AccLoadFactorn

(8)

Equation 8 is globally optimal since it virtually maps amulti-level hierarchy into a one-level tree (as illustrated inFigure 6) and in this transformed view, computes the opti-mal error budget for each node.

To account for volatile nodes, we apply a similar itera-tive algorithm as in the one-level tree case to determine thelargest volatile node i (i ∈ child(n)) at each step, recom-pute Equation 8 for all remaining children, and so on. Thecondition to check whether node i is volatile remains sameas : σi

δopt

i(self)

≥ 1 and σi

δopt

i(self)

≥ σj

δopt

j(self)

∀j where δopt

i(self)is

computed as:

δopt

i(self) = δT ∗ LoadFactori

AccLoadFactorv

(9)

2. Relaxation: adaptive adjustment of deltas.

δnewi = αδ

opti + (1 − α)δi

where α = 0.05.

3. Redistribute deltas iff the expected benefit exceeds the re-

distribution overhead.

To do cost-benefit throttling, We recursively apply the for-mula for one-level tree to compute the cumulative chargefor a subtree rooted at node v: the following difference: forcomputing Mi − Mnew

i ,

AccChargev =

8

>

<

>

:

Chargev (v is a leaf node)Chargev +

P

j∈child(v)

AccChargej

(otherwise)

Mi =

„

σ2i ∗ ui

δ2i

+X

j∈child(i)

σ2j ∗ uj

δ2j

«

and

Mnewi =

„

σ2i ∗ ui

(δnewi )2

+X

j∈child(i)

σ2j ∗ uj

(δnewj )2

«

Note that the terms AccLoadFactorv for computing Equa-tion 8 and AccChargev can be computed using a SUM ag-gregation function, and are piggybacked on updates sent bychildren to their parents in our implementation.

While beyond the scope of this paper, our algorithms canbe extended in a straightforward manner to adaptively as-sign error budgets for multiple queries involving overlappingsets of data objects.

4. IMPLEMENTATIONIn this section, we describe three important design aspects

of implementing STAR in our SDIMS prototype. First, wepresent a new distribution abstraction to provide the func-tionality of distributing AI error budgets in an aggregationtree. Second, we discuss different policies to maintain AIerror precision in query results under failures. Finally, wedescribe two practical optimizations in our prototype im-plementation to reduce communication load for large-scalesystem monitoring.

4.1 Distribution AbstractionA distribution abstraction is a dual mechanism to the

traditional bottom-up tree based aggregation that enablesan update to be distributed top-down along the aggregationtree.

The basic mechanism of a distribution function at a nodeis to receive its list of children for an attribute, a (possiblynull) Distribution message received from its parent, and re-turn a set of messages destined for a subset of its childrenand itself.. For self-tuning AI, a distribution function imple-ments the STAR algorithm that takes an AI error budgetfrom the parent and distributes it down to its children.

In the SDIMS framework, each node implements an Ag-gregation Management Layer (AML) that maintains attributetuples, performs aggregations, stores and propagates aggre-gate values [40]. On receiving a Distribution message, theAML layer invokes the distribution function for the givenattribute and level in the aggregation tree. The logic of howto process these messages is implemented by the applica-tion. For example, for self-tuning AI, the STAR protocolgenerates distribution messages to either allocate more er-ror budget to a child subtree or decrease the error budget ofa child subtree.

The AML layer provides communication between an ag-gregation function and a distribution function by passinglocal messages. Conceptually, a distribution function ex-

6

tends an aggregation function by allowing the flexibility ofdefining multiple distribution functions (policies) for a givenaggregation function.

In SDIMS, a store of (attribute type, attribute name,value) tuples is termed as Management Information Base(MIB). For hierarchical aggregation of a given attribute,each node stores child MIBs received from children and a re-

duction MIB containing locally aggregated values across thechild MIBs. To provide the distribution abstraction, we im-plemented a distribution MIB that stores as value an objectused for maintaining application-defined state for adaptiveerror budget distribution. In STAR, a distribution MIB alsoholds a local copy of load information received from childMIBs.

Given this abstraction, it is simple to implement STAR’sself-tuning adaptation of error budgets in an aggregationtree. For example, in STAR

(a) A parent can increase δself by sending a distribution mes-sage to a child c to reduce its subtree budget, δc

(b) A parent can increase δc for a subtree child c by reducingits own δself , and

(c) Filter small changes: a parent only changes a δ assignmentif doing so is likely to save more messages than rebalancingthe δs costs.The policy decision of when the self-tuning function gets

called can be based on either a periodic timer, processing athreshold number of messages, or simply on-demand.

4.2 RobustnessTo handle node failures and disconnections, our STAR

implementation currently provides a simple policy to main-tain the AI error precision in query results under churn.On each invocation of the distribution function, it receivesthe current child MIB set from the AML layer. If the newchild MIB set is inconsistent with the previous child MIBsnapshot maintained by the distribution function, it takescorrective action as follows:• On a new child event: inserts a new entry in the dis-

tribution MIB for the new-child, assigns an error budgetto that child subtree, and sends it to that child for distri-bution in its subtree.

• On a dead child event: garbage collects the child stateand reclaims all AI error budget previously assigned tothat child subtree.

• On a new parent event: Resets the AI error budget tozero as the new parent will allocate a new error budget.Assuming that the error bounds were in a state where

all precision constraints are satisfied prior to a failure, thetemporary lost error due to the failure of a child only im-proves precision, thus no precision constraints can becomeviolated. For a new child, it receives a fraction of the errorbudget allocation so the correctness still holds. Finally, ona new parent event, the error budget at its children will bereset.

Though this policy is simple and conservative, it alwaysprovides correctness that the reported AI error precision inquery results is satisfied. Under this policy, however, a singlefailure might incur high communication overhead e.g., if awhole subtree moves to a new parent, the entire AI budgetis lost resulting in every leaf update being propagated up inthe tree until the new parent sends distribution messages toreassign the AI error budgets in the affected subtree.

Alternatively, policies based on exploiting consistency ver-sus performance tradeoffs can be used. One such policy iswhen a child gets connected to a new parent, it can keepreporting aggregate values to the new parent with the pre-cision error assigned by the last parent; the self-tuning al-gorithm then continually adjusts the previous subtree AIbudget to converge with the new allocation over time.

Another policy would be to allow the invariant that asubtree meets the AI bound to violate under periods of re-configuration. Instead, a new parent can send down a targetbudget and aggregate up the actual AI error bound reportedby child’s subtree. This policy is simple for both this fail-ure case and the common case of adaptively redistributingerror budgets. We leave the empirical comparison of differ-ent policies that trade strict consistency for performance asfuture work.

4.3 Optimizing for ScalabilityWith the STAR algorithm described in Section 3, the error

budgets can be adaptively set in an aggregation hierarchyto minimize the communication overhead. However, sincewe are interested in large-scale system monitoring, in thissection we present an optimization for setting initial errorbudgets that complement adaptive error settings to furtherreduce the network load in a real-world setting.

In our DHH application, we want to maintain accurateinformation for only the heavy hitter flows that generate asignificant fraction of the total traffic. For mice flows thatseldom send updates, we allow them to get culled at theleaves and maintain conservative information on their ag-gregate value.

Since SDIMS is an event-driven system, for error distri-bution, an initial update needs to be propagated to the rootof an aggregation tree to distribute the error budget amongall the nodes in that tree. Therefore, the initial cost of re-distribution budget would be O(log N) for a mice update toreach the root and O(N) for error distribution. However,for mice flows that only send a few updates, this cost will besignificantly higher than benefits of filtering mice. Thereforefor scalability, the key question is how to set the initial errorbudgets so that the mice flows never generate any updates.

One simple yet expensive policy is to keep the entire bud-get at the root and perform on-demand error distribution.Another policy is to give root share of zero and uniformlydivide the entire budget at all the leaf nodes. In general,there is a continuum of policies that can be defined on ini-tially dividing the total budget δT among the root and Nleaf nodes. These set of policies can be expressed as:

AIRoot = Rootshare ∗ δT

AILeaf =1 − AIRoot

N

For example, if Rootshare = 0.5, then the root gets δT

2and

each leaf gets δT

2∗N. To implement this policy, on receiving

an update, a leaf node checks if it already has a local AIerror budget. If not yet, it performs a lookup for AILeaf tofilter the update. If the local budget is insufficient, the leafsends the update to its parent which in turn applies AI errorfiltering. If that update reaches the root, the root initiateserror distribution of its AIRoot error budget across its tree.Given this setting, we can cull a large fraction of the miceflows at the leaves preventing their updates from reachingthe root. We show the performance benefit of using this

7

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.01 0.1 1 10 100

Nor

mal

ized

Loa

d

Delta to noise budget ratio

Guassian DistributionRandom Walk Pattern

Figure 7: Normalized load vs. noise of synthetic work-

load for a fixed AI error budget. If noise < AI, a majority

of updates get filtered. The x-axis is on a log scale.

optimization in Section 5.

5. EXPERIMENTAL EVALUATIONWe have implemented a prototype of STAR in our SDIMS

monitoring framework [40] on top of FreePastry [32] and per-formed large-scale simulation experiments and a real-worldmonitoring system on two real networks: 120 node instancesmapped on 30 physical machines in the department Condorcluster and the same setup on 30 machines in the Emu-lab [39] testbed.

Our experiments characterize the performance and scal-ability of the self-tuning AI for the distributed heavy hit-ters (DHH) application. First, we quantify the reductionin monitoring overheads due to self-tuning AI using sim-ulations. Second, we evaluate the performance benefits ofour optimization in a real world monitoring implementation.Finally, we investigate the reduction in communication loadachieved by STAR for the DHH application using our proto-type implementation. In summary, our experimental resultsshow that STAR is an effective substrate for scalable moni-toring: introducing small amounts of AI error and adaptivityusing self-tuning AI significantly reduces monitoring load.

5.1 Simulation ExperimentsTo generalize the trade-off between AI error budget and

monitoring load, we evaluate via simulations under whatconditions is AI error budget and self-tuning the AI errorbudget effective. In all experiments, all active sensor are atthe leaf nodes of the aggregation tree. Each sensor generatesa data value every time unit (round) for two sets of syntheticworkloads for 100,000 rounds: (1) a Gaussian distributionwith standard deviation 1 and mean 0, and (2) a randomwalk pattern in which the value either increases or decreasesby an amount sampled uniformally from [0.5, 1.5].

We first investigate under what conditions is AI error bud-get effective. Figure 7 shows the simulation results for a 4-level degree-6 aggregation tree with 1296 leaf nodes for thetwo data distributions under uniform static error distribu-tion. The x-axis denotes the ratio of total AI budget to totalnoise induced by the leaf sensors and the y-axis shows thetotal message load normalized with respect to zero AI errorbudget. We observe that when noise is small compared to

1e-04

0.001

0.01

0.1

1

0.1 1 10 100

Cos

t per

sec

ond

Error Budget to Noise ratio

Uniform AllocationAdap-filters (freq = 5)

Adap-filters (freq = 10)Adap-filters (freq = 50)

STAR

Figure 8: Performance benefits due to cost-benefitthrottling. Load vs. precision error (log scale) for10-node 1-level tree, random walk data.

the error budget, there is about an order of magnitude loadreduction as the majority of updates are filtered. But, asexpected, when noise is large compared to the error bud-get, the load asymptotically approaches the unfiltered loadwith AI = 0. The random walk pattern allows almost per-fect culling of updates for small amounts of noise whereasfor the Gaussian distribution, there is a small yet a finiteprobability for data values to deviate arbitrarily from theirpreviously reported range.

Next, we quantify the cost of periodic bound shrinkingin previous approaches [11, 28] compared with STAR thatperforms cost-benefit throttling. In our experiments, thefollowing configuration gave the best results for Adaptive-filters [28]: shrink percentage = 5%, high self-tuning fre-quency, and distributing error budgets to a small set (e.g.,10-15%) of nodes with the highest burden scores, where bur-den is the ratio of load to error budget. These observationsare consistent with previous work [11].

Figure 8 shows the performance results of uniform alloca-tion, Adaptive-filters [28], and STAR for a 10 node 1-leveltree for 1 attribute using a random walk pattern with samestep size at each node. In this case, the uniform error al-location would be close to the optimal setting. We observethat when the error budget exceeds noise, Adaptive-filtersincurs a constant error redistribution cost that is propor-tional to frequency of self-tuning. As discussed in Section 1,for per-flow monitoring services that require tracking mil-lions of attributes each having small but non-zero noise, thecorresponding communication overhead would be very ex-pensive for Adaptive-filters. STAR, however, performs cost-benefit throttling and doesn’t redistribute error when thecorresponding gain is negligible.

We now characterize the higher performance benefits ofSTAR compared to other approaches as skewness in a work-load increases. In Figure 9, the three figures show the com-munication load vs. error budget to noise ratio for the fol-lowing skewness settings: (a) 20:80% (b) 50:50% (c) 90:10%.For example, the 20:80% skewness represents that only 20%of nodes have zero noise and the remaining 80% flows havea large noise. In this case, since only a small fraction ofthe nodes are stable, both STAR and Adaptive-filters canonly reclaim 20% total error budget from the zero noisesources and distribute it to noisy sources to cull their up-

8

0.001

0.01

0.1

1

0.1 1 10

Mes

sage

Cos

t per

sec

ond


Uniform AllocationAdap-filters

STAR

0.001

0.01

0.1

1

0.1 1 10

Mes

sage

Cos

t per

sec

ond



STAR

0.001

0.01

0.1

1

0.1 1 10

Mes

sage

Cos

t per

sec

ond



STAR

Figure 9: STAR provides higher performance benefits as skewness in a workload increases. The three figuresshow load vs. error budget to noise ratio for different skewness settings (a) 20:80% (b) 50:50% (c) 90:10%.

dates. STAR has a constant difference from Adaptive-filterssince the latter does a periodic shrinking of error bound ateach node. For the 50:50 case, both self-tuning algorithmscan claim 50% of total budget compared to uniform allo-cation and give it to noisy sources. However, even whenthe optimal configuration (error budget large compared tonoise) is reached, Adaptive-filters keep readjusting the bud-gets due to periodic shrinking of error bounds. Finally, when90% are stable, STAR gives more than an order of magni-tude reduction in load compared to both Adaptive-filter anduniform allocation.

Next, we compare the performance of STAR, Adaptive-filters, and uniform allocation under different configurationsby varying input data distribution, standard deviation (stepsizes), and update frequency at each node. For data dis-tribution, the workload is either generated from a random-walk pattern or Gaussian. For standard deviation/step-size,70% of the nodes have uniform parameters as previously de-scribed; the remaining 30% nodes have these parametersproportional to rank or randomly assigned from the range[0.5, 150].

Figure 10 shows the corresponding results for differentsettings of data distribution and standard deviation for a4 level degree-4 tree but fixed update frequency of 1 up-date per node per round. We make the following observa-tions from Figure 10(a). First, when error budget is smallerthan noise, no algorithm in any configuration achieves betterperformance than uniform allocation. Adaptive-filters, how-ever, incurs a slightly higher overhead due to self-tuning eventhough it does not benefit. In comparison, STAR avoids self-tuning costs via cost-benefit throttling. Second, Adaptive-filters and uniform error allocation reach a cross-over pointhaving a similar performance. This cross-over implies thatfor Adaptive-filters, the cost of self-tuning is equal to thebenefits. Third, as error budget increases, STAR achievesbetter performance than Adaptive-filters. Because step-sizesare based on node rank, STAR’s outlier detection avoids al-locating budget to the nodes having the largest step-sizes.Adaptive-filters, however, do not make such a distinctionand computes burden scores based on load thereby favoringnodes with relatively large step sizes. Thus, since the totalbudget is limited, reducing error budget at nodes with smallstep sizes increases their load but does not benefit outlierssince the additional slack is still insufficient to filter theirupdates. Finally, as expected, when error budget is higherthan noise, all algorithms achieve better performance. Inthis configuration, Adaptive-filters reduces monitoring loadby 3x-5x compared to uniform allocation and by 2x-3x com-

pared to Adaptive-filters.Under random distribution of step-sizes as described above,

STAR reduces load by up to 1.5x compared to Adaptive-filters and up to 3x against uniform allocation. Comparingacross configurations, all algorithms perform better underinput distribution of Gaussian compared to the random-walk model. Overall, across all configurations, STAR re-duces monitoring load by up to an order of magnitude com-pared to uniform allocation and by up to 5x compared toAdaptive-filters.

5.2 Testbed experimentsIn this subsection we quantify the reduction in monitor-

ing load due to self-tuning AI and the query precision ofreported results for the Distributed Heavy Hitter (DHH)monitoring application.

1

10

100

1 100 10000 1e+06

CD

F (

% o

f flo

ws)

Flow value (KB)

Flow value distribution 1

10

100

1 100 10000

CD

F (

% o

f flo

ws)

Number of updates

Flow updates distribution

Figure 11: CDF of percentage of flows vs (a) number

of updates and (b) flow-values for the Abilene dataset.

The graph is on a log-log scale.

We use multiple netflow traces obtained from the Abi-lene [1] Internet2 backbone network. The data was collectedfrom 3 Abilene routers for 1 hour; each router logged per-flow data every 5 minutes, and we split these logs into 120buckets based on the hash of source IP. As described in Sec-tion 2.3, our DHH application executes a Top-100 query onthis dataset for tracking the top 100 flows (destination IPas key) in terms of bytes received over a 30 second movingwindow shifted every 10 seconds.

We first describe the characteristics of this Abilene work-load. Figure 11 shows the cumulative distribution function(CDF) of the percentage of network flows versus the numberof bytes (KB) sent by each flow. We observe that 60% flowssend less than 1 KB of aggregate traffic, 80% flows send lessthan 12 KB traffic, 90% flows less than 55 KB and 99% ofthe flows send less than 330 KB of total traffic during the1-hour run. Note that the distribution is heavy-tailed andmaximum aggregate flow value is about 179.4 MB. Figure 11

9

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4

Mes

sage

Cos

t per

sec

ond



STAR

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4

Mes

sage

Cos

t per

sec

ond



STAR

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4

Mes

sage

Cos

t per

sec

ond



STAR

0

0.2

0.4

0.6

0.8

1

0 1 2 3 4

Mes

sage

Cos

t per

sec

ond



STAR

Figure 10: Performance comparison of STAR vs. Adaptive-filters and uniform allocation for different (work-load, standard deviation) configurations (a)random walk, rank (b) random walk, random (c) Gaussian, rank(d) Gaussian, random.

0.01

0.1

1

10

100

0 5 10 15 20

Mes

sage

Cos

t per

sec

ond

AI Error Budget (% max flow value)

BW(0%)BW(50%)BW(90%)

BW(100%)DistBW(0%)

DistBW(50%)DistBW(90%)

DistBW(100%)

Figure 12: Performance comparison of different poli-

cies for setting initial error budgets for DHH appli-

cation. BW is bandwidth cost per node and DistBW

is redistribution overhead per node.

0.1

1

10

0 20 40 60 80 100

Nor

mal

ized

Mes

sage

Cos

t per

sec

ond

Root_share (% total error budget)

AI(5%)AI(10%)AI(15%)AI(20%)

Figure 13: Performance benefits of keeping local er-

ror budget at the internal nodes of the aggregation

tree.

shows the corresponding CDF graph for of the percentageof network flows versus the number of updates We observethat 40% flows send only a single update (a 28 byte IP/UDPpacket). Further, 80% flows send less than 70 updates, 90%flows less than 360 updates, and 99% flows less than 2000updates. Note that the number of update distribution is alsoheavy-tailed with the maximum number of updates send bya flow is about 42,000.

Overall, the total number of updates sent by roughly 80,000flows originating from 120 sensors is around 25 million. Thus,the monitoring load for AI of 0 error budget would be about58.6 messages per second per node for each of the 120 nodes.Therefore, a centralized scheme would incur prohibitive costof processing this workload generating about 7,000 updatesper second.

To address this scalability challenge, we analyze differentsettings of the optimization discussed in Section 4.3 to fil-ter majority mice flows in our monitoring system. For thisexperiment, we enable distribution mechanism for error dis-tribution using our self-tuning algorithm. Figure 12 showsthe performance comparison graph by varying the Rootshare

parameter values of 0%, 50%, 90%, and 100% for setting ini-tial error budgets and global AI error budget of 5%, 10%,15%, and 20% in each aggregation tree. Note that this fig-ure shows both the total bandwidth cost per node (BW)and the error redistribution overhead per node (DistBW).We observe that with a Rootshare of 50% and AI of 5%, we

incur overhead of about 2.5 messages per node per second.By increasing the AI error budget to 20%, we can reducethis cost by almost a factor of five to about 0.5 messagesper node per second. However, for Rootshare of 100%, eachmice update reaches the root which then initiates an er-ror allocation to each node of the aggregation tree. Finally,note that even a setting of Rootshare of 90% gives an order ofmagnitude reduction in load compared to Rootshare of 100%by filtering mice flows at lower levels of the aggregation tree.

Next, we show the performance benefits of keeping a non-zero local error budget (δself ) at the internal nodes in thetree. Intuitively, when an internal node aggregates across itschildren values, even though the child values have deviatedsignificantly to bypass their own error range, the net effectby merging all children updates may still be close to zeroi.e., the children updates cancel out each other. Figure 11shows the benefits of keeping 10% error budget at the in-ternal nodes during initial error allocation compared to 0%budget at the internal nodes. The different lines in the graphcorrespond to different AI budgets. We observe that forRootshare of 50%, we achieve only 10% load reduction dueto filtering at internal nodes. However, for Rootshare of 90%,we achieve almost an order of magnitude load reduction. Inthis case, the δself at the internal nodes is sufficient to filtera significant fraction of flows. However, when Rootshare of50%, the local error budget is only able to filter about 10%updates. Note that this reduction due to internal nodes is in

10

addition to the filtering at the leaf nodes. Overall, a scalablemonitoring service can significantly benefit from filtering atboth the leaf and the internal nodes.

In summary, our evaluation shows that adaptive settingof modest AI budgets can provide large bandwidth savingsto enable scalable monitoring.

6. RELATED WORKOlston et al. [28] proposed a self-tuning algorithm for a

one-level tree: increase δ for nodes with high load and lowprevious δ and decrease δ for nodes with low load and highprevious δ. Our algorithm differs from Olston’s in two fun-damental ways: First, to work with large-scale multi-leveltrees, we define a distributed algorithm in which internalnodes not only split δc among their children c but may alsoretain δself to help prevent updates received from childrenfrom being propagated further up the tree. Second, in orderto scale to tens of thousands of attributes required by per-flow network monitoring services, our algorithm performscost/benefit throttling. In particular, we use a competitiveanalysis to ensure that rebalancing δ accounts for only afraction of the system’s total message load by only redis-tributing δ to correct large imbalances or to correct stable,long-duration (though perhaps small) imbalances.

For hierarchical topologies, Manjhi et al. [27] determinean optimal but static distribution of slack to the internaland leaf nodes of a tree for finding frequent items in datastreams. IrisNet [12] filters sensors at leaves and cachestimestamped results in a hierarchy with queries that specifythe maximum staleness they will accept and that trigger re-transmission if needed. Deligiannakis et al. [11] propose anadaptive precision setting technique for hierarchical aggre-gation, with a focus on sensor networks. However, similar toOlston’s approach, their technique also periodically shrinksthe error budgets for each tree which limits scalability fortracking large number of attributes. Further, since their ap-proach uses only two local anchor points around the currenterror budget to infer the precision-performance tradeoff asshown in Figure 2, it cannot infer the complete correlationmaking it susceptible to dynamic changes. None of the pre-vious studies to our knowledge uses the variance and theupdate rate in the data distribution in a principled man-ner. In comparison, STAR provides an efficient and practi-cal algorithm that uses an informed mathematical model tocompute globally optimal budgets and performs cost-benefitthrottling to adaptively set precision constraints in a generalcommunication hierarchy.

Some Recent studies [19, 22, 24] have proposed monitoringsystems with distributed triggers that fire when an aggregateof remote-site behavior exceeds an a priori global threshold.Their solution is based on a centralized architecture. STARmay enhance such efforts by providing a scalable way totrack top-k and other significant events.

Other studies have proposed prediction-based techniquesfor data stream filtering e.g., using Kalman filters [21], neu-ral networks [25], etc. Recently, there has been a consider-able interest in the database and sensor network communi-ties on approximate data management techniques; Skordyliset al. [35] provide a good survey.

There are ongoing efforts similar to ours in the P2P anddatabases community to build global monitoring services.PIER is a DHT-based relational query engine [20] targetedat querying real-time data from many vantage-points on the

Internet. Sophia [38] is a distributed monitoring systemdesigned with a declarative logic programming model. Gi-gascope [10] provides a stream database functionality fornetwork monitoring applications.

Traditionally, DHT-based aggregation is event-driven andbest-effort, i.e., each update event triggers re-aggregation foraffected portions of the aggregation tree. Further, systemsoften only provide eventual consistency guarantees on itsdata [37, 40], i.e., updates by a live node will eventually bevisible to probes by connected nodes.

7. CONCLUSIONS AND FUTURE WORKWithout adaptive setting of precision constraints, large

scale network monitoring systems may be too expensive toimplement (because too many events flow through the sys-tem) even under considerable error budgets. STAR providesself-tuning arithmetic imprecision to adaptively bound thenumerical accuracy in query results, and optimizations toenable scalable monitoring of large number of stream eventsin a distributed system.

While our self-tuning algorithm focuses on minimizingcommunication load in an aggregation hierarchy under fixeddata precision constraints, it might be useful for some appli-cations and environments to investigate the dual problem ofmaximizing data precision subject to constraints on avail-ability of global computation and communication resources.Another interesting topic of future work is to consider self-tuning error distribution for general graph topologies e.g.,DAGs, rings, etc., that are more robust to node failuresthan tree networks. Finally, reducing monitoring load inreal-world systems would also require understanding otherdimensions of imprecision such as temporal where queriescan tolerate bounded staleness and topological when only asubset of nodes are needed to answer a query.

8. REFERENCES[1] Abilene internet2 network.

http://abilene.internet2.edu/.

[2] R. Avnur and J. M. Hellerstein. Eddies: continuouslyadaptive query processing. In SIGMOD, 2000.

[3] B. Babcock, S. Babu, M. Datar, R. Motwani, andJ. Widom. Models and issues in data stream systems.In PODS, 2002.

[4] B. Babcock and C. Olston. Distributed top-kmonitoring. In SIGMOD, June 2003.

[5] S. Babu, R. Motwani, K. Munagala, I. Nishizawa, andJ. Widom. Adaptive ordering of pipelined streamfilters. In SIGMOD, 2004.

[6] A. Bharambe, M. Agrawal, and S. Seshan. Mercury:Supporting Scalable Multi-Attribute Range Queries.In SIGCOMM, Portland, OR, August 2004.

[7] S. Chaudhuri, G. Das, and V. Narasayya. A robust,optimization-based approach for approximateanswering of aggregate queries. In SIGMOD ’01, 2001.

[8] D. D. Clark, C. Partridge, J. C. Ramming, andJ. Wroclawski. A knowledge plane for the internet. InSIGCOMM, 2003.

[9] R. Cox, A. Muthitacharoen, and R. T. Morris. ServingDNS using a Peer-to-Peer Lookup Service. In IPTPS,2002.

[10] C. D. Cranor, T. Johnson, O. Spatscheck, andV. Shkapenyuk. Gigascope: A stream database for

11

network applications. In SIGMOD, 2003.

[11] A. Deligiannakis, Y. Kotidis, and N. Roussopoulos.Hierarchical in-network data aggregation with qualityguarantees. In EDBT, 2004.

[12] A. Deshpande, S. Nath, P. Gibbons, and S. Seshan.Cache-and-query for wide area sensor databases. InProc. SIGMOD, 2003.

[13] C. Estan and G. Varghese. New directions in trafficmeasurement and accounting. In SIGCOMM, 2002.

[14] M. J. Freedman and D. Mazires. Sloppy Hashing andSelf-Organizing Clusters. In IPTPS, Berkeley, CA,February 2003.

[15] FreePastry. http://freepastry.rice.edu.

[16] Y. Fu, J. Chase, B. Chun, S. Schwab, and A. Vahdat.SHARP: An architecture for secure resource peering.In Proc. SOSP, Oct. 2003.

[17] N. J. A. Harvey, M. B. Jones, S. Saroiu, M. Theimer,and A. Wolman. SkipNet: A Scalable OverlayNetwork with Practical Locality Properties. In USITS,March 2003.

[18] J. M. Hellerstein, V. Paxson, L. L. Peterson,T. Roscoe, S. Shenker, and D. Wetherall. The networkoracle. IEEE Data Eng. Bull., 28(1):3–10, 2005.

[19] L. Huang, M. Garofalakis, A. D. Joseph, and N. Taft.Communication-efficient tracking of distributedcumulative triggers. In ICDCS, 2007.

[20] R. Huebsch, J. M. Hellerstein, N. Lanham, B. T. Loo,S. Shenker, and I. Stoica. Querying the Internet withPIER. In VLDB, 2003.

[21] A. Jain, E. Y. Chang, and Y.-F. Wang. Adaptivestream resource management using kalman filters. InSIGMOD, 2004.

[22] A. Jain, J. M. Hellerstein, S. Ratnasamy, andD. Wetherall. A wakeup call for internet monitoringsystems: The case for distributed triggers. In HotNets,San Diego, CA, November 2004.

[23] N. Jain, D. Kit, P. Mahajan, P. Yalagandula,M. Dahlin, and Y. Zhang. STAR: self tuningaggregation for scalable monitoring (extended).Technical Report TR-07-15, UT Austin Department ofComputer Sciences, March 2007.

[24] R. Keralapura, G. Cormode, and J. Ramamirtham.Communication-efficient distributed monitoring ofthresholded counts. In SIGMOD, 2006.

[25] V. Kumar, B. F. Cooper, and S. B. Navathe.Predictive filtering: a learning-based approach to datastream filtering. In DMSN, 2004.

[26] S. R. Madden, M. J. Franklin, J. M. Hellerstein, andW. Hong. TAG: a Tiny AGgregation Service forAd-Hoc Sensor Networks. In OSDI, 2002.

[27] A. Manjhi, V. Shkapenyuk, K. Dhamdhere, andC. Olston. Finding (Recently) Frequent Items inDistributed Data Streams. In ICDE, 2005.

[28] C. Olston, J. Jiang, and J. Widom. Adaptive filtersfor continuous queries over distributed data streams.In SIGMOD, 2003.

[29] C. Olston and J. Widom. Offering aprecision-performance tradeoff for aggregation queriesover replicated data. In VLDB, Sept. 2000.

[30] C. G. Plaxton, R. Rajaraman, and A. W. Richa.Accessing Nearby Copies of Replicated Objects in a

Distributed Environment. In ACM SPAA, 1997.

[31] S. Ratnasamy, P. Francis, M. Handley, R. Karp, andS. Shenker. A Scalable Content Addressable Network.In SIGCOMM, 2001.

[32] A. Rowstron and P. Druschel. Pastry: Scalable,Distributed Object Location and Routing forLarge-scale Peer-to-peer Systems. In Middleware,2001.

[33] J. Shneidman, P. Pietzuch, J. Ledlie,M. Roussopoulos, M. Seltzer, and M. Welsh.Hourglass: An infrastructure for connecting sensornetworks and applications. Technical ReportTR-21-04, Harvard Technical Report, 2004.

[34] A. Singla, U. Ramachandran, and J. Hodgins.Temporal notions of synchronization and consistencyin Beehive. In Proc. SPAA, 1997.

[35] A. Skordylis, N. Trigoni, and A. Guitton. A study ofapproximate data management techniques for sensornetworks. In WISES, Fourth Workshop on Intelligent

Solutions in Embedded Systems, 2006.

[36] I. Stoica, R. Morris, D. Karger, F. Kaashoek, andH. Balakrishnan. Chord: A scalable Peer-To-Peerlookup service for internet applications. In ACM

SIGCOMM, 2001.

[37] R. van Renesse, K. Birman, and W. Vogels. Astrolabe:A robust and scalable technology for distributedsystem monitoring, management, and data mining.TOCS, 21(2):164–206, 2003.

[38] M. Wawrzoniak, L. Peterson, and T. Roscoe. Sophia:An Information Plane for Networked Systems. InHotNets-II, 2003.

[39] B. White, J. Lepreau, L. Stoller, R. Ricci,S. Guruprasad, M. Newbold, M. Hibler, C. Barb, andA. Joglekar. An integrated experimental environmentfor distributed systems and networks. In Proc. OSDI,pages 255–270, Boston, MA, Dec. 2002.

[40] P. Yalagandula and M. Dahlin. A scalable distributedinformation management system. In Proc SIGCOMM,Aug. 2004.

[41] P. Yalagandula, P. Sharma, S. Banerjee, S.-J. Lee, andS. Basu. S3: A Scalable Sensing Service forMonitoring Large Networked Systems. In Proceedings

of the SIGCOMM Workshop on Internet Network

Management, 2006.

[42] H. Yu and A. Vahdat. Design and evaluation of acontinuous consistency model for replicated services.In OSDI, pages 305–318, 2000.

[43] H. Yu and A. Vahdat. Design and evaluation of aconit-based continuous consistency model forreplicated services. ACM Trans. on Computer

Systems, 20(3), Aug. 2002.

[44] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph.Tapestry: An Infrastructure for Fault-tolerantWide-area Location and Routing. Technical ReportUCB/CSD-01-1141, UC Berkeley, Apr. 2001.

12

STAR: Self-Tuning Aggregation for Scalable Monitoring · SDIMS builds on two ongoing research...

Documents

Transcript of STAR: Self-Tuning Aggregation for Scalable Monitoring · SDIMS builds on two ongoing research...