Scalable Hierarchical Coarse Grained Timers

Scalable Hierarchical Coarse-grained TimersRohit Dube

High Speed Networks ResearchRoom 4C-508, Bell Labs, Lucent Technologies

101 Crawfords Corner Road, Holmdel, NJ 07733Email: rohit [email protected]

Abstract— Several network servers and routing and sig-nalling protocols need a large number of events to be sched-uled off timers. Some of these applications can withstanda bounded level of inaccuracy in when the timer is sched-uled. In this paper we describe a novel mechanism called“scalable hierarchical coarse grained timers”which can han-dle the scheduling of a large number of events while incur-ring a minimum of cpu and memory overhead. The tech-niques presented here were implemented on a commercialIP routing system and are used by the routing stack to dampflapping BGP routes. The paper reflects our experiences incarrying out this implementation and the subsequent perfor-mance analysis.

I. INTRODUCTION

With the explosive growth of network use, a plethora ofnew services and protocols have been developed. Virtuallyall of these applications assume a timer facility, both in thedefinition of the service or protocol and its implementa-tion. Routing and signalling protocols, especially thosethat deal with a large number of routes and or states, makeintense of the the timer abstraction. For example, a routeron the border of a routing domain peering with routers inother routing domains using the Border Gateway Proto-col (BGP [15]) may experience a large number of routefluctuations due to network instability and may need tohold-down these routes till they stabilize [19]. In mostrouter systems, this requires extensive use of the timer sub-system.

Timer implementations typically have a practical limiton how far they scale in terms of sheer numbers. With anyreasonable timer implementation, this limit is dictated byeither the cpu or memory usage of the timer module. Giventhe number of events that can be handled by such a mod-ule, there exist many network applications which requirean order of magnitude or more events to be scheduled anddescheduled off timers than what the module can handle.If these events use a timer each, it follows that the systemresources can not meet the processing requirements of theapplication.

Often times these very applications have a high toler-ance to inaccuracies in scheduling of the events and spend

a comparatively small amount of time processing the eventas compared to other applications. This allows for aggre-gating events which are to be scheduled within a short du-ration of each other. This set of events can be scheduledusing a single timer, reducing the total number of timersin use as well as the overhead of the timer sub-system byprocessing multiple events per timer. In the following sec-tions we describe a technique which combines this allowedcoarsenesswith hierarchical stackingof timer modules toproduce a highly scalable timer sub-system.

A timer module using the concepts described in this pa-per was implemented on a commercial IP router. The BGProute-flap damping code was then modified to use this newtimer module. As we demonstrate in the subsequent sec-tions, the implementation is modular, highly scalable andlends itself to tuning as per an applications requirements.

We start with a description of the main concepts of thistechnique in Section II. This is followed by implemen-tation details in Section III, a real-life application to BGProute-flap damping in Section IV and an analysis of the de-sign in Section V. Related work is discussed in Section VIjust before we conclude in Section VII.

II. H IERARCHICAL TIMERS AND COARSENESS

Most complete systems like operating-systems orrouters have a very high resolution clock (potentially sup-plied by the hardware) around which a timer module isbuilt. Such a timer module, which is typically imple-mented in software has a resolution in the microsecond–millisecond range. The clock and the timer module rep-resent the bottom two layers of the hierarchy depicted infigure 1. Well implemented timer modules scale into therange of thousands in terms of outstanding timer eventsand meet the requirements of most applications which usethem directly.

However, theLevel 0timer module is not adequate forcertain applications which schedule and deschedule up-wards of a hundred thousand events off timers. This lim-itation is because of the total amount of memory requiredby the Level 0 timer, one of which is needed for eachevent scheduled by the application. More seriously, the

1 2 Level 1 Timers N

Level 0 Timers

Low Level Clock

Fig. 1. Hierarchical Timers

cpu overhead of firing a Level 0 timer (which may be asystem call) per event is prohibitive and can severely de-grade the performance of other sub-systems supported bythe Level 0 timers. Such applications, one of which is dis-cussed later in Section IV, can be supported with aLevel 1timer module which makes use of the primitives providedby the timer module at Level 0. In theory, one can stack athird layer of timers on top of an instance of Level 1 timersand so on.

Two properties need to be satisfied by a timer modulelayered on top of another. First, it must have a coarserresolution compared to the resolution of the lower leveltimer module it sits on. Second, it must have the abilityto bundleevents which are scheduled at the same instant,using a single lower level timer.

21 3 4

Level 0Resolution

Level 1Resolution

Fig. 2. Scheduling with coarse resolution

Consider the time line in figure 2. The shorter notchesin the diagram stands for the resolution of Level 0 timerswhereas the longer notches depict the resolution for Level1. The small buckets at the bottom represent events tobe scheduled. If these events are scheduled using Level 1timers, they are forced to align along the resolution bound-ary of the Level 1 timers. In this process, some of them likeevents 3 and 4 are scheduled at the same time and can bemanaged by the same Level 1 timer. This bundling due tothe coarse resolution of the Level 1 timers allows the Level1 module to support a large number of events using a rela-tively small number of Level 0 timers. Note that a Level 1timer is a Level 0 timer with one or more events queued offit. For clarity, we will refrain from using the term ‘Level1 timer’. Instead, we will refer to these timers as ‘Level 0timers with multiple queued events’.

Since the Level 1 module makes use of Level 0 timers,the features provided by and the performance of the Level0 module directly effect the Level 1 timers. For example,

if the Level 0 timers provide alow-priority or bulk facility,the Level 1 module can be made to exclusively use thisfacility. This helps the overall system stability as higherpriority events can be scheduled by the Level 0 moduleover the Level 1 events which are resistant to inaccuraciesand slip.

III. I MPLEMENTATION DETAILS

An implementation of a Level 1 module was carried outon a commercial IP routing system [12]. The implemen-tation was done using C [9] and is platform independenthaving been tested on Linux [14] and Solaris [17] besidesthe IP router. In this section we describe the salient pointsof this implementation.

A. Level 1 Timer Interface

Like any other timer implementation, the Level 1 timermodule has the standard Application Programming Inter-face (API) -

timer_init (resolution, maxevents,handlers, callback);

timer_schedule (instance, event,poptime);

timer_deschedule (instance, event);

The difference over typical timer APIs come from sup-porting multiple instances of this module and from the useof handlers which are needed because the module does notunderstand the internal structure of the events to be sched-uled and descheduled but needs access to some of the fieldsof the event’s data structures for efficient queuing and de-queuing.

The ‘resolution’ parameter in the timerinit() call is ameasure of coarseness in seconds. The module ensuresthat any Level 0 timers used will be spaced at ‘resolution’seconds. The lowest resolution supported by our imple-mentation is 1 second. The second parameter ‘maxevents’is the maximum number of events that can be queued off asingle Level 0 timer. This is important so that the user has away of preventing the system from going into an extendedloop in the case where a large number of events are queuedto be fired at the same time. The ‘callback’ is a function ora method provided by the user application to process theevent when the timer fires. The ‘callback’ is unusual in thesense that it can process multiple events with the aid of the‘handlers’.

The timerinit() call return an ‘instance’ which is a ref-erence to the instantiation of the Level 1 module being re-ferred to. The ‘instance’ is an input to the schedule anddeschedule calls. The ‘event’ parameter is a pointer or ref-erence to the event being scheduled or descheduled. The

‘poptime’ parameter is the time increment from when thecall is made to when the timer is expected to be fired (i.e.the scheduled time of the event).

B. Scheduling and Descheduling

As the astute reader may have realized, queuing multi-ple events of the same Level 0 timer requires a scheme toquickly seek to an existing timer. When a timerschedule()call is made to schedule an event, the module needs to de-termine if there is an existing timer with space to accom-modate a new event. Similarly, when an application de-cides to deschedule an event, the module needs to seek tothe timer queue holding the event and dequeue it.

This seek to an existing timer is a search problem whichcan be solved efficiently with a balanced-tree or a hash-table lookup. But for either of these search schemes to beused, the module needs akeyto uniquely identify a timer.We use the absolute time since the module is initialized togenerate this key. A timer is simply identified by the timedifference between when the timer is to be fired and theinitialization time. We use a red-black tree (rb-tree) [2] asthe search mechanism. Figure 3 shows a block level viewof the software architecture - the Level 1 module sits on topof and is implemented using the primitives from an rb-treemodule and the Level 0 timer module.

RB-tree moduleLevel 0 module

Level 1 module

Fig. 3. API Layering

A search mechanism such-as an rb-tree makes the im-plementation of timerschedule() call efficient, but not nec-essarily that of the timerdeschedule() call. This is be-cause multiple events may be scheduled off the same timerand searching for the event to be descheduled involvesyet another search. To solve this problem, the modulerequires the user application to provide ‘handlers’ in thetimer init() call using which the module maintains theevents queued off a timer into a doubly linked list. Thisallows for efficient queueing onto and dequeuing from thetimer queue.

Finally, since the module allows the user to place abound on the maximum number of events to be queued offa single timer, multiple timers are needed to accommodateevents in excess of ‘maxevents’. The module maintainsmultiple Level 0 timers in a doubly linked-list attached tothe rb-tree node. As far as the Level 1 module is con-cerned, these timers fire at the same time. Of-course theLevel 0 module which actually fires the timers will space

them depending on the load that it sees. Since the Level 1module is coarse-grained to begin with, the loss of accu-racy is not a cause for concern.

Level 1 TimerNodes

Scheduled Events

Level 0 Timers

in Timer Queues

Fig. 4. Run-time View

Figure 4 shows a run-time view of a module instance.The rightmost Level 1 timer node is expanded out. Thisnode contains two Level 0 timers, the first one of which isfilled to capacity and the second one holds the overflow. Afew additional notes need to be made at this point. First,the Level 0 timer nodes belonging to the same Level 1node are arranged in a double linked list. This is to fa-cilitate easy cancellation and removal of the Level 0 timer,in case all the events from one of the Level 0 timers aredescheduled. Second, when a new event is scheduled, thelinked list of timers is traversed as the module looks forspace in the first available timer. Thus a counter or a flagneeds to be maintained in the Level 0 node, indicating thenumber of events queued off the node. If no timer hasless than ‘maxevents’ events in its queue, a new Level 0timer is obtained and a new Level 0 node added to thelinked list. Third, it is possible at run-time that more thanone of the queues hanging off Level 0 nodes which are inturn attached to the same Level 1 node are less than full.These queues can be compacted if desired. We left outthis optimization from our implementation as the overheadof compacting and the subsequent cost of maintaining theoptimization in large systems was deemed more than thepotential savings of Level 0 timers.

IV. A PPLICATION TO BGP ROUTE-FLAP DAMPING

The Border Gateway Protocol (BGP [7], [8], [15]) is theinter-domain routing protocol of choice in the Internet to-day. Most Internet service providers make a distinction be-tween their border routers (which import routes from theircustomers and other service providers) and core routers(which are higher capacity and see half a million to a mil-lion prefixes). The Internet sees a great amount of instabil-ity which leads to constant flapping of routes [10], [11]. If

these flaps are allowed to propagate all the way to the corerouters, their performance is seriously impacted. Serviceproviders therefore turn on flap-damping [19] on their bor-der routers (see figure 5) for routes learnt from routers ex-ternal to their network. Flapping routes are held down tillthey stabilize before they are passed on to internal routerswhich are part of the core network.

Internal BGP Sessions

External BGP Sessions

External Routes

Damp

Internal Routes

Fig. 5. BGP border router damps route flaps

During periods of serious disruption in the networks ofother providers, several thousand routes can flap up anddown. The border router needs to be able to damp all ofthese. Naively implemented damping uses a Level 0 timerfor each flapping route. The timer is set to fire after theperiod of the hold-down. In case the route flaps again, thetimer needs to be canceled and scheduled again for a latertime (or rescheduled depending on the Level 0 primitivesupported). During large-scale network outages, the usageof Level 0 timers goes up as the number of flapping routesincrease in volume, severely slowing down the timer sub-system and effecting other protocols which may drop intofailure modes.

Route-flap damping does not require a great deal of ac-curacy and routes held down for a few additional sec-onds do not impact the network adversely by much. Thispresents and ideal opportunity for using a Level 1 timermodule.

Indeed, the implementation described in the precedingsections was grafted onto a commercial IP router [12]. Thedirect use of Level 0 timers was taken out and replaced bya Level 1 module within a matter of days. Most of the workinvolved came from coding the ‘handlers’, modifying the‘callback’ function and testing the system for regression.Even so the overall task of migrating to the Level 1 modulewas simple.

V. A NALYSIS

An application using Level 1 timers in place of Level0 timers decreases both the memory and cpu usage of thesystem. The improvement in memory usage comes sim-ply because fewer Level 0 timers are needed by the sys-tem. Similarly, the cpu usage is improved because the

MaxeventsResolution 10 50 250(in seconds)

2 98 98 984 92 92 928 85 85 8516 77 77 7732 64 64 64

TABLE I100EVENTS

cost of scheduling a timer and subsequently firing or de-scheduling it is amortized over multiple events which areprocessed from the same callback. This is especially truefor systems which implement the Level 0 module in kernelspace, typically with pre-allocated memory. If the Level1 module is implemented in user space on these systems,the failure characteristics are of the system are improvedin addition to memory scalability - the kernel memory us-age is decreased as fewer Level 0 timers are used as is theprobability of an event being denied a timer. Further, thecpu utilization is minimized as the expensive kernel-userboundary is crossed once per Level 0 timer scheduled. Thetotal number of Level 0 timers used are a small fraction ofthe total number of events scheduled implying that thereare fewer kernel-user crossings in all. The following ex-periments corroborate this claim.

A. Experimental Results

Tables I to V show data obtained from sample runs us-ing the previously described implementation of the Level1 module. The metric that we record is the number ofLevel 0 timers actually used by the module for varying‘resolution’, ‘maxevents’ and the total number of eventsscheduled (see Section III for definitions). We recordedruns with resolutions of 2, 4, 8, 16 and 32 seconds andmaxevents of 10, 50 and 250 under loads of 100, 1000,10000, 100000 and 1000000 events. In all cases, the eventswere uniformly distributed to be scheduled over a periodof 3600 seconds (one hour). During the experimental runs,special care was taken to ensure that all the modules hadadequate memory. No failures were observed on the sys-tem.

In the experiment with 100 events (table I), the num-ber of Level 0 timers used is comparable to the number ofevents scheduled. This is to be expected as the the timescale over which the events are scheduled is quite largecompared to the total number of events. The experiment


2 780 780 7804 617 617 6178 410 410 41016 226 224 22432 143 113 113

TABLE II1,000EVENTS


2 1849 1797 17974 1391 902 9028 1196 452 45216 1107 269 22632 1047 235 113

TABLE III10,000EVENTS


2 10814 3154 18494 10417 2586 9248 10207 2238 47516 10106 2114 45732 10050 2059 454

TABLE IV100,000EVENTS


2 101029 21131 55434 100513 20553 46198 100259 20283 424416 100131 20141 414132 100058 20072 4074

TABLE V1,000,000EVENTS

with 1,000,000 events (table V) shows impressive savingswith a best case of 4074 Level 0 timers proving adequatefor scheduling all the events.

Looking through a row of any of tables II, III, IV andV, it is clear that the number of Level 0 timers used arereduced as the maximum number of events queued offa single timer increase. Similarly, as the coarseness in-creases (or resolution drops), the number of timers neededdecrease. The explanation for both observations is thehigher occupancy of the timer queues. (As of this writ-ing we are unable to make available results comparing thecpu and memory usage with and without the Level 1 mod-ule. These will be included in the final version of the paperif required).

10

100

1000

10000

100000

1e+06

100 1000 10000 100000 1e+06

Log(

timer

s us

ed)

Log(events scheduled)

Level 1, resolution 8 seconds, maxevents 250

Level 0 only

Fig. 6. Comparison Plot

In the application discussed in section IV, a setting of8 seconds for the ‘resolution’ and 250 for the ‘maxevents’would be considered appropriate. Figure 6 plots the num-ber of Level 0 timers used for this setting against the ex-pected number of Level 0 timers if used directly. Note thatthe axes arelog10() in order to meaningfully accommodatethe large numbers from the experiments.

B. Algorithmic Analysis

Having discussed the empirical results, an algorithmicanalysis of this scheme is in order. The point of inter-est here is the performance of the timerschedule() andtimer deschedule() calls. As we show below, the runningtime of these calls largely depends on the whether a Level0 timer call is required for the operation.

We start with some terminology - assume that the num-ber of Level 1 nodes in the steady state isn and that the av-erage number of Level 0 timers per Level 1 node isb. Fur-ther assume that the time taken to insert (delete) an eventin the queue for a Level 0 timer is the constantQa (Qd).The time taken to create (cancel) a Level 0 timer is depen-dent on the implementation of that module - we assume itis given by the functionLa() (Ld()).

A timer schedule() call takesO(log(n)) (for the rb-treesearch) +O(b) (to walk through the Level 0 nodes) +Qa(to insert the event in the queue) time, when an existingtimer can accommodate the item. If there is an existingLevel 1 timer but no space in any of the existing queues,the call takesO(log(n)) + O(b) (to search the tree) +La() (to create a new Level 0 timer) +Qa time. If thereis no matching Level 1 timer in the tree, the call takesO(log(n)) + La() +Qa time.

timer deschedule() calls takeQa time when the de-scheduled event doesn’t leave behind an empty timerqueue andO(b) + Qd + Ld() when the Level 1 nodestays intact but a Level 0 node is to be deleted becauseof its queue becoming empty. Finally, if the Level 1node itself is deleted in addition to the Level 0 node be-cause there are no events queued at all, the running time isO(log(n)) + Qd + Ld(), theO(log(n)) coming from thedelete operation on the rb-tree.

The case where a new Level 1 node is created or deletedis not treated further as the properties directly reflect thoseof rb-tree where inserts, deletes and searches all takeO(log(n)) time. Also, note that ‘b’ is expected to be smallin practice and the doubly linked list suffices for small ‘b’.If this is not the case for a set of applications, the list canbe replaced with a red-black tree to yieldO(log(b)) timeinstead ofO(b).La() andLd() overshadow the other costs as acquiringor canceling a Level 0 timer minimally requires an APIboundary crossing. This API boundary crossing is often anexpensive system call. Note that without the Level 1 mod-ule, scheduling an event would always costLa() whereasdescheduling would always costLd(). Hence, as the occu-pancy of the timer queues (figure 4) increases, the runningtime of the Level 1 module improves as most schedule anddeschedule calls complete without requiring a Level 0 op-eration.

VI. RELATED WORK

Brown [1] and Davison [4] independently discoveredcalendar queueswhich are modeled after desk calendarsand can be used to implement a timer facility. Varghese et.al. [18] describe a way of building scalable timer imple-mentations using cascadedtiming wheelseach of which issimilar to a calendar queue. These techniques qualify asLevel 0 in the hierarchy of figure 1 and have been usedby multiple Unix-like operating systems to implement thetimeoutfacility [3].

The concept of using hierarchical timers has been im-plicitly used at various times in the operating systems anddata networks community. Most significantly, various fla-vors of BSD [13] implement the TCP/IP stack by layering

protocol timers on top of a few kernel timers obtained fromthe timeout facility. The kernel timers in use by the stackfire at every tick and the callback routine that is calledwalks thorough the active protocols calling a function eachper protocol which in turn processes the protocol timerswhich need to be fired at that instant. Further implementa-tion details on this can be found in [20].

Sharma et. al. [16] describe a way of dynamically ad-justing timers in soft state protocols in order to keep theprotocol control traffic bounded. The idea ofdynamic ad-justmentscan be applied to the Level 1 timer implementa-tion by allowing on the fly changes to the ‘resolution’ and‘maxevents’ parameters which control the occupancy ofthe timer queues and hence the efficiency of the module.

VII. SUMMARY

In this paper we have described in detail a novel mecha-nism which trades off accuracy in favor of scalability. Theresult is a highly scalable timer module built on top ofexisting finer granularity timer implementations with thehelp of a fast access algorithm (in our implementation, ared-black tree). This mechanism has been implemented ina commercial system where one of its applications is todamp BGP route-flaps which in the worst case generate aload of a half to one million prefixes i.e. events which needto be scheduled off timers.

Note that the design discussed here may not be suit-able (without modification) for certain applications as itdoes not provide a way to directly control the jitter of thetimers. For example routing networks which can synchro-nize without deliberate jitter in the control messages [6]may not be built on top of the module as described here.On the other hand, jitter is a functionality typically pro-vided by Level 0 modules and the the Level 1 API canbe extended to control the jitter of the lower level timers.Scale permitting, applications are of-course free to use theLevel 0 timers directly.

Acknowledgements

We would like to thank David Ward (IENG) and PingPan (Bell Labs) for discussions which started the train ofthought that led to the development of this idea. We wouldalso like to thank Bernhard Suter and Lampros Kalam-poukas (Bell Labs) and Sambit Sahu (University of Mas-sachusetts) for helpful comments on preliminary versionsof this paper and Shivkumar Haran (Lucent Technologies)for helping debug the implementation.

Note: The authors employers may patent the ideas pre-sented in this paper [5].

REFERENCES

[1] R. Brown. Calendar Queues: A Fast O(1) Priority Queue Imple-mentation for the Simulation Event Set Problem.Communica-tions of the ACM, 31(10), 1988.

[2] T.H. Corman, C.E. Leiserson, and R.L. Rivest.Introduction toAlgorithms. McGraw-Hill, 1991.

[3] A. Costello and G. Varghese. Redesigning the BSD CalloutandTimer Facilities. Technical Report 95-23, Washington University,St. Louis, MO, 1995.

[4] G. Davison. Calendar P’s and Queues.Communications of theACM, 32(10), 1989.

[5] R. Dube. Scalable Hierarchical Coarse-grained Timers.PatentApplication.

[6] S. Floyd and V. Jacobson. The Synchronization of Periodic Rout-ing Messages. InSIGCOMM Conference. ACM, 1993.

[7] B. Halabi. Internet Routing Architectures. Cisco-Press, 1997.[8] J.W. Stewart III. BGP4: Inter-Domain Routing in the Internet.

Addison-Wesley, 1998.[9] B.W. Kerninghan and D.M. Ritchie.The C Programming Lan-

guage. Prentice Hall, 1988.[10] C. Labovitz, G.R. Malan, and F. Jahanian. Internet Routing Insta-

bility. In SIGCOMM Conference. ACM, 1997.[11] C. Labovitz, G.R. Malan, and F. Jahanian. Origins of Internet

Routing Instability. InINFOCOM Conference. IEEE, 1999.[12] Lucent Technologies.PacketStar 6400 Series IP Switch On-line

User Doucumentation, 1999. Release 1.1.[13] M.K. McKusick, K. Bostic, M.J. Karels, and J.R. Quarterman.

The Design and Implementation of the 4.4BSD Operating System.Addison-Wesley, 1996.

[14] Red Hat. Linux OS, 6.0 intel edition, 1999.http://www.redhat.com.

[15] Y. Rekhter and T. Li. A Border Gateway Protocol (BGP-4),March1995. IETF RFC 1771.

[16] P. Sharma, D. Estrin, S. Floyd, and V. Jacobson. Scalable Timersfor Soft State Protocols. InINFOCOM Conference. IEEE, 1997.

[17] Sun Microsystems. SunOS, 5.5.1 sparc edition, 1997.http://www.sun.com.

[18] G. Varghese and A. Lauck. Hashed and Hierarchical TimingWheels: Efficient Data Structures for Impelementing a TimerFa-cility. IEEE/ACM Transactions on Networking, 5(6), 1997.

[19] C Villamizar, R. Chandra, and R. Govindan. BGP Route FlapDamping, November 1998. IETF RFC 2439.

[20] G.R. Wright and W.R. Stevens.TCP/IP Illustrated, Volume 2.Addison-Wesley, 1994.

Scalable Hierarchical Coarse Grained Timers

Technology

Transcript of Scalable Hierarchical Coarse Grained Timers