Sipload Extended

Failover, Load Sharing and Server Architecture in SIP Telephony

Kundan Singh and Henning SchulzrinneDepartment of Computer Science, Columbia University

1214 Amsterdam Ave, Mail Code 0401, New York, NY 10027, USAEmail: {kns10,hgs}@cs.columbia.edu, tel:+1-212-9397000, fax:+1-212-6660140

Abstract

We apply some of the existing web server redundancytechniques for high service availability and scalabilityto the relatively new IP telephony context. The pa-per compares various failover and load sharing meth-ods for registration and call routing servers based onthe Session Initiation Protocol (SIP). In particular,we consider SIP server failover techniques based onthe clients, DNS (Domain Name Service), databasereplication and IP address takeover, and load sharingtechniques using DNS, SIP identifiers, network ad-dress translators and servers with same IP addresses.We describe our two-stage reliable and scalable SIPserver architecture in which the first stage proxies therequest to one of the second stage server group basedon the destination user identifier. We quantitativelyevaluate the performance improvement of the loadsharing architecture using our SIP server. We quanti-tatively compare the effect of SIP server architecturesuch as event-based and thread pool. Additionally,we present an overview of the failover mechanism weimplemented in our test-bed using the open sourceMySQL database.

Keywords: Availability; scalability; failover; loadsharing; SIP

1 Introduction

The Session Initiation Protocol (SIP) [1] is a dis-tributed signaling protocol for IP telephony. SIP-based telephony services have been proposed as analternative to the classical PSTN (public switchedtelephone network) and offers a number of advantagesover the PSTN [2]. Traditionally, telephony serviceis perceived as more reliable than the Internet-basedservices such as web and email. To ensure wide ac-ceptance of SIP among carriers, SIP servers shoulddemonstrate similar quantifiable guarantees on ser-vice availability and scalability. For example, PSTNswitches have a “5 nines” reliability requirement, i.e.,

are available for 99.999% of the time, which impliesat most 5 minutes of outage a year.

The SIP proxy servers are more light-weight com-pared to PSTN switches because they only route callsignaling messages without maintaining any per-callstate. The SIP proxy server of a domain is responsi-ble for forwarding the incoming requests destined forthe logical address of the form user@domain to thecurrent transport address of the device used by thislogical entity, and forwarding the responses back tothe request sender. Consider the example shown inFig. 1. When a user, Bob, starts his SIP phone, itregisters his unique identifier [email protected] to theSIP server in the home.com domain. The servermaintains the mapping between his identifier andhis phone’s IP address. When another user, Alice,calls sip:[email protected], her phone does a DNS (Do-main Name Service) lookup for the SIP service [3] ofhome.com and sends the SIP call initiation message tothe resolved server IP address. The server “proxies”the call to Bob’s currently registered phone. OnceBob picks up the handset, the audio packets can besent directly between the two phones without goingthrough the server. Further details [2, 1] of the callare skipped for brevity.

Server ClientClient

Alice

(2) DNS

(3) INVITE

Proxy/registrar

database

(1) REGISTER

(4) INVITE

Bob

DNS DB

Figure 1: An example SIP call

If the server fails for some reason, the call initiationor termination messages cannot be proxied correctly.(Note that the call termination message need not tra-verse proxy servers unless the servers record-route.)We can improve the service availability by adding a

1

second server that automatically takes over in case ofthe failure of the first server. Secondly, if there arethousands of registered users and a single server can-not handle the load, then a second server can workalong with the first server such that the load is di-vided between the two. Our goal is to provide the car-rier grade capacity of one to ten million BHCA (busyhour call attempts) for IP telephony using commodityhardware. We describe some of the failover and loadsharing techniques for SIP servers in Sections 3 and4, respectively. These techniques also apply beyondtelephony, for example, for SIP-based instant mes-saging and presence that use the same SIP servers forregistration and message routing. Section 5 quantita-tively evaluates the performance improvement of ourload sharing architecture. Section 6 further evaluatesand improves the performance based on the softwarearchitecture such as event-based and thread pool.

2 Related Work

Failover and load sharing for web servers is a well-studied problem [4, 5, 6, 7]. TCP connection mi-gration [8], IP address takeover [9] and MAC ad-dress takeover [10] have been proposed for high avail-ability. Load sharing via connection dispatchers [11]and HTTP content or session-based request redirec-tion [12, 13, 10] are available for web servers. Someof these techniques such as DNS-based load shar-ing [14, 15] also apply to other Internet serviceslike email and SIP. Although SIP is an HTTP likerequest-response protocol, there are certain funda-mental differences that make the problem slightly dif-ferent. For example, SIP servers can use both TCPand UDP transport, the call requests and responsesare usually not bandwidth intensive, caching of re-sponses is not useful, and the volume of data up-date (REGISTER message) and lookup (INVITE mes-sage) is often similar, unlike common read-dominateddatabase and web applications.

For SIP server failover, IP anycast does not workwell with TCP and the backend database requiressynchronization between the primary and backupservers [16]. Section 3.5 describes how to apply theIETF’s Reliable Server Pooling (Rserpool [17, 18, 19])architecture for SIP telephony. The primary disad-vantage of Rserpool is that it requires new protocolsupport in the clients.

SIP-based telephony services exhibit three bottle-necks to scalability: signaling, real-time media dataand gateway services. The signaling part requireshigh request processing capacity in the SIP servers.The data part requires enough network bandwidth

and capacity (CPU and memory) in the end systems.The gateway part requires optimal placement of me-dia gateways and switching components [20]. Thispaper focuses on the signaling part only. SIP allowsredirecting a request to a less loaded server using the302 response, or transferring an existing call dialogto a less loaded endpoint or gateway [1, 21].

Optimizations such as memory pool, countedstrings and lazy parsing have been proposed forSIP servers [22, 23]. These optimizations can fur-ther improve our load sharing performance. Eventand thread-based architectures, particularly for webservers, are well known in systems research. We studythe effect of server architecture on SIP server per-formance using commodity hardware and standardPOSIX threads.

3GPP’s IP Multimedia Subsystem (IMS) uses SIPfor call control to support millions of users. It definesdifferent server roles such as outbound proxy in vis-ited network, interrogating proxy as the first point ofcontact for incoming calls in the home network, andserving proxy providing services based on subscriber’sprofile.

Identifier-based load balancing has been used foremails. We combine this with DNS-based server re-dundancy for a two-stage reliable and scalable archi-tecture. Novelty of our work lies in the applicationof existing techniques to relatively new Internet tele-phony, and quantitative performance evaluation ofthe architecture for SIP-based telephony.

We describe and compare some of these techniquesin the context of SIP. We also present an overviewof our implementation of failover and describe somepractical issues.

3 Availability: Failover

High availability is achieved by adding a backup com-ponent such as the SIP server or user record database.Depending on where the failure is detected and whodoes the failover, there are various design choices:client-based, DNS-based, database failover and IPtakeover.

3.1 Client-based failover

In the client-based failover (Fig. 2), Bob’s phoneknows the IP addresses of the primary and the backupservers, P1 and P2. It registers with both, so thateither server can be used to reach Bob. Similarly, Al-ice’s phone also knows about the two servers. It firsttries P1, and if that fails it switches to P2.

2

P1

(2) REGISTER

P2

(1) REGISTER

BobAlice

(4) INVITE

(3) INVITE

Figure 2: Client-basedfailover

P2

P1

Bob

(1) REGISTER

(2) REGISTER

(4) INVITE

(5) INVITE

(3)Alice

example.com_sip._udp SRV 0 0 p1.example.com

SRV 1 0 p2.example.com

DNS

Figure 3: DNS-basedfailover

All failover logic is built into the client. The serversoperate independently of each other. This method isused by the Cisco IP phones [24]. Configuring phoneswith the two server addresses works well within adomain. However, DNS is used to allow adding orreplacing backup servers without changing the phoneconfigurations as described next.

3.2 DNS-based failover

DNS-based failover using NAPTR and SRV recordsis the most clean and hence, preferred way, tofailover [3]. For instance, Alice’s phone can retrievethe DNS SRV [14] record for sip. udp.home.com toget the two server addresses (Fig. 3). In the exam-ple, P1 will be preferred over P2 by assigning a lowernumeric priority value to P1.

Alternatively, dynamic DNS can be used to updatethe A-record for home.com from the IP address of P1

to P2, when P1 fails. P2 can periodically monitor P1

and update the record when P1 is dead. Setting alow time-to-live (TTL) for the A-record bindings canreduce the failover latency due to DNS caching [25].

3.3 Failover based on database repli-cation

SlaveMaster

D2

(4)

replicationdatabase

(3)

(2)

D1

P1

P2

Alice Bob

(6) INVITE

(1) REGISTER(5) INVITE

Figure 4: Failover based on database replication

Not all the SIP phones are capable of register-ing with multiple servers. Moreover, to keep theserver failover architecture independent of the clientconfiguration, the client can register with only P1,which can then propagate the registration to P2. If a

database is used to store the user records, then repli-cation can be used as shown in Fig. 4. Bob’s phoneregisters with the primary server, P1, which stores themapping in the database D1. The secondary server,P2, uses the database D2. Any change in D1 is prop-agated to D2. When P1 fails, P2 can take over anduse D2 to proxy the call to Bob. There could be smalldelay before D2 gets the updated record from D1.

3.4 Failover using IP address takeover

If DNS-based failover cannot be used due to somereason (e.g., not implemented in the client), then IPtakeover [9] can also be used (Fig. 5). Both P1 and P2

have identical configuration but run on different hostson the same Ethernet. Both servers are configured touse the external master database, D1. The slave D2

is replicated from D1. The clients know the server IPaddress as P1’s 10.1.1.1 in this example.

P1

P2

10.1.1.1

10.1.1.1D2

D1

10.1.1.3

10.1.1.410.1.1.2

Master

Slave

Figure 5: Whenthe primary serverfails

P1

10.1.1.1

P2

10.1.1.2

D1

D2

10.1.1.3

10.1.1.4Slave

Master

Figure 6: Whenthe masterdatabase fails

10.1.1.2

D1

P1

P2

D2

10.1.1.1

10.1.1.1

Master

Slave

Figure 7:co-locateddatabaseand proxy

P2 periodically monitors the activity of P1. WhenP1 fails, P2 takes over the IP address 10.1.1.1. Now,all requests sent to the server address will be receivedand processed by P2. When D1 fails, P1 detects andswitches to D2 (Fig. 6). IP takeover is not used by D2

since the SIP servers can be modified to switch overwhen D1 fails. There can be a small failover latencydue to the ARP cache.

The architecture is transparent to the rest of thenetwork (clients and DNS) and can be implementedwithout external assumptions. However, if the repli-cation is only from the master to the slave, it requiresmodification in the SIP server software to first try D1,and if that fails use D2 so that all the updates aredone to the master server. To avoid replicating thedatabase, P1 can propagate the REGISTER messagealso to P2.

Alternatively, to avoid the server modification, theserver and the associated database can be co-locatedon the same host as shown in Fig. 7. If the primaryhost fails, both P2 and D2 take over. P1 always uses

3

D1, whereas P2 always uses D2.

3.5 Reliable server pooling

In the context of IETF’s Reliable Server Pooling ar-chitecture [17], Fig. 8 shows the client phone as thepool user(PU), P1 and P2 as the pool elements (PE)in the “SIP server pool”, and D1 and D2 as PEs inthe “database pool”. P1 and P2 register with theirhome name server, NS2, which supervises them, andinforms the other name servers (NS) about these PEs.Similarly, D1 and D2 also register with the NS. TheSIP servers are the pool users of the “database pool”.A pool element is removed from the pool if it is outof service.

SIP server pool

Database pool

Name Servers

name resolution register serverin the pool

register

access server pool

access server poo

Client (PU)

Pool elements

Pool elements

P1 P2

NS1 NS2 DB2DB1

Figure 8: Reliable server pooling for SIP

When the client wants to contact the “SIP serverpool”, it queries one of the name servers, NS1, to getthe list of P1 and P2 with relative priority for failoverand load sharing. The client chooses to connect toP1 and sends the call invitation. If P1 fails, the clientdetects this and sends the message to P2. For statefulservices, P1 can exchange state information with an-other server, P2, and return the backup server, P2, tothe client in the initial message exchange. This waythe client knows which backup server to use in thecase of failure. P1 can also give a signed cookie sim-ilar to HTTP cookie to the client, which sends it tothe new failover server, P2, in the initial message ex-change. This is needed for call stateful services suchas conferencing, but not for SIP proxy server failover.

The SIP server, P1, queries the NS to get the list,D1 and D2, for the “database pool”. D1 and D2 arebacked up and replicated by each other, so they canreturn this backup server information in the initialmessage exchange.

The primary limitation is that this requires newprotocol support for name resolution and aggregateserver access in the clients. A translator can beused to interoperate with the clients that do not sup-port reliable server pooling. However, this makes thetranslator as a single point of failure between the

client and the server, hence limiting the reliability.Secondly, the name space is flat unlike DNS hierar-chy, and is designed for a limited scale (e.g., withinan enterprise), but may be combined with wide areaDNS based name resolution, for example. More workis needed in that context.

3.6 Implementation

We have used some of the above techniques in ourColumbia InterNet Extensible Multimedia Architec-ture (CINEMA). The architecture [26, 27] consistsof our SIP server, sipd and a MySQL database foruser profile and system configuration. Other compo-nents such as the PSTN gateway and media serversare outside the scope of this paper. The configura-tion and management are done via a web interfacethat accesses various CGI (Common Gateway Inter-face) scripts written in Tcl on the web server. All theservers may run on a single machine for an enterprisesetup.

SRV 0 0 5060 phone.csSRV 1 0 5060 sip2.cs

D1

_sip._udp

P1 P2

D2Webscripts

phone.cs.columbia.edu sip2.cs.columbia.edu

REGISTER proxy1=phone.cs

backup=sip2.cs

Master Slave

Webscripts

Figure 9: Failover in CINEMA

For failover, we use two sets of identical serverson two different machines as shown in Fig. 9. Thedatabase and SIP server share the same host. Thedatabases are replicated using MySQL 4.0 replica-tion [28] such that both D1 and D2 are master andslave of each other. MySQL propagates the binarylog of the SQL commands of master to the slave, andthe slave runs these commands again to do the repli-cation. Our technical report [29] contains the detailsof the two-way replication.

MySQL 4.0 does not support any locking proto-col between the master and the slave to guaranteethe atomicity of the distributed updates. However,the updates from the SIP server are additive, i.e.,each registration from each device is one databaserecord, so having two devices for the same user

4

register with two database replicas does not inter-fere with the other registration. For example, [email protected] registers [email protected] with D1

and [email protected] with D2, both, D1 and D2,will propagate the updates to each other such thatboth D1 and D2 will have both of Bob’s locations.There is a slight window of vulnerability when onecontact is added from D1 and the same contact is re-moved in D2, then after the propagation of updatesthe two databases will be inconsistent with differentcontacts for the user. It turns out that this does notoccur for the simple failover as we describe next. Wecan safely use the two-way replication as long as up-dates are done by only the SIP server.

For a simple failover case, the primary server P1

is preferred over the secondary server P2. So all theREGISTER requests go to P1 and are updated in D1.The replication happens from D1 to D2, not the otherway. Only in the case of failure of P1, will the updatehappen to D2 through P2. But D1 will not be up-dated by the server in this case. By making sure thatdatabase becomes consistent before the failed serveris brought up, we can avoid the database inconsis-tency problem mentioned above.

Web scripts are used to manage user profiles andsystem configuration. To maintain database consis-tency, the web scripts should not be allowed to mod-ify D2 if D1 is up. To facilitate this we modified theMySQL-Tcl client interface to accept a list of connec-tion attributes. For example, if D1 and D2 are listedthen the script tries to connect to D1 first, and if thatfails then tries D2 as shown in Fig. 9. For our webscripts, the short-lived TCP connection to MySQL isactive as long as the CGI script is running. So thefailover at the connection setup is sufficient. In the fu-ture, for long-lived connection, it should be modifiedto provide failover even when the TCP connectionbreaks.

3.7 Analysis

The architecture provides high reliability due to re-dundancy. Assuming the reliability of primary andbackup sets of servers as R, the overall reliability is(1− (1 −R)2).

Server failure affects the call setup latency (sincethe client retries the call request to the secondaryserver after a timeout) and the user availability (theprobability that the user is reachable via the servergiven that her SIP phone is up). If the primary serveris down for a longer duration, the DNS records can beupdated to change the secondary server into primary.If the individual server reliability is R (such that 0 ≤

R ≤ 1), client retry timeout is TR, and DNS TTL isTD, then the average call setup latency increases byTR(1 − R)P[tM < TD] (assuming no network delayand R ≈ 1), where P[tM < TD] is the probabilitythat the time, tM (random variable), to repair theserver is less than the DNS TTL. For example, if therepair time is exponentially distributed with mean

TM , then P[tM < TD] = 1− e− TD

TM assuming that themean time to failure is much larger than the meantime to repair. (i.e., (1−R)TM ≈ 0).

User availability is mostly unaffected by the pri-mary server failure, because most registrations areREGISTER refreshes. However, if the primary serverfails after the phone registers a new contact for thefirst time, but before the registration is propagatedto the secondary server, then the phone contact lo-cation is unreachable until the next registration re-fresh. In this case, assuming that the server uptimeis exponentially distributed, and given the memory-less property, the time-to-failure has the same distri-bution. Suppose the mean-time-to-failure is TF andthe database replication latency is Td, then the prob-ability that the server goes down before the replica-tion is completed (given that it is up at t = 0) is

P[lifetime < Td] = 1 − e− Td

TF . If this happens, theuser record is unavailable for at most Tr + TR, whereTr is the registration refresh interval (typically onehour), and TR is client retry timeout. After this time,the client refreshes the registration and updates thesecondary server making the user record available.

We use an in-memory cache of user records insidethe SIP server to improve its performance [26, 23].This causes more latency in updating the user regis-tration from P1 to P2. If the failure happens beforethe update is propagated to the P2, then it may havean old and expired record. However, in practice thephones refresh registrations much before the expiryand the problem is not visible. For example, sup-pose the record expires every two hours and the re-fresh happens every 50 minutes. Suppose P1 receivesthe registration update from a phone and fails beforepropagating the update to D1. At this point, therecord in D2 has 70 minutes to expire so P2 can stillhandle the calls to this phone. The next refresh hap-pens in 50 minutes, before expiration of the record inD2. If a new phone is setup (first time registration)just before failure of P1, it will be unavailable untilthe next refresh. Suppose Td and TF are defined asbefore, and Tc is the database refresh interval, thenthe probability that the server goes down before the

replication is completed is 1− e−Td+Tc

TF .

With the Cisco phone [24] that has the primary

5

and backup proxy address options (Section 3.1), thephone registers with both P1 and P2. Both D1 andD2 propagate the same contact location change toeach other. However, since the contact record is keyedon the user identifier and contact location, the sec-ond write just overrides the first write without anyother side effect. Alternatively, the server can bemodified to perform the immediate synchronizationbetween the in-memory cache and external databaseif the server is not loaded.

The two-way replication can be extended to moreservers by using circular replication such as D1-D2-D3-D1 using the MySQL master/slave configu-ration [28]. To provide failover of individual servers(e.g., D1 fails but not P1), the SIP server P1 shouldswitch to D2 if D1 is not available.

4 Scalability: Load Sharing

In failover, the backup server takes over in the caseof failure whereas in load sharing all the redundantservers are active and distribute the load amongthem. Some of the failover techniques can also beextended to load sharing.

4.1 DNS-based load sharing

The DNS SRV [14] and NAPTR [15] mechanisms canbe used for load sharing using the priority and weightfields in these resource records [3], as shown below:

example.com

_sip._udp 0 40 a.example.com

0 40 b.example.com

0 20 c.example.com

1 0 backup.somewhere.com

The above DNS SRV entry indicates that the serversa, b, c should be used if possible (priority 0), withbackup.somewhere.com as the backup server (prior-ity 1) for failover. Within the three primary servers,a and b are to receive a combined total of 80% of therequests, while c, presumably a slower server, shouldget the remaining 20%. Clients can use weighted ran-domization to achieve this distribution.

However, simple random distribution of requestsis not sufficient since the servers need to access thesame registration information. Thus, in the exampleabove, each server would have to replicate incomingREGISTER requests to all other servers or update thecommon shared and replicated database(s). In eithercase, the updates triggered by REGISTER quickly be-come the bottleneck. The SIP phones typically doREGISTER refresh once an hour, thus, for a wireless

write

D=2

D1

P1

P2

P3

D2

Figure 10: DNS-based

P1

a−h

i−q

D=3

r−zP3

P2

statelessproxy

D3

P0

D1

D2

Figure 11: Identifier-basedload sharing

operator with one million subscribers, it has to pro-cess about 280 updates per second.

Fig. 10 shows an example with three redundantservers and two redundant databases. For every REG-ISTER, it performs one read and one write in thedatabase. For every INVITE-based call request, itperforms one read from the database. Every writeshould be propagated to all the D databases, whereasa read can be done from any available database. Sup-pose there are N writes and r ∗ N reads, e.g., if thesame number of INVITE and REGISTER are processedthen r = 2. Suppose, the database write takes T unitsof time, and database read takes t ∗ T units. Totaltime per database will be ( tr

D + 1)TN .This architecture also provides high reliability due

to redundancy. Assuming that the mean-time-to-repair is much less than mean-time-to-failure, andthe reliability of individual proxy server as Rp anddatabase server as Rd, and suppose there are P proxyservers and D database servers, the reliability of thesystem becomes (1− (1−Rp)P )(1− (1−Rd)D). Thereliability increases with increasing D and P .

4.2 Identifier-based load sharing

For identifier-based load sharing (Fig. 11), the userspace is divided into multiple non-overlapping groups.A hash function maps the destination user identifierto the particular group that handles the user record,e.g., based on the first letter of the user identifier.For example, P1 handles a-h, P2 handles i-q and P3

handles r-z. A high speed first stage server (P0),proxies the call request to P1, P2 and P3 based onthe destination user identifier. If a call is receivedfor destination [email protected] it goes to P1, [email protected] goes to P3. Each server has its owndatabase and does not need to interact with the oth-ers. To guarantee almost uniform distribution of callrequests to different servers, a better hashing algo-rithm such as SHA1 can be used or the groups canbe re-assigned dynamically based on the load.

Suppose N , D, T , t and r are as defined in the

6

previous section. Since each read and write opera-tion is limited to one database and assuming uniformdistribution of requests to the different servers, totaltime per database will be ( tr+1

D )TN . With increas-ing D, this scales better than the previous method.Since the writes do not have to be propagated to allthe databases and the database can be co-located onthe same host with the proxy, it reduces the internalnetwork traffic.

However, because of lack of redundancy this archi-tecture does not improve system reliability. Assum-ing that the mean-time-to-repair is much less thanmean-time-to-failure, and the reliability of the firststage proxy, second stage proxy and database serveras R0, Rp and Rd, and suppose there are D groups,then the system reliability becomes R0 ·(Rp)D ·(Rd)D.The least reliable component affects the system reli-ability the most and the reliability decreases as Dincreases.

The only bottleneck may be the first stage proxy.We observed that the stateful performance is roughlysimilar to stateless performance (Section 5), hence asingle stateless load balancing proxy may not workwell in practice.

4.3 Network address translation

A network address translator (NAT) device can ex-pose a unique public address as the server address anddistribute the incoming traffic to one of the severalinternal private hosts running the SIP servers [30].Eventually, the NAT itself becomes the bottleneckmaking the architecture inefficient. Moreover, thetransaction-stateful nature of SIP servers require thatsubsequent re-transmissions should be handled by thesame internal server. So the NAT needs to maintainthe transaction state for the duration of the transac-tion, further limiting scalability.

4.4 Multiple servers with the same IPaddress

In this approach, all the redundant servers in thesame broadcast network (e.g., Ethernet) use the sameIP address. The router on the subnet is configured toforward the incoming packets to one of these servers’MAC address. The router can use various algorithmssuch as “round robin” or “response time from server”to choose the least loaded server.

To avoid storing SIP transaction states in the sub-net router, this method is only recommended forstateless SIP proxies that use only UDP transportand treat each request as independent without main-taining any transaction state.

In the absence of DNS SRV and NAPTR, we canuse this method for the first stage in Fig. 12. Thisis less efficient since the network bandwidth of thissubnet may limit the number of servers in the cluster.Moreover, this method does not work if the networkitself is unreachable.

4.5 Two-stage reliable and scalable ar-chitecture

s3.example.com

s2.example.com

s1.example.com

a1.example.com, a2.example.coma*@example.com

b*@example.com

b.example.com

SRV 1 0 b2.example.co_sip._udp SRV 0 0 b1.example.co

sip:[email protected]

b1.example.com, b2.example.com

sip:[email protected]

_sip._udp SRV 0 0 s1.example.com SRV 0 0 s2.example.com SRV 0 0 s3.example.com

a.example.com_sip._udp SRV 0 0 a1.example.co SRV 1 0 a2.example.co

Figure 12: Two-stage reliable and scalable architec-ture

Since none of the mechanisms above are sufficientlygeneral or infinitely scalable, we propose to combinethe two methods (Fig. 10 and 11) in a two-stage scal-ing architecture (Fig. 12) to improve both reliabilityand scalability. The first set of proxy servers selectedvia DNS NAPTR and SRV performs request routingto the particular second-stage cluster based on thehash of the destination user identifier. The clustermember is again determined via DNS. The second-stage server performs the actual request processing.Adding an additional stage does not affect the audiodelay, since the media path (usually directly betweenthe SIP phones) is independent of the signaling path.Use of DNS does not require the servers to be co-located, thus allowing geographic diversity.

Suppose there are S first stage proxy servers,P clusters in the second stage, and B proxy anddatabase servers in each cluster. The second stagecluster has one primary server and B − 1 backups.All the databases in a cluster are replicated usingcircular replication. Suppose the REGISTER messagearrivals are uniformly distributed (because of the uni-form registration refresh rate by most user agents)with mean λR and INVITE (or other requests thatneed to be proxied such as MESSAGE) arrivals arePoisson distributed with mean λP , such that the to-tal request rate is λ=λR+λP . Suppose the constantservice rates of first stage server be µs, and the sec-ond stage server be µr and µp for registration andproxying, respectively. We assume a hash function

7

so that each cluster’s arrival rate is approximatelyλB . Suppose the reliability (probability that the sys-tem is available for processing an incoming message)and maintainability (repair rate of the system aftera failure) distributions for first stage proxy are rep-resented by probability distribution functions (pdf)Rs and Ms, respectively, and that for second stageproxy be Rp and Mp respectively. Note that Fig. 9 isa special case where S=0, P=1 and B=2. Similarly,Fig. 11 is a special case where S=B=1.

The goal is to quantitatively derive the relation-ship between different service parameters (µ), systemload (λ), reliability parameters (R, M) and redun-dancy parameters (S, B, P ). We want to answerthe questions such as (1) when is first stage proxyneeded, and (2) what are the optimal values for re-dundancy parameters to achieve a given scalabilityand reliability. Our goal is to achieve carrier gradereliability (99.999% available) and scalability (10 mil-lion BHCA) using commodity hardware. We provideour performance measurement results for scalabilityparameters (S and P ) and system load (λ) in thenext section.

Suppose each server is 99% reliable, and S = P =B = 3, then overall system reliability is (1 − (1 −Rs)S)·(1−(1−Rp)B)P = 99.9996%, i.e., “five nines”.

We do not consider the case of load sharing bydifferent proxies in the same cluster, because loadsharing is better achieved by creating more clusters.For handling sudden load spikes within one cluster,the DotSlash on-demand rescue system [31] is moreappropriate where a backup server in the same oranother cluster temporarily shares the load with theprimary server of the overloaded cluster.

5 Performance Evaluation

In this section, we quantitatively evaluate the per-formance of our two-stage architecture for scalabilityusing our SIP registration and proxy server, sipd, andSIPstone test suite [32].

5.1 Test setup

We performed the SIPstone Proxy 200 tests, overUDP. The SIPstone test suite has loaders and callhandlers, to generate SIP requests and to respondto incoming requests, respectively. The server undertest (SUT) is a two-stage cluster of our SIP servers,sipd, implementing the reactive system model [23].An example test setup is shown in Fig. 13. Eachinstance of sipd was run on a dedicated host withPentium 4, 3GHz CPU, on a 800MHz motherboard,

with 1 GB of memory, running Redhat Linux (Fe-dora). The hosts communicated over a lightly loaded100base-T Ethernet connection. A single externalMySQL database, running version 3.23.52 of theMySQL server was shared by all the sipd instances.But this is not an issue because the Proxy 200 testdoes not modify the database, but uses in-memorycache of sipd [26].

L=4Call handlersLoaders

H4

H3

L2

L3

H2

H1

L4

L1

H=4

λ

0.4λ

L

nLoad= λ

0.3

Second stage serversm=3

λ

First stage serversn=3

0.3

A1000−A1024

user identifiers

A1025−A1049

A1050−A1074

A1075−A1099

Generate load λ

Load= λ

P3

P2

P1

S2

S1

S3

SIPstone controller

Figure 13: Example test setup for S3P3

To focus on only the scalability aspects we usedone server in each group of the second stage (Fig. 12,B=1). We use the convention SnPm to represent nfirst stage servers, and m second stage groups withone server per group. S0P1 is same as a single SIPproxy server without any first stage load balancer.

180 Ringing

200 OK200 OK

INVITE

Call handlerLoad generator

H2L1

second stage (stateless)first stage (stateless)

200 OK200 OK200 OK

BYEBYE

ACK

BYE

ACKACK

200 OK

180 Ringing

INVITE INVITE

180 Ringing

S1 S2 P1 P2

Figure 14: Example message flow

On startup, a number of call handlers (in ourtests, four) register a number of destination locations(from non-overlapping user identifier sets as shown inFig. 13) with the proxy server. Then for the Proxy200 test, a number of loaders (in our tests, four) sendSIP INVITE requests using Poisson distribution forcall generation to the SUT, randomly selecting fromamong the registered addresses as shown in Fig. 14.If there is more than one first stage server (n > 1),then the loader randomly selects one of the first stageservers. The first stage server proxies the request on

8

one of the second stage servers based on the desti-nation user identifier. The second stage server for-wards each request to the appropriate call handlerresponsible for this user identifier. The call handlerimmediately responds with 180 Ringing and 200 OKmessages. These are forwarded back to the load gen-erators in the reverse path. Upon receiving the 200OK response, the load generator sends an ACK mes-sage for the initial transaction and a BYE request fora new transaction. The BYE is similarly forwardedto the call handler via the two-stage servers to reflectthe record-route behavior in real operational condi-tions [32]. The call handler again responds with 200OK. If the 200 OK response is not received by theloader within two seconds, or if any other behavior oc-curs, then the test is considered a failure. The loadergenerates the request for one minute for a given re-quest rate. The server is then restarted, and the testis repeated for a higher request rate. We used anincrement of 100 calls per second (CPS).

This process is repeated until 50% or more of thetests fail. Although [32] requires 95% success, wemeasure until 50% to show that the throughput isstable at higher loads. There is no retransmission onfailure [32]. The complete process is repeated for dif-ferent values of n and m in the cluster configuration,SnPm.

5.2 Analysis

Fig. 15 compares the performance of the differ-ent SnPm configurations. It shows the average ofthree experiments for each configuration at variouscall rates. A single sipd server handles about 900calls/second (CPS) (see S0P1 in Fig. 15), which cor-responds to about three million BHCA. When theload is more than the server capacity, the throughputremains almost constant at about 900 CPS. Whenthroughput is same as load, i.e., 100% success rate,the graph is a straight line. Once the throughputreaches the capacity (900 CPS), the graph for S0P1

flattens indicating lower success rate for higher load.At a load of 1800 CPS, the system gives only 50%success rate (i.e., throughput is half of load), and theexperiment stops. Note that for all practical pur-poses, success rate of close to 100% is desired.

When the server is overloaded, the CPU utiliza-tion is close to 100%. Introducing an extra server inthe second stage and having a first stage load bal-ancing proxy puts the bottleneck on the first stageserver which has a capacity of about 1050 CPS (S1P2

in Fig. 15). An additional server in the first stage(S2P2) gives the throughput of approximately dou-

ble the single second stage server capacity. Similarly,S3P3 has capacity of approximately 2800 CPS whichis about three times the capacity of the single sec-ond stage server, and S2P3 has capacity of 2100 CPSwhich is double the capacity of the single first-stageserver.

0

400

800

1200

1600

2000

2400

2800

0 400 800 1200 1600 2000 2400 2800 3200 3600 4000

Thr

ough

put -

cal

ls/s

econ

dLoad - calls/second

s0p1s1p1

s1p2

s2p2

s2p3

s3p3

Figure 15: Server throughput in SnPm configuration (nfirst stage and m second stage servers). The results showthat the performance increases linearly with the numberof servers, i.e., s2p2 is twice and s3p3 is thrice that ofs1p1 and s0p1 performance.

0

400

800

1200

1600

2000

2400

2800

0 400 800 1200 1600 2000 2400 2800 3200 3600 4000

Thr

ough

put -

cal

ls/s

econ

d

Load - calls/second

s1p2

s2p2

s2p3

s3p3

Figure 16: Theoretical and experimental capacity forconfiguration SnPm

The results show that we can achieve linear scalingby putting more servers in the first and second stagesin our architecture. Below, we present the theoreticalanalysis for the two-stage architecture.

Suppose the first and second stage servers in SnPm

have capacity of Cs and Cp, respectively (usually,Cs ≥ Cp). The servers are denoted as Si and Pj ,1 ≤ i ≤ n, 1 ≤ j ≤ m, for the first and second stage,respectively. Suppose the incoming calls arrive at an

9

average rate λ, with exponential inter-arrival time.Suppose the load is uniformly distributed among allthe n first stage servers, so the each first stage servergets a request rate of λ

n . Suppose the hash functiondistributes the requests to the second stage serversuch that the ith server, Pi, gets a fraction, fi, ofthe calls (Note that

∑fi = 1). Assuming that all

the users are equally likely to get called, and thehash function uniformly distributes the user identi-fiers among the second stage servers, then all fi willbe same (i.e., fi = 1

n ). However, differences in thenumber of incoming calls for different users will causenon-uniform distribution in reality.

The throughput, τ , at a given load, λ, is the com-bined throughput of the two stages. The throughputof the first stage is λ′ = min(λ, nCs), which is load(input) to the second stage. The server, Pj , in thesecond stage has throughput of min(λ′fj, Cp). Thus,

τ(λ) =m∑

j=1

min(fj min(λ, nCs), Cp)

Without loss of generality, we assume that fi ≥ fj

for i > j. The resulting throughput vs load graphis given by m + 1 line segments, Li: (λi, τi) →(λi+1, τi+1), for i=0 to m, where (λk, τk) is givenas follows:

(0, 0) for k = 0

(Cp

fk, τk−1 + (λk − λk−1)Fk) for 1 ≤ k ≤ m; fk ≥ Cp

nCs

(nCs, τk−1 + (λk − λk−1)Fk) for 1 ≤ k ≤ m; fk <Cp

nCs

(∞, τm) for k = m + 1where Fk = (1−∑m

i=kfi)

The initial line segment represents 100% successrate with slope 1. At the request load of Cp

f1, server

P1 reaches its capacity and drops any additional re-quest load. So the capacity increases at rate equal tothe remaining fraction of requests that go to the othernon-overloaded servers, Pk, k = 2, 3, ..., m. This givesthe slope F1 = (1 − (f2 + f3 + ... + fm)) for the sec-ond line segment. Similarly, P2 reaches its capacityat load Cp

f2, and so on. When all the second stage

servers are overloaded the throughput remains con-stant, giving the last line segment. At the requestload of nCs, all the first stage servers, Si, reach theircapacity limit. If the second stage server Pj ’s capac-ity, Cp is more than the load it receives at that time,fj(nCs), then the system throughput is not limitedby Pj .

We used a set of hundred user identifiers for test.The hash function we used distributed these identi-fiers as follows: for m = 2, f is roughly {0.6, 0.4}, andfor m = 3, f is roughly {0.4, 0.3, 0.3}. Note that with

1000 or 10 000 user identifiers the same hash functiondistributed the set more uniformly as expected, butour skewed distribution of hundred identifiers helpsus verify the results assuming non-uniform call dis-tribution for different users. The capacity Cs and Cp

are 900 CPS and 1050 CPS, respectively. The re-sulting theoretical performance is shown in Fig. 16for s1p2, s2p2, s2p3 and s3p3 with system capacityof 1050, 1740, 2100 and 2700 CPS, respectively. Al-though S2P2’s second stage can handle 900 × 2 =1800 CPS, the throughput of the first stage is only1050× 2 = 2100, out of which 60% (i.e., 1260 CPS)goes to P1 which drops 1260 − 900 = 360 CPS. Sothe system throughput is 2100 − 360 = 1740 CPS.Our experimental results are plotted as data points(not average, but individual throughput values) inthe same graph for comparison.

5.3 Non-uniform call distribution

If the call requests to the user population among thedifferent second stage servers is non-uniformly dis-tributed, then the system starts dropping the call re-quests at a load lower than the combined capacity ofthe second stage servers. To prevent this, the userdata should be redistributed among the second stageservers to provide an uniform distribution on an av-erage, e.g., by changing the hash function. Fig. 17compares the two experiments for the S2P2 configu-ration: one with the earlier skewed hash function thatdistributed the user identifiers in ratio 60:40 and an-other hash function (Bernstein’s hash [33]), that dis-tributed the user identifiers in ratio 50:50.

If the number of second-stage groups changes fre-quently, then a consistent hashing function [34] isdesirable that avoid large redistribution of the useridentifiers among the servers.

5.4 Stateful proxy

So far we have shown the test results using the state-less proxy mode. A SIP request over UDP that needsto be proxied to only one destination (i.e., no requestforking), can be proxied statelessly. Our SIP server,sipd, can be configured to try the stateless mode, ifpossible, for every request that needs to be proxied.If a request can not be proxied statelessly, sipd fallsback to the transaction stateful mode for that re-quest. Stateful mode requires more processing andstate in the server, e.g., for matching the responsesagainst the request.

We ran one experiment by disabling the statelessproxy mode in the second stage. Fig. 18 shows the ex-perimental results along with the theoretical through-

10

0

400

800

1200

1600

2000

0 400 800 1200 1600 2000 2400 2800 3200 3600 4000

Thr

ough

put -

cal

ls/s

econ

d

Load - calls/second

skewed hashuniform hash

Figure 17: Effect of user identifier distribution amongsecond stage servers for S2P2. Uniform distribution givesthe best performance, i.e., success rate is close to 100%until the peak performance (1800 CPS), whereas for non-uniform distribution the success rate reduces as soon asone of the server is overloaded (at 1500 CPS).

0

400

800

1200

1600

2000

0 400 800 1200 1600 2000 2400 2800

Thr

ough

put -

cal

ls/s

econ

d

Load - calls/second

s0p1s1p1s1p2s2p2s2p3s3p3

Figure 18: Performance of SnPm with stateful proxy insecond stage. The results show that the performance in-creases linearly with the number of servers, i.e., s2p2 istwice and s3p3 is thrice that of s1p1 and s0p1 perfor-mance.

put using the earlier hash function. The first and sec-ond stage server capacities are C=800 and C′=650CPS, respectively. The first stage server capacityis less if the second stage is stateful (800 CPS) com-pared to the case when the second stage is stateless(1050 CPS), because the stateful second stage servergenerates two additional 100 Trying SIP responses forINVITE and BYE in a call that increases the num-ber of messages handled by the first stage server (SeeFig. 19 and 14). If a fraction, fs, of the input loadneeds to be handled using stateful mode (e.g., due torequest forking to multiple callee devices), then the

effective server capacity becomes (1 − fs)C + fsC′.

200 OK

ACKACKACK

200 OK200 OK

Call handler

H2Load generator

L1

first stage (stateless)

100 trying BYE100 tryingBYEBYE

100 trying

INVITE INVITE100 trying INVITE

180 Ringing200 OK

180 Ringing180 Ringing

second stage (stateful)

200 OK200 OK

P2P1S2S1

Figure 19: Stateful proxy message flow

Our more recent optimizations enhance the singlesecond stage server throughput to 1200 CPS and 1600CPS for stateful and stateless proxy, respectively.

5.5 Effect of DNS

In our earlier experiments, the call handler registeredthe DNS host name with the proxy server so that theserver does DNS lookup for locating the call handlerhost. We observed comparatively poor performance,e.g., a single proxy server capacity with DNS was 110CPS on the same hardware, compared to 900 CPSwithout DNS. There were two problems in our imple-mentation: (1) it used a blocking DNS resolver thatwaits for the query to complete so the internal requestqueue builds up if the DNS latency is more than theaverage interarrival duration; and (2) it did not im-plement any host-cache for DNS, so the second stageserver did DNS lookup for every call request. Wealso observed some fluctuations in throughput evenbefore the server reached its capacity. This was dueto the fact that the DNS server was not in the samenetwork, and the DNS procedure took between 10to 25ms for each call. In our tests, sipd sent about29 DNS queries for each call due to multiple resolversearch domains (six in our tests) and DNS records(e.g., sipd tries NAPTR, SRV and A records, fallingback in that order) used in the implementation.

Then, we implemented a simple DNS host-cache insipd and observed same performance as that withoutDNS (i.e., 900 CPS for single second stage server). Inpractice, the first-stage servers access records for thesecond-stage servers within the same domain, thus,doing localized DNS queries in the domain. It will beinteresting to measure the host-cache performance forthe real callee host names by the second-stage servers,instead of a few call handler host names that werecached after the first lookups until the end of the testrun in our tests. We plan to use an event-based DNS

11

resolver to improve the performance and eliminatethe potential bottleneck due to DNS access.

5.6 Other SIPstone tests

We also performed one experiment with Registra-tion test without authentication. The performance isshown in Fig. 20 along with the expected throughputvalues. We used capacity values as Cs=2500 registra-tions/second (RPS) and Cp=2400 RPS for first andsecond stage servers respectively. Authentication re-quires two transactions, thus reducing the capacityto half. Thus, the S3P3 configuration will be able tosupport more than 10 million subscribers assumingone hour registration refresh interval.

0

800

1600

2400

3200

4000

4800

5600

6400

7200

0 800 1600 2400 3200 4000 4800 5600 6400 7200 8000 8800

Thr

ough

put -

reg

istr

atio

ns/s

econ

d

Load - registrations/second

s0p1s1p1s1p2s2p2s2p3s3p3

Figure 20: Performance for SnPm with registrationserver in second stage. The results show that the per-formance increases linearly with the number of servers,i.e., s2p2 is twice and s3p3 is thrice that of s1p1 and s0p1performance.

Note that the second stage registrar is always state-ful. Moreover, we used the database refresh rateto be more than the test duration, thus, removingthe database synchronization variable from the re-sults. The effect of database synchronization on per-formance is for further study. The first stage proxyserver capacity for the registration test is more be-cause the number of messages per transaction that ithandles is two in the registration test compared tosix in the Proxy 200 test (see Fig. 21 and 14).

The Proxy 200 test determines the BHCA (busyhour call attempts) metric, whereas the registrationtest determines the number of registered subscribersfor the system.

L1Load generator

200 OK 200 OK

first stage (stateless)REGISTER REGISTER

second stage: registraS2 P2

Figure 21: REGISTER message flow

6 Server Architecture

There are two components in providing high ca-pacity IP telephony services: network componentssuch as bandwidth, server location and load shar-ing, and server components such as server hardware(CPU, memory), features vs. performance tradeoff,non-blocking I/O and software architecture. In gen-eral, scaling any Internet service involves individualserver performance enhancements and load distribu-tion among multiple servers. We described and eval-uated SIP load sharing in Sections 4 and 5. Thissection deals with performance enhancements on anindividual SIP server on commodity server hardware.In particular, we evaluate the effect of software ar-chitecture - events, threads or processes - for the SIPproxy server. We try to answer the following ques-tions: (1) For a SIP-style server, which of the basicarchitectures is likely to perform better in a given sit-uation? (2) Does performance scale with CPU speedor is it memory dominated? (3) What can be doneto improve the performance on a multiprocessor ma-chine?

We built a very basic SIP server in different soft-ware architectures using the same set of libraries forSIP processing. This helps us in understanding theeffect of the server architecture on performance. Theserver includes a parser module and has many simpli-fications such as memory-only lookups without anydatabase write-through, no SIP Route header han-dling, minimal configuration, only UDP transport(i.e., no TCP or TLS), no programmable scripts, andno user authentication. We used standard POSIXthreads, which map to kernel-level threads on So-laris and Linux. On a multi-processor hardware, con-currency is utilized via multi-threading and multi-processing software architecture. Our goal is to usecommodity hardware without any custom tweaks,hence optimized user-level threads and CPU sched-uler reconfiguration are not investigated.

12

Databaselookup

TCP

UDP

recv

recvfrom

accept

Response

Proxy

First stage proxy

stateless proxy

other

Request other messages

sendmsgdatabaseUpdate

response

matchingRequest

modify

(B)

(B)

(B)

(B)(B)

(B)

(B)

(B)(B)

Stateful

Stateless proxy

matchingBranch

FoundREGISTER

reject

Redirect/

Next server

lookup

responseBuild

DNSrequestmodify

InitialParsing

Figure 22: Processing steps in a SIP server

6.1 Processing steps

Figure 22 describes the steps involved in processinga SIP request in any SIP server. It includes bothtransaction stateless and stateful processing. Theserver receives a message on UDP, TCP, SCTP, orTLS transport. We use only UDP in our tests. Themessage is parsed using our unoptimized SIP parser.If the message is a SIP request, it is matched againstthe existing transactions. If a matching transactionis found, the request is a retransmission and the lastresponse, if any, in the transaction is returned for therequest. If a match is not found, and the requestis a SIP REGISTER request, then the user contactrecords are updated for this registration and a re-sponse is sent. For any other request, the user recordis looked up. Depending on the policy chosen, thecall is then proxied, redirected or rejected. In theproxy mode, the server looks up the callee’s currentcontact locations, forwards the request and waits forthe response. During this process, the server mayneed to perform additional retransmissions for relia-bility. When receiving a response, the server looksup the appropriate matching transaction for this re-sponse and forwards the response. If the policy de-cides to redirect the request instead of proxying it, theserver sends the response to the caller listing the newcontact location(s). The first stage load balancingserver selects the next stage server based on the des-tination user identifier, without doing any databasequery. The steps are based on our SIP server im-plementation, but are likely to be similar for otherimplementations.

These processing steps can be implemented in var-ious software architectures for both stateless andstateful proxy modes.

6.2 Stateless proxy

A stateless proxy does not maintain any transactionstate, and has a single control flow per message. Thatmeans, once a message is received, it can be processedto the end without interfering with other messages.We used only UDP transport for our tests and did notperform any DNS lookups. As shown in Figure 14,a single Proxy 200 test involves six messages. In Fig-ure 22, for an incoming call request, the steps per-formed are recvfrom, initial parsing, database lookup,modify request and sendmsg. Similarly, for an incom-ing call response the steps performed are recvfrom,initial parsing, modify response and sendmsg. A firststage load balancer proxy is also a stateless proxy,but it does not include the database lookup stage.The processing steps can be implemented in differentsoftware architectures as follows:

Event-based: A single thread listens for incomingmessage and processes it to the end. There is nolocking. This does not take advantage of the un-derlying multiprocessor architecture. If DNS isused, then the same thread also listens for eventssuch as the DNS response and timeout.

Thread per message: A main thread listens for in-coming messages. A new parsing thread is cre-ated to do the initial parsing. Then another pro-cessing thread is created to perform the remain-ing steps depending on whether the message isa request or a response. The thread terminatesafter the steps are completed. DNS lookups, ifany, are performed synchronously in the process-ing thread. Locks (i.e., mutexes) are used foraccessing shared data such as database. Poten-tially blocking operations include DNS, sendmsg,

13

and database lookup.

Pool-thread per message: This is similar to theprevious method, except that instead of creatinga new thread, it reuses a thread from a threadpool. A set of threads are created in the threadpool on server initialization and persist through-out the server lifetime. This reduces the threadcreation overhead and is the original architectureof our SIP server, sipd [23]. To further reducelock contention, the user data can be dividedinto multiple sets (say, 100), each with its owntransaction tables or user records. Thus, accessto user records in different sets do not contendfor the same lock.

Process pool: On server initialization, a pool ofidentical processes is created, all listening on thesame socket. When a message is received, one ofthe processes gets the socket message and per-forms all the processing steps for that message.Shared memory is used for sharing the databaseamong multiple processes. This is the architec-ture of the SIP express router [35].

Thread pool: This is similar to the previousmethod, but it uses threads instead of processes.Only one thread can call recvfrom on the listen-ing socket. If a thread has called recvfrom, thenanother thread is blocked from calling this func-tion until the first thread finishes the functioncall.

Software architecture/Hardware

1xP 4xP 1xS 2xS

Event-based 1550 400 150 600Thread per message 1300 500 100 500Pool-thread per mes-sage (sipd)

1400 850 110 600

Thread pool 1500 1300 152 750Process pool 1600 1350 160 1000

Table 1: Performance (CPS) of stateless proxy forProxy 200 test

We ran our tests on four different platforms asfollows: (1xP) Pentium 4, 3GHz, 1 GB runningLinux 2.4.20, (4xP) four-processor Pentium 450MHz,512MB running Linux 2.4.20, (1xS) ultraSparc-IIi,300MHz, 64MB running Solaris 5.8, and (2xS) two-processor ultraSparc-III+, 900MHz, 2GB runningSolaris 5.8. The results of our tests are shown in Ta-ble 1. The numbers presented in this section are dif-ferent from earlier load sharing experiments of sipd,

because these tests were done after some optimiza-tions such as per-transaction memory pool to reducememory deallocation and copy [23]. The performanceof different architectures relative to the event-basedmodel on different platforms is shown in Figure 23.

For a single processor system (1xP and 1xS), theperformances of event-based, thread pool and processpool are roughly similar. We found that the threadpool model had higher context switches compared toprocess pool. In the process pool model the sameprocess keeps getting scheduled for handling subse-quent requests. This resulted in the slight differencein the performance. The process pool model performsthe best. The thread-per-message and pool-thread-per-message models have many fold higher contextswitches resulting in much poorer performance. Thisis because every message processing must involve atleast two context switches. One interesting observa-tion is that both the single processor systems (1xPand 1xS) took approximately 2MHz CPU cycle perCPS (calls per second) load.

For a multiprocessor system, the performance ofthe process pool implementation scales linearly withthe number of processors. The performance of thepool-thread-per-message model is much worse thanprocess pool because the former does not fully uti-lize the available concurrency of multiprocessor hard-ware. The processor running the main listeningthread becomes the bottleneck.

Figure 23: Performance of software architectures relativeto event-based on different hardware. For example, theperformance of stateless proxy on 4xP hardware in thethread pool architecture is approximately three times thatin the event-based architecture on the same hardware.

14

6.3 Stateful proxy

Unlike the stateless proxy, a transaction statefulproxy needs to maintain the SIP transaction state forthe duration of the transaction. We used only UDPtransport for our tests and did not perform any DNSlookup. As shown in Figure 19, a single Proxy 200test involves six incoming and eight outgoing mes-sages. In Figure 22, compared to the stateless proxy,the stateful proxy performs additional steps such astransaction (or client branch) matching. The trans-actions data structures are locked for exclusive accessin a multi-threaded system. The processing steps canbe implemented in different software architectures asfollows:

Event-based: Most of the blocking operations aremade non-blocking using events. A single threadhandles events from a queue (e.g., timer events)as well as messages from the listening socket.There is no locking or mutexes. There are onlytwo operations that remain blocking: listeningfor incoming message on the socket, and listeningfor events on the event queue. A single threadedevent-based system does not take advantage ofthe underlying multiprocessor architecture.

Thread per message (or transaction): A mainthread listens for incoming messages. If themessage is a request not matching any previoustransaction, then a new thread is created to han-dle the new transaction associated with this mes-sage. The thread persists as long as the trans-action exists. Similarly, a process-per-messagemodel can be defined that creates a new processfor each incoming connection and message.

Thread pool: This is similar to the previousmethod, except that instead of creating a newthread, it reuses a thread from the threadpool. This reduces the thread creation overhead.Locks are used for accessing shared data. Poten-tially blocking operations include DNS lookup,sendmsg, request matching and database access.This is the original architecture of our SIP server,sipd [23].

(Two-stage) thread pool: A pool of identicalthreads is created. Each thread handles a spe-cific subset of the user population based on thehash value of the user identifier, similar to thesecond stage of our load sharing architecture.A request is processed in two stages. The firststage thread listens for incoming messages, doesminimum parsing, and chooses the second stage

thread based on the destination user identifier.The message is then handed over to the partic-ular second stage thread. The second stage ispurely event-based with no other locking. Sincea single thread handles the requests for thesame set of users, we do not need to lock thedatabase or transaction data structures. Num-ber of threads in the thread pool is determinedby the number of processors.

The models can be further extended to processesas follows. We have not evaluated these extensionsyet:

Process pool: A pool of identical processes is cre-ated, each listening on the same socket. When amessage is received, the server performs all theprocessing steps for that message. Shared mem-ory is used for sharing the transaction and usercontacts among multiple processes. This is thearchitecture of the SIP express router [35].

Two-stage event and process-based: This issimilar to the two-stage thread pool model,but using processes instead of threads. Theinter-process communication is done using pipesor shared memory. Multiple first stage processescan be used to allow more concurrency.

A generic design of thread-per-message is easy tounderstand and implement. However this model suf-fers from poor performance at higher load [36]. As theload increases the number of threads in the systemalso increases. If the thread blocks waiting for a net-work response, the maximum number of simultaneousrequests active in the system is small. Transactionlifetime further reduces the system capacity. For ex-ample, if the OS supports 10000 threads, and the SIPtransaction lifetime is about 30 seconds, then therecan be at most 10000/30 = 333 transactions/secondprocessed in the system. Unlike a web server, this isfurther exacerbated in a SIP server by the fact thatabout 70% of calls are answered within roughly 8.5seconds [37] while unanswered calls ring for 38 sec-onds. Thus, a bad design results in insufficient num-ber of threads. This leads to higher call blocking orcall setup delays at high call volume. Thus, we needto use a true event-driven architecture which requiresthe threads to be returned to the free-threads poolwhenever they make a blocking call.

Table 2 and Figure 23 compare the performanceof stateful proxy in different architectures on thesame set of hardware, except that 1xS is replaced bya single-processor ultraSparc-IIi, 360MHz, 256MB,running Solaris5.9. Event-based system performs

15

best for single processor machine. For an N -processormachine, the thread pool performance is much worsethan N times the single-processor performance dueto memory access contentions.

Software architecture/Hardware

1xP 4xP 1xS 2xS

Event-based 1150 300 160 400Thread per message 600 175 90 300Thread pool (sipd) 850 340 120 3002-stage thread pool 1100 550 155 500

Table 2: Performance (CPS) for stateful proxy forProxy 200 test

6.4 The best architecture

The two-stage thread pool model for stateful proxyand the thread pool model for stateless proxy com-bine the event and thread pool architectures. Theyprovide an event-loop in each thread, and has a poolof threads for concurrency on multiprocessor ma-chines. The lock contention is reduced by allowingthe same thread to process all the steps of a mes-sage or transaction after initial parsing. For a multi-threaded software architecture this seems to give thebest performance as per our tests. We have not yetevaluated the stateful proxy in process pool model.

The stateless proxy performance is usually limitedby the CPU speed, whereas the memory utilizationremains constant. On the other hand, the statefulproxy may be limited by either CPU or memory de-pending of various transaction timers. By default aSIP transaction state is maintained for about 30 sec-onds. Thus, a load of 1000 CPS creating 2000 trans-actions per second will require memory for about 60thousand transactions. Assuming 10 kB for storingeach transaction state, this requires 600MB. In ourtests, we have reduced the timer values significantlyso that memory is not the bottleneck.

6.5 Effect on Load Sharing Perfor-mance

The software architecture choice of the SIP serverfurther enhances the load sharing results since thebest single stateless proxy capacity is about 1600 CPSon a 3 GHz Pentium 4 with 1 GB memory runningLinux 2.4.20. In addition, we have achieved about4000 CPS throughput for the first stage proxy in asimplified implementation. This means even S1P2 instateless proxy mode can achieve close to 3200 CPS,i.e., 11 million BHCA on this hardware configuration.

Similarly, S3P3 in stateful proxy mode can achieveclose to 13 million BHCA.

7 Conclusions

We have shown how to apply some of the existingfailover and load sharing techniques to SIP servers,and propose an identifier-based two-stage load shar-ing method. Using DNS is the preferred way to of-fer redundancy since it does not require network co-location of the servers. For example, one can placeSIP servers on different networks. With IP addresstakeover and NATs, that is rather difficult. This isless important for enterprise environments, but in-teresting for voice service providers such as Vonage.DNS itself is replicated, so a single name server out-age does not affect operation. We combine DNS,server redundancy and the identifier-based load shar-ing in our two-stage reliable and scalable server archi-tecture that can theoretically scale to any capacity.A large user population is divided among indepen-dent second stage servers such that each server loadremains below its capacity.

We have also described the failover implementa-tion and performance evaluation of our two-stagearchitecture for scalability using the SIPstone testsuite in our test bed. Our results verify the the-oretical improvement of load sharing for call han-dling and registration capacity. We achieve carriergrade scalability using commodity hardware, e.g.,2800 calls/second supported by our S3P3 load sharingconfiguration roughly translates to 10 million call ar-rivals per hour, using six servers (Lucent’s 5E-XCTM

switch, a high-end 5ESS, can support four millionBHCA for PSTN). This is further increased to 16million BHCA in our memory pool and event-basedarchitecture. We also achieved the 5-nines reliabil-ity goal even if each server has only uptime of 99%(3 days/year downtime) using the two-stage architec-ture. Other call stateful services such as voicemail,conferencing and PSTN interworking need more workto do failover and load sharing in the middle of thecall without breaking the session.

Detection and recovery of wide area path out-ages [38] is complementary to the individual serverfailover. Instead of statically configuring the re-dundant servers, it will be useful if the servers canautomatically discover and configure other availableservers on the Internet, e.g., to handle temporaryoverload [31]. This gives rise to the service modelwhere the provider can sell its SIP services dynami-cally by becoming part of another customer SIP net-work. Dynamic or adaptive load sharing based on

16

the workload of each server is for further study. How-ever, it is not clear how useful this will be for Internettelephony because the call distribution is more uni-form unlike Zipf distribution of web page popularity.Therefore, a good static hash function can uniformlydistribute the call requests among the servers. Apeer-to-peer approach for the SIP service also seemspromising for scalability and robustness [39].

Acknowledgment

Huitao Sheng helped with the initial performancemeasurement. Jonathan Lennox is the primary archi-tect of our SIP server, sipd, and helped with SIPstonetools. Sankaran Narayanan implemented the efficientdatabase interaction in sipd. The work is supportedby a grant from SIPquest, Inc.

References[1] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. R. Johnston,

J. Peterson, R. Sparks, M. Handley, and E. Schooler, “SIP:session initiation protocol,” RFC 3261, Internet EngineeringTask Force, June 2002.

[2] H. Schulzrinne and J. Rosenberg, “Internet telephony: Ar-chitecture and protocols – an IETF perspective,” ComputerNetworks and ISDN Systems, vol. 31, pp. 237–255, Feb. 1999.

[3] J. Rosenberg and H. Schulzrinne, “Session initiation protocol(SIP): locating SIP servers,” RFC 3263, Internet EngineeringTask Force, June 2002.

[4] H. BRYHNI, E. Klovning, and Øivind Kure, “A comparisonof load balancing techniques for scalable web servers,” IEEENetwork, vol. 14, July 2000.

[5] K. Suryanarayanan and K. J. Christensen, “Performance eval-uation of new methods of automatic redirection for load bal-ancing of apache servers distributed in the Internet,” in IEEEConference on Local Computer Networks, (Tampa, Florida,USA), Nov. 2000.

[6] O. Damani, P. Chung, Y. Huang, C. Kintala, and Y. Wang,“ONE-IP: techniques for hosting a service on a cluster of ma-chines,” Computer Networks, vol. 29, pp. 1019–1027, Sept.1997.

[7] D. Oppenheimer, A. Ganapathi, and D. Patterson, “Why dointernet services fail, and what can be done about it?,” in 4thUSENIX Symposium on Internet Technologies and Systems(USITS ’03), (Seattle, WA), Mar. 2003.

[8] A. C. Snoeren, D. Andersen, and H. Balakrishnan, “Fine-grained failover using connection migration,” in USENIXSymposium on Internet Technologies and Systems, (SanFrancisco), Mar. 2001.

[9] High-Availability Linux Project, http://www.linux-ha.org/.

[10] Cisco Systems, Failover configuration for LocalDirec-tor, http://www.cisco.com/warp/public/cc/pd/cxsr/400/tech/locdf wp.htm.

[11] G. Hunt, G. Goldszmidt, R. P. King, and R. Mukherjee, “Net-work dispatcher: a connection router for scalable Internet ser-vices,” Computer Networks, vol. 30, pp. 347–357, Apr. 1998.

[12] C.-L. Yang and M.-Y. Luo, “Efficient support for content-based routing in web server clusters,” in 2nd USENIX Sym-posium on Internet Technologies and Systems, (Boulder,Colorado, USA), Oct 1999.

[13] Akamai Technologies, Inc. http://www.akamai.com.

[14] A. Gulbrandsen, P. Vixie, and L. Esibov, “A DNS RR forspecifying the location of services (DNS SRV),” RFC 2782,Internet Engineering Task Force, Feb. 2000.

[15] M. Mealling and R. W. Daniel, “The naming authoritypointer (NAPTR) DNS resource record,” RFC 2915, Inter-net Engineering Task Force, Sept. 2000.

[16] N. Ohlmeier, “Design and implementation of a high availabil-ity sip server architecture,” Thesis, Computer Science De-partment, Technical University of Berlin, Berlin, Germany,July 2003.

[17] M. Tuexen, Q. Xie, R. Stewart, M. Shore, J. Loughney, andA. Silverton, “Architecture for reliable server pooling,” In-ternet Draft draft-ietf-rserpool-arch-09, Internet EngineeringTask Force, Feb 2005. work in progress.

[18] M. Tuexen, Q. Xie, R. J. Stewart, M. Shore, L. Ong, J. Lough-ney, and M. Stillman, “Requirements for reliable server pool-ing,” RFC 3237, Internet Engineering Task Force, Jan. 2002.

[19] P. Conrad, A. Jungmaier, C. Ross, W.-C. Sim, and M. Tuxen,“Reliable ip telephony applications with sip using RSerPool,”in World Multiconference on Systemics, Cybernetics andInformatics (SCI), (Orlando, USA), July 2002.

[20] A. Srinivasan, K. G. Ramakrishnan, K. Kumaran, M. Arava-mudan, and S. Naqvi, “Optimal design of signaling networksfor Internet telephony,” in Proceedings of the Conference onComputer Communications (IEEE Infocom), (Tel Aviv, Is-rael), Mar. 2000.

[21] R. Sparks, “The session initiation protocol (SIP) refermethod,” RFC 3515, Internet Engineering Task Force, Apr.2003.

[22] J. Janak, “SIP proxy server effectiveness,” Master’s Thesis,Department of Computer Science, Czech Technical Univer-sity, Prague, Czech, May 2003.

[23] J. Lennox, “Services for internet telephony,” PhD.thesis, Department of Computer Science, ColumbiaUniversity, New York, New York, Jan. 2004.http://www.cs.columbia.edu/~lennox/thesis.pdf.

[24] Cisco IP phone 7960, Release 2.1, http://www.cisco.com.

[25] J. Jung, E. Sit, H. Balakrishnan, and R. Morris, “DNS perfor-mance and the effectiveness of caching,” in ACM SIGCOMMInternet Measurement Workshop, (San Francisco, Califor-nia), Nov. 2001.

[26] K. Singh, W. Jiang, J. Lennox, S. Narayanan, andH. Schulzrinne, “CINEMA: columbia internet extensible mul-timedia architecture,” technical report CUCS-011-02, Depart-ment of Computer Science, Columbia University, New York,New York, May 2002.

[27] W. Jiang, J. Lennox, S. Narayanan, H. Schulzrinne, K. Singh,and X. Wu, “Integrating Internet telephony services,” IEEEInternet Computing, vol. 6, pp. 64–72, May 2002.

[28] MySQL, Open Source SQL server, http://www.mysql.com.

[29] K. Singh and H. Schulzrinne, “Failover and load sharing inSIP telephony,” Tech. Rep. CUCS-011-04, Columbia Univer-sity, Computer Science Department, New York, NY, USA,Mar. 2004.

[30] P. Srisuresh and D. Gan, “Load sharing using IP network ad-dress translation (LSNAT),” RFC 2391, Internet EngineeringTask Force, Aug. 1998.

17

[31] W. Zhao and H. Schulzrinne, “Dotslash: A self-configuringand scalable rescue system for handling web hotspots effec-tively,” in International Workshop on Web Caching andContent Distribution (WCW), (Beijing, China), Oct. 2004.

[32] H. Schulzrinne, S. Narayanan, J. Lennox, and M. Doyle,“SIPstone - benchmarking SIP server performance,” Techni-cal Report CUCS-005-02, Department of Computer Science,Columbia University, New York, New York, Mar. 2002.

[33] B. Jenkins, “Algorithm alley,” Dr. Dobb’s Journal, Sept.1997. http://burtleburtle.net/bob/hash/doobs.html.

[34] D. R. Karger, A. H. Sherman, A. Berkheimer, B. Bogstad,R. Dhanidina, K. Iwamoto, B.-J. J. Kim, and L. Matkins,“Web caching with consistent hashing,” Computer Networks,vol. 31, pp. 1203–1213, May 1999.

[35] “SIP express router (ser): a high performance free sip server.”http://www.iptel.org/ser.

[36] M. Welsh, D. Culler, and E. Brewer, “SEDA: an architecturefor well-conditioned, scalable Internet services,” in Sympo-sium on Operating Systems Principles (SOSP), (ChateauLake Louise, Canada), ACM, Oct. 2001.

[37] F. P. Duffy and R. A. Mercer, “A study of network perfor-mance and customer behavior during-direct-distance-dialingcall attempts in the USA,” Bell System Technical Journal,vol. 57, no. 1, pp. 1–33, 1978.

[38] D. Andersen, H. Balakrishnan, M. F. Kaashoek, and R. Mor-ris, “Resilient overlay networks,” in 18th ACM SOSP, (Banff,Canada), Oct. 2001.

[39] K. Singh and H. Schulzrinne, “Peer-to-peer Internet tele-phony using SIP,” Tech. Rep. CUCS-044-04, Department ofComputer Science, Columbia University, New York, NY, Oct.2004.

18

Sipload Extended

Documents

Transcript of Sipload Extended