Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers...

12
Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben Dept. of Mathematics and Computer Science, University of Marburg Hans-Meerwein-Str. 3, D-35032 Marburg, Germany {matthew,schriddc,freisleb}@informatik.uni-marburg.de ABSTRACT The Grid computing paradigm is aimed at providing seam- less access to different kinds of resources, such as compute clusters, data, special appliances and even people. Like most complex IT systems, Grid middleware systems exhibit a number of security problems, and there will always be at- tacks that are unknown and can circumvent even the best security measures and intrusion detection systems. This creates the requirement that Grid environments should be equipped with intrusion tolerance mechanisms as well as with the traditional intrusion prevention and intrusion de- tection mechanisms. In this paper, we present a new in- trusion tolerance approach which improves the security of stateful WSRF Grid servers against stealth attacks. The proposal is based on a novel server rotation strategy uti- lizing paravirtualization to close attack windows for state- ful service-oriented Grid headnode servers. A flexible plu- gin based rotation manager deals with the complex issue of stateful connections to the Grid server, and a database con- nector is utilized to detach service state from the rotating functional components of the Grid server. A prototypical implementation based on the Globus Toolkit 4 is presented. Categories and Subject Descriptors D.4.6 [Software]: OPERATING SYSTEMS—Security and Protection ; C.2.4 [Computer Systems Organization]: COMPUTER-COMMUNICATION NETWORKS—Distributed Systems ; C.4 [Computer Systems Organization]: PER- FORMANCE OF SYSTEMS—Fault tolerance; Reliability, availability, and serviceability General Terms Security Keywords Grid Computing, Intrusion Tolerance, Security, Self-Cleansing Servers, Virtualization Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. HPDC’08, June 23–27, 2008, Boston, Massachusetts, USA. Copyright 2008 ACM 978-1-59593-997-5/08/06 ...$5.00. 1. INTRODUCTION Unlike traditional cluster computing in which only a small number of users work in a closed system, Grid computing exposes local clusters to a large number of users via the In- ternet using open Grid middlewares such as Globus, gLite and Unicore. Like most complex IT systems, these mid- dleware solutions exhibit a number of security problems [1, 2, 3, 4] opening the entire system to attack. As a conse- quence, Grids are an attractive target for intruders, since the Grid offers standardized access to a large number of machines, which can be misused in various ways. The con- siderable computing power of clusters exposed via the Grid can be used to break passwords, and the large storage ca- pacity is perfect for storing and sharing illegal software and data. The generous bandwidth of the Internet is ideal for launching Denial-of-Service (DoS) attacks or for hosting file sharing services, to name just a few attacks. The difficulty in securing computer systems in general is largely due to the increasing complexity of the systems to- day and the constant innovation and morphing of attack techniques [5]. The research area of Intrusion Tolerant Sys- tems (ITS) [6] aims to cope with the inevitable attacks and to create systems which, to a certain extent, can continue operating even though they are being attacked. Tradition- ally, this is done by implementing the following steps: self- diagnosis, repair, and reconstitution. The main drawback of this approach is that self-diagnosis requires an Intrusion Detection System (IDS) to raise an alarm. This works well for known or obvious attacks (like Distributed Denial-of- Service (DDoS) attacks) but fails to cope with unknown and stealth attacks. A stealth attack is an attack which does not affect the regular function of the attacked system and as such might not be noticed; for example, stealing a copy of /etc/shadow from a web server might not be no- ticed, while defacing the front page of the same server or even crashing the system would definitely draw attention to the attack. Furthermore, many ITS work by using redun- dancy to replace compromised resources with backups. This creates problems with maintaining the state of the compro- mised resources, resulting in the fact that many ITS only deal with stateless resources, such as static web servers. In this paper, we present a new intrusion tolerance ap- proach which improves the security of stateful WSRF Grid servers against stealth attacks. The system is based on a novel server rotation strategy utilizing paravirtualization to close attack windows for stateful service-oriented Grid headnode servers. Our approach does not require complex attack detection procedures or heterogeneous redundancy

Transcript of Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers...

Page 1: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

Securing Stateful Grid Servers throughVirtual Server Rotation

Matthew Smith, Christian Schridde and Bernd FreislebenDept. of Mathematics and Computer Science, University of Marburg

Hans-Meerwein-Str. 3, D-35032 Marburg, Germany{matthew,schriddc,freisleb}@informatik.uni-marburg.de

ABSTRACTThe Grid computing paradigm is aimed at providing seam-less access to different kinds of resources, such as computeclusters, data, special appliances and even people. Likemost complex IT systems, Grid middleware systems exhibita number of security problems, and there will always be at-tacks that are unknown and can circumvent even the bestsecurity measures and intrusion detection systems. Thiscreates the requirement that Grid environments should beequipped with intrusion tolerance mechanisms as well aswith the traditional intrusion prevention and intrusion de-tection mechanisms. In this paper, we present a new in-trusion tolerance approach which improves the security ofstateful WSRF Grid servers against stealth attacks. Theproposal is based on a novel server rotation strategy uti-lizing paravirtualization to close attack windows for state-ful service-oriented Grid headnode servers. A flexible plu-gin based rotation manager deals with the complex issue ofstateful connections to the Grid server, and a database con-nector is utilized to detach service state from the rotatingfunctional components of the Grid server. A prototypicalimplementation based on the Globus Toolkit 4 is presented.

Categories and Subject DescriptorsD.4.6 [Software]: OPERATING SYSTEMS—Security andProtection; C.2.4 [Computer Systems Organization]:COMPUTER-COMMUNICATION NETWORKS—DistributedSystems; C.4 [Computer Systems Organization]: PER-FORMANCE OF SYSTEMS—Fault tolerance; Reliability,availability, and serviceability

General TermsSecurity

KeywordsGrid Computing, Intrusion Tolerance, Security, Self-CleansingServers, Virtualization

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.HPDC’08, June 23–27, 2008, Boston, Massachusetts, USA.Copyright 2008 ACM 978-1-59593-997-5/08/06 ...$5.00.

1. INTRODUCTIONUnlike traditional cluster computing in which only a small

number of users work in a closed system, Grid computingexposes local clusters to a large number of users via the In-ternet using open Grid middlewares such as Globus, gLiteand Unicore. Like most complex IT systems, these mid-dleware solutions exhibit a number of security problems [1,2, 3, 4] opening the entire system to attack. As a conse-quence, Grids are an attractive target for intruders, sincethe Grid offers standardized access to a large number ofmachines, which can be misused in various ways. The con-siderable computing power of clusters exposed via the Gridcan be used to break passwords, and the large storage ca-pacity is perfect for storing and sharing illegal software anddata. The generous bandwidth of the Internet is ideal forlaunching Denial-of-Service (DoS) attacks or for hosting filesharing services, to name just a few attacks.

The difficulty in securing computer systems in general islargely due to the increasing complexity of the systems to-day and the constant innovation and morphing of attacktechniques [5]. The research area of Intrusion Tolerant Sys-tems (ITS) [6] aims to cope with the inevitable attacks andto create systems which, to a certain extent, can continueoperating even though they are being attacked. Tradition-ally, this is done by implementing the following steps: self-diagnosis, repair, and reconstitution. The main drawbackof this approach is that self-diagnosis requires an IntrusionDetection System (IDS) to raise an alarm. This works wellfor known or obvious attacks (like Distributed Denial-of-Service (DDoS) attacks) but fails to cope with unknownand stealth attacks. A stealth attack is an attack whichdoes not affect the regular function of the attacked systemand as such might not be noticed; for example, stealing acopy of /etc/shadow from a web server might not be no-ticed, while defacing the front page of the same server oreven crashing the system would definitely draw attention tothe attack. Furthermore, many ITS work by using redun-dancy to replace compromised resources with backups. Thiscreates problems with maintaining the state of the compro-mised resources, resulting in the fact that many ITS onlydeal with stateless resources, such as static web servers.

In this paper, we present a new intrusion tolerance ap-proach which improves the security of stateful WSRF Gridservers against stealth attacks. The system is based ona novel server rotation strategy utilizing paravirtualizationto close attack windows for stateful service-oriented Gridheadnode servers. Our approach does not require complexattack detection procedures or heterogeneous redundancy

Page 2: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

mechanisms. A flexible plugin based rotation manager dealswith the complex issue of managing stateful connections tothe Grid server, and a database connector is utilized to de-tach service state from the rotating functional componentsof the Grid server. A prototypical implementation basedon the Globus Toolkit 4 is presented. The Globus servicessurvive attacks against both the Grid middleware and theunderlying operating system through server rotation with-out losing their state.

The paper is organized as follows. In section 2, relatedwork is discussed. In section 3, several attack scenarios aredescribed. Section 4 presents our novel server rotation strat-egy for self-cleaning stateful Grid servers. The implemen-tation is described in section 5. Experimental results arepresented in section 6. Section 7 concludes the paper andoutlines areas for future work.

2. RELATED WORKSood, Huang and Arsenault [7, 8] suggest treating all

servers as potentially compromised, no matter if an Intru-sion Detection System (IDS) has raised an alert or not.Their work focuses on high availability computing where allservers have redundant backup servers and hardware failoverswitches. The authors argue that undetected attacks do notcause instant harm, but increase damage over time. In [8],an estimate is shown based on a report by banking secu-rity experts that a theft of $5,000 to $10,000 can be carriedout over a few weeks, while larger losses up to $1 millionare likely to take four to six months [9]. To make IT sys-tems resilient against long lasting attacks, the authors pro-pose rotating backup servers with the primary servers on aregular basis. The server which is currently offline is thenrestored from a secure image. All malicious code (togetherwith the state of the server) is lost during rotation, thusautomatically cleaning the system of attack code. The ro-tation is made possible by using the redundant servers andhardware failover switches available to them in high avail-ability computing. The drawback of the proposed system isthat it only works well for stateless servers. DNS, NFS andstatic web servers are given as example applications for theSelf-Cleansing Intrusion Tolerance (SCIT) technology pro-posed by the authors. Furthermore, no allowance is madefor long lasting TCP/IP connections which would be ter-minated with an error state if a rotation interrupted them,making UDP the more viable protocol for the proposed ar-chitecture. Grid servers, like many other server products,are not stateless and show complex stateful interaction pat-terns which make them incompatible with the proposed ap-proach.

The same basic idea using virtualization technology waslater also proposed by Reiser and Kapitza [10]. The pre-sented VM-FIT system uses the Xen hypervisor technologyfor redundant server copies which can periodically be re-freshed to increase the resilience of the server. To achievethis, Domain0/Xen0 runs multiple XenUs containing the ac-tual application and one XenU called Domain NV which isresponsible for passing user requests to the replicas. Thismakes Domain NV the critical component of the system,since a compromise of Domain NV can compromise all repli-cas, and Domain NV is not protected by the replicationmechanism. Critically, the approach requires that dedicatedsupport for an application has to be integrated into DomainNV. In the case of the CORBA prototype presented in the

paper, this means that a CORBA middleware must run inDomain NV. In addition, Domain NV must implement groupcommunication to ensure that the state is updated on allreplicas. This creates a potential for security problems inDomain NV defeating the rest of the security system beforeit has a chance to operate. The authors state this prob-lem and offer two alternatives. First, Domain NV must bemade intrusion tolerant. The authors themselves state thatthe drawback of this approach is that it requires more com-plex, Byzantine fault tolerant group communication proto-cols, which are not presented in the paper. However, evena fault tolerant group communication protocol would notprotect the system from an operating system fault or an at-tack on Domain NV. The second approach presented in thepaper is that the system makes sure that Domain NV is aprotected, isolated entity that cannot be influenced by Do-main 0, but no suggestion is presented on how this could bedone. The requirement that a Domain U must be protectedagainst Domain 0 is very difficult to achieve, since it con-tradicts the paravirtualization technology used, because thedistinguishing feature of Domain 0 is that it can control allhosted domains. Even if this requirement could be met, thisapproach would not deal with the risk of an internal compro-mise due to the heavyweight nature of the software runningin Domain NV. Furthermore, due to the group communica-tion infrastructure, the presented system does not work forapplications with multithreading and thus excludes almostall modern server applications.

SITAR [11, 12] is an intrusion-tolerant architecture fordistributed services which includes adaptive reconfiguration,heartbeat monitors, runtime checks, and commercial off-the-shelf servers mediated by proxies. SITAR relies on hardwareredundancy for intrusion tolerance, but requires a sharedmemory environment for the redundant servers and createsa large overhead for the intrusion detection and tolerance in-frastructure. Next to the functional hardware redundancy,SITAR requires a Proxy Server, a Ballot Monitor and anAcceptance Monitor. Thus, if a server has a single redun-dant backup, SITAR adds six more servers to the setup forthe intrusion tolerance architecture. Furthermore, SITARstill requires an IDS to successfully detect an attack for theintrusion tolerance mechanisms to operate, thus making itdifficult to defend against unknown attacks.

The ITUA project [13] relies on intrusion detection andrandomized automated responses. It provides middlewarebased on process replication and unpredictable adaptationto tolerate the faults that result from staged attacks. Likein SITAR, the IDS must successfully detect an attack forthe ITUA to operate.

HACQIT [14] uses a primary/backup server architecturewhich unlike the two approaches above can cope with un-known attacks, but only works for scenarios with knownusers. The basic idea is to mirror servers using different im-plementations for the server software. For example, for aweb server scenario both IIS and Apache would be utilized.All user requests are sent to both servers, the idea being thatan attack against IIS will not work against Apache. Thus,the two results can be compared, and if they differ, one ofthe servers has been compromised. Determining which of theservers was compromised can be difficult, but according tothe authors most attacks lead to one server responding with2xx or 3xx (success and redirect codes) and the other with4xx or 5xx (error codes), thus clearly showing which server

Page 3: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

was compromised. In the case of a compromise, the com-promised server is taken offline. The approach does not dealwith operating system or cross-server attacks, such as SQLinjection. To fully utilize the proposed system, a fully het-erogeneous environment must be created. Each functionalcomponent must be paired with a different implementationoffering the same functionality, e.g. Windows/Linux, IIS/A-pache, MySQL/Microsoft SQL, etc. Apart from the factthat this is not always possible (for instance in Grid comput-ing), it significantly increases the administrative overheadfor the uncompromised system. Furthermore, HACQIT usesa Mediator/Adaptor/Controller to sanitize requests beforepassing them on to the redundant servers. This introducesa single point of failure in a key component with a complexfunctional behaviour.

Valdes et al. [15] describe a similar system using het-erogeneity and redundancy to cope with attacks. As such,they suffer from the same problems. They assume that nomore than a critical number of servers are in an undetectedcompromised state at any given time. Thus, they need anintrusion detection system. They complement this throughan agreement protocol based on heterogeneous redundantcomponents. The agreement protocol assumes that all non-faulty and non-compromised servers give the same answer tothe same request. Thus, the architecture is meant to providecontent that is static from the end user’s point of view. If anattacker chooses not to change the expected responses (i.e.the HTML content), (s)he can misuse the resources in anundetected manner in any number of ways (Spambot, File-server, Relay, etc). The work does not deal with the issueof content update.

Saidane et al. [16] introduce an extension to Valdes’ worksto cover SQL storage queries for web servers. An agree-ment protocol over a number of redundant SQL storagequery generators is aimed at weeding out malformed SQLqueries. Using this system, a certain amount of state canbe stored in a database and can be accessed by the redun-dant web servers. The problem with this approach is findingat least three different web servers that generate the sameSQL queries without being vulnerable to the same attack.Like all other approaches based on heterogeneous redundantservers, the hardware overhead is quite large: two further al-ternate servers and three proxies are at least required for oneproduction server. Furthermore, attacks which do not tar-get the primary system go undetected, since the agreementprotocol does not catch them.

Several publications deal with integrating an IDS into aGrid. Schulter et al. [17] describe a Grid IDS, which com-bines a Host- and Network-IDS to analyze the user’s be-havior. A scheduler loads the user’s profiles and starts oneor more analyzer processes to detect anomalies. All com-ponents interact closely with a database to update changedprofiles regularly. The IDS utilizes stored user behavior todetect anomalous activities. Fang-Yie Lue et. al [18] alsointegrate an IDS into a Grid. Their solution uses exist-ing Grid resources to detect high volume packets, especiallyDDoS attacks. Instead of standard technologies, they usetheir own solution to overcome possible performance bottle-necks. Their approach mainly deals with the distribution ofload for the IDS. None of the Grid IDS solutions proposescountermeasures for attacks which are not detected, limitingthem to anomaly detection algorithms and the Grid specificattacks currently known. None of the above systems deals

with intrusion tolerance and do not cope well with unknownstealth attacks.

To summarize, there are two categories of related work.The first category does not rely on detecting an intrusionand as such can deal with stealth attacks. However, this isachieved at the cost of limiting the scope of the approachesto stateless static content systems, which is not sufficientfor the area of Grid computing. The second category re-quires that the attack is detected one way or the other andoften also has the problem of state reproduction due to thedistributed nature of their redundancy mechanisms. Thesolution of using heterogeneous servers to validate primaryfunction responses is not applicable to Grid computing, sinceunlike web servers which have been successfully standard-ized, Grid servers currently do not have the same interfacesor behaviour. Thus, requests cannot be transparently sentto different Grid server implementations. The solution of us-ing heterogeneous servers also does not detect attacks whichdo not affect the primary function of a server, since onlythe responses to user requests are monitored. Our workpresents a novel approach which successfully deals with un-known stealth attacks, but also preserves the state of theGrid servers and can cope with TCP/IP connections.

Figure 1: Traditional Grid

3. ATTACK SCENARIOSBefore we introduce our intrusion tolerance system, we

need to look at the types of attacks relevant to Grid com-puting systems and some recent intrusion prevention mech-anisms deployed to handle them. Figure 1 shows a tradi-tional Grid setup often used in low security environments.The Grid consists of two Clusters P and Q, and two usersA and B with valid credentials are present. User softwareis installed locally on all nodes. There is also an externalattacker without valid Grid credentials. The external at-tacker must utilize vulnerabilities in the Grid middlewareor the underlying operating system to compromise the Gridheadnode (1.1) and from there (s)he can attack the computenodes (1.2). The valid users can use their login and theirinstalled software to attack other user’s software and data(2) or the resources of other compute nodes (3). We canclassify these attacks into three attack vectors of external(Vector 1) and internal attacks (Vectors 2 and 3). Deal-ing with these attacks is a daunting task, since the numberof users and the deployed software is growing rapidly and

Page 4: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

exhibits a high rate of churn, creating very complex interac-tion patterns which make intrusion detection difficult, sincethe IDS must separate many varying legitimate interactionpatterns from attack patterns. In previous work [19, 20],we discussed these new threats for Grid computing systems,and will briefly recap the attack classification relevant forthis paper. The attacks against Grid systems can be di-vided into five classes: Resource and data attacks againstthe Grid providers and Grid users and meta-data attacksagainst Grid users. Resource attacks focus on the misuseof resources, such as stealing bandwidth for a botnet attack.Data attacks are attacks which steal or modify data like cer-tificates or job data. Meta-data attacks are attacks whichsteal knowledge about Grid jobs as opposed to the job dataitself. For a more comprehensive discussion, the reader isreferred to [20].

Since there are several known vulnerabilities in the threecommon Grid middleware solutions (GT4, gLite and Uni-core) [1, 2, 3, 4] and most probably a fair number of undis-closed vulnerabilities, attack vector 1 is a large threat toGrid computing since the Grid headnode is the most criticalcomponent of the Grid infrastructure. In the following, wewill mainly deal with external resource and meta-data at-tacks on the Grid provider hindering the attacker from get-ting a foothold on the Grid headnode (1.1). Removing thefoothold also significantly hampers external attacks againstthe users (1.2). Some data attacks against the Grid providerwill also be treated. Attack vectors 2 and 3 (the internal at-tacks) will also be discussed, but more briefly. For moreinformation on the internal attacks, the reader is referred to[20].

4. SELF-CLEANSING GRID SERVERSIn this section, we present our architecture for an intru-

sion tolerant Grid server using virtualization technology toenable self-cleansing of the entire guest operating system.As stated in section 3, targeted stealth attacks are becom-ing a serious concern, coupled with the complexity of Gridsystems we believe that undetected Grid server compromiseswill occur no matter which security precautions are taken.Thus, the main design goal of our work is to periodicallyrefresh the Grid server from a clean read only image everycouple of minutes, thus removing any attack code from theGrid system, if it was detected or not. This periodic refreshsignificantly reduces the window of opportunity to damagethe system. The basic approach is simple: a read only imageof the Grid headnode complete with a valid server certificateis created and booted on two systems. At any given time,only one of the Grid headnodes is connected to the Internetand the cluster. Periodically, the active headnode is shut-down and the passive headnode takes the active role. Theshutdown node then immediately restarts from the read onlyimage and takes the passive role. This procedure is repeatedindefinitely.

One of the main problems to be solved is state manage-ment. Each time the active node is shutdown, the runningstate of all software is lost and all open connections are ter-minated. In a traditional Grid, the headnode runs both theGrid middleware and the cluster scheduling software andas such must keep the state of a number of vital Grid ser-vices (i.e. WS-GRAM, GSI-SSH, Grid-FTP), the schedulingsoftware (i.e. Torque, SGE), current file transfers (i.e. Grid-FTP, SCP,) and it must also handle the state of interactive

sessions. Simply shutting down and restarting this sort ofsystem every couple of minutes would lead to an enormousnumber of errors. No job data which takes more than a cou-ple of minutes to upload could be transferred to the Grid,unscheduled jobs would be lost and scheduled jobs wouldbecome orphans. The Grid would be completely inoperable.To solve this problem, the first step is to separate the issuesinto manageable parts.

4.1 New Grid ArchitectureTo better cope with the complexity of a traditional Grid,

we propose a new Grid setup to separate a number of Gridcomponents currently running on one machine into sepa-rate zones. Each user gets his or her own virtual operatingsystem with a private login into which his or her softwareis installed exclusively. This prevent users from installingmalicious software and direct inter-users attacks, since eachuser is confined to his or her own virtual environment. Italso greatly simplifies intrusion detection. Since there are noshared nodes, it can be clearly seen which user is authorizedto be where. For more information on the inter-user attacks,the reader is referred to [20].

Furthermore, the cluster scheduling software is operatedsolely by the Grid middleware and thus should not be ex-posed to the Internet (and the external attacks from theInternet). For this purpose, we created a custom bridge tolink the Grid headnode to the cluster scheduling system overtwo machines and set up a Grid demilitarized zone to sepa-rate the cluster headnode from the Grid headnode and theInternet. This minimizes the threat from a Grid middlewarecompromise to the compute clusters and cluster virtualiza-tion. More information on decoupling the Grid headnodefrom the cluster headnode can be found in [21].

Figure 2: Virtual Grid

Figure 2 shows this new Grid setup. The software of eachuser is only installed in his or her virtual Grid environment,and the users can only log on to their virtual machines. Thisprevents the inter-user attacks (2) and (3), since it wouldrequire that the user is able to break out of his or her vir-tual machine. The use of private virtual machines enablesjob data to be pulled directly onto the worker nodes, avoid-ing the danger of Grid headnode compromises to the datatransfer. This leaves the external attacks (1.x). Two fire-walls separate the Grid headnode from the Internet and the

Page 5: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

backend cluster. This minimizes the threat posed by exter-nal attackers. However, since a number of ports must beopen to allow the Grid to function, it still remains possiblefor an attacker to compromise the Grid headnode (1.1) andfrom there to attack the cluster (1.2). Once on the physicalcluster, the attacker can then compromise the virtual exe-cution environment of the users (1.3). Since job data canbe pulled by the worker nodes directly, no GridFTP or SCPconnection will pass through the headnode. This unconven-tional worker node configuration is made possible by map-ping worker node IP addresses to dynamic inter-node firewallrules implemented on the Xen0. For more information onthe security of the cluster nodes and the firewalling setup,the reader is referred to [21, 22]. Since the cluster schedulernow runs on a separate machine, its state does not need tobe considered. If the user requires an interactive login tohis or her worker nodes, a direct login can be enabled, so nointeractive sessions are routed via the headnode. If this fea-ture is enabled, this opens up a new attack possibility whoserisk, however, is minimal since GSIssh would require an at-tacker to successfully break an X.509 certificate protectedlogin in a single user environment. The only thing left tobe discussed is the state of the actual Grid server with itsservices.

Figure 3: State Management

4.2 Managing StateWith traditional off-the-shelf servers like IIS or Apache,

state is managed in several ways. It is kept in files, databasesand in-memory in any number of combinations, making itvery difficult to save the entire state of a product. However,the adoption of the service-oriented computing paradigm, inparticular the Web Service Resource Framework (WSRF), inGrid computing has opened up an interesting opportunity tocleanly manage the state of a Grid server. The WSRF intro-duces the notion of a Web Service Resource (WS-Resource)formed by the combination of a Resource Document and acorresponding web service. It is the purpose of the ResourceDocument to capture state information for a WS-Resourcewhile the corresponding web service implementation remainsstateless. In this way, a multitude of WS-Resources can be

created using one stateless web service implementation whilecapturing the state of execution in multiple Resource Doc-uments. The WSRF further defines web service interfacesto inspect and alter the information contained in the Re-source Document and to receive and subscribe for notifica-tions about property changes of a WS-Resource. This newparadigm allows us to neatly separate the state of a serverfrom the functional aspects of the server.

Figure 3 shows the architecture for state separation. Weutilize two redundant virtual operating systems to host theGrid services (Service Hosts) and one virtual operating sys-tem to host a database for the Resource Documents (StorageHost). The Service Host contains the full Grid middlewarewith all the associated security vulnerabilities. The StorageHost is a minimal Linux system with the database being theonly application service available to the two Service Hosts,even ssh logon is deactivated. The Globus toolkit alreadyprovides functionality to create and manage persistent Gridservices locally. We simply extended the persistence capabil-ities to store the Resource Documents on the remote StorageHost. The stateless Service (i.e. the WS-GRAM Service)uses the ResourceHome to create our remote DataBaseRe-source object. This resource object stores its state in theremote database. We do not have to deal with concurrencyissues, since at any given time only one of the service hostswill have write access to the state document.

Figure 4: Self-Cleansing Grid Servers

4.3 Rotating ServersSince the Service Hosts contain the Grid middleware and

the associated risks, it is this system that we wish to protect.We do this by periodically rotating a fresh Service Host intoactive duty. The virtualization setup described above allowsus to continuously restart and exchange the Service Hostswithout losing the state of the Grid services.

Figure 4 shows the architecture of our rotating serversetup: (1) is the data connection between the Service Hosts(X,Y) and the Storage Host (S). The connection uses the pri-vate virtual network interfaces vif(n).1, vif(m).1 and vif1.0which are not routed outside of the Xen0 domain. Utiliz-ing IP Conntrack, we monitor all incoming connections on

Page 6: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

the physical network interface eth0 of the Xen0 (2)1, sincerotation must not take place while an active TCP/IP con-nection is in progress. The Rotation Manager is responsiblefor monitoring the current connections (3), refreshing theService Hosts and setting the dynamic routing rules to con-nect and disconnect the virtual machines (4), thus creatingour desired self-cleansing rotation mechanism (5).

Figure 5: State Preserving Rotation Algorithm

Figure 5 shows the algorithm implemented by the rotationmanager. First, the Storage Host is started and a privatenetwork connection is set up. A copy from a read only baseimage is made for Service Host X, and Service Host X isstarted. On the first pass, Service Host X then loads thestate for all relevant Services, and Xen0 sets a network routeto Service Host X. Once Service Host X is active, a copy ofthe read only base image is made for Service Host Y, andService Host Y is started. Also, a timer is started for theRotation Manager. At the end of the time slice, the algo-rithm enters a critical part2 and Xen0 stops accepting newconnections. Then, Service Host Y starts loading the service

1Usually there are two physical network interfaces on a Gridheadnode, one for connections to the Internet and one forconnections to the cluster. For the sake of clarity, we onlyshow one. The mechanism for the second interface is analogto the first.2This part of the algorithm is critical, since no new con-nections are accepted and thus the Grid headnode is notreachable. We will deal with this issue in the implementa-tion section.

state from the Storage Host, and concurrently the RotationManager checks if it is safe to rotate. If there are still activeconnections, the rotation is delayed and new connections areallowed again for a given time slice. If it is safe to rotate,and Service Host Y has loaded its state, Xen0 removes theold network route to Service Host X and sets a new routeto Service Host Y. The critical part ends with Xen0 allow-ing new connections again. The same procedure is repeatedfor Service Host Y. For the sake of clarity, the critical partfor Service Host Y is not shown in the Figure. The criti-cal part starts when the Xen0 stops accepting connectionsand ends when connections are accepted again. The con-current loading of state and checking for rotation safeness isnot a problem, since due to our new Grid setup the Globusservices only change their state through external triggers(from users or the cluster), thus if no external connectionsare present, then both the rotation and the state are safe.If there are external connections, the algorithm terminatesthe current rotation. The rotation safeness checks are animportant aspect of the system discussed in the followingsection.

4.4 Guardian PluginsSince our system must be capable of dealing with TCP/IP

connections and TCP/IP connections are stateful, a mech-anism is needed to prevent rotation while there is an activeconnection. Using our proposed new Grid architecture, thisis not a significant problem, since the long lasting connec-tions (i.e. for job or result data transmission) are no longerrouted via the headnode but go directly to the users’ ownprivate worker nodes. This only leaves the command andcontrol messages (i.e. WS-Gram or Virtual Workspace calls)to be handled. In our experience, these calls usually do notlast longer than a few minutes, and there are ample gapsbetween connections in which rotation can take place. Sincethe communication patterns and the desired responses tofailed rotations differ from site to site, the Rotation Man-ager uses a plugin mechanism to decide whether to rotate ornot. This allows the manager of a Grid site to adapt the ro-tation mechanisms to his or her needs with minimal effort.These Guardian Plugins have access to the IP Conntrackinformation and can thus monitor the current connections.They are called by the Rotation Manager when the rotationtime slice has ended and must then decide whether to ro-tate or not. If there are no current connections, rotation isgranted. If there are TCP/IP connections, it is up to thespecific plugin to decide what to do. This depends greatlyon the site policy. A usual approach is to delay the rotationfor a short while and try again. The amount of time and thenumber of delays is dependent on the site’s policy and shouldbe chosen with care, since every delayed rotation increasesthe length of the attack window. Factors currently used todecide whether to rotate or not are: source IP, source port,duration of connection, number of rotations delayed due tosource IP. It is important to note that only active connec-tions affect the rotation mechanism. This means that thetype of attack (stealth, data, resource, etc) does not affectthe rotation strategy. The goal of the Guardian Plugins isto delay rotation if legitimate connections are present, whilenot falling prey to halting rotation because of the presenceof malicious connections. One easy way of doing this is tosimply rotate even if there is an active connection. This isthe safest course of action from the server perspective but

Page 7: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

relies on error recovery mechanisms on the client side. How-ever, detecting malicious activity in this scenario is relativelysimple. The Grid communication pattern usually has one ortwo connections from a user to start a job, and several dayslater one connection at the end of the job. Even with alarge number of users, the length of the Grid jobs provideample time to rotate. During our test runs with real worldapplications from a German national Grid (D-Grid) com-munity project, the plugin did not have to delay a singlerotation. An attacker in a normal scenario can easily maskhis or her attack connection in the background noise of theGrid, since the attack only has to be launched once. Withour system in place, the attacker would have to keep a per-manent connection or several alternating connections opento prevent rotation, since this is the only way to prevent therotation from being executed and thus to prevent that allattack code is removed. Every rotation which is preventedsends an alert to the site administrator and thus any attack,no matter how stealthy it is, can easily be detected.

4.5 Rotating Server AttacksThe novel rotating server approach presented in this sec-

tion preserves all relevant state for normal operation of theGrid, but all other state stored on the Service Hosts is lost.Thus, any attack code is lost, and any interactive attacksare terminated. Since the rotation is executed irrespectiveof attack detection, all attack code is removed from the sys-tem including stealth attacks (see section 3). If the sys-tem was crashed by an attacker, it is automatically reju-venated. Depending on the configuration of the Guardianplugin, rotation may be delayed a number of times due toopen connections. By continuously connecting to the server,an attacker can prevent rotation, extending the lifetime ofan attack. In this case, the Guardian plugin raises an alarmand either automatically terminates the connections after nconnection requests (with the danger of terminating validconnections) and rotates anyway or informs the adminis-trator of the potential attack. The fact that an attackermust prevent the Guardian plugin from rotating to preservethe attack code makes stealth attacks which normally wouldnot be detected visible, through the fact that something isblocking the rotation mechanism. The only attack whichremains possible is a data injection attack against the stor-age database. An attacker who fully compromises one of theService Hosts can insert bogus entries or remove valid en-tries from the database concerning the state of running Gridservices by masquerading as the Grid service. It should benoted that this does not affect job data, result data or logininformation. The user’s data is secure in its own virtual en-vironment. The data attacks possible here are some of themeta-data attacks (i.e. what jobs were submitted by whichuser) and data attacks against the control information ofservices like WS-GRAM. Thus, job IDs could be deletedfrom the database, making it impossible to check the statusof a job. The users could still, however, log on to his orher worker nodes and check manually since a Service Hostcompromise does not spill over to the virtual worker nodes.The final attack which is possible is an attack against theStorage Host itself. Since the Storage Host operates in aprivate address space, the attack must be executed from acompromised Service Host and thus must be done in a shorttime span. The only service running on the Storage Host isthe database, all other services are deactivated. Thus, an

attacker must either use an exploit against the database orthe network protocol stack. The network protocol stack hasnot had a remote vulnerability for a long time, making thedatabase the more serious threat. This threat is not coun-tered by the solution presented in this paper and can only bemitigated by using a database with good security measures.However, since the Storage Host operates in a private ad-dress space and can only connect to the two Service Hosts,attack code is severely limited in its capabilities. Insertingfalse entries into the database is one possible form of mis-use that can be achieved from a compromised Service Hostand does not require the Storage Host to be compromised.Since the Storage Host does not rotate, the attack code canremain indefinitely. However, since the Storage Host cannotbe reached from external sources, any information must betransmitted via a compromised Service Host. This requiresrepeated attacks against the Service Hosts. Since they arecleaned periodically, the likelihood of detection is increased.Although these attacks must not be taken lightly, they onlyhave a minimal attack window and are usually easily notice-able, since they affect the primary operation of the Grid (i.e.a user’s job is cancelled before it is started, or the notifica-tion of a completed job is removed before it is sent). Thistype of attack is possible against all the analyzed intrusiontolerance systems discussed in the related work section, ex-cept those which disallow state completely. The main goalof preventing stealth attacks is not compromised by theseattacks. If required, a data specific IDS could be run at thedata base access point to check for irregular behavior (suchas a job was started in Europe and terminated outside ofEurope). While IDS must usually play follow the leader,the very small area of attack will ease the configuration andincrease the accuracy beyond what a full IDS could manage.

5. IMPLEMENTATIONWe implemented a number of different systems to achieve

virtual server rotation. Our first idea was to utilize exist-ing technology for hot failover like the projects in the re-lated work section did. We used the Linux UCARP daemonand a simple scriptUCARP hot failover works by monitor-ing the heartbeat of an active server. If the heartbeat stops(when the Rotation Manager shuts down the VM), the wait-ing backup server takes the active role. Since this is doneautomatically, the active VM just has to be destroyed andthe backup will take over. Then, a new backup server isstarted. However, due to the polling nature of UCARP andno built-in state rescue mechanisms, the delay between oneserver going down and the other serving requests with thecorrect state is a few seconds. In the meantime, clients try-ing to connect to the server get an error message.

To avoid the polling delay of existing hot failover mecha-nisms, we decided to use an active component for the rota-tion and utilize dynamic routing to switch between the Xeninstances. To avoid having to utilize network address trans-lation (NAT) and the problems involved with it, and to min-imize the security risk to Xen0, the two Service Host XenUsboth receive public IP addresses for their eth0/vifx|y.0 de-vice. This means that Xen0 does not need to rewrite anypackets or offer any complex services, it simply acts as arouter to the outside world. Since only one of the XenUs’eth0 is connected to the physical eth0 network device, wegave both XenUs the same IP address for their eth0s. TheVMs are started in routed mode instead of the standard

Page 8: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

bridged mode, since this way the virtual interfaces are im-mediately connected and there is no collision. It is conve-nient to give both VMs the same IP, since the Globus Gridmiddleware requires a host certificate for secure communica-tion. This Grid X.509 certificate is bound to a distinguishedname (DN) and the DN is bound to an IP address. WhenGlobus starts, it does a reverse lookup on its IP address,compares the name with its certificate and fails to start ifthere is a mismatch. If different IPs were to be used, twoentries would need to be made in the reverse lookup tableof the DNS server, so the same DN would be returned fortwo different IP addresses. This is not a real problem, butsince we can give both XenUs the same IP address, we avoidhaving to contact our DNS administrator and thus create amore portable system. It is also important to configure Xen0as an ARP proxy, so Xen0 can respond to ARP requests forthe Service Hosts and receive and forward packets addressedto them. The second virtual network interface eth1/vifx|y.1is started with different IP addresses, so both images canbe connected to the Storage Host at the same time. Theeth1 network is not routed outside of the Xen0 and does notconnect Service Host X to Service Host Y.

pr i va t e void i n i t i a l i z e ( S t r ing p lug in ){Class c l = Class . forName ( p lug in ) ;java . lang . r e f l e c t . Constructor con =

c l . getConstructor (new Class [ ] {St r ing . c l a s s , S t r ing . c l a s s } ) ;

GuardianPlugin p = ( GuardianPlugin )con . newInstance (new Object [ ] {vmRunning . getIP ( ) ,vmRunning . getNetmask ( ) } ) ;

}pr i va t e i n t r o t a t e ( ){

dropSyns ( t rue ) ;vmWaiting . l oadState ( ) ; // threadedi f ( ! p . va l ida teSwi t ch ( parseConntrack ( ) ) ){

dropSyns ( f a l s e ) ;r e turn p . i n t e r v a l ( ) ;

}whi le ( ! vmWaiting . i sStateLoaded ( ) ) {

Thread . y i e l d ( ) ;Thread . s l e ep ( i n t e r v a l l ) ;

}runCommand( new St r ing [ ] {”ip ” , ”route ” ,

”de l ” , vmRunning . getIP ( ) , ”dev ” ,vmRunning . g e tV i r t u a l I n t e r f a c e ( ) } ) ;

runCommand( new St r ing [ ] {”ip ” , ”route ” ,”add ” , vmWaiting . getIP ( ) , ”dev ” ,vmWaiting . g e tV i r t u a l I n t e r f a c e ( ) } ) ;

dropSyns ( f a l s e ) ;r e turn 0 ;

}

Figure 6: Rotation Manager

The Rotation Manager and the Guardian Plugins are writ-ten in Java. We also implemented a simple Rotation Man-ager using a Bash script for better performance, however, thegreater flexibility and security of Java outweighed the per-formance deficit. Figure 6 shows an excerpt of the RotationManager. The first method is used to load the Guardian Plu-gin which implements a method validateSwitch to parsethe /proc/net/ip_conntrack file and checks whether a rota-tion is safe or not and whether the rotation should take placeanyway. The method rotate is called in an endless loop atthe end of every time slice to initiate rotation. The firststatement sets the iptables to drop all SYN packets, thus itdoes not accept any new connection any more. This is the

start of the critical part. A separate thread is then tasked toload the state for the waiting VM while the Guardian Plu-gin checks whether to rotate or not. If the rotation is notallowed the rotate method returns the time period when thenext rotation is to be attempted as defined by the GuardianPlugin.

The critical part under typical connection patterns is roughly350 ms long and has not caused any problems outside of ourstress tests. A secure programming language like Java isdesirable since the Rotation Manager is one of the few com-ponents running on Xen0. A compromise in this code com-promises the entire system. While the attack possibilitiesare slim, it is conceivable that parsing the IP conntrack fileat /proc/net/ip_conntrack could be used as an attack. Ifthe Guardian Plugin allows the rotation process to continue,the waiting VM is queried whether the state has been suc-cessfully loaded. Once this is the case, the old network routeis deleted and the new network route is created. The lastcommand is to tell iptables to accept new connections again.If the Guardian Plugin denies the rotation, new connectionsare accepted again and the rotate method returns the num-ber of milliseconds after which a new rotation attempt is tobe made.

During the critical part, no new connections are allowed,thus if a connection attempt is made within these 350 ms,the connection will fail. Since the TCP/IP protocol is fault-tolerant, this does not lead to an application error but to aTCP retry. The TCP/IP SYN retransmission timeout typi-cally is between 3 and 6 seconds (depending on the TCP/IPimplementation and configuration). Since package loss oc-curs naturally and Grid execution time is usually hours ordays, the delay is not an issue. If, however, the connectionis due to human interaction, a couple of seconds can be toolong as a response time. Our current implementation simplydrops the SYN packets. A more efficient solution is to delaythe SYN packets by 400 ms instead of dropping them. Thisis within the tolerance for TCP/IP packets and would notbe noticed by the user. At the end of the critical part, thedelay is turned off again. The latter solution, however, is notimplemented yet. Since none of our real world applicationsran into this problem, it was not a priority. We will comeback to this issue in the experimental results section.

A simple example of a Guardian Plugin is shown in Fig-ure 7. The first command checks whether the number oftimes this Guardian has consecutively aborted is higher thanthe maximum number of allowed failed rotations, then it al-lows rotation no matter if there are active connections ornot. Otherwise, it loops through the Conntrack entries tocheck for TCP/IP connections. If it finds TCP/IP connec-tions which are not in the CLOSED or TIME WAIT state,it increments the number of aborted rotations and returnsfalse. If the Guardian allows rotation, it resets the numberof aborted rotations. This is a very simple Guardian Pluginand should only be seen as an example. The plugin allows3 failed rotations and specifies a 10 second delay for retry-ing the rotation. The number of allowed failed rotations,reasons to delay rotation and the delay times are complexissues which, due to space limitations, will be discussed in afollow up publication.

One of the most important services for the operation of theGrid is the WS-GRAM Service. The modifications to thisservice are relatively small. The standard WS-GRAM usesthe Globus Class PersistenceHelper to store and load its

Page 9: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

i n t i n t e r v a l = 10 ; // secondsi n t maxRot = 3 ; //maximum f a i l e d r o t a t i o n spub l i c boolean va l ida teSwi t ch (

ConntrackEntry [ ] ce ) throwsSwitchVal idatorExcept ion {

i f ( nrOfRetr ies>=maxRot ) {nrOfRetr i e s =0;re turn true ; // f o r c ed

}f o r ( i n t i = 0 ; i < ce . l ength ; i++ ) {

i f ( ( ce [ i ] . getDstIP ( ) . equa l s ( getIP ( ) )| | ce [ i ] . getSrcIP ( ) . equa l s ( getIP ( ) ) )&& ce [ i ] . g e tProtoco l ( ) .

equa l s ( ConntrackEntry .TCP)&& ! ce [ i ] . getTCPState ( ) .

equa l s ( ConntrackEntry .TIME WAIT)&& ! ce [ i ] . getTCPState ( ) .

equa l s ( ConntrackEntry .CLOSE) )nrOfRetr i e s++;return f a l s e ; // delayed

}nrOfRetr i e s =0;re turn true ; // al lowed}

Figure 7: Guardian Plugin

state. We configured the service to use our PersistenceDB-Helper with the same interface to load and store the statein the remote database.

6. EXPERIMENTAL RESULTSIn the following, an evaluation of the proposed approach

is presented. It should be noted that the performance over-head incurred by the rotation mechanism only affects theGrid headnode and not the worker nodes and thus not theexecution time of Grid jobs. The main criteria the headnodemust fulfill is that the performance overhead of the rota-tion mechanism does not affect the responsiveness of theheadnode to user requests and that the state of the serveris not corrupted, i.e. the rotation mechanism must not dropany user messages during rotation. The rotation mechanismwas tested on an Intel Core Duo 1666 MHz processor with1024 MB RAM and a Hitachi Travelstar 5K060. 314 MBof memory was allocated to Xen0. The following VMs arestarted in this Xen0:

• The Storage Host with PostgreSQL 7.4.7 and Derby10.2.1.6 used by the Grid services for storage. TheDerby database is currently needed for the VirtualWorkspace (VW) Service. The Storage Host receives150 MB RAM.

• Two Service Hosts running Globus 4.0.3, VW TP 1.2.Each Service Host gets 280 MB of memory.

Additionally, there is one machine running the Cluster headnodesoftware Sun Grid Engine 6.0. This machine has no effecton the performance of the rotating servers. The networkcards are 100 MBit cards. The Xen version is 3.0.2-testingand the operating system is Debian (4.0) Etch with Kernel2.6.16-29-xen0.

In the following, we will first present an attack evaluationbefore discussing the performance impact of the rotation sys-tem. Two common attacks against the Grid headnode wereperformed. The first one involves the installation and execu-tion of a spambot, and the second one installs a botnet slavewhich sits idle waiting for the botnet controller to activate it,

for example to participate in a DDoS attack. Both attackswere executed using a root shell controlled via SSH. To vali-date the correct functioning of the Grid middleware and thestate preserving mechanisms, a Grid job is submitted beforethe attack, and the status of the Grid job is queried afterthe attack. To visualize the attacks, the package throughputwas measured in intervals of 3 seconds. Figure 8(a) showsthe spambot attack executed against a system which is notprotected by the presented rotation mechanism; once thespambot is installed and activated, it runs indefinitely. Thefirst large peak in the package rate seen in the Figure 8(a) isthe submission of the Grid job and the transfer of the proxycertificates. The second large peak is the initial executionof the exploit and the initialization of the SSH root shell.The third peak is the download of the spambot software.Once the spambot is started, the network load goes out ofthe scale, since roughly 30000 packets are sent per 3 secondsinterval. The job query was executed a little slower thanusual, but the primary function of the Grid middleware wasnot compromised, thus if no standard IDS is installed orthe IDS misses the attack, there is no alert for the admin-istrator. The same attack executed against a system usingthe presented rotation mechanism is shown in Figure 8(b).Here, the same submission, exploit and malware installationpattern was followed. The Guardian Plugin was configuredto tolerate 9 failed rotations with an interval of 10 secondsbefore forcing a rotation. At this point, the malware is re-moved. The job status query is executed normally. Thetotal time in which the server was compromised was onlyroughly 250 seconds compared to the indefinite compromisein the unprotected case.

The second attack is not quite as obvious as the first. InFigure 9(a), a botnet slave is installed in much the same wayas the spambot was installed. However, the botnet slave isa passive component which cannot easily be detected untilit is activated by the botnet master. To enable activation,it opens a port and waits for the command and control mes-sages from the master. Since the botnet slave does not keepany open TCP connections, the rotation mechanism is notaffected and simply rotates on schedule and removes themalware. The total time of the compromise was roughly150 seconds. In the second attack shown in Figure 9(b), theattacker is aware of the rotation mechanism and has repro-grammed the botnet slave to keep a number of alternatingopen connections to prevent rotation as long as possible.The Guardian Plugin in this attack was set to force rota-tion after 3 failed attempts and then alert the administra-tor. Thus, the lifetime of the botnet slave in this attack wasextended from roughly 100 seconds to 130 seconds. Sincebotnet compromises can be very difficult to detect until itis too late, especially in the case of a previously unknownbotnet, the window of compromise offered by the presentedrotation system is a great advantage, since the botnet slavewill have been removed before the master can activate it. Ifthe attack is executed and the botnet is activated immedi-ately, the scenario is similar to the spambot in the previousattack. In all cases, the operation of the Grid was not af-fected adversely by the rotation mechanism while the attackcode was removed.

To evaluate the performance and the possible functionalimpact of the rotation mechanisms on normal Grid opera-tion, we first tested the system using an engineering applica-tion for metal casting processes from the D-Grid. However,

Page 10: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

(a) Spambot Attack No Rotation (b) Spambot Attack With Rotation

Figure 8: Spambot Attacks

(a) Botnet Installation 1 (b) Botnet Installation 2

Figure 9: Botnet Installation

0

5

10

15

20

0 200 400 600 800 1000 1200 1400 1600 1800

Per

cent

age

Seconds

I/O WaitSystem

User

(a) CPU load with unmodified Counter-Service

0

5

10

15

20

0 200 400 600 800 1000 1200 1400 1600 1800

Per

cent

age

Seconds

I/O WaitSystem

User

(b) CPU load with modified CounterSer-vice

0

20

40

60

80

100

0 200 400 600 800 1000 1200 1400 1600 1800

654321

Per

cent

age

Seconds

Rotations

I/O WaitSystem

User

(c) CPU load with modified CounterSer-vice and rotations

Figure 10: CPU performance

no noticeable difference could be detected, since the applica-tion (like most Grid applications) running on our new Gridsetup only requires minimal effort from the Grid headnode:one job submission connection and certificate check to startthe computation and one notification connection when thejob is completed. The data transferred goes directly to theworker nodes over a secured connection. As such, we werenot able to get the application to collide with the rotationmechanism. To better judge the behavior of the system,we used a modified CounterService to bombard the Gridheadnode in quicker intervals than we could with real Grid

applications. The CounterService is a simple Globus testservice containing a stateful counter which can be incre-mented using a service call containing an integer value.

All experiments were done in three different settings: onceusing the unmodified CounterService without rotation, oncewith the modified CounterService which stores its state onthe Storage Host without rotation and once with the mod-ified CounterService which stores its state on the StorageHost with the first rotation after 200 seconds and then a ro-tation roughly every 300 seconds. The experiments were runfor 30 minutes each. This allows us to see the effect of both

Page 11: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

0

200

400

600

800

1000

0 200 400 600 800 1000 1200 1400 1600 1800

Blo

cks

Seconds

SendRecieved

(a) I/O load with unmodified Counter-Service

0

200

400

600

800

1000

0 200 400 600 800 1000 1200 1400 1600 1800

Blo

cks

Seconds

SendRecieved

(b) I/O load with modified CounterSer-vice

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

0 200 400 600 800 1000 1200 1400 1600 1800

654321

Blo

cks

Seconds

Rotations

SendRecieved

(c) I/O load with modified CounterSer-vice and rotations

Figure 11: I/O performance

0

10

20

30

40

50

0 10 20 30 40 50 60 70 80 90 100

Sec

onds

spe

nd e

xecu

ting

Minutes / Client calls

RealUser

System

(a) CounterService execution times withunmodified CounterService

0

10

20

30

40

50

0 10 20 30 40 50 60 70 80 90 100

Sec

onds

spe

nd e

xecu

ting

Minutes / Client calls

RealUser

System

(b) CounterService execution times withmodified CounterService

0

10

20

30

40

50

0 10 20 30 40 50 60 70 80 90 100

201918171614.71413121110987654321

Sec

onds

spe

nd e

xecu

ting

Minutes / Client calls

Rotations

RealUser

System

(c) CounterService execution times withmodified CounterService and rotations

Figure 12: CounterService execution time.

0

200

400

600

800

1000

1200

1400

1600

1400 1500 1600 1700 1800 1900 2000

2019181716

Mill

isec

onds

add()-Calls

Client calls

add

(a) CounterService response times with-out collisions

0

2000

4000

6000

8000

10000

12000

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

121110987654321

Mill

isec

onds

add()-Calls

Rotations

add

(b) CounterService response times withcollisions

Figure 13: Client response time

the externalization of the state storage and of the actualrotation mechanism. The Counter client calls the Counter-Service on the headnode once every minute and incrementsthe Counter 100 times. Since the Counter client only calls100 times, the likelihood of a critical part collision was verylow and in fact did not occur.

Figure 10 shows the results. The big spikes after eachrotation stem from the cloning of the Service Host image.The image is 640 MB and is currently simply duplicated onthe same hard disk which creates both I/O and CPU load.This high load can be avoided using RAM disks or copy-on-write layers to avoid the costly hard disk access.

Figure 11 shows the same but for disk I/O load. The I/Oload for the modified CounterService (b) is lower than for theunmodified CounterService (a), since the state is no longerwritten to disk but to the remote database. The cloningof the images creates the expected spikes in measurement(c). Since the memory is pre-allocated to the Xen VMs, thememory consumption remains constant for all three cases.

The rotation is currently quite expensive due to the fullhard disk copy. However, the average load with rotationis only 30% and the machine remains responsive even dur-ing the copy operations, i.e. response times remain stable,no messages are dropped, and the state of server state isconsistent with the expected value of the client.

In our next test, we analyzed the client performance. 100clients were started which then incremented the Counter-Service 100 times at a rate of one call per minute. Theaverage processing time for the 100 calls is shown in Figure12. As before, no collisions took place. The execution timeincreased from roughly 30 seconds to 50 seconds over the100 calls mainly due to the remote state transfer.

In our last experiment, we measured the response timefrom the client side. In the first measurement, we ran thesame frequency as above with the modified CounterServiceand rotations, in the second measurement we increased thefrequency of the client calls to the maximum our networkwould allow to make collision more likely. Figure 13 shows

Page 12: Securing Stateful Grid Servers through Virtual Server Rotation · Securing Stateful Grid Servers through Virtual Server Rotation Matthew Smith, Christian Schridde and Bernd Freisleben

the results. The spikes in response time in (a) occur immedi-ately after a rotation and do not stem from a collision. Theyare caused by the lazy class loading delay caused by the newinitialization of the Globus toolkit. The request frequency in(b) was high enough to cause collisions and thus the spikesrepresent the TCP/IP retry time plus the lazy loading delay.The collision did not cause any performance effects on theService Hosts. Considering the many performance optimiza-tions still possible, we believe that the performance figuresobserved for the rotation system are tolerable consideringthe benefits the system offers, since the overhead does notaffect the actual Grid jobs which run on the worker nodes.

7. CONCLUSIONSIn this paper, we presented a novel intrusion tolerance sys-

tem which can deal with unknown stealth attacks on statefulGrid servers. The system is based on a novel server rota-tion strategy utilizing paravirtualization to close attack win-dows for stateful service-oriented Grid headnode servers. Aflexible plugin based rotation manager deals with the com-plex issue of stateful connections to the Grid server, and adatabase connector is utilized to detach service state fromthe rotating functional components of the Grid server. Aprototypical implementation based on the Globus Toolkit 4was presented. The Virtual Workspace, WS-GRAM, RFTand Delegation Services were extended to survive attacksthrough server rotation without losing their state. Attacksagainst the rotation system itself were easily identified dueto the necessity of the sustained nature of those attacks. Thepresented approach differs from other work in this area, sinceit does not require an intrusion detection system to operateand can handle both Grid server state and stateful TCP/IPconnections, all vital issues for its applicability to Grid com-puting systems. Experimental results including attack sce-narios and a performance evaluation were presented.

There are several areas for future work, such as (a) modi-fying the routing mechanism to enable SYN delays instead ofSYN dropping for the critical part to avoid the SYN retrans-mission timeout due to critical part collision, (b) introducingcopy-on-write and RAM disk mechanisms for efficient imagecloning, (c) investigating the remaining data attacks againstthe database to see if new strategies can be found to preventmeta-data corruption, and (d) evaluating Guardian Pluginconfiguration.

8. ACKNOWLEDGMENTSThis work is partly supported by the German Ministry of

Education and Research (BMBF) (D-Grid Initiative). Theauthors would like to thank Salim Hariri for his valuableinput.

9. REFERENCES[1] Globus Security Team, “Globus Security Advisory 2007-03:

Nexus Vulnerability,”http://www.globus.org/mail archive/security-announce/2007/05/msg00000.html, May2007.

[2] ——, “Globus Security Advisory 2007-02: GSI-OpenSSHVulnerability,”http://www-unix.globus.org/mail archive/security-announce/2007/04/msg00000.html, March2007.

[3] The Grid Security Vulnerability Group, “Critical Vulnerability:OpenPBS/Torque,”http://security.fnal.gov/CriticalVuln/openpbs-10-23-2006.html.

[4] Internet Security Systems, “UNICORE Client KeystoreInformation Disclosure,”http://xforce.iss.net/xforce/xfdb/30157, November 2006.

[5] President’s Information Technology Advisory Committee(PITAC), “Cyber Security: A Crisis of Prioritization,” 2005,available at www.nitrd.gov.

[6] Y. Deswarte, L. Blain, and J.-C. Fabre, “Intrusion Tolerance inDistributed Computing Systems,” in Intl. Symposium onSecurity and Privacy. IEEE, 1991, pp. 110–121.

[7] D. Arsenault, A. Sood, and Y. Huang, “Secure, ResilientComputing Clusters: Self-Cleansing Intrusion Tolerance withHardware Enforced Security (SCIT/HES),” in ARES ’07:Proceedings of the The Second International Conference onAvailability, Reliability and Security. IEEE ComputerSociety, 2007, pp. 343–350.

[8] Y. Huang, D. Arsenault, and A. Sood, “Closing Cluster AttackWindows through Server Redundancy and Rotations,” in IEEEInternational Symposium on Cluster Computing and theGrid Workshops, 2006, pp. 12–18.

[9] Sandeep Junnarkar, “Anatomy of a Hacking,” 2002, availableat http://news.com.com/2009-1017-893228.html.

[10] H. Reiser and R. Kapitza, “VM-FIT: Supporting IntrusionTolerance with Virtualisation Technology,” in Proceedings ofthe First Workshop on Recent Advances onIntrusion-Tolerant Systems, 2007, pp. 18–22.

[11] F. Wang, F. Gong, C. Sargor, K. Goseva-Popstojanova,K. Trivedi, and F. Jou, “SITAR: A Scalable Intrusion ToleranceArchitecture for Distributed Servers,” in In Second IEEE SMCInformation Assurance Workshop. IEEE, 2001, pp. 135–144.

[12] D. Wang, B. B. Madan, and K. S. Trivedi, “Security Analysisof SITAR Intrusion Tolerance System,” in SSRS ’03:Proceedings of the 2003 ACM Workshop on Survivable andSelf-Regenerative Systems. ACM Press, 2003, pp. 23–32.

[13] M. Cukier, J. Lyons, P. Pandey, H. V. Ramasamy, W. H.Sanders, P. Pal, F. Webber, R. Schantz, J. Loyall, R. Watro,M. Atighetchi, and J. Gossett., “Intrusion ToleranceApproaches in ITUA,” in In Fast Abstract Supplement of the2001 Intl. Conf. on Dependable Systems and Networks, 2001,pp. 64–65.

[14] J. Reynolds, J. Just, E. Lawson, L. Clough, R. Maglich, andK. Levitt, “The Design and Implementation of an IntrusionTolerant System,” in Foundations of Intrusion TolerantSystems. IEEE, 2003, pp. 64–65.

[15] A. Valdes, M. Almgren, S. Cheung, Y. Deswarte, B. Dutertre,J. Levy, H. Saidi, V. Stavridou, and T. E. Uribe, “AnArchitecture for an Adaptive Intrusion Tolerant Server,” inSecurity Protocols Workshop. Springer, 2002, pp. 122–145.

[16] A. Saidane, Y. Deswarte, and V. Nicomette, “An IntrusionTolerant Architecture for Dynamic Content Internet Servers,”in SSRS ’03: Proceedings of the 2003 ACM workshop onSurvivable and self-regenerative systems. ACM Press, 2003,pp. 110–114.

[17] A. Schulter, F. Navarro, F. Koch, and C. B. Westphall,“Towards Grid-based Intrusion Detection,” in 10th IEEE/IFIPNetwork Operations and Management Symposium (NOMS2006), 2006, pp. 1–4.

[18] F.-Y. Leu, J.-C. Lin, M.-C. Li, C.-T. Yang, and P.-C. Shih,“Integrating Grid with Intrusion Detection,” in Proc. of the19th International Conference on Advanced InformationNetworking and Applications. Washington, DC, USA: IEEEPress, 2005, pp. 304–309.

[19] M. Smith, M. Engel, T. Friese, B. Freisleben, G. A. Koenig,and W. Yurcik, “Security Issues in On-Demand Grid andCluster Computing,” in CCGRID ’06: Proc. of the IEEEInternational Symposium on Cluster Computing and theGrid Workshops. Washington, DC, USA: IEEE ComputerSociety, 2006, pp. 24–32.

[20] M. Smith, T. Friese, M. Engel, and B. Freisleben, “CounteringSecurity Threats in Service-Oriented On-Demand GridComputing Using Sandboxing and Trusted ComputingTechniques,” Journal of Parallel and Distributed Computing,vol. 66, no. 9, pp. 1189–1204, 2006.

[21] M. Schmidt, M. Smith, N. Fallenbeck, H. Picht, andB. Freisleben, “Building a Demilitarized Zone with DataEncryption for Grid Environments,” in Proceedings of FirstInternational Conference on Networks for Grid Applications,2007.

[22] M. Schmidt, N. Fallenbeck, M. Smith, and B. Freisleben,“Virtual Organization Based Firewalling in Virtualized GridEnvironments,” 2008 (submitted for publication).